TY - GEN
T1 - Tackling MeSH Indexing Dataset Shift with Time-Aware Concept Embedding Learning
AU - Jin, Qiao
AU - Ding, Haoyang
AU - Li, Linfeng
AU - Huang, Haitao
AU - Wang, Lei
AU - Yan, Jun
PY - 2020
Y1 - 2020
N2 - Medical Subject Headings (MeSH) is a controlled thesaurus developed by the National Library of Medicine (NLM). MeSH covers a wide variety of biomedical topics like diseases and drugs, which are used to classify PubMed articles. Human indexers at NLM have been annotating the PubMed articles with MeSH for decades, and have collected millions of MeSH-labeled articles. Recently, many deep learning algorithms have been developed to automatically annotate the MeSH terms, utilizing this large-scale MeSH indexing dataset. However, most of the models are trained on all articles non-discriminatively, ignoring the temporal structure of the dataset. In this paper, we uncover and thoroughly characterize the problem of MeSH indexing dataset shift (MeSHIFT), meaning that the data distribution changes with time. MeSHIFT includes the shift of input articles, output MeSH labels and annotation rules. We found that machine learning models suffer from performance loss for not tackling the problem of MeSHIFT. Towards this end, we present a novel method, time-aware concept embedding learning (TaCEL), as an attempt to solve it. TaCEL is a plug-in module which can be easily incorporated in other automatic MeSH indexing models. Results show that TaCEL improves current state-of-the-art models with only minimum additional costs. We hope this work can facilitate understanding of the MeSH indexing dataset, especially its temporal structure, and provide a solution that can be used to improve current models. © 2020, Springer Nature Switzerland AG.
AB - Medical Subject Headings (MeSH) is a controlled thesaurus developed by the National Library of Medicine (NLM). MeSH covers a wide variety of biomedical topics like diseases and drugs, which are used to classify PubMed articles. Human indexers at NLM have been annotating the PubMed articles with MeSH for decades, and have collected millions of MeSH-labeled articles. Recently, many deep learning algorithms have been developed to automatically annotate the MeSH terms, utilizing this large-scale MeSH indexing dataset. However, most of the models are trained on all articles non-discriminatively, ignoring the temporal structure of the dataset. In this paper, we uncover and thoroughly characterize the problem of MeSH indexing dataset shift (MeSHIFT), meaning that the data distribution changes with time. MeSHIFT includes the shift of input articles, output MeSH labels and annotation rules. We found that machine learning models suffer from performance loss for not tackling the problem of MeSHIFT. Towards this end, we present a novel method, time-aware concept embedding learning (TaCEL), as an attempt to solve it. TaCEL is a plug-in module which can be easily incorporated in other automatic MeSH indexing models. Results show that TaCEL improves current state-of-the-art models with only minimum additional costs. We hope this work can facilitate understanding of the MeSH indexing dataset, especially its temporal structure, and provide a solution that can be used to improve current models. © 2020, Springer Nature Switzerland AG.
KW - Dataset shift
KW - Machine learning
KW - Medical Subject Headings
KW - Natural language processing
KW - Text classification
UR - http://www.scopus.com/inward/record.url?scp=85092085420&partnerID=8YFLogxK
UR - https://www.scopus.com/record/pubmetrics.uri?eid=2-s2.0-85092085420&origin=recordpage
U2 - 10.1007/978-3-030-59419-0_29
DO - 10.1007/978-3-030-59419-0_29
M3 - RGC 32 - Refereed conference paper (with host publication)
SN - 9783030594183
T3 - Lecture Notes in Computer Science
SP - 474
EP - 488
BT - Database Systems for Advanced Applications - 25th International Conference, DASFAA 2020, Proceedings, Part III
A2 - Nah, Yunmook
A2 - Cui, Bin
A2 - Lee, Sang-Won
A2 - Yu, Jeffrey Xu
A2 - Moon, Yang-Sae
A2 - Whang, Steven Euijong
PB - Springer, Cham
T2 - 25th International Conference on Database Systems for Advanced Applications (DASFAA 2020)
Y2 - 24 September 2020 through 27 September 2020
ER -