Skip to main navigation Skip to search Skip to main content

Tackling MeSH Indexing Dataset Shift with Time-Aware Concept Embedding Learning

Qiao Jin, Haoyang Ding, Linfeng Li, Haitao Huang, Lei Wang*, Jun Yan

*Corresponding author for this work

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

Abstract

Medical Subject Headings (MeSH) is a controlled thesaurus developed by the National Library of Medicine (NLM). MeSH covers a wide variety of biomedical topics like diseases and drugs, which are used to classify PubMed articles. Human indexers at NLM have been annotating the PubMed articles with MeSH for decades, and have collected millions of MeSH-labeled articles. Recently, many deep learning algorithms have been developed to automatically annotate the MeSH terms, utilizing this large-scale MeSH indexing dataset. However, most of the models are trained on all articles non-discriminatively, ignoring the temporal structure of the dataset. In this paper, we uncover and thoroughly characterize the problem of MeSH indexing dataset shift (MeSHIFT), meaning that the data distribution changes with time. MeSHIFT includes the shift of input articles, output MeSH labels and annotation rules. We found that machine learning models suffer from performance loss for not tackling the problem of MeSHIFT. Towards this end, we present a novel method, time-aware concept embedding learning (TaCEL), as an attempt to solve it. TaCEL is a plug-in module which can be easily incorporated in other automatic MeSH indexing models. Results show that TaCEL improves current state-of-the-art models with only minimum additional costs. We hope this work can facilitate understanding of the MeSH indexing dataset, especially its temporal structure, and provide a solution that can be used to improve current models. © 2020, Springer Nature Switzerland AG.
Original languageEnglish
Title of host publicationDatabase Systems for Advanced Applications - 25th International Conference, DASFAA 2020, Proceedings, Part III
EditorsYunmook Nah, Bin Cui, Sang-Won Lee, Jeffrey Xu Yu, Yang-Sae Moon, Steven Euijong Whang
PublisherSpringer, Cham
Pages474-488
Number of pages15
ISBN (Electronic)9783030594190
ISBN (Print)9783030594183
DOIs
Publication statusPublished - 2020
Externally publishedYes
Event25th International Conference on Database Systems for Advanced Applications (DASFAA 2020) - Jeju, Korea, Republic of
Duration: 24 Sept 202027 Sept 2020

Publication series

NameLecture Notes in Computer Science
Volume12114
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference25th International Conference on Database Systems for Advanced Applications (DASFAA 2020)
PlaceKorea, Republic of
CityJeju
Period24/09/2027/09/20

Research Keywords

  • Dataset shift
  • Machine learning
  • Medical Subject Headings
  • Natural language processing
  • Text classification

Fingerprint

Dive into the research topics of 'Tackling MeSH Indexing Dataset Shift with Time-Aware Concept Embedding Learning'. Together they form a unique fingerprint.

Cite this