TY - GEN
T1 - A Novel Hierarchical Discourse Model for Scientific Article and It's Efficient Top-K Resampling-based Text Classification Approach
AU - Gao, Min
AU - Chen, Chun-Hua
AU - Gao, Zhi-Han
AU - Chen, Wei-Long
AU - Ren, Yuan
AU - Kwong, Sam
AU - Zhan, Zhi-Hui
PY - 2022
Y1 - 2022
N2 - Scientific articles contain rich knowledge that can significantly assists scientific research, but it is difficult to precisely extract knowledge information due to the complexity of the discourse structure of scientific articles. To provide more accurate scientific research knowledge for researchers in a specific academic domain, it is necessary to study the discourse structure of domain scientific articles and to propose an automatic annotation approach to automatically annotate discourse information from articles. Unfortunately, few works have studied the discourse structure of domain scientific articles and the corresponding automatic discourse annotation. To fill this gap, we take scientific articles of the wastewater-based epidemiology domain as a case to study how to automatically and efficiently annotate discourse information. This paper has three contributions. Firstly, we propose a hierarchical discourse model with two layers to cover all potential discourses in various domain scientific articles. Specifically, the first layer defines four core discourse concepts to describe the main process of a scientific research which can be applied in all scientific articles in various domains. The second layer defines fine-granular domain-specific structure, which can accurately describe the entire research contents of a specific domain. Secondly, based on the proposed model, we build a corpus dataset of 100 annotated scientific articles in the wastewater-based epidemiology domain. Thirdly, based on the model and dataset, we propose a simple yet efficient Top-K resampling-based approach to train a more effective classifier for automatic annotation. Extensive experiments verify the effectiveness and efficiency of our proposed hierarchical discourse model and the Top-K resampling-based classification approach.
AB - Scientific articles contain rich knowledge that can significantly assists scientific research, but it is difficult to precisely extract knowledge information due to the complexity of the discourse structure of scientific articles. To provide more accurate scientific research knowledge for researchers in a specific academic domain, it is necessary to study the discourse structure of domain scientific articles and to propose an automatic annotation approach to automatically annotate discourse information from articles. Unfortunately, few works have studied the discourse structure of domain scientific articles and the corresponding automatic discourse annotation. To fill this gap, we take scientific articles of the wastewater-based epidemiology domain as a case to study how to automatically and efficiently annotate discourse information. This paper has three contributions. Firstly, we propose a hierarchical discourse model with two layers to cover all potential discourses in various domain scientific articles. Specifically, the first layer defines four core discourse concepts to describe the main process of a scientific research which can be applied in all scientific articles in various domains. The second layer defines fine-granular domain-specific structure, which can accurately describe the entire research contents of a specific domain. Secondly, based on the proposed model, we build a corpus dataset of 100 annotated scientific articles in the wastewater-based epidemiology domain. Thirdly, based on the model and dataset, we propose a simple yet efficient Top-K resampling-based approach to train a more effective classifier for automatic annotation. Extensive experiments verify the effectiveness and efficiency of our proposed hierarchical discourse model and the Top-K resampling-based classification approach.
KW - automatic annotation
KW - discourse
KW - scientific articles
KW - text classification
UR - http://www.scopus.com/inward/record.url?scp=85142682911&partnerID=8YFLogxK
UR - https://www.scopus.com/record/pubmetrics.uri?eid=2-s2.0-85142682911&origin=recordpage
U2 - 10.1109/SMC53654.2022.9945306
DO - 10.1109/SMC53654.2022.9945306
M3 - RGC 32 - Refereed conference paper (with host publication)
T3 - Conference Proceedings - IEEE International Conference on Systems, Man and Cybernetics
SP - 774
EP - 781
BT - 2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC) - Proceedings
PB - IEEE
T2 - 2022 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2022
Y2 - 9 October 2022 through 12 October 2022
ER -