Abstract
The prediction of lexical complexity in context is assuming an increasing relevance in Natural Language Processing research, since identifying complex words is often the f irst step of text simplification pipelines. To the best of our knowledge, though, datasets annotated with complex words are available only for English and for a limited number of Western languages.
In our paper, we introduce CompLex-ZH, a dataset including words annotated with complexity scores in sentential contexts for Chinese. Our data include sentences in Mandarin and Cantonese, which were selected from a variety of sources and textual genres. We provide a first evaluation with baselines combining hand-crafted and language models-based features.
©2024 Association for Computational Linguistics
In our paper, we introduce CompLex-ZH, a dataset including words annotated with complexity scores in sentential contexts for Chinese. Our data include sentences in Mandarin and Cantonese, which were selected from a variety of sources and textual genres. We provide a first evaluation with baselines combining hand-crafted and language models-based features.
©2024 Association for Computational Linguistics
Original language | English |
---|---|
Title of host publication | Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024) |
Editors | Matthew Shardlow, Horacio Saggion, Fernando Alva-Manchego, Marcos Zampieri, Kai North, Sanja Štajner, Regina Stodden |
Publisher | Association for Computational Linguistics |
Pages | 20-26 |
ISBN (Electronic) | 979-8-89176-176-6 |
DOIs | |
Publication status | Published - Nov 2024 |
Event | 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024) - Hyatt Regency Miami Hotel, Miami, United States Duration: 12 Nov 2024 → 16 Nov 2024 https://2024.emnlp.org/ |
Conference
Conference | 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024) |
---|---|
Abbreviated title | EMNLP 2024 |
Country/Territory | United States |
City | Miami |
Period | 12/11/24 → 16/11/24 |
Internet address |
Bibliographical note
Research Unit(s) information for this publication is provided by the author(s) concerned.Funding
EC acknowledges the financial support from the start-up fund project “Building and Predicting Neurocognitive-Motivated Lexical-Semantic Norms for Mandarin Chinese”(1- BE8G), sponsored by the Faculty of Humanities of the Hong Kong Polytechnic University. JL acknowledges support from a Strategic Research Grant (7006037) at City University of Hong Kong.
Publisher's Copyright Statement
- This full text is made available under CC-BY 4.0. https://creativecommons.org/licenses/by/4.0/