Abstract
The prediction of lexical complexity in context is assuming an increasing relevance in Natural Language Processing research, since identifying complex words is often the f irst step of text simplification pipelines. To the best of our knowledge, though, datasets annotated with complex words are available only for English and for a limited number of Western languages.
In our paper, we introduce CompLex-ZH, a dataset including words annotated with complexity scores in sentential contexts for Chinese. Our data include sentences in Mandarin and Cantonese, which were selected from a variety of sources and textual genres. We provide a first evaluation with baselines combining hand-crafted and language models-based features.
©2024 Association for Computational Linguistics
In our paper, we introduce CompLex-ZH, a dataset including words annotated with complexity scores in sentential contexts for Chinese. Our data include sentences in Mandarin and Cantonese, which were selected from a variety of sources and textual genres. We provide a first evaluation with baselines combining hand-crafted and language models-based features.
©2024 Association for Computational Linguistics
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024) |
| Editors | Matthew Shardlow, Horacio Saggion, Fernando Alva-Manchego, Marcos Zampieri, Kai North, Sanja Štajner, Regina Stodden |
| Publisher | Association for Computational Linguistics |
| Pages | 20-26 |
| ISBN (Electronic) | 979-8-89176-176-6 |
| DOIs | |
| Publication status | Published - Nov 2024 |
| Event | 29th Conference on Empirical Methods in Natural Language Processing (EMNLP 2024) - Hybrid, Miami, United States Duration: 12 Nov 2024 → 16 Nov 2024 https://2024.emnlp.org/ |
Conference
| Conference | 29th Conference on Empirical Methods in Natural Language Processing (EMNLP 2024) |
|---|---|
| Abbreviated title | EMNLP 2024 |
| Place | United States |
| City | Miami |
| Period | 12/11/24 → 16/11/24 |
| Internet address |
Bibliographical note
Research Unit(s) information for this publication is provided by the author(s) concerned.Funding
EC acknowledges the financial support from the start-up fund project “Building and Predicting Neurocognitive-Motivated Lexical-Semantic Norms for Mandarin Chinese”(1- BE8G), sponsored by the Faculty of Humanities of the Hong Kong Polytechnic University. JL acknowledges support from a Strategic Research Grant (7006037) at City University of Hong Kong.
Publisher's Copyright Statement
- This full text is made available under CC-BY 4.0. https://creativecommons.org/licenses/by/4.0/
Fingerprint
Dive into the research topics of 'CompLex-ZH: A New Dataset for Lexical Complexity Prediction in Mandarin and Cantonese'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver