CompLex-ZH: A New Dataset for Lexical Complexity Prediction in Mandarin and Cantonese

Le Qiu, Shanyue Guo, Tak-sum Wong, Emmanuele Chersoni*, John S. Y. Lee, Chu-Ren Huang

*Corresponding author for this work

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

12 Downloads (CityUHK Scholars)

Abstract

The prediction of lexical complexity in context is assuming an increasing relevance in Natural Language Processing research, since identifying complex words is often the f irst step of text simplification pipelines. To the best of our knowledge, though, datasets annotated with complex words are available only for English and for a limited number of Western languages.

In our paper, we introduce CompLex-ZH, a dataset including words annotated with complexity scores in sentential contexts for Chinese. Our data include sentences in Mandarin and Cantonese, which were selected from a variety of sources and textual genres. We provide a first evaluation with baselines combining hand-crafted and language models-based features.

©2024 Association for Computational Linguistics
Original languageEnglish
Title of host publicationProceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024)
EditorsMatthew Shardlow, Horacio Saggion, Fernando Alva-Manchego, Marcos Zampieri, Kai North, Sanja Štajner, Regina Stodden
PublisherAssociation for Computational Linguistics
Pages20-26
ISBN (Electronic)979-8-89176-176-6
DOIs
Publication statusPublished - Nov 2024
Event2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024) - Hyatt Regency Miami Hotel, Miami, United States
Duration: 12 Nov 202416 Nov 2024
https://2024.emnlp.org/

Conference

Conference2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024)
Abbreviated titleEMNLP 2024
Country/TerritoryUnited States
CityMiami
Period12/11/2416/11/24
Internet address

Bibliographical note

Research Unit(s) information for this publication is provided by the author(s) concerned.

Funding

EC acknowledges the financial support from the start-up fund project “Building and Predicting Neurocognitive-Motivated Lexical-Semantic Norms for Mandarin Chinese”(1- BE8G), sponsored by the Faculty of Humanities of the Hong Kong Polytechnic University. JL acknowledges support from a Strategic Research Grant (7006037) at City University of Hong Kong.

Publisher's Copyright Statement

  • This full text is made available under CC-BY 4.0. https://creativecommons.org/licenses/by/4.0/

Fingerprint

Dive into the research topics of 'CompLex-ZH: A New Dataset for Lexical Complexity Prediction in Mandarin and Cantonese'. Together they form a unique fingerprint.

Cite this