CSSWiki : A Chinese Sentence Simplification Dataset with Linguistic and Content Operations

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

View graph of relations

Detail(s)

Original languageEnglish
Title of host publicationProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
PublisherEuropean Language Resources Association (ELRA)
Pages4205-4213
ISBN (print)9782493814104
Publication statusPublished - 23 May 2024

Publication series

NameJoint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING - Main Conference Proceedings

Conference

Title2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
LocationHybrid
PlaceItaly
CityTorino
Period20 - 25 May 2024

Link(s)

Abstract

Sentence Simplification aims to make sentences easier to read and understand. With most effort on corpus development focused on English, the amount of annotated data is limited in Chinese. To address this need, we introduce CSSWiki, an open-source dataset for Chinese sentence simplification based on Wikipedia. This dataset contains 1.6k source sentences paired with their simplified versions. Each sentence pair is annotated with operation tags that distinguish between linguistic and content modifications. We analyze differences in annotation scheme and data statistics between CSSWiki and existing datasets. We then report baseline sentence simplification performance on CSSWiki using zero-shot and few-shot approaches with Large Language Models. © 2024 ELRA Language Resource Association

Research Area(s)

  • Chinese sentence simplification, Corpus creation, Linguistic simplification operations, Content simplification operations

Citation Format(s)

CSSWiki: A Chinese Sentence Simplification Dataset with Linguistic and Content Operations. / Liu, Fengkai; Lee, John S. Y.
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). European Language Resources Association (ELRA), 2024. p. 4205-4213 (Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING - Main Conference Proceedings).

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

Download Statistics

No data available