A Mandarin-Cantonese Parallel Corpus with Formality Ranking

John S. Y. Lee, Qiong Wang

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

8 Downloads (CityUHK Scholars)

Abstract

Formality-controlled machine translation allows users to specify the formality level of the target sentence, so that it would be suitable for the intended audience. While formality-annotated datasets have been constructed for some major languages, no such resource is currently available for Cantonese. This paper presents a Mandarin-Cantonese parallel corpus with 300 Mandarin sentences, each of which is aligned to a list of five or more Cantonese sentences ranked according to their level of formality. To our knowledge, this is the first parallel translation corpus with manual formality ranking, which provides more nuanced judgment than the formal/informal dichotomy in most current formality-annotated datasets. This corpus can support future research towards more fine-grained notions of formality in terminology, translation and text style transfer. © 2025 Copyright for this paper by its authors.
Original languageEnglish
Title of host publicationMDTT 2025 - Multilingual Digital Terminology Today 2025
Subtitle of host publicationProceedings of the 4rd International Conference on Multilingual Digital Terminology Today (MDTT 2025)
EditorsFederica Vezzani, Giorgio Maria Di Nunzio, Elpida Loupaki, Georgios Meditskos, Maria Papoutsoglou
PublisherCEUR-WS
Number of pages7
Publication statusPublished - Jun 2025
Event4th International Conference on Multilingual Digital Terminology Today (MDTT 2025): Design, representation formats and management systems - Aristotle University Research Dissemination Center, Thessaloniki, Greece
Duration: 19 Jun 202520 Jun 2025
https://mdtt2025.web.auth.gr/en/

Publication series

NameCEUR Workshop Proceedings
Volume3990
ISSN (Print)1613-0073

Conference

Conference4th International Conference on Multilingual Digital Terminology Today (MDTT 2025)
Abbreviated titleMDTT2025
PlaceGreece
CityThessaloniki
Period19/06/2520/06/25
Internet address

Funding

This work is partially supported by a Strategic Research Grant (project number 70006037) from City University of Hong Kong

Research Keywords

  • formality ranking
  • formality-controlled machine translation
  • parallel corpus
  • Large Language Models
  • Cantonese

Publisher's Copyright Statement

  • This full text is made available under CC-BY 4.0. https://creativecommons.org/licenses/by/4.0/

Fingerprint

Dive into the research topics of 'A Mandarin-Cantonese Parallel Corpus with Formality Ranking'. Together they form a unique fingerprint.

Cite this