Abstract
Formality-controlled machine translation allows users to specify the formality level of the target sentence, so that it would be suitable for the intended audience. While formality-annotated datasets have been constructed for some major languages, no such resource is currently available for Cantonese. This paper presents a Mandarin-Cantonese parallel corpus with 300 Mandarin sentences, each of which is aligned to a list of five or more Cantonese sentences ranked according to their level of formality. To our knowledge, this is the first parallel translation corpus with manual formality ranking, which provides more nuanced judgment than the formal/informal dichotomy in most current formality-annotated datasets. This corpus can support future research towards more fine-grained notions of formality in terminology, translation and text style transfer. © 2025 Copyright for this paper by its authors.
| Original language | English |
|---|---|
| Title of host publication | MDTT 2025 - Multilingual Digital Terminology Today 2025 |
| Subtitle of host publication | Proceedings of the 4rd International Conference on Multilingual Digital Terminology Today (MDTT 2025) |
| Editors | Federica Vezzani, Giorgio Maria Di Nunzio, Elpida Loupaki, Georgios Meditskos, Maria Papoutsoglou |
| Publisher | CEUR-WS |
| Number of pages | 7 |
| Publication status | Published - Jun 2025 |
| Event | 4th International Conference on Multilingual Digital Terminology Today (MDTT 2025): Design, representation formats and management systems - Aristotle University Research Dissemination Center, Thessaloniki, Greece Duration: 19 Jun 2025 → 20 Jun 2025 https://mdtt2025.web.auth.gr/en/ |
Publication series
| Name | CEUR Workshop Proceedings |
|---|---|
| Volume | 3990 |
| ISSN (Print) | 1613-0073 |
Conference
| Conference | 4th International Conference on Multilingual Digital Terminology Today (MDTT 2025) |
|---|---|
| Abbreviated title | MDTT2025 |
| Place | Greece |
| City | Thessaloniki |
| Period | 19/06/25 → 20/06/25 |
| Internet address |
Funding
This work is partially supported by a Strategic Research Grant (project number 70006037) from City University of Hong Kong
Research Keywords
- formality ranking
- formality-controlled machine translation
- parallel corpus
- Large Language Models
- Cantonese
Publisher's Copyright Statement
- This full text is made available under CC-BY 4.0. https://creativecommons.org/licenses/by/4.0/