Investigating translated Chinese and its variants using machine learning

Hai Hu*, Sandra Kübler

*Corresponding author for this work

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

15 Citations (Scopus)

Abstract

Translations are generally assumed to share universal features that distinguish them from texts that are originally written in the same language. Thus, we can argue that these translations constitute their own variety of a language, often called translationese. However, translations are also influenced by their source languages and thus show different characteristics depending on the source language. Consequently, we argue that these variants constitute different dialects of translations into the same target language. Studies using machine learning techniques on Indo-European languages have investigated the universal characteristics of translationese and how translations from various source languages differ. However, for typologically very different languages such as Chinese, there are only few corpus studies that tap into the intricate relation between translations and the originals, as well as into the relations among translations themselves. In this contribution, we investigate the following questions: (1) What are the characteristics of Chinese translationese, both in general and with respect to different source languages? (2) Can we find differences not only at the lexical but also on the syntactic level? and (3) Based on the characteristics found in the previous questions, which of the proposed laws and universals can we corroborate based on our evidence from Chinese? We use machine learning to operationalize determining the importance of different characteristics and comparing their importance for our Chinese dataset with characteristics previously reported in studies on English. In addition, our methodology allows us to add syntactic features, which have rarely been used to study translations into Chinese. Our results show that Chinese translations as a whole can be reliably distinguished from non-translations, even based on only five features. More interestingly, typological traces from the source languages can often be found in their translations, therefore creating what we call dialects of translationese. For instance, translations from two Altaic languages exhibit more noun repetition and less frequent use of pronouns. Additionally, some characteristics that are not discriminative for English work well for Chinese, possibly because the distance between Chinese and the source languages is greater than that in English studies.

© Cambridge University Press 2020
Original languageEnglish
Pages (from-to)339-372
JournalNatural Language Engineering
Volume27
Issue number3
Online published3 Apr 2020
DOIs
Publication statusPublished - May 2021
Externally publishedYes

Funding

The authors thank Ruoze Huang, Chien-Jer Charles Lin, Jingyi Guo, Yueqi Zhu, and the computational linguistics colloquium at Indiana University for helpful discussions. The authors are grateful for the comments and suggestions from Aini Li and our two anonymous reviewers. The authors also thank Noam Ordan for providing information about the implementation for the study on English. The first author is partly funded by China Scholarship Council.

Research Keywords

  • Chinese
  • Text classification
  • Translation universal
  • Translationese

Fingerprint

Dive into the research topics of 'Investigating translated Chinese and its variants using machine learning'. Together they form a unique fingerprint.

Cite this