Transductive learning, data refactoring and model reweighting in domain adaptation for language processing

  • Yan SONG

Student thesis: Doctoral Thesis

Abstract

A critical problem within statistical natural language processing (NLP) is that the performance of an NLP system often degrades significantly when its training and test data are from different domains. Bearing in mind the inevitable expense of collecting and annotating training sets for a new domain, domain adaptation is intended to address this problem, aiming at the stable performance of an NLP system transferred from one domain to another. Most methods in domain adaptation are task-specific and require labeled data from a target domain. Up to now, there have been only limited studies on domain adaptation for Chinese language processing. This thesis is intended to develop novel approaches in three directions in order to tackle the drawbacks of current domain adaption methodology. First, a transductive learning approach is proposed for domain adaptation, which extracts domain knowledge from target domain data and subsequently incorporates it into the learning process. A system built this way is able to capture the essential characteristics of a target domain and outperform an out-of-domain system. Second, an entropy-based training data selection approach is proposed for data refactoring for domain adaptation, using different measures to compute domain similarity. This is a general strategy for data preprocessing and is applicable to different NLP tasks. This approach is further enhanced by feature augmentation to overcome the shortcomings of training data selection relying on partial data. It exploits various discriminative information from both a source and a target domains and takes advantage of the entire training data. Third, a simple and effective approach is proposed for domain adaptation through model reweighting, which assigns proper weights to different models which are then trained on different sources of domains and applied to a new target domain. It measures domain variance and uses it as a guideline for weighting different models. To further enhance this approach, a hybrid approach to integrate model reweighting with training data selection is also attempted. Our proposed approaches have several advantages; they work well with or without labeled data in a target domain, and can also effectively learn from the test data. These approaches cover the whole pipeline of a typical machine learning process, including data preprocessing, learning and prediction. Chinese language processing is chosen as the application task for this thesis research. These approaches are evaluated on four different tasks of Chinese language processing, namely, Chinese word segmentation (CWS), Chinese part-of-speech (POS) tagging, joint CWS and POS tagging, and English-Chinese machine translation (MT). Their validity and effectiveness of these approaches are confirmed by the experimental result that systems equipped with our domain adaptation perform significantly better than those without, and that the performance of our domain adaptation is significantly better than or highly comparable to previously reported state-of-the-art methods on each task.
Date of Award15 Jul 2015
Original languageEnglish
Awarding Institution
  • City University of Hong Kong
SupervisorChun Yu KIT (Supervisor)

Keywords

  • Chinese language
  • Natural language processing (Computer science)
  • Data processing

Cite this

'