A critical problem within statistical natural language processing (NLP) is that the
performance of an NLP system often degrades significantly when its training and
test data are from different domains. Bearing in mind the inevitable expense of
collecting and annotating training sets for a new domain, domain adaptation is
intended to address this problem, aiming at the stable performance of an NLP system
transferred from one domain to another. Most methods in domain adaptation are
task-specific and require labeled data from a target domain. Up to now, there have
been only limited studies on domain adaptation for Chinese language processing.
This thesis is intended to develop novel approaches in three directions in order to
tackle the drawbacks of current domain adaption methodology. First, a transductive
learning approach is proposed for domain adaptation, which extracts domain
knowledge from target domain data and subsequently incorporates it into the
learning process. A system built this way is able to capture the essential
characteristics of a target domain and outperform an out-of-domain system. Second,
an entropy-based training data selection approach is proposed for data refactoring for
domain adaptation, using different measures to compute domain similarity. This is a
general strategy for data preprocessing and is applicable to different NLP tasks. This
approach is further enhanced by feature augmentation to overcome the shortcomings
of training data selection relying on partial data. It exploits various discriminative
information from both a source and a target domains and takes advantage of the
entire training data. Third, a simple and effective approach is proposed for domain
adaptation through model reweighting, which assigns proper weights to different
models which are then trained on different sources of domains and applied to a new
target domain. It measures domain variance and uses it as a guideline for weighting
different models. To further enhance this approach, a hybrid approach to integrate
model reweighting with training data selection is also attempted.
Our proposed approaches have several advantages; they work well with or without
labeled data in a target domain, and can also effectively learn from the test data.
These approaches cover the whole pipeline of a typical machine learning process,
including data preprocessing, learning and prediction. Chinese language processing
is chosen as the application task for this thesis research. These approaches are
evaluated on four different tasks of Chinese language processing, namely, Chinese
word segmentation (CWS), Chinese part-of-speech (POS) tagging, joint CWS and
POS tagging, and English-Chinese machine translation (MT). Their validity and
effectiveness of these approaches are confirmed by the experimental result that
systems equipped with our domain adaptation perform significantly better than those
without, and that the performance of our domain adaptation is significantly better
than or highly comparable to previously reported state-of-the-art methods on each
task.
| Date of Award | 15 Jul 2015 |
|---|
| Original language | English |
|---|
| Awarding Institution | - City University of Hong Kong
|
|---|
| Supervisor | Chun Yu KIT (Supervisor) |
|---|
- Chinese language
- Natural language processing (Computer science)
- Data processing
Transductive learning, data refactoring and model reweighting in domain adaptation for language processing
SONG, Y. (Author). 15 Jul 2015
Student thesis: Doctoral Thesis