This research is an effort to explore how the syntactic information of term
candidates can be exploited for the task of term extraction. It proposes an
approach that represents a novel, linguistically motivated perspective in the area
of terminological processing. The hypothesis of this work is that terms tend to
perform certain types of syntactic functions more prominently than others. This
syntactic behaviour of terms can be captured as termhood by estimating term
probabilities from their occurrences in different syntactic paths. Based on a large
corpus of parse trees, this feature allows for highly reliable statistics on
properties of term occurrences. In essence, this method is a weighting scheme
that measures probabilistic relations between term occurring patterns and
syntactic paths, which is discussed in this thesis as Syntactic Function Value
(SF-Value) and implemented in a term extraction system.
Experiments conducted in this study begin by building up an automatic term
extraction system that integrates such a weighting scheme. The purpose of these
experiments is not to design a term extraction system with the best performance
but to investigate the contributions of syntactic information to term extraction,
including single-word terms, multi-word terms, and new terms. Specifically,
these experiments are aimed at answering several research questions, including
the following: whether linguistic knowledge as term rates in syntactic paths is
useful for recognising candidate terms in medical texts; to what extent singleword
terms can be extracted by this linguistic indicator; and subsequently how
this linguistically based metric can be used to improve the ranking of multi-word
terms, and whether term rates in syntactic paths can be used effectively for new
term extraction. Finally, with the aim of investigating whether this linguistic
metric can be used as an effective feature within a machine learning framework,
a series of experiments are conducted on general term extraction and new term
extraction using the method of Conditional Random Fields (CRF).
Empirical results strongly argue that the term extraction approach proposed in
this study demonstrates superior performance when compared with two existing
term extractors. The key technique of this term extraction system, SF-Value,
proves to be especially useful in selecting single-word terms and is also effective
in enhancing the ranking of multi-word term candidates after their initial ranking
by a statistical measure, C-Value. With regard to new term extraction, results
show that SF-Value does not perform as well, which suggests that more features
are needed to distinguish new terms from known terms. CRF framework is
subsequently applied with the uses of SF-Value and term rate as added features
for the extraction of new terms. Results show that this machine learning
framework performs quite well in general term extraction. However, for the task
of generating a list of new term candidates, this framework does not show good
performance as expected. This result indicates that, for the task of new term
extraction, more features related to new term candidates should be taken into
consideration, in addition to syntactic function information.
In conclusion, this study devises an innovative, linguistically motivated measure
for term extraction and implements it in a software system. Comprehensive
experiments are conducted to evaluate its performance, and empirical results
demonstrate its superior performance in comparison with existing term
extraction systems.
Date of Award | 15 Jul 2011 |
---|
Original language | English |
---|
Awarding Institution | - City University of Hong Kong
|
---|
Supervisor | Chengyu Alex FANG (Supervisor) |
---|
- Data processing
- Terms and phrases
Enhanced term extraction based on probabilistic estimation from syntactic parse trees
ZHANG, X. (Author). 15 Jul 2011
Student thesis: Doctoral Thesis