Constituent parsing, or parsing for short, is a fundamental process in natural language
processing (NLP) to derive a provably useful syntactic representation for an
input sentence. It plays an important role in many applications, e.g., machine translation
and sentiment analysis.
Natural languages are widely acknowledged to be inherently ambiguous. There
are usually many possible parses for an input sentence. Thus, statistical approaches
are adopted to evaluate the plausibility of each candidate parse, so as to select the
best. The most straightforward and commonly adopted approach is probabilistic
context-free grammar (PCFG). However, PCFG is limited by its underlying strong
independence assumption. To address this problem, a discriminative parsing model
is formulated in this thesis with a novel method of parameterization. First, this
model considers instances of grammar rule to be the basic components of a parse
tree. The same grammar rule appearing in different positions in a parse tree is considered
a different instance, and assigned a score to indicate its plausibility. The
plausibility of a parse tree is defined as the sum of scores of all such instances in
the tree. Second, the plausibility of a such instance is determined with regard to
information from a limited local context. In particular, this parsing model considers
two types of local context: lexical and structural. The lexical context of a grammar
rule instance includes all words that it covers, and the structural context consists
of its neighbors in the same parse tree. Using features extracted from such lexical
and structural contexts, this parsing model enriches the conventional PCFG-based
parsing with lexical and structural sensitivity to relax the strong independence assumption.
Furthermore, this thesis is intended to explore statistical constituent parsing with a generic structured linear model. Under the framework of discriminative rescoring,
this model is capable to combine the strength of generative and discriminative
parsing. Experiments show that its best F1 scores of 91:86% and 85:58%
on English and Chinese test sets, reducing the error rates on these two languages
by 19:6% and 14:0% over the baseline Berkeley parser. Working solely in a pure
discriminative manner, this model also produces competitive results against the best
discriminative approaches in the literature. More significantly, it employs a simple
perceptron for parameter estimation. A novel parallel decoding algorithm makes it
possible to be trained efficiently on large-scale treebanks. Analyses of the parser’s
outputs show that it can provide sound resolutions for many spiny syntactic ambiguities
without using any overt linguistically motivated feature. Finally, the combination
of this model with other high-performance parsers through a constituent
recombination framework further pushes its best F1 scores to 92:80% and 85:60%
on the two aforementioned languages, which are the highest ones achieved so far
on the same data sets. All these results confirm the validity and effectiveness of this
novel approach to constituent parsing.
| Date of Award | 16 Jul 2012 |
|---|
| Original language | English |
|---|
| Awarding Institution | - City University of Hong Kong
|
|---|
| Supervisor | Chun Yu KIT (Supervisor) |
|---|
- Grammar, Comparative and general
- Natural language processing (Computer science)
- Data processing
- Parsing (Computer grammar)
- Parsing
- Computational linguistics
Discriminative constituent parsing with localized features
CHEN, X. (Author). 16 Jul 2012
Student thesis: Doctoral Thesis