Discriminative constituent parsing with localized features

  • Xiao CHEN

    Student thesis: Doctoral Thesis

    Abstract

    Constituent parsing, or parsing for short, is a fundamental process in natural language processing (NLP) to derive a provably useful syntactic representation for an input sentence. It plays an important role in many applications, e.g., machine translation and sentiment analysis. Natural languages are widely acknowledged to be inherently ambiguous. There are usually many possible parses for an input sentence. Thus, statistical approaches are adopted to evaluate the plausibility of each candidate parse, so as to select the best. The most straightforward and commonly adopted approach is probabilistic context-free grammar (PCFG). However, PCFG is limited by its underlying strong independence assumption. To address this problem, a discriminative parsing model is formulated in this thesis with a novel method of parameterization. First, this model considers instances of grammar rule to be the basic components of a parse tree. The same grammar rule appearing in different positions in a parse tree is considered a different instance, and assigned a score to indicate its plausibility. The plausibility of a parse tree is defined as the sum of scores of all such instances in the tree. Second, the plausibility of a such instance is determined with regard to information from a limited local context. In particular, this parsing model considers two types of local context: lexical and structural. The lexical context of a grammar rule instance includes all words that it covers, and the structural context consists of its neighbors in the same parse tree. Using features extracted from such lexical and structural contexts, this parsing model enriches the conventional PCFG-based parsing with lexical and structural sensitivity to relax the strong independence assumption. Furthermore, this thesis is intended to explore statistical constituent parsing with a generic structured linear model. Under the framework of discriminative rescoring, this model is capable to combine the strength of generative and discriminative parsing. Experiments show that its best F1 scores of 91:86% and 85:58% on English and Chinese test sets, reducing the error rates on these two languages by 19:6% and 14:0% over the baseline Berkeley parser. Working solely in a pure discriminative manner, this model also produces competitive results against the best discriminative approaches in the literature. More significantly, it employs a simple perceptron for parameter estimation. A novel parallel decoding algorithm makes it possible to be trained efficiently on large-scale treebanks. Analyses of the parser’s outputs show that it can provide sound resolutions for many spiny syntactic ambiguities without using any overt linguistically motivated feature. Finally, the combination of this model with other high-performance parsers through a constituent recombination framework further pushes its best F1 scores to 92:80% and 85:60% on the two aforementioned languages, which are the highest ones achieved so far on the same data sets. All these results confirm the validity and effectiveness of this novel approach to constituent parsing.
    Date of Award16 Jul 2012
    Original languageEnglish
    Awarding Institution
    • City University of Hong Kong
    SupervisorChun Yu KIT (Supervisor)

    Keywords

    • Grammar, Comparative and general
    • Natural language processing (Computer science)
    • Data processing
    • Parsing (Computer grammar)
    • Parsing
    • Computational linguistics

    Cite this

    '