A Linguistically Intelligent Approach to Detecting Implicit Discourse Relations in Natural Texts

自然語言中隱性語篇關係識別的語言學分析

Student thesis: Doctoral Thesis

View graph of relations

Author(s)

Detail(s)

Awarding Institution
Supervisors/Advisors
Award date6 Jul 2021

Abstract

This thesis explores how and what linguistic cues help detect implicit discourse relations in natural texts. A natural text is different from a random collection of sentences because its text spans are interrelated, within the sentence or beyond the sentence boundary. The interaction between text spans is a widely studied topic in linguistics. This interaction is termed as “discourse relation” in linguistic studies. Discourse relation is what glues the text together, allowing it to function as an integrated and meaningful whole. The research on discourse relations is intriguingly complex because there are many ways to encode a single discourse relation. One can utilize various linguistic resource to denote meaning: lexical resources such as two opposite words “cold” and “hot” to denote an opposite meaning; grammatical structures, such as “had I known…” to denote a contrast between the reality and assumption; contextual information, like a contrast between “the yield in 2019” and “the decline in the stock price” to show a comparison; metaphorical encoding, like a juxtaposition of “the bull market” and “the bear market” to encode a comparison; or even world knowledge, like the “the merge and acquisition” and “profit” to express a contingency. This range of grammatical and semantic complexity is intrinsic to language. By its nature, language enables us to utilize various linguistic resources to denote meaning, at the lexico-grammatical level and beyond. In this sense, discourse relations interact with all levels of linguistic cues and play a crucial role in text comprehension.

A growing body of scholars in various disciplines, from theoretical linguistic studies to empirical computational studies, are researching discourse relations to improve discourse relation recognition results for Natural Language Processing.

Discourse relations can be overtly encoded. Among all the possible ways of overt encoding, connectives are most frequently used. By connectives, it means those discourse connectives such as “and”, “because”, and “since”. These connectives are ubiquitous in sentences, showing the relatedness between text spans. However, discourse relations become ‘implicit’ when text spans lack those overt discourse connectives. The lack of discourse connectives makes it difficult to determine and identify discourse relations between relevant text spans. This thesis takes on the challenge and attempts to show how linguistically wise cues can be utilized to help detect implicit discourse relations when connectives are missing.

PDTB has become one of the most widely studied resources for discourse relation since its first launch in 2007 (Prasad et al., 2007). PDTB annotates discourse relation based on discourse connectives. It also differentiates Explicit and Implicit relations based on whether discourse connectives appear in the discourse or not. This study focuses on implicit discourse relation in PDTB 2.0 to ensure the findings are comparable to previous NLP research and to show linguistically intelligent systems can make a difference.

This thesis identifies the linguistic features that distinguish the four implicit discourse relations, as defined by the PDTB system. The four discourse relations, or “senses,” are (1) Temporal relation for time or sequential actions, (2) Contingency relation for events with causal chains, (3) Comparison subsense for contrasting cases, and (4) Expansion for cases with further explanations and developments.

The current research adopts a corpus-driven approach to study the empirical data and to unveil the underlying patterns in implicitly expressed (without a discourse connective) discourse relations as well as an NLP experiment for further verification. It has identified 21 linguistics markers for four level-1 relations of contingency, expansion, temporal, and comparison. The linguistic research is accompanied by an NLP study. The experiment result shows overt linguistic resources—traceable markers—can denote discourse relations in text spans. The main findings are as follows: (1) word pairs, main verbs, and attributes are the three most important ties for the contingency relation; (2) thematic progression, polarity words, and attributes are critical for the expansion subsense; (3) word pairs, lexical negation, main verbs, and polarity words are indicative of the comparison relation; and (4) thematic progression, syntactic structures, tense and aspect markers are frequently used clues for the temporal relation. This study consolidates the research on the usefulness of overt discourse markers in discourse relation construction and further identifies the relational-specific ties. To further verify the effectiveness of using linguistic information in NLP system, the study hand-annotated implicit temporal cases and engineered it as an accustomed feature in the classifier, which significantly improved the recognition accuracy, achieving yield a state-of-the-art result, with an accuracy rate of 45%.

The findings are both theoretical and practical. First, the linguistic markers identified enhances traditional lexical cohesion study with empirical data and could be further applied to many other genres such as second language writing study. Second, the simply engineered NLP system provides a perspective for NLP researchers in building straightforward and explainable systems. Finally, the linguistic annotation also proves the usefulness of linguistic information in NLP study.

    Research areas

  • implicit discourse relation, PDTB 2.0, an averaged perceptron classifier