Usage Patterns of Two-character Modals in Modern Chinese: A Corpus-Based Quantitative Study

Student thesis: Doctoral Thesis

Abstract

Chinese modals exhibit distinct grammatical behaviors and semantic complexities (Zhu, 1982; Li & Thompson, 1981), presenting significant challenges in both linguistic theory and computational applications (Morante & Sporlede, 2012). This thesis explores the intricate linguistic domain of modality, with a specific focus on two-character Chinese modals. It employs a usage-based constructional approach that emphasizes both frequently occurring patterns in language uses and the principle of form-meaning mapping across various linguistic units (Goldberg, 1995, 2006; Cappelle & Depraetere, 2016a, 2016b; Trousdale, 2016; Wärnsby, 2016). The research proposes three questions focusing on the frequent pattern [Mod + Verb], contextual features of modals, and probabilistic representations of modals.

In terms of methods, a self-built corpus of 2.5 billion Chinese characters based on media data (Xu, 2019) was utilized. The seven modals—可能 kěnéng ‘possible, may’, 可以 kěyǐ ‘can’, 应当 yīngdāng ‘should’, 应该 yīnggāi ‘should’, 必须 bìxū ‘must’, 能够 nénggòu ‘be able to’, and 必然 bìrán ‘inevitably; must be’—were selected for this study based on their high frequency of occurrence and their representational significance in the Chinese modal system (Cui, 2002; Zhu, 2005; Peng, 2007:82-163; Yang, 2017: 21). The thesis aims to uncover the underlying form-meaning mechanisms of modals through addressing the three research questions. Thus, Study 1 (Chapter 4) uses Collostructional Analysis (Gries & Stefanowitsch, 2004a, 2004b, 2010; Stefanowitsch & Gries, 2003, 2005; Stefanowitsch & Flach, 2020; Stefanowitsch, 2013) to analyze the attracted collexemes of different modals; Study 2 (Chapter 5) adopts the Behavioral Profile (BP) approach (Gries, 2006; Divjak & Gries, 2009; Gries & Divjak, 2009) to explore the interplays among different contextual variables; and Study 3 (Chapter 5) employs token-based Semantic Vector Space Models (VSMs) (Heylen et al., 2022, 2015; De Pascale, 2019; Hilpert & Correia Saavedra, 2020; Hilpert & Flach, 2021) to represent modals and analyze the semantic relations among them.

Study 1 (Chapter 4) delineates the categorization of Chinese modals into three primary modal clusters through an in-depth collexeme analysis. The study uses Multiple Distinctive Collexeme Analysis (MDCA), a subtype of Collostructional analysis, to classify modals. It sorts them into three groups: epistemic (kěnéng and bìrán), dynamic (kěyǐ and nénggòu), and deontic (bìxū, yīngdāng, and yīnggāi), with yīnggāi being relatively underrepresented. Each group aligns with its respective semantic domains. Epistemic modals predominantly collocate with verbs related to relation, occurrence, and causation, emphasizing the likelihood of a proposition. In contrast, dynamic modals attract verbs denoting mental activities and physical actions, reflecting subjective decision-making and potential capabilities. Deontic modals are closely associated with verbs that convey social obligations and ethical standards, highlighting their role in expressing societal norms and personal duties. The semantic groups are further validated through Correspondence Analysis and Hierarchical Clustering on Principal Components, which affirm the initial categorization by revealing coherent thematic relationships and distinct semantic properties among the modals.

Study 2 (Chapter 5) utilizes the multivariate behaviors of the seven Chinese modals, examining their form-meaning mappings within a network of constructions that incorporates symbolic, syntagmatic, and pragmatic associations. Utilizing hierarchical clustering based on eight syntagmatic variables, the study categorizes the modals into three distinct groups, corresponding to epistemic (kěnéng, bìrán, yīnggāi), deontic (bìxū, yīngdāng), and dynamic (kěyǐ, nénggòu) modalities. The classification is reinforced by lexical features derived from classification models and thematic analysis from topic modeling. These methods validate the clusters by highlighting unique lexical features and semantic themes for each group. Additionally, the analysis clarifies the semantic orientations of the modals according to the three-way modality types. Cluster 1 (epistemic) primarily expresses notions of knowledge and belief, particularly in contexts related to business development and economic dynamics. Cluster 2 (deontic) is prevalent in discussions of regulation, requirements, and legal frameworks. Cluster 3 (dynamic) focuses on internal or inherent abilities and situational capabilities, often in scenarios involving human empowerment, social engagement, and corporate potential.

Study 3 (Chapter 5) further analyzes the modals by employing token-based VSMs to use high-dimensional semantic vectors to represent modals. The idea of Vector Space Models (VSMs) is that the meaning of a word can be deduced from its usage or context. This approach marks a shift from frequency-based to meaning-based interpretations by using weighted co-occurrence matrices of collocated words to form word vectors (Turney & Pantel, 2010). This chapter constructs type-based vectors from first-order collocates and then uses the collocates of collocates, or second-order collocates, to compute the token-based vectors. Using supervised classification models, the study achieves high accuracy in differentiating modals by their token-based semantic vectors. This shows that each modal has a distinctive profile despite similarities in usage patterns and semantic overlaps. Sparse Principal Component Analysis is employed to reduce the dimensionality of the modal vectors, while K-means clustering is used to categorize these vectors into four distinct semantic clusters on a two-dimensional graph. These clusters include an epistemic group (kěnéng and bìrán), a deontic group (yīngdāng), a dynamic group (kěyǐ and nénggòu), and a weak deontic or polysemous group (bìxū and yīnggāi). The distributional tendencies indicate contextual similarities in reduced dimensions. These findings suggest that the three-way modality plays a significant but not decisive role in shaping the contextual environment, while semantic preferences and the grammaticalization process also influence modal choices.

In summary, the three studies in this thesis reveal distinct usage patterns among the modals studied, aligning roughly with the three-way modality types. Statistically, the distinctions between epistemic and non-epistemic modalities are more pronounced in usage-based features than the distinction between possibility and necessity. Study 1 provides a detailed analysis of the modal-verb pattern, showing how each modality interacts with post-modal verbs. Study 2 illustrates that Chinese modals are part of a network of constructions, supported by empirical evidence in symbolic, syntagmatic, and pragmatic aspects. Study 3 demonstrates that, despite some modals sharing usage patterns and semantic overlaps, each modal possesses a distinctive profile that can be accurately differentiated using the second-order collocates. The thesis highlights usage variations and semantic distinctions of modals, providing much-needed empirical evidence based on distributional features. By integrating traditional linguistic analyses with statistical and machine learning methods, the thesis deepens the theoretical understanding of Chinese modals and modality types. This comprehensive approach, focused on distributional features, not only enriches our comprehension but also facilitates the development of a robust and practical methodological framework for studying lexical semantics.
Date of Award30 Aug 2024
Original languageEnglish
Awarding Institution
  • City University of Hong Kong
SupervisorMeichun LIU (Supervisor) & Chun Yu KIT (Co-supervisor)

Cite this

'