Opinion Summarization and Subset Selection via Submodular Maximization
基於子模最大化的意見摘要和子集選擇
Student thesis: Doctoral Thesis
Author(s)
Related Research Unit(s)
Detail(s)
Awarding Institution | |
---|---|
Supervisors/Advisors |
|
Award date | 22 Aug 2022 |
Link(s)
Permanent Link | https://scholars.cityu.edu.hk/en/theses/theses(80d486a7-b7f3-42cd-aac9-5b9c9ff52f31).html |
---|---|
Other link(s) | Links |
Abstract
With the increasing popularity of online review content, corpora nowadays contain a significant number of opinion-bearing short texts. For numerous opinion mining disciplines, extracting essential viewpoints from such a large volume of data is a critical challenge. Opinion subset selection aims to convey a subjective representation for opinion categorization and regression by selecting a limited number of instances from the original dataset. Subset selection for sentiment analysis tries to form a subset for sentiment classification on different sentimental domains and levels. On the other hand, the core concept of opinion summarization is to construct subjective and succinct text summaries that capture vital thoughts and opinions from user-generated information. This thesis investigates the relationship between submodularity maximization and three opinion mining tasks, specifically, opinion classification, regression, and summarization.
We propose a submodular-based framework for opinion subset selection in the first part. This framework can retrieve a small set of instances from the corpus to convey a subjective representation. To correct the slackness of the fine-grained opinion detection capability of the conventional submodular-based subset selection approach, we first propose a topic-relevant filtering algorithm for candidate document selection. After that, these candidate instances can be scored by the non-decreasing submodular function exploiting the embedded trivial opinion features. Our work further introduces an algorithmic opinion-sensitive solution for optimizing submodular set functions with a greedy heuristic lower bound. The experimental results on different context domains demonstrate that the proposed opinion subset selection framework can distillate the data with substantial opinion features. The opinion preserved training set can be compressed to 10% - 40% of the original set size but still maintain adequate metric performance on classification and regression tasks. The comparative study of the subset's metric impact justifies the robustness of the proposed framework across all common sentiment levels, namely positive, neutral, and negative.
The second part proposes SentiSS, a framework for selecting subsets for sentiment analysis. Because the evaluation is a classification problem, this work follows a submodular three-part objective formulation in part one. On the other hand, it can be optimized as a plain greedy submodular maximization problem. The selected subsets can perform well on corpora from various domains, including hotels, restaurants, food, and airlines. Our findings also reveal that the three criteria of relevance, diversity, and fine-grained sentiment are the essential factors to consider when selecting a sentiment subset. The experimental results show the robust performance offered by subsets across all emotional variations on typical positive, neutral, and negative sentiment levels.
In the third part, an opinion summarization framework is developed to outline emotions and sentiments for generating a natural language representation from multiple documents. The automatically generated topics on short texts are utilized as a part of the features for the submodular information measures. Along with documents and opinion embedding, these features can be analyzed for constructing the summarization by four submodular set functions. The functions can reduce irrelevant sentiment and strengthen the opinion relevance of the summary to the given topics. Notably, two budget constraint optimization algorithms with feasible time complexity are proposed to maximize the utility of opinion summarization. We empirically evaluate the framework for the opinion summarization task with ROUGE and a newly proposed automatic evaluation metric, TOV. Results demonstrate the state-of-the-art efficacy of the summarization framework and the sufficient utility of the TOV metric. The framework development process also suggests the possible range of the balancing factor with respect to the common opinion summarization requirements.
We propose a submodular-based framework for opinion subset selection in the first part. This framework can retrieve a small set of instances from the corpus to convey a subjective representation. To correct the slackness of the fine-grained opinion detection capability of the conventional submodular-based subset selection approach, we first propose a topic-relevant filtering algorithm for candidate document selection. After that, these candidate instances can be scored by the non-decreasing submodular function exploiting the embedded trivial opinion features. Our work further introduces an algorithmic opinion-sensitive solution for optimizing submodular set functions with a greedy heuristic lower bound. The experimental results on different context domains demonstrate that the proposed opinion subset selection framework can distillate the data with substantial opinion features. The opinion preserved training set can be compressed to 10% - 40% of the original set size but still maintain adequate metric performance on classification and regression tasks. The comparative study of the subset's metric impact justifies the robustness of the proposed framework across all common sentiment levels, namely positive, neutral, and negative.
The second part proposes SentiSS, a framework for selecting subsets for sentiment analysis. Because the evaluation is a classification problem, this work follows a submodular three-part objective formulation in part one. On the other hand, it can be optimized as a plain greedy submodular maximization problem. The selected subsets can perform well on corpora from various domains, including hotels, restaurants, food, and airlines. Our findings also reveal that the three criteria of relevance, diversity, and fine-grained sentiment are the essential factors to consider when selecting a sentiment subset. The experimental results show the robust performance offered by subsets across all emotional variations on typical positive, neutral, and negative sentiment levels.
In the third part, an opinion summarization framework is developed to outline emotions and sentiments for generating a natural language representation from multiple documents. The automatically generated topics on short texts are utilized as a part of the features for the submodular information measures. Along with documents and opinion embedding, these features can be analyzed for constructing the summarization by four submodular set functions. The functions can reduce irrelevant sentiment and strengthen the opinion relevance of the summary to the given topics. Notably, two budget constraint optimization algorithms with feasible time complexity are proposed to maximize the utility of opinion summarization. We empirically evaluate the framework for the opinion summarization task with ROUGE and a newly proposed automatic evaluation metric, TOV. Results demonstrate the state-of-the-art efficacy of the summarization framework and the sufficient utility of the TOV metric. The framework development process also suggests the possible range of the balancing factor with respect to the common opinion summarization requirements.
- Opinion Subset Selection, Opinion Summarization, Submodular Maximization, PhD Thesis, Engineering