Automatic text analysis using rhetorical structure theory with application for information search and retrieval
Student thesis: Doctoral Thesis
In addition to the typical methods employed in information retrieval systems, e.g. calculating frequency of keywords, pattern matching involving keywords, in this research project, I am proposing an approach to information search and retrieval based not only on the basic element set known as the Dublin Core Metadata Element Set (DCMES), which represents the content or bibliographical information of the data, but also based on the identification of linguistic information about the rhetorical structure of the text. This rhetorical structure information may be inferred from linguistic clues identified in the text. Both types of information are encoded as rules and facts in F-Logic (Frame-Logic). The cues and criteria in identifying rhetorical structure information are based on those developed by Corston-Oliver(1998). The text base in question consists of abstracts of linguistics journal articles drawn from a collection of over three hundred papers on the topic of Chinese Linguistics. Included in the text base are abstracts from linguistics journals in both Chinese and English. Information retrieval is web-based. Besides offering a search and retrieval capability, the application can also be extended by developing a web interface for authors or publishers to submit their abstracts to the text base. As the data in this research is linguistic abstracts, part of the focus of the research would be the investigation and analysis on the text structure of the abstracts. Since the usual way of creating an abstract is to extract all the main ideas of the text being described, analyzing abstracts in terms of their structure will be helpful in determining the structure of the whole article upon which each abstract is based. By identifying the relations among the different spans in the abstracts, one can be able to realize the general structure of the whole article. In other words, investigating and analyzing the text structure of discourse in the smaller-scale, i.e. the abstracts makes it possible to gain insight into those of larger-scale discourse, i.e. the papers. The research serves to further the development of ‘smart’ search facilities through the use of linguistic knowledge about the text. We have based our approach on the existence of a correlation between the move and rhetorical structures of texts. The result of the research has demonstrated support for the validity of this assumption.
- Information storage and retrieval systems, Data processing, Rhetoric, Discourse analysis