Mining and Analyzing the Text for Corporate Fraud Detection: An Investigation of Financial Statements and Social Media


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date19 Dec 2016


As more companies go public, an increasing number of financial frauds are exposed. Falsification of financial information made by public companies not only causes significant financial loss to broad shareholders but also has resulted in a general loss of confidence to capital market. There is an urgent need to detect and identify financial fraud, which is instrumental to ensure a fair, open, and transparent financial market.

Conventional auditing practices, which unvaryingly focus on statistical analysis of structured financial or nonfinancial indicators in financial statements, work not so well with the presence of misleading financial reports. Considering much of contents in financial statements are textual information, this dissertation taps the power of largely ignored textual contents in financial statements for corporate fraud detection.
First of all, an integrated language model, which combines statistical language model (SLM) and latent semantic analysis (LSA), is built to detect the strategic use of deceptive language in financial statements. By integrating SLM with LSA framework, the integrated model not only overcomes SLM’s inability to capture long-span information, but also extracts the semantic patterns which distinguish fraudulent financial statements from non-fraudulent ones. Four different modes of the integrated model are also studied and compared. With application to assess fraud risk in overseas-listed
Chinese companies, the integrated model shows high accuracy to flag fraudulent financial statements.

Despite of dictionary-based methods or statistical methods in literature, it lacks a systematic, holistic, and theoretical framework for guiding the text analysis of textual financial statements. With the theoretical foundation of Systemic Functional Linguistics theory (SFL), this dissertation develops a text analytic framework for financial statement fraud detection. Seven information types, i.e., topics, opinions, emotions, modality, personal pronouns, writing style, and genres are identified based on ideational, interpersonal, and textual metafunctions in SFL. Under the text analytic framework, Latent Dirichlet Allocation (LDA) algorithm, computational linguistics, term frequency-inverse document frequency (TF-IDF) method, are integrated to create a synergy for extracting both word-level and document-level features for all information types. All these features serve as the input of a linear Support Vector Machine (SVM) classifier. With application to assess fraud risk for 1610 firm-year samples in U.S. listed companies, the analytic framework makes a prediction with average accuracy at 82.36% under ten-fold cross validation, much better than baseline methods using financial ratios.

What’s more, this dissertation has the first try to leverage huge amount of user generated contents in financial social media for corporate fraud prediction. Draws upon theories and methodologies of text mining and information retrieval, a new text analytic framework is proposed for decomposing unstructured social media contents into words weights features, topic features, emotion related ratios, and social network features. By collecting social media contents prior the time point of fraud disclosure for 64 fraudulent and matched 64 non-fraudulent firms, the fraud can be predicted with average accuracy at 75.50% with support vector machine (SVM) classifier under ten-fold cross validation. It demonstrates that there is a leading effect of social media contents for financial fraud disclosure. A probability-of-fraud indicator is also created within SVM model to show possibility of a firm to be fraud and non-fraud. In addition, the proposed analytic framework obtains much better performance when compared with baseline method using financial ratios on the same sample set. This indicates that social media features can be a supplement to existing fraud detection methods.

In summary, this dissertation develops three IT artifacts for corporate fraud detection: (1) a statistical method for classifying textual financial statements; (2) a theoretical framework for extracting useful features from textual financial statements; (3) an analytic framework for decomposing unstructured social media contents for corporate fraud detection. The findings in this dissertation will benefit financial governors, market regulators, and auditors in predicting and detecting financial fraud, and protect the public’s investments.