Abstract
Nowadays people use more digital devices and the internet, which provide much convenience to daily life but lead to information explosion. Thanks to the devices and the internet, it becomes easier for us to read and write freely online, as well as to communicate with each other. However, we can be drowning in information, especially when living in a world of information explosion because the quality of the web content cannot be guaranteed. For this reason we should think and choose what to believe. In this thesis, we study credibility assessment of web content from two aspects. One is to identify problematic online investment platforms. The other is to detect news frauds from the online news websites.The number of P2P lending platforms increases dramatically with the development of the internet finance in China. Whereas more problematic platforms are exposed, resulting in public concerning about the P2P market. In this thesis, P2P platforms with totally a hundred thousand transaction records are experimented. Firstly, a financial expertise based feature extraction solution transforms each platform’s data into a feature vector with fixed dimensions. Then three classification models - Logistic Regression (LR), Decision Tree (DT), and Support Vector Machine (SVM) are applied to categorize the platforms into two types - platforms of high risk and those of low risk. Furthermore, to solve the imbalanced problem, an ensemble learning method is used, which improves the performance. We also compare the performance of each feature type and analyze the correlation among the features. Our proposed feature extraction algorithm and ensemble learning method are proved to be useful. And our further analysis of features can provide effective suggestions for both investors and supervision systems to assess the credibility of the investment platforms.
News can spread fast to public with the development of the internet. But rumors may propagate through the web due to the openness of the online environment. Fraud detection aims to measure the possibility of a news article to be a news fraud. We design source-based, content-based and comment-based features and train classification models using ensemble learning methods. Source-based features are extracted from the characteristics of the news sources, while content-based features and comment-based features are extracted from their linguistic characteristics. Ensemble learning methods are used to address the imbalanced problem. We experiment on a public data set, which contains seventeen thousand news articles from three hundred and sixty different news websites. Our trained credibility model received good results and we are thinking of more representative features for long texts and some unsupervised algorithms for news verification across multiple sources.
On one hand, the internet brings convenience for people to watch news, buy products and publish viewpoints online. On the other hand, too much information drives people to a misleading place. As a result, it is necessary for people to identify true from false and right from wrong, either by rational analysis or by technical tools.
| Date of Award | 22 Aug 2016 |
|---|---|
| Original language | English |
| Awarding Institution |
|
| Supervisor | Qing LI (Supervisor) |