Detection of Cross-lingual Plagiarism Based on Latent Semantic Analysis

  • WANG, Fu Lee Philips (Principal Investigator / Project Coordinator)

Project: Research

Project Details

Description

In recent years, there are an increasing number of plagiarism cases. With no doubt, protection of intellectual property is one of the major public concerns nowadays. In order to win the battle against plagiarist, it is important to develop a fully automated system to detect plagiarism cases. However, automatic plagiarism detection of natural language texts is difficult due to the complexity of natural language. Besides, plagiarism is now no longer limited to one single language. It makes plagiarism detection become even more challenging. Although monolingual plagiarism detection techniques have been developed, unfortunately, there is no cross-lingual plagiarism detection technique available. As technology and creative industries have been selected as focus areas for development by the Chief Executive of Hong Kong where Hong Kong is a bilingual global city, it is very important to develop related techniques for intellectual property protection. In this project, the researchers are going to develop a cross-lingual plagiarism detection technique. Traditionally, automatic plagiarism detection is string matching of texts. This project aims at developing a plagiarism detection technique based on latent semantic analysis of texts. In order to overcome the boundary of different languages, a system will be developed to automatically construct a cross-lingual thesaurus by text-mining a large corpus of parallel documents in two languages. This cross-lingual thesaurus shall become an important tool in cross-lingual information retrieval. Moreover, an algorithm will be designed to measure the similarity of documents across languages. Given a document in one language and a collection of documents in another language, the algorithm will identify a set of concepts by semantic analysis. A document will be represented by a vector of concepts. The similarities between documents are then measured by the latent semantic analysis of concepts vectors. Because of the rapid growth of volume of on-line documents, the researchers will build a huge document database by crawling the web and collecting documents from other sources, such as collaboration with information content providers. Based on the techniques developed in this project, the researchers will develop a cross-lingual plagiarism detection system. These novel techniques will provide a powerful tool in cross-lingual plagiarism detection. Nowadays, some commercial companies provide plagiarism detection services to the public. Therefore, it is natural to predict that the technology proposed will be well-received by the industry. The system can be utilized to protect the intellectual property of an organization.
Project number9041426
Grant typeGRF
StatusFinished
Effective start/end date1/01/0910/02/10

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.