Project Details
Description
In recent years, there are an increasing number of plagiarism cases. With no doubt,
protection of intellectual property is one of the major public concerns nowadays. In order
to win the battle against plagiarist, it is important to develop a fully automated system
to detect plagiarism cases. However, automatic plagiarism detection of natural language
texts is difficult due to the complexity of natural language. Besides, plagiarism is now no
longer limited to one single language. It makes plagiarism detection become even more
challenging. Although monolingual plagiarism detection techniques have been developed,
unfortunately, there is no cross-lingual plagiarism detection technique available. As
technology and creative industries have been selected as focus areas for development by
the Chief Executive of Hong Kong where Hong Kong is a bilingual global city, it is very
important to develop related techniques for intellectual property protection. In this
project, the researchers are going to develop a cross-lingual plagiarism detection technique.
Traditionally, automatic plagiarism detection is string matching of texts. This project
aims at developing a plagiarism detection technique based on latent semantic analysis of
texts. In order to overcome the boundary of different languages, a system will be
developed to automatically construct a cross-lingual thesaurus by text-mining a large
corpus of parallel documents in two languages. This cross-lingual thesaurus shall
become an important tool in cross-lingual information retrieval. Moreover, an algorithm
will be designed to measure the similarity of documents across languages. Given a
document in one language and a collection of documents in another language, the
algorithm will identify a set of concepts by semantic analysis. A document will be
represented by a vector of concepts. The similarities between documents are then
measured by the latent semantic analysis of concepts vectors. Because of the rapid
growth of volume of on-line documents, the researchers will build a huge document database by
crawling the web and collecting documents from other sources, such as collaboration
with information content providers. Based on the techniques developed in this project,
the researchers will develop a cross-lingual plagiarism detection system. These novel techniques will
provide a powerful tool in cross-lingual plagiarism detection. Nowadays, some
commercial companies provide plagiarism detection services to the public. Therefore, it is
natural to predict that the technology proposed will be well-received by the industry.
The system can be utilized to protect the intellectual property of an organization.
| Project number | 9041426 |
|---|---|
| Grant type | GRF |
| Status | Finished |
| Effective start/end date | 1/01/09 → 10/02/10 |
Fingerprint
Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.