Multilayer SOM with tree-structured data for efficient document retrieval and plagiarism detection

Tommy W.S. Chow, M. K M Rahman

Research output: Journal Publications and ReviewsRGC 22 - Publication in policy or professional journal

59 Citations (Scopus)

Abstract

This paper proposes a new document retrieval (DR) and plagiarism detection (PD) system using multilayer self-organizing map (MLSOM). A document is modeled by a rich tree-structured representation, and a SOM-based system is used as a computationally effective solution. Instead of relying on keywords/lines, the proposed scheme compares a full document as a query for performing retrieval and PD. The tree-structured representation hierarchically includes document features as document, pages, and paragraphs. Thus, it can reflect underlying context that is difficult to acquire from the currently used word-frequency information. We show that the tree-structured data is effective for DR and PD. To handle tree-structured representation in an efficient way, we use an MLSOM algorithm, which was previously developed by the authors for the application of image retrieval. In this study, it serves as an effective clustering algorithm. Using the MLSOM, local matching techniques are developed for comparing text documents. Two novel MLSOM-based PD methods are proposed. Detailed simulations are conducted and the experimental results corroborate that the proposed approach is computationally efficient and accurate for DR and PD. © 2009 IEEE.
Original languageEnglish
Pages (from-to)1385-1402
JournalIEEE Transactions on Neural Networks
Volume20
Issue number9
DOIs
Publication statusPublished - 2009

Research Keywords

  • Document retrieval (DR)
  • Multilayer self-organizing map (MLSOM)
  • Plagiarism detection (PD)
  • Tree-structured representation

Fingerprint

Dive into the research topics of 'Multilayer SOM with tree-structured data for efficient document retrieval and plagiarism detection'. Together they form a unique fingerprint.

Cite this