Skip to main navigation Skip to search Skip to main content

Record matching over query results from multiple web databases

Weifeng Su, Jiying Wang, Frederick H. Lochovsky

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

Abstract

Record matching, which identifies the records that represent the same real-world entity, is an important step for data integration. Most state-of-the-art record matching methods are supervised, which requires the user to provide training data. These methods are not applicable for the Web database scenario, where the records to match are query results dynamically generated on-the-fly. Such records are query-dependent and a prelearned method using training examples from previous query results may fail on the results of a new query. To address the problem of record matching in the Web database scenario, we present an unsupervised, online record matching method, UDD, which, for a given query, can effectively identify duplicates from the query result records of multiple Web databases. After removal of the same-source duplicates, the presumed nonduplicate records from the same source can be used as training examples alleviating the burden of users having to manually label training examples. Starting from the nonduplicate set, we use two cooperating classifiers, a weighted component similarity summing classifier and an SVM classifier, to iteratively identify duplicates in the query results from multiple Web databases. Experimental results show that UDD works well for the Web database scenario where existing supervised methods do not apply. © 2010 IEEE.
Original languageEnglish
Pages (from-to)578-589
JournalIEEE Transactions on Knowledge and Data Engineering
Volume22
Issue number4
DOIs
Publication statusPublished - 1 Apr 2010

Bibliographical note

Publication details (e.g. title, author(s), publication statuses and dates) are captured on an “AS IS” and “AS AVAILABLE” basis at the time of record harvesting from the data source. Suggestions for further amendments or supplementary information can be sent to [email protected].

Funding

This research was supported by the Research Grants Council of Hong Kong under grant HKUST6172/04E.

Research Keywords

  • Data deduplication
  • Data integration
  • Duplicate detection
  • Query result record
  • Record linkage
  • Record matching
  • SVM
  • Web database

RGC Funding Information

  • RGC-funded

Fingerprint

Dive into the research topics of 'Record matching over query results from multiple web databases'. Together they form a unique fingerprint.

Cite this