基于双语 URL 匹配模式可信度的平行网页识别研究

Translated title of the contribution: Detection of Parallel Web Pages Based on theAutomatically Discovered Bilingual URL Pairing Patterns

章成志, 马舒天, 揭春雨, 姚旭晨

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

Abstract

Parallel corpora are one of the most important resources for natural language processing, a large volume of which can be mined from bilingual parallel web pages. This paper formulates a practical algorithm for recognizing parallel web pages based on the credibility of automatically discovered bilingual URL pairing patterns (or keys), then this paper extends it in two ways to find more parallel web pages, namely, rescue weak keys of low local credibility in terms of their global credibility, and unearth bilingual parallel deep web pages by means of applying strong keys of high global credibility. Furthermore, we detect more bilingual web sites according to their credibility in terms of their link relationship with the seed set of web sites in use, and also utilize search engines to recognize bilingual web sites efficiently with only a small set of URL pairing patterns of high credibility. To further enhance the recognition accuracy on top of these five methods, we calculate cross-lingual similarity of candidate parallel web pages and filter out weak ones with a threshold. The effectiveness of our approaches is confirmed by a series of experiments.
Translated title of the contributionDetection of Parallel Web Pages Based on theAutomatically Discovered Bilingual URL Pairing Patterns
Original languageChinese (Simplified)
Pages (from-to) 91-100
Number of pages10
Journal中文信息学报
Volume32
Issue number3
Publication statusPublished - 15 Mar 2018

Research Keywords

  • 行网页获取
  • 平行语料库
  • 双语URL匹配模式
  • 双语文本挖掘
  • parallel webpage mining
  • parallel corpora
  • bilingual URL pairing pattern
  • bilingual text mining

Fingerprint

Dive into the research topics of 'Detection of Parallel Web Pages Based on theAutomatically Discovered Bilingual URL Pairing Patterns'. Together they form a unique fingerprint.

Cite this