Abstract
Parallel corpora are one of the most important resources for natural language processing, a large volume of which can be mined from bilingual parallel web pages. This paper formulates a practical algorithm for recognizing parallel web pages based on the credibility of automatically discovered bilingual URL pairing patterns (or keys), then this paper extends it in two ways to find more parallel web pages, namely, rescue weak keys of low local credibility in terms of their global credibility, and unearth bilingual parallel deep web pages by means of applying strong keys of high global credibility. Furthermore, we detect more bilingual web sites according to their credibility in terms of their link relationship with the seed set of web sites in use, and also utilize search engines to recognize bilingual web sites efficiently with only a small set of URL pairing patterns of high credibility. To further enhance the recognition accuracy on top of these five methods, we calculate cross-lingual similarity of candidate parallel web pages and filter out weak ones with a threshold. The effectiveness of our approaches is confirmed by a series of experiments.
| Translated title of the contribution | Detection of Parallel Web Pages Based on theAutomatically Discovered Bilingual URL Pairing Patterns |
|---|---|
| Original language | Chinese (Simplified) |
| Pages (from-to) | 91-100 |
| Number of pages | 10 |
| Journal | 中文信息学报 |
| Volume | 32 |
| Issue number | 3 |
| Publication status | Published - 15 Mar 2018 |
Research Keywords
- 行网页获取
- 平行语料库
- 双语URL匹配模式
- 双语文本挖掘
- parallel webpage mining
- parallel corpora
- bilingual URL pairing pattern
- bilingual text mining