Skip to main navigation Skip to search Skip to main content

Learning block importance models for Web pages

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

Abstract

Previous work shows that a web page can be partitioned into multiple segments or blocks, and often the importance of those blocks in a page is not equivalent. Also, it has been proven that differentiating noisy or unimportant blocks from pages can facilitate web mining, search and accessibility. However, no uniform approach and model has been presented to measure the importance of different segments in web pages. Through a user study, we found that people do have a consistent view about the importance of blocks in web pages. In this paper, we investigate how to find a model to automatically assign importance values to blocks in a web page. We define the block importance estimation as a learning problem. First, we use a vision-based page segmentation algorithm to partition a web page into semantic blocks with a hierarchical structure. Then spatial features (such as position and size) and content features (such as the number of images and links) are extracted to construct a feature vector for each block. Based on these features, learning algorithms are used to train a model to assign importance to different segments in the web page. In our experiments, the best model can achieve the performance with Micro-F1 79% and Micro-Accuracy 85.9%, which is quite close to a person's view.
Copyright is held by the author/owner(s).
Original languageEnglish
Title of host publicationThirteenth International World Wide Web Conference Proceedings, WWW2004
PublisherAssociation for Computing Machinery
Pages203-211
ISBN (Print)158113844, 9781581138443
DOIs
Publication statusPublished - 2004
Externally publishedYes
EventThirteenth International World Wide Web Conference Proceedings, WWW2004 - New York, NY, United States
Duration: 17 May 200422 May 2004

Publication series

NameThirteenth International World Wide Web Conference Proceedings, WWW2004

Conference

ConferenceThirteenth International World Wide Web Conference Proceedings, WWW2004
PlaceUnited States
CityNew York, NY
Period17/05/0422/05/04

Bibliographical note

Publication details (e.g. title, author(s), publication statuses and dates) are captured on an “AS IS” and “AS AVAILABLE” basis at the time of record harvesting from the data source. Suggestions for further amendments or supplementary information can be sent to [email protected].

Research Keywords

  • Block importance model
  • Classification
  • Page segmentation
  • Web mining

Fingerprint

Dive into the research topics of 'Learning block importance models for Web pages'. Together they form a unique fingerprint.

Cite this