TY - GEN
T1 - Incorporating site-level knowledge for incremental crawling of web forums
T2 - 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '09
AU - Yang, Jiang-Ming
AU - Cai, Rui
AU - Wang, Chunsong
AU - Huang, Hua
AU - Zhang, Lei
AU - Ma, Wei-Ying
N1 - Publication details (e.g. title, author(s), publication statuses and dates) are captured on an “AS IS” and “AS AVAILABLE” basis at the time of record harvesting from the data source. Suggestions for further amendments or supplementary information can be sent to [email protected].
PY - 2009
Y1 - 2009
N2 - We study in this paper the problem of incremental crawling of web forums, which is a very fundamental yet challenging step in many web applications. Traditional approaches mainly focus on scheduling the revisiting strategy of each individual page. However, simply assigning different weights for different individual pages is usually inefficient in crawling forum sites because of the different characteristics between forum sites and general websites. Instead of treating each individual page independently, we propose a list-wise strategy by taking into account the site-level knowledge. Such sitelevel knowledge is mined through reconstructing the linking structure, called sitemap, for a given forum site. With the sitemap, posts from the same thread but distributed on various pages can be concatenated according to their timestamps. After that, for each thread, we employ a regression model to predict the time when the next post arrives. Based on this model, we develop an efficient crawler which is 260% faster than some state-of-the-art methods in terms of fetching new generated content; and meanwhile our crawler also ensure a high coverage ratio. Experimental results show promising performance of Coverage, Bandwidth utilization,and Timeliness of our crawler on 18 various forums.Copyright 2009 ACM.
AB - We study in this paper the problem of incremental crawling of web forums, which is a very fundamental yet challenging step in many web applications. Traditional approaches mainly focus on scheduling the revisiting strategy of each individual page. However, simply assigning different weights for different individual pages is usually inefficient in crawling forum sites because of the different characteristics between forum sites and general websites. Instead of treating each individual page independently, we propose a list-wise strategy by taking into account the site-level knowledge. Such sitelevel knowledge is mined through reconstructing the linking structure, called sitemap, for a given forum site. With the sitemap, posts from the same thread but distributed on various pages can be concatenated according to their timestamps. After that, for each thread, we employ a regression model to predict the time when the next post arrives. Based on this model, we develop an efficient crawler which is 260% faster than some state-of-the-art methods in terms of fetching new generated content; and meanwhile our crawler also ensure a high coverage ratio. Experimental results show promising performance of Coverage, Bandwidth utilization,and Timeliness of our crawler on 18 various forums.Copyright 2009 ACM.
KW - Incremental crawling
KW - Sitemap
KW - Web forum
UR - http://www.scopus.com/inward/record.url?scp=70350625344&partnerID=8YFLogxK
UR - https://www.scopus.com/record/pubmetrics.uri?eid=2-s2.0-70350625344&origin=recordpage
U2 - 10.1145/1557019.1557166
DO - 10.1145/1557019.1557166
M3 - RGC 32 - Refereed conference paper (with host publication)
SN - 9781605584959
T3 - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
SP - 1375
EP - 1383
BT - KDD '09: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Y2 - 28 June 2009 through 1 July 2009
ER -