CORC  > 北京大学  > 信息科学技术学院
SizeSpotSigs: An Effective Deduplicate Algorithm Considering the Size of Page Content
Mao, Xianling ; Liu, Xiaobing ; Di, Nan ; Li, Xiaoming ; Yan, Hongfei
2011
关键词Deduplicate Near Duplicate Detection AF_SpotSigs SizeSpotSigs Information Retrieval DOCUMENTS
英文摘要Detecting if two Web pages are near replicas, in terms of their contents rather than files, is of great importance in many web information based applications. As a result, many deduplicating algorithms have been proposed. Nevertheless, analysis and experiments show that existing algorithms usually don't work well for short Web pages(1), due to relatively large portion of noisy information, such as ads and templates for websites, existing in the corresponding files. In this paper, we analyze the critical issues in deduplicating short Web pages and present an algorithm (AF_SpotSigs) that incorporates them, which could work 15% better than the state-of-the-art method. Then we propose an algorithm (SizeSpotSigs), taking the size of page contents into account, which could handle both short and long Web pages. The contributions of SizeSpotSigs are three-fold: 1) Provide an analysis about the relation between noise-content ratio and similarity, and propose two rules of making the methods work better; 2) Based on the analysis, for Chinese, we propose 3 new features to improve the effectiveness for short Web pages; 3) We present an algorithm named SizeSpotSigs for near duplicate detection considering the size of the core content in Web page. Experiments confirm that SizeSpotSigs works better than state-of-the-art approaches such as SpotSigs, over a demonstrative Mixer of manually assessed near-duplicate news articles, which include both short and long Web pages.; http://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=PARTNER_APP&SrcAuth=LinksAMR&KeyUT=WOS:000312259200044&DestLinkType=FullRecord&DestApp=ALL_WOS&UsrCustomerID=8e1609b174ce4e31116a60747a720701 ; Computer Science, Artificial Intelligence; Computer Science, Information Systems; Computer Science, Theory & Methods; EI; CPCI-S(ISTP); 2
语种英语
DOI标识10.1007/978-3-642-20841-6-44
内容类型其他
源URL[http://ir.pku.edu.cn/handle/20.500.11897/293148]  
专题信息科学技术学院
推荐引用方式
GB/T 7714
Mao, Xianling,Liu, Xiaobing,Di, Nan,et al. SizeSpotSigs: An Effective Deduplicate Algorithm Considering the Size of Page Content. 2011-01-01.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace