CORC  > 自动化研究所  > 中国科学院自动化研究所  > 毕业生  > 博士学位论文
题名搜索引擎作弊检测方法研究
作者耿光刚
学位类别工学博士
答辩日期2008-05-31
授予单位中国科学院研究生院
授予地点中国科学院自动化研究所
导师王春恒
关键词搜索引擎作弊 内容作弊 链接作弊 机器学习 特征提取 search engine spam content spam link spam machine learning feature extraction
其他题名Study on Web Spam Detection Methods
学位专业模式识别与智能系统
中文摘要搜索引擎作弊,又称互联网作弊,是指采用迷惑、欺骗搜索引擎的手段,使得Web页面在检索结果中的排名高于其实际应得排名的行为。互联网作弊行为的猖獗导致搜索引擎检索结果的质量严重下降,严重挫伤了用户的搜索体验,被公认为是互联网搜索所面临的最大挑战之一。研究有效的互联网作弊检测方法是一项有意义的研究课题。 互联网作弊包含了内容作弊、链接作弊、隐藏作弊等形式,呈现出多样性、隐蔽性、融合性和进化性等特点。现有的大多数启发式作弊检测方法多是针对某种特定的作弊形式,不仅参数难以调节,而且容易被作弊者所利用。基于统计学习的检测方法,可以针对新的作弊形式,通过增加、删除相应特征保持系统的有效性,而不必修改系统结构,表现出其在作弊检测中的优越性。本文针对基于统计学习的搜索引擎作弊检测这一研究方向,围绕特征提取、分类策略等关键问题展开研究,主要内容包括: 1.提出了基于统计学习的搜索引擎作弊检测框架,并基于该框架设计出两种作弊检测策略。第一种策略是提取三类不同视角的特征(包含网页内容特征、网页级链接分析特征和网站级链接分析特征),进行基于融合特征的作弊检测。另一种策略是基于两阶段特征提取的Web作弊检测,首先提取内容统计特征和网页级链接分析特征,在此基础上进行作弊初检测;然后,基于初检测的预测作弊度和网站拓扑结构进行特征的再提取,再提取的特征包括聚类特征、近邻特征和传递特征;接下来,基于两阶段特征进行Web作弊检测,并在标准数据集上验证这两种作弊检测策略的有效性。最后,深入分析了这两种不同的特征提取策略的关系以及它们的优、缺点,为作弊检测的特征提取工作提供有意义的参考。 2.针对作弊检测中的类不均衡检测问题,我们分别基于代价敏感学习策略(Cost-Sensitive Learning, CSL)和集成的随机欠采样策略(Ensemble Random Under-Sampling, ERUS)进行搜索引擎作弊检测,并在不均衡程度差异很大的两组公开数据集上进行有效性测试。实验表明两种检测策略都可以一定程度上提高系统性能,本文提出的集成的随机欠采样策略更加有效。 3.针对基于有监督学习的Web作弊检测需要大量手工标号样本的事实,通过分析作弊、非作弊网站呈现出的拓扑依赖、聚类等特性,本文提出了两类基于Web拓扑结构的半监督作弊检测方法。这两类方法分别融合了Self-training和Co-training算法,包括LS-training、Link-training、LCo-training和Link-training2共4个半监督学习算法。标准数据集上的实验表明:在少量标号训练样本的情况下,这两类半监督学习方法可以有效地挖掘节点间的Web拓扑依赖,提升Web作弊检测性能。
英文摘要Search engine spam, also called web spam, is the practice of manipulating the relevancy or prominence of resources indexed by a search engine, without improving the utility to the viewer. In recent years, the web spam has become increasingly rampant, which makes the results from search engines be greatly harmed; developing efficient web spam detection algorithms is a promising research area. Web spam is usually classified into content spam, link spam, etc. Most of the existing heuristic-based web spam detection has focused on specific spam, which is difficult to tune the parameters. Machine learning based web spam detection demonstrated their superiority for being easy to adapt to newly developed spam techniques. Aiming at fighting web spam effectively, the following research work has been carried out. 1.We first proposed a machine learning based web spam detection framework, under the framework, two different detection strategies are designed. i). The first strategy extracts three kinds of features, including content based features, hyperlink based features and host level link analysis features, based on all the features, machine learning based web spam detection is performed. ii). The second one is a two-stage web spam detection strategy. The first stage detection is carried out based on the content and hyperlink based features, then the predicted spamicity will be used for feature re-extraction. Detection algorithm on the expanded eigenspace will be implemented in the second stage. We implement the proposed approach and demonstrate that the approach improves the web spam detection performance evidently on WEBSPAM-UK2006 benchmark. In addition, we further study the relation of the two methods, and point out their strength and weakness. The two detection methods provide a reference to perform feature extraction in web spam detection. 2. Based on the fact that reputable pages are much easier to obtain than spam ones on the Web, a cost-sensitive learning and an ensemble random under-sampling classification strategies are adopted for the web spam detection. Experiments on unbalanced public benchmarks show that both methods improve the performance, and the latter outperforms the former. 3. Supervised web spam detection requires large amounts of labeled training data. However, labeled samples are more difficult, expensive and time consuming to obtain than unlabeled ones. Based on the Web graph and two famous bootstrapping methods self-training and co-training, we propose several web topology based semi-supervised learning algorithms, including Link-training, LS-training, Link-training2 and LCo-training. The experiments with a few labeled samples on standard WEBSPAM-UK2006 benchmark showed that the proposed semi-supervised algorithms are effective.
语种中文
其他标识符200518014628054
内容类型学位论文
源URL[http://ir.ia.ac.cn/handle/173211/6106]  
专题毕业生_博士学位论文
推荐引用方式
GB/T 7714
耿光刚. 搜索引擎作弊检测方法研究[D]. 中国科学院自动化研究所. 中国科学院研究生院. 2008.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace