互联网毒品类信息过滤研究

CORC > 自动化研究所 > 中国科学院自动化研究所 > 毕业生 > 硕士学位论文

题名	互联网毒品类信息过滤研究
作者	贺主
学位类别	工学硕士
答辩日期	2010-06-03
授予单位	中国科学院研究生院
授予地点	中国科学院自动化研究所
导师	胡卫明
关键词	互联网有害信息过滤文本识别机器学习有监督学习半监督学习 Adaboost 支持向量机一类支持向量机 Webpage filtering Text classification Machine Learning Supervised Learning Semi-supervised Learning Adaboost SVM One-Class SVM
其他题名	Research on Filtering Drug Information on the Internet
学位专业	模式识别与智能系统
中文摘要	本文围绕互联网毒品类信息识别这一问题，对目前国际上流行的一些机器学习的方法进行了研究，涉及到有监督学习、半监督学习和等多个方面，并结合现实中的问题进行了应用。本学位论文的内容主要有：提出了针对海量数据的违禁药物网页识别算法。搜索引擎索引库中的网页数量十分庞大，并且不可能用手工的方式标记足够的训练数据，直接用传统的机器学习、文本分类等算法难以解决这个问题。我们充分结合了indexing和divide and conquer的策略，设计了一个多层的识别框架。实验结果表明我们的算法较好的找到了大数据集上识别准确率和运行效率的平衡点，可以实现对网上兴奋剂销售类网页的有效监控。设计并实现了基于Adaboost算法和潜在语意标引的互联网毒品信息识别与过滤算法。我们首先介绍了Boosting方法的基本思想和重要性质，然后将实值Adaboost算法用于互联网毒品信息识别与过滤中，特别的在构造决策桩、信息融合等方面提出了一些改进。这使得我们的互联网毒品信息识别与过滤系统能够在保持较高检测率的情况下获得很低的虚警率。另外，基于Adaboost算法的有害网页识别具有相当低的计算复杂度，可以频繁的重训练以适应复杂的互联网环境，因而很有希望走向实用化。提出了一种基于实值Adaboost的半监督学习框架，来进行毒品网页的识别与过滤。在我们的毒品网页分类的问题中，所能得到的已经标记的样本集通常来说相对较小，而手工的标记这些样本则会耗费大量的人力物力，是非常昂贵的。但是，互联网上有大量的未标记的样本，所以能利用这些未标记的样本是很重要而且很有效的一种思路。通过我们的研究可以看到，使用未标记样本进行半监督学习是十分有效的，而使用层级分类框架带来的分类结果有明显提高。通过半监督学习，可以大大减轻人工标记的负担；另一方面相比监督学习来说，可以有效的改善识别结果。由于我们的半监督主动学习框架有很多优秀的性质，因此有必要对其进行更深入的研究。综上所述，本文在机器学习方法本身及其在互联网毒品信息识别领域中的应用等方面做了一些有益的探索。
英文摘要	As the rapid growth of the World Wide Web, it plays a more and more important roll in every day’s life. The World Wide Web provides great convenience for users to obtain information. And its growth is extremely fast in China. However, there exists much harmful information on the internet, such as pornographic content. Thus, how to filter harmful web pages on the internet is quite an important issue. In general, the problem of harmful web page filtering is converted to that of web page classification, in which machine learning plays a very important roll. As far as now, filtering harmful content on the Internet became an important issue for researchers. The filtering demand is mainly information such as pornography, gambling, violence and murder. Researches on these kinds of web information have had great achievements. But websites about prohibited drugs' information haven't attracted much attention. Some of these sites are selling drugs on the internet, while some are providing information about growing or using drugs. There exist lots of drug sites. And the traffic of these sites has a rapid growth. In this thesis, based on the problem of web harmful content filtering, we study several prevailing methods in machine learning, which include supervised learning methods and semi-supervised learning methods, etc. And some of the methods have applied in real life subject. The main contributions of this thesis include the following issues: We design and implement an algorithm for filtering stimulant selling web sites on the internet. In this algorithm we use data extract from Sogou search engine. The size of the database is huge so traditional machine learning methods may appear ineffective. We use strategy of indexing and divide and conquer to solve this problem. First we use a rough filter and an refined filter both based on keywords to extract data from the database. Then we make an index for all the extracted data to improve the accessing speed. After that we use combined-rules and one class SVM to classify the webpages in the database. The results show that our method has an satisfactory performance and an reasonable speed. We design and implement a web page filtering algorithm based on Adaboost. With an improved setting, our Adaboost-based web page filtering algorithm can achieve a very low false positive rate while keeping a relatively high detection rate. Meanwhile, this algorithm owns low computational complexity which makes it possible for ...
语种	中文
其他标识符	200628014628032
内容类型	学位论文
源URL	[http://ir.ia.ac.cn/handle/173211/7543]
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	贺主. 互联网毒品类信息过滤研究[D]. 中国科学院自动化研究所. 中国科学院研究生院. 2010.

个性服务

查看访问统计

相关权益政策

暂无数据

收藏/分享

所有评论 (0)

[发表评论/异议/意见]

暂无评论

评论
权益异议
反馈意见

评注功能仅针对注册用户开放，请您登录

您对该条目有什么异议，请向管理员反馈。
内容：
Email：	*
单位:
验证码：	刷新

您在知识库使用过程中有什么好的想法或者建议可以反馈给我们。
标题：	*
内容：
Email：	*
验证码：	刷新

相关链接

CORC

联系我们