Research on Similarity Detection of Massive Text Based on Semantic Fingerprint

CORC > 自动化研究所 > 中国科学院自动化研究所 > 数字内容技术与服务研究中心 > 新媒体服务与管理技术

	Research on Similarity Detection of Massive Text Based on Semantic Fingerprint
	Xiaolin Jin 1; Shuwu Zhang 2; Jie Liu 2; Hu Guan 2
	2017-12
会议日期	December 16-17, 2017
会议地点	Guangzhou, China
英文摘要	In order to find the required information quickly and efficiently in massive texts, this paper proposes a method of combining semantic fingerprint with cosine distance. After text preprocessing for Chinese texts, the Term Frequency-Inverse Document Frequency algorithm is used to extract feature words of the text, and then screen the text initially by the Simhash algorithm, finally compare these candidate texts tby using the cosine distance for the second similarity to extract the most similar texts. Based on a single Simhash algorithm, the proposed method can greatly improve the accuracy and recall under the modified textual environment, and can also meet the needs of massive texts' similarity testing requirements. Therefore, this method of combining semantic fingerprint with cosine distance can effectively make up for the problem of high false positive rate of Simhash algorithm and is more suitable for the similarity detection of massive texts in fact.
产权排序	2
语种	英语
内容类型	会议论文
源URL	[http://ir.ia.ac.cn/handle/173211/47503]
专题	数字内容技术与服务研究中心_新媒体服务与管理技术
通讯作者	Jie Liu; Hu Guan
作者单位	1.Communication University of China 2.Institute of Automation, Chinese Academy of Sciences
推荐引用方式 GB/T 7714	Xiaolin Jin,Shuwu Zhang,Jie Liu,et al. Research on Similarity Detection of Massive Text Based on Semantic Fingerprint[C]. 见:. Guangzhou, China. December 16-17, 2017.

个性服务

查看访问统计

相关权益政策

暂无数据

收藏/分享

所有评论 (0)

[发表评论/异议/意见]

暂无评论

评论
权益异议
反馈意见

评注功能仅针对注册用户开放，请您登录

您对该条目有什么异议，请向管理员反馈。
内容：
Email：	*
单位:
验证码：	刷新

您在知识库使用过程中有什么好的想法或者建议可以反馈给我们。
标题：	*
内容：
Email：	*
验证码：	刷新

相关链接

CORC

联系我们