Research on Similarity Detection of Massive Text Based on Semantic Fingerprint
Xiaolin Jin1; Shuwu Zhang2; Jie Liu2; Hu Guan2
2017-12
会议日期December 16-17, 2017
会议地点Guangzhou, China
英文摘要

In order to find the required information quickly and efficiently in massive texts, this paper proposes a method of combining semantic fingerprint with cosine distance. After text preprocessing for Chinese texts, the Term Frequency-Inverse Document Frequency algorithm is used to extract feature words of the text, and then screen the text initially by the Simhash algorithm, finally compare these candidate texts tby using the cosine distance for the second similarity to extract the most similar texts. Based on a single Simhash algorithm, the proposed method can greatly improve the accuracy and recall under the modified textual environment, and can also meet the needs of massive texts' similarity testing requirements. Therefore, this method of combining semantic fingerprint with cosine distance can effectively make up for the problem of high false positive rate of Simhash algorithm and is more suitable for the similarity detection of massive texts in fact.

产权排序2
语种英语
内容类型会议论文
源URL[http://ir.ia.ac.cn/handle/173211/47503]  
专题数字内容技术与服务研究中心_新媒体服务与管理技术
通讯作者Jie Liu; Hu Guan
作者单位1.Communication University of China
2.Institute of Automation, Chinese Academy of Sciences
推荐引用方式
GB/T 7714
Xiaolin Jin,Shuwu Zhang,Jie Liu,et al. Research on Similarity Detection of Massive Text Based on Semantic Fingerprint[C]. 见:. Guangzhou, China. December 16-17, 2017.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace