Research on Similarity Detection of Massive Text Based on Semantic Fingerprint | |
Xiaolin Jin1; Shuwu Zhang2; Jie Liu2; Hu Guan2 | |
2017-12 | |
会议日期 | December 16-17, 2017 |
会议地点 | Guangzhou, China |
英文摘要 | In order to find the required information quickly and efficiently in massive texts, this paper proposes a method of combining semantic fingerprint with cosine distance. After text preprocessing for Chinese texts, the Term Frequency-Inverse Document Frequency algorithm is used to extract feature words of the text, and then screen the text initially by the Simhash algorithm, finally compare these candidate texts tby using the cosine distance for the second similarity to extract the most similar texts. Based on a single Simhash algorithm, the proposed method can greatly improve the accuracy and recall under the modified textual environment, and can also meet the needs of massive texts' similarity testing requirements. Therefore, this method of combining semantic fingerprint with cosine distance can effectively make up for the problem of high false positive rate of Simhash algorithm and is more suitable for the similarity detection of massive texts in fact. |
产权排序 | 2 |
语种 | 英语 |
内容类型 | 会议论文 |
源URL | [http://ir.ia.ac.cn/handle/173211/47503] |
专题 | 数字内容技术与服务研究中心_新媒体服务与管理技术 |
通讯作者 | Jie Liu; Hu Guan |
作者单位 | 1.Communication University of China 2.Institute of Automation, Chinese Academy of Sciences |
推荐引用方式 GB/T 7714 | Xiaolin Jin,Shuwu Zhang,Jie Liu,et al. Research on Similarity Detection of Massive Text Based on Semantic Fingerprint[C]. 见:. Guangzhou, China. December 16-17, 2017. |
个性服务 |
查看访问统计 |
相关权益政策 |
暂无数据 |
收藏/分享 |
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论