基于语义知识挖掘与融合的实体消歧技术研究

CORC > 自动化研究所 > 中国科学院自动化研究所 > 毕业生 > 博士学位论文

题名	基于语义知识挖掘与融合的实体消歧技术研究
作者	韩先培
学位类别	工学博士
答辩日期	2010-06-02
授予单位	中国科学院研究生院
授予地点	中国科学院自动化研究所
导师	赵军
关键词	实体消歧实体链接多源异构语义知识语义知识挖掘语义知识集成 Named Entity Disambiguation Semantic Knowledge Mining Semantic Knowledge Integration
其他题名	Named Entity Disambiguation Based on Semantic Knowledge Mining and Integration
学位专业	模式识别与智能系统
中文摘要	实体消歧是信息提取和集成领域的一项关键技术，旨在解决文本信息中广泛存在的名字歧义问题，在知识工程、信息检索和Semantic Web等领域有广泛的应用价值。另一方面，高性能的命名实体消歧依赖于语义知识的利用，虽然互联网上存在多种知识源，但是由于知识源的多源异构性以及很多语义知识隐藏在知识源的深层结构中，计算机通常难以获取和利用这些知识源中的语义知识。因此，研究多源异构知识源中语义知识的挖掘与集成方法，在命名实体消歧以及其它很多自然语言处理任务中都具有重要的学术意义。本文研究网络知识源中结构化语义知识与概率化语义知识的挖掘和集成方法，以及它们在实体聚类消歧和链接消歧任务中的应用，具体成果如下。 [1] 结构化知识源的挖掘与集成——基于语义图的结构化语义关联结构化知识源中的大部分语义知识都可以表示成概念关联的形式。为了挖掘和集成这些结构化语义知识，本文提出了统一的结构化语义知识表示模型——语义图，并提出了基于图的结构化语义知识挖掘算法——结构化语义关联，来挖掘语义图中的显式和隐式知识。实验结果表明，相比于传统的基于词袋子模型的实体消歧系统和基于社会化网络的实体消歧系统，结构化语义知识分别提升了9.7%和15.7%的实体消歧性能。 [2] 非结构化知识源的挖掘与集成——基于语言模型的实体知识表示非结构化知识源中存在着大量概率化语义知识。为了挖掘和集成概率化语义知识，本文提出了基于语言模型的实体知识表示框架——实体语言模型。基于实体语言模型表示，本文研究了文本语料库中的实体知识挖掘算法。同时针对实体语言模型参数估计中的训练样本不足问题，提出了两种基于文本语料库结构的训练样本挖掘策略：基于相似度结构的相关文档扩展和基于层次分类结构的相关文档扩展。实验结果表明：实体语言模型能有效地表示实体知识；基于相关文档扩展和基于层次分类结构扩展的实体语言模型参数估计方法能显著地提升实体知识挖掘性能。 [3] 基于知识推导的实体消歧——利用结构化关联语义核计算实体相似度实体消歧的关键问题是实体指称项之间的相似度计算。传统的基于词袋子模型的相似度计算仅仅考虑实体指称项特征之间的表层关联，不能捕捉特征之间的语义关系，如词语之间的词汇化关联、实体之间的社会化关联和概念之间的语义关联。为了在相似度计算中融合各种语义关联信息，本文提出了基于结构化关联语义核的指称项相似度计算方法，该方法能取得具竞争力的实体消歧性能：相比于基于词袋子模型的方法，该方法取得了10.7%的性能提升；相比于基于社会化网络的方法，该方法取得了16.7%的性能提升；相比于State-of-Art系统，该方法也能取得10%的性能提升。 [4] 基于知识推导的实体链接——利用多源语义知识计算局部一致性和全局一致性为了充分利用实体指称项的上下文信息以及从多个知识源中挖掘出的概率化语义知识和结构化语义知识来构建高性能的实体链接系统，本文提出了局部一致性和全局一致性模型。其中局部一致性模型将实体指称项建模为实体语言模型生成的样本，全局一致性模型利用结构化语义关联建模指称项上下文的主题一...
英文摘要	Named entity disambiguation is one of the key techniques in information extraction and integration. It aims at resolving the name ambiguity problem which is common in the textual information, and plays an important role in many different areas, such as knowledge engineering, information retrieval and semantic web. However, the high-performance named entity disambiguation is critically depending on the use of semantic knowledge. In recent years, there is an increasing availability of large-scale knowledge sources on the Web. These knowledge sources, unfortunately, are usually heterogeneous and the semantic knowledge within them is encoded in complex structures, thus are difficult to be used in different tasks. Therefore, the mining and integration of the semantic knowledge contained in heterogeneous knowledge sources is critical to the named entity disambiguation and many other natural language processing tasks. This thesis focuses on semantic knowledge mining from the Web, and entity disambiguation and entity linking methods based on the mined semantic knowledge. The main contributions and novelties are summarized as follows. [1] Structural Knowledge Mining and Integration——Structural Semantic Relatedness Most semantic knowledge contained in structural knowledge sources can be represented as semantic relatedness between concepts. This paper proposes a novel structural semantic knowledge representation model——semantic-graph, which can uniformly represent the structural semantic knowledge exploited from multiple knowledge sources. Then we propose our Structural Semantic Relatedness measure to capture the explicit and implicit semantic knowledge contained in the semantic-graph. The experimental results show that our SSR method can significantly outperform the traditional BOW methods by 9.7% and the Social Network based methods by 15.7%. [2] Unstructured Knowledge Mining and Integration——the Entity Language Model The unstructured knowledge sources contain rich probabilistic semantic knowledge which can enhance the named entity disambiguation system. In this thesis, in order to mine and integrate the probabilistic semantic knowledge, we propose an knowledge representation model——Entity Language Model. Based on the entity language model, we demonstrate how to mine the semantic knowledge about an entity by exploiting unstructured knowledge sources. In order to resolve the sample sparseness problem in entity language model estimation, we propose two sam...
语种	中文
其他标识符	200718014628035
内容类型	学位论文
源URL	[http://ir.ia.ac.cn/handle/173211/6284]
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	韩先培. 基于语义知识挖掘与融合的实体消歧技术研究[D]. 中国科学院自动化研究所. 中国科学院研究生院. 2010.

个性服务

查看访问统计

相关权益政策

暂无数据

收藏/分享

所有评论 (0)

[发表评论/异议/意见]

暂无评论

评论
权益异议
反馈意见

评注功能仅针对注册用户开放，请您登录

您对该条目有什么异议，请向管理员反馈。
内容：
Email：	*
单位:
验证码：	刷新

您在知识库使用过程中有什么好的想法或者建议可以反馈给我们。
标题：	*
内容：
Email：	*
验证码：	刷新

相关链接

CORC

联系我们