CORC  > 自动化研究所  > 中国科学院自动化研究所  > 毕业生  > 博士学位论文
题名汉语文本信息抽取关键技术研究
作者刘非凡
学位类别工学博士
答辩日期2006-06-12
授予单位中国科学院研究生院
授予地点中国科学院自动化研究所
导师徐波 ; 赵军
关键词信息抽取 产品命名实体识别 实体提及识别 实体共指消解 Information Extraction Product Named Entity Recognition Entity Mention Recognition Entity Co-reference Resolution
其他题名Research on the Key Technologies for Chinese Text Information Extraction
学位专业模式识别与智能系统
中文摘要随着信息时代的到来和Internet的发展,“信息爆炸”成为信息处理领域迫切需要解决的问题。如何实现快速准确的信息获取,成为目前国内外研究人员关注的热点课题。文本信息抽取就是解决这一问题的有力手段之一。相对于英文信息抽取技术,汉语信息抽取研究基础相对薄弱,底层关键技术的发展滞后,严重影响了汉语信息抽取系统整体框架的实现。 本文以汉语自由文本为对象,对命名实体识别、实体提及识别、实体提及共指消解等三个信息抽取关键技术展开深入研究,主要研究工作和创新点包括: 1)商务领域产品命名实体标注规范研究及语料库建设 面向互联网真实文本,对商务领域三类产品命名实体进行了合理界定,并深入分析了它们在结构和表达上的特点,以此为基础制定出一套操作性较强的产品命名实体标注规范,建立了第一个汉语产品命名实体手工标注语料库。 2)汉语文本中产品名命名实体识别 针对产品命名实体“结构复杂、表达灵活、允许嵌套”的特点,本文提出了一种基于层级隐马尔可夫模型的产品命名实体识别方法,该方法基于词形和词性特征分别构建了两个层级隐马尔可夫模型,进而融合两个模型并结合知识库和启发式规则,综合利用不同层面的上下文特征进行产品命名实体识别。实验结果表明,论文提出的方法在性能上优于基于两层级联最大熵模型的识别方法,在电子数码领域和手机领域均取得了比较满意的效果。 3)汉语文本中实体提及识别 针对汉语文本中实体提及“多层嵌套”的特点,提出了一种层次结构信息编码方法,在绕过深层句法分析的同时可以较好地对实体提及的嵌套结构进行建模。在此基础上,构建了基于条件随机场的实体提及嵌套边界检测模型和基于支持向量机的实体提及多层信息标注模型,有效地融合丰富的语言学特征进行实体提及识别。实验结果表明,论文提出的层次结构信息编码方法可以有效地解决实体提及的多层嵌套识别问题,基于条件随机场模型和支持向量机模型的实体提及识别方法具有良好的性能。 4)基于统计的汉语实体共指消解 针对基于规则的指代消解方法“依赖深层句法语义分析、可移植性较差”的不足,将基于“分类-链接”的统计框架用于汉语文本实体共指消解,有效地提高了系统的鲁棒性和可移植性。该方法运用支持向量机统计学习模型对两个实体提及的共指关系进行建模,采用“最近链接策略”实现文本层面上的实体共指分析,构建了基于统计的汉语实体共指消解系统,并深入分析了不同层面的上下文特征、不同分类器及其组合对系统性能的影响。实验结果表明,基于统计的汉语实体共指消解方法是有效的,在无需深层句法语义分析的情况下可以获得比较满意的结果。 5)语言学理论指导下的共指消解统计特征挖掘 针对基于统计的共指消解方法中“统计黑盒子”的盲目性,本研究以指代消解语言学理论三种消解因子为指导,通过“无序特征重组”、“大颗粒度上下文表示”、“上下文窗口扩展”、“搭配信息抽取”等四种策略挖掘不同形式的统计特征,目的是充分利用浅层语言学特征去近似刻画相关语言学理论所描述的语言特性,在一定程度上实现由浅层特征到深层语言学规则的映射。实验结果表明,本文提出的基于语言学理论的特征挖掘策略对基于统计的共指消解方法可以起到有效的改善作用。
英文摘要Text information extraction is one of the most powerful measures to deal with the problem of information explosion. This thesis makes an intensive study on related issues, and the main contribution and creative points are summarized as follows. 1. Study on Product NEs Annotation Specifications and Construction of Annotated Corpus Based on texts from the Internet, three types of product named entities are thoroughly analyzed. Also this research constitutes a set of product named entity annotation specifications with strong maneuverability and establishes the first Chinese manually annotated corpus for product named entity. 2. Product Named Entity Recognition in Chinese Texts A Hierarchical Hidden Markov Model (HHMM) based approach to product named entity recognition is proposed. This approach establishes two HHMMs using the word form feature and part of speech feature respectively, which are combined with knowledge base and heuristics to leverage diverse contextual features to conduct product named entity recognition. Extensive experiments show that the presented method obtains promising results in the electronic digital domain and mobile phone domain. 3. Entity Mention Recognition in Chinese Texts This thesis presents a hierarchical structure information encoding method. Based on this method, two models are set up for entity mention recognition by employing rich linguistic features, namely a boundary detection model for multi-level nested entity mentions based on conditional random fields (CRFs) and a multiple properties annotation model based on support vector machine (SVM). The experimental results prove the feasibility and the good performance of the proposed approach. 4. Statistical Method for Chinese Entity Co-reference Resolution The “classification--linking” statistical framework is employed for Chinese entity co-reference resolution. In this dissertation, SVM is employed to model the co-reference relation between two entity mentions, and the “link-first” strategy is adopted to analyze the co-reference relation among the mentions within a text. Extensive experiments prove that the statistic based approach can perform well. 5. Linguistic Theory based Statistical Feature Mining for Entity Co-reference Resolution Four strategies are presented to mine different statistical features guided by linguistic findings in anaphora resolution, such as feature reconstruction, contextual granularity scale-up, contextual window enlargement, collocation information extraction, etc. The rationale behind this is to take full advantage of the surface linguistic features to approximately depict the characteristics described in corresponding linguistic theories. The experimental results show that the proposed strategies can efficiently improve the performance of the statistic based co-reference resolution approach.
语种中文
其他标识符200218014603213
内容类型学位论文
源URL[http://ir.ia.ac.cn/handle/173211/5947]  
专题毕业生_博士学位论文
推荐引用方式
GB/T 7714
刘非凡. 汉语文本信息抽取关键技术研究[D]. 中国科学院自动化研究所. 中国科学院研究生院. 2006.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace