CORC  > 软件研究所  > 并行计算实验室  > 学位论文
题名一种基于深度学习的上市公司公告信息抽取系统
作者王文惠
学位类别硕士
答辩日期2019-05-17
授予单位中国科学院研究生院
授予地点北京
导师杨超
关键词信息抽取系统 数据回标 命名实体识别 实体关系抽取
学位专业计算机软件与理论
中文摘要

上市公司公告主要负责公布公司的发展情况或投资者利益相关的重大事件,挖掘公告重要信息是专业的机构研究员每日的必要功课。随着深度学习的发展,自动信息抽取逐渐应用于多种场景中,极大地提高了人们的工作效率。本文在上市公司公告这一特定领域设计开发了一种基于深度学习模型的信息抽取系统,能够抽取文档级别的结构化数据,该系统分为模型训练和预测两部分,核心深度学习技术主要包括句子级别的命名实体识别和实体关系抽取。

    在模型训练部分,本文对爬取的公告PDF文件和文档级的结构化数据采用数据回标的方法生成较为准确句子级的训练语料。本文对经典的端到端的关系抽取模型进行改进,提出基于BLSTM_ATT和分段池化的实体关系抽取模型,能够捕捉长距离的依赖关系并获得更细致的特征,实验结果表明该模型在公司公告语料上的预测效果优于经典模型。

 

    在预测部分,每篇公告的结构化输出数据来源于非结构化文本和表格信息。非结构化文本信息抽取以句子为单位进行命名实体识别、指代消解和实体关系抽取,然后通过文档级信息融合模块将关键句上下文中的实体填入结构化模板中以获得文档级的结构化数据;对于表格,采用正则表达式的方法进行信息抽取。

   最后,本文在增减持和重大合同两类公告上对公告信息抽取系统进行实验验证。实验结果表明,本文所设计的信息抽取系统能够取得较高的准确率和运行速度,且系统具有可扩展性能够应用到多种类型的公告中。

 
英文摘要

Most of announcements of listed companies are used for announcing development of the companies  or events related to investors. It is necessary for professional researchers to discover the important information from announcements. With the development of deep learning, automatic information extraction is gradually applied to many scenarios, which greatly improves work efficiency. We develops an information extraction system based on deep learning in the specific field of announcements of listed companies, which can extract document-level structured data. The system consists of the model training phase and the prediction phase. The core deep learning technologies include named entity recognition and entity relation extraction.

 

In the model training phase, we designs data retrieval method to generate sentence-level training corpus from PDF text and structured training data. We proposes an entity relation extraction model based on BLSTM$\_$ATT and piecewise max-pooling, which can capture long-distance dependencies and obtain more detailed features. The experimental results show that this model is better than the state-of-arts in the corpus of company announcements.

 

In the prediction phase, the structured data of each announcement comes from unstructured text and tables. For unstructured text, named entity recognition, anaphora resolution and entity relation extraction are carried out in sentence level. We have resolved anaphors about the entity's full name and abbreviation. Entities in the context of key sentences are filled into predefined templates through document-level information integration module. We use regular expression to extract key information from tables.

 

Finally, We tests the announcement information extraction system in Equity Overweight or Underweight and Major Contracts. The experimental results indicate that the information extraction system developed in this paper has achieved high accuracy and running speed. And the system is scalable and can be applied to a variety of announcements.

语种中文
学科主题计算机科学技术 ; 人工智能 ; 自然语言处理
内容类型学位论文
源URL[http://ir.iscas.ac.cn/handle/311060/19171]  
专题软件研究所_并行计算实验室 _学位论文
作者单位中国科学院软件研究所
推荐引用方式
GB/T 7714
王文惠. 一种基于深度学习的上市公司公告信息抽取系统[D]. 北京. 中国科学院研究生院. 2019.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace