藏文分词及文本资源挖掘研究

CORC > 软件研究所 > 基础软件国家工程研究中心 > 学位论文

题名

藏文分词及文本资源挖掘研究

作者

刘汇丹

学位类别

博士

答辩日期

2012-11

授予单位

中国科学院研究生院

授予地点

北京

导师

贺也平 ; 吴健

关键词

藏文编码藏文分词藏文紧缩词藏文数字识别条件随机场藏文文本资源挖掘藏文搜索引擎文本语料

学位专业

计算机软件与理论

中文摘要

藏文信息处理的研究已有多年的历史，但直到近几年，主流的操作系统平台才逐渐完善地解决了Unicode国际标准藏文字符集支持问题，但目前仍有多种藏文编码在使用，数据交换和共享仍然是一个问题。同时，由于藏文文本中词语之间没有分隔标记，同汉语类似，分词是藏文自然语言处理的一项基础性任务。另外，语料是统计自然语言处理的最基本的原材料，藏文信息处理目前仍然面临着语料匮乏的困境。针对以上问题，本文主要研究了藏文编码的识别与转换、藏文分词、网络藏文文本资源的挖掘和利用等方面的内容。取得的主要成果包括：

第一：研究了多种藏文编码共存的现状，提出了一种综合使用藏文的音节点间距规律和高频音节为特征的藏文编码识别方法，在大规模应用的环境中识别正确率接近100%。

我们研究了藏文的三种编码模型和三种编码实现方案，并考察了多种藏文编码，提出了一种综合使用藏文的音节点间距规律为特征、以藏文高频音节为特征的藏文编码识别方法，实验证明，在大规模应用的环境下识别正确率接近100%。为了实现对有限的藏文电子资源的充分利用，方便藏文电子数据的交换和共享，我们开发了藏文编码转换软件，可以实现多种藏文编码的互相转换和归一化转换。

第二：研究并解决或部分解决了基于规则的藏文分词方法中的交集型歧义消除、藏文数字识别等问题，设计实现了基于规则的藏文分词系统，该系统对藏文数字识别的正确率为99.21%，分词正确率为96.98%。

我们提出了一种迭代训练的方法进行词频统计，并提出了利用词频信息进行交集型歧义消除的方法，部分解决了交集型歧义问题；提出了使用双数组Trie树进行临界词快速识别的方法。应用这些方法，设计实现了一个藏文分词系统SegT。该系统采用格助词分块并识别临界词，然后采用最大匹配方法分词，系统采用双向切分检测交集型歧义字段并使用预先统计的词频信息进行消歧。我们考察了藏文数字的构成规律，将藏文数字构件分成基本数字、数字前缀、数字连接词、数字后缀、独立数字等多种类别，采用了对藏文数字构件分类贴标签、按照一定规则进行标签更新、最后合并数字构件的方法进行藏文数字识别。实验结果表明，本文设计的格助词分块和临界词识别方法可以将分词速度提高15%左右，但格助词分块对分词效果没有明显提高或降低，藏文数字识别的正确率为99.21%。系统最终分词正确率为96.98%。

第三：提出了一种通过对藏文音节进行词位标注实现藏文分词的方法，研究了不同特征模板集和语料规模对分词性能的影响，训练的分词模型在测试语料上的F值为95.12%。

我们将藏文分词转化为对藏文音节的词位标注问题，采用8词位标注集，利用条件随机场，训练了一个藏文分词模型CRF-SegT。我们在实验中进行各种方面的比较，实现数据表明，特征模板集TMPT-6比TMPT-10更好一些，较大规模的语料能够显著地提高分词性能，但词典语料对性能的提升不明显。使用131,903句由SegT生成并未经人工校对的训练语料得到的分词模型在1000句的测试语料上测试的F值达到了95.12%。

第四：研究了网络藏文文本资源的分布情况，抽取了一份包含159万句、共计3500万音节字的藏文文本语料，设计实现了一个通用的藏文搜索引擎的原型系统。

我们结合链接分析技术和藏文编码识别技术，使用网络爬虫实现对互联网上Web文本资源的挖掘，并配合人工方式，相对全面地考察分析了Web藏文文本资源的分布情况。根据我们的分析：首先，国内藏文网站50%以上在青海省；其次，旧有的藏文编码正在被逐步地弃用，人们转而使用国际标准的Unicode编码来制作Web页面；再次，87%的藏文网页集中分布在31个大型网站中。Web藏文文本资源分布的集中性为文本采集加工提供了一定的方便。我们选择了三个最大的藏文网站，根据其网页URL和内容结构编制了一系列的规则，抽取了159万句，共计3500万音节字的藏文文本语料。由于当前仍有多种藏文编码在使用，现有的搜索引擎对藏文编码支持能力不足，针对该问题，我们分析了构建通用的藏文搜索引擎的关键技术，设计实现了一个通用藏文搜索引擎的原型系统。

英文摘要

There is an about 30 years' history of research on Tibetan information processing. However, There are many Tibetan Character sets and corresponding encodings which are still used at present. It's still an issue to make encoding conversion. Meanwhile, there is a lack of word separator in Tibetan text, so word segmentation is also a fundamental task in Tibetan natural language processing. In addition, there is also a lack of corpus for Tibetan natural language processing. Focusing on these problems, we make research on Tibetan encoding detection and conversion, Tibetan word segmentation, web Tibetan text mining and some other issues. The achievements of this paper are as follow.

First, we researched many Tibetan encodings and proposed a method to automatically identify these encodings with both the syllable dot and high frequency syllables as the features. Experimental data in large scale application shows that the accuracy is very close to 100%.

We have summarized three encoding models and three encoding implementation methods, and introduced many Tibetan encodings. An encoding detection method is proposed by combining the two type features, namely, distance of syllable dots and high frequency syllables. Experimental data in large scale application shows that the accuracy is very close to 100%. We also developed some tools to make the encoding conversion.

Second, we have solved or partially solved many problems in rule-based Tibetan word segmentation method, such as word disambiguation, Tibetan number identification and so on. A segmenter is implemented. Experimental data shows that the accuracy of Tibetan number identification is 99.21% and The F score of the segmenter is 96.98%.

An iteratively training method is proposed to make word frequency statistic before there is a good word segmenter. It is used to make disambiguate the cross ambiguity. A fast critical word detection method is also proposed based on the double-array trie structure. Applying these methods, a segmenter named "SegT" is implemented. We have summarized the structure of Tibetan numbers, and sort Tibetan number components into different classed, namely basic number, number prefix, number linker, number suffix and independent number. A method is proposed to identify Tibetan numbers based on the classification. The method first tags each number component according to the class which it belongs to while segmenting, and then updates the tag series according to some predefined rules. At last adjacent number components are combined to form a Tibetan number if they meet a certain requirement. Experimental data shows that, fast critical word detection method improves the segmentation speed by about 15%, but it doesn't improve the segmentation precision. The accuracy of Tibetan number identification is 99.21%. The F score of the segmenter is 96.98% on a corpus of 1000 Tibetan manually segmented sentences.

Third, we have reformulated Tibetan word segmentation to a syllable labelling problem and applied statistic based method to Tibetan word segmentation. We compared the effects of different feature templates and corpus scales on the performance. The segmenter achieves an F-score of 95.12% on the test set

We have proposed a novel approach for Tibetan word segmentation using the conditional random field. We reformulate the segmentation as a syllable tagging problem. The approach labels each syllable with a word-internal position tag, and combines syllable(s) into words according to their tags. As there is no public available Tibetan word segmentation corpus, the training corpus is generated by "SegT" which has an F-score of 96.94% on the test set. Two feature template sets namely TMPT-6 and TMPT-10 are used and compared, and the result shows that the former is better. Experiments also show that larger training set improves the performance significantly. Trained on a set of 131,903 sentences, the segmenter achieves an F-score of 95.12% on the test set of 1,000 sentences.

At last, we have analyzed the Tibetan text distribution on the internet. A large scale Tibetan text corpus is built, including nearly 1.59 million sentences or 35 million syllables in total. A general Tibetan search engine is also developed.

We analyzed the distribution of Tibetan text. Statistical data shows that, more than 50% inland Tibetan web sites are hold by organizations in Qinghai province. About 87% web pages belong to 31 large web sites. Three biggest web sites are selected, and topic pages are selected. The layout structures of selected pages are analyzed, and topic related information are extracted. Consequently, we get a corpus including 1.59 million sentences or 35 million syllables in total. Investigation show that the existing search engines are not good enough in Tibetan text retrieval, because it only supports Tibetan basic character set defined in Unicode standard, and can’t support other encodings which are still used in several largest Tibetan websites. Focusing on the problem, key technologies are analyzed in building a general Tibetan search engine, such as Tibetan encoding detection and conversion, inverted index building. Based on two open source software namely Nutch and Solr, a prototype of Tibetan search engine is design and implemented.

语种

中文

学科主题

自然语言处理 ; 机器翻译 ; 中国语言文字信息处理(包括汉字信息处理)

公开日期

2012-12-20

内容类型

学位论文

源URL

[http://ir.iscas.ac.cn/handle/311060/14758]