对等网络检索系统中关键技术的研究

CORC > 声学研究所 > 中国科学院声学所 > 声学所博硕士学位论文 > 1981-2009博硕士学位论文

题名	对等网络检索系统中关键技术的研究
作者	尤佳莉
学位类别	博士
答辩日期	2008-06-03
授予单位	中国科学院声学研究所
授予地点	声学研究所
关键词	对等网络语言分类词结构信息 Web信息多信息融合语义扩展
其他题名	Research on Key Technologies for Information Retrieval in Peer-to-Peer Network
学位专业	信号与信息处理
中文摘要	近年来，传统的客户端/服务器模式由于受到硬件和体系结构的限制，已无法满足Internet快速增长的需要。对等网络（Peer-to-Peer Network，简称P2P网络），作为一个新的互联网应用模式，得到了快速的发展和普及。由于无中心、自组织等特性，怎样在对等网络中进行准确、有效的信息检索成为一个新的研究课题。本文针对对等网络信息检索中的关键技术进行了研究，主要的研究内容和成果如下： 1、设计了基于语言类别簇的P2P重叠网结构。由于对等网络在全球范围内得到发展，使用不同语言的用户参与其中，如果不对用户群进行有效的划分并针对性的提供搜索服务，则很难得到可实际应用的、高效的检索系统。本文提出了一个基于语言类别簇的多层P2P体系结构。在这个结构下，每个查询问题都依据其语言类别识别的后验概率，提交到与查询最相似的语言簇中进行处理，有效减少了搜索空间的大小，降低了节点负载。实验表明，当6种使用不同语言的用户进行查询时，与洪泛查找相比较，在相同的通信代价下（如消息量为400），可以将平均搜索成功率从6.1%提高到54.5%。 2、提出了一个基于AdaBoost对多Chunk特征进行融合的专有名词语言分类算法。在传统的基于数据驱动的语言分类方法中，由于数据稀疏等问题，统计语言模型无法对长程信息进行有效的学习，仅仅对字母间的关系进行建模很难区分不同的拉丁字母语言。因此，本文通过对专有名词的词结构信息进行多角度挖掘，提取出具有类别表征性的字母块(chunk)，并针对不同的chunk建立统计模型，缓解数据稀疏带来的过训练问题。除此之外，根据统计学习理论，用AdaBoost有效融合多个chunk模型，将单分类器的分类能力进行提升。该模型显著地提高了拉丁字母语言的专有名词语言分类的正确率，有利于节点和用户查询时对关键词的语言验证。实验中，对英语、德语、法语和葡萄牙语四种语言进行了分类，与传统的基于字母的N元语法模型相比，多Chunk模型融合的分类方法可以将平均正确率从75%提高到78.4%，减少了13.6%的分类错误(Error Reduction)。同时，本文对不同的融合分类器也分别进行了研究，实验表明，AdaBoost具有最好的融合性能，显著优于投票、决策树以及高斯混合模型。 3、提出了一个两级的语言分类系统以及多信息源融合的专有名词语言分类算法。针对拉丁字母语言的特点，通过多级系统，对语言聚类(Cluster)和语言类别进行分段识别。另外，针对拉丁字母语言中构词法相似的问题，提出了一个词语流行度信息的概念，并通过对Web的挖掘提取出强鲁棒性的特征，作为新的信息源辅助词结构信息共同增强模型的分类能力。实验表明，在该结构下， 98.6%的Cluster（中文，日语和拉丁字母语言）可以被正确区分，而对于同一个Cluster内的拉丁字母语言（英语、德语、法语和葡萄牙语），多信息源融合后，与基于字母的N元语法模型相比，可以将正确率从75%提高至86.3%，纠正近45.2%的分类错误。 4、提出了一个基于Web信息进行媒体文件描述符扩展的算法。由于P2P网络中普遍存在媒体文件描述信息匮乏的问题，通过对Web中相关网页进行挖掘和统计，抽取出具有媒体内容表征性的文本信息，用以丰富资源描述符。实验表明，该算法明显提高了媒体文件信息检索的正确率，与未扩展前相比，平均排序倒数可从0.09提高到0.23，改善了用户体验。 5、提出了一个基于语义扩展的P2P网络结构。由于媒体描述信息不足，分布式哈希表无法支持语义查询的问题，提出了媒体文件语义特征映射算法，以及基于DHT和语义跳表多层环的特征空间向量搜索算法。仿真表明，该算法不仅可以有效扩展媒体文件的特征信息，其检索精度也和传统的中心式搜索方法的结果很相似，具有实用价值。
英文摘要	Since the limitation of hardware and system structure, the traditional Client/Server mode can not satisfy the fast developing of Internet. Peer-to-peer network is a new structure for Internet. It has been developed a lot in recent years and becomes more and more popular all over the world. Because P2P network’s characters of decentralization and symmetry, how to make the information retrieval correctly and efficiently in it is a new research topic. In this thesis, some key technologies of P2P information retrieval are investigated. The content and achievements are mainly as follows: 1. Designing a language-group-based P2P overlay network. Because the network covers the counties all over the world; the users speaking different languages join it. If we do not offer the service based on different languages, it will be hard to obtain a satisfying system which is practical. In this thesis, a language-group-based multi-layer P2P structure is proposed. In this system, each query can be routed to the language group with the highest posterior probability which is obtained from the automatically language classification algorithm. Consequently, the search space is remarkably reduced. In experiments, the queries from 6 kinds of languages are tested. Comparing with the flooding search, with the same communication cost (e.g., with message number of 400), the average rate of successful search can be improved from 6.1% to 54.5%. 2、Proposing an algorithm with multi-chunk combination by AdaBoost to identify the languages of named entities. For the traditional data driven methods, because of the data sparseness problem, some long distance information can not be learned by statistic models. If only the relationship among letters is modeled, it is hard to discriminate different Latin languages. To solve this problem, we use different methods to extract the chunks for representing morphological information in different aspects, and the chunk-based language model is used to alleviate the data sparseness. In addition, Adaptive Boosting (AdaBoost) is used to combine multiple chunk models and enhances the classification capability for any single one. In experiments, four Latin languages are classified. Comparing with the letter-based N-gram model, multi-chunk combining model can get a better performance that the accuracy is improved from 75% to 78.4% and an error reduction of 13.5% is obtained. Besides, we also investigate different combining classifiers, and AdaBoost gains the best performance among the four classifiers: Voting, Classification and Regression Tree, Gaussian Mixture Model and AdaBoost. 3、Proposing a two-stage framework of language identification and an algorithm of multiple information sources combining for identifying languages of named entities. Via this two-stage system, the language clusters and classes can be identified separately. In addition, new concept named popularity information is proposed, and its corresponding robust features are extracted from Web. This new information source is a useful complement of morphological information, which helps to enhance the identification model. From the results of experiments, we can see that, 98.6% clusters can be correctly identified (including Chinese Pinyin, Japanese Romaji and Latin languages). For the Latin languages in the same cluster, when comparing with the letter N-gram model, the algorithm of multiple information sources combining achieves a much better performance that the accuracy is increased from 75% to 86.3% and 45.2% errors are removed 4、Proposing an algorithm of expanding descriptors of media files through Web information. Because of the lack of descriptors for media file in P2P network, an algorithm that extracts useful content-based information through Web page mining and learning is proposed. After this algorithm is applied, the descriptors of media files are enriched a lot. Simulations show that, with comparing with the method without descriptor expansion, the mean reciprocal rank increased from 0.09 to 0.23, which improves the users' experience remarkably. 5、Proposing a P2P overlay network with semantic expansion. Since it is hard to search media files based on their content keywords, both a semantic expansion algorithm and an algorithm with DHT and multi-ring spacing searching are proposed. The simulations verify that the semantic features can be expanded by this algorithm, and the search precision is comparable with the one from the centralized search.
语种	中文
公开日期	2011-05-07
页码	142
内容类型	学位论文
源URL	[http://159.226.59.140/handle/311008/318]
专题	声学研究所_声学所博硕士学位论文_1981-2009博硕士学位论文
推荐引用方式 GB/T 7714	尤佳莉. 对等网络检索系统中关键技术的研究[D]. 声学研究所. 中国科学院声学研究所. 2008.

个性服务

查看访问统计

相关权益政策

暂无数据

收藏/分享

所有评论 (0)

[发表评论/异议/意见]

暂无评论

评论
权益异议
反馈意见

评注功能仅针对注册用户开放，请您登录

您对该条目有什么异议，请向管理员反馈。
内容：
Email：	*
单位:
验证码：	刷新

您在知识库使用过程中有什么好的想法或者建议可以反馈给我们。
标题：	*
内容：
Email：	*
验证码：	刷新

相关链接

CORC

联系我们