题名大规模化合物子结构并行检索方法
作者井银玲
学位类别硕士
答辩日期2010-05-27
授予单位中国科学院研究生院
授予地点北京
导师李晓霞
关键词化学子结构检索 大规模化学结构检索 集群并行 化学数据库 化学信息学
其他题名Parallel chemical substructure searching of large scale chemical database on PC Cluster
学位专业应用化学
中文摘要化合物的子结构检索在计算机辅助药物设计、波谱学、化学数据库等领域是不可或缺的工具。然而由于子结构检索是NP完备性问题,如何提高检索效率、获得用户可接受的平均检索时间多年来备受研究人员的关注。本文以满足化学深层网统一检索引擎ChemDB Portal(http://www.chemdb-portal. cn/)未来发展对子结构检索功能的需要为目标,研究了如何利用集群并行、实现化合物规模为百万级的化合物子结构检索的策略,完成了以下研究工作: 1. 基于一个5节点的小型集群,设计并实现了主从模式集群并行的ChemDB Portal的化学子结构检索系统。选取10个提问结构,在含有800万个化学结构的数据库中进行了子结构检索测试,以初始单节点时的化学子结构平均检索时间34.1min作为基准,测试结果表明基于集群的化学子结构平均检索时间降低为5节点集群并行时的9.02min,5节点的集群并行能够获得平均3.78倍的检索效率提升。 2. 为均衡集群各节点负载和充分利用节点计算资源,对集群并行的化学子结构检索系统进行任务均分及匹配过程多线程优化,优化后5节点的集群平均检索时间由9.02min降低至2.75min,执行效率进一步提高3.28倍,较初始单节点的系统而言,检索效率提高12.4倍。 3. 为ChemDB Portal的化学子结构集群并行检索系统添加了从节点状态的动态监听和计算任务的动态调度功能,提高了并行检索系统的可靠性和任务调度的灵活性。 4. 比较了开源化学结构处理软件包CDK与MX的子结构检索模块的性能差异,测试结果表明:就检索效率而言,CDK包适用于提问结构的原子数小于200的子结构检索,MX包则更适用于原子数大于200的提问结构。本文所采用的集群并行技术、任务均分及匹配过程多线程优化等策略不仅适用于ChemDB Portal的化学子结构检索系统,也同样适用于其他涉及到大规模化合物子结构检索的应用。
英文摘要Chemical substructure searching is equivalent to the subgraph isomorphism problem in graph theory, which is known to belong to the class of NP-complete computational problem, because it is a traversal method that must be done at atom-by-atom match level. Finding subgraph isomorphism algorithms which operate with acceptable average time has occupied the attention of researchers for many years and is still the subject of active research. Our aim in this paper was to reduce the time required to perform chemical substructure matching, particularly in a large database with over millions of chemical structures. Based on the chemical substructure searching of ChemDB Portal (http://www.chemdb-portal.cn/), a chemistry deep web search engine, this thesis completed the following research work. Based on the chemical substructure searching of ChemDB Portal, a parallel substructure searching system of large scale chemical database was established on a 5-node PC cluster in a master-slave mode. After the parallel chemical substructure searching system was optimized with computing task balance among the slave nodes and multithreading in a single node, its performance was tested with 10 representative queries in a chemical database with 8 millions structures, and the results show that an average 12.4-fold speed up was obtained for the parallel program against the serial one running on a single node of the cluster. The executing efficiency of chemical substructure searching of large scale database can be improved significantly by parallelizing the chemical substructure searching on PC cluster. The parallel chemical substructure searching system of ChemDB Portal can monitor the states of slave nodes dynamically and be capable of scheduling the slave nodes dynamically, which improves the reliability and flexibility of the parallel chemical substructure searching system. The atom by atom match performance of the chemical substructure searching based on two open source software packages (CDK and MX) was tested by four groups of query set. There are 500 structures in each set, which consists of structures with different atom numbers, ranging from 1 to 50, 50 to 100, 100 to 200 and more than 200 atoms. The test results show that package CDK is much more efficient than package MX when the query structure has less than 200 atoms; while, for the query structures with more than 200 atoms package MX is more suitable. The tests should be helpful in developing an integrated approach by combining the two packages for chemical substructure searching tool. The methods we used in this thesis can be used in the chemical substructure searching system of ChemDB Portal, and can be applied to other large scale chemical substructure searching applications as well.
公开日期2013-09-17
页码76
内容类型学位论文
源URL[http://ir.ipe.ac.cn/handle/122111/1532]  
专题过程工程研究所_研究所(批量导入)
推荐引用方式
GB/T 7714
井银玲. 大规模化合物子结构并行检索方法[D]. 北京. 中国科学院研究生院. 2010.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace