题名自动语音识别中声学模型鉴别性训练的研究与应用
作者徐燃
学位类别博士
答辩日期2009-05-23
授予单位中国科学院声学研究所
授予地点声学研究所
关键词鉴别性训练 声学模型 自动语音识别 最小音素错误 轨迹模型
其他题名Discriminative Training of Acoustic Models and Its Applications in Automatic Speech Recognition
学位专业信号与信息处理
中文摘要声学模型的鉴别性训练是近年来主流语音识别系统中所广泛采用的模型训练优化手段,相较于传统的声学模型最大似然估计训练,鉴别性训练对模型假设的依赖程度降低,通过优化与系统识别率相关的目标函数,在有限的训练集上力图从正反两方面的训练样本中学习到更多的类区分度信息,从而寻求在现实条件下,对声学模型参数的更优估计。本文围绕着当前几种主流鉴别性训练准则的理论原理、优化方法及其在各种语音识别任务中的实现和应用问题,进行了以下研究工作并有所创新: 1. 本文分析了传统MPE训练中音素正确率计算存在的问题,抓住在状态聚类的声学模型中,音素背后的物理状态才是音素的物理载体这一本质,提出了一种基于音素物理状态序列比对的MPE-SC音素正确率计算方法,改进了传统MPE训练中的音素正确率计算,并在不同的测试集上取得一致有效的改进效果。 2. 本文在MPE训练过程中的声学似然值缩放问题上,提出了一种引入后验概率平滑因子的MPE-PPS方法。MPE-PPS在保持声学模型似然值和语言模型概率动态范围的约束关系的基础上,为调整音素后验概率的分布提供了更灵活有效的方式。后验概率平滑因子的本质在于更好地控制训练过程中引入的混淆度,从而为模型在测试集上提供更好的推广性能。MPE-SC与MPE-PPS在性能上具有一定的叠加性,在本文的实验中,二者的结合使用相较于传统的MPE训练,在中文CTS测试集上取得了相对2.48%~3.31%的错误率下降。 3. 本文提出了一种在参考声学空间中对超高维高斯后验概率向量进行快速计算的方法,通过模型两步聚类、选取高斯快速计算码字、设置最优候选数目和高斯似然值剪枝等方法,实现对超高维向量稀疏化的精确控制,大大降低了与超高维向量相关的乘法计算,使fMPE训练算法在各种语音识别系统中的应用成为可能。 4. 本文分析了参考声学空间的精度对fMPE训练的影响。提出了使用鉴别性声学模型作为参考声学空间以改善fMPE模型性能的fMPE-DCMT训练方法。在小规模纯声学模型的识别任务中,比传统的fMPE模型性能提高了相对7.5%。 5. 本文将MPE/fMPE训练推广到了各种实际应用的语法受限语音识别系统。针对不同系统各自的特点,在训练Lattice的生成问题上应用了不同的策略,最大限度地挖掘鉴别性训练在各种语音识别系统中的潜在优化能力。本文还将MPE/fMPE训练从单语种LVCSR系统推广到了中英文双语LVCSR系统,根据语种内和语种间的错误率改善情况分析了鉴别性训练对于改善双语、乃至多语识别系统性能的意义。
英文摘要Discriminative training of acoustic models has achieved great successes in the state-of-the-art automatic speech recognition systems. In contrast to the traditional Maximum Likelihood Estimation training method, discriminative training criteria rely less on acoustic model assumptions, and learn from both correct transcriptions and the competing hypotheses during training procedure to give more optimized parameter estimation of acoustic models. In this thesis, we firstly make a detailed discussion of the theory fundamentals of several predominant discriminative training criteria, and then introduce the improvements we made to these training criteria, as well as the applications of these discriminative training methods in various automatic speech recognition systems. The innovations of this thesis include: 1. This thesis proposes an improved phone accuracy calculation method named as MPE-SC for MPE training. MPE-SC focuses on the fact that in the state-clustered acoustic models, physical state sequence is the real representation of phonemes. Thus it does the comparisons on the state level to provide more accurate phone accuracies. Experimental results show that MPE-SC can consistently improve the recognition results compared to the traditional MPE on different test sets. 2. This thesis introduces a Posterior Probability Smoothing (PPS) factor in the calculation of posterior probability of phones during MPE training, which is named as MPE-PPS. The essential of MPE-PPS is to better control the confusion data introduced in the MPE training to give better generalization results on the test sets. It provides a more flexible probability-scaling mechanism to both acoustic likelihood and language model probabilities. Experimental results show that MPE-SC and MPE-PPS can be implemented together to further improve the performance of MPE models. On our Mandarin CTS test sets, a relative character error rate reduction of 2.48%~3.31% is achieved compared to the traditional MPE models. 3. This thesis proposes a fast algorithm for the super-vector of Gaussian posterior probability in fMPE training. The algorithm adopts the techniques of two-pass clustering methods for Gaussians, selection of N-best code-words and the likelihood pruning beam for Gaussians to precisely control the scarcity of the hyper-vector, which dramatically reduces the calculation perplexity and makes the implementation of fMPE practicable in various speech recognition systems. 4. This thesis proposes the fMPE-DCMT training methods, which adopts discriminatively-trained models as the reference acoustic space for fMPE training to improve its modeling abilities. On the small speech recognition task, fMPE-DCMT improves the performance of the traditional fMPE model by 7.5%. 5. This thesis expands the applications of discriminative training to various practical-used grammar-constrained recognition systems. According to the features of different application systems, different lattice generation methods are adopted to improve the system performance. We also expand the application of MPE/fMPE from traditional mono-language LVCSR to Mandarin-English LVCSR tasks. The within- and cross-language substitution error rate introduced in this thesis show that discriminative training can improve both the within- and cross-language discriminabilities of acoustic models and achieve greater gains on bilingual or multilingual recognition tasks.
语种中文
公开日期2011-05-07
页码117
内容类型学位论文
源URL[http://159.226.59.140/handle/311008/320]  
专题声学研究所_声学所博硕士学位论文_1981-2009博硕士学位论文
推荐引用方式
GB/T 7714
徐燃. 自动语音识别中声学模型鉴别性训练的研究与应用[D]. 声学研究所. 中国科学院声学研究所. 2009.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace