基于 DL-T 及迁移学习的语音识别研究张

正在加载图片...

工程科学学报.第43卷，第3期：433-441.2021年3月 Chinese Journal of Engineering,Vol.43,No.3:433-441,March 2021 https://doi.org/10.13374/j.issn2095-9389.2020.01.12.001;http://cje.ustb.edu.cn 基于DL-T及迁移学习的语音识别研究张威12，刘晨2，费鸿博2，李巍)，俞经虎2)，曹毅12)区 1)江南大学机械工程学院.无锡2141222)江苏省食品先进制造装备技术重点实验室，无锡2141223)苏州工业职业技术学院，苏州 215104 ☒通信作者，E-mail:caoyi@jiangnan.edu.cn 摘要为解决RNN-T语音识别时预测错误率高、收敛速度慢的问题，本文提出了一种基于DL-T的声学建模方法.首先介绍了RNN-T声学模型；其次结合DenseNet与LSTM网络提出了一种新的声学建模方法一DL-T,该方法可提取原始语音的高维信息从而加强特征信息重用、减轻梯度问题便于深层信息传递，使其兼具预测错误率低及收敛速度快的优点：然后，为进一步提高声学模型的准确率，提出了一种适合DL-T的迁移学习方法；最后为验证上述方法，采用DL-T声学模型，基于Aishell--1数据集开展了语音识别研究.研究结果表明：DL-T相较于RNN-T预测错误率相对降低了12.52%，模型最终错误率可达10.34%.因此，DL-T可显著改善RNN-T的预测错误率和收敛速度关键词深度学习：语音识别：声学模型：DL-T;迁移学习分类号TN912.3 Research on automatic speech recognition based on a DL-T and transfer learning ZHANG Wei2),LIU Chen2.FEI Hong-bo2,LI We,YU Jing-hu2),CAO Yi2 1)School of Mechanical Engineering.Jiangnan University,Wuxi214122,China 2)Jiangsu Key Laboratory of Advanced Food Manufacturing Equipment and Technology,Wuxi 214122,China 3)Suzhou Institute of Industrial Technology,Suzhou 215104,China Corresponding author,E-mail:caoyi@jiangnan.edu.cn ABSTRACT Speech has been a natural and effective way of communication,widely used in the field of information-communication and human-machine interaction.In recent years,various algorithms have been used for achieving efficient communication.The main purpose of automatic speech recognition (ASR),one of the key technologies in this field,is to convert the analog signals of input speech into corresponding text digital signals.Further,ASR can be divided into two categories:one based on hidden Markov model (HMM)and the other based on end to end (E2E)models.Compared with the former,E2E models have a simple modeling process and an easy training model and thus,research is carried out in the direction of developing E2E models for effectively using in ASR.However,HMM- based speech recognition technologies have some disadvantages in terms of prediction error rate,generalization ability,and convergence speed.Therefore,recurrent neural network-transducer(RNN-T),a typical E2E acoustic model that can model the dependencies between the outputs and can be optimized jointly with a Language Model (LM),was proposed in this study.Further,a new acoustic model of DL-T based on DenseNet (dense convolutional network)-LSTM(long short-term memory)-Transducer,was proposed to solve the problems of a high prediction error rate and slow convergence speed in a RNN-T.First,a RNN-T was briefly introduced.Then, combining the merits of both DenseNet and LSTM,a novel acoustic model of DL-T,was proposed in this study.A DL-T can extract high-dimensional speech features and alleviate gradient problems and it has the advantages of low character error rate(CER)and fast 收稿日期：2020-01-12 基金项目：国家自然科学基金资助项目(51375209)：江苏省“六大人才高蜂”计划资助项目(ZBZZ-012):江苏省研究生创新计划资助项目 (KYCX180630,KYCX181846):高等学校学科创新引智计划资助项目(B18027)基于 DL-T 及迁移学习的语音识别研究张威1,2)，刘晨1,2)，费鸿博1,2)，李巍3)，俞经虎1,2)，曹毅1,2) 苣 1) 江南大学机械工程学院，无锡 214122 2) 江苏省食品先进制造装备技术重点实验室，无锡 214122 3) 苏州工业职业技术学院，苏州 215104 苣通信作者，E-mail：caoyi@jiangnan.edu.cn 摘要为解决 RNN–T 语音识别时预测错误率高、收敛速度慢的问题，本文提出了一种基于 DL–T 的声学建模方法. 首先介绍了 RNN–T 声学模型；其次结合 DenseNet 与 LSTM 网络提出了一种新的声学建模方法——DL–T，该方法可提取原始语音的高维信息从而加强特征信息重用、减轻梯度问题便于深层信息传递，使其兼具预测错误率低及收敛速度快的优点；然后，为进一步提高声学模型的准确率，提出了一种适合 DL–T 的迁移学习方法；最后为验证上述方法，采用 DL–T 声学模型，基于 Aishell–1 数据集开展了语音识别研究. 研究结果表明：DL–T 相较于 RNN–T 预测错误率相对降低了 12.52%，模型最终错误率可达 10.34%. 因此，DL–T 可显著改善 RNN–T 的预测错误率和收敛速度. 关键词深度学习；语音识别；声学模型；DL–T；迁移学习分类号 TN912.3 Research on automatic speech recognition based on a DL–T and transfer learning ZHANG Wei1,2) ，LIU Chen1,2) ，FEI Hong-bo1,2) ，LI Wei3) ，YU Jing-hu1,2) ，CAO Yi1,2) 苣 1) School of Mechanical Engineering, Jiangnan University, Wuxi 214122, China 2) Jiangsu Key Laboratory of Advanced Food Manufacturing Equipment and Technology, Wuxi 214122, China 3) Suzhou Institute of Industrial Technology, Suzhou 215104, China 苣 Corresponding author, E-mail: caoyi@jiangnan.edu.cn ABSTRACT Speech has been a natural and effective way of communication, widely used in the field of information-communication and human–machine interaction. In recent years, various algorithms have been used for achieving efficient communication. The main purpose of automatic speech recognition (ASR), one of the key technologies in this field, is to convert the analog signals of input speech into corresponding text digital signals. Further, ASR can be divided into two categories: one based on hidden Markov model (HMM) and the other based on end to end (E2E) models. Compared with the former, E2E models have a simple modeling process and an easy training model and thus, research is carried out in the direction of developing E2E models for effectively using in ASR. However, HMMbased speech recognition technologies have some disadvantages in terms of prediction error rate, generalization ability, and convergence speed. Therefore, recurrent neural network–transducer (RNN–T), a typical E2E acoustic model that can model the dependencies between the outputs and can be optimized jointly with a Language Model (LM), was proposed in this study. Further, a new acoustic model of DL –T based on DenseNet (dense convolutional network) –LSTM (long short-term memory) –Transducer, was proposed to solve the problems of a high prediction error rate and slow convergence speed in a RNN –T. First, a RNN –T was briefly introduced. Then, combining the merits of both DenseNet and LSTM, a novel acoustic model of DL–T, was proposed in this study. A DL–T can extract high-dimensional speech features and alleviate gradient problems and it has the advantages of low character error rate (CER) and fast 收稿日期: 2020−01−12 基金项目: 国家自然科学基金资助项目（51375209）；江苏省“六大人才高峰”计划资助项目（ZBZZ–012）；江苏省研究生创新计划资助项目（KYCX18_0630, KYCX18_1846）；高等学校学科创新引智计划资助项目（B18027）工程科学学报，第 43 卷，第 3 期：433−441，2021 年 3 月 Chinese Journal of Engineering, Vol. 43, No. 3: 433−441, March 2021 https://doi.org/10.13374/j.issn2095-9389.2020.01.12.001; http://cje.ustb.edu.cn

<<向上翻页向下翻页>>

点击下载：基于DL-T及迁移学习的语音识别研究