正在加载图片...
INTERSPEECH 2014 Robust Speech Recognition with Speech Enhanced Deep Neural Networks Jun Du,Qing Wang,Tian Gao,Yong Xu,Lirong Dai,Chin-Hui Lee2 University of Science and Technology of China,Hefei,Anhui,P.R.China 2Georgia Institute of Technology,Atlanta,GA.30332-0250,USA jundu,lrdaieustc.edu.cn,[xiaosong,gtian09,xuyong62@mail.ustc.edu.cn,chleece.gatech.edu Abstract The recent breakthrough of deep learning [10,11],espe- cially the application of deep neural networks(DNNs)in ASR We propose a signal pre-processing front-end to enhance area [12,13,14],marks a new milestone that DNN-HMM speech based on deep neural networks (DNNs)and use the en- for acoustic modeling becomes the-state-of-the-art instead of hanced speech features directly to train hidden Markov mod- GMM-HMM.It's believed that the first several layers of DNN els(HMMs)for robust speech recognition.As a comprehen- play the role of extracting highly nonlinear and discriminative sive study,we examine its effectiveness for different acoustic features which are robust to irrelevant variabilities.This makes features,acoustic models,and training-testing combinations DNN-HMM inherently noise robust to some extent,which is Tested on the Aurora4 task the experimental results indicate verified on Aurora4 database in [15].In [16,17],several con- that our proposed framework consistently outperform the state- ventional front-end techniques can further yield performance of-the-art speech recognition systems in all evaluation condi- gain on top of DNN-HMM system for tasks with small vo- tions.To our best knowledge,this is the first showcase on the cabulary or constrained grammar.But on large vocabulary Aurora4 task yielding performance gains by using only an en- tasks,the traditional enhancement approach as in [18]which hancement pre-processor without any adaptation or compensa- is effective for GMM-HMM system may even lead to the per- tion post-processing on top of the best DNN-HMM system.The formance degradation for DNN-HMM system with log Mel- word error rate reduction from the baseline system is up to 50% filterbank (LMFB)features under the well-matched training- for clean-condition training and 15%for multi-condition train- testing condition [15].Meanwhile,the data-driven approaches ing.We believe the system performance could be improved fur- using stereo-data via recurrent neural network(RNN)and DNN ther by incorporating post-processing techniques to work coher- proposed in [19,20]can improve the recognition accuracy on ently with the proposed enhancement pre-processing scheme. small vocabulary tasks.More recently,the masking techniques Index Terms:robust speech recognition,speech enhance- [21,22,23]are successfully applied for noisy speech recogni- ment,clean-condition training,multi-condition training,hidden tion.In [23],the approach using time-frequency masking com- Markov models,deep neural networks bined with feature mapping via DNN and stereo-data claims to achieve the best results on Aurora4 database.Unfortunately, 1.Introduction for multi-condition training using DNN-HMM with LMFB fea- tures,this approach still results in worse performance,which is With the fast development of mobile internet,the speech- similar to the conclusion in [15]. enabled applications using automatic speech recognition(ASR) are becoming increasingly popular.However,the noise robust- In this study,inspired by our recent progress on speech ness is one of the critical issues to make ASR system widely enhancement via DNN as a regression model [24],we fur- used in real world.Historically,most of ASR systems use Mel- ther verify its effectiveness for noisy speech recognition.First, frequency cepstral coefficients (MFCCs)and their derivatives DNN is adopted as a pre-processor,which directly estimates as speech features,and a set of Gaussian mixture continuous the complicated nonlinear mapping from observed noisy speech density HMMs(CDHMMs)for modeling basic speech units. with acoustic context to desired clean speech in log-power Many techniques [1,2,3]have been proposed to handle the spectral domain.Second,we propose to use global variance difficult problem of mismatch between training and application equalization (GVE)to alleviate the over-smoothing problem conditions.One type of approaches to dealing with the above of DNN based regression model,which is implemented as a problem is the so-called data-driven approach based on stereo- post-processing operation by linear scaling of log-power spec- data,which is also the topic of this study.SPLICE [4]is one tral features.Third,an exhaustive experimental study is con- successful showcase which is a feature compensation approach ducted by the comparison of different acoustic features(MFCC by using environmental selection and stereo data to learn the and LMFB),acoustic models(GMM-HMM and DNN-HMM), mapping function between clean speech and noisy speech via and training-testing conditions(high-mismatch,mid-mismatch, Gaussian mixture models(GMMs).Then similar approaches and well-matched).Our approach achieves promising results on are proposed in [5,6].In [7],a stereo-based stochastic mapping Aurora4 database for all testing cases.Furthermore,compared (SSM)technique is presented,which outperforms SPLICE.The with the enhancement approaches in [15,23],this is the first basic idea of SSM is to build a GMM for the joint distribution time to yield performance gain by using our proposed approach of the clean and noisy speech by using stereo data.To relax for the multi-condition training with LMFB features and DNN- the constraint of recorded stereo-data,we propose to use syn- HMM on Aurora4 database,which indicates that the proposed thesized pseudo-clean features generated by exploiting HMM- front-end DNN can further improve the noise robustness on top based synthesis to replace the ideal clean features from one of of DNN-HMM systems under the well-matched condition for the stereo channels in SPLICE and SSM [8,9]. large vocabulary tasks. Copyright©2014ISCA 616 14-18 September 2014,SingaporeRobust Speech Recognition with Speech Enhanced Deep Neural Networks Jun Du1 , Qing Wang1 , Tian Gao1 , Yong Xu1 , Lirong Dai1 , Chin-Hui Lee2 1University of Science and Technology of China, Hefei, Anhui, P.R. China 2Georgia Institute of Technology, Atlanta, GA. 30332-0250, USA {jundu,lrdai}@ustc.edu.cn, {xiaosong,gtian09,xuyong62}@mail.ustc.edu.cn, chl@ece.gatech.edu Abstract We propose a signal pre-processing front-end to enhance speech based on deep neural networks (DNNs) and use the en￾hanced speech features directly to train hidden Markov mod￾els (HMMs) for robust speech recognition. As a comprehen￾sive study, we examine its effectiveness for different acoustic features, acoustic models, and training-testing combinations. Tested on the Aurora4 task the experimental results indicate that our proposed framework consistently outperform the state￾of-the-art speech recognition systems in all evaluation condi￾tions. To our best knowledge, this is the first showcase on the Aurora4 task yielding performance gains by using only an en￾hancement pre-processor without any adaptation or compensa￾tion post-processing on top of the best DNN-HMM system. The word error rate reduction from the baseline system is up to 50% for clean-condition training and 15% for multi-condition train￾ing. We believe the system performance could be improved fur￾ther by incorporating post-processing techniques to work coher￾ently with the proposed enhancement pre-processing scheme. Index Terms: robust speech recognition, speech enhance￾ment, clean-condition training, multi-condition training, hidden Markov models, deep neural networks 1. Introduction With the fast development of mobile internet, the speech￾enabled applications using automatic speech recognition (ASR) are becoming increasingly popular. However, the noise robust￾ness is one of the critical issues to make ASR system widely used in real world. Historically, most of ASR systems use Mel￾frequency cepstral coefficients (MFCCs) and their derivatives as speech features, and a set of Gaussian mixture continuous density HMMs (CDHMMs) for modeling basic speech units. Many techniques [1, 2, 3] have been proposed to handle the difficult problem of mismatch between training and application conditions. One type of approaches to dealing with the above problem is the so-called data-driven approach based on stereo￾data, which is also the topic of this study. SPLICE [4] is one successful showcase which is a feature compensation approach by using environmental selection and stereo data to learn the mapping function between clean speech and noisy speech via Gaussian mixture models (GMMs). Then similar approaches are proposed in [5, 6]. In [7], a stereo-based stochastic mapping (SSM) technique is presented, which outperforms SPLICE. The basic idea of SSM is to build a GMM for the joint distribution of the clean and noisy speech by using stereo data. To relax the constraint of recorded stereo-data, we propose to use syn￾thesized pseudo-clean features generated by exploiting HMM￾based synthesis to replace the ideal clean features from one of the stereo channels in SPLICE and SSM [8, 9]. The recent breakthrough of deep learning [10, 11], espe￾cially the application of deep neural networks (DNNs) in ASR area [12, 13, 14], marks a new milestone that DNN-HMM for acoustic modeling becomes the-state-of-the-art instead of GMM-HMM. It’s believed that the first several layers of DNN play the role of extracting highly nonlinear and discriminative features which are robust to irrelevant variabilities. This makes DNN-HMM inherently noise robust to some extent, which is verified on Aurora4 database in [15]. In [16, 17], several con￾ventional front-end techniques can further yield performance gain on top of DNN-HMM system for tasks with small vo￾cabulary or constrained grammar. But on large vocabulary tasks, the traditional enhancement approach as in [18] which is effective for GMM-HMM system may even lead to the per￾formance degradation for DNN-HMM system with log Mel- filterbank (LMFB) features under the well-matched training￾testing condition [15]. Meanwhile, the data-driven approaches using stereo-data via recurrent neural network (RNN) and DNN proposed in [19, 20] can improve the recognition accuracy on small vocabulary tasks. More recently, the masking techniques [21, 22, 23] are successfully applied for noisy speech recogni￾tion. In [23], the approach using time-frequency masking com￾bined with feature mapping via DNN and stereo-data claims to achieve the best results on Aurora4 database. Unfortunately, for multi-condition training using DNN-HMM with LMFB fea￾tures, this approach still results in worse performance, which is similar to the conclusion in [15]. In this study, inspired by our recent progress on speech enhancement via DNN as a regression model [24], we fur￾ther verify its effectiveness for noisy speech recognition. First, DNN is adopted as a pre-processor, which directly estimates the complicated nonlinear mapping from observed noisy speech with acoustic context to desired clean speech in log-power spectral domain. Second, we propose to use global variance equalization (GVE) to alleviate the over-smoothing problem of DNN based regression model, which is implemented as a post-processing operation by linear scaling of log-power spec￾tral features. Third, an exhaustive experimental study is con￾ducted by the comparison of different acoustic features (MFCC and LMFB), acoustic models (GMM-HMM and DNN-HMM), and training-testing conditions (high-mismatch, mid-mismatch, and well-matched). Our approach achieves promising results on Aurora4 database for all testing cases. Furthermore, compared with the enhancement approaches in [15, 23], this is the first time to yield performance gain by using our proposed approach for the multi-condition training with LMFB features and DNN￾HMM on Aurora4 database, which indicates that the proposed front-end DNN can further improve the noise robustness on top of DNN-HMM systems under the well-matched condition for large vocabulary tasks. Copyright © 2014 ISCA 14-18 September 2014, Singapore INTERSPEECH 2014 616
向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有