正在加载图片...
INTERSPEECH 2020 October 25-29.2020.Shanghai,China Densely Connected Time Delay Neural Network for Speaker Verification Ya-Oi Yu,Wu-Jun Li National Key Laboratory for Novel Software Technology, Department of Computer Science and Technology,Nanjing University,China yuyq@lamda.nju.edu.cn,liwujunenju.edu.cn Abstract work depth of vanilla TDNN by adopting four TDNN layers and adding one FNN layer after each TDNN layer.F-TDNN Time delay neural network (TDNN)has been widely used factorizes the weight matrix of each TDNN layer as a product in speaker verification tasks.Recently,two TDNN-based of two smaller matrices to reduce the number of parameters in models,including extended TDNN (E-TDNN)and factorized each layer.Then it adopts more channels and a deeper network TDNN (F-TDNN),are proposed to improve the accuracy of than vanilla TDNN.Compared with vanilla TDNN,E-TDNN vanilla TDNN.But E-TDNN and F-TDNN increase the number and F-TDNN can achieve better accuracy but they substantially of parameters due to deeper networks,compared with vanilla increase the number of parameters. TDNN.In this paper,we propose a novel TDNN-based model, To reduce the number of parameters and further improve the called densely connected TDNN(D-TDNN),by adopting bot- tleneck layers and dense connectivity.D-TDNN has fewer pa- accuracy of TDNN,in this paper we propose a novel TDNN- based model,called densely connected TDNN (D-TDNN)',for rameters than existing TDNN-based models.Furthermore,we speaker embedding.The contributions of this paper are outlined propose an improved variant of D-TDNN,called D-TDNN-SS, as follows: to employ multiple TDNN branches with short-term and long- term contexts.D-TDNN-SS can integrate the information from D-TDNN has fewer parameters than existing TDNN- multiple TDNN branches with a newly designed channel-wise based models,by adopting bottleneck layers and dense selection mechanism called statistics-and-selection (SS).Ex- connectivity. periments on VoxCeleb datasets show that both D-TDNN and An improved variant of D-TDNN,called D-TDNN-SS. D-TDNN-SS can outperform existing models to achieve state- is proposed to employ multiple TDNN branches with of-the-art accuracy with fewer parameters,and D-TDNN-SS short-term and long-term contexts.D-TDNN-SS can in- can achieve better accuracy than D-TDNN. tegrate the information from multiple TDNN branches Index Terms:speaker verification,time delay neural network, with a newly designed channel-wise selection mecha- dense connectivity,attention nism called statistics-and-selection(SS). Experiments on VoxCeleb datasets demonstrate that both 1.Introduction D-TDNN and D-TDNN-SS can outperform existing The task of speaker verification is to decide whether or not to TDNN-based models to achieve state-of-the-art accuracy accept a speaker's claim of identity,usually by scoring the sim- with fewer parameters,and D-TDNN-SS can achieve ilarity between enrollment and test utterances.Existing speaker better accuracy than D-TDNN. verification methods typically have two steps.The first step is called speaker embedding,which aims to extract fixed-length 2.Related Works vectors from variable-length utterances.The second step is In this section,we briefly review the related works including called scoring,which aims to calculate the similarity between two TDNN-based speaker embedding models and skip connec- speaker embedding vectors.Representative scoring methods tion related methods. include cosine similarity and probabilistic linear discriminant analysis(PLDA)[1]. 2.1.Extended TDNN In the last decade,i-vector [2]used to be predominant for speaker embedding.Recently,more and more researches focus Compared with vanilla TDNN,extended TDNN(E-TDNN)has on deep neural network(DNN)-based speaker embedding mod- one more TDNN layer and interleaves FNN layers between the els [3,4,5,6,71.Some works [5,8]have empirically verified TDNN layers.Layer interleaving creates a repeated composite that DNN-based speaker embedding models trained on large- structure in E-TDNN and we denote it as E-TDNN layer.The scale datasets can outperform conventional speaker embedding components of an E-TDNN layer are shown in Table 1.Note models like i-vector. that batch normalization (BN)is employed before ReLU but Time delay neural network(TDNN)is one of the most pop- omitted in the table. ular DNNs in speaker embedding.TDNN has shown state-of- the-art performance on a large range of datasets [4,5].The 2.2.Factorized TDNN vanilla TDNN [5]consists of three consecutive TDNN layers Singular value decomposition (SVD)is a commonly used ap- each of which is a temporal one-dimensional convolutional neu- proach for reducing the number of parameters in DNN.Instead ral network,followed by two consecutive feed-forward neu- of performing SVD after pre-training and then fine-tuning the ral network (FNN)layers and then a statistics pooling layer factorized networks,factorized TDNN (F-TDNN)[11]adopts Recently,two new TDNN-based models,including extended TDNN (E-TDNN)[9]and factorized TDNN (F-TDNN)[10] ID-TDNN is publicly available at https://github.com/ are proposed for speaker embedding.E-TDNN extends the net- yuyq96/D-TDNN. Copyright©2020ISCA 921 http://dx.doi.org/10.21437/Interspeech.2020-1275Densely Connected Time Delay Neural Network for Speaker Verification Ya-Qi Yu, Wu-Jun Li National Key Laboratory for Novel Software Technology, Department of Computer Science and Technology, Nanjing University, China yuyq@lamda.nju.edu.cn, liwujun@nju.edu.cn Abstract Time delay neural network (TDNN) has been widely used in speaker verification tasks. Recently, two TDNN-based models, including extended TDNN (E-TDNN) and factorized TDNN (F-TDNN), are proposed to improve the accuracy of vanilla TDNN. But E-TDNN and F-TDNN increase the number of parameters due to deeper networks, compared with vanilla TDNN. In this paper, we propose a novel TDNN-based model, called densely connected TDNN (D-TDNN), by adopting bot￾tleneck layers and dense connectivity. D-TDNN has fewer pa￾rameters than existing TDNN-based models. Furthermore, we propose an improved variant of D-TDNN, called D-TDNN-SS, to employ multiple TDNN branches with short-term and long￾term contexts. D-TDNN-SS can integrate the information from multiple TDNN branches with a newly designed channel-wise selection mechanism called statistics-and-selection (SS). Ex￾periments on VoxCeleb datasets show that both D-TDNN and D-TDNN-SS can outperform existing models to achieve state￾of-the-art accuracy with fewer parameters, and D-TDNN-SS can achieve better accuracy than D-TDNN. Index Terms: speaker verification, time delay neural network, dense connectivity, attention 1. Introduction The task of speaker verification is to decide whether or not to accept a speaker’s claim of identity, usually by scoring the sim￾ilarity between enrollment and test utterances. Existing speaker verification methods typically have two steps. The first step is called speaker embedding, which aims to extract fixed-length vectors from variable-length utterances. The second step is called scoring, which aims to calculate the similarity between speaker embedding vectors. Representative scoring methods include cosine similarity and probabilistic linear discriminant analysis (PLDA) [1]. In the last decade, i-vector [2] used to be predominant for speaker embedding. Recently, more and more researches focus on deep neural network (DNN)-based speaker embedding mod￾els [3, 4, 5, 6, 7]. Some works [5, 8] have empirically verified that DNN-based speaker embedding models trained on large￾scale datasets can outperform conventional speaker embedding models like i-vector. Time delay neural network (TDNN) is one of the most pop￾ular DNNs in speaker embedding. TDNN has shown state-of￾the-art performance on a large range of datasets [4, 5]. The vanilla TDNN [5] consists of three consecutive TDNN layers each of which is a temporal one-dimensional convolutional neu￾ral network, followed by two consecutive feed-forward neu￾ral network (FNN) layers and then a statistics pooling layer. Recently, two new TDNN-based models, including extended TDNN (E-TDNN) [9] and factorized TDNN (F-TDNN) [10], are proposed for speaker embedding. E-TDNN extends the net￾work depth of vanilla TDNN by adopting four TDNN layers and adding one FNN layer after each TDNN layer. F-TDNN factorizes the weight matrix of each TDNN layer as a product of two smaller matrices to reduce the number of parameters in each layer. Then it adopts more channels and a deeper network than vanilla TDNN. Compared with vanilla TDNN, E-TDNN and F-TDNN can achieve better accuracy but they substantially increase the number of parameters. To reduce the number of parameters and further improve the accuracy of TDNN, in this paper we propose a novel TDNN￾based model, called densely connected TDNN (D-TDNN)1 , for speaker embedding. The contributions of this paper are outlined as follows: • D-TDNN has fewer parameters than existing TDNN￾based models, by adopting bottleneck layers and dense connectivity. • An improved variant of D-TDNN, called D-TDNN-SS, is proposed to employ multiple TDNN branches with short-term and long-term contexts. D-TDNN-SS can in￾tegrate the information from multiple TDNN branches with a newly designed channel-wise selection mecha￾nism called statistics-and-selection (SS). • Experiments on VoxCeleb datasets demonstrate that both D-TDNN and D-TDNN-SS can outperform existing TDNN-based models to achieve state-of-the-art accuracy with fewer parameters, and D-TDNN-SS can achieve better accuracy than D-TDNN. 2. Related Works In this section, we briefly review the related works including two TDNN-based speaker embedding models and skip connec￾tion related methods. 2.1. Extended TDNN Compared with vanilla TDNN, extended TDNN (E-TDNN) has one more TDNN layer and interleaves FNN layers between the TDNN layers. Layer interleaving creates a repeated composite structure in E-TDNN and we denote it as E-TDNN layer. The components of an E-TDNN layer are shown in Table 1. Note that batch normalization (BN) is employed before ReLU but omitted in the table. 2.2. Factorized TDNN Singular value decomposition (SVD) is a commonly used ap￾proach for reducing the number of parameters in DNN. Instead of performing SVD after pre-training and then fine-tuning the factorized networks, factorized TDNN (F-TDNN) [11] adopts 1D-TDNN is publicly available at https://github.com/ yuyq96/D-TDNN. Copyright © 2020 ISCA INTERSPEECH 2020 October 25–29, 2020, Shanghai, China 921 http://dx.doi.org/10.21437/Interspeech.2020-1275
向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有