正在加载图片...
ENSEMBLE ADDITIVE MARGIN SOFTMAX FOR SPEAKER VERIFICATION Ya-Oi Yu,Lei Fan,Wu-Jun Li National Key Laboratory for Novel Software Technology, Department of Computer Science and Technology,Nanjing University,China yuyq,fanlelamda.nju.edu.cn,liwujun@nju.edu.cn ABSTRACT based methods for short utterances which are more common in real applications. End-to-end speaker embedding systems have shown promising In end-to-end systems,an appropriate training criterion (loss performance on speaker verification tasks.Traditional end-to-end function)is important for exploiting the power of neural networks systems typically adopt softmax loss as training criterion,which Most traditional systems adopt a softmax loss function to supervise is not strong enough for training discriminative models.In this the training of the neural networks.However,in speaker verification paper,we adapt the additive margin softmax(AM-Softmax)loss, tasks.the embeddings learned by the softmax loss based systems which is originally proposed for face verification,to speaker em- cannot achieve satisfactory performance on minimizing intra-class bedding systems.Furthermore,we propose a novel ensemble loss, divergence [6,7]. called ensemble additive margin softmax (EAM-Softmax)loss, To improve the performance of end-to-end systems,researchers for speaker embedding by integrating Hilbert-Schmidt indepen- have recently proposed several new loss functions for SV which can dence criterion(HSIC)into the speaker embedding system with the be divided into two major categories.The first category is classi- AM-Softmax loss.Experiments on a large-scale dataset VoxCeleb fication loss,such as center loss and angular softmax (A-Softmax) show that AM-Softmax loss is better than traditional loss functions, loss [6,7].Center loss [6],which tries to reduce the intra-class dis- and approaches using EAM-Softmax loss can outperform existing tance,is typically used in a combination with softmax loss to train speaker verification methods to achieve state-of-the-art performance. an embedding system.A-Softmax loss [7]tries to incorporate the Index Terms-Speaker verification,additive margin softmax, angular margin into the softmax loss function,which has achieved ensemble,Hilbert-Schmidt independence criterion promising performance.However,the margin in A-Softmax loss is constrained by a positive integer,which is not flexible enough. The second category is metric learning loss,in which triplet 1.INTRODUCTION loss [8]and pairwise loss [9,10]are widely used ones.Triplet loss is defined on a set of triplets,each of which consists of an anchor Recently,demands for high-precision speaker verification(SV)tech- sample,a positive sample and a negative sample.Triplet loss based nology increase quickly in security domain,because SV has great systems try to maximize the distance between anchor sample and potential with a low requirement for collecting devices and oper- negative sample as well as minimize the distance between anchor ating environment.The task of SV systems is to verify whether a sample and positive sample at the same time.Pairwise loss.such as given utterance matches a specific speaker,whose characteristic can contrastive loss [9,10],is defined on a set of pairs.Pairwise loss be extracted from enrollment utterances recorded in advance.The tries to maximize the distance between two samples if they have dif- characteristic of an utterance is typically represented as an embed- ferent class labels,otherwise minimize it.For models supervised by ding vector,which is calculated by speaker embedding systems. metric learning loss,the target of training and the requirement of in- For the last decade,approaches based on i-vectors [1],which ference are consistent,which should have promising performance as represent speaker and channel variability in a low dimensional long as the training is sufficient.Nevertheless,metric learning loss space called total variability space,have dominated the field of based systems have a shortcoming that the size of dataset and the speaker embedding.Nevertheless,there is a paradigm shift in strategies for sampling and composing triplets or pairs significantly recent speaker embedding studies,from i-vector to deep neural affect the performance,bringing obstacle to training.Thus,they are networks (DNN)[2.3.4].mostly with end-to-end training.The usually used in combination with classification loss. difference between i-vector and end-to-end systems is that i-vector Very recently,a novel loss function,called additive margin adopts generative models for embedding but end-to-end systems softmax (AM-Softmax)loss [11].is proposed for face verifica- adopt DNN for embedding.In end-to-end systems,we generally tion.AM-Softmax loss has achieved better performance than other use an intermediate layer of neural networks as the embedding layer loss functions in face verification.In this paper,we adapt the instead of the last layer or 'classification'layer,because the interme- AM-Softmax loss to speaker embedding systems.Furthermore,we diate layer appears to be more robust in open-set tasks.To complete propose a novel ensemble loss,called ensemble additive margin soft- speaker verification,the speaker embeddings,either learned by max (EAM-Softmax)loss,for SV by integrating Hilbert-Schmidt end-to-end embedding systems or by i-vector,can be followed by independence criterion (HSIC)[12]into the speaker embedding back-ends like probabilistic linear discriminant analysis(PLDA)[5]. system with the AM-Softmax loss.Experiments on a large-scale In addition,cosine similarity based back-end can also be used for dataset VoxCeleb show that AM-Softmax loss is better than tra- speaker verification,which is much simpler than PLDA.Although ditional loss functions,and approaches using EAM-Softmax loss i-vector based systems are still effective if the utterances have suf- can outperform existing speaker verification methods to achieve ficient length [1],end-to-end systems appear to outperform i-vector state-of-the-art performance. 978-1-5386-4658-8/18/$31.00©2019IEEE 6046 ICASSP 2019ENSEMBLE ADDITIVE MARGIN SOFTMAX FOR SPEAKER VERIFICATION Ya-Qi Yu, Lei Fan, Wu-Jun Li National Key Laboratory for Novel Software Technology, Department of Computer Science and Technology, Nanjing University, China {yuyq,fanl}@lamda.nju.edu.cn, liwujun@nju.edu.cn ABSTRACT End-to-end speaker embedding systems have shown promising performance on speaker verification tasks. Traditional end-to-end systems typically adopt softmax loss as training criterion, which is not strong enough for training discriminative models. In this paper, we adapt the additive margin softmax (AM-Softmax) loss, which is originally proposed for face verification, to speaker em￾bedding systems. Furthermore, we propose a novel ensemble loss, called ensemble additive margin softmax (EAM-Softmax) loss, for speaker embedding by integrating Hilbert-Schmidt indepen￾dence criterion (HSIC) into the speaker embedding system with the AM-Softmax loss. Experiments on a large-scale dataset VoxCeleb show that AM-Softmax loss is better than traditional loss functions, and approaches using EAM-Softmax loss can outperform existing speaker verification methods to achieve state-of-the-art performance. Index Terms— Speaker verification, additive margin softmax, ensemble, Hilbert-Schmidt independence criterion 1. INTRODUCTION Recently, demands for high-precision speaker verification (SV) tech￾nology increase quickly in security domain, because SV has great potential with a low requirement for collecting devices and oper￾ating environment. The task of SV systems is to verify whether a given utterance matches a specific speaker, whose characteristic can be extracted from enrollment utterances recorded in advance. The characteristic of an utterance is typically represented as an embed￾ding vector, which is calculated by speaker embedding systems. For the last decade, approaches based on i-vectors [1], which represent speaker and channel variability in a low dimensional space called total variability space, have dominated the field of speaker embedding. Nevertheless, there is a paradigm shift in recent speaker embedding studies, from i-vector to deep neural networks (DNN) [2, 3, 4], mostly with end-to-end training. The difference between i-vector and end-to-end systems is that i-vector adopts generative models for embedding but end-to-end systems adopt DNN for embedding. In end-to-end systems, we generally use an intermediate layer of neural networks as the embedding layer instead of the last layer or ‘classification’ layer, because the interme￾diate layer appears to be more robust in open-set tasks. To complete speaker verification, the speaker embeddings, either learned by end-to-end embedding systems or by i-vector, can be followed by back-ends like probabilistic linear discriminant analysis (PLDA) [5]. In addition, cosine similarity based back-end can also be used for speaker verification, which is much simpler than PLDA. Although i-vector based systems are still effective if the utterances have suf- ficient length [1], end-to-end systems appear to outperform i-vector based methods for short utterances which are more common in real applications. In end-to-end systems, an appropriate training criterion (loss function) is important for exploiting the power of neural networks. Most traditional systems adopt a softmax loss function to supervise the training of the neural networks. However, in speaker verification tasks, the embeddings learned by the softmax loss based systems cannot achieve satisfactory performance on minimizing intra-class divergence [6, 7]. To improve the performance of end-to-end systems, researchers have recently proposed several new loss functions for SV which can be divided into two major categories. The first category is classi- fication loss, such as center loss and angular softmax (A-Softmax) loss [6, 7]. Center loss [6], which tries to reduce the intra-class dis￾tance, is typically used in a combination with softmax loss to train an embedding system. A-Softmax loss [7] tries to incorporate the angular margin into the softmax loss function, which has achieved promising performance. However, the margin in A-Softmax loss is constrained by a positive integer, which is not flexible enough. The second category is metric learning loss, in which triplet loss [8] and pairwise loss [9, 10] are widely used ones. Triplet loss is defined on a set of triplets, each of which consists of an anchor sample, a positive sample and a negative sample. Triplet loss based systems try to maximize the distance between anchor sample and negative sample as well as minimize the distance between anchor sample and positive sample at the same time. Pairwise loss, such as contrastive loss [9, 10], is defined on a set of pairs. Pairwise loss tries to maximize the distance between two samples if they have dif￾ferent class labels, otherwise minimize it. For models supervised by metric learning loss, the target of training and the requirement of in￾ference are consistent, which should have promising performance as long as the training is sufficient. Nevertheless, metric learning loss based systems have a shortcoming that the size of dataset and the strategies for sampling and composing triplets or pairs significantly affect the performance, bringing obstacle to training. Thus, they are usually used in combination with classification loss. Very recently, a novel loss function, called additive margin softmax (AM-Softmax) loss [11], is proposed for face verifica￾tion. AM-Softmax loss has achieved better performance than other loss functions in face verification. In this paper, we adapt the AM-Softmax loss to speaker embedding systems. Furthermore, we propose a novel ensemble loss, called ensemble additive margin soft￾max (EAM-Softmax) loss, for SV by integrating Hilbert-Schmidt independence criterion (HSIC) [12] into the speaker embedding system with the AM-Softmax loss. Experiments on a large-scale dataset VoxCeleb show that AM-Softmax loss is better than tra￾ditional loss functions, and approaches using EAM-Softmax loss can outperform existing speaker verification methods to achieve state-of-the-art performance. 978-1-5386-4658-8/18/$31.00 ©2019 IEEE 6046 ICASSP 2019
向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有