ENSEMBLE ADDITIVE MARGIN SOFTMAX FOR SPEAKER VERIFICATION Ya-Oi Yu,Lei Fan,Wu-Jun Li National Key Laboratory for Novel Software Technology, Department of Computer Science and Technology,Nanjing University,China yuyq,fanlelamda.nju.edu.cn,liwujun@nju.edu.cn ABSTRACT based methods for short utterances which are more common in real applications. End-to-end speaker embedding systems have shown promising In end-to-end systems,an appropriate training criterion (loss performance on speaker verification tasks.Traditional end-to-end function)is important for exploiting the power of neural networks systems typically adopt softmax loss as training criterion,which Most traditional systems adopt a softmax loss function to supervise is not strong enough for training discriminative models.In this the training of the neural networks.However,in speaker verification paper,we adapt the additive margin softmax(AM-Softmax)loss, tasks.the embeddings learned by the softmax loss based systems which is originally proposed for face verification,to speaker em- cannot achieve satisfactory performance on minimizing intra-class bedding systems.Furthermore,we propose a novel ensemble loss, divergence [6,7]. called ensemble additive margin softmax (EAM-Softmax)loss, To improve the performance of end-to-end systems,researchers for speaker embedding by integrating Hilbert-Schmidt indepen- have recently proposed several new loss functions for SV which can dence criterion(HSIC)into the speaker embedding system with the be divided into two major categories.The first category is classi- AM-Softmax loss.Experiments on a large-scale dataset VoxCeleb fication loss,such as center loss and angular softmax (A-Softmax) show that AM-Softmax loss is better than traditional loss functions, loss [6,7].Center loss [6],which tries to reduce the intra-class dis- and approaches using EAM-Softmax loss can outperform existing tance,is typically used in a combination with softmax loss to train speaker verification methods to achieve state-of-the-art performance. an embedding system.A-Softmax loss [7]tries to incorporate the Index Terms-Speaker verification,additive margin softmax, angular margin into the softmax loss function,which has achieved ensemble,Hilbert-Schmidt independence criterion promising performance.However,the margin in A-Softmax loss is constrained by a positive integer,which is not flexible enough. The second category is metric learning loss,in which triplet 1.INTRODUCTION loss [8]and pairwise loss [9,10]are widely used ones.Triplet loss is defined on a set of triplets,each of which consists of an anchor Recently,demands for high-precision speaker verification(SV)tech- sample,a positive sample and a negative sample.Triplet loss based nology increase quickly in security domain,because SV has great systems try to maximize the distance between anchor sample and potential with a low requirement for collecting devices and oper- negative sample as well as minimize the distance between anchor ating environment.The task of SV systems is to verify whether a sample and positive sample at the same time.Pairwise loss.such as given utterance matches a specific speaker,whose characteristic can contrastive loss [9,10],is defined on a set of pairs.Pairwise loss be extracted from enrollment utterances recorded in advance.The tries to maximize the distance between two samples if they have dif- characteristic of an utterance is typically represented as an embed- ferent class labels,otherwise minimize it.For models supervised by ding vector,which is calculated by speaker embedding systems. metric learning loss,the target of training and the requirement of in- For the last decade,approaches based on i-vectors [1],which ference are consistent,which should have promising performance as represent speaker and channel variability in a low dimensional long as the training is sufficient.Nevertheless,metric learning loss space called total variability space,have dominated the field of based systems have a shortcoming that the size of dataset and the speaker embedding.Nevertheless,there is a paradigm shift in strategies for sampling and composing triplets or pairs significantly recent speaker embedding studies,from i-vector to deep neural affect the performance,bringing obstacle to training.Thus,they are networks (DNN)[2.3.4].mostly with end-to-end training.The usually used in combination with classification loss. difference between i-vector and end-to-end systems is that i-vector Very recently,a novel loss function,called additive margin adopts generative models for embedding but end-to-end systems softmax (AM-Softmax)loss [11].is proposed for face verifica- adopt DNN for embedding.In end-to-end systems,we generally tion.AM-Softmax loss has achieved better performance than other use an intermediate layer of neural networks as the embedding layer loss functions in face verification.In this paper,we adapt the instead of the last layer or 'classification'layer,because the interme- AM-Softmax loss to speaker embedding systems.Furthermore,we diate layer appears to be more robust in open-set tasks.To complete propose a novel ensemble loss,called ensemble additive margin soft- speaker verification,the speaker embeddings,either learned by max (EAM-Softmax)loss,for SV by integrating Hilbert-Schmidt end-to-end embedding systems or by i-vector,can be followed by independence criterion (HSIC)[12]into the speaker embedding back-ends like probabilistic linear discriminant analysis(PLDA)[5]. system with the AM-Softmax loss.Experiments on a large-scale In addition,cosine similarity based back-end can also be used for dataset VoxCeleb show that AM-Softmax loss is better than tra- speaker verification,which is much simpler than PLDA.Although ditional loss functions,and approaches using EAM-Softmax loss i-vector based systems are still effective if the utterances have suf- can outperform existing speaker verification methods to achieve ficient length [1],end-to-end systems appear to outperform i-vector state-of-the-art performance. 978-1-5386-4658-8/18/$31.00©2019IEEE 6046 ICASSP 2019
ENSEMBLE ADDITIVE MARGIN SOFTMAX FOR SPEAKER VERIFICATION Ya-Qi Yu, Lei Fan, Wu-Jun Li National Key Laboratory for Novel Software Technology, Department of Computer Science and Technology, Nanjing University, China {yuyq,fanl}@lamda.nju.edu.cn, liwujun@nju.edu.cn ABSTRACT End-to-end speaker embedding systems have shown promising performance on speaker verification tasks. Traditional end-to-end systems typically adopt softmax loss as training criterion, which is not strong enough for training discriminative models. In this paper, we adapt the additive margin softmax (AM-Softmax) loss, which is originally proposed for face verification, to speaker embedding systems. Furthermore, we propose a novel ensemble loss, called ensemble additive margin softmax (EAM-Softmax) loss, for speaker embedding by integrating Hilbert-Schmidt independence criterion (HSIC) into the speaker embedding system with the AM-Softmax loss. Experiments on a large-scale dataset VoxCeleb show that AM-Softmax loss is better than traditional loss functions, and approaches using EAM-Softmax loss can outperform existing speaker verification methods to achieve state-of-the-art performance. Index Terms— Speaker verification, additive margin softmax, ensemble, Hilbert-Schmidt independence criterion 1. INTRODUCTION Recently, demands for high-precision speaker verification (SV) technology increase quickly in security domain, because SV has great potential with a low requirement for collecting devices and operating environment. The task of SV systems is to verify whether a given utterance matches a specific speaker, whose characteristic can be extracted from enrollment utterances recorded in advance. The characteristic of an utterance is typically represented as an embedding vector, which is calculated by speaker embedding systems. For the last decade, approaches based on i-vectors [1], which represent speaker and channel variability in a low dimensional space called total variability space, have dominated the field of speaker embedding. Nevertheless, there is a paradigm shift in recent speaker embedding studies, from i-vector to deep neural networks (DNN) [2, 3, 4], mostly with end-to-end training. The difference between i-vector and end-to-end systems is that i-vector adopts generative models for embedding but end-to-end systems adopt DNN for embedding. In end-to-end systems, we generally use an intermediate layer of neural networks as the embedding layer instead of the last layer or ‘classification’ layer, because the intermediate layer appears to be more robust in open-set tasks. To complete speaker verification, the speaker embeddings, either learned by end-to-end embedding systems or by i-vector, can be followed by back-ends like probabilistic linear discriminant analysis (PLDA) [5]. In addition, cosine similarity based back-end can also be used for speaker verification, which is much simpler than PLDA. Although i-vector based systems are still effective if the utterances have suf- ficient length [1], end-to-end systems appear to outperform i-vector based methods for short utterances which are more common in real applications. In end-to-end systems, an appropriate training criterion (loss function) is important for exploiting the power of neural networks. Most traditional systems adopt a softmax loss function to supervise the training of the neural networks. However, in speaker verification tasks, the embeddings learned by the softmax loss based systems cannot achieve satisfactory performance on minimizing intra-class divergence [6, 7]. To improve the performance of end-to-end systems, researchers have recently proposed several new loss functions for SV which can be divided into two major categories. The first category is classi- fication loss, such as center loss and angular softmax (A-Softmax) loss [6, 7]. Center loss [6], which tries to reduce the intra-class distance, is typically used in a combination with softmax loss to train an embedding system. A-Softmax loss [7] tries to incorporate the angular margin into the softmax loss function, which has achieved promising performance. However, the margin in A-Softmax loss is constrained by a positive integer, which is not flexible enough. The second category is metric learning loss, in which triplet loss [8] and pairwise loss [9, 10] are widely used ones. Triplet loss is defined on a set of triplets, each of which consists of an anchor sample, a positive sample and a negative sample. Triplet loss based systems try to maximize the distance between anchor sample and negative sample as well as minimize the distance between anchor sample and positive sample at the same time. Pairwise loss, such as contrastive loss [9, 10], is defined on a set of pairs. Pairwise loss tries to maximize the distance between two samples if they have different class labels, otherwise minimize it. For models supervised by metric learning loss, the target of training and the requirement of inference are consistent, which should have promising performance as long as the training is sufficient. Nevertheless, metric learning loss based systems have a shortcoming that the size of dataset and the strategies for sampling and composing triplets or pairs significantly affect the performance, bringing obstacle to training. Thus, they are usually used in combination with classification loss. Very recently, a novel loss function, called additive margin softmax (AM-Softmax) loss [11], is proposed for face verification. AM-Softmax loss has achieved better performance than other loss functions in face verification. In this paper, we adapt the AM-Softmax loss to speaker embedding systems. Furthermore, we propose a novel ensemble loss, called ensemble additive margin softmax (EAM-Softmax) loss, for SV by integrating Hilbert-Schmidt independence criterion (HSIC) [12] into the speaker embedding system with the AM-Softmax loss. Experiments on a large-scale dataset VoxCeleb show that AM-Softmax loss is better than traditional loss functions, and approaches using EAM-Softmax loss can outperform existing speaker verification methods to achieve state-of-the-art performance. 978-1-5386-4658-8/18/$31.00 ©2019 IEEE 6046 ICASSP 2019
2.PRELIMINARIES 3.ENSEMBLE ADDITIVE MARGIN SOFTMAX LOSS In this section.we introduce some loss functions which have been This section presents the details of our proposed loss function,called used in SV tasks,including softmax loss,A-Softmax loss and con- EAM-Softmax loss. trastive loss 3.1.Additive Margin Softmax Loss 2.1.Softmax Loss As stated in Section 2.2.A-Softmax is not flexible enough.To The softmax loss is defined as follows: overcome this shortcoming of A-Softmax and explore more possible 1 N ew xi+bm margins.additive margin softmax (AM-Softmax)loss [11]adopts a Cs= N -log- =1e"x (1) simpler function ()cos(0)-m and further normalizes x.The i=1 AM-Softmax loss is defined as follows: where c is the number of classes,xi is the input of the last fully connected layer corresponding to sample i.y{1,2,...,c}is the ecos(0vi.)-m class label of sample i.N is the number of samples,w;and b;are CAMS 1一 N -log i=1 eo-0-m+∑5=1t:ema(8,0 respectively the weight vector and bias of the last fully connected layer corresponding to class j. The traditional softmax loss aims to learn a weight vector set and bias set for different classes that satisfy wxi+b>wxi+ 2.2.A-Softmax Loss bj(j {1,...,c,j).And the decision boundary satisfies min{(wxi+bu)-(wxi+bj)}=0.But AM-Softmax loss Note that wTx in softmax loss can be rewritten aslwlllxll cos(0). obtains a boundary satisfying min cos()-cos(0j.)=m, where 6 is the angle between w and x.Hence,the softmax loss can which forces the embeddings to be more discriminative and makes be rewritten as follows: the verification to be more robust. N ∑-1o ellw cos(v)+bv Since a large margin m might push the decision boundary too Cs= (2) N hard and make the training difficult to converge,a hyperparameter s = ∑=1ewxo+ is introduced to scale the cosine value and the actual AM-Softmax where 0;;denotes the angle between w;and xi loss function is given by: By normalizing weight w.zeroing bias b and replacing cosine N with a tighter function ()cos(0)for the intra-class part,the es(coa(v()-m) formulation of a generalized margin softmax loss is given by: CAMS log e(cos(,)-m)+∑ =1切≠ge*cos(0, ex(ew4.d) CMS N -log)ea o() 3.2.Hilbert-Schmidt Independence Criterion A-Softmax loss [13]adopts ()cos(mo),where m is the The Hilbert-Schmidt independence criterion (HSIC)[12]indicates hyperparameter related to the margin.This makes sense in intu- the independence of two random variables A and B,and the empir- ition but mo should not be larger than to preserve monotonicity. ical HSIC is an estimator of HSIC given a finite number of observa- To avoid this problem,A-Softmax uses the following monotone de- tions. creasing function: Definition 1 (Empirical HSIC)Consider a series ofn independent (0)=(-1)*cos(m0)-2k (3) observations 3={(a1,b1),...,(an,bn)}C Ax B drawn from where0e[=,+]andk∈{0,m-1.Usually,misa Pab.The empirical HSIC is given by positive integer in this function,and hence the margin in A-Softmax HSIC(,F,G)=(n-1)2tr(KHLH), (5) loss is not flexible enough. where K and L are Gram matrices with Kij k(a,aj).Lij 2.3.Contrastive Loss e(bi,bj).Here,k(ai,aj)and e(bi,bi)are the kernel functions defined in space F and g respectively.H =In-nJn,where Contrastive loss [14,15]is a kind of pairwise loss in which the sam- In and Jn ERnxn are an identity matrix and a matrix of all ones ples are organized into pairs with a label z e 0,1 indicating respectively. whether the two elements of the corresponding pair belong to the same class or not.The formulation of contrastive loss is as follows: 3.3.Ensemble Additive Margin Softmax Loss 1 Cc= 2M zd匠+(1-z)max(p-d,0)2),(4 Diversity in weak learners proves to be critical for the performance of ensemble models.Inspired by the work in [16],we exploit parallel where M is the number of pairs.da is the Euclidean distance be- fully connected layers to encourage diversity in homogenous learn- tween the two embeddings of the elements in the i-th pair ers.Weights in these layers are highly pairwise independent with the constraint of HSIC.Moreover,the kernel functions in HSIC which d:=f(Pi,1;w)-f(P2;w)l2, map variances into reproducing kernel Hilbert spaces (RKHS)give where pi.and pi.2 are the two elements from the i-th pair,f()is it the ability to measure nonlinear dependence. a non-linear function which represents the embedding system and w Unlike classification tasks in [16],the classification layer in em- represents the model parameters.The distance between embeddings bedding systems is not suitable for exploiting different models since with different class labels is expected to be larger than a margin p. it will no longer be used once the training is finished.Hence,we add 6047
2. PRELIMINARIES In this section, we introduce some loss functions which have been used in SV tasks, including softmax loss, A-Softmax loss and contrastive loss. 2.1. Softmax Loss The softmax loss is defined as follows: LS = 1 N XN i=1 − log e wT yi xi+byi Pc j=1 e wT j xi+bj , (1) where c is the number of classes, xi is the input of the last fully connected layer corresponding to sample i, yi ∈ {1, 2, . . . , c} is the class label of sample i, N is the number of samples, wj and bj are respectively the weight vector and bias of the last fully connected layer corresponding to class j. 2.2. A-Softmax Loss Note that wT x in softmax loss can be rewritten as kwkkxk cos(θ), where θ is the angle between w and x. Hence, the softmax loss can be rewritten as follows: LS = 1 N XN i=1 − log e kwyi kkxik cos(θyi ,i)+byi Pc j=1 e kwjkkxik cos(θj,i)+bj , (2) where θj,i denotes the angle between wj and xi. By normalizing weight w, zeroing bias b and replacing cosine with a tighter function ψ(θ) wT j xi + bj (j ∈ {1, . . . , c}, j 6= yi). And the decision boundary satisfies min{(wT yi xi + byi ) − (wT j xi + bj )} = 0. But AM-Softmax loss obtains a boundary satisfying min{cos(θyi,i) − cos(θj,i)} = m, which forces the embeddings to be more discriminative and makes the verification to be more robust. Since a large margin m might push the decision boundary too hard and make the training difficult to converge, a hyperparameter s is introduced to scale the cosine value and the actual AM-Softmax loss function is given by: LAMS = 1 N XN i=1 − log e s(cos(θyi ,i)−m) e s(cos(θyi ,i)−m) + Pc j=1;j6=yi e s·cos(θj,i) . 3.2. Hilbert-Schmidt Independence Criterion The Hilbert-Schmidt independence criterion (HSIC) [12] indicates the independence of two random variables A and B, and the empirical HSIC is an estimator of HSIC given a finite number of observations. Definition 1 (Empirical HSIC) Consider a series of n independent observations Z = {(a1, b1), . . . ,(an, bn)} ⊆ A × B drawn from pab. The empirical HSIC is given by HSIC(Z, F, G) = (n − 1)−2 tr(KHLH), (5) where K and L are Gram matrices with Kij = κ(ai, aj ), Lij = `(bi, bj ). Here, κ(ai, aj ) and `(bi, bj ) are the kernel functions defined in space F and G respectively. H = In − n −1Jn, where In and Jn ∈ R n×n are an identity matrix and a matrix of all ones respectively. 3.3. Ensemble Additive Margin Softmax Loss Diversity in weak learners proves to be critical for the performance of ensemble models. Inspired by the work in [16], we exploit parallel fully connected layers to encourage diversity in homogenous learners. Weights in these layers are highly pairwise independent with the constraint of HSIC. Moreover, the kernel functions in HSIC which map variances into reproducing kernel Hilbert spaces (RKHS) give it the ability to measure nonlinear dependence. Unlike classification tasks in [16], the classification layer in embedding systems is not suitable for exploiting different models since it will no longer be used once the training is finished. Hence, we add 6047
the HSIC constraint to the embedding layer rather than the classifi- cation layer. Table 1.Dataset for evaluation.POI denotes Person of Interest. Dataset # Dev Test Total Assume that there are V parallel fully connected layers for em- POIs 1.211 40 1,25I bedding in the ensemble systems,and each fully connected layer VoxCeleb1 utterances 148.642 4.874 153.516 contains a weight matrix W E R'xm where I and n are the in- hours 352 put size and output size of the embedding layer respectively.The formulation of HSIC constraint for the v-th weight matrix W() POIs 5.994 118 6,112 VoxCeleb2 utterances 1.092.00936.237 1.128.246 (u∈{1,..,V})is as follows: hours 2.442 HSIC(W())=>(n-1)-2tr(K()HK()H),(6) u=1u≠知 4.EXPERIMENTS where K(=W(),W())and K()=k(W(),W()).with We compare our method with other baselines in real dataset W)being the i-th column of W().Although more complex ker- nels can be expected to achieve better performance,inner product 4.1.Dataset kerel K =WTW is just adopted for illustration in this paper. In the experiments,we use two datasets including VoxCelebl [4) Since weight matrix with small magnitude will have small HSIC constraint,{W()}are normalized along vertical axis. and VoxCeleb2 [9].Both datasets are gender balanced and contain a large number of utterances from thousands of speakers.The ut- Note that the time and space complexity of the HSIC constraint terances are collected from YouTube videos in which the speakers computation mainly depends on the number of columns in matrix belong to different races and have a wide range of accents.The W.which equals to the dimensionality of embedding vectors in datasets contain background noise from a large number of environ- our network architectures.Hence,with a low dimensionality of em- bedding vectors which is typically adopted in practice,we can eas- ments,e.g.,overlapping speech,which makes the audio segments ily handle several models in the ensemble without worrying about challenging for speaker verification. the rapidly increasing memory usage and computational cost faced Both datasets are split into development set and test set.We adopt the same strategy as that in [9]for evaluation.In particular,the by[16. development set of VoxCeleb2 is used for training and the test set of There are two ways to construct the final ensemble model.The VoxCelebl is used for testing.Details of VoxCelebl and VoxCeleb2 first one is to average the outputs of the fully connected layers,and are described in Table 1.There are no overlapping identities between this is equivalent to averaging the weights of the fully connected hese two datasets. layers since 4.2.Implementation Details ∑(w]'x=(∑w)x In order to facilitate fair comparison of experimental results,we try to make our experimental settings consistent with those of base- The second way is to concatenate the outputs of the fully con- lines [4,9],except for the loss functions and ensemble strategy.Thus nected layers.One shortcoming of this way is that the embedding we adopt similar network architectures,data processing,training and size and the number of parameters in the classification layer are pro- testing strategies in our experiments. portional to the number of models in the ensemble,leading to higher Networks.Network architectures are modified from the original storage and computational burden.Hence,we adopt the first way to residual networks (ResNet)[17]to take spectrograms as input fea- construct the final ensemble model in this paper. tures.In particular,ResNet-34 and ResNet-50 are used in our experi- Rather than optimizing the ensemble model by multiple stan- ments.The details of network architectures are described in Table 2. dalone softmax loss functions,which is adopted in [16]and may With an input feature length of 512,the output size of com5_x will lead to inconsistency between training and inference,we directly av- be 9 x h,where h is determined by the audio segment length.The erage the outputs of embedding layers before they are forwarded to cono layer is employed to combine information from different fre- the classification layer and optimize the ensemble model by a single quency domains,where the filter size is 9 x 1 and the output size is softmax loss function. 1x h.The adaptive average pool avgpool,which supports different Finally,by combining the AM-Softmax loss and the HSIC con- input sizes,calculates a temporal mean of size 1x 1.These modifica- straint for the embedding layers,we can get the formulation of en- tions make the network architectures sensitive to frequency variance semble additive margin softmax (EAM-Softmax)loss for speaker rather than temporal position,which is desired in text independent embedding systems: SV. Features.Spectrograms computed through a sliding hamming win- es(cos(0vi.i)-m) dow are used as input features.Window width and window step are CEAMS=- ∑og 25ms and 10ms respectively.Feature length is set to 512.Normal- i三1 (co(v)-m)eco() ization is performed along axis of frequency. Hyperparameter.Margin m and scale factor s for AM-Softmax +λ(m-1)-2tr(KHK(H) loss are set to 0.35 and 30.0 respectively.Ensemble number V =4. =1 u=1:u≠t Hyperparameter A for balancing AM-Softmax loss and HSIC con- straint in the EAM-Softmax loss is set to 0.1. where A is a hyperparameter denoting the tradeoff between the Training.3-second utterances are randomly sampled from each au- AM-Softmax loss and the HSIC constraint. dio file in training,each producing a spectrogram of size 512 x 6048
the HSIC constraint to the embedding layer rather than the classifi- cation layer. Assume that there are V parallel fully connected layers for embedding in the ensemble systems, and each fully connected layer contains a weight matrix W ∈ R l×n where l and n are the input size and output size of the embedding layer respectively. The formulation of HSIC constraint for the v-th weight matrix W(v) (v ∈ {1, . . . , V }) is as follows: HSIC(W(v) ) = XV u=1;u6=v (n − 1)−2 tr(K (v)HK(u)H), (6) where K (v) ij = k(W(v) i ,W(v) j ) and K (u) ij = k(W(u) i ,W(u) j ), with W(v) i being the i-th column of W(v) . Although more complex kernels can be expected to achieve better performance, inner product kernel K = WTW is just adopted for illustration in this paper. Since weight matrix with small magnitude will have small HSIC constraint, {W(v) } are normalized along vertical axis. Note that the time and space complexity of the HSIC constraint computation mainly depends on the number of columns in matrix W, which equals to the dimensionality of embedding vectors in our network architectures. Hence, with a low dimensionality of embedding vectors which is typically adopted in practice, we can easily handle several models in the ensemble without worrying about the rapidly increasing memory usage and computational cost faced by [16]. There are two ways to construct the final ensemble model. The first one is to average the outputs of the fully connected layers, and this is equivalent to averaging the weights of the fully connected layers since 1 V XV v=1 hW(v) iT x = 1 V XV v=1 W(v) T x. The second way is to concatenate the outputs of the fully connected layers. One shortcoming of this way is that the embedding size and the number of parameters in the classification layer are proportional to the number of models in the ensemble, leading to higher storage and computational burden. Hence, we adopt the first way to construct the final ensemble model in this paper. Rather than optimizing the ensemble model by multiple standalone softmax loss functions, which is adopted in [16] and may lead to inconsistency between training and inference, we directly average the outputs of embedding layers before they are forwarded to the classification layer and optimize the ensemble model by a single softmax loss function. Finally, by combining the AM-Softmax loss and the HSIC constraint for the embedding layers, we can get the formulation of ensemble additive margin softmax (EAM-Softmax) loss for speaker embedding systems: LEAMS = − V N XN i=1 log e s(cos(θyi ,i)−m) e s(cos(θyi ,i)−m) + Pc j=1;j6=yi e s·cos(θj,i) + λ XV v=1 XV u=1;u6=v (n − 1)−2 tr(K (v)HK(u)H), where λ is a hyperparameter denoting the tradeoff between the AM-Softmax loss and the HSIC constraint. Table 1. Dataset for evaluation. POI denotes Person of Interest. Dataset # Dev Test Total VoxCeleb1 POIs 1,211 40 1,251 utterances 148,642 4,874 153,516 hours - - 352 VoxCeleb2 POIs 5,994 118 6,112 utterances 1,092,009 36,237 1,128,246 hours - - 2,442 4. EXPERIMENTS We compare our method with other baselines in real dataset. 4.1. Dataset In the experiments, we use two datasets including VoxCeleb1 [4] and VoxCeleb2 [9]. Both datasets are gender balanced and contain a large number of utterances from thousands of speakers. The utterances are collected from YouTube videos in which the speakers belong to different races and have a wide range of accents. The datasets contain background noise from a large number of environments, e.g., overlapping speech, which makes the audio segments challenging for speaker verification. Both datasets are split into development set and test set. We adopt the same strategy as that in [9] for evaluation. In particular, the development set of VoxCeleb2 is used for training and the test set of VoxCeleb1 is used for testing. Details of VoxCeleb1 and VoxCeleb2 are described in Table 1. There are no overlapping identities between these two datasets. 4.2. Implementation Details In order to facilitate fair comparison of experimental results, we try to make our experimental settings consistent with those of baselines [4, 9], except for the loss functions and ensemble strategy. Thus we adopt similar network architectures, data processing, training and testing strategies in our experiments. Networks. Network architectures are modified from the original residual networks (ResNet) [17] to take spectrograms as input features. In particular, ResNet-34 and ResNet-50 are used in our experiments. The details of network architectures are described in Table 2. With an input feature length of 512, the output size of conv5 x will be 9 × h, where h is determined by the audio segment length. The conv6 layer is employed to combine information from different frequency domains, where the filter size is 9 × 1 and the output size is 1 × h. The adaptive average pool avgpool, which supports different input sizes, calculates a temporal mean of size 1×1. These modifications make the network architectures sensitive to frequency variance rather than temporal position, which is desired in text independent SV. Features. Spectrograms computed through a sliding hamming window are used as input features. Window width and window step are 25ms and 10ms respectively. Feature length is set to 512. Normalization is performed along axis of frequency. Hyperparameter. Margin m and scale factor s for AM-Softmax loss are set to 0.35 and 30.0 respectively. Ensemble number V = 4. Hyperparameter λ for balancing AM-Softmax loss and HSIC constraint in the EAM-Softmax loss is set to 0.1. Training. 3-second utterances are randomly sampled from each audio file in training, each producing a spectrogram of size 512 × 6048
Table 2.Network architectures modified from ResNet-34 and Table 3.Experimental results.Here,denotes that the results are ResNet-50 for spectrogram inputs.The com6 layers are imple- from [9].The letters in the brackets are the initials of loss func- mented with 2d convolutional layers,where the number of groups tions,where S,C,AMS and EAMS denote softmax,contrastive, equals to the number of channels. AM-Softmax and EAM-Softmax respectively. layer name 34-layer 50-layer Model Trained on Cmin EER ( convl 7×7,64,stride2 i-vector PLDA VoxCeleb1 0.73* 8.8* maxpool 3×3 max pool..stride2 VGG-M(S) VoxCelebl 0.75* 10.2* VGG-M(C) VoxCeleb1 0.71* 7.8* 1×1.64 「3×3,64 VGG-M(C) VoxCeleb2 0.609* 5.94* conv2_x ×3 3×3,64 ×3 3×3,64 ResNet-34(C) VoxCeleb2 0.543* 5.04* 1×1,256 ResNet-50(C) VoxCeleb2 0.449* 4.19* 「1×1,1281 ResNet-34 (AMS) VoxCeleb2 0.304 3.35 3×3,128 3×3.128 ResNet-34 (EAMS) VoxCeleb2 0.305 3.14 conv3_x ×4 ×4 3×3.128 ResNet-50(AMS) VoxCeleb2 0.303 3.10 1×1,512 ResNet-50 (EAMS) VoxCeleb2 0.278 2.94 1×1.256 [3×3,2561 conv4_x ×6 3×3.256 ×6 3×3.256 1×1,1024 4.4.Baseline 3×3,512 1×1.5121 Methods and results explored in the experiments of [4,9]are used conv5_x ×3 3×3.512 as baselines,including: 3×3,512 ×3 1×1.2048 (1)I-vector based embedding system with a PLDA back-end; conv6 9×1.512.stride1 9×1.2048.stride1 (2)End-to-end embedding systems with a cosine similarity based avgpool 1x 1,adaptive average pool,stride 1 back-end,in which the architectures are modified from net- embedding 512×512 2048×512 works introduced by visual geometry group (VGG-M)[18]or classification 512×5994 ResNet [17].Those networks modified from ResNet are ex- actly the same as the networks employed in the experiments of AM-softmax loss and EAM-Softmax loss,except for the extra embedding layer. 300.Models are optimized by momentum stochastic gradient de- scent (SGD),in which the momentum is 0.9 and the weight decay is For the end-to-end embedding systems,softmax loss and con- 5 x e-4.Mini-batch size is 64.Learning rate is initialized as 0.1. trastive loss are employed.Nevertheless,standalone contrastive loss For the 34-layer network.the learning rate is divided by 10 after the is hard to learn.Baseline models supervised by contrastive loss are 6-th and the 12-th epochs.For the 50-layer network,the learning rate obtained in two stages.First,softmax loss is used to initialize the is divided by 10 after the 10-th and the 20-th epochs.The training weights of networks.Then the classification layer is replaced by terminates earlier to avoid overfitting if the performance on a valida- a fully connected layer with a smaller output size.This fully con- tion set,which is randomly sampled from VoxCelebl development nected layer is treated as the embedding layer and the contrastive set,stops improving after 12 epochs for the 34-layer network and 20 loss is used for tuning its parameters epochs for the 50-layer network. Testing.Full length utterances are used in testing,so the generated 4.5.Results spectrograms are in different sizes.Adaptive average pooling is em- ployed to output embeddings of the same size. Results on VoxCelebl test set are listed in Table 3.Our method achieves state-of-the-art performance.decreasing EER to 2.94% and minimum DCF to 0.278,which are 29.8%and 38.1%relatively 4.3.Metric lower than the best results in [9]. Two metrics are used for performance evaluation: VGG-M architecture trained with softmax loss is slightly weaker than the traditional i-vector based approach,but VGG-M architec- (1)Equal error rate (EER):the error rate when false rejection ture trained with contrastive loss surpasses i-vector based approach. probability Pr equals false acceptance probability P Furthermore,end-to-end systems using AM-Softmax loss outper- form all of the baselines,and approaches using EAM-Softmax loss (2)Minimum detection cost function (minimum DCF):similar to achieve the best results. EER.but takes different costs of misclassification and uneven target/nontarget probability into account.The formulation of minimum DCF is given by 5.CONCLUSION Caet"min{Cfr Pfr Ptar+Cfa Pfa *(1-Ptar)}, This paper first adapts the AM-Softmax loss to speaker verification, and then proposes a novel EAM-Softmax loss for speaker verifica- where Cfr and Cfa indicate the cost of false rejection and tion.Experiments on real datasets show that the proposed methods false acceptance respectively,and Par is the target probabil- can achieve state-of-the-art performance. ity.All of the three parameters are application dependent.In our experiments,we adopt the same values as those in [9]for Acknowledgement.This work has been supported by the NSFC- Cfr(1.0),Cfa(1.0)and Ptar(0.01). NRF Joint Research Project (No.61861146001). 6049
Table 2. Network architectures modified from ResNet-34 and ResNet-50 for spectrogram inputs. The conv6 layers are implemented with 2d convolutional layers, where the number of groups equals to the number of channels. layer name 34-layer 50-layer conv1 7 × 7, 64, stride 2 maxpool 3 × 3 max pool, stride 2 conv2 x 3 × 3, 64 3 × 3, 64 × 3 1 × 1, 64 3 × 3, 64 1 × 1, 256 × 3 conv3 x 3 × 3, 128 3 × 3, 128 × 4 1 × 1, 128 3 × 3, 128 1 × 1, 512 × 4 conv4 x 3 × 3, 256 3 × 3, 256 × 6 1 × 1, 256 3 × 3, 256 1 × 1, 1024 × 6 conv5 x 3 × 3, 512 3 × 3, 512 × 3 1 × 1, 512 3 × 3, 512 1 × 1, 2048 × 3 conv6 9 × 1, 512, stride 1 9 × 1, 2048, stride 1 avgpool 1 × 1, adaptive average pool, stride 1 embedding 512 × 512 2048 × 512 classification 512 × 5994 300. Models are optimized by momentum stochastic gradient descent (SGD), in which the momentum is 0.9 and the weight decay is 5 × e −4 . Mini-batch size is 64. Learning rate is initialized as 0.1. For the 34-layer network, the learning rate is divided by 10 after the 6-th and the 12-th epochs. For the 50-layer network, the learning rate is divided by 10 after the 10-th and the 20-th epochs. The training terminates earlier to avoid overfitting if the performance on a validation set, which is randomly sampled from VoxCeleb1 development set, stops improving after 12 epochs for the 34-layer network and 20 epochs for the 50-layer network. Testing. Full length utterances are used in testing, so the generated spectrograms are in different sizes. Adaptive average pooling is employed to output embeddings of the same size. 4.3. Metric Two metrics are used for performance evaluation: (1) Equal error rate (EER): the error rate when false rejection probability Pfr equals false acceptance probability Pfa; (2) Minimum detection cost function (minimum DCF): similar to EER, but takes different costs of misclassification and uneven target/nontarget probability into account. The formulation of minimum DCF is given by C min det = min{Cfr ∗ Pfr ∗ Ptar + Cfa ∗ Pfa ∗ (1 − Ptar)}, where Cfr and Cfa indicate the cost of false rejection and false acceptance respectively, and Ptar is the target probability. All of the three parameters are application dependent. In our experiments, we adopt the same values as those in [9] for Cfr (1.0), Cfa (1.0) and Ptar (0.01). Table 3. Experimental results. Here, * denotes that the results are from [9]. The letters in the brackets are the initials of loss functions, where S, C, AMS and EAMS denote softmax, contrastive, AM-Softmax and EAM-Softmax respectively. Model Trained on C min det EER (%) i-vector + PLDA VoxCeleb1 0.73* 8.8* VGG-M (S) VoxCeleb1 0.75* 10.2* VGG-M (C) VoxCeleb1 0.71* 7.8* VGG-M (C) VoxCeleb2 0.609* 5.94* ResNet-34 (C) VoxCeleb2 0.543* 5.04* ResNet-50 (C) VoxCeleb2 0.449* 4.19* ResNet-34 (AMS) VoxCeleb2 0.304 3.35 ResNet-34 (EAMS) VoxCeleb2 0.305 3.14 ResNet-50 (AMS) VoxCeleb2 0.303 3.10 ResNet-50 (EAMS) VoxCeleb2 0.278 2.94 4.4. Baseline Methods and results explored in the experiments of [4, 9] are used as baselines, including: (1) I-vector based embedding system with a PLDA back-end; (2) End-to-end embedding systems with a cosine similarity based back-end, in which the architectures are modified from networks introduced by visual geometry group (VGG-M) [18] or ResNet [17]. Those networks modified from ResNet are exactly the same as the networks employed in the experiments of AM-softmax loss and EAM-Softmax loss, except for the extra embedding layer. For the end-to-end embedding systems, softmax loss and contrastive loss are employed. Nevertheless, standalone contrastive loss is hard to learn. Baseline models supervised by contrastive loss are obtained in two stages. First, softmax loss is used to initialize the weights of networks. Then the classification layer is replaced by a fully connected layer with a smaller output size. This fully connected layer is treated as the embedding layer and the contrastive loss is used for tuning its parameters. 4.5. Results Results on VoxCeleb1 test set are listed in Table 3. Our method achieves state-of-the-art performance, decreasing EER to 2.94% and minimum DCF to 0.278, which are 29.8% and 38.1% relatively lower than the best results in [9]. VGG-M architecture trained with softmax loss is slightly weaker than the traditional i-vector based approach, but VGG-M architecture trained with contrastive loss surpasses i-vector based approach. Furthermore, end-to-end systems using AM-Softmax loss outperform all of the baselines, and approaches using EAM-Softmax loss achieve the best results. 5. CONCLUSION This paper first adapts the AM-Softmax loss to speaker verification, and then proposes a novel EAM-Softmax loss for speaker verification. Experiments on real datasets show that the proposed methods can achieve state-of-the-art performance. Acknowledgement. This work has been supported by the NSFCNRF Joint Research Project (No. 61861146001). 6049
6.REFERENCES [14]Sumit Chopra,Raia Hadsell,and Yann LeCun,"Learning a similarity metric discriminatively,with application to face ver- [1]Najim Dehak,Patrick J Kenny,Reda Dehak,Pierre Du- ification,"in IEEE Computer Society Conference on Computer mouchel.and Pierre Ouellet,"Front-end factor analysis for Vision and Pattern Recognition(CVPR),2005,pp.539-546. speaker verification,"IEEE Transactions on Audio,Speech, [15]Raia Hadsell,Sumit Chopra,and Yann LeCun,"Dimension- and Language Processing,vol.19,no.4,pp.788-798,2011. ality reduction by learning an invariant mapping,"in IEEE [2]Ehsan Variani,Xin Lei,Erik McDermott,Ignacio Lopez Computer Society Conference on Computer Vision and Pattern Moreno,and Javier Gonzalez-Dominguez,"Deep neural net- Recognition(CVPR),2006,pp.1735-1742. works for small footprint text-dependent speaker verification," [16]Xiaobo Wang,Shifeng Zhang,Zhen Lei,Si Liu,Xiaojie Guo, in IEEE International Conference on Acoustics,Speech and and Stan Z Li,"Ensemble soft-margin softmax loss for image Signal Processing (ICASSP),2014.pp.4052-4056. classification,"in International Joint Conference on Artificial [3]David Snyder,Daniel Garcia-Romero,Gregory Sell,Daniel Intelligence (IJCAl),2018,pp.992-998. Povey,and Sanjeev Khudanpur,"X-vectors:robust DNN em- [17]Kaiming He,Xiangyu Zhang,Shaoqing Ren,and Jian Sun, beddings for speaker recognition,"in IEEE International Con- "Deep residual learning for image recognition,"in IEEE Con- ference on Acoustics,Speech and Signal Processing (ICASSP). ference on Computer Vision and Pattern Recognition (CVPR), 2018,pp.5329-5333. 2016,pp.770-778. [4]Arsha Nagrani,Joon Son Chung,and Andrew Zisserman, [18]Ken Chatfield,Karen Simonyan,Andrea Vedaldi,and Andrew "Voxceleb:a large-scale speaker identification dataset,"in Zisserman,"Return of the devil in the details:delving deep Conference of the International Speech Communication Asso- into convolutional nets,"British Machine Vision Conference ciation (INTERSPEECH),2017,pp.2616-2620. BMVC).2014. [5]Simon JD Prince and James H Elder,"Probabilistic linear dis- criminant analysis for inferences about identity,"in IEEE In- ternational Conference on Computer Vision(ICCV),2007,pp. 1-8. [6]Na Li,Deyi Tuo,Dan Su,Zhifeng Li,and Dong Yu,"Deep discriminative embeddings for duration robust speaker verifi- cation."in Conference of the International Speech Communi- cation Association(INTERSPEECH),2018,pp.2262-2266. [7]Zili Huang,Shuai Wang,and Kai Yu,"Angular softmax for short-duration text-independent speaker verification,"in Con- ference of the International Speech Communication Associa- tion (INTERSPEECH),2018,pp.3623-3627. [8]Chao Li,Xiaokong Ma,Bing Jiang,Xiangang Li,Xuewei Zhang,Xiao Liu,Ying Cao,Ajay Kannan,and Zhenyao Zhu, "Deep speaker:an end-to-end neural speaker embedding sys- tem,”CoRR.vol.abs/1705.02304,2017. [9]Joon Son Chung,Arsha Nagrani,and Andrew Zisserman, "Voxceleb2:deep speaker recognition,"in Conference of the International Speech Communication Association (INTER- SPEECH),2018,pp.1086-1090. [10]Gautam Bhattacharya,Md Jahangir Alam,Vishwa Gupta, and Patrick Kenny,"Deeply fused speaker embeddings for text-independent speaker verification,"in Conference of the International Speech Communication Association (INTER- SPEECH),2018,pp.3588-3592. [11]Feng Wang,Jian Cheng,Weiyang Liu,and Haijun Liu,"Ad- ditive margin softmax for face verification,"IEEE Signal Pro- cessing Letters,vol.25,no.7,pp.926-930,2018. [12]Arthur Gretton,Olivier Bousquet,Alexander J.Smola,and Bernhard Scholkopf,"Measuring statistical dependence with Hilbert-Schmidt norms,"in International Conference on Algo- rithmic Learning Theory(ALT).2005,pp.63-77. [13]Weiyang Liu,Yandong Wen,Zhiding Yu,Ming Li,Bhiksha Raj.and Le Song."Sphereface:deep hypersphere embedding for face recognition,"in IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR),2017. pp.6738-6746. 6050
6. REFERENCES [1] Najim Dehak, Patrick J Kenny, Reda Dehak, Pierre Du- ´ mouchel, and Pierre Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011. [2] Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Javier Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 4052–4056. [3] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur, “X-vectors: robust DNN embeddings for speaker recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5329–5333. [4] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman, “Voxceleb: a large-scale speaker identification dataset,” in Conference of the International Speech Communication Association (INTERSPEECH), 2017, pp. 2616–2620. [5] Simon JD Prince and James H Elder, “Probabilistic linear discriminant analysis for inferences about identity,” in IEEE International Conference on Computer Vision (ICCV), 2007, pp. 1–8. [6] Na Li, Deyi Tuo, Dan Su, Zhifeng Li, and Dong Yu, “Deep discriminative embeddings for duration robust speaker verifi- cation,” in Conference of the International Speech Communication Association (INTERSPEECH), 2018, pp. 2262–2266. [7] Zili Huang, Shuai Wang, and Kai Yu, “Angular softmax for short-duration text-independent speaker verification,” in Conference of the International Speech Communication Association (INTERSPEECH), 2018, pp. 3623–3627. [8] Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu, Ying Cao, Ajay Kannan, and Zhenyao Zhu, “Deep speaker: an end-to-end neural speaker embedding system,” CoRR, vol. abs/1705.02304, 2017. [9] Joon Son Chung, Arsha Nagrani, and Andrew Zisserman, “Voxceleb2: deep speaker recognition,” in Conference of the International Speech Communication Association (INTERSPEECH), 2018, pp. 1086–1090. [10] Gautam Bhattacharya, Md Jahangir Alam, Vishwa Gupta, and Patrick Kenny, “Deeply fused speaker embeddings for text-independent speaker verification,” in Conference of the International Speech Communication Association (INTERSPEECH), 2018, pp. 3588–3592. [11] Feng Wang, Jian Cheng, Weiyang Liu, and Haijun Liu, “Additive margin softmax for face verification,” IEEE Signal Processing Letters, vol. 25, no. 7, pp. 926–930, 2018. [12] Arthur Gretton, Olivier Bousquet, Alexander J. Smola, and Bernhard Scholkopf, “Measuring statistical dependence with ¨ Hilbert-Schmidt norms,” in International Conference on Algorithmic Learning Theory (ALT), 2005, pp. 63–77. [13] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song, “Sphereface: deep hypersphere embedding for face recognition,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6738–6746. [14] Sumit Chopra, Raia Hadsell, and Yann LeCun, “Learning a similarity metric discriminatively, with application to face verification,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2005, pp. 539–546. [15] Raia Hadsell, Sumit Chopra, and Yann LeCun, “Dimensionality reduction by learning an invariant mapping,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2006, pp. 1735–1742. [16] Xiaobo Wang, Shifeng Zhang, Zhen Lei, Si Liu, Xiaojie Guo, and Stan Z Li, “Ensemble soft-margin softmax loss for image classification,” in International Joint Conference on Artificial Intelligence (IJCAI), 2018, pp. 992–998. [17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778. [18] Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman, “Return of the devil in the details: delving deep into convolutional nets,” British Machine Vision Conference (BMVC), 2014. 6050