INTERSPEECH 2019 September 15-19,2019,Graz.Austria Deep Hashing for Speaker Identification and Retrieval Lei Fan,Qing-Yuan Jiang,Ya-Oi Yu,Wu-Jun Li National Key Laboratory for Novel Software Technology Collaborative Innovation Center of Novel Software Technology and Industrialization Department of Computer Science and Technology,Nanjing University,P.R.China {fanl,jiangqy,yuyqelamda.nju.edu.cn,liwujun@nju.edu.cn Abstract ally suffer from high storage cost and low retrieval speed in real applications with large-scale datasets. Speaker identification and retrieval have been widely used To enable fast query and reduce storage cost,there have ap- in real applications.To overcome the inefficiency problem peared some hashing methods [1,3],also called speaker hash- caused by real-valued representations.there have appeared ing methods,for speaker identification and retrieval.By repre- some speaker hashing methods for speaker identification and senting each utterance as a binary code,speaker hashing can re- retrieval by learning binary codes as representations.How- duce the storage cost dramatically.Furthermore,we can achieve ever,these hashing methods are based on i-vector and can- constant or sub-linear query speed based on binary codes.How- not achieve satisfactory retrieval accuracy as they cannot learn ever,existing speaker hashing methods [1,3]are based on i- discriminative feature representations.In this paper,we pro- vector.Specifically,each utterance is represented as an i-vector pose a novel deep hashing method,called deep additive margin in the first stage.Then the hash function is utilized to generate hashing(DAMH),to improve retrieval performance for speaker binary codes for utterances in the second stage.On one hand identification and retrieval task. Compared with existing the retrieval performance of them is limited by i-vector repre- speaker hashing methods,DAMH can perform feature learn- sentations.On the other hand,existing speaker hashing methods ing and binary code learning seamlessly by incorporating these are two-stage methods and they cannot learn optimally compat- two procedures into an end-to-end architecture.Experimen- ible feature for hashing.Hence,the retrieval performance of tal results on a large-scale audio dataset VoxCeleb2 show that these methods is far from satisfactory in real applications DAMH can outperform existing speaker hashing methods to To overcome the drawbacks of existing speaker hashing achieve state-of-the-art performance. methods,in this paper we propose a novel deep hashing method, Index Terms:speaker identification and retrieval,deep hash- called deep additive margin hashing(DAMH).The contribu- ing,additive margin softmax,deep additive margin hashing tions of this paper are listed as follows: 1.Introduction DAMH is an end-to-end deep hashing method for speaker identification and retrieval.To the best of our Speaker identification and retrieval [1,2,3]have been widely knowledge,DAMH is the first deep hashing method for used in real applications including automatic access control of speaker identification and retrieval task.Compared with banking services,financial transactions and detection of speak- existing speaker hashing methods,DAMH can perform ers in complex scenes.Both speaker identification and retrieval audio feature learning and binary code learning simulta- can be realized by a retrieval procedure'.To realize the re- neously.Hence,these two procedures can give feedback trieval procedure,one common solution is to embed utterances to each other. into low-dimensional representations firstly,which is also called DAMH utilizes additive margin softmax loss to super- speaker embedding,and then perform retrieval based on the vise speaker hashing.Angular margin added in the loss low-dimensional representations. makes the learned binary codes more discriminative. Over the past decades,real-value based speaker embed- Experiments on a large-scale audio dataset Vox- ding has made good progress and achieved promising accu- Celeb2 demonstrate that DAMH can outperform exist- racy [2,4,5,6].I-vector based approaches [41,which project ing speaker hashing methods to achieve state-of-the-art the Gaussian mixture model(GMM)super vector into a low- performance. dimensional vector,have dominated the field of speaker em- bedding.I-vector based systems are robust and accurate in the cases with utterances of sufficient length [4].Nevertheless,long 2.Related Works speech isn't always available in real applications.With the up- In this section,we briefly review the related works,including surge of deep learning,many works have recently been devoted real-value based speaker embedding and speaker hashing. to deep neural networks (DNN)[2,5,6]and achieved promising performance due to the powerful modeling capacity of DNN 2.1.Real-value based Speaker Embedding DNN based systems can outperform i-vector based systems in the case of short utterances,which is more common and practi- To perform speaker embedding,i-vector [4]was proposed to cal than long utterances in real applications.Since i-vector and represent the GMM super vector in a single total variability DNN based methods are real-value based methods,they usu- space instead of two distinct spaces,i.e.,speaker space and channel space.Modeling all variability as a single manifold IIn some cases,speaker identification can also be realized by classi- has superior performance in this total variability model (TVM). fication.In this paper,we only focus on retrieval based speaker identi- The i-vector is the vector of latent factors which represent the fication. speaker information of a given utterance.After the TVM model Copyright©2019ISCA 2908 http://dx.doi.org/10.21437/Interspeech.2019-2457
Deep Hashing for Speaker Identification and Retrieval Lei Fan, Qing-Yuan Jiang, Ya-Qi Yu, Wu-Jun Li National Key Laboratory for Novel Software Technology Collaborative Innovation Center of Novel Software Technology and Industrialization Department of Computer Science and Technology, Nanjing University, P. R. China {fanl,jiangqy,yuyq}@lamda.nju.edu.cn, liwujun@nju.edu.cn Abstract Speaker identification and retrieval have been widely used in real applications. To overcome the inefficiency problem caused by real-valued representations, there have appeared some speaker hashing methods for speaker identification and retrieval by learning binary codes as representations. However, these hashing methods are based on i-vector and cannot achieve satisfactory retrieval accuracy as they cannot learn discriminative feature representations. In this paper, we propose a novel deep hashing method, called deep additive margin hashing (DAMH), to improve retrieval performance for speaker identification and retrieval task. Compared with existing speaker hashing methods, DAMH can perform feature learning and binary code learning seamlessly by incorporating these two procedures into an end-to-end architecture. Experimental results on a large-scale audio dataset VoxCeleb2 show that DAMH can outperform existing speaker hashing methods to achieve state-of-the-art performance. Index Terms: speaker identification and retrieval, deep hashing, additive margin softmax, deep additive margin hashing 1. Introduction Speaker identification and retrieval [1, 2, 3] have been widely used in real applications including automatic access control of banking services, financial transactions and detection of speakers in complex scenes. Both speaker identification and retrieval can be realized by a retrieval procedure1 . To realize the retrieval procedure, one common solution is to embed utterances into low-dimensional representations firstly, which is also called speaker embedding, and then perform retrieval based on the low-dimensional representations. Over the past decades, real-value based speaker embedding has made good progress and achieved promising accuracy [2, 4, 5, 6]. I-vector based approaches [4], which project the Gaussian mixture model (GMM) super vector into a lowdimensional vector, have dominated the field of speaker embedding. I-vector based systems are robust and accurate in the cases with utterances of sufficient length [4]. Nevertheless, long speech isn’t always available in real applications. With the upsurge of deep learning, many works have recently been devoted to deep neural networks (DNN) [2, 5, 6] and achieved promising performance due to the powerful modeling capacity of DNN. DNN based systems can outperform i-vector based systems in the case of short utterances, which is more common and practical than long utterances in real applications. Since i-vector and DNN based methods are real-value based methods, they usu- 1 In some cases, speaker identification can also be realized by classi- fication. In this paper, we only focus on retrieval based speaker identi- fication. ally suffer from high storage cost and low retrieval speed in real applications with large-scale datasets. To enable fast query and reduce storage cost, there have appeared some hashing methods [1, 3], also called speaker hashing methods, for speaker identification and retrieval. By representing each utterance as a binary code, speaker hashing can reduce the storage cost dramatically. Furthermore, we can achieve constant or sub-linear query speed based on binary codes. However, existing speaker hashing methods [1, 3] are based on ivector. Specifically, each utterance is represented as an i-vector in the first stage. Then the hash function is utilized to generate binary codes for utterances in the second stage. On one hand, the retrieval performance of them is limited by i-vector representations. On the other hand, existing speaker hashing methods are two-stage methods and they cannot learn optimally compatible feature for hashing. Hence, the retrieval performance of these methods is far from satisfactory in real applications. To overcome the drawbacks of existing speaker hashing methods, in this paper we propose a novel deep hashing method, called deep additive margin hashing (DAMH). The contributions of this paper are listed as follows: • DAMH is an end-to-end deep hashing method for speaker identification and retrieval. To the best of our knowledge, DAMH is the first deep hashing method for speaker identification and retrieval task. Compared with existing speaker hashing methods, DAMH can perform audio feature learning and binary code learning simultaneously. Hence, these two procedures can give feedback to each other. • DAMH utilizes additive margin softmax loss to supervise speaker hashing. Angular margin added in the loss makes the learned binary codes more discriminative. • Experiments on a large-scale audio dataset VoxCeleb2 demonstrate that DAMH can outperform existing speaker hashing methods to achieve state-of-the-art performance. 2. Related Works In this section, we briefly review the related works, including real-value based speaker embedding and speaker hashing. 2.1. Real-value based Speaker Embedding To perform speaker embedding, i-vector [4] was proposed to represent the GMM super vector in a single total variability space instead of two distinct spaces, i.e., speaker space and channel space. Modeling all variability as a single manifold has superior performance in this total variability model (TVM). The i-vector is the vector of latent factors which represent the speaker information of a given utterance. After the TVM model Copyright © 2019 ISCA INTERSPEECH 2019 September 15–19, 2019, Graz, Austria 2908 http://dx.doi.org/10.21437/Interspeech.2019-2457
is trained using the EM algorithm,i-vector can be extracted with the maximum a posteriori (MAP)for each utterance. With the rise of deep learning [7].some works [2,5,6,8] have been developed by using DNN for learning speaker em- Lro M-Sotaat Le bedding and achieved promising performance.There mainly exist two categories of DNN based speaker embedding meth- ods.One aims to classify speakers using the frame level fea- Figure 1:The end-to-end deep learning framework of DAMH ture [5].The other tries to classify speakers using the utter- ance level feature [2].After training,intermediate layer fea- Table 1:CNN architecture modified from ResNet-34 for spec- tures,which might be extracted from a single layer or multiple trogram input.K represents the binary code length and C rep- intermediate layers,are used as speaker embedding. resents the number of speakers in training set. With the rapid growth of audio data in real applications, real-value based retrieval usually suffers from high storage cost Layer Configuration and low query speed. convl 7×7,64,stride2 max pooling 3 x 3 max pooling,stride 2 2.2.Speaker Hashing 3×3,64,BN.ReLU1 conv2x ×3 To address the inefficiency problem in real-value based re- 3×3,64,BN trieval,many hashing methods have been proposed [9,10.11. T3×3.128,BN,ReLU1 12].There have appeared two speaker hashing methods [1.3] conv3_x ×4 3×3,128,BN for speaker identification and retrieval. 3×3.256.BN,ReLU In [1].locality sensitive hashing (LSH)[9]over i-vectors conv4x ×6 is applied to achieve faster speaker retrieval.Specifically,a 3×3,256,BN d-dimensional i-vector v is transformed to a lower dimensional T3×3,512,BN,ReLU vector with a random weight matrix A Rxk drawn from convs x ×3 3×3,512,BN a Gaussian distribution,where K<d.Then an element-wise sign function is adopted to turn the lower dimensional vector conv6 16×1,512,stride1 into the binary code b.The binary code b can be calculated avg.pooling 1 x 1 adaptive avg.pooling,stride 1 based on the equation:b=sign(Av). hash layer 512×K,sign Because LSH based speaker hashing method in [1]uti- classification layer KXC☑ lizes random projection matrix as hash functions,it usually re- quires long binary codes to achieve satisfactory accuracy.To improve the retrieval performance,Hamming distance metric 3.1.1.Feature Learning Part learning(HDML)[13,3]has been applied for speaker identi- fication.HDML is a triplet-based supervised hashing method The feature learning part of DAMH model uses a convo- which tries to preserve the relative similarity defined over triplet lutional neural network (CNN)architecture modified from inputs like {x,x,x).Here,{x,x,x}is constructed ResNet-34 [14].which is shown in Table 1.As shown in Ta- based on an anchor sample x,its similar sample x and its ble 1,the CNN model contains six groups of convolutional lay- dissimilar sample x.HDML employs triplet-based model ers,one max pooling layer,one average pooling layer,one hash by adopting a hinge loss over the learned triplet binary codes. layer and one classification layer.The first convolutional layer In [3],HDML was applied for speaker identification and re- contains 64 convolution filters where the kernel size is 7 x 7. trieval task. Following the first convolutional layer.a max pooling layer is Although aforementioned speaker hashing methods have adopted.The second to the fifth groups of convolutional layers been used for speaker identification and retrieval,the retrieval which contain 3,4,6 and 3 blocks respectively,are designed performance is far from satisfactory as these methods cannot in a skip connection style.The first convolutional layer in each fully exploit the power of feature learning.In this paper,we block is followed by a batch normalization (BN)layer and a propose a deep hashing method,which will be presented in the ReLU layer successively.The second convolutional layer in following section,to perform feature learning and binary code each block is only followed by a BN layer.After the fifth group learning seamlessly. of convolutional layers com5_r,we can get the 512 channels of 16 x t intermediate feature maps,where t is determined by the length of audio segment.Then the sixth convolutional layer 3.Deep Additive Margin Hashing com6 is employed to combine local frequency features for each channel.followed by an adaptive average pooling layer avg In this section,we present deep additive margin hash- pooling to calculate a temporal mean of the t frames.These ing (DAMH)in detail,including model formulation and learn- modifications make the model focus on frequency variance in- ing algorithm. stead of temporal position.Hence,the capability of capturing speaker information is improved.After that,a hash layer trans- 3.1.Model forms the feature from a 512-dimensional real-valued vector into a binary code vector of K bits.Then the binary code will The proposed DAMH is shown in Figure 1,which consists of be utilized as input of the classification layer. two components,i.e.,feature learning part and objective func- Please note that in DAMH the architecture of ResNet-34 is tion part. used as an example for illustration,which may be replaced by 2909
is trained using the EM algorithm, i-vector can be extracted with the maximum a posteriori (MAP) for each utterance. With the rise of deep learning [7], some works [2, 5, 6, 8] have been developed by using DNN for learning speaker embedding and achieved promising performance. There mainly exist two categories of DNN based speaker embedding methods. One aims to classify speakers using the frame level feature [5]. The other tries to classify speakers using the utterance level feature [2]. After training, intermediate layer features, which might be extracted from a single layer or multiple intermediate layers, are used as speaker embedding. With the rapid growth of audio data in real applications, real-value based retrieval usually suffers from high storage cost and low query speed. 2.2. Speaker Hashing To address the inefficiency problem in real-value based retrieval, many hashing methods have been proposed [9, 10, 11, 12]. There have appeared two speaker hashing methods [1, 3] for speaker identification and retrieval. In [1], locality sensitive hashing (LSH) [9] over i-vectors is applied to achieve faster speaker retrieval. Specifically, a d-dimensional i-vector v is transformed to a lower dimensional vector with a random weight matrix A ∈ R d×K drawn from a Gaussian distribution, where K < d. Then an element-wise sign function is adopted to turn the lower dimensional vector into the binary code b. The binary code b can be calculated based on the equation: b = sign(ATv). Because LSH based speaker hashing method in [1] utilizes random projection matrix as hash functions, it usually requires long binary codes to achieve satisfactory accuracy. To improve the retrieval performance, Hamming distance metric learning (HDML) [13, 3] has been applied for speaker identi- fication. HDML is a triplet-based supervised hashing method, which tries to preserve the relative similarity defined over triplet inputs like {x, x +, x −}. Here, {x, x +, x −} is constructed based on an anchor sample x, its similar sample x + and its dissimilar sample x −. HDML employs triplet-based model by adopting a hinge loss over the learned triplet binary codes. In [3], HDML was applied for speaker identification and retrieval task. Although aforementioned speaker hashing methods have been used for speaker identification and retrieval, the retrieval performance is far from satisfactory as these methods cannot fully exploit the power of feature learning. In this paper, we propose a deep hashing method, which will be presented in the following section, to perform feature learning and binary code learning seamlessly. 3. Deep Additive Margin Hashing In this section, we present deep additive margin hashing (DAMH) in detail, including model formulation and learning algorithm. 3.1. Model The proposed DAMH is shown in Figure 1, which consists of two components, i.e., feature learning part and objective function part. AM-Softmax Loss θ Quantization Loss b h Convolutions Convolutions Pooling Feature Learning Part Objective Function Part Figure 1: The end-to-end deep learning framework of DAMH. Table 1: CNN architecture modified from ResNet-34 for spectrogram input. K represents the binary code length and C represents the number of speakers in training set. Layer Configuration conv1 7 × 7, 64, stride 2 max pooling 3 × 3 max pooling, stride 2 conv2 x 3 × 3, 64, BN, ReLU 3 × 3, 64, BN × 3 conv3 x 3 × 3, 128, BN, ReLU 3 × 3, 128, BN × 4 conv4 x 3 × 3, 256, BN, ReLU 3 × 3, 256, BN × 6 conv5 x 3 × 3, 512, BN, ReLU 3 × 3, 512, BN × 3 conv6 16 × 1, 512, stride 1 avg. pooling 1 × 1 adaptive avg. pooling, stride 1 hash layer 512 × K, sign classification layer K × C 3.1.1. Feature Learning Part The feature learning part of DAMH model uses a convolutional neural network (CNN) architecture modified from ResNet-34 [14], which is shown in Table 1. As shown in Table 1, the CNN model contains six groups of convolutional layers, one max pooling layer, one average pooling layer, one hash layer and one classification layer. The first convolutional layer contains 64 convolution filters where the kernel size is 7 × 7. Following the first convolutional layer, a max pooling layer is adopted. The second to the fifth groups of convolutional layers, which contain 3, 4, 6 and 3 blocks respectively, are designed in a skip connection style. The first convolutional layer in each block is followed by a batch normalization (BN) layer and a ReLU layer successively. The second convolutional layer in each block is only followed by a BN layer. After the fifth group of convolutional layers conv5 x, we can get the 512 channels of 16 × t intermediate feature maps, where t is determined by the length of audio segment. Then the sixth convolutional layer conv6 is employed to combine local frequency features for each channel, followed by an adaptive average pooling layer avg. pooling to calculate a temporal mean of the t frames. These modifications make the model focus on frequency variance instead of temporal position. Hence, the capability of capturing speaker information is improved. After that, a hash layer transforms the feature from a 512-dimensional real-valued vector into a binary code vector of K bits. Then the binary code will be utilized as input of the classification layer. Please note that in DAMH the architecture of ResNet-34 is used as an example for illustration, which may be replaced by 2909
other network architectures. Algorithm 1 Learning algorithm for DAMH 3.1.2.Objective Function Part Input:X={x:training utterances; y=y:person identities for training utterances; Given an input datax,we define the output of the hash layer as K:binary code length. bi sign(f(xi;ecnn))E-1,+1,where ecnn denotes Output:and {b1. the parameters of the CNN architecture except for the classifica- 1:Procedure tion layer.Then we adopt the binary codes as the input of clas- 2:Initialize deep neural network parameters e,mini-batch sification layer.Given N training examples,we can define the size M and iteration number T; objective function with additive margin softmax(AM-Softmax) 3:for iter=1→Tdo loss [15]as follows: 4 fork=1→N/Mdo 5: Randomly select M samples to construct a mini- 1 es(0vi.i-m) batch: min C=- 、1og1 0m+∑=t:e 6 Calculate hi by forward propagation for each xi in the =1 mini-batch; s.t. b:e{-1,+1K,ie{1,,N (1) 7: Update bi according to bi sign(hi); 8: where y{1,...,C}denotes the class label of input xi. Calculate the gradient and according to (3) and (4); 0j.i is the cosine similarity between W.j and bi,i.e.,0j.= w:bi 9: Here.W denotes the parameters of the classifi- Calculate the gradientusing chain rule: 10: Update based on mini-batch SGD algorithm: cation layer and W.is the jth column of W.bi denotes the 11: Increase margin m gradually; binary code of xi.m is the additive margin,s is a scaling hyper- 12: end for parameter.By minimizing the objective function C defined in 13:end for problem (1),the training examples of the same class will be mapped to similar binary codes with smaller Hamming distance than that of training examples from different classes. b:h;.Thus we can get the following closed-form solution: However,as the sign function is adopted to get the binary code in the hash layer,we cannot back-propagate the gradient to bi=sign(hi),Vi e{1,...,N). due to the zero-gradient problem.In this paper,we utilize tanh()to approximate sign()and rewrite problem (1)as the Here,sign()is an element-wise sign function. following form: 3.2.2.Update with i Fixed 1 e(何4,-m) min N∑log When [bi is fixed,we can utilize back-propagation to up- =1 e6-m+∑9=j*e date according to the following gradients: +衣∑Ib:-h,, 入 OW.i 2=1 [wim(-w小 1 st.b,e{-1,+1}K,i∈{1,,N, (2) 3) ac 1 hi WT,h where h:=tanh(f(x:9emn)》.可.a=Twla,and入isa =1 80;.: w,-w)】 hyper-parameter. b:-h) 2X (4) 3.2.Learning We adopt an alternating learning algorithm to learn binary codes Then we can use chain rule to computeBased on bi1 and neural network parameters e [enn;W). the computed gradients,we utilize mini-batch stochastic gradi- More specifically,we learn one group of parameters with an- ent descent(SGD)to update other group of parameters fixed. The whole learning algorithm for DAMH is summarized in Algorithm 1. 3.2.1.Update bik with e Fixed When is fixed,we can rewrite problem (2)as follows: 4.Experiment To verify the effectiveness of DAMH,we carry out experiments w on a workstation with an Intel (R)CPU E5-2620V4@2.1G of 8 min C((bi))= cores,128G RAM and an NVIDIA(R)GPU TITAN Xp. i=1 2 bihi+const, 4.1.Dataset N =1 VoxCeleb2 [2]is a widely used dataset for speaker recogni- s.t.b∈{-1,+1}K,i∈{1,.,N tion (identification)task.We use this dataset for evaluating DAMH and baselines.VoxCeleb2 collected the utterances from where const denotes a constant. YouTube videos containing thousands of speakers which span The elements in the binary code vector bi should keep the different races and a wide range of different accents.Back- same sign as the corresponding elements in h;to maximize ground noise from a large number of environments and over- 2910
other network architectures. 3.1.2. Objective Function Part Given an input data xi, we define the output of the hash layer as bi = sign(f(xi; Θcnn)) ∈ {−1, +1} K, where Θcnn denotes the parameters of the CNN architecture except for the classification layer. Then we adopt the binary codes as the input of classification layer. Given N training examples, we can define the objective function with additive margin softmax (AM-Softmax) loss [15] as follows: min L = − 1 N XN i=1 log e s(θyi ,i−m) e s(θyi ,i−m) + PC j=1,j6=yi e s·θj,i , s.t. bi ∈ {−1, +1} K, ∀i ∈ {1, . . . , N}, (1) where yi ∈ {1, . . . , C} denotes the class label of input xi. θj,i is the cosine similarity between W∗j and bi, i.e., θj,i = WT ∗jbi kW∗jkkbik . Here, W denotes the parameters of the classifi- cation layer and W∗j is the jth column of W, bi denotes the binary code of xi, m is the additive margin, s is a scaling hyperparameter. By minimizing the objective function L defined in problem (1), the training examples of the same class will be mapped to similar binary codes with smaller Hamming distance than that of training examples from different classes. However, as the sign function is adopted to get the binary code in the hash layer, we cannot back-propagate the gradient to Θcnn due to the zero-gradient problem. In this paper, we utilize tanh(·) to approximate sign(·) and rewrite problem (1) as the following form: min Le =− 1 N XN i=1 log e s(θeyi ,i−m) e s(θeyi ,i−m) + PC j=1,j6=yi e s·θej,i + λ N XN i=1 kbi − hik 2 2, s.t. bi ∈ {−1, +1} K, ∀i ∈ {1, . . . , N}, (2) where hi = tanh(f(xi; Θcnn)), θej,i = WT ∗jhi kW∗jkkhik , and λ is a hyper-parameter. 3.2. Learning We adopt an alternating learning algorithm to learn binary codes {bi} N i=1 and neural network parameters Θ = {Θcnn;W}. More specifically, we learn one group of parameters with another group of parameters fixed. 3.2.1. Update {bi} N i=1 with Θ Fixed When Θ is fixed, we can rewrite problem (2) as follows: min Le({bi} N i=1)= 1 N XN i=1 kbi − hik 2 2 = − 2 N XN i=1 b T i hi + const, s.t. bi ∈ {−1, +1} K, ∀i ∈ {1, . . . , N}, where const denotes a constant. The elements in the binary code vector bi should keep the same sign as the corresponding elements in hi to maximize Algorithm 1 Learning algorithm for DAMH Input: X = {xi} N i=1: training utterances; y = {yi} N i=1: person identities for training utterances; K: binary code length. Output: Θ and {bi} N i=1. 1: Procedure 2: Initialize deep neural network parameters Θ, mini-batch size M and iteration number T; 3: for iter = 1 → T do 4: for k = 1 → N/M do 5: Randomly select M samples to construct a minibatch; 6: Calculate hi by forward propagation for each xi in the mini-batch; 7: Update bi according to bi = sign(hi); 8: Calculate the gradient ∂Le ∂W and ∂Le ∂hi according to (3) and (4); 9: Calculate the gradient ∂Le ∂Θcnn using chain rule; 10: Update Θ based on mini-batch SGD algorithm; 11: Increase margin m gradually; 12: end for 13: end for b T i hi. Thus we can get the following closed-form solution: bi = sign(hi), ∀i ∈ {1, . . . , N}. Here, sign(·) is an element-wise sign function. 3.2.2. Update Θ with {bi} N i=1 Fixed When {bi} N i=1 is fixed, we can utilize back-propagation to update Θ according to the following gradients: ∂Le ∂W∗j = XN i=1 " ∂Le ∂θej,i 1 kW∗jkkhik hi − WT ∗jhi W∗j kW∗jk 2 # , (3) ∂Le ∂hi = XC j=1 " ∂Le ∂θej,i 1 kW∗jkkhik W∗j − WT ∗jhi hi khik 2 # − 2λ N (bi − hi). (4) Then we can use chain rule to compute ∂L ∂Θcnn . Based on the computed gradients, we utilize mini-batch stochastic gradient descent (SGD) to update Θ. The whole learning algorithm for DAMH is summarized in Algorithm 1. 4. Experiment To verify the effectiveness of DAMH, we carry out experiments on a workstation with an Intel (R) CPU E5-2620V4@2.1G of 8 cores, 128G RAM and an NVIDIA (R) GPU TITAN Xp. 4.1. Dataset VoxCeleb2 [2] is a widely used dataset for speaker recognition (identification) task. We use this dataset for evaluating DAMH and baselines. VoxCeleb2 collected the utterances from YouTube videos containing thousands of speakers which span different races and a wide range of different accents. Background noise from a large number of environments and over- 2910
Table 2:Training/validation/test split.POl:Person of Interest Table 3:Top-I accuracy (%of speaker identification Dataset of Training Validation Test Method Code length POIs 3641 3 64 96 128 256 VoxCeleb2 utterances903.57218,20536.410 DAMH 89.9894.7296.1697.7498.19 IsoH 27.81 54.2465.54 71.67 NIA lapping speech make speaker identification and retrieval chal- 父85)6710 7793 N/A lenging on this dataset. HDML 2758 .55.48672573.038233 We remove speakers whose utterance numbers are less than LSH 9.13 28.0144.3154.2375.80 one hundred.The remaining 958.187 utterances from 3.641 i-vector 93.81 speakers are divided into training set,validation set and test set. AM-Softmax 98.65 Validation set and test set contain five and ten utterances of each speaker respectively.Details of the training set,validation set Table 4:MAP (%of speaker retrieval. and test set are described in Table 2. Method Code length 32 4.2.Baselines and Evaluation Protocols 64 96 128 256 DAMH 72.87 88.1890.3892.20 0455 Two speaker hashing methods,LSH [9.1]and HDML [13,3] IsoH 4.95 10.9014.2816.28 N/A are selected as baselines.Besides these two methods,other Io 5.78 12.5916.46 18.60 N/A hashing methods can also be utilized for speaker identifica- HDMI 6.55 14.6543.23 49.8961.43 LSH 0.79 286 5.98 8.65 18.20 tion and retrieval.We choose two representative hashing methods,iterative quantization (ITQ)[10]and isotropic hash- i-vector 27.70 AM-Softmax 9582 ing (IsoH)[16],as baselines for comprehensive comparison. We utilize Gaussian mixture model-universal background In Table 4,we present the MAP for speaker retrieval task model (GMM-UBM)[17]to extract i-vector. Specifically, From Table 4.we can find that DAMH can outperform all hash- GMM-UBM uses 20-dimensional mel-frequency cepstral co- ing baselines in all cases.Furthermore,DAMH can outper- efficients (MFCC)as input and extracts 400-dimensional i- form i-vector,and DAMH with 256 bits can achieve comparable vector.After that,we use linear discriminant analysis (LDA) MAP compared with AM-Softmax. and within-class covariance normalization (WCCN)[18]to re- duce the dimensionality of i-vector to 150. 4.4.Efficiency For our proposed DAMH,we randomly slice a 3-second utterance from each original utterance for training.A sliding In real applications,real-value based speaker identification and Hamming window is used to compute the spectrogram of each retrieval methods might be impractical for massive audio data. utterance.Feature length,window width and step-size are set Hashing based retrieval is used to enable fast query based on to 512,25ms and 10ms respectively.Normalization along the binary representation. axis of frequency is performed on the features.Margin m of We report the retrieval time for DAMH and real-value based DAMH loss function is set to a small value at the beginning and methods in Table 5.From Table 5.we can find that DAMH gradually increases during the training procedure.m will be is faster than real-value based methods with comparable top-1 fixed after reaching 0.35.s and A are set to 30.0 and 0.1 x accuracy.DAMH can also reduce the storage cost compared respectively,where K is the length of binary codes.We set the mini-batch size to 64 and tune the learning rate from 10-2 with real-value based methods.Hence,DAMH is more practical than real-value based methods in real applications. to 10-5.Each model is trained for 36 epochs and the average training time for each epoch is 178 minutes. We select top-1 accuracy and mean average preci- Table 5:Top-I accuracy (%)retrieval time (in second)and sion (MAP)to evaluate the proposed DAMH and baselines for database storage cost (in MB)of speaker identification. speaker identification and retrieval,respectively.Furthermore. Method #bit/dim Accuracy Time Storage cost we report the retrieval time and storage cost for real-value based 64 bits DAMH 94.72 0.1327 methods and our DAMH to verify the efficiency of DAMH. 256 bits 98.19 0.1660 23 i-vector 150 9381 03395 1018 4.3.Accuracy AM-Softmax 512 98.65 0.6509 3539 The top-1 accuracy for speaker identification task is presented in Table 3 with binary code length being 32,64,96.128 and 5.Conclusion 256 respectively.As the dimensionality of i-vector is less than 256,the accuracy for IsoH and ITQ,which are based In this paper,we propose a novel deep hashing method,called on principal component analysis(PCA),cannot be calculated deep additive margin hashing (DAMH),for speaker identifica- when the binary code length is 256.Besides all hashing base- tion and retrieval.To the best of our knowledge,DAMH is the lines,we also utilize two real-value based methods,i-vector first deep hashing method for speaker identification and retrieval and AM-Softmax,for comparison.Here,AM-Softmax denotes task.Experiments on a large scale audio dataset show that our the method using real-valued features learned with a variant of proposed DAMH can outperform baselines to achieve state-of- DAMH without the binary constraint.We can see that our pro- the-art retrieval performance. posed DAMH can outperform all hashing baselines to achieve the highest accuracy.Comparing DAMH with real-value based 6.Acknowledgement methods,we can see that DAMH outperforms i-vector.DAMH with 256 bits can achieve comparable accuracy compared with This work is supported by the NSFC-NRF Joint Research AM-Softmax. Project(No.61861146001). 2911
Table 2: Training/validation/test split. POI:Person of Interest. Dataset # of Training Validation Test VoxCeleb2 POIs 3,641 utterances 903,572 18,205 36,410 lapping speech make speaker identification and retrieval challenging on this dataset. We remove speakers whose utterance numbers are less than one hundred. The remaining 958,187 utterances from 3,641 speakers are divided into training set, validation set and test set. Validation set and test set contain five and ten utterances of each speaker respectively. Details of the training set, validation set and test set are described in Table 2. 4.2. Baselines and Evaluation Protocols Two speaker hashing methods, LSH [9, 1] and HDML [13, 3], are selected as baselines. Besides these two methods, other hashing methods can also be utilized for speaker identification and retrieval. We choose two representative hashing methods, iterative quantization (ITQ) [10] and isotropic hashing (IsoH) [16], as baselines for comprehensive comparison. We utilize Gaussian mixture model - universal background model (GMM-UBM) [17] to extract i-vector. Specifically, GMM-UBM uses 20-dimensional mel-frequency cepstral coefficients (MFCC) as input and extracts 400-dimensional ivector. After that, we use linear discriminant analysis (LDA) and within-class covariance normalization (WCCN) [18] to reduce the dimensionality of i-vector to 150. For our proposed DAMH, we randomly slice a 3-second utterance from each original utterance for training. A sliding Hamming window is used to compute the spectrogram of each utterance. Feature length, window width and step-size are set to 512, 25ms and 10ms respectively. Normalization along the axis of frequency is performed on the features. Margin m of DAMH loss function is set to a small value at the beginning and gradually increases during the training procedure. m will be fixed after reaching 0.35. s and λ are set to 30.0 and 0.1 × 1 K respectively, where K is the length of binary codes. We set the mini-batch size to 64 and tune the learning rate from 10−2 to 10−5 . Each model is trained for 36 epochs and the average training time for each epoch is 178 minutes. We select top-1 accuracy and mean average precision (MAP) to evaluate the proposed DAMH and baselines for speaker identification and retrieval, respectively. Furthermore, we report the retrieval time and storage cost for real-value based methods and our DAMH to verify the efficiency of DAMH. 4.3. Accuracy The top-1 accuracy for speaker identification task is presented in Table 3 with binary code length being 32, 64, 96, 128 and 256 respectively. As the dimensionality of i-vector is less than 256, the accuracy for IsoH and ITQ, which are based on principal component analysis (PCA), cannot be calculated when the binary code length is 256. Besides all hashing baselines, we also utilize two real-value based methods, i-vector and AM-Softmax, for comparison. Here, AM-Softmax denotes the method using real-valued features learned with a variant of DAMH without the binary constraint. We can see that our proposed DAMH can outperform all hashing baselines to achieve the highest accuracy. Comparing DAMH with real-value based methods, we can see that DAMH outperforms i-vector. DAMH with 256 bits can achieve comparable accuracy compared with AM-Softmax. Table 3: Top-1 accuracy (%) of speaker identification. Method Code length 32 64 96 128 256 DAMH 89.98 94.72 96.16 97.74 98.19 IsoH 27.81 54.24 65.54 71.67 N/A ITQ 28.88 55.82 67.19 72.93 N/A HDML 27.58 55.48 67.25 73.03 82.33 LSH 9.13 28.01 44.31 54.23 75.80 i-vector 93.81 AM-Softmax 98.65 Table 4: MAP (%) of speaker retrieval. Method Code length 32 64 96 128 256 DAMH 72.87 88.18 90.38 92.20 94.55 IsoH 4.95 10.90 14.28 16.28 N/A ITQ 5.78 12.59 16.46 18.60 N/A HDML 6.55 14.65 43.23 49.89 61.43 LSH 0.79 2.86 5.98 8.65 18.20 i-vector 27.70 AM-Softmax 95.82 In Table 4, we present the MAP for speaker retrieval task. From Table 4, we can find that DAMH can outperform all hashing baselines in all cases. Furthermore, DAMH can outperform i-vector, and DAMH with 256 bits can achieve comparable MAP compared with AM-Softmax. 4.4. Efficiency In real applications, real-value based speaker identification and retrieval methods might be impractical for massive audio data. Hashing based retrieval is used to enable fast query based on binary representation. We report the retrieval time for DAMH and real-value based methods in Table 5. From Table 5, we can find that DAMH is faster than real-value based methods with comparable top-1 accuracy. DAMH can also reduce the storage cost compared with real-value based methods. Hence, DAMH is more practical than real-value based methods in real applications. Table 5: Top-1 accuracy (%), retrieval time (in second) and database storage cost (in MB) of speaker identification. Method #bit/dim Accuracy Time Storage cost DAMH 64 bits 94.72 0.1327 5 256 bits 98.19 0.1660 23 i-vector 150 93.81 0.3395 1018 AM-Softmax 512 98.65 0.6509 3539 5. Conclusion In this paper, we propose a novel deep hashing method, called deep additive margin hashing (DAMH), for speaker identification and retrieval. To the best of our knowledge, DAMH is the first deep hashing method for speaker identification and retrieval task. Experiments on a large scale audio dataset show that our proposed DAMH can outperform baselines to achieve state-ofthe-art retrieval performance. 6. Acknowledgement This work is supported by the NSFC-NRF Joint Research Project (No. 61861146001). 2911
7.References [1]L.Schmidt,M.Sharifi,and I.Lopez-Moreno,"Large-scale speaker identification,"in IEEE International Conference on Acoustics,Speech and Signal Processing.2014,pp.1650-1654. [2]J.S.Chung,A.Nagrani,and A.Zisserman,"Voxceleb2:Deep speaker recognition,"in Annual Conference of the Internationa Speech Communication Association,2018,pp.1086-1090. [3]L.Li,C.Xing.D.Wang,K.Yu,and T.F.Zheng."Binary speaker embedding."in International Symposium on Chinese Spoken Lan- guage Processing,2016,pp.1-4. [4]N.Dehak,P.Kenny,R.Dehak,P.Dumouchel,and P.Ouellet, "Front-end factor analysis for speaker verification,"IEEE Trans- actions on Audio,Speech Language Processing,vol.19,no.4. Pp.788-798.2011. [5]E.Variani,X.Lei,E.McDermott,I.Lopez-Moreno,and J.Gonzalez-Dominguez,"Deep neural networks for small foot- print text-dependent speaker verification,"in IEEE International Conference on Acoustics,Speech and Signal Processing,2014 Pp.40524056. [6]D.Snyder,D.Garcia-Romero,G.Sell,D.Povey,and S.Khudan- pur,"X-vectors:Robust DNN embeddings for speaker recogni- tion."in IEEE International Conference on Acoustics.Speech and Signal Processing.2018.pp.5329-5333. [7]O.Russakovsky,J.Deng,H.Su,J.Krause,S.Satheesh,S.Ma Z.Huang.A.Karpathy,A.Khosla,M.S.Bernstein,A.C.Berg. and F.Li,"Imagenet large scale visual recognition challenge,"In- ternational Journal of Computer Vision,vol.115,no.3,pp.211- 252.2015 [8]Y.-Q.Yu,L.Fan,and W.-J.Li,"Ensemble additive margin soft- max for speaker verification,"in IEEE International Conference on Acoustics,Speech and Signal Processing,2019,pp.6046- 6050. [9]M.Datar,N.Immorlica,P.Indyk,and V.S.Mirrokni,"Locality- sensitive hashing scheme based on p-stable distributions,"in ACM Symposium on Computational Geometry,2004,pp.253-262. [10]Y.Gong and S.Lazebnik,"Iterative quantization:A procrustean approach to learning binary codes"in IEEE Conference on Com- puter Vision and Pattern Recognition.2011.pp.817-824. [11]W.-J.Li,S.Wang.and W.-C.Kang,"Feature learning based deep supervised hashing with pairwise labels,"in International Joint Conference on Artificial Intelligence,2016.pp.1711-1717. [12]Q.-Y.Jiang.X.Cui.and W.-J.Li,"Deep discrete supervised hash- ing,"IEEE Transaction Image Processing,vol.27.no.12,pp. 5996-6009.2018. [13]M.Norouzi.D.J.Fleet,and R.Salakhutdinov,"Hamming dis- tance metric learning,"in Annual Conference on Neural Informa- tion Processing Systems,2012,pp.1070-1078. [14]K.He,X.Zhang,S.Ren,and J.Sun,"Deep residual learning for image recognition,"in IEEE Conference on Computer Vision and Pattern Recognition,2016,pp.770-778. [15]F.Wang.J.Cheng,W.Liu,and H.Liu,"Additive margin soft- max for face verification,"IEEE Signal Processing Letter,vol.25 no.7,Pp.926-930,2018 [16]W.Kong and W.-J.Li,"Isotropic hashing,"in Annual Conference on Neural Information Processing Systems,2012,pp.1655-1663. [17]D.Gutman and Y.Bistritz,"Speaker verification using phoneme- adapted gaussian mixture models,"in European Signal Processing Conference.2002,pp.1-4. [18]A.O.Hatch.S.S.Kajarekar,and A.Stolcke,"Within-class co- variance normalization for svm-based speaker recognition,"inIn- ternational Conference on Spoken Language Processing,2006. Pp.1471-1474. 2912
7. References [1] L. Schmidt, M. Sharifi, and I. Lopez-Moreno, “Large-scale speaker identification,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2014, pp. 1650–1654. [2] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” in Annual Conference of the International Speech Communication Association, 2018, pp. 1086–1090. [3] L. Li, C. Xing, D. Wang, K. Yu, and T. F. Zheng, “Binary speaker embedding,” in International Symposium on Chinese Spoken Language Processing, 2016, pp. 1–4. [4] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech & Language Processing, vol. 19, no. 4, pp. 788–798, 2011. [5] E. Variani, X. Lei, E. McDermott, I. Lopez-Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2014, pp. 4052–4056. [6] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2018, pp. 5329–5333. [7] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li, “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211– 252, 2015. [8] Y.-Q. Yu, L. Fan, and W.-J. Li, “Ensemble additive margin softmax for speaker verification,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2019, pp. 6046– 6050. [9] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Localitysensitive hashing scheme based on p-stable distributions,” in ACM Symposium on Computational Geometry, 2004, pp. 253–262. [10] Y. Gong and S. Lazebnik, “Iterative quantization: A procrustean approach to learning binary codes,” in IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 817–824. [11] W.-J. Li, S. Wang, and W.-C. Kang, “Feature learning based deep supervised hashing with pairwise labels,” in International Joint Conference on Artificial Intelligence, 2016, pp. 1711–1717. [12] Q.-Y. Jiang, X. Cui, and W.-J. Li, “Deep discrete supervised hashing,” IEEE Transaction Image Processing, vol. 27, no. 12, pp. 5996–6009, 2018. [13] M. Norouzi, D. J. Fleet, and R. Salakhutdinov, “Hamming distance metric learning,” in Annual Conference on Neural Information Processing Systems, 2012, pp. 1070–1078. [14] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778. [15] F. Wang, J. Cheng, W. Liu, and H. Liu, “Additive margin softmax for face verification,” IEEE Signal Processing Letter, vol. 25, no. 7, pp. 926–930, 2018. [16] W. Kong and W.-J. Li, “Isotropic hashing,” in Annual Conference on Neural Information Processing Systems, 2012, pp. 1655–1663. [17] D. Gutman and Y. Bistritz, “Speaker verification using phonemeadapted gaussian mixture models,” in European Signal Processing Conference, 2002, pp. 1–4. [18] A. O. Hatch, S. S. Kajarekar, and A. Stolcke, “Within-class covariance normalization for svm-based speaker recognition,” in International Conference on Spoken Language Processing, 2006, pp. 1471–1474. 2912