other network architectures. 3.1.2. O_中国高校课件下载中心

点击下载：《人工智能、机器学习与大数据》课程教学资源（参考文献）Deep Hashing for Speaker Identification and Retrieval

正在加载图片...

other network architectures. Algorithm 1 Learning algorithm for DAMH 3.1.2.Objective Function Part Input:X={x:training utterances; y=y:person identities for training utterances; Given an input datax,we define the output of the hash layer as K:binary code length. bi sign(f(xi;ecnn))E-1,+1,where ecnn denotes Output:and {b1. the parameters of the CNN architecture except for the classifica- 1:Procedure tion layer.Then we adopt the binary codes as the input of clas- 2:Initialize deep neural network parameters e,mini-batch sification layer.Given N training examples,we can define the size M and iteration number T; objective function with additive margin softmax(AM-Softmax) 3:for iter=1→Tdo loss [15]as follows: 4 fork=1→N/Mdo 5: Randomly select M samples to construct a mini- 1 es(0vi.i-m) batch: min C=- 、1og1 0m+∑=t:e 6 Calculate hi by forward propagation for each xi in the =1 mini-batch; s.t. b:e{-1,+1K,ie{1,,N (1) 7: Update bi according to bi sign(hi); 8: where y{1,...,C}denotes the class label of input xi. Calculate the gradient and according to (3) and (4); 0j.i is the cosine similarity between W.j and bi,i.e.,0j.= w:bi 9: Here.W denotes the parameters of the classifi- Calculate the gradientusing chain rule: 10: Update based on mini-batch SGD algorithm: cation layer and W.is the jth column of W.bi denotes the 11: Increase margin m gradually; binary code of xi.m is the additive margin,s is a scaling hyper- 12: end for parameter.By minimizing the objective function C defined in 13:end for problem (1),the training examples of the same class will be mapped to similar binary codes with smaller Hamming distance than that of training examples from different classes. b:h;.Thus we can get the following closed-form solution: However,as the sign function is adopted to get the binary code in the hash layer,we cannot back-propagate the gradient to bi=sign(hi),Vi e{1,...,N). due to the zero-gradient problem.In this paper,we utilize tanh()to approximate sign()and rewrite problem (1)as the Here,sign()is an element-wise sign function. following form: 3.2.2.Update with i Fixed 1 e(何4，-m） min N∑log When [bi is fixed,we can utilize back-propagation to up- =1 e6-m+∑9=j*e date according to the following gradients: +衣∑Ib:-h,, 入 OW.i 2=1 [wim(-w小 1 st.b,e{-1,+1}K,i∈{1，，N, (2) 3) ac 1 hi WT,h where h:=tanh(f(x:9emn)》.可.a=Twla,and入isa =1 80;.: w,-w)】 hyper-parameter. b:-h) 2X (4) 3.2.Learning We adopt an alternating learning algorithm to learn binary codes Then we can use chain rule to computeBased on bi1 and neural network parameters e [enn;W). the computed gradients,we utilize mini-batch stochastic gradi- More specifically,we learn one group of parameters with an- ent descent(SGD)to update other group of parameters fixed. The whole learning algorithm for DAMH is summarized in Algorithm 1. 3.2.1.Update bik with e Fixed When is fixed,we can rewrite problem (2)as follows: 4.Experiment To verify the effectiveness of DAMH,we carry out experiments w on a workstation with an Intel (R)CPU E5-2620V4@2.1G of 8 min C((bi))= cores,128G RAM and an NVIDIA(R)GPU TITAN Xp. i=1 2 bihi+const, 4.1.Dataset N =1 VoxCeleb2 [2]is a widely used dataset for speaker recogni- s.t.b∈{-1，+1}K,i∈{1，.，N tion (identification)task.We use this dataset for evaluating DAMH and baselines.VoxCeleb2 collected the utterances from where const denotes a constant. YouTube videos containing thousands of speakers which span The elements in the binary code vector bi should keep the different races and a wide range of different accents.Back- same sign as the corresponding elements in h;to maximize ground noise from a large number of environments and over- 2910other network architectures. 3.1.2. Objective Function Part Given an input data xi, we define the output of the hash layer as bi = sign(f(xi; Θcnn)) ∈ {−1, +1} K, where Θcnn denotes the parameters of the CNN architecture except for the classification layer. Then we adopt the binary codes as the input of classification layer. Given N training examples, we can define the objective function with additive margin softmax (AM-Softmax) loss [15] as follows: min L = − 1 N XN i=1 log e s(θyi ,i−m) e s(θyi ,i−m) + PC j=1,j6=yi e s·θj,i , s.t. bi ∈ {−1, +1} K, ∀i ∈ {1, . . . , N}, (1) where yi ∈ {1, . . . , C} denotes the class label of input xi. θj,i is the cosine similarity between W∗j and bi, i.e., θj,i = WT ∗jbi kW∗jkkbik . Here, W denotes the parameters of the classifi- cation layer and W∗j is the jth column of W, bi denotes the binary code of xi, m is the additive margin, s is a scaling hyperparameter. By minimizing the objective function L defined in problem (1), the training examples of the same class will be mapped to similar binary codes with smaller Hamming distance than that of training examples from different classes. However, as the sign function is adopted to get the binary code in the hash layer, we cannot back-propagate the gradient to Θcnn due to the zero-gradient problem. In this paper, we utilize tanh(·) to approximate sign(·) and rewrite problem (1) as the following form: min Le =− 1 N XN i=1 log e s(θeyi ,i−m) e s(θeyi ,i−m) + PC j=1,j6=yi e s·θej,i + λ N XN i=1 kbi − hik 2 2, s.t. bi ∈ {−1, +1} K, ∀i ∈ {1, . . . , N}, (2) where hi = tanh(f(xi; Θcnn)), θej,i = WT ∗jhi kW∗jkkhik , and λ is a hyper-parameter. 3.2. Learning We adopt an alternating learning algorithm to learn binary codes {bi} N i=1 and neural network parameters Θ = {Θcnn;W}. More specifically, we learn one group of parameters with another group of parameters fixed. 3.2.1. Update {bi} N i=1 with Θ Fixed When Θ is fixed, we can rewrite problem (2) as follows: min Le({bi} N i=1)= 1 N XN i=1 kbi − hik 2 2 = − 2 N XN i=1 b T i hi + const, s.t. bi ∈ {−1, +1} K, ∀i ∈ {1, . . . , N}, where const denotes a constant. The elements in the binary code vector bi should keep the same sign as the corresponding elements in hi to maximize Algorithm 1 Learning algorithm for DAMH Input: X = {xi} N i=1: training utterances; y = {yi} N i=1: person identities for training utterances; K: binary code length. Output: Θ and {bi} N i=1. 1: Procedure 2: Initialize deep neural network parameters Θ, mini-batch size M and iteration number T; 3: for iter = 1 → T do 4: for k = 1 → N/M do 5: Randomly select M samples to construct a minibatch; 6: Calculate hi by forward propagation for each xi in the mini-batch; 7: Update bi according to bi = sign(hi); 8: Calculate the gradient ∂Le ∂W and ∂Le ∂hi according to (3) and (4); 9: Calculate the gradient ∂Le ∂Θcnn using chain rule; 10: Update Θ based on mini-batch SGD algorithm; 11: Increase margin m gradually; 12: end for 13: end for b T i hi. Thus we can get the following closed-form solution: bi = sign(hi), ∀i ∈ {1, . . . , N}. Here, sign(·) is an element-wise sign function. 3.2.2. Update Θ with {bi} N i=1 Fixed When {bi} N i=1 is fixed, we can utilize back-propagation to update Θ according to the following gradients: ∂Le ∂W∗j = XN i=1 " ∂Le ∂θej,i 1 kW∗jkkhik hi − WT ∗jhi W∗j kW∗jk 2 # , (3) ∂Le ∂hi = XC j=1 " ∂Le ∂θej,i 1 kW∗jkkhik W∗j − WT ∗jhi hi khik 2 # − 2λ N (bi − hi). (4) Then we can use chain rule to compute ∂L ∂Θcnn . Based on the computed gradients, we utilize mini-batch stochastic gradient descent (SGD) to update Θ. The whole learning algorithm for DAMH is summarized in Algorithm 1. 4. Experiment To verify the effectiveness of DAMH, we carry out experiments on a workstation with an Intel (R) CPU E5-2620V4@2.1G of 8 cores, 128G RAM and an NVIDIA (R) GPU TITAN Xp. 4.1. Dataset VoxCeleb2 [2] is a widely used dataset for speaker recognition (identification) task. We use this dataset for evaluating DAMH and baselines. VoxCeleb2 collected the utterances from YouTube videos containing thousands of speakers which span different races and a wide range of different accents. Background noise from a large number of environments and over- 2910

<<向上翻页向下翻页>>

点击下载：《人工智能、机器学习与大数据》课程教学资源（参考文献）Deep Hashing for Speaker Identification and Retrieval