正在加载图片...
INTERSPEECH 2019 September 15-19,2019,Graz.Austria Deep Hashing for Speaker Identification and Retrieval Lei Fan,Qing-Yuan Jiang,Ya-Oi Yu,Wu-Jun Li National Key Laboratory for Novel Software Technology Collaborative Innovation Center of Novel Software Technology and Industrialization Department of Computer Science and Technology,Nanjing University,P.R.China {fanl,jiangqy,yuyqelamda.nju.edu.cn,liwujun@nju.edu.cn Abstract ally suffer from high storage cost and low retrieval speed in real applications with large-scale datasets. Speaker identification and retrieval have been widely used To enable fast query and reduce storage cost,there have ap- in real applications.To overcome the inefficiency problem peared some hashing methods [1,3],also called speaker hash- caused by real-valued representations.there have appeared ing methods,for speaker identification and retrieval.By repre- some speaker hashing methods for speaker identification and senting each utterance as a binary code,speaker hashing can re- retrieval by learning binary codes as representations.How- duce the storage cost dramatically.Furthermore,we can achieve ever,these hashing methods are based on i-vector and can- constant or sub-linear query speed based on binary codes.How- not achieve satisfactory retrieval accuracy as they cannot learn ever,existing speaker hashing methods [1,3]are based on i- discriminative feature representations.In this paper,we pro- vector.Specifically,each utterance is represented as an i-vector pose a novel deep hashing method,called deep additive margin in the first stage.Then the hash function is utilized to generate hashing(DAMH),to improve retrieval performance for speaker binary codes for utterances in the second stage.On one hand identification and retrieval task. Compared with existing the retrieval performance of them is limited by i-vector repre- speaker hashing methods,DAMH can perform feature learn- sentations.On the other hand,existing speaker hashing methods ing and binary code learning seamlessly by incorporating these are two-stage methods and they cannot learn optimally compat- two procedures into an end-to-end architecture.Experimen- ible feature for hashing.Hence,the retrieval performance of tal results on a large-scale audio dataset VoxCeleb2 show that these methods is far from satisfactory in real applications DAMH can outperform existing speaker hashing methods to To overcome the drawbacks of existing speaker hashing achieve state-of-the-art performance. methods,in this paper we propose a novel deep hashing method, Index Terms:speaker identification and retrieval,deep hash- called deep additive margin hashing(DAMH).The contribu- ing,additive margin softmax,deep additive margin hashing tions of this paper are listed as follows: 1.Introduction DAMH is an end-to-end deep hashing method for speaker identification and retrieval.To the best of our Speaker identification and retrieval [1,2,3]have been widely knowledge,DAMH is the first deep hashing method for used in real applications including automatic access control of speaker identification and retrieval task.Compared with banking services,financial transactions and detection of speak- existing speaker hashing methods,DAMH can perform ers in complex scenes.Both speaker identification and retrieval audio feature learning and binary code learning simulta- can be realized by a retrieval procedure'.To realize the re- neously.Hence,these two procedures can give feedback trieval procedure,one common solution is to embed utterances to each other. into low-dimensional representations firstly,which is also called DAMH utilizes additive margin softmax loss to super- speaker embedding,and then perform retrieval based on the vise speaker hashing.Angular margin added in the loss low-dimensional representations. makes the learned binary codes more discriminative. Over the past decades,real-value based speaker embed- Experiments on a large-scale audio dataset Vox- ding has made good progress and achieved promising accu- Celeb2 demonstrate that DAMH can outperform exist- racy [2,4,5,6].I-vector based approaches [41,which project ing speaker hashing methods to achieve state-of-the-art the Gaussian mixture model(GMM)super vector into a low- performance. dimensional vector,have dominated the field of speaker em- bedding.I-vector based systems are robust and accurate in the cases with utterances of sufficient length [4].Nevertheless,long 2.Related Works speech isn't always available in real applications.With the up- In this section,we briefly review the related works,including surge of deep learning,many works have recently been devoted real-value based speaker embedding and speaker hashing. to deep neural networks (DNN)[2,5,6]and achieved promising performance due to the powerful modeling capacity of DNN 2.1.Real-value based Speaker Embedding DNN based systems can outperform i-vector based systems in the case of short utterances,which is more common and practi- To perform speaker embedding,i-vector [4]was proposed to cal than long utterances in real applications.Since i-vector and represent the GMM super vector in a single total variability DNN based methods are real-value based methods,they usu- space instead of two distinct spaces,i.e.,speaker space and channel space.Modeling all variability as a single manifold IIn some cases,speaker identification can also be realized by classi- has superior performance in this total variability model (TVM). fication.In this paper,we only focus on retrieval based speaker identi- The i-vector is the vector of latent factors which represent the fication. speaker information of a given utterance.After the TVM model Copyright©2019ISCA 2908 http://dx.doi.org/10.21437/Interspeech.2019-2457Deep Hashing for Speaker Identification and Retrieval Lei Fan, Qing-Yuan Jiang, Ya-Qi Yu, Wu-Jun Li National Key Laboratory for Novel Software Technology Collaborative Innovation Center of Novel Software Technology and Industrialization Department of Computer Science and Technology, Nanjing University, P. R. China {fanl,jiangqy,yuyq}@lamda.nju.edu.cn, liwujun@nju.edu.cn Abstract Speaker identification and retrieval have been widely used in real applications. To overcome the inefficiency problem caused by real-valued representations, there have appeared some speaker hashing methods for speaker identification and retrieval by learning binary codes as representations. How￾ever, these hashing methods are based on i-vector and can￾not achieve satisfactory retrieval accuracy as they cannot learn discriminative feature representations. In this paper, we pro￾pose a novel deep hashing method, called deep additive margin hashing (DAMH), to improve retrieval performance for speaker identification and retrieval task. Compared with existing speaker hashing methods, DAMH can perform feature learn￾ing and binary code learning seamlessly by incorporating these two procedures into an end-to-end architecture. Experimen￾tal results on a large-scale audio dataset VoxCeleb2 show that DAMH can outperform existing speaker hashing methods to achieve state-of-the-art performance. Index Terms: speaker identification and retrieval, deep hash￾ing, additive margin softmax, deep additive margin hashing 1. Introduction Speaker identification and retrieval [1, 2, 3] have been widely used in real applications including automatic access control of banking services, financial transactions and detection of speak￾ers in complex scenes. Both speaker identification and retrieval can be realized by a retrieval procedure1 . To realize the re￾trieval procedure, one common solution is to embed utterances into low-dimensional representations firstly, which is also called speaker embedding, and then perform retrieval based on the low-dimensional representations. Over the past decades, real-value based speaker embed￾ding has made good progress and achieved promising accu￾racy [2, 4, 5, 6]. I-vector based approaches [4], which project the Gaussian mixture model (GMM) super vector into a low￾dimensional vector, have dominated the field of speaker em￾bedding. I-vector based systems are robust and accurate in the cases with utterances of sufficient length [4]. Nevertheless, long speech isn’t always available in real applications. With the up￾surge of deep learning, many works have recently been devoted to deep neural networks (DNN) [2, 5, 6] and achieved promising performance due to the powerful modeling capacity of DNN. DNN based systems can outperform i-vector based systems in the case of short utterances, which is more common and practi￾cal than long utterances in real applications. Since i-vector and DNN based methods are real-value based methods, they usu- 1 In some cases, speaker identification can also be realized by classi- fication. In this paper, we only focus on retrieval based speaker identi- fication. ally suffer from high storage cost and low retrieval speed in real applications with large-scale datasets. To enable fast query and reduce storage cost, there have ap￾peared some hashing methods [1, 3], also called speaker hash￾ing methods, for speaker identification and retrieval. By repre￾senting each utterance as a binary code, speaker hashing can re￾duce the storage cost dramatically. Furthermore, we can achieve constant or sub-linear query speed based on binary codes. How￾ever, existing speaker hashing methods [1, 3] are based on i￾vector. Specifically, each utterance is represented as an i-vector in the first stage. Then the hash function is utilized to generate binary codes for utterances in the second stage. On one hand, the retrieval performance of them is limited by i-vector repre￾sentations. On the other hand, existing speaker hashing methods are two-stage methods and they cannot learn optimally compat￾ible feature for hashing. Hence, the retrieval performance of these methods is far from satisfactory in real applications. To overcome the drawbacks of existing speaker hashing methods, in this paper we propose a novel deep hashing method, called deep additive margin hashing (DAMH). The contribu￾tions of this paper are listed as follows: • DAMH is an end-to-end deep hashing method for speaker identification and retrieval. To the best of our knowledge, DAMH is the first deep hashing method for speaker identification and retrieval task. Compared with existing speaker hashing methods, DAMH can perform audio feature learning and binary code learning simulta￾neously. Hence, these two procedures can give feedback to each other. • DAMH utilizes additive margin softmax loss to super￾vise speaker hashing. Angular margin added in the loss makes the learned binary codes more discriminative. • Experiments on a large-scale audio dataset Vox￾Celeb2 demonstrate that DAMH can outperform exist￾ing speaker hashing methods to achieve state-of-the-art performance. 2. Related Works In this section, we briefly review the related works, including real-value based speaker embedding and speaker hashing. 2.1. Real-value based Speaker Embedding To perform speaker embedding, i-vector [4] was proposed to represent the GMM super vector in a single total variability space instead of two distinct spaces, i.e., speaker space and channel space. Modeling all variability as a single manifold has superior performance in this total variability model (TVM). The i-vector is the vector of latent factors which represent the speaker information of a given utterance. After the TVM model Copyright © 2019 ISCA INTERSPEECH 2019 September 15–19, 2019, Graz, Austria 2908 http://dx.doi.org/10.21437/Interspeech.2019-2457
向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有