Asymmetric Deep Supervised Hashing Qing-Yuan Jiang and Wu-Jun Li National Key Laboratory for Novel Software Technology Collaborative Innovation Center of Novel Software Technology and Industrialization Department of Computer Science and Technology,Nanjing University,China jiangqy@lamda.nju.edu.cn,liwujun@nju.edu.cn Abstract time and dramatically reduce the storage cost for the data points(Gong and Lazebnik 2011).Hence,hashing has at- Hashing has been widely used for large-scale approximate tracted more and more attention for large-scale ANN search. nearest neighbor search because of its storage and search ef- ficiency.Recent work has found that deep supervised hashing As the pioneering work,locality sensitive hashing (LSH) can significantly outperform non-deep supervised hashing in (Kulis and Grauman 2009:Datar et al.2004)tries to use ran- many applications.However,most existing deep supervised dom projections as hash functions.LSH-like methods are hashing methods adopt a symmetric strategy to learn one deep always called data-independent methods,because the ran- hash function for both query points and database(retrieval) dom projections are typically independent of training da- points.The training of these symmetric deep supervised hash- ta.On the contrary,data-dependent methods (Kong and Li ing methods is typically time-consuming,which makes them 2012),which are also called learning to hash (L2H)meth- hard to effectively utilize the supervised information for cas- ods,aim to learn the hash functions from training data.Data- es with large-scale database.In this paper,we propose a nov- el deep supervised hashing method,called asymmetric deep dependent methods usually achieve more promising perfor- supervised hashing (ADSH),for large-scale nearest neighbor mance than data-independent methods with shorter binary search.ADSH treats the query points and database points in codes.Hence,data-dependent methods have become more an asymmetric way.More specifically,ADSH learns a deep popular than data-independent methods in recent years hash function only for query points,while the hash codes for Based on whether supervised information is used or not, database points are directly learned.The training of ADSH is data-dependent methods can be further divided into two cat- much more efficient than that of traditional symmetric deep egories (Kang,Li,and Zhou 2016):unsupervised hashing supervised hashing methods.Experiments show that ADSH and supervised hashing.Representative unsupervised hash- can achieve state-of-the-art performance in real applications ing methods include spectral hashing(SH)(Weiss,Torral- ba,and Fergus 2008).iterative quantization (ITQ)(Gong Introduction and Lazebnik 2011),isotropic hashing (IsoH)(Kong and Li With the explosive growing of data in real applications,n- 2012),discrete graph hashing (DGH)(Liu et al.2014),scal- earest neighbor (NN)search (Gionis,Indyk,and Motwani able graph hashing (SGH)(Jiang and Li 2015)and ordinal 1999:Andoni and Indyk 2006;Andoni and Razenshteyn embedding hashing(OEH)(Liu et al.2016b).Unsupervised 2015)has attracted much attention from machine learning hashing learns hash functions that map input data points into binary codes only using unlabeled data.On the contrary,su- community,with a lot of applications in information re- pervised hashing tries to learn the hash function by utilizing trieval,computer vision and so on.However,in big data supervised information.In recent years,supervised hashing applications,the searching time for exact nearest neighbor has attracted more and more attention because it can achieve is typically expensive or impossible for the given queries. Hence,approximate nearest neighbor(ANN)search (An- better accuracy than unsupervised hashing. doni and Razenshteyn 2015)has become more and more Most traditional supervised hashing methods are non- popular in recent years.As a widely used technique for deep methods which cannot perform feature learning from ANN search,hashing (Weiss,Torralba,and Fergus 2008; scratch.Representative non-deep supervised hashing meth- Zhang et al.2010;Neyshabur et al.2013;Liu et al.2014; ods include supervised hashing with kernels (KSH)(Li- Lin et al.2014a:Shen et al.2015b:2015a:Song et al.2015: u et al.2012),asymmetric hashing with two variants Xie.Shen.and Zhu 2016:Liu et al.2016b:2016a:Shi et Lin:Lin and Lin:V(Neyshabur et al.2013),latent factor al.2017;Shen et al.2017;Dasgupta,Stevens,and Navlakha hashing (LFH)(Zhang et al.2014),fast supervised hash- 2017)aims to encode the data points into compact binary ing (FastH)(Lin et al.2014a),supervised discrete hash- hash codes.Thanks to the binary hash code representation, ing(SDH)(Shen et al.2015b),column-sampling based dis- hashing methods can provide constant or sub-linear search crete supervised hashing (COSDISH)(Kang.Li,and Zhou 2016)and asymmetric discrete graph hashing ADGH(Shi et Copyright C)2018,Association for the Advancement of Artificial al.2017).Recently,deep supervised hashing.which adopt- Intelligence (www.aaai.org).All rights reserved. s deep learning(Krizhevsky,Sutskever,and Hinton 2012)
Asymmetric Deep Supervised Hashing Qing-Yuan Jiang and Wu-Jun Li National Key Laboratory for Novel Software Technology Collaborative Innovation Center of Novel Software Technology and Industrialization Department of Computer Science and Technology, Nanjing University, China jiangqy@lamda.nju.edu.cn,liwujun@nju.edu.cn Abstract Hashing has been widely used for large-scale approximate nearest neighbor search because of its storage and search ef- ficiency. Recent work has found that deep supervised hashing can significantly outperform non-deep supervised hashing in many applications. However, most existing deep supervised hashing methods adopt a symmetric strategy to learn one deep hash function for both query points and database (retrieval) points. The training of these symmetric deep supervised hashing methods is typically time-consuming, which makes them hard to effectively utilize the supervised information for cases with large-scale database. In this paper, we propose a novel deep supervised hashing method, called asymmetric deep supervised hashing (ADSH), for large-scale nearest neighbor search. ADSH treats the query points and database points in an asymmetric way. More specifically, ADSH learns a deep hash function only for query points, while the hash codes for database points are directly learned. The training of ADSH is much more efficient than that of traditional symmetric deep supervised hashing methods. Experiments show that ADSH can achieve state-of-the-art performance in real applications. Introduction With the explosive growing of data in real applications, nearest neighbor (NN) search (Gionis, Indyk, and Motwani 1999; Andoni and Indyk 2006; Andoni and Razenshteyn 2015) has attracted much attention from machine learning community, with a lot of applications in information retrieval, computer vision and so on. However, in big data applications, the searching time for exact nearest neighbor is typically expensive or impossible for the given queries. Hence, approximate nearest neighbor (ANN) search (Andoni and Razenshteyn 2015) has become more and more popular in recent years. As a widely used technique for ANN search, hashing (Weiss, Torralba, and Fergus 2008; Zhang et al. 2010; Neyshabur et al. 2013; Liu et al. 2014; Lin et al. 2014a; Shen et al. 2015b; 2015a; Song et al. 2015; Xie, Shen, and Zhu 2016; Liu et al. 2016b; 2016a; Shi et al. 2017; Shen et al. 2017; Dasgupta, Stevens, and Navlakha 2017) aims to encode the data points into compact binary hash codes. Thanks to the binary hash code representation, hashing methods can provide constant or sub-linear search Copyright c 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. time and dramatically reduce the storage cost for the data points (Gong and Lazebnik 2011). Hence, hashing has attracted more and more attention for large-scale ANN search. As the pioneering work, locality sensitive hashing (LSH) (Kulis and Grauman 2009; Datar et al. 2004) tries to use random projections as hash functions. LSH-like methods are always called data-independent methods, because the random projections are typically independent of training data. On the contrary, data-dependent methods (Kong and Li 2012), which are also called learning to hash (L2H) methods, aim to learn the hash functions from training data. Datadependent methods usually achieve more promising performance than data-independent methods with shorter binary codes. Hence, data-dependent methods have become more popular than data-independent methods in recent years. Based on whether supervised information is used or not, data-dependent methods can be further divided into two categories (Kang, Li, and Zhou 2016): unsupervised hashing and supervised hashing. Representative unsupervised hashing methods include spectral hashing (SH) (Weiss, Torralba, and Fergus 2008), iterative quantization (ITQ) (Gong and Lazebnik 2011), isotropic hashing (IsoH) (Kong and Li 2012), discrete graph hashing (DGH) (Liu et al. 2014), scalable graph hashing (SGH) (Jiang and Li 2015) and ordinal embedding hashing (OEH) (Liu et al. 2016b). Unsupervised hashing learns hash functions that map input data points into binary codes only using unlabeled data. On the contrary, supervised hashing tries to learn the hash function by utilizing supervised information. In recent years, supervised hashing has attracted more and more attention because it can achieve better accuracy than unsupervised hashing. Most traditional supervised hashing methods are nondeep methods which cannot perform feature learning from scratch. Representative non-deep supervised hashing methods include supervised hashing with kernels (KSH) (Liu et al. 2012), asymmetric hashing with two variants Lin:Lin and Lin:V (Neyshabur et al. 2013), latent factor hashing (LFH) (Zhang et al. 2014), fast supervised hashing (FastH) (Lin et al. 2014a), supervised discrete hashing (SDH) (Shen et al. 2015b), column-sampling based discrete supervised hashing (COSDISH) (Kang, Li, and Zhou 2016) and asymmetric discrete graph hashing ADGH (Shi et al. 2017). Recently, deep supervised hashing, which adopts deep learning (Krizhevsky, Sutskever, and Hinton 2012)
to perform feature learning for hashing,has been proposed. Notation and Problem Definition Representative deep supervised hashing methods include Notation convolutional neural networks based hashing (CNNH)(Xia et al.2014),network in network hashing (NINH)(Lai et al. Boldface lowercase letters like b denote vectors,and bold- 2015),deep pairwise supervised hashing(DPSH)(Li,Wang, face uppercase letters like B denote matrices.B.denotes and Kang 2016),deep hashing network (DHN)(Zhu et al. the jth column of B.Bii denotes the (i,j)th element of ma- 2016).deep supervised hashing (DSH)(Liu et al.2016a) trix B.Furthermore,Bl and BT are used to denote the and deep asymmetric pairwise hashing(DAPH)(Shen et al. Frobenius norm and the transpose of matrix B,respectively 2017).By integrating feature learning and hash-code learn- Capital Greek letters like denote sets of indices.Boldface ing(or hash function learning)into the same end-to-end ar- 0 denotes a vector with all elements being 0.The symbol O chitecture,deep supervised hashing can significantly outper- is used to denote the Hadamard product. form non-deep supervised hashing. Most existing deep supervised hashing methods,includ- Problem Definition ing CNNH,NINH,DPSH,DHN and DSH,adopt a sym- For supervised hashing methods,supervised information metric strategy to learn one deep hash function for both can be point-wise labels(Shen et al.2015b),pairwise la- query points and database points.The training of these sym- bels (Liu et al.2011:Heo et al.2012:Liu et al.2012: metric deep supervised hashing methods is typically time- Li,Wang,and Kang 2016)or triplet labels(Norouzi,Fleet, consuming.For example,the storage and computational cost and Salakhutdinov 2012;Wang et al.2013;Zhao et al.2015; for these hashing methods with pairwise supervised infor- Zhang et al.2015).In this paper,we only focus on pairwise- mation is O(n2)where n is the number of database points. label based supervised hashing which is a common applica- The training cost for methods with triplet supervised infor- tion scenario. mation(Zhao et al.2015;Zhang et al.2015)is even high- Assume that we have m query data points which are de- er.To make the training practicable,most existing deep su- noted as X ={xi and n database points which are pervised hashing methods have to sample only a small sub- denoted as Y =fy;=1.Furthermore,pairwise super- set from the whole database to construct a training set for vised information,denoted as S {-1,+1]mxn,is also hash function learning,and many points in database may be available for supervised hashing.If Si=1,it means that discarded during training.Hence,it is hard for these deep point xi and point yi are similar.Otherwise,xi and yi are supervised hashing methods to effectively utilize the super- dissimilar.The goal of supervised hashing is to learn bi- vised information for cases with large-scale database,which nary hash codes for both query points and database points makes the search performance unsatisfactory. from X,Y and S,and the hash codes try to preserve the In this paper,we propose a novel deep super- vised hashing method,called asymmetric deep supervised similarity between query points and database points.More hashing (ADSH),for large-scale nearest neighbor search. specifically,if we use U ={ui)m1E{-1,+1)mxe to The main contributions of ADSH are outlined as follows: denote the learned binary hash codes for query points and V={vj)"=1E{-1,+1)nxe to denote the learned binary ADSH treats the query points and database points in an hash codes for database points,where c denotes the binary asymmetric way.More specifically,ADSH learns a deep code length.To preserve semantic similarity,the Hamming hash function only for query points,while the binary hash distance between u;and v;should be as small as possible codes for database points are directly learned.To the best if Sij=1.Otherwise,the Hamming distance between ui of our knowledge,ADSH is the first deep supervised and v;should be as large as possible.Moreover,we should hashing method which treats query points and database also learn a hash function h(x)E{-1,+1e so that we points in an asymmetric way. can generate binary code for any unseen query point xg. The training of ADSH is much more efficient than that of Please note that in many cases,we are only given a set of traditional symmetric deep supervised hashing methods. database points Y=[y and the pairwise supervised Hence,the whole set of database points can be used for information between them.We can also learn the hash codes training even if the database is large. and hash function by sampling a subset or the whole set of Y as the query set for training.In these cases,X Y. ADSH can directly learn the binary hash codes for database points,which will be empirically proved to be Asymmetric Deep Supervised Hashing more accurate than the strategies adopted by tradition- al symmetric deep supervised hashing methods which In this section,we introduce our asymmetric deep super- use the learned hash function to generate hash codes for vised hashing (ADSH)in detail,including model formula- database points. tion and learning algorithm. Experiments on three large-scale datasets show that Model Formulation ADSH can achieve state-of-the-art performance in real applications. Figure 1 shows the model architecture of ADSH,which con- tains two important components:feature learning part and DAPH is also an asymmetric deep supervised hashing method, loss function part.The feature learning part tries to leamn which adopts an asymmetric strategy different from our ADSH a deep neural network which can extract appropriate fea- DAPH had not been published before the submission of ADSH. ture representation for binary hash code learning.The loss
to perform feature learning for hashing, has been proposed. Representative deep supervised hashing methods include convolutional neural networks based hashing (CNNH) (Xia et al. 2014), network in network hashing (NINH) (Lai et al. 2015), deep pairwise supervised hashing (DPSH) (Li, Wang, and Kang 2016), deep hashing network (DHN) (Zhu et al. 2016), deep supervised hashing (DSH) (Liu et al. 2016a) and deep asymmetric pairwise hashing (DAPH) (Shen et al. 2017)1 . By integrating feature learning and hash-code learning (or hash function learning) into the same end-to-end architecture, deep supervised hashing can significantly outperform non-deep supervised hashing. Most existing deep supervised hashing methods, including CNNH, NINH, DPSH, DHN and DSH, adopt a symmetric strategy to learn one deep hash function for both query points and database points. The training of these symmetric deep supervised hashing methods is typically timeconsuming. For example, the storage and computational cost for these hashing methods with pairwise supervised information is O(n 2 ) where n is the number of database points. The training cost for methods with triplet supervised information (Zhao et al. 2015; Zhang et al. 2015) is even higher. To make the training practicable, most existing deep supervised hashing methods have to sample only a small subset from the whole database to construct a training set for hash function learning, and many points in database may be discarded during training. Hence, it is hard for these deep supervised hashing methods to effectively utilize the supervised information for cases with large-scale database, which makes the search performance unsatisfactory. In this paper, we propose a novel deep supervised hashing method, called asymmetric deep supervised hashing (ADSH), for large-scale nearest neighbor search. The main contributions of ADSH are outlined as follows: • ADSH treats the query points and database points in an asymmetric way. More specifically, ADSH learns a deep hash function only for query points, while the binary hash codes for database points are directly learned. To the best of our knowledge, ADSH is the first deep supervised hashing method which treats query points and database points in an asymmetric way. • The training of ADSH is much more efficient than that of traditional symmetric deep supervised hashing methods. Hence, the whole set of database points can be used for training even if the database is large. • ADSH can directly learn the binary hash codes for database points, which will be empirically proved to be more accurate than the strategies adopted by traditional symmetric deep supervised hashing methods which use the learned hash function to generate hash codes for database points. • Experiments on three large-scale datasets show that ADSH can achieve state-of-the-art performance in real applications. 1DAPH is also an asymmetric deep supervised hashing method, which adopts an asymmetric strategy different from our ADSH. DAPH had not been published before the submission of ADSH. Notation and Problem Definition Notation Boldface lowercase letters like b denote vectors, and boldface uppercase letters like B denote matrices. B∗j denotes the jth column of B. Bij denotes the (i, j)th element of matrix B. Furthermore, kBkF and BT are used to denote the Frobenius norm and the transpose of matrix B, respectively. Capital Greek letters like Ω denote sets of indices. Boldface 0 denotes a vector with all elements being 0. The symbol is used to denote the Hadamard product. Problem Definition For supervised hashing methods, supervised information can be point-wise labels (Shen et al. 2015b), pairwise labels (Liu et al. 2011; Heo et al. 2012; Liu et al. 2012; Li, Wang, and Kang 2016) or triplet labels (Norouzi, Fleet, and Salakhutdinov 2012; Wang et al. 2013; Zhao et al. 2015; Zhang et al. 2015). In this paper, we only focus on pairwiselabel based supervised hashing which is a common application scenario. Assume that we have m query data points which are denoted as X = {xi} m i=1 and n database points which are denoted as Y = {yj} n j=1. Furthermore, pairwise supervised information, denoted as S ∈ {−1, +1} m×n, is also available for supervised hashing. If Sij = 1, it means that point xi and point yj are similar. Otherwise, xi and yj are dissimilar. The goal of supervised hashing is to learn binary hash codes for both query points and database points from X, Y and S, and the hash codes try to preserve the similarity between query points and database points. More specifically, if we use U = {ui} m i=1 ∈ {−1, +1} m×c to denote the learned binary hash codes for query points and V = {vj} n j=1 ∈ {−1, +1} n×c to denote the learned binary hash codes for database points, where c denotes the binary code length. To preserve semantic similarity, the Hamming distance between ui and vj should be as small as possible if Sij = 1. Otherwise, the Hamming distance between ui and vj should be as large as possible. Moreover, we should also learn a hash function h(xq) ∈ {−1, +1} c so that we can generate binary code for any unseen query point xq. Please note that in many cases, we are only given a set of database points Y = {yj} n j=1 and the pairwise supervised information between them. We can also learn the hash codes and hash function by sampling a subset or the whole set of Y as the query set for training. In these cases, X ⊆ Y. Asymmetric Deep Supervised Hashing In this section, we introduce our asymmetric deep supervised hashing (ADSH) in detail, including model formulation and learning algorithm. Model Formulation Figure 1 shows the model architecture of ADSH, which contains two important components: feature learning part and loss function part. The feature learning part tries to learn a deep neural network which can extract appropriate feature representation for binary hash code learning. The loss
Ouery code Database code F(x;;e)E Re.Then,the problem in (1)is transformed to the following problem: 11d 01010 职 Je,V)=∑∑[h(x)Tv-cS22 i=1j= Feature Learning Part Loss Function Part 7 =∑∑[sign((Fx;o)y-cs] i=1j=1 Figure 1:Model architecture of ADSH. s.Lv∈{-1,+1}e,j∈{1,2,,n} function part aims to learn binary hash codes which preserve In(2),we set F(xi;e)to be the output of the CNN-F model the supervised information(similarity)between query points in the feature learning part,and e is the parameter of the and database points.ADSH integrates these two components CNN-F model.Through this way,we seamlessly integrate into the same end-to-end framework.During training proce- the feature learning part and the loss function part into the same framework. dure,each part can give feedback to the other part. There still exists a problem for the formulation in (2), Please note that the feature learning is only performed for query points but not for database points.Furthermore,based which is that we cannot back-propagate the gradient to e on the deep neural network for feature learning,ADSH due to the sign(F(xi;e))function.Hence,in ADSH we adopts a deep hash function to generate hash codes for adopt the following objective function: query points,but the binary hash codes for database points m n are directly learned.Hence,ADSH treats the query points gO,V=∑∑amh(Fx;o)rv-cS]2 and database points in an asymmetric way.This asymmetric i=1j=1 property of ADSH is different from traditional deep super- S.t. VE{-1,+1]nxe, (3) vised hashing methods.Traditional deep supervised hash- ing methods adopt the same deep neural network to perform where we use tanh(.)to approximate the sign()function. feature learning for both query points and database points, In practice,we might be given only a set of database and then use the same deep hash function to generate binary points Y=fyj without query points.In this case. codes for both query points and database points. we can randomly sample m data points from database to Feature Learning Part In this paper,we adopt a convo- construct the query set.More specifically,we set X=Y where Y denotes the database points indexed by Here, lutional neural network(CNN)model from (Chatfield et al. 2014),i.e.,CNN-F model,to perform feature learning.This we use I ={1,2,...,n}to denote the indices of all the database points and fi1,i2,...,im}I to denote CNN-F model has also been adopted in DPSH (Li,Wang, the indices of the sampled query points.Accordingly,we set and Kang 2016)for feature learning.The CNN-F model S=S,where S{-1,+1)nxn denotes the supervised contains five convolutional layers and three fully-connected information(similarity)between pairs of all database points layers,the details of which can be found at(Chatfield et al. and SE{-1,+1)mxn denotes the sub-matrix formed by 2014;Li,Wang,and Kang 2016).In ADSH,the last layer of the rows of S indexed by Then,we can rewrite J(,V) CNN-F model is replaced by a fully-connected layer which can project the output of the first seven layers into Re space. as follows: Please note that the framework of ADSH is general enough a聘J(6,V)=∑∑[tanh(Fy;o)Tv,-cS]2 to adopt other deep neural networks to replace the CNN-F ien jer model for feature learning.Here,we just adopt CNN-F for illustration. s.tV∈{-1,+1}mxc (4 Loss Function Part To learn the hash codes which can Because C T,there are two representations for yi. preserve the similarity between query points and database Vi E One is the binary hash code vi in database,and points,one common way is to minimize the L2 loss between the other is the query representation tanh(F(yi;))We ad- the supervised information (similarity)and inner product of d an extra constraint to keep vi and tanh(F(yi;e))as close query-database binary code pairs.This can be formulated as as possible,Vi E 1.This is intuitively reasonable,because follows: tanh(F(yi;e))is the approximation of the binary code of yi.Then we get the final formulation of ADSH with only 映J(U,V)=∑∑(uy-c5)2 database points Y for training: =1j=1 J6,V)=∑∑anh(Fy:6)7y-cS2 st.Ue{-1,+1}mxc,V∈{-1,+1}nxc, ien jer u=h(x),i∈{1,2,,m} +y∑v:-tanh(Fy;Θ)2 (5) However,it is difficult to learn h(x;)due to the dis- crete output.We can set h(xi)=sign(F(xi;e)),where s.t.V∈{-1,+1}nxc
h ! Figure 1: Model architecture of ADSH. function part aims to learn binary hash codes which preserve the supervised information (similarity) between query points and database points. ADSH integrates these two components into the same end-to-end framework. During training procedure, each part can give feedback to the other part. Please note that the feature learning is only performed for query points but not for database points. Furthermore, based on the deep neural network for feature learning, ADSH adopts a deep hash function to generate hash codes for query points, but the binary hash codes for database points are directly learned. Hence, ADSH treats the query points and database points in an asymmetric way. This asymmetric property of ADSH is different from traditional deep supervised hashing methods. Traditional deep supervised hashing methods adopt the same deep neural network to perform feature learning for both query points and database points, and then use the same deep hash function to generate binary codes for both query points and database points. Feature Learning Part In this paper, we adopt a convolutional neural network (CNN) model from (Chatfield et al. 2014), i.e., CNN-F model, to perform feature learning. This CNN-F model has also been adopted in DPSH (Li, Wang, and Kang 2016) for feature learning. The CNN-F model contains five convolutional layers and three fully-connected layers, the details of which can be found at (Chatfield et al. 2014; Li, Wang, and Kang 2016). In ADSH, the last layer of CNN-F model is replaced by a fully-connected layer which can project the output of the first seven layers into R c space. Please note that the framework of ADSH is general enough to adopt other deep neural networks to replace the CNN-F model for feature learning. Here, we just adopt CNN-F for illustration. Loss Function Part To learn the hash codes which can preserve the similarity between query points and database points, one common way is to minimize the L2 loss between the supervised information (similarity) and inner product of query-database binary code pairs. This can be formulated as follows: min U,V J(U, V) = Xm i=1 Xn j=1 u T i vj − cSij 2 (1) s.t. U ∈ {−1, +1} m×c , V ∈ {−1, +1} n×c , ui = h(xi) , ∀i ∈ {1, 2, . . . , m}. However, it is difficult to learn h(xi) due to the discrete output. We can set h(xi) = sign(F(xi ; Θ)), where F(xi ; Θ) ∈ R c . Then, the problem in (1) is transformed to the following problem: min Θ,V J(Θ,V) = Xm i=1 Xn j=1 h(xi) T vj − cSij 2 (2) = Xm i=1 Xn j=1 sign(F(xi ; Θ))T vj − cSij 2 s.t. vj ∈ {−1, +1} c , ∀j ∈ {1, 2, . . . , n}. In (2), we set F(xi ; Θ) to be the output of the CNN-F model in the feature learning part, and Θ is the parameter of the CNN-F model. Through this way, we seamlessly integrate the feature learning part and the loss function part into the same framework. There still exists a problem for the formulation in (2), which is that we cannot back-propagate the gradient to Θ due to the sign(F(xi ; Θ)) function. Hence, in ADSH we adopt the following objective function: min Θ,V J(Θ, V) = Xm i=1 Xn j=1 tanh(F(xi ; Θ))T vj − cSij 2 , s.t. V ∈ {−1, +1} n×c , (3) where we use tanh(·) to approximate the sign(·) function. In practice, we might be given only a set of database points Y = {yj} n j=1 without query points. In this case, we can randomly sample m data points from database to construct the query set. More specifically, we set X = YΩ where YΩ denotes the database points indexed by Ω. Here, we use Γ = {1, 2, . . . , n} to denote the indices of all the database points and Ω = {i1, i2, . . . , im} ⊆ Γ to denote the indices of the sampled query points. Accordingly, we set S = S Ω, where S ∈ {−1, +1} n×n denotes the supervised information (similarity) between pairs of all database points and S Ω ∈ {−1, +1} m×n denotes the sub-matrix formed by the rows of S indexed by Ω. Then, we can rewrite J(Θ, V) as follows: min Θ,V J(Θ, V) = X i∈Ω X j∈Γ tanh(F(yi ; Θ))T vj − cSij 2 s.t. V ∈ {−1, +1} n×c . (4) Because Ω ⊆ Γ, there are two representations for yi , ∀i ∈ Ω. One is the binary hash code vi in database, and the other is the query representation tanh(F(yi ; Θ)). We add an extra constraint to keep vi and tanh(F(yi ; Θ)) as close as possible, ∀i ∈ Ω. This is intuitively reasonable, because tanh(F(yi ; Θ)) is the approximation of the binary code of yi . Then we get the final formulation of ADSH with only database points Y for training: min Θ,V J(Θ, V) =X i∈Ω X j∈Γ tanh(F(yi ; Θ))T vj − cSij 2 + γ X i∈Ω [vi − tanh(F(yi ; Θ))]2 (5) s.t. V ∈ {−1, +1} n×c ,
where y is a hyper-parameter. We define={1 where j is defined as follows: In real applications,if we are given both Y and X,we use the problem in (3)for training ADSH.If we are only ∫u,ifje2 given Y,we use the problem in(5)for training ADSH.Af- 0 otherwise ter training ADSH,we can get the binary hash codes for database points,and a deep hash function for query points. Then we can rewrite the problem(7)as follows: We can use the trained deep hash function to generate the bi- nary hash codes for any query points including newly com- min J(V)=IvUT-2tr(V[cuTs+])+const ing query points which are not seen during training.One simple way to generate binary codes for query points is to =VUT+tr(VQT)+const set ug=h(xa)=sign(F(xg;e)). s.tV∈{-1,+1}nxc 4 (8) From (3)and(5),we can find that ADSH treats query points and database points in an asymmetric way.More where Q=-2cSTU-2U,"const"is a constant indepen- specifically,the feature learning is only performed for query dent of V. points but not for database points.Furthermore,ADSH Then,we update V bit by bit.That is to say,each time we adopts a deep hash function to generate hash codes for query update one column of V with other columns fixed.Let V.k points,but the binary hash codes for database points are denote the kth column of V and Vk denote the matrix of V directly learned.This is different from traditional deep su- excluding V.k Let Q.k denote the kth column of Q and Q pervised hashing methods which adopt the same deep hash denote the matrix of Q excluding Qk.Let U+k denote the function to generate binary hash codes for both query points and database points.Because mn in general,ADSH can kth column of U and Uk denote the matrix of U excluding learn the deep neural networks efficiently,and is much faster U.k.To optimize Vk,we can get the objective function: than traditional symmetric deep supervised hashing method- s.This will be verified in our experiments. J(V.k)=VUT+tr(VQ)const =tr(V.k2UTkUxV+Q])+const. Learning Algorithm Here,we only present the learning algorithm for prob- Then,we need to solve the following problem: lem(5),which can be easily adapted for problem(3).We design an alternating optimization strategy to learn the pa- in J(V)=tr(V20UV+Q)+const rameters e and V in problem(5).More specifically,in each s.tV*k∈{-1,+1}" (9) iteration we learn one parameter with the other fixed,and this process will be repeated for many iterations. Then,we can get the optimal solution of problem(9)as fol- lows: Learn e with V fixed When V is fixed,we use back- propagation(BP)algorithm to update the neural network pa- V.k=-sign(2VkUIU.k+Q.k), (10) rameter Specifically,we sample a mini-batch of the query points,then update the parameter e based on the sampled which can be used to update V.k. data.For the sake of simplicity,we define zi=F(yi;e) We summarize the whole learning algorithm for ADSH in and u;=tanh(F(yi;e)).Then we can compute the gradi- Algorithm 1.Here,we repeat the learning for several times, ent of zi as follows: and each time we can sample a query set indexed by n. -{2【@y-es,w+2a-v} 0.J Out-of-Sample Extension jer After training ADSH,the learned deep neural network can ⊙(1-). (6) be applied for generating binary codes for query points in- cluding unseen query points during training.More specifi- Then we can use chain rule to compute based on cally,we can use the following equation to generate binary and the BP algorithm is used to update code for x: Learn V with fixed When is fixed,we rewrite the ug=h(xg:Θ)=sign(F(xg:θ): problem(5)in matrix form: min J(V)=lUvT cS+v(7) Complexity Analysis The total computational complexity for training ADSH is =|UVrI房-2ctr(VTSTU) O(ToutTin[(n+2)mc+(m+1)nc2+(c(n+m)-m)c]). In practice,Tout,Tin,m and c will be much less than n. -2ytr(vSUT)+const Hence,the computational cost of ADSH is O(n).For tra- st.V∈{-1,+1}nxc ditional symmetric deep supervised hashing methods,if all database points are used for training,the computational cost where U=[un,...,[-1,+1]mxe,vo de- is at least O(n2).Furthermore,the training for deep neural notes the binary codes for the database points indexed by network is typically time-consuming.For traditional sym- ie.vo lvi,via,...,vi metric deep supervised hashing methods,they need to scan
where γ is a hyper-parameter. In real applications, if we are given both Y and X, we use the problem in (3) for training ADSH. If we are only given Y, we use the problem in (5) for training ADSH. After training ADSH, we can get the binary hash codes for database points, and a deep hash function for query points. We can use the trained deep hash function to generate the binary hash codes for any query points including newly coming query points which are not seen during training. One simple way to generate binary codes for query points is to set uq = h(xq) = sign(F(xq; Θ)). From (3) and (5), we can find that ADSH treats query points and database points in an asymmetric way. More specifically, the feature learning is only performed for query points but not for database points. Furthermore, ADSH adopts a deep hash function to generate hash codes for query points, but the binary hash codes for database points are directly learned. This is different from traditional deep supervised hashing methods which adopt the same deep hash function to generate binary hash codes for both query points and database points. Because m n in general, ADSH can learn the deep neural networks efficiently, and is much faster than traditional symmetric deep supervised hashing methods. This will be verified in our experiments. Learning Algorithm Here, we only present the learning algorithm for problem (5), which can be easily adapted for problem (3). We design an alternating optimization strategy to learn the parameters Θ and V in problem (5). More specifically, in each iteration we learn one parameter with the other fixed, and this process will be repeated for many iterations. Learn Θ with V fixed When V is fixed, we use backpropagation (BP) algorithm to update the neural network parameter Θ. Specifically, we sample a mini-batch of the query points, then update the parameter Θ based on the sampled data. For the sake of simplicity, we define zi = F(yi ; Θ) and uei = tanh(F(yi ; Θ)). Then we can compute the gradient of zi as follows: ∂J ∂zi = n 2 X j∈Γ (ue T i vj − cSij )vj + 2γ(uei − vi) o (1 − ue 2 i ). (6) Then we can use chain rule to compute ∂J ∂Θ based on ∂J ∂zi , and the BP algorithm is used to update Θ. Learn V with Θ fixed When Θ is fixed, we rewrite the problem (5) in matrix form: min V J(V) =kUVe T − cSk 2 F + γkVΩ − Ue k 2 F (7) =kUVe T k 2 F − 2ctr(VT S T Ue ) − 2γtr(VΩUe T ) + const s.t. V ∈ {−1, +1} n×c , where Ue = [uei1 , uei2 , . . . , ueim] T ∈ [−1, +1]m×c , VΩ denotes the binary codes for the database points indexed by Ω, i.e., VΩ = [vi1 , vi2 , . . . , vim] T . We define U¯ = {u¯j} n j=1, where u¯j is defined as follows: u¯j = uej if j ∈ Ω 0 otherwise. Then we can rewrite the problem (7) as follows: min V J(V) = kVUe T k 2 F − 2tr V[cUe T S + γU¯ T ] + const = kVUe T k 2 F + tr(VQT ) + const s.t. V ∈ {−1, +1} n×c , (8) where Q = −2cS T Ue −2γU¯ , “const” is a constant independent of V. Then, we update V bit by bit. That is to say, each time we update one column of V with other columns fixed. Let V∗k denote the kth column of V and Vb k denote the matrix of V excluding V∗k. Let Q∗k denote the kth column of Q and Qb k denote the matrix of Q excluding Q∗k. Let Ue ∗k denote the kth column of Ue and Ub k denote the matrix of Ue excluding Ue ∗k. To optimize V∗k, we can get the objective function: J(V∗k) =kVUe T k 2 F + tr(VQT ) + const =tr V∗k[2Ue T ∗kUb kVb T k + QT ∗k ] + const. Then, we need to solve the following problem: min V∗k J(V∗k) =tr(V∗k[2Ue T ∗kUb kVb T k + QT ∗k ] + const s.t. V∗k ∈ {−1, +1} n . (9) Then, we can get the optimal solution of problem (9) as follows: V∗k = −sign(2Vb kUb T k Ue ∗k + Q∗k), (10) which can be used to update V∗k. We summarize the whole learning algorithm for ADSH in Algorithm 1. Here, we repeat the learning for several times, and each time we can sample a query set indexed by Ω. Out-of-Sample Extension After training ADSH, the learned deep neural network can be applied for generating binary codes for query points including unseen query points during training. More specifi- cally, we can use the following equation to generate binary code for xq: uq = h(xq; Θ) = sign(F(xq; Θ)). Complexity Analysis The total computational complexity for training ADSH is O(ToutTin[(n + 2)mc + (m + 1)nc2 + (c(n + m) − m)c]). In practice, Tout, Tin, m and c will be much less than n. Hence, the computational cost of ADSH is O(n). For traditional symmetric deep supervised hashing methods, if all database points are used for training, the computational cost is at least O(n 2 ). Furthermore, the training for deep neural network is typically time-consuming. For traditional symmetric deep supervised hashing methods, they need to scan
Algorithm 1 The learning algorithm for ADSH The CIFAR-10 dataset is a single-label dataset which con- nput:Y={yi}是1:n data points. tains 60,000 32 x 32 color images.Each image belongs to SE{-1,1]nxn:supervised similarity matrix. one of the ten classes.For CIFAR-10 dataset,two images c:binary code length. will be treated as a ground-truth neighbor(similar pair)if Output:V:binary hash codes for database points. they share one common label. 曰:neural network parameter.. The NUS-WIDE dataset consists of 269.648 web im- Initialization:initialize O.V.mini-batch size M and it- ages associated with tags.It is a multi-label dataset where eration number Tout,Tin. each image might be annotated with multi-labels.Following forw=1→Tout do DPSH (Li,Wang,and Kang 2016).we only select 195,834 Randomly generate index set from I.Set S =SR, images that belong to the 21 most frequent concepts.For X=Y based on NUS-WIDE dataset,two images will be defined as a ground- fort=1→Tin do truth neighbor (similar pair)if they share at least one com- for s =1,2,...,m/M do mon label. Randomly sample M data points from X=Y to construct a mini-batch. Baselines and Evaluation Protocol Calculate zi and u;for each data point yi in the To evaluate ADSH and baselines,we choose some method- mini-batch by forward propagation. s as baselines for comparison,including one unsupervised Calculate the gradient according to (6). hashing method ITQ (Gong and Lazebnik 2011),seven Update the parameter by using back propaga- non-deep supervised hashing methods Lin:Lin (Neyshabur tion. et al.2013),Lin:V(Neyshabur et al.2013),LFH (Zhang end for et al.2014).FastH (Lin et al.2014a).SDH (Shen et al fork=1→cdo 2015b),COSDISH (Kang.Li,and Zhou 2016)and kernel Update V.according to update rule in (10) ADGH(KADGH)(Shi et al.2017).three deep supervised end for hashing methods,DSH (Liu et al.2016a),DHN (Zhu et end for al.2016)and DPSH (Li,Wang,and Kang 2016).Among end for these baselines,Lin:Lin,Lin:V and KADGH are asymmet- ric and others are based on symmetric hashing.Other meth- ods,include CNNH (Xia et al.2014)and NINH (Lai et al. n points in an epoch of the neural network training.On the 2015),are not adopted for comparison because they have contrary,only m points are scanned in an epoch of the neu- been found to be outperformed by the adopted baselines like ral network training for ADSH.Typically,mn.Hence, DPSH. ADSH is much faster than traditional symmetric deep super- For non-deep hashing methods,we utilize 4,096-dim deep vised hashing methods. features which are extracted by the pre-trained CNN-F mod- To make the training practicable,most existing symmet- el on ImageNet dataset for fair comparison.KADGH and ric deep supervised hashing methods have to sample only a SDH are kernel-based methods.for which we randomly s- small subset from the whole database to construct a train- elect 1,000 data points as anchors to construct the kernels ing set for deep hash function learning,and many points in by following the suggestion of the authors of KADGH.As database may be discarded during training.On the contrary, KADGH can only be used for single-label dataset,we on- ASDH is much more efficient to utilize more database points ly carry out experiments on CIFAR-10 dataset for KADGH. for training. For FastH,LFH and COSDISH,we use boosted decision tree for out-of-sample extension by following the setting of FastH.For most baselines including ITQ,Lin:Lin,Lin:V. Experiment LFH,FastH,SDH,COSDISH and DPSH,source code is We carry out experiments to evaluate our ADSH and base- kindly provided by their authors.For DSH and DHN.al- lines which are implemented with the deep learning toolbox though the authors provide source code,for fair comparison MatConvNet (Vedaldi and Lenc 2015)on a NVIDIA M40 we carefully re-implement their methods on MatConvNet to GPU server. remove effect on training time caused by different platform- s.For DHN,DPSH and DSH,we resize all images to 224 x Datasets 224 and use the raw pixels as the inputs for all datasets.In order to avoid overfitting,we set weight decay as 5 x 10-4. We evaluate ADSH on three datasets:MS-COCO (Lin In addition,some deep hashing methods adopt other neu- et al.2014b),CIFAR-10 (Krizhevsky 2009)and NUS- ral networks for feature learning.For example,DHN adopts WIDE (Chua et al.2009). AlexNet (Krizhevsky,Sutskever,and Hinton 2012).We find The MS-COCO contains 82.783 training,40,504 valida- that the deep baselines with CNN-F network can outperform tion images which belong to 91 categories.It's a multi-label the counterparts with the original networks.For fair compar- dataset.For training image set,we discard the images which ison,we adopt the same deep neural networks for DHN and have no category information.For MS-COCO dataset,two DSH,i.e.,the CNN-F network.We initialize CNN-F with images will be defined as a ground-truth neighbor(similar the pre-trained model on ImageNet.Following the sugges- pair)if they share at least one common label. tions of the authors,we set the mini-batch size to be 128 and
Algorithm 1 The learning algorithm for ADSH Input: Y = {yi} n i=1: n data points. S ∈ {−1, 1} n×n: supervised similarity matrix. c: binary code length. Output: V: binary hash codes for database points. Θ: neural network parameter. Initialization: initialize Θ, V, mini-batch size M and iteration number Tout, Tin. for w = 1 → Tout do Randomly generate index set Ω from Γ. Set S = S Ω, X = YΩ based on Ω. for t = 1 → Tin do for s = 1, 2, . . . , m/M do Randomly sample M data points from X = YΩ to construct a mini-batch. Calculate zi and uei for each data point yi in the mini-batch by forward propagation. Calculate the gradient according to (6). Update the parameter Θ by using back propagation. end for for k = 1 → c do Update V∗k according to update rule in (10). end for end for end for n points in an epoch of the neural network training. On the contrary, only m points are scanned in an epoch of the neural network training for ADSH. Typically, m n. Hence, ADSH is much faster than traditional symmetric deep supervised hashing methods. To make the training practicable, most existing symmetric deep supervised hashing methods have to sample only a small subset from the whole database to construct a training set for deep hash function learning, and many points in database may be discarded during training. On the contrary, ASDH is much more efficient to utilize more database points for training. Experiment We carry out experiments to evaluate our ADSH and baselines which are implemented with the deep learning toolbox MatConvNet (Vedaldi and Lenc 2015) on a NVIDIA M40 GPU server. Datasets We evaluate ADSH on three datasets: MS-COCO (Lin et al. 2014b), CIFAR-10 (Krizhevsky 2009) and NUSWIDE (Chua et al. 2009). The MS-COCO contains 82,783 training, 40,504 validation images which belong to 91 categories. It’s a multi-label dataset. For training image set, we discard the images which have no category information. For MS-COCO dataset, two images will be defined as a ground-truth neighbor (similar pair) if they share at least one common label. The CIFAR-10 dataset is a single-label dataset which contains 60,000 32 × 32 color images. Each image belongs to one of the ten classes. For CIFAR-10 dataset, two images will be treated as a ground-truth neighbor (similar pair) if they share one common label. The NUS-WIDE dataset consists of 269,648 web images associated with tags. It is a multi-label dataset where each image might be annotated with multi-labels. Following DPSH (Li, Wang, and Kang 2016), we only select 195,834 images that belong to the 21 most frequent concepts. For NUS-WIDE dataset, two images will be defined as a groundtruth neighbor (similar pair) if they share at least one common label. Baselines and Evaluation Protocol To evaluate ADSH and baselines, we choose some methods as baselines for comparison, including one unsupervised hashing method ITQ (Gong and Lazebnik 2011), seven non-deep supervised hashing methods Lin:Lin (Neyshabur et al. 2013), Lin:V (Neyshabur et al. 2013), LFH (Zhang et al. 2014), FastH (Lin et al. 2014a), SDH (Shen et al. 2015b), COSDISH (Kang, Li, and Zhou 2016) and kernel ADGH (KADGH) (Shi et al. 2017), three deep supervised hashing methods, DSH (Liu et al. 2016a), DHN (Zhu et al. 2016) and DPSH (Li, Wang, and Kang 2016). Among these baselines, Lin:Lin, Lin:V and KADGH are asymmetric and others are based on symmetric hashing. Other methods, include CNNH (Xia et al. 2014) and NINH (Lai et al. 2015), are not adopted for comparison because they have been found to be outperformed by the adopted baselines like DPSH. For non-deep hashing methods, we utilize 4,096-dim deep features which are extracted by the pre-trained CNN-F model on ImageNet dataset for fair comparison. KADGH and SDH are kernel-based methods, for which we randomly select 1,000 data points as anchors to construct the kernels by following the suggestion of the authors of KADGH. As KADGH can only be used for single-label dataset, we only carry out experiments on CIFAR-10 dataset for KADGH. For FastH, LFH and COSDISH, we use boosted decision tree for out-of-sample extension by following the setting of FastH. For most baselines including ITQ, Lin:Lin, Lin:V, LFH, FastH, SDH, COSDISH and DPSH, source code is kindly provided by their authors. For DSH and DHN, although the authors provide source code, for fair comparison we carefully re-implement their methods on MatConvNet to remove effect on training time caused by different platforms. For DHN, DPSH and DSH, we resize all images to 224 × 224 and use the raw pixels as the inputs for all datasets. In order to avoid overfitting, we set weight decay as 5 × 10−4 . In addition, some deep hashing methods adopt other neural networks for feature learning. For example, DHN adopts AlexNet (Krizhevsky, Sutskever, and Hinton 2012). We find that the deep baselines with CNN-F network can outperform the counterparts with the original networks. For fair comparison, we adopt the same deep neural networks for DHN and DSH, i.e., the CNN-F network. We initialize CNN-F with the pre-trained model on ImageNet. Following the suggestions of the authors, we set the mini-batch size to be 128 and
Table 1:MAP on three datasets.The best results for MAP are shown in bold MS-COCO CIFAR-10 NUS-WIDE Method 12 bits 24 bits 32 bits 48 bits 12 bits 24 bits 32bits 48 bits 12 bits 24 bits 32bits 48 bits TTo 0.6338 0.63260.6308 0.6338 0.26I9 0.2754 0.286T 0.294T 0.71430.73610.7457 0.7553 Lin:Lin 0.6557 0.6722 0.6701 0.6736 0.6099 0.6312 0.6079 0.6013 0.5556 0.5704 0.5627 0.5555 LFH 0.7085 0.7389 0.7580 0.7725 0.4178 0.5738 0.6414 0.6927 0.7116 0.7681 0.7949 0.8135 FastH 0.7194 0.7478 0.7544 0.7604 0.5971 0.6632 0.6847 0.7020 0.7267 0.7692 0.7817 0.8037 SDH 0.6954 0.7078 0.7115 0.7164 0.4539 0.6334 0.6514 0.6603 0.7646 0.7998 0.8017 0.8124 COSDISH 0.6895 0.6924 0.7312 0.7589 0.5831 0.6614 0.6802 0.7016 0.6425 0.7406 0.7843 0.7964 KADGH N/A N/A N/A N/A 0.6134 0.6607 0.6701 0.6829 N/A N/A N/A N/A DPSH 0.7461 0.7667 0.7729 0.7777 0.6818 0.7204 0.7341 0.7464 0.7941 0.8249 0.8351 0.8442 DHN 0.7440 0.7656 0.7691 0.7740 0.6805 0.7213 0.7233 0.7332 0.7719 0.8013 0.8051 0.8146 DSH 0.6962 0.7176 0.7156 0.7220 0.6441 0.7421 0.7703 0.7992 0.7125 0.7313 0.7401 0.7485 ADSH 0.8388 0.8590 0.8633 0.8651 0.8898 0.9280 0.9310 0.9390 0.8400 0.8784 0.8951 0.9055 MSCOCO CIFAR-10 NUS-WIDE 0.95 0.9 0.9 0.85 0.85 08 A1 +-DPSI -DHN 0.75 DHN DHN 08 Lin:V ◆。Li 0.45 5H- 0.7 ◆-LFED SDH-I -●05DSFD --KADGH-D -●0SDSF-D 0.75 0.35 0.65 12 24 48 24 A 12 24 32 48 Binary code length Binary code length Binary code length (a)MS-COCO (b)CIFAR-10 (c)NUS-WIDE Figure 2:Top-5K precision on three datasets. tune the learning rate among [10-6,10-2]by using a vali- sion curve to evaluate the proposed ADSH and baselines dation set.For ADSH method,we set y 200,Tout =50, For NUS-WIDE dataset,the MAP results are calculated Tin =3,=2000 by using a validation strategy for all based on the Top-5K returned samples.We also compare datasets.To avoid effect caused by class-imbalance prob- the training time between different deep hashing methods lem between positive and negative similarity information, All experiments are run five times,and the average values we empirically set the weight of the element-1 in S as the are reported. ratio between the number of element 1 and the number of element-1 in S. Accuracy For MS-COCO dataset,we use the pruned training images The MAP results are presented in Table 1.We can find that in as database points and randomly select 5,000 images (250 most cases the supervised methods can outperform the unsu- images per category)which belong to the 20 most cate- pervised methods,and the deep methods can outperform the gories from validation set as test set.For CIFAR-10 dataset, non-deep methods.Furthermore,we can find that ADSH can we randomly select 1,000 images (100 images per class) significantly outperform all the other baselines,including for test set,with the remaining images as database points deep hashing baselines,non-deep supervised hashing base- For NUS-WIDE dataset,we randomly choose 2,100 im- lines and unsupervised hashing baselines. ages(100 images per class)as test set,with the rest of the Some baselines,including Lin:Lin,LFH,SDH,COS- images as database points following the setting of DPSH(Li, DISH.KADGH can also be adapted to learn binary hash Wang,and Kang 2016).Because the deep hashing baselines codes for database directly due to their training efficiency are very time-consuming for training,we randomly select We carry out experiments to evaluate the adapted counter- 10,000 images(500 images per category)which belong to parts of these methods which can learn binary codes for the 20 most categories from database points for training all database directly,and denote the counterparts of these meth- baselines except Lin:V for MS-COCO dataset.Similar to ex- ods as Lin:V,LFH-D,SDH-D,COSDISH-D,KADGH-Dre- isting works like (Li,Wang,and Kang 2016),we randomly spectively.It means that Lin:V,LFH-D,SDH-D,COSDISH- choose 5,000(500 images per class)and 10,500 images(500 D.KADGH-D adopt all database points for training.We re- images per class)from database for training all baselines port the corresponding MAP results in Table 2.We can find except Lin:V for CIFAR-10 and NUS-WIDE,respectively. that Lin:V,LFH-D,SDH-D,COSDISH-D,KADGH-D can The necessity of random sampling for training set will also outperform Lin:Lin,LFH,SDH,COSDISH,KADGH re- be empirically verified in the later section. spectively.This means that directly learning the binary hash We report Mean Average Precision(MAP),Top-K preci- codes for database points is more accurate than the strate-
Table 1: MAP on three datasets. The best results for MAP are shown in bold. Method MS-COCO CIFAR-10 NUS-WIDE 12 bits 24 bits 32 bits 48 bits 12 bits 24 bits 32 bits 48 bits 12 bits 24 bits 32 bits 48 bits ITQ 0.6338 0.6326 0.6308 0.6338 0.2619 0.2754 0.2861 0.2941 0.7143 0.7361 0.7457 0.7553 Lin:Lin 0.6557 0.6722 0.6701 0.6736 0.6099 0.6312 0.6079 0.6013 0.5556 0.5704 0.5627 0.5555 LFH 0.7085 0.7389 0.7580 0.7725 0.4178 0.5738 0.6414 0.6927 0.7116 0.7681 0.7949 0.8135 FastH 0.7194 0.7478 0.7544 0.7604 0.5971 0.6632 0.6847 0.7020 0.7267 0.7692 0.7817 0.8037 SDH 0.6954 0.7078 0.7115 0.7164 0.4539 0.6334 0.6514 0.6603 0.7646 0.7998 0.8017 0.8124 COSDISH 0.6895 0.6924 0.7312 0.7589 0.5831 0.6614 0.6802 0.7016 0.6425 0.7406 0.7843 0.7964 KADGH N/A N/A N/A N/A 0.6134 0.6607 0.6701 0.6829 N/A N/A N/A N/A DPSH 0.7461 0.7667 0.7729 0.7777 0.6818 0.7204 0.7341 0.7464 0.7941 0.8249 0.8351 0.8442 DHN 0.7440 0.7656 0.7691 0.7740 0.6805 0.7213 0.7233 0.7332 0.7719 0.8013 0.8051 0.8146 DSH 0.6962 0.7176 0.7156 0.7220 0.6441 0.7421 0.7703 0.7992 0.7125 0.7313 0.7401 0.7485 ADSH 0.8388 0.8590 0.8633 0.8651 0.8898 0.9280 0.9310 0.9390 0.8400 0.8784 0.8951 0.9055 Binary code length 12 24 32 48 Top-5K precision 0.75 0.8 0.85 0.9 (a) MS-COCO Binary code length 12 24 32 48 Top-5K precision 0.35 0.45 0.55 0.65 0.75 0.85 0.95 (b) CIFAR-10 Binary code length 12 24 32 48 Top-5K precision 0.65 0.7 0.75 0.8 0.85 0.9 (c) NUS-WIDE Figure 2: Top-5K precision on three datasets. tune the learning rate among [10−6 , 10−2 ] by using a validation set. For ADSH method, we set γ = 200, Tout = 50, Tin = 3, |Ω| = 2000 by using a validation strategy for all datasets. To avoid effect caused by class-imbalance problem between positive and negative similarity information, we empirically set the weight of the element -1 in S as the ratio between the number of element 1 and the number of element -1 in S. For MS-COCO dataset, we use the pruned training images as database points and randomly select 5,000 images (250 images per category) which belong to the 20 most categories from validation set as test set. For CIFAR-10 dataset, we randomly select 1,000 images (100 images per class) for test set, with the remaining images as database points. For NUS-WIDE dataset, we randomly choose 2,100 images (100 images per class) as test set, with the rest of the images as database points following the setting of DPSH (Li, Wang, and Kang 2016). Because the deep hashing baselines are very time-consuming for training, we randomly select 10,000 images (500 images per category) which belong to the 20 most categories from database points for training all baselines except Lin:V for MS-COCO dataset. Similar to existing works like (Li, Wang, and Kang 2016), we randomly choose 5,000 (500 images per class) and 10,500 images (500 images per class) from database for training all baselines except Lin:V for CIFAR-10 and NUS-WIDE, respectively. The necessity of random sampling for training set will also be empirically verified in the later section. We report Mean Average Precision (MAP), Top-K precision curve to evaluate the proposed ADSH and baselines. For NUS-WIDE dataset, the MAP results are calculated based on the Top-5K returned samples. We also compare the training time between different deep hashing methods. All experiments are run five times, and the average values are reported. Accuracy The MAP results are presented in Table 1. We can find that in most cases the supervised methods can outperform the unsupervised methods, and the deep methods can outperform the non-deep methods. Furthermore, we can find that ADSH can significantly outperform all the other baselines, including deep hashing baselines, non-deep supervised hashing baselines and unsupervised hashing baselines. Some baselines, including Lin:Lin, LFH, SDH, COSDISH, KADGH can also be adapted to learn binary hash codes for database directly due to their training efficiency. We carry out experiments to evaluate the adapted counterparts of these methods which can learn binary codes for database directly, and denote the counterparts of these methods as Lin:V, LFH-D, SDH-D, COSDISH-D, KADGH-D respectively. It means that Lin:V, LFH-D, SDH-D, COSDISHD, KADGH-D adopt all database points for training. We report the corresponding MAP results in Table 2. We can find that Lin:V, LFH-D, SDH-D, COSDISH-D, KADGH-D can outperform Lin:Lin, LFH, SDH, COSDISH, KADGH respectively. This means that directly learning the binary hash codes for database points is more accurate than the strate-
Table 2:MAP on three datasets.The best results for MAP are shown in bold. MS-COCO CIFAR-10 NUS-WIDE Method 12 bits 24 bits 32bits 48 bits 12 bits 24 bits 32 bits 48 bits 12 bits 24 bits 32 bits 48 bits Lin:V 0.76I60.7803 0.78T2 0.7842 0.8206 0.8I600.8038 0.7993 0.7T63 0.7565 0.7615 0.7605 LFH-D 0.7408 0.7729 0.8063 0.8165 0.5091 0.7267 0.7712 0.8333 0.7761 0.8351 0.8604 0.8790 SDH-D 0.7644 0.7724 0.7708 0.7711 0.6616 0.8466 0.8501 0.8501 0.8571 0.8808 0.8815 0.8756 COSDISH-D 0.6976 0.7685 0.8052 0.7943 0.8544 0.8700 0.8802 0.8771 0.8084 0.8546 0.8636 0.8752 KADGH-D N/A N/A N/A N/a 0.8606 0.8710 0.8759 0.8749 N/A N/A N/A N/A ADSH 083880.85900.8633 0.8651 0.88980.92800.93100.93900.8400 0.87840.8951 0.9055 MS-COCO MS-COCO MS-COCO MS-COCO 0.85 09 0.8 0.75 07 075 0.65 0.6 08 0.6 0.65 12345 10 15 12345 10 15 20 12345 101520 25 12345 101520253035 Training time(in hour) Training time(in hour) Training time(in hour) Training time(in hour) (a)12 bits (b)24 bits (c)32 bits (d)48 bits Figure 3:Training time on MS-COCO dataset. gies which use the learned hash function to generate hash 0.8 codes for database points.We can also find that ADSH can 0.86 .85 outperform all the other baselines in most cases. We also report Top-5K precision in Figure 2 on three 0.83 datasets.Once again,we can find that ADSH can signifi- 0.2 cantly outperform other baselines in most cases especially 0.81 001a11101005001080 1005001000 for large code length. (a)y (b)m Time Complexity Furthermore,we compare our ADSH to deep hashing base- Figure 4:Hyper-parameters on MS-COCO dataset. lines by adopting the whole database as the training set on MS-COCO dataset.The results are shown in Figure 3.Here, DSH,DHN and DPSH denote the deep hashing baselines query points will result in higher computation cost,in our with 10,000 sampled points for training.DSH-D,DHN-D experiments we select a suitable number to get a tradeoff be- and DPSH-D denote the counterparts of the corresponding tween retrieval accuracy and computation cost.By choosing deep hashing baselines which adopt the whole database for training.We can find that if the whole database is used for m =2000,our ADSH can significantly outperform all other deep supervised hashing baselines in terms of both accuracy training,it need more than 10 hours for most baselines to and efficiency. converge.Hence,we have to sample a subset for training on large-scale datasets.From Figure 3,we can also find that to achieve similar accuracy,our ADSH is much faster than al- Conclusion I the baselines,either with sampled training points or with the whole database.In addition,ADSH can achieve a higher In this paper,we propose a novel deep supervised hash- accuracy than all baselines with much less time. ing method,called ADSH,for large-scale nearest neighbor search.To the best of our knowledge,this is the first deep Sensitivity to Parameters supervised hashing method which treats query points and database points in an asymmetric way.Experiments on re- Figure 4 presents the effect of the hyper-parameters y and al datasets show that ADSH can achieve the state-of-the-art the number of query points(m)for ADSH on MS-COCO dataset,with binary code length being 24 bits and 48 bit- performance in real applications. s.From Figure 4(a),we can see that ADSH is not sensi- tive to y in a large range with 1 <y<500.Figure 4 (b) Acknowledgement presents the MAP results for different number of sampled query points (m)on MS-COCO dataset.We can find that This work is supported by the NSFC(61472182),the key re- better retrieval accuracy can be achieved with larger number search and development program of Jiangsu(BE2016121), of sampled query points.Because larger number of sampled and a fund from Tencent
Table 2: MAP on three datasets. The best results for MAP are shown in bold. Method MS-COCO CIFAR-10 NUS-WIDE 12 bits 24 bits 32 bits 48 bits 12 bits 24 bits 32 bits 48 bits 12 bits 24 bits 32 bits 48 bits Lin:V 0.7616 0.7803 0.7812 0.7842 0.8206 0.8160 0.8038 0.7993 0.7163 0.7565 0.7615 0.7605 LFH-D 0.7408 0.7729 0.8063 0.8165 0.5091 0.7267 0.7712 0.8333 0.7761 0.8351 0.8604 0.8790 SDH-D 0.7644 0.7724 0.7708 0.7711 0.6616 0.8466 0.8501 0.8501 0.8571 0.8808 0.8815 0.8756 COSDISH-D 0.6976 0.7685 0.8052 0.7943 0.8544 0.8700 0.8802 0.8771 0.8084 0.8546 0.8636 0.8752 KADGH-D N/A N/A N/A N/A 0.8606 0.8710 0.8759 0.8749 N/A N/A N/A N/A ADSH 0.8388 0.8590 0.8633 0.8651 0.8898 0.9280 0.9310 0.9390 0.8400 0.8784 0.8951 0.9055 Training time (in hour) 12 34 5 10 15 20 MAP 0.6 0.65 0.7 0.75 0.8 0.85 (a) 12 bits Training time (in hour) 12345 10 15 20 25 MAP 0.6 0.7 0.8 0.9 (b) 24 bits Training time (in hour) 12345 10 15 20 25 30 MAP 0.6 0.7 0.8 0.9 (c) 32 bits Training time (in hour) 12345 10 15 20 25 30 35 40 MAP 0.65 0.7 0.75 0.8 0.85 0.9 (d) 48 bits Figure 3: Training time on MS-COCO dataset. gies which use the learned hash function to generate hash codes for database points. We can also find that ADSH can outperform all the other baselines in most cases. We also report Top-5K precision in Figure 2 on three datasets. Once again, we can find that ADSH can signifi- cantly outperform other baselines in most cases especially for large code length. Time Complexity Furthermore, we compare our ADSH to deep hashing baselines by adopting the whole database as the training set on MS-COCO dataset. The results are shown in Figure 3. Here, DSH, DHN and DPSH denote the deep hashing baselines with 10,000 sampled points for training. DSH-D, DHN-D and DPSH-D denote the counterparts of the corresponding deep hashing baselines which adopt the whole database for training. We can find that if the whole database is used for training, it need more than 10 hours for most baselines to converge. Hence, we have to sample a subset for training on large-scale datasets. From Figure 3, we can also find that to achieve similar accuracy, our ADSH is much faster than all the baselines, either with sampled training points or with the whole database. In addition, ADSH can achieve a higher accuracy than all baselines with much less time. Sensitivity to Parameters Figure 4 presents the effect of the hyper-parameters γ and the number of query points (m) for ADSH on MS-COCO dataset, with binary code length being 24 bits and 48 bits. From Figure 4 (a), we can see that ADSH is not sensitive to γ in a large range with 1 < γ < 500. Figure 4 (b) presents the MAP results for different number of sampled query points (m) on MS-COCO dataset. We can find that better retrieval accuracy can be achieved with larger number of sampled query points. Because larger number of sampled 0.01 0.1 1 10 100 500 1000 MAP 0.84 0.85 0.86 0.87 (a) γ 100 500 1000 2000 3000 MAP 0.81 0.82 0.83 0.84 0.85 0.86 0.87 (b) m Figure 4: Hyper-parameters on MS-COCO dataset. query points will result in higher computation cost, in our experiments we select a suitable number to get a tradeoff between retrieval accuracy and computation cost. By choosing m = 2000, our ADSH can significantly outperform all other deep supervised hashing baselines in terms of both accuracy and efficiency. Conclusion In this paper, we propose a novel deep supervised hashing method, called ADSH, for large-scale nearest neighbor search. To the best of our knowledge, this is the first deep supervised hashing method which treats query points and database points in an asymmetric way. Experiments on real datasets show that ADSH can achieve the state-of-the-art performance in real applications. Acknowledgement This work is supported by the NSFC (61472182), the key research and development program of Jiangsu (BE2016121), and a fund from Tencent
References Liu,W.;Wang,J.;Kumar,S.;and Chang,S.2011.Hashing Andoni,A.,and Indyk,P.2006.Near-optimal hashing algo- with graphs.In ICML,1-8. rithms for approximate nearest neighbor in high dimensions. Liu,W.;Wang,J.;Ji,R.;Jiang,Y.;and Chang,S.2012. InF0CS.459468. Supervised hashing with kernels.In CVPR,2074-2081. Andoni,A.,and Razenshteyn,I.P.2015.Optimal data- Liu,W.;Mu,C.;Kumar,S.;and Chang,S.2014.Discrete dependent hashing for approximate near neighbors.In S- graph hashing.In NIPS,3419-3427. T0C.793-801. Liu,H.;Wang,R.;Shan,S.;and Chen,X.2016a.Deep Chatfield,K.;Simonyan,K.;Vedaldi,A.;and Zisserman,A. supervised hashing for fast image retrieval.In CVPR,2064 2014.Return of the devil in the details:Delving deep into 2072 convolutional nets.In BMVC. Liu,H.;Ji.R.:Wu.Y.:and Liu,W.2016b.Towards optimal Chua,T.;Tang,J.;Hong,R.;Li,H.;Luo,Z.;and Zheng,Y. binary code learning via ordinal embedding.In AAA/,1258 1265. 2009.NUS-WIDE:a real-world web image database from national university of singapore.In CIVR. Neyshabur,B.;Srebro,N.;Salakhutdinov,R.;Makarychev, Dasgupta,S.;Stevens,C.F.;and Navlakha,S.2017.A neu- Y.;and Yadollahpour,P.2013.The power of asymmetry in binary hashing.In NIPS,2823-2831. ral algorithm for a fundamental computing problem.Science 358(6364):793-796. Norouzi,M.;Fleet,D.J.;and Salakhutdinov,R.2012.Ham- ming distance metric learning.In NIPS,1070-1078. Datar,M.;Immorlica,N.;Indyk,P.;and Mirrokni,V.S. 2004.Locality-sensitive hashing scheme based on p-stable Shen,F;Liu,W.;Zhang,S.;Yang,Y.;and Shen,H.T. distributions.In SCG.253-262. 2015a.Learning binary codes for maximum inner product search.In /CCV.4148-4156. Gionis,A.;Indyk,P.;and Motwani,R.1999.Similarity search in high dimensions via hashing.In VLDB,518-529. Shen,F.;Shen,C.;Liu,W.;and Shen,H.T.2015b.Super- vised discrete hashing.In CVPR,37-45. Gong,Y.,and Lazebnik,S.2011.Iterative quantization:A Shen,F.;Gao,X.;Liu,L.;Yang,Y.;and Shen,H.T.2017. procrustean approach to learning binary codes.In CVPR, Deep asymmetric pairwise hashing.In MM,1522-1530. 817-824. Shi,X.:Xing,F.;Xu,K.;Sapkota,M.;and Yang,L.2017. Heo,J.;Lee,Y.;He,J.;Chang,S.;and Yoon,S.2012.Spher- Asymmetric discrete graph hashing.In AAA/,2541-2547. ical hashing.In CVPR,2957-2964. Song,D.;Liu,W.;Ji,R.;Meyer,D.A.;and Smith,J.R. Jiang,Q.-Y.,and Li,W.-J.2015.Scalable graph hashing 2015.Top rank supervised binary coding for visual search. with feature transformation.In //CAl,2248-2254. In1CCV.1922-1930. Kang,W.-C.;Li,W.-J.;and Zhou,Z.-H.2016.Column Vedaldi,A..and Lenc,K.2015.Matconvnet:Convolutional sampling based discrete supervised hashing.In AAAl,1230- neural networks for MATLAB.In MM.689-692. 1236. Wang,J.;Liu,W.;Sun,A.X.;and Jiang,Y.2013.Learning Kong,W.,and Li,W.-J.2012.Isotropic hashing.In NIPS, hash codes with listwise supervision.In ICCV,3032-3039. 1655-1663. Weiss,Y.:Torralba,A.;and Fergus,R.2008.Spectral hash- Krizhevsky,A.;Sutskever,I.;and Hinton,G.E.2012.Ima- ing.n NIPS,1753-1760. genet classification with deep convolutional neural network- Xia,R.;Pan,Y.;Lai,H.;Liu,C.;and Yan,S.2014.Super- s.In NIPS,1106-1114 vised hashing for image retrieval via image representation Krizhevsky,A.2009.Learning multiple layers of features learning.In AAA/,2156-2162. from tiny images.Master's thesis,University of Toronto. Xie,L.:Shen.J.:and Zhu,L.2016.Online cross-modal Kulis,B.,and Grauman,K.2009.Kernelized locality- hashing for web image retrieval.In AAA/,294-300. sensitive hashing for scalable image search.In ICCV,2130- Zhang,D.;Wang,J.;Cai,D.;and Lu,J.2010.Self-taught 2137. hashing for fast similarity search.In S/GIR,18-25. Lai,H.;Pan,Y.;Liu,Y.;and Yan,S.2015.Simultaneous Zhang,P.;Zhang,W.;Li,W.-J.;and Guo,M.2014.Super- feature learning and hash coding with deep neural networks vised hashing with latent factor models.In S/GIR,173-182. In CVPR.3270-3278. Zhang,R.;Lin,L.;Zhang,R.;Zuo,W.;and Zhang,L.2015. Li,W.-J.;Wang,S.;and Kang,W.-C.2016.Feature learn- Bit-scalable deep hashing with regularized similarity learn- ing based deep supervised hashing with pairwise labels.In ing for image retrieval and person re-identification.77P 1JCAL,1711-1717 24(12):4766-4779. Lin,G.;Shen,C.;Shi,Q.;van den Hengel,A.;and Suter, Zhao,F.;Huang,Y.;Wang,L.;and Tan,T.2015.Deep D.2014a.Fast supervised hashing with decision trees for semantic ranking based hashing for multi-label image re- high-dimensional data.In CVPR.1971-1978 trieval.In CVPR,1556-1564. Lin,T.;Maire,M.;Belongie,S.J.;Hays,J.;Perona,P.;Ra- Zhu,H.;Long,M.;Wang,J.;and Cao,Y.2016.Deep hash- manan,D.;Dollar,P.;and Zitnick,C.L.2014b.Microsoft ing network for efficient similarity retrieval.In AAA/,2415- COCO:common objects in context.In ECCV,740-755. 2421
References Andoni, A., and Indyk, P. 2006. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In FOCS, 459–468. Andoni, A., and Razenshteyn, I. P. 2015. Optimal datadependent hashing for approximate near neighbors. In STOC, 793–801. Chatfield, K.; Simonyan, K.; Vedaldi, A.; and Zisserman, A. 2014. Return of the devil in the details: Delving deep into convolutional nets. In BMVC. Chua, T.; Tang, J.; Hong, R.; Li, H.; Luo, Z.; and Zheng, Y. 2009. NUS-WIDE: a real-world web image database from national university of singapore. In CIVR. Dasgupta, S.; Stevens, C. F.; and Navlakha, S. 2017. A neural algorithm for a fundamental computing problem. Science 358(6364):793–796. Datar, M.; Immorlica, N.; Indyk, P.; and Mirrokni, V. S. 2004. Locality-sensitive hashing scheme based on p-stable distributions. In SCG, 253–262. Gionis, A.; Indyk, P.; and Motwani, R. 1999. Similarity search in high dimensions via hashing. In VLDB, 518–529. Gong, Y., and Lazebnik, S. 2011. Iterative quantization: A procrustean approach to learning binary codes. In CVPR, 817–824. Heo, J.; Lee, Y.; He, J.; Chang, S.; and Yoon, S. 2012. Spherical hashing. In CVPR, 2957–2964. Jiang, Q.-Y., and Li, W.-J. 2015. Scalable graph hashing with feature transformation. In IJCAI, 2248–2254. Kang, W.-C.; Li, W.-J.; and Zhou, Z.-H. 2016. Column sampling based discrete supervised hashing. In AAAI, 1230– 1236. Kong, W., and Li, W.-J. 2012. Isotropic hashing. In NIPS, 1655–1663. Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In NIPS, 1106–1114. Krizhevsky, A. 2009. Learning multiple layers of features from tiny images. Master’s thesis, University of Toronto. Kulis, B., and Grauman, K. 2009. Kernelized localitysensitive hashing for scalable image search. In ICCV, 2130– 2137. Lai, H.; Pan, Y.; Liu, Y.; and Yan, S. 2015. Simultaneous feature learning and hash coding with deep neural networks. In CVPR, 3270–3278. Li, W.-J.; Wang, S.; and Kang, W.-C. 2016. Feature learning based deep supervised hashing with pairwise labels. In IJCAI, 1711–1717. Lin, G.; Shen, C.; Shi, Q.; van den Hengel, A.; and Suter, D. 2014a. Fast supervised hashing with decision trees for high-dimensional data. In CVPR, 1971–1978. Lin, T.; Maire, M.; Belongie, S. J.; Hays, J.; Perona, P.; Ramanan, D.; Dollar, P.; and Zitnick, C. L. 2014b. Microsoft ´ COCO: common objects in context. In ECCV, 740–755. Liu, W.; Wang, J.; Kumar, S.; and Chang, S. 2011. Hashing with graphs. In ICML, 1–8. Liu, W.; Wang, J.; Ji, R.; Jiang, Y.; and Chang, S. 2012. Supervised hashing with kernels. In CVPR, 2074–2081. Liu, W.; Mu, C.; Kumar, S.; and Chang, S. 2014. Discrete graph hashing. In NIPS, 3419–3427. Liu, H.; Wang, R.; Shan, S.; and Chen, X. 2016a. Deep supervised hashing for fast image retrieval. In CVPR, 2064– 2072. Liu, H.; Ji, R.; Wu, Y.; and Liu, W. 2016b. Towards optimal binary code learning via ordinal embedding. In AAAI, 1258– 1265. Neyshabur, B.; Srebro, N.; Salakhutdinov, R.; Makarychev, Y.; and Yadollahpour, P. 2013. The power of asymmetry in binary hashing. In NIPS, 2823–2831. Norouzi, M.; Fleet, D. J.; and Salakhutdinov, R. 2012. Hamming distance metric learning. In NIPS, 1070–1078. Shen, F.; Liu, W.; Zhang, S.; Yang, Y.; and Shen, H. T. 2015a. Learning binary codes for maximum inner product search. In ICCV, 4148–4156. Shen, F.; Shen, C.; Liu, W.; and Shen, H. T. 2015b. Supervised discrete hashing. In CVPR, 37–45. Shen, F.; Gao, X.; Liu, L.; Yang, Y.; and Shen, H. T. 2017. Deep asymmetric pairwise hashing. In MM, 1522–1530. Shi, X.; Xing, F.; Xu, K.; Sapkota, M.; and Yang, L. 2017. Asymmetric discrete graph hashing. In AAAI, 2541–2547. Song, D.; Liu, W.; Ji, R.; Meyer, D. A.; and Smith, J. R. 2015. Top rank supervised binary coding for visual search. In ICCV, 1922–1930. Vedaldi, A., and Lenc, K. 2015. Matconvnet: Convolutional neural networks for MATLAB. In MM, 689–692. Wang, J.; Liu, W.; Sun, A. X.; and Jiang, Y. 2013. Learning hash codes with listwise supervision. In ICCV, 3032–3039. Weiss, Y.; Torralba, A.; and Fergus, R. 2008. Spectral hashing. In NIPS, 1753–1760. Xia, R.; Pan, Y.; Lai, H.; Liu, C.; and Yan, S. 2014. Supervised hashing for image retrieval via image representation learning. In AAAI, 2156–2162. Xie, L.; Shen, J.; and Zhu, L. 2016. Online cross-modal hashing for web image retrieval. In AAAI, 294–300. Zhang, D.; Wang, J.; Cai, D.; and Lu, J. 2010. Self-taught hashing for fast similarity search. In SIGIR, 18–25. Zhang, P.; Zhang, W.; Li, W.-J.; and Guo, M. 2014. Supervised hashing with latent factor models. In SIGIR, 173–182. Zhang, R.; Lin, L.; Zhang, R.; Zuo, W.; and Zhang, L. 2015. Bit-scalable deep hashing with regularized similarity learning for image retrieval and person re-identification. TIP 24(12):4766–4779. Zhao, F.; Huang, Y.; Wang, L.; and Tan, T. 2015. Deep semantic ranking based hashing for multi-label image retrieval. In CVPR, 1556–1564. Zhu, H.; Long, M.; Wang, J.; and Cao, Y. 2016. Deep hashing network for efficient similarity retrieval. In AAAI, 2415– 2421