Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) Semi-Supervised Deep Hashing with a Bipartite Graph Xinyu Yan,Lijun Zhang,Wu-Jun Li National Key Laboratory for Novel Software Technology,Nanjing University,Nanjing 210023,China yanxy,zhanglj@lamda.nju.edu.cn,liwujun @nju.edu.cn Abstract (LSH)[Gionis et al.,1999]and data-dependent ones like Iter- ative Quantization (ITQ)Gong et al.,2013].Spectral Hash- Recently,deep learning has been successfully ap- plied to the problem of hashing,yielding remark- ing(SH)[Weiss et al,2009],Anchor Graph Hashing (AGH) able performance compared to traditional method- [Liu et al.,2011].On the other hand,to deal with more com- plicated semantic similarity,supervised hashing methods are s with hand-crafted features.However,most of proposed to exploit label information to improve the hash- existing deep hashing methods are designed for ing quality.Representative supervised methods include La- the supervised scenario and require a large num- tent Factor Hashing (LFH)[Zhang et al.,2014],Fast Super- ber of labeled data.In this paper,we propose vised Hashing (FastH)[Lin et al.,2014],Supervised Discrete a novel semi-supervised hashing method for im- age retrieval,named Deep Hashing with a Bipar- Hashing(SDH)[Shen et al.,2015].However,labeling large- scale image dataset is inefficient and time-consuming.As a tite Graph(BGDH),to simultaneously learn em- result,Semi-Supervised Hashing (SSH)[Wang et al.,2012] beddings,features and hash codes.More specifi- has been developed to make use of labeled data as well as the cally,we construct a bipartite graph to discover the abundant unlabeled data. underlying structure of data,based on which an em- bedding is generated for each instance.Then,we In traditional hashing methods,images are represented feed raw pixels as well as embeddings to a deep by hand-crafted features such as GIST [Oliva and Torralba. neural network,and concatenate the resulting fea- 2001],and the choice of features requires heavy manual in- tures to determine the hash code.Compared to ex- terventions.Motivated from the great success of deep neu- isting methods,BGDH is a universal framework ral networks in image analysis [Krizhevsky et al.,2012],re- that is able to utilize various types of graphs and cently some deep hashing methods have been proposed to losses.Furthermore,we propose an inductive vari- learn features and hash codes simultaneously [Li et al..2016: ant of BGDH to support out-of-sample extensions Zhu et al..2016:Liu et al..2016.Although those deep hash- Experimental results on real datasets show that our ing methods yield better performance compared with the tra- BGDH outperforms state-of-the-art hashing meth- ditional methods,they usually need a large number of labeled ods instances as training data.To address this limitation,a semi- supervised deep hashing,named SSDH,have been develope- d [Zhang et al.,2016].SSDH is fundamentally built upon 1 Introduction graph-based semi-supervised learning [Zhou et al.,2004]and With the explosion in the volume of image data,it has been the loss function contains a graph regularization term which raised as a big challenge about how to index and organize involves both the labeled and unlabeled data.In theory,SS- these data efficiently and accurately.Approximate Nearest DH needs to construct a nearest neighbor graph of all the da- Neighbor (ANN)search [Indyk and Motwani.1998]has be- ta.Unfortunately,this step takes O(n2)time,where n is the come a popular way to retrieve content from images with both number of instances,and thus intractable for large scale data. computational efficiency and search quality.Among existing In this paper,we propose a novel semi-supervised hash- ANN search methods.hashing is advantageous due to its fast ing method,named Deep Hashing with a Bipartite Graph query speed and low memory complexity [Gionis et al.,1999; (BGDH),which performs graph embedding,feature learning Gong et al.,2013].It aims to transform high-dimensional im- and hash code learning in a unified framework.First,we con- ages into a set of short binary codes while maintaining simi- struct a bipartite graph to capture the information hidden in larity of the original data. the labeled and unlabeled data.The bipartite graph could be Generally speaking,hashing methods can be divided into a semantic graph that describes relationships between images two categories:unsupervised and supervised.Unsupervised and concepts,an anchor graph that describes similarities be- methods utilize some kinds of distance metrics to learn a hash tween images and landmarks [Liu et al.,2010],or a tradition- function from unlabeled data.Methods in this category in- al nearest neighbor graph.Then,inspired by the recent work clude data-independent ones like Locally Sensitive Hashing on graph embedding [Yang et al.,20161,we learn an embed- 3238
Semi-Supervised Deep Hashing with a Bipartite Graph Xinyu Yan, Lijun Zhang, Wu-Jun Li National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China {yanxy, zhanglj}@lamda.nju.edu.cn, liwujun@nju.edu.cn Abstract Recently, deep learning has been successfully applied to the problem of hashing, yielding remarkable performance compared to traditional methods with hand-crafted features. However, most of existing deep hashing methods are designed for the supervised scenario and require a large number of labeled data. In this paper, we propose a novel semi-supervised hashing method for image retrieval, named Deep Hashing with a Bipartite Graph (BGDH), to simultaneously learn embeddings, features and hash codes. More specifi- cally, we construct a bipartite graph to discover the underlying structure of data, based on which an embedding is generated for each instance. Then, we feed raw pixels as well as embeddings to a deep neural network, and concatenate the resulting features to determine the hash code. Compared to existing methods, BGDH is a universal framework that is able to utilize various types of graphs and losses. Furthermore, we propose an inductive variant of BGDH to support out-of-sample extensions. Experimental results on real datasets show that our BGDH outperforms state-of-the-art hashing methods. 1 Introduction With the explosion in the volume of image data, it has been raised as a big challenge about how to index and organize these data efficiently and accurately. Approximate Nearest Neighbor (ANN) search [Indyk and Motwani, 1998] has become a popular way to retrieve content from images with both computational efficiency and search quality. Among existing ANN search methods, hashing is advantageous due to its fast query speed and low memory complexity [Gionis et al., 1999; Gong et al., 2013]. It aims to transform high-dimensional images into a set of short binary codes while maintaining similarity of the original data. Generally speaking, hashing methods can be divided into two categories: unsupervised and supervised. Unsupervised methods utilize some kinds of distance metrics to learn a hash function from unlabeled data. Methods in this category include data-independent ones like Locally Sensitive Hashing (LSH) [Gionis et al., 1999] and data-dependent ones like Iterative Quantization (ITQ) [Gong et al., 2013], Spectral Hashing (SH) [Weiss et al., 2009], Anchor Graph Hashing (AGH) [Liu et al., 2011]. On the other hand, to deal with more complicated semantic similarity, supervised hashing methods are proposed to exploit label information to improve the hashing quality. Representative supervised methods include Latent Factor Hashing (LFH) [Zhang et al., 2014], Fast Supervised Hashing (FastH) [Lin et al., 2014], Supervised Discrete Hashing (SDH) [Shen et al., 2015]. However, labeling largescale image dataset is inefficient and time-consuming. As a result, Semi-Supervised Hashing (SSH) [Wang et al., 2012] has been developed to make use of labeled data as well as the abundant unlabeled data. In traditional hashing methods, images are represented by hand-crafted features such as GIST [Oliva and Torralba, 2001], and the choice of features requires heavy manual interventions. Motivated from the great success of deep neural networks in image analysis [Krizhevsky et al., 2012], recently some deep hashing methods have been proposed to learn features and hash codes simultaneously [Li et al., 2016; Zhu et al., 2016; Liu et al., 2016]. Although those deep hashing methods yield better performance compared with the traditional methods, they usually need a large number of labeled instances as training data. To address this limitation, a semisupervised deep hashing, named SSDH, have been developed [Zhang et al., 2016]. SSDH is fundamentally built upon graph-based semi-supervised learning [Zhou et al., 2004] and the loss function contains a graph regularization term which involves both the labeled and unlabeled data. In theory, SSDH needs to construct a nearest neighbor graph of all the data. Unfortunately, this step takes O(n 2 ) time, where n is the number of instances, and thus intractable for large scale data. In this paper, we propose a novel semi-supervised hashing method, named Deep Hashing with a Bipartite Graph (BGDH), which performs graph embedding, feature learning and hash code learning in a unified framework. First, we construct a bipartite graph to capture the information hidden in the labeled and unlabeled data. The bipartite graph could be a semantic graph that describes relationships between images and concepts, an anchor graph that describes similarities between images and landmarks [Liu et al., 2010], or a traditional nearest neighbor graph. Then, inspired by the recent work on graph embedding [Yang et al., 2016], we learn an embedProceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) 3238
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence(IJCAI-17) ding for each instance to predict the neighborhood context in point xi into a c-dimensional binary code bi=H(xi)= the graph.Finally,we feed both raw pixels and embeddings [hi(xi),h2(xi),...,he(xi)]TE{-1,1)c.The binary codes to a deep neural network,and concatenate the corresponding B=[bshould preserve the semantic similarity and hidden layers when producing binary codes.BGDH is a gen- structure similarity in the Hamming space. eral learning framework in the sense that any loss function of hashing and any type of graph can be incorporated. 3 Semi-Supervised Deep Hashing with a Graph-based methods are usually transductive,because they can only handle instances that are already appeared in Bipartite Graph the graph.Since embeddings of instances are learnt from In this section,we first present the details of our semi- the graph,the basic BGDH is also transductive.To address supervised Deep Hashing with a Bipartite Graph (BGDH), the out-of-sample problem,we further propose an inductive then introduce an inductive variant and finally discuss the variant of BGDH,in which the embeddings are defined as learning procedure. a parametric function of the raw features.In this way,we can produce hash codes for new instances that have not seen 3.1 The Proposed BGDH Framework during training.To demonstrate the effectiveness of our ap- The end-to-end deep learning architecture of our BGDH is proach,we conduct extensive experiments on two large-scale shown in Figure 1,which includes three main components: datasets:CIFAR-10 and NUS-WIDE.Experimental result- graph embedding,feature learning,and hash code learning. s show that BGDH outperforms other methods and achieves Similar to other semi-supervised learning methods [Yang et the state-of-the-art performance in image retrieval. al.,2016],the loss function of BGDH can be expressed as Finally,we emphasize that although both BGDH and SS- DH are semi-supervised deep hashing methods,the proposed C.+λCg (1) BGDH differs from SSDH in the following two aspects: where Cs is a supervised loss designed to preserve the simi- a. While SSDH is built upon graph regularization,our larity between pairwise instances,and Co is an unsupervised BGDH relies on graph embedding. loss of predicting the graph context.In the following,we b.SSDH uses graphs to exploit the unlabeled data,in con- first introduce Co which aims to learn embeddings from the tract BGDH makes use of bipartite graphs,which can bipartite graph g,then formulate Cs which is used to leam be constructed more efficiently since building an anchor both features and binary codes from hidden layers of deep graph only costs O(n)time. networks. 2 Notations and Problem Definitions Graph embedding We propose to use a bipartite graph g=(,O,E)to cap- In this section,we introduce notations and problem defini- ture the information hidden in the unlabeled data.It can be tions. constructed in different ways as stated below. 2.1 Notations An anchor graph constructed from the dataset t.In this case,O contains m landmarks and the construction of G We use script letters like to denote sets,boldface lowercase takes O(n)time [Liu et al.,20101. letters like e to denote vectors and boldface uppercase letters .A nearest neighbor graph.In this case,=Y and the like E to denote matrices.We denote the element at the i-th construction of G takes O(n2)time. row and j-th column of E by Eii.ET is the transpose of E. A semantic graph constructed from external data.In this l2 denotes the Euclidean norm of a vector and F denotes case,O may contain m concepts,styles,or owners. the Frobenius norm of a matrix.sgn()is an element-wise In the following,we briefly introduce one way to construct an sign function and o()is the sigmoid function.u;v]denotes anchor graph.We first randomly sample m instances from the concatenation of two vectors u and v. as landmarks,denoted by =fo1,...,om}.Then,we put an edge between xi and o;if o;is among k nearest landmarks 2.2 Problem Definitions of x;,or if the distance between them is smaller than some Given a set of n instances/images ={xi where xi is threshold e.Let A E Rxm be the similarity matrix of g, the feature vector of the i-th instance.Without loss of gener- where A:;denotes the weight of the edge between xi and oj. ality,we assume the first l instances [x1,...,x}are labeled The value of Aij may be binary,that is,Ai=1 if there is and the rest are unlabeled.We assume the supervised infor- an edge,otherwise 0.If a real value is preferred,Aij can be mation is given in term of pairwise labels,though our method set according to the heat kernel:Aij=e/,where can also support other kinds of labels.Specifically,for the P>0 is a parameter. first I instances,we have a set of pairwise labels S={sij} The goal of graph embedding is to learn an embedding for with sij E[0,1,where sij =1 means that xi and xj are each instance that predicts the context in the graph.Given an similar,sij=0 means that xi and xj are dissimilar.In ad- instance and its context,the objective of graph embedding is dition,a bipartite graph g=(,O,E)between n instances usually formulated as minimizing certain loss of predicting and m objects are given,where is a set of objects such as the context using the embedding of an instance as input fea- concepts or landmarks,and is the set of edges. ture [Weston et al.,2012;Mikolov et al.,2013].The context The goal of our semi-supervised hashing method is to learn of an instance can be simply defined as its neighbors in the a mapping function H:xi-1,1c,which encodes each graph,or generated by sophisticated methods such as random 3239
ding for each instance to predict the neighborhood context in the graph. Finally, we feed both raw pixels and embeddings to a deep neural network, and concatenate the corresponding hidden layers when producing binary codes. BGDH is a general learning framework in the sense that any loss function of hashing and any type of graph can be incorporated. Graph-based methods are usually transductive, because they can only handle instances that are already appeared in the graph. Since embeddings of instances are learnt from the graph, the basic BGDH is also transductive. To address the out-of-sample problem, we further propose an inductive variant of BGDH, in which the embeddings are defined as a parametric function of the raw features. In this way, we can produce hash codes for new instances that have not seen during training. To demonstrate the effectiveness of our approach, we conduct extensive experiments on two large-scale datasets: CIFAR-10 and NUS-WIDE. Experimental results show that BGDH outperforms other methods and achieves the state-of-the-art performance in image retrieval. Finally, we emphasize that although both BGDH and SSDH are semi-supervised deep hashing methods, the proposed BGDH differs from SSDH in the following two aspects: a. While SSDH is built upon graph regularization, our BGDH relies on graph embedding. b. SSDH uses graphs to exploit the unlabeled data, in contract BGDH makes use of bipartite graphs, which can be constructed more efficiently since building an anchor graph only costs O(n) time. 2 Notations and Problem Definitions In this section, we introduce notations and problem definitions. 2.1 Notations We use script letters like X to denote sets, boldface lowercase letters like e to denote vectors and boldface uppercase letters like E to denote matrices. We denote the element at the i-th row and j-th column of E by Eij . ET is the transpose of E. k·k2 denotes the Euclidean norm of a vector and k·kF denotes the Frobenius norm of a matrix. sgn(·) is an element-wise sign function and σ(·) is the sigmoid function. [u; v] denotes the concatenation of two vectors u and v. 2.2 Problem Definitions Given a set of n instances/images X = {xi} n i=1 where xi is the feature vector of the i-th instance. Without loss of generality, we assume the first l instances {x1, . . . , xl} are labeled and the rest are unlabeled. We assume the supervised information is given in term of pairwise labels, though our method can also support other kinds of labels. Specifically, for the first l instances, we have a set of pairwise labels S = {sij} with sij ∈ {0, 1}, where sij = 1 means that xi and xj are similar, sij = 0 means that xi and xj are dissimilar. In addition, a bipartite graph G = (X , O, E) between n instances and m objects are given, where O is a set of objects such as concepts or landmarks, and E is the set of edges. The goal of our semi-supervised hashing method is to learn a mapping function H : xi → {−1, 1} c , which encodes each point xi into a c-dimensional binary code bi = H(xi) = [h1(xi), h2(xi), · · · , hc(xi)]T ∈ {−1, 1} c . The binary codes B = {bi} n i=1 should preserve the semantic similarity and structure similarity in the Hamming space. 3 Semi-Supervised Deep Hashing with a Bipartite Graph In this section, we first present the details of our semisupervised Deep Hashing with a Bipartite Graph (BGDH), then introduce an inductive variant and finally discuss the learning procedure. 3.1 The Proposed BGDH Framework The end-to-end deep learning architecture of our BGDH is shown in Figure 1, which includes three main components: graph embedding, feature learning, and hash code learning. Similar to other semi-supervised learning methods [Yang et al., 2016], the loss function of BGDH can be expressed as Ls + λLg (1) where Ls is a supervised loss designed to preserve the similarity between pairwise instances, and Lg is an unsupervised loss of predicting the graph context. In the following, we first introduce Lg which aims to learn embeddings from the bipartite graph G, then formulate Ls which is used to learn both features and binary codes from hidden layers of deep networks. Graph embedding We propose to use a bipartite graph G = (X , O, E) to capture the information hidden in the unlabeled data. It can be constructed in different ways as stated below. • An anchor graph constructed from the dataset X . In this case, O contains m landmarks and the construction of G takes O(n) time [Liu et al., 2010]. • A nearest neighbor graph. In this case, O = X and the construction of G takes O(n 2 ) time. • A semantic graph constructed from external data. In this case, O may contain m concepts, styles, or owners. In the following, we briefly introduce one way to construct an anchor graph. We first randomly sample m instances from X as landmarks, denoted by O = {o1, . . . , om}. Then, we put an edge between xi and oj if oj is among k nearest landmarks of xi , or if the distance between them is smaller than some threshold . Let A ∈ R n×m be the similarity matrix of G, where Aij denotes the weight of the edge between xi and oj . The value of Aij may be binary, that is, Aij = 1 if there is an edge, otherwise 0. If a real value is preferred, Aij can be set according to the heat kernel: Aij = e −kxi−oj k 2 2/ρ, where ρ > 0 is a parameter. The goal of graph embedding is to learn an embedding for each instance that predicts the context in the graph. Given an instance and its context, the objective of graph embedding is usually formulated as minimizing certain loss of predicting the context using the embedding of an instance as input feature [Weston et al., 2012; Mikolov et al., 2013]. The context of an instance can be simply defined as its neighbors in the graph, or generated by sophisticated methods such as random Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) 3239
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) Feature Learning Hash Code Learning labels Predict pairwise Labeled Training Set Graph Embedding Graph context Predict graph Embeddings Prediction Loss Figure 1:The end-to-end deep learning architecture of BGDH. walk [Perozzi et al.,2014].Following previous studies [Yang Algorithm 1 Context generation based on random walk et al.,2016],we present a simple procedure for context gen- 1:Input:A bipartite graph g=(,O,E),parameters r eration in Algorithm 1,where the parameter d is the length and d of the random walk and r E(0,1)determines the ratio of positive contexts. 2:Uniformly sample an instance i from 3:Uniformly sample a random variable u from [0,1] By invoking Algorithm 1 t times,we obtain a set of triples 4:if u<r then (ijcwhere 1 indicates node cj is a context y←+1 of node ij,and =-1 indicates cj is not a context.Let 6 ei,be the embedding of node ij,and we,be the parameters Uniformly sample a random walk sequence S of length d started from i for predicting node c;as a context.We define the objective 7: Uniformly sample a context c from S except i function of graph embedding 8:else t 9: y←-1 Cg= (ei,we,:7j) (2) 10: Uniformly sample context c from&' 11 11:end if 12:return(i,c,y) where e(,)is a loss that measures the discrepancy between ei We,and Y.In machine learning,the following losses are commonly used. network is presented in Table 1,and a detailed explanation ●The square loss can be found in [Li et al.,2016].The output of the last fea- ewg)=(-ewg)月 ture learning layer(full7)of labeled instance xi is represent- ed by (xi;0),where 6 denotes all the parameters in the seven layers of feature learning part.In contrast with exist- ·The logistic loss ing supervised hashing methods,we also learn features from embeddings of labeled instances.The output associated with (ewe,)=log(1+e-yeiWey) embedding ei is denoted by (ei;0e),where be contains all the parameters in hidden layers.In this paper,we only add To constrain the solution space,we may further impose sparse one fully-connected layer for embeddings,of which the size constraints or nonnegative constraints [Lee and Seung,1999]. is determined by the dimension of embeddings. Feature learning and hash code learning We concatenate (xi;0)and (ei;be)as a new feature We utilize deep neural network model to learn features from for instance i,then send it to a hash code learning layer as: the raw pixels and embeddings of labeled instances,and then ui M [o(xi;0z);(ei;0)]+v (3) combine them together to learn the hash codes.BGDH con- tains a CNN model to learn features from raw image pixels, where M E R(4096+d)xe denotes a weight matrix,d is the and the model has seven layers as those of CNN-F [Chatfield dimension ofembedding,and vERex is a bias vector.Note et al.,2014]while other networks like AlexNet [Krizhevsky that any supervised loss function of hashing can be used in and Hinton,2009]can be used too.The configuration of the our framework to learn parameters M and v.In this paper,we 3240
Labeled Training Set Graph Feature Learning Hash Code Learning Predict pairwise labels Embeddings Prediction Loss Predict graph context labeled unlabeled {ui} ψ(e, θe) Graph Embedding φ(x,θx) Figure 1: The end-to-end deep learning architecture of BGDH. walk [Perozzi et al., 2014]. Following previous studies [Yang et al., 2016], we present a simple procedure for context generation in Algorithm 1, where the parameter d is the length of the random walk and r ∈ (0, 1) determines the ratio of positive contexts. By invoking Algorithm 1 t times, we obtain a set of triples {(ij , cj , γj )} t j=1 where γj = 1 indicates node cj is a context of node ij , and γj = −1 indicates cj is not a context. Let eij be the embedding of node ij , and wcj be the parameters for predicting node cj as a context. We define the objective function of graph embedding Lg = 1 t Xt j=1 `(e > ijwcj , γj ) (2) where `(·, ·) is a loss that measures the discrepancy between e > ijwcj and γj . In machine learning, the following losses are commonly used. • The square loss `(e > ijwcj , γj ) = γj − e > ijwcj 2 • The logistic loss `(e > ijwcj , γj ) = log 1 + e −γje > ij wcj To constrain the solution space, we may further impose sparse constraints or nonnegative constraints [Lee and Seung, 1999]. Feature learning and hash code learning We utilize deep neural network model to learn features from the raw pixels and embeddings of labeled instances, and then combine them together to learn the hash codes. BGDH contains a CNN model to learn features from raw image pixels, and the model has seven layers as those of CNN-F [Chatfield et al., 2014] while other networks like AlexNet [Krizhevsky and Hinton, 2009] can be used too. The configuration of the Algorithm 1 Context generation based on random walk 1: Input: A bipartite graph G = (X , O, E), parameters r and d 2: Uniformly sample an instance i from X 3: Uniformly sample a random variable u from [0, 1] 4: if u < r then 5: γ ← +1 6: Uniformly sample a random walk sequence S of length d started from i 7: Uniformly sample a context c from S except i 8: else 9: γ ← −1 10: Uniformly sample context c from X 11: end if 12: return (i, c, γ) network is presented in Table 1, and a detailed explanation can be found in [Li et al., 2016]. The output of the last feature learning layer (full7) of labeled instance xi is represented by φ(xi ; θx), where θx denotes all the parameters in the seven layers of feature learning part. In contrast with existing supervised hashing methods, we also learn features from embeddings of labeled instances. The output associated with embedding ei is denoted by ψ(ei ; θe), where θe contains all the parameters in hidden layers. In this paper, we only add one fully-connected layer for embeddings, of which the size is determined by the dimension of embeddings. We concatenate φ(xi ; θx) and ψ(ei ; θe) as a new feature for instance i, then send it to a hash code learning layer as: ui = MT [φ(xi ; θx); ψ(ei ; θe)] + v (3) where M ∈ R (4096+d)×c denotes a weight matrix, d is the dimension of embedding, and v ∈ R c×1 is a bias vector. Note that any supervised loss function of hashing can be used in our framework to learn parameters M and v. In this paper, we Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) 3240
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence(IJCAI-17) Table 1:Configuration of the feature learning network inductive hashing model can be written as: Layer Configuration convl filter64×11×1l,stride4×4,pad0,LRN,pool2×2 conv2 filter 256x5x5,stride 1x1,pad 2,LRN,pool 2x2 c。+λcg=-∑(8,6j-log(1+e9)月 conv3 filter256×3×3,stride1×l,padl 845∈S conv4 filter256x3×3,stride1×1,padI conv5 filter256x3×3,stride1x1,pad1,pool2×2 +n∑Ilb,-(MP[(x;0)片(5(x;0z):0e月+v)l full6 4096 i=1 full7 4096 ∑ewl choose the loss function of deep pairwise-supervised hashing (DPSH)[Li et al.,2016],and Cs is given by (7) We can predict the hash code of any point xas: C。=-∑(s,6-log(1+e9)+n∑Ib-ul3 ba sgn(M [o(xg:0);(E(xg;v);0e)]+v). (8) 8y∈S i=l (4) 3.3 Learning where j=uuj and n>0 is a regularization parameter. In the transductive version of BGDH,the optimization vari- By substituting Eq.(3)into Eg.(4),we obtain the final loss ables include M,v.[bi},0r,0e,fei}and {we}.We adopt function of the supervised part: stochastic gradient descent(SGD)[Bottou,2010]to train our model. c=-∑ (s6-log(1+e9)】 First,we sample a batch of labeled instances of which set 8时∈S I contains indexes.A gradient step is then taken to optimize the supervised loss Cs.For all i Z1,bi can be directly +b:-(MT[(x:0):(e::0c)]+v)2 optimized as follow: i=1 (5) bi=sgn(ui)=sgn (MT [o(xi;0);u(e;:0e)]+v)(9) The objective of BGDH For other parameters in Cs,i.e.,M,v,0r,e,and fei:iE We combine the supervised part and unsupervised part to for- I1},we use back propagation(BP)to optimize them.Deriva- m the transductive version of BGDH.From(2)and(5),the tives of Cs w.r.t.ui are presented as follows: loss function of BGDH is OC.=1 1 C.+=->e-log(1+e0u)) Oui 2 ∑(a-s4+2∑(a-s叫 j:84y∈S j:8∈S 84∈S +2m(u-b) (10) +n∑lb,-(MP(x;0z(e:9e】+v)l where aij=a(uuj).We can then update other parame- i=1 ters according to +∑em OC (11) j=1 i=[(x;0)片(e:0e】 (6) where A >0 is a constant weighting factor.The first t- ot._OC. wo terms are the loss of predicting pairwise labels and the (12) third one is the loss of predicting context.As a result,our BGDH can simultaneously learn embeddings,features,and 8Cs OCs hash codes.During the training phase,semantic similarity 3o(x:0)Ov(ei;0e) =MOC, u (13) can affect graph embeddings,at the same time structure of Then,we perform a gradient step to optimize the unsuper- data also influence the prediction of pairwise labels. vised graph embedding loss Ca calculated by sampling triples 3.2 An Inductive Variant generated in Algorithm 1.In this case parameters fei,we (i,c)E Z2}will be updated where Z2 denotes the set con- Note that the basic BGDH is transductive,because the em- taining indexes of sampled instances and contexts.The above beddings of instances are learnt from the graph.Since the procedures are repeated for T1 and T2 iterations respectively. hash code of an instance depends on both the raw pixels and its embedding,we need to design a way to infer the embed- The whole learning algorithm of BGDH is summarized in ding of a unseen instance.To this end,we insert hidden layers Algorithm 2.Notice that before training jointly,we first train to connect the raw pixels and embedding [Yang et al.,2016] unsupervised part for a number of iterations to learn the ini- and in this way,the embedding e;becomes a parameterized tialization embeddings [e).For the inductive variant,we function of xi,denoted by (xi;).The loss function of will update parameters instead of embeddings [ei}. 3241
Table 1: Configuration of the feature learning network Layer Configuration conv1 filter 64×11×11, stride 4×4, pad 0, LRN, pool 2×2 conv2 filter 256×5×5, stride 1×1, pad 2, LRN, pool 2×2 conv3 filter 256×3×3, stride 1×1, pad 1 conv4 filter 256×3×3, stride 1×1, pad 1 conv5 filter 256×3×3, stride 1×1, pad 1, pool 2×2 full6 4096 full7 4096 choose the loss function of deep pairwise-supervised hashing (DPSH) [Li et al., 2016], and Ls is given by Ls = − X sij∈S sijΘij − log(1 + e Θij ) + η X l i=1 kbi − uik 2 2 (4) where Θij = 1 2 u T i uj and η > 0 is a regularization parameter. By substituting Eq. (3) into Eq. (4), we obtain the final loss function of the supervised part: Ls = − X sij∈S sijΘij − log(1 + e Θij ) + η X l i=1 bi − (MT [φ(xi ; θx); ψ(ei ; θe)] + v) 2 2 . (5) The objective of BGDH We combine the supervised part and unsupervised part to form the transductive version of BGDH. From (2) and (5), the loss function of BGDH is Ls + λLg = − X sij∈S sijΘij − log(1 + e Θij ) + η X l i=1 bi − (MT [φ(xi ; θx); ψ(ei ; θe)] + v) 2 2 + λ t Xt j=1 `(e > ijwcj , γj ) (6) where λ > 0 is a constant weighting factor. The first two terms are the loss of predicting pairwise labels and the third one is the loss of predicting context. As a result, our BGDH can simultaneously learn embeddings, features, and hash codes. During the training phase, semantic similarity can affect graph embeddings, at the same time structure of data also influence the prediction of pairwise labels. 3.2 An Inductive Variant Note that the basic BGDH is transductive, because the embeddings of instances are learnt from the graph. Since the hash code of an instance depends on both the raw pixels and its embedding, we need to design a way to infer the embedding of a unseen instance. To this end, we insert hidden layers to connect the raw pixels and embedding [Yang et al., 2016], and in this way, the embedding ei becomes a parameterized function of xi , denoted by ξ(xi ; ϑx). The loss function of inductive hashing model can be written as: Ls + λLg = − X sij∈S sijΘij − log(1 + e Θij ) + η X l i=1 bi − MT [φ(xi ; θx); ψ(ξ(xi ; ϑx); θe)] + v 2 2 + λ t Xt j=1 `(e > ijwcj , γj ) (7) We can predict the hash code of any point xq ∈ X/ as: bq = sgn(MT [φ(xq; θx); ψ(ξ(xq; ϑx); θe)] + v). (8) 3.3 Learning In the transductive version of BGDH, the optimization variables include M, v, {bi}, θx, θe, {ei} and {wc}. We adopt stochastic gradient descent (SGD) [Bottou, 2010] to train our model. First, we sample a batch of labeled instances of which set I1 contains indexes. A gradient step is then taken to optimize the supervised loss Ls. For all i ∈ I1, bi can be directly optimized as follow: bi = sgn(ui) = sgn MT [φ(xi ; θx); ψ(ei ; θe)] + v (9) For other parameters in Ls, i.e., M, v, θx, θe, and {ei : i ∈ I1}, we use back propagation (BP) to optimize them. Derivatives of Ls w. r. t. ui are presented as follows: ∂Ls ∂ui = 1 2 X j:sij∈S (aij − sij )uj + 1 2 X j:sji∈S (aji − sji)uj + 2η(ui − bi) (10) where aij = σ( 1 2 u T i uj ). We can then update other parameters according to ∂Ls ∂M = [φ(xi ; θx); ψ(ei ; θe)] ∂Ls ∂ui T , (11) ∂Ls ∂v = ∂Ls ∂ui , (12) ∂Ls ∂φ(xi ; θx) = ∂Ls ∂ψ(ei ; θe) = M ∂Ls ∂ui . (13) Then, we perform a gradient step to optimize the unsupervised graph embedding loss Lg calculated by sampling triples generated in Algorithm 1. In this case parameters {ei , wc : (i, c) ∈ I2} will be updated where I2 denotes the set containing indexes of sampled instances and contexts. The above procedures are repeated for T1 and T2 iterations respectively. The whole learning algorithm of BGDH is summarized in Algorithm 2. Notice that before training jointly, we first train unsupervised part for a number of iterations to learn the initialization embeddings {ei}. For the inductive variant, we will update parameters ϑx instead of embeddings {ei}. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) 3241
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence(IJCAI-17) Algorithm 2 Learning algorithm for BGDH [Lin et al.,20141.SDH [Shen et al.,2015]and COSDISH 1:Input:A bipartite graph G=(.O.)images [Kang et al.,2016];supervised deep hashing methods:DPSH {xi}是l,pairwise labels S={si},parametersn,入, [Li et al.,2016],DHN [Zhu et al.,2016]and DSH [Liu et al., batch iterations T1,T and sizes N,N2 2016];semi-supervised deep methods SSDH [Zhang et al., 2:Output:Binary codes B=(bi} 2016]. 3:Initialization:Initialize 0r with the pre-trained CNN-F For hashing methods using hand-crafted features,we rep- model on ImageNet;Initialize each entry of M,v,and e resent each image in CIFAR-10 by a 512-dimensional GIST by randomly sampling from a Gaussian distribution with vector.In NUS-WIDE,an image is represented by a 1134- mean 0 and variance 0.01. dimensional feature vector,including 500-D bag-of-words 4:REPEAT features,64-D color histogram,144-D color correlogram,73- 5:fort←1toT1do D edge direction histogram,128-D wavelet texture,225-D block-wise color moments.For deep methods,we use the 6: Randomly sample Ni labeled images,and let ZI be the set containing indexes of sampled instances raw image pixels as input.The network architectures of these 7: Calculate o(xi;0)and (ei:0e)for all i E I by deep models are different from each other,for fair compar- forward propagation ison,we adopt the same deep network architecture CNN-F 8: Compute ui M[o(xi;0);(e:;0e)]+v and the with the same initialization parameters pre-trained on Im- binary code of xi with bi=sgn(ui)for all i Z ageNet [Deng et al.,2009]for all deep hashing methods. 9: Compute the derivative of Cs w.r.t.fui:iZ1} The bipartite graph of BGDH is constructed based on hand- 10: Update parameters M,v,0,0e,and [e;:iE T1} crafted features with heat kemnel,where the hyper-parameter by back propagation p is set as 1 for CIFAR-10 and 10 for NUS-WIDE.The hyper- 11:end for parameter n in BGDH is set as 10 for CIFAR-10 and 100 for 12:fort←-1toT2do NUS-WIDE similar to DPSH [Li et al.,2016].We simply set 13: Randomly generate a batch of triples by invoking Al- T1 =10,T2 =5,0.1 in all the experiments. gorithm 1 N2 times,and let Z2 be the set containing in- 4.2 Experimental Results dexes of sampled instances and contexts 14: Compute the derivative of o w.r.t.e;,we:(i,c)E To evaluate the retrieval accuracy,we use the Mean Average Z Precision (MAP)as the evaluation metric.Following Lai et 15: Update parameters [ei,we:(i,c)EZ2) al.,2015;Xia et al.,2014],1000 images (100 images per 16:end for class)are randomly selected as the query set in CIFAR-10. 17:UNTIL stopping We use all the rest of images as training set for the unsu- 18:Calculate bi sgn (MT[o(xi;0);(e:;0e)]+v)for pervised methods and for building a neighbour graph g in i=1,...,n BGDH.We randomly select 5000(500 images per class)im- ages as the training set for supervised methods. In NUS-WIDE,we randomly sample 2100 query images 4 Experiments from 21 most frequent labels following the strategy in [Lai et al.,2015;Xia et al.,2014].We use all the rest of images as In this section,we present experimental results.All the ex- training set for the unsupervised methods.To reduce the time periments are performed on a NVIDIA K80 GPU server with complexity,we only use 5000 landmarks sampled randomly MatConvNet [Vedaldi and Lenc,2014]. to build a bipartite graph for BGDH.For supervised methods, we randomly select 10500 images from dataset.The MAP 4.1 Datasets and Setting values are calculated within the top 5000 returned neighbors We conduct experiments on two widely used benchmark We present the MAP results in Table 2 where BGDH.SS- datasets:CIFAR-10 and NUS-WIDE.The CIFAR-10 dataset DH,DSH,DHN,DPSH are deep methods.Except for SSDH consists of 60,000 images from 10 classes (6000 images per which uses triplet labels,all deep methods are trained with class).It is a single-label dataset.The NUS-WIDE dataset2 pairwise labels.To be fair,parameters of these methods are contains 269,648 images from Flickr.Following the settings set according to the suggestions of authors.The results show in [Lai et al.,2015],we use 19,5834 images belonging to the that in most cases our proposed BGDH method substantially 21 most frequent classes and each class consists at least 5000 outperforms other baselines including unsupervised methods, images.Additionally,two images are defined as a ground- supervised methods and semi-supervised methods truth neighbor when they share at least one common label. In real applications,the number of labeled instances is usu- In our experiments,BGDH-T denotes transductive mod- ally far less than that of unlabeled instances.To further verify el while BGDH-I denotes inductive one.We compare our the effectiveness of leveraging both pairwise labels and un- method with several state-of-the-art methods including unsu- label data.we reduce the number of labeled instances while pervised methods:ITQ [Gong et al.,2013],LSH [Gionis et other settings remain the same.For supervised learning.in al.,1999],IsoH [Kong and Li,2012]and SpH [Heo et al., CIFAR-10,we randomly select 2500(250 images per class) 2012];supervised methods:LFH [Zhang et al.,2014],FastH images and in NUS-WIDE,we randomly select 5000 images. The MAP results with this experiment setting are listed in Ta- https://www.cs.toronto.edu/kriz/cifar.html ble 3.Note that the results of unlisted unsupervised methods http://lms.comp.nus.edu.sg/research/NUS-WIDE.htm are the same as those in Table 2.We observe that our BGDH 3242
Algorithm 2 Learning algorithm for BGDH 1: Input: A bipartite graph G = (X , O, E), images X = {xi} n i=1, pairwise labels S = {sij}, parameters η, λ, batch iterations T1, T2 and sizes N1, N2 2: Output: Binary codes B = {bi} n i=1 3: Initialization: Initialize θx with the pre-trained CNN-F model on ImageNet; Initialize each entry of M, v, and θe by randomly sampling from a Gaussian distribution with mean 0 and variance 0.01. 4: REPEAT 5: for t ← 1 to T1 do 6: Randomly sample N1 labeled images, and let I1 be the set containing indexes of sampled instances 7: Calculate φ(xi ; θx) and ψ(ei ; θe) for all i ∈ I1 by forward propagation 8: Compute ui = MT [φ(xi ; θx); ψ(ei ; θe)] + v and the binary code of xi with bi = sgn(ui) for all i ∈ I1 9: Compute the derivative of Ls w. r. t. {ui : i ∈ I1} 10: Update parameters M, v, θx, θe, and {ei : i ∈ I1} by back propagation 11: end for 12: for t ← 1 to T2 do 13: Randomly generate a batch of triples by invoking Algorithm 1 N2 times, and let I2 be the set containing indexes of sampled instances and contexts 14: Compute the derivative of Lg w. r. t. {ei , wc : (i, c) ∈ I2} 15: Update parameters {ei , wc : (i, c) ∈ I2} 16: end for 17: UNTIL stopping 18: Calculate bi = sgn MT [φ(xi ; θx); ψ(ei ; θe)] + v for i = 1, . . . , n 4 Experiments In this section, we present experimental results. All the experiments are performed on a NVIDIA K80 GPU server with MatConvNet [Vedaldi and Lenc, 2014]. 4.1 Datasets and Setting We conduct experiments on two widely used benchmark datasets: CIFAR-10 and NUS-WIDE. The CIFAR-10 dataset1 consists of 60, 000 images from 10 classes (6000 images per class). It is a single-label dataset. The NUS-WIDE dataset2 contains 269, 648 images from Flickr. Following the settings in [Lai et al., 2015], we use 19,5834 images belonging to the 21 most frequent classes and each class consists at least 5000 images. Additionally, two images are defined as a groundtruth neighbor when they share at least one common label. In our experiments, BGDH-T denotes transductive model while BGDH-I denotes inductive one. We compare our method with several state-of-the-art methods including unsupervised methods: ITQ [Gong et al., 2013], LSH [Gionis et al., 1999], IsoH [Kong and Li, 2012] and SpH [Heo et al., 2012]; supervised methods: LFH [Zhang et al., 2014], FastH 1 https://www.cs.toronto.edu/ kriz/cifar.html 2 http://lms.comp.nus.edu.sg/research/NUS-WIDE.htm [Lin et al., 2014], SDH [Shen et al., 2015] and COSDISH [Kang et al., 2016]; supervised deep hashing methods: DPSH [Li et al., 2016], DHN [Zhu et al., 2016] and DSH [Liu et al., 2016]; semi-supervised deep methods SSDH [Zhang et al., 2016]. For hashing methods using hand-crafted features, we represent each image in CIFAR-10 by a 512-dimensional GIST vector. In NUS-WIDE, an image is represented by a 1134- dimensional feature vector, including 500-D bag-of-words features, 64-D color histogram, 144-D color correlogram, 73- D edge direction histogram, 128-D wavelet texture, 225-D block-wise color moments. For deep methods, we use the raw image pixels as input. The network architectures of these deep models are different from each other, for fair comparison, we adopt the same deep network architecture CNN-F with the same initialization parameters pre-trained on ImageNet [Deng et al., 2009] for all deep hashing methods. The bipartite graph of BGDH is constructed based on handcrafted features with heat kernel, where the hyper-parameter ρ is set as 1 for CIFAR-10 and 10 for NUS-WIDE. The hyperparameter η in BGDH is set as 10 for CIFAR-10 and 100 for NUS-WIDE similar to DPSH [Li et al., 2016]. We simply set T1 = 10, T2 = 5, λ = 0.1 in all the experiments. 4.2 Experimental Results To evaluate the retrieval accuracy, we use the Mean Average Precision (MAP) as the evaluation metric. Following [Lai et al., 2015; Xia et al., 2014], 1000 images (100 images per class) are randomly selected as the query set in CIFAR-10. We use all the rest of images as training set for the unsupervised methods and for building a neighbour graph G in BGDH. We randomly select 5000 (500 images per class) images as the training set for supervised methods. In NUS-WIDE, we randomly sample 2100 query images from 21 most frequent labels following the strategy in [Lai et al., 2015; Xia et al., 2014]. We use all the rest of images as training set for the unsupervised methods. To reduce the time complexity, we only use 5000 landmarks sampled randomly to build a bipartite graph for BGDH. For supervised methods, we randomly select 10500 images from dataset. The MAP values are calculated within the top 5000 returned neighbors. We present the MAP results in Table 2 where BGDH, SSDH, DSH, DHN, DPSH are deep methods. Except for SSDH which uses triplet labels, all deep methods are trained with pairwise labels. To be fair, parameters of these methods are set according to the suggestions of authors. The results show that in most cases our proposed BGDH method substantially outperforms other baselines including unsupervised methods, supervised methods and semi-supervised methods. In real applications, the number of labeled instances is usually far less than that of unlabeled instances. To further verify the effectiveness of leveraging both pairwise labels and unlabel data, we reduce the number of labeled instances while other settings remain the same. For supervised learning, in CIFAR-10, we randomly select 2500 (250 images per class) images and in NUS-WIDE, we randomly select 5000 images. The MAP results with this experiment setting are listed in Table 3. Note that the results of unlisted unsupervised methods are the same as those in Table 2. We observe that our BGDH Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) 3242
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence(IJCAI-17) Table 2:Accuracy in terms of MAP.The best MAPs for each category are shown in boldface.Training size for supervised method is 5000 for CIFAR-10 and 10500 for NUS-WIDE. Method CIFAR-10 (MAP) NUS-WIDE (MAP) 12-bits 24-bits 32-bits 48-bits 12-bits 24-bits 32-bits 48-bits BGDH-T 0.805 0.824 0.826 0.833 0.803 0.818 0.822 0.828 BGDH-I 0.803 0.818 0.822 0.829 0.801 0.815 0.816 0.825 SSDH 0.801 0.813 0.812 0.814 0.773 0.779 0.778 0.778 DSH 0.604 0.746 0.781 0.810 0.751 0.765 0.767 0.773 DHN 0.692 0.703 0.726 0.735 0.751 0.785 0.792 0.799 DPSH 0.684 0.734 0.750 0.767 0.788 0.809 0.817 0.823 COSDISH 0.522 0.590 0.599 0.615 0.691 0.749 0.745 0.765 SDH 0.525 0.671 0.686 0.696 0.752 0.745 0.744 0.730 FastH 0.291 0.351 0.367 0.390 0.622 0.660 0.670 0.687 LFH 0.335 0.433 0.509 0.515 0.749 0.751 0.775 0.780 ITQ 0.163 0.170 0.173 0.176 0.447 0.465 0.468 0.473 LSH 0.152 0.167 0.170 0.200 0.367 0.394 0.413 0.416 IsoH 0.158 0.162 0.166 0.169 0.436 0.454 0.461 0.465 SpH 0.141 0.153 0.154 0.158 0.399 0.437 0.454 0.465 Table 3:Accuracy in terms of MAP.The best MAPs for each category are shown in boldface. Training size for supervised method is 2500 for CIFAR-10 and 5000 for NUS-WIDE Method CIFAR-10(MAP) NUS-WIDE (MAP) 12-bits 24-bits 32-bits 48-bits 12-bits 24-bits 32-bits 48-bits BGDH-T 0.755 0.791 0.800 0.812 0.772 0.798 0.806 0.816 BGDH-I 0.746 0.776 0.787 0.796 0.768 0.794 0.801 0.811 SSDH 0.581 0.589 0.595 0.596 0.743 0.745 0.746 0.749 DSH 0.617 0.707 0.737 0.761 0.749 0.769 0.771 0.786 DHN 0.591 0.646 0.640 0.662 0.741 0.763 0.766 0.773 DPSH 0.576 0.634 0.642 0.668 0.762 0.789 0.791 0.803 COSDISH 0.312 0.348 0.373 0398 0.648 0.678 0.699 0.713 SDH 0.327 0.357 0.374 0.377 0.574 0.597 0.591 0.595 FastH 0.267 0.298 0.320 0.341 0.604 0.634 0.650 0.667 LFH 0.244 0.288 0.311 0.391 0.611 0.644 0.653 0.669 outperforms all the other methods.Specifically,compared to 5 Conclusion the best baseline in Table 2.we conclude that when labeled data are insufficient,BGDH is able to leverage unlabeled data In this paper,we propose a novel semi-supervised hash- to deliver a good result ing method,named Deep Hashing with a Bipartite Graph (BGDH).To the best of our knowledge,BGDH is the first method that performs graph embedding,feature learning,and 4.3 Parameter Selection hash code learning simultaneously.BGDH constructs a bi- In BGDH,there is a hyper-parameter A which controls the partite graph to discover the underlying structure of data,and is much more efficient than methods based on neighborhood tradeoff between supervised loss and unsupervised loss.Fig- graph.Experimental results demonstrate that BGDH outper- ure 2 displays the impacts of A on the performance of BGDH with the experiment settings being the same as those in Table forms state-of-the-art methods in image retrieval. 3.As can be seen,there is a wide range of A that BGDH per- forms well.Thus,to a large extent,BGDH is insensitive to Acknowledgements A and the parameter selection is not a crucial problem in our This work was partially supported by the NSFC(61603177, algorithm.Additionally,by comparing the MAP ofA=0 61472182),JiangsuSF (BK20160658),and the Collaborative and A=1,we verify the importance of graph embedding. Innovation Center of Novel Software Technology and Indus- trialization of Nanjing University. References [Bottou,2010]Leon Bottou.Large-scale machine learning with stochastic gradient descent.In Proceedings of 19th Interna- 103020 103102101 10 tional Conference on Computational Statistics,pages 177-186. (a)CIFAR-10 (b)NUS-WIDE Springer,2010. Figure 2:Hyper-parameter Sensitivity [Chatfield et al..2014]Ken Chatfield,Karen Simonyan,Andrea Vedaldi,and Andrew Zisserman.Return of the devil in the de- 3243
Table 2: Accuracy in terms of MAP. The best MAPs for each category are shown in boldface. Training size for supervised method is 5000 for CIFAR-10 and 10500 for NUS-WIDE. Method CIFAR-10 (MAP) NUS-WIDE (MAP) 12-bits 24-bits 32-bits 48-bits 12-bits 24-bits 32-bits 48-bits BGDH-T 0.805 0.824 0.826 0.833 0.803 0.818 0.822 0.828 BGDH-I 0.803 0.818 0.822 0.829 0.801 0.815 0.816 0.825 SSDH 0.801 0.813 0.812 0.814 0.773 0.779 0.778 0.778 DSH 0.604 0.746 0.781 0.810 0.751 0.765 0.767 0.773 DHN 0.692 0.703 0.726 0.735 0.751 0.785 0.792 0.799 DPSH 0.684 0.734 0.750 0.767 0.788 0.809 0.817 0.823 COSDISH 0.522 0.590 0.599 0.615 0.691 0.749 0.745 0.765 SDH 0.525 0.671 0.686 0.696 0.752 0.745 0.744 0.730 FastH 0.291 0.351 0.367 0.390 0.622 0.660 0.670 0.687 LFH 0.335 0.433 0.509 0.515 0.749 0.751 0.775 0.780 ITQ 0.163 0.170 0.173 0.176 0.447 0.465 0.468 0.473 LSH 0.152 0.167 0.170 0.200 0.367 0.394 0.413 0.416 IsoH 0.158 0.162 0.166 0.169 0.436 0.454 0.461 0.465 SpH 0.141 0.153 0.154 0.158 0.399 0.437 0.454 0.465 Table 3: Accuracy in terms of MAP. The best MAPs for each category are shown in boldface. Training size for supervised method is 2500 for CIFAR-10 and 5000 for NUS-WIDE. Method CIFAR-10 (MAP) NUS-WIDE (MAP) 12-bits 24-bits 32-bits 48-bits 12-bits 24-bits 32-bits 48-bits BGDH-T 0.755 0.791 0.800 0.812 0.772 0.798 0.806 0.816 BGDH-I 0.746 0.776 0.787 0.796 0.768 0.794 0.801 0.811 SSDH 0.581 0.589 0.595 0.596 0.743 0.745 0.746 0.749 DSH 0.617 0.707 0.737 0.761 0.749 0.769 0.771 0.786 DHN 0.591 0.646 0.640 0.662 0.741 0.763 0.766 0.773 DPSH 0.576 0.634 0.642 0.668 0.762 0.789 0.791 0.803 COSDISH 0.312 0.348 0.373 0.398 0.648 0.678 0.699 0.713 SDH 0.327 0.357 0.374 0.377 0.574 0.597 0.591 0.595 FastH 0.267 0.298 0.320 0.341 0.604 0.634 0.650 0.667 LFH 0.244 0.288 0.311 0.391 0.611 0.644 0.653 0.669 outperforms all the other methods. Specifically, compared to the best baseline in Table 2, we conclude that when labeled data are insufficient, BGDH is able to leverage unlabeled data to deliver a good result. 4.3 Parameter Selection In BGDH, there is a hyper-parameter λ which controls the tradeoff between supervised loss and unsupervised loss. Figure 2 displays the impacts of λ on the performance of BGDH with the experiment settings being the same as those in Table 3. As can be seen, there is a wide range of λ that BGDH performs well. Thus, to a large extent, BGDH is insensitive to λ and the parameter selection is not a crucial problem in our algorithm. Additionally, by comparing the MAP of λ = 0 and λ = 1, we verify the importance of graph embedding. 0 10-3 10-2 10-1 100 0.5 0.6 0.7 0.8 0.9 MAP 48bits 12bits (a) CIFAR-10 0 10-3 10-2 10-1 100 0.74 0.76 0.78 0.8 MAP 48bits 12bits (b) NUS-WIDE Figure 2: Hyper-parameter Sensitivity 5 Conclusion In this paper, we propose a novel semi-supervised hashing method, named Deep Hashing with a Bipartite Graph (BGDH). To the best of our knowledge, BGDH is the first method that performs graph embedding, feature learning, and hash code learning simultaneously. BGDH constructs a bipartite graph to discover the underlying structure of data, and is much more efficient than methods based on neighborhood graph. Experimental results demonstrate that BGDH outperforms state-of-the-art methods in image retrieval. Acknowledgements This work was partially supported by the NSFC (61603177, 61472182), JiangsuSF (BK20160658), and the Collaborative Innovation Center of Novel Software Technology and Industrialization of Nanjing University. References [Bottou, 2010] Leon Bottou. Large-scale machine learning with ´ stochastic gradient descent. In Proceedings of 19th International Conference on Computational Statistics, pages 177–186. Springer, 2010. [Chatfield et al., 2014] Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Return of the devil in the deProceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) 3243
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence(IJCAI-17) tails:Delving deep into convolutional nets.In Proceedings of [Liu et al.,2011]Wei Liu,Jun Wang.Sanjiv Kumar,and Shih-Fu British Machine Vision Conference,2014. Chang.Hashing with graphs.In Proceedings of the 28th Inter [Deng et al.,2009]Jia Deng,Wei Dong,Richard Socher,Li-Jia Li, national Conference on Machine Learning,pages 1-8,2011. Kai Li,and Li Fei-Fei.Imagenet:A large-scale hierarchical im- [Liu et al.,2016]Haomiao Liu,Ruiping Wang,Shiguang Shan,and age database.In Proceedings of the IEEE Conference on Com- Xilin Chen.Deep supervised hashing for fast image retrieval.In puter Vision and Pattern Recognition,pages 248-255,2009. Proceedings of IEEE Conference on Computer Vision and Pat- tern Recognition,pages 2064-2072,2016. Gionis et al..1999]Aristides Gionis,Piotr Indyk.Rajeev Mot- wani,et al.Similarity search in high dimensions via hashing. [Mikolov et al.,2013]Tomas Mikolov,Ilya Sutskever,Kai Chen, In Proceedings of 25th International Conference on Very Large Greg S Corrado,and Jeff Dean.Distributed representations of Data Bases,volume 99,pages 518-529,1999. words and phrases and their compositionality.In Advances in Neural Information Processing Systems,pages 3111-3119,2013. [Gong et al.,2013]Yunchao Gong,Svetlana Lazebnik,Albert Gor- Oliva and Torralba.2001]Aude Oliva and Antonio Torralba.Mod- do,and Florent Perronnin.Iterative quantization:A procrustean eling the shape of the scene:A holistic representation of the approach to learning binary codes for large-scale image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelli- spatial envelope.International Journal of Computer Vision, 42(3):145-175,2001. 8ence,35(12):2916-2929.2013 [Perozzi et al.,2014]Bryan Perozzi,Rami Al-Rfou,and Steven [Heo et al.,2012]Jae-Pil Heo,Youngwoon Lee,Junfeng He,Shih- Skiena.Deepwalk:online learning of social representations. Fu Chang,and Sung-Eui Yoon.Spherical hashing.In Proceed- In Proceedings of the 20th ACM International Conference on ings of the IEEE Conference on Computer Vision and Pattern Knowledge Discovery and Data Mining,pages 701-710,2014. Recognition,pages 2957-2964,2012. [Shen et al.,2015]Fumin Shen,Chunhua Shen,Wei Liu,and Heng [Indyk and Motwani,1998]Piotr Indyk and Rajeev Motwani.Ap- Tao Shen.Supervised discrete hashing.In Proceedings of the proximate nearest neighbors:towards removing the curse of di- IEEE Conference on Computer Vision and Pattern Recognition, mensionality.In Proceedings of the thirtieth annual ACM sym- pages37-45,2015. posium on Theory of computing,pages 604-613.1998. [Vedaldi and Lenc,2014]Andrea Vedaldi and Karel Lenc.Matcon- [Kang et al.,2016]Wang-Cheng Kang.Wu-Jun Li,and Zhi-Hua vnet-convolutional neural networks for MATLAB.CoRR,ab- Zhou.Column sampling based discrete supervised hashing.In s/1412.4564,2014 Proceedings of the Thirtieth AAAl Conference on Artificial Intel- [Wang et al.,2012]Jun Wang,Sanjiv Kumar,and Shih-Fu Chang. ligence,pages 1230-1236,2016. Semi-supervised hashing for large-scale search. IEEE [Kong and Li,2012]Weihao Kong and Wu-Jun Li.Isotropic hash- Transactions on Pattern Analysis and Machine Intelligence, ing.In Advances in Neural Information Processing Systems, 34(12):2393-2406.2012. pages1646-1654,2012. [Weiss et al.,2009]Yair Weiss,Antonio Torralba,and Rob Fergus [Krizhevsky and Hinton,2009]Alex Krizhevsky and Geoffrey Spectral hashing.In Advances in Neural Information Processing Systems,pages 1753-1760.2009. Hinton.Learning multiple layers of features from tiny images 2009. [Weston et al..2012]Jason Weston.Frederic Ratle.Hossein Mobahi,and Ronan Collobert. Deep learning via semi- [Krizhevsky et al.,2012]Alex Krizhevsky,Ilya Sutskever,and Ge- supervised embedding.In Neural Networks:Tricks of the Trade, offrey E Hinton.Imagenet classification with deep convolutional pages 639-655.Springer,2012. neural networks.In Advances in Neural Information Processing Systems,pages 1097-1105.2012. [Xia et al.,2014]Rongkai Xia,Yan Pan,Hanjiang Lai,Cong Liu, and Shuicheng Yan.Supervised hashing for image retrieval via [Lai et al.,2015]Hanjiang Lai,Yan Pan,Ye Liu,and Shuicheng image representation learning.In Proceedings of AAAl Confer- Yan.Simultaneous feature learning and hash coding with deep ence on Artificial Intelligence,pages 2156-2162,2014. neural networks.In Proceedings of the IEEE Conference on [Yang et al,2016]Zhilin Yang.William W.Cohen,and Ruslan Computer Vision and Pattern Recognition,pages 3270-3278, Salakhutdinov.Revisiting semi-supervised learning with graph 2015. embeddings.In Proceedings of the 33nd International Confer- [Lee and Seung,1999]Daniel D.Lee and H.Sebastian Seung. ence on Machine Learning,pages 40-48.2016. Learning the parts of objects by non-negative matrix factoriza- [Zhang et al.,2014]Peichao Zhang.Wei Zhang.Wu-Jun Li,and tion.Nature,.401(6755):788-791.1999. Minyi Guo.Supervised hashing with latent factor models.In Pro- [Li et al.,2016]Wu-Jun Li,Sheng Wang.and Wang-Cheng Kang. ceedings of the 37th International ACM Conference on Research Feature learning based deep supervised hashing with pairwise la- and Development in Information Retrieval,pages 173-182,2014. bels.In Proceedings of the Twenty-Fifth International Joint Con- Zhang et al.,20161 Jian Zhang,Yuxin Peng,and Junchao Zhang. ference on Artificial Intelligence,pages 1711-1717,2016. Ssdh:semi-supervised deep hashing for large scale image re- [Lin et al.,2014]Guosheng Lin,Chunhua Shen,Qinfeng Shi,An- trieval.arXiv preprint arXiv:1607.08477,2016. ton van den Hengel,and David Suter.Fast supervised hashing [Zhou et al.,2004]Dengyong Zhou,Olivier Bousquet with decision trees for high-dimensional data.In Proceedings of Thomas Navin Lal,Jason Weston,and Bernhard Scholkopf. the IEEE Conference on Computer Vision and Pattern Recogni- Learning with local and global consistency.In Advances in tion,pages1963-1970,2014. Neural Information Processing Systems,pages 321-328,2004. [Liu et al.,2010]Wei Liu,Junfeng He,and Shih-Fu Chang.Large [Zhu et al.,2016]Han Zhu,Mingsheng Long,Jianmin Wang,and graph construction for scalable semi-supervised learning.In Pro- Yue Cao.Deep hashing network for efficient similarity retrieval. ceedings of the 27th International Conference on Machine Learn- In Proceedings of the Thirtieth AAAl Conference on Artificial In- ing,pages679-686.2010. telligence,pages 2415-2421,2016. 3244
tails: Delving deep into convolutional nets. In Proceedings of British Machine Vision Conference, 2014. [Deng et al., 2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. [Gionis et al., 1999] Aristides Gionis, Piotr Indyk, Rajeev Motwani, et al. Similarity search in high dimensions via hashing. In Proceedings of 25th International Conference on Very Large Data Bases, volume 99, pages 518–529, 1999. [Gong et al., 2013] Yunchao Gong, Svetlana Lazebnik, Albert Gordo, and Florent Perronnin. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2916–2929, 2013. [Heo et al., 2012] Jae-Pil Heo, Youngwoon Lee, Junfeng He, ShihFu Chang, and Sung-Eui Yoon. Spherical hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2957–2964, 2012. [Indyk and Motwani, 1998] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604–613, 1998. [Kang et al., 2016] Wang-Cheng Kang, Wu-Jun Li, and Zhi-Hua Zhou. Column sampling based discrete supervised hashing. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pages 1230–1236, 2016. [Kong and Li, 2012] Weihao Kong and Wu-Jun Li. Isotropic hashing. In Advances in Neural Information Processing Systems, pages 1646–1654, 2012. [Krizhevsky and Hinton, 2009] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009. [Krizhevsky et al., 2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105. 2012. [Lai et al., 2015] Hanjiang Lai, Yan Pan, Ye Liu, and Shuicheng Yan. Simultaneous feature learning and hash coding with deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3270–3278, 2015. [Lee and Seung, 1999] Daniel D. Lee and H. Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, 1999. [Li et al., 2016] Wu-Jun Li, Sheng Wang, and Wang-Cheng Kang. Feature learning based deep supervised hashing with pairwise labels. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pages 1711–1717, 2016. [Lin et al., 2014] Guosheng Lin, Chunhua Shen, Qinfeng Shi, Anton van den Hengel, and David Suter. Fast supervised hashing with decision trees for high-dimensional data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1963–1970, 2014. [Liu et al., 2010] Wei Liu, Junfeng He, and Shih-Fu Chang. Large graph construction for scalable semi-supervised learning. In Proceedings of the 27th International Conference on Machine Learning, pages 679–686, 2010. [Liu et al., 2011] Wei Liu, Jun Wang, Sanjiv Kumar, and Shih-Fu Chang. Hashing with graphs. In Proceedings of the 28th International Conference on Machine Learning, pages 1–8, 2011. [Liu et al., 2016] Haomiao Liu, Ruiping Wang, Shiguang Shan, and Xilin Chen. Deep supervised hashing for fast image retrieval. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 2064–2072, 2016. [Mikolov et al., 2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111–3119, 2013. [Oliva and Torralba, 2001] Aude Oliva and Antonio Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3):145–175, 2001. [Perozzi et al., 2014] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: online learning of social representations. In Proceedings of the 20th ACM International Conference on Knowledge Discovery and Data Mining, pages 701–710, 2014. [Shen et al., 2015] Fumin Shen, Chunhua Shen, Wei Liu, and Heng Tao Shen. Supervised discrete hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 37–45, 2015. [Vedaldi and Lenc, 2014] Andrea Vedaldi and Karel Lenc. Matconvnet - convolutional neural networks for MATLAB. CoRR, abs/1412.4564, 2014. [Wang et al., 2012] Jun Wang, Sanjiv Kumar, and Shih-Fu Chang. Semi-supervised hashing for large-scale search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(12):2393–2406, 2012. [Weiss et al., 2009] Yair Weiss, Antonio Torralba, and Rob Fergus. Spectral hashing. In Advances in Neural Information Processing Systems, pages 1753–1760, 2009. [Weston et al., 2012] Jason Weston, Fred´ eric Ratle, Hossein ´ Mobahi, and Ronan Collobert. Deep learning via semisupervised embedding. In Neural Networks: Tricks of the Trade, pages 639–655. Springer, 2012. [Xia et al., 2014] Rongkai Xia, Yan Pan, Hanjiang Lai, Cong Liu, and Shuicheng Yan. Supervised hashing for image retrieval via image representation learning. In Proceedings of AAAI Conference on Artificial Intelligence, pages 2156–2162, 2014. [Yang et al., 2016] Zhilin Yang, William W. Cohen, and Ruslan Salakhutdinov. Revisiting semi-supervised learning with graph embeddings. In Proceedings of the 33nd International Conference on Machine Learning, pages 40–48, 2016. [Zhang et al., 2014] Peichao Zhang, Wei Zhang, Wu-Jun Li, and Minyi Guo. Supervised hashing with latent factor models. In Proceedings of the 37th International ACM Conference on Research and Development in Information Retrieval, pages 173–182, 2014. [Zhang et al., 2016] Jian Zhang, Yuxin Peng, and Junchao Zhang. Ssdh: semi-supervised deep hashing for large scale image retrieval. arXiv preprint arXiv:1607.08477, 2016. [Zhou et al., 2004] Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Scholkopf. ¨ Learning with local and global consistency. In Advances in Neural Information Processing Systems, pages 321–328, 2004. [Zhu et al., 2016] Han Zhu, Mingsheng Long, Jianmin Wang, and Yue Cao. Deep hashing network for efficient similarity retrieval. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pages 2415–2421, 2016. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) 3244