Deep Cross-Modal Hashing Qing-Yuan Jiang and Wu-Jun Li National Key Laboratory for Novel Software Technology Collaborative Innovation Center of Novel Software Technology and Industrialization Department of Computer Science and Technology,Nanjing University,P.R.China jiangqyalamda.nju.edu.cn,liwujun@nju.edu.cn Abstract text information like tags for the images in Flickr and many other social websites.This kind of data is always called Due to its low storage cost and fast query speed,cross- multi-modal data.With the rapid growth of multi-modal modal hashing(CMH)has been widely used for similarity data in real applications,especially multimedia application- search in multimedia retrieval applications.However most s,multi-modal hashing (MMH)has recently been widely existing CMH methods are based on hand-crafted features used for ANN search (retrieval)on multi-modal datasets. which might not be optimally compatible with the hash-code Existing MMH methods can be divided into two main learning procedure.As a result,existing CMH methods categories:mutli-source hashing (MSH)[30,36,32,14] with hand-crafted features may not achieve satisfactory and cross-modal hashing (CMH)[18,35.7.22.31.The performance.In this paper,we propose a novel CMH goal of MSH is to learn hash codes by utilizing all the in- method,called deep cross-modal hashing (DCMH),by formation from multiple modalities.Hence,MSH requires integrating feature learning and hash-code learning into that all the modalities should be observed for all data points the same framework.DCMH is an end-to-end learning including query points and those in database.In practice, framework with deep neural networks,one for each modal- the application of MSH is limited because in many cases it ity,to perform feature learning from scratch.Experiments is difficult to acquire all modalities of all data points.On on three real datasets with image-text modalities show the contrary,the application scenarios of CMH are more that DCMH can outperform other baselines to achieve flexible than those of MSH.In CMH.the modality of a the state-of-the-art performance in cross-modal retrieval query point is different from the modality of the points in applications. database.Furthermore,typically the query point has only one modality and the points in the database can have one or more modalities.For example,we can use text queries to 1.Introduction retrieve images in the database,and we can also use image Approximate nearest neighbor(ANN)search [1]plays a queries to retrieve texts in the database.Due to its wide fundamental role in machine learning and related applica- application,CMH has gained more attention than MSH. tions like information retrieval.Due to its low storage cost Many CMH methods have recently been proposed. and fast retrieval speed,hashing has recently attracted much Existing representative methods include cross attention from the ANN research community [17,34,9,15, modality similarity sensitive hashing (CMSSH)[2], 26,10,29,21,28,4].The goal of hashing is to map the cross view hashing (CVH)[18], multi-modal data points from the original space into a Hamming space latent binary embedding (MLBE)[39].co- of binary codes where the similarity in the original space regularized hashing (CRH)[38].semantic correlation is preserved in the Hamming space.By using binary hash maximization (SCM)[35].collective matrix factorization codes to represent the original data,the storage cost can hashing (CMFH)[7],semantic topic multi-modal be dramatically reduced.Furthermore,we can achieve a hashing (STMH)[33]and semantics preserving constant or sub-linear time complexity for search by using hashing (SePH)[22].Almost all these existing CMH hash codes to construct an index [15].Hence,hashing has methods are based on hand-crafted features. One become more and more popular for ANN search in large- shortcoming of these hand-crafted feature based methods scale datasets. is that the feature extraction procedure is independent of In many applications,the data can have multi-modalities. the hash-code learning procedure,which means that the For example,besides the image content,there also exists hand-crafted features might not be optimally compatible
Deep Cross-Modal Hashing Qing-Yuan Jiang and Wu-Jun Li National Key Laboratory for Novel Software Technology Collaborative Innovation Center of Novel Software Technology and Industrialization Department of Computer Science and Technology, Nanjing University, P. R. China jiangqy@lamda.nju.edu.cn, liwujun@nju.edu.cn Abstract Due to its low storage cost and fast query speed, crossmodal hashing (CMH) has been widely used for similarity search in multimedia retrieval applications. However, most existing CMH methods are based on hand-crafted features which might not be optimally compatible with the hash-code learning procedure. As a result, existing CMH methods with hand-crafted features may not achieve satisfactory performance. In this paper, we propose a novel CMH method, called deep cross-modal hashing (DCMH), by integrating feature learning and hash-code learning into the same framework. DCMH is an end-to-end learning framework with deep neural networks, one for each modality, to perform feature learning from scratch. Experiments on three real datasets with image-text modalities show that DCMH can outperform other baselines to achieve the state-of-the-art performance in cross-modal retrieval applications. 1. Introduction Approximate nearest neighbor (ANN) search [1] plays a fundamental role in machine learning and related applications like information retrieval. Due to its low storage cost and fast retrieval speed, hashing has recently attracted much attention from the ANN research community [17, 34, 9, 15, 26, 10, 29, 21, 28, 4]. The goal of hashing is to map the data points from the original space into a Hamming space of binary codes where the similarity in the original space is preserved in the Hamming space. By using binary hash codes to represent the original data, the storage cost can be dramatically reduced. Furthermore, we can achieve a constant or sub-linear time complexity for search by using hash codes to construct an index [15]. Hence, hashing has become more and more popular for ANN search in largescale datasets. In many applications, the data can have multi-modalities. For example, besides the image content, there also exists text information like tags for the images in Flickr and many other social websites. This kind of data is always called multi-modal data. With the rapid growth of multi-modal data in real applications, especially multimedia applications, multi-modal hashing (MMH) has recently been widely used for ANN search (retrieval) on multi-modal datasets. Existing MMH methods can be divided into two main categories: mutli-source hashing (MSH) [30, 36, 32, 14] and cross-modal hashing (CMH) [18, 35, 7, 22, 3]. The goal of MSH is to learn hash codes by utilizing all the information from multiple modalities. Hence, MSH requires that all the modalities should be observed for all data points including query points and those in database. In practice, the application of MSH is limited because in many cases it is difficult to acquire all modalities of all data points. On the contrary, the application scenarios of CMH are more flexible than those of MSH. In CMH, the modality of a query point is different from the modality of the points in database. Furthermore, typically the query point has only one modality and the points in the database can have one or more modalities. For example, we can use text queries to retrieve images in the database, and we can also use image queries to retrieve texts in the database. Due to its wide application, CMH has gained more attention than MSH. Many CMH methods have recently been proposed. Existing representative methods include cross modality similarity sensitive hashing (CMSSH) [2], cross view hashing (CVH) [18], multi-modal latent binary embedding (MLBE) [39], coregularized hashing (CRH) [38], semantic correlation maximization (SCM) [35], collective matrix factorization hashing (CMFH) [7], semantic topic multi-modal hashing (STMH) [33] and semantics preserving hashing (SePH) [22]. Almost all these existing CMH methods are based on hand-crafted features. One shortcoming of these hand-crafted feature based methods is that the feature extraction procedure is independent of the hash-code learning procedure, which means that the hand-crafted features might not be optimally compatible
with the hash-code learning procedure.Hence,these column of W is denoted as Wij.The ith row of W is existing CMH methods with hand-crafted features may not denoted as Wi.,and the jth column of W is denoted as achieve satisfactory performance in real applications. W材 WT is the transpose of W.We use 1 to denote Recently,deep learning with neural networks [19,16] a vector with all elements being 1.tr()and‖·lF has been widely used to perform feature learning from denote the trace of a matrix and the Frobenius norm of scratch with promising performance.There also exist a matrix,respectively.sign(.)is an element-wise sign some methods which adopt deep learning for uni-modal function defined as follows: hashing [37,23,20,40.24].These methods show that end-to-end deep learning architecture is more compatible x≥0, for hashing learning.For the CMH setting,there also sign(x x<0. appears one method,called deep visual-semantic hash- ing (DVSH)[3],with deep neural networks for feature learning'.However,DVSH can only be used for a special 2.2.Cross-Modal Hashing CMH case where one of the modalities have to be temporal dynamics. Although the method proposed in this paper can be easily In this paper,we propose a novel CMH method,called adapted to cases with more than two modalities,we only deep cross-modal hashing(DCMH),for cross-modal re- focus on the case with two modalities here. trieval applications.The main contributions of DCMH are Assume that we have n training entities (data points), outlined as follows: each of which has two modalities of features.Without loss of generality,we use image-text datasets for illustration in .DCMH is an end-to-end learning framework with deep this paper,which means that each training point has both neural networks,one for each modality,to perform text modality and image modality.We use X=x feature learning from scratch. to denote the image modality,where xi can be the hand- crafted features or the raw pixels of image i.Moreover, The hash-code learning problem is essentially a dis- we use Y=[yi]1 to denote the text modality,where crete learning problem,which is difficult to learn. yi is typically the tag information related to image i.In Hence.most existing CMH methods solve this prob- addition,we are also given a cross-modal similarity matrix lem by relaxing the original discrete learning problem S.Sij=1 if image xi and text yj are similar,and Sij =0 into a continuous learning problem.This relaxation otherwise.Here,the similarity is typically defined by some procedure may deteriorate the accuracy of the learned semantic information such as class labels.For example,we hash codes [25].Unlike these relaxation-based meth- can say that image xi and text y;are similar if they share ods,DCMH directly learns the discrete hash codes the same class label.Otherwise,image xi and text yj are without relaxation. dissimilar if they are from different classes. Experiments on real datasets with image-text modali- Given the above training information X,Y and S,the ties show that DCMH can outperform other baselines goal of cross-modal hashing is to learn two hash functions to achieve the state-of-the-art performance in cross- for the two modalities:h()(x){-1,+1e for the modal retrieval applications. image modality and h()(y)E{-1,+c for the text modality,where c is the length of binary code.These two The rest of this paper is organized as follows.Section 2 hash functions should preserve the cross-modal similarity introduces the problem definition of this paper.We present in S.More specifically,if Sj=1.the Hamming distance our DCMH method in Section 3,including the model between the binary codes b)=h()(x)and bv)= formulation and learning algorithm.Experiments are shown h()(y;)should be small.Otherwise,if Si=0,the in Section 4.At last,we conclude our work in Section 5. corresponding Hamming distance should be large. Here,we assume that both modalities of features for each 2.Problem Definition point in the training set are observed although our method 2.1.Notation can also be easily adapted to other settings where some training points have only one modality of features being Boldface lowercase letters like w are used to denote observed.Please note that we only make this assumption vectors.Boldface uppercase letters like W are used to for training points.After we have trained the model,we can denote matrices,and the element in the ith row and jth use the learned model to generate hash codes for query and IThe first version of our DCMH method has been submitted to database points of either one modality or two modalities. arXiv [13]before this CVPR submission,which actually appeared earlier which exactly matches the setting of cross-modal retrieval than DVSH in public literature. applications
with the hash-code learning procedure. Hence, these existing CMH methods with hand-crafted features may not achieve satisfactory performance in real applications. Recently, deep learning with neural networks [19, 16] has been widely used to perform feature learning from scratch with promising performance. There also exist some methods which adopt deep learning for uni-modal hashing [37, 23, 20, 40, 24]. These methods show that end-to-end deep learning architecture is more compatible for hashing learning. For the CMH setting, there also appears one method, called deep visual-semantic hashing (DVSH) [3], with deep neural networks for feature learning1. However, DVSH can only be used for a special CMH case where one of the modalities have to be temporal dynamics. In this paper, we propose a novel CMH method, called deep cross-modal hashing (DCMH), for cross-modal retrieval applications. The main contributions of DCMH are outlined as follows: ∙ DCMH is an end-to-end learning framework with deep neural networks, one for each modality, to perform feature learning from scratch. ∙ The hash-code learning problem is essentially a discrete learning problem, which is difficult to learn. Hence, most existing CMH methods solve this problem by relaxing the original discrete learning problem into a continuous learning problem. This relaxation procedure may deteriorate the accuracy of the learned hash codes [25]. Unlike these relaxation-based methods, DCMH directly learns the discrete hash codes without relaxation. ∙ Experiments on real datasets with image-text modalities show that DCMH can outperform other baselines to achieve the state-of-the-art performance in crossmodal retrieval applications. The rest of this paper is organized as follows. Section 2 introduces the problem definition of this paper. We present our DCMH method in Section 3, including the model formulation and learning algorithm. Experiments are shown in Section 4. At last, we conclude our work in Section 5. 2. Problem Definition 2.1. Notation Boldface lowercase letters like w are used to denote vectors. Boldface uppercase letters like W are used to denote matrices, and the element in the 𝑖th row and 𝑗th 1The first version of our DCMH method has been submitted to arXiv [13] before this CVPR submission, which actually appeared earlier than DVSH in public literature. column of W is denoted as 𝑊𝑖𝑗 . The 𝑖th row of W is denoted as W𝑖∗, and the 𝑗th column of W is denoted as W∗𝑗 . W𝑇 is the transpose of W. We use 1 to denote a vector with all elements being 1. tr(⋅) and ∥⋅∥𝐹 denote the trace of a matrix and the Frobenius norm of a matrix, respectively. sign(⋅) is an element-wise sign function defined as follows: sign(𝑥) = { 1 𝑥 ≥ 0, −1 𝑥 < 0. 2.2. Cross-Modal Hashing Although the method proposed in this paper can be easily adapted to cases with more than two modalities, we only focus on the case with two modalities here. Assume that we have 𝑛 training entities (data points), each of which has two modalities of features. Without loss of generality, we use image-text datasets for illustration in this paper, which means that each training point has both text modality and image modality. We use X = {x𝑖}𝑛 𝑖=1 to denote the image modality, where x𝑖 can be the handcrafted features or the raw pixels of image 𝑖. Moreover, we use Y = {y𝑖}𝑛 𝑖=1 to denote the text modality, where y𝑖 is typically the tag information related to image 𝑖. In addition, we are also given a cross-modal similarity matrix S. 𝑆𝑖𝑗 = 1 if image x𝑖 and text y𝑗 are similar, and 𝑆𝑖𝑗 = 0 otherwise. Here, the similarity is typically defined by some semantic information such as class labels. For example, we can say that image x𝑖 and text y𝑗 are similar if they share the same class label. Otherwise, image x𝑖 and text y𝑗 are dissimilar if they are from different classes. Given the above training information X, Y and S, the goal of cross-modal hashing is to learn two hash functions for the two modalities: ℎ(𝑥) (x) ∈ {−1, +1}𝑐 for the image modality and ℎ(𝑦) (y) ∈ {−1, +1}𝑐 for the text modality, where 𝑐 is the length of binary code. These two hash functions should preserve the cross-modal similarity in S. More specifically, if 𝑆𝑖𝑗 = 1, the Hamming distance between the binary codes b(𝑥) 𝑖 = ℎ(𝑥) (x𝑖) and b(𝑦) 𝑗 = ℎ(𝑦) (y𝑗 ) should be small. Otherwise, if 𝑆𝑖𝑗 = 0, the corresponding Hamming distance should be large. Here, we assume that both modalities of features for each point in the training set are observed although our method can also be easily adapted to other settings where some training points have only one modality of features being observed. Please note that we only make this assumption for training points. After we have trained the model, we can use the learned model to generate hash codes for query and database points of either one modality or two modalities, which exactly matches the setting of cross-modal retrieval applications
Loss Function 1-11-1 10.... 01 -1-11-1 01……… 00 Binary Codej S -1-11-1 00...10 1-1-1-1 0.…….01 Bag of Words Figure 1.The end-to-end deep learning framework of our DCMH model. Table 1.Configuration of the CNN for image modality. layer is described by several aspects: Layer Configuration conv1f.64×11×11;st.4×4,pad0,LRN,×2pool ●“f.num×size×size”denotes the number of conv2 f.265×5×5;st.1×1,pad2,LRN,×2pool convolution filters and their receptive field size. conv3 f.265×3×3:st.1×1,pad1 conv4 f.265×3×3:st1×1,pad1 ●“st"denotes the convolution stride. conv5 f.265×3×3:st.1×1,pad1,×2p0ol ●“pad”denotes the number of pixels to add to each size full6 4096 of the input. full7 4096 full8 Hash code length c ."LRN"denotes whether Local Response Normaliza- tion (LRN)[16]is applied or not. 3.Deep Cross-Modal Hashing ●“pool”'denotes the down-sampling factor. In this section,we present the details about our deep The number in the fully connected layers.such as CMH(DCMH)method,including model formulation and "4096",denotes the number of nodes in that layer.It is learning algorithm. also the dimensionality of the output at that layer. 3.1.Model All the first seven layers use the Rectified Linear Unit(Re- The whole DCMH model is shown in Figure 1,which is LU)[16]as activation function.For the eighth layer,we choose identity function as the activation function. an end-to-end learning framework by seamlessly integrating To perform feature learning from text,we first represent two parts:the feature learning part and the hash-code each text y;as a vector with bag-of-words (BOW)repre- learning part.During learning,each part can give feedback sentation.And then the bag-of-words vectors are used as to the other part. the input to a deep neural network with two fully-connected layers,denoted as"fulll-full2".The detailed configuration 3.1.1 Feature Learning Part of the deep neural network for text is shown in Table 2, where the configuration shows the number of nodes in each The feature learning part contains two deep neural network- layer.The activation function for the first layer is ReLU, s,one for image modality and the other for text modality. and that for the second layer is the identity function. The deep neural network for image modality is a con- volutional neural network (CNN)adapted from [5].There Table 2.Configuration of the deep neural network for text are eight layers in this CNN model.The first seven layers modality. are the same as those in CNN-F of [5].The eighth layer Layer Configuration is a fully-connected layer with the output being the learned fulll 8192 image features. full2 Hash code length c Table 1 shows the detailed configuration of the CN- N for image modality.More specifically,eight layers Please note that the main goal of this paper is to show are divided into five convolutional layers and three fully- that it is possible to design an end-to-end learning frame- connected layers,which are denoted as"convI-conv5"and work for cross-modal hashing by using deep neural net- "full6-full8"in Table 1,respectively.Each convolutional works for feature learning from scratch.But how to design
Binary Code Loss Function S Convolutions Pooling Convolutions Fully Connected Quantization TEXT A black dog and a white dog with brown spots are staring at each other in the street …… Fully Connected basketball cat dog spots ball street tree white zoo … 0 0 1 1 0 1 0 1 0 ... Bag of Words 噯 噯 . Figure 1. The end-to-end deep learning framework of our DCMH model. Table 1. Configuration of the CNN for image modality. Layer Configuration conv1 f. 64 × 11 × 11; st. 4 × 4, pad 0, LRN,×2 pool conv2 f. 265 × 5 × 5; st. 1 × 1, pad 2, LRN,×2 pool conv3 f. 265 × 3 × 3; st. 1 × 1, pad 1 conv4 f. 265 × 3 × 3; st. 1 × 1, pad 1 conv5 f. 265 × 3 × 3; st. 1 × 1, pad 1,×2 pool full6 4096 full7 4096 full8 Hash code length 𝑐 3. Deep Cross-Modal Hashing In this section, we present the details about our deep CMH (DCMH) method, including model formulation and learning algorithm. 3.1. Model The whole DCMH model is shown in Figure 1, which is an end-to-end learning framework by seamlessly integrating two parts: the feature learning part and the hash-code learning part. During learning, each part can give feedback to the other part. 3.1.1 Feature Learning Part The feature learning part contains two deep neural networks, one for image modality and the other for text modality. The deep neural network for image modality is a convolutional neural network (CNN) adapted from [5]. There are eight layers in this CNN model. The first seven layers are the same as those in CNN-F of [5]. The eighth layer is a fully-connected layer with the output being the learned image features. Table 1 shows the detailed configuration of the CNN for image modality. More specifically, eight layers are divided into five convolutional layers and three fullyconnected layers, which are denoted as “conv1 - conv5” and “full6 - full8” in Table 1, respectively. Each convolutional layer is described by several aspects: ∙ “f. 𝑛𝑢𝑚 × 𝑠𝑖𝑧𝑒 × 𝑠𝑖𝑧𝑒” denotes the number of convolution filters and their receptive field size. ∙ “st” denotes the convolution stride. ∙ “pad” denotes the number of pixels to add to each size of the input. ∙ “LRN” denotes whether Local Response Normalization (LRN) [16] is applied or not. ∙ “pool” denotes the down-sampling factor. ∙ The number in the fully connected layers, such as “4096”, denotes the number of nodes in that layer. It is also the dimensionality of the output at that layer. All the first seven layers use the Rectified Linear Unit (ReLU) [16] as activation function. For the eighth layer, we choose identity function as the activation function. To perform feature learning from text, we first represent each text y𝑗 as a vector with bag-of-words (BOW) representation. And then the bag-of-words vectors are used as the input to a deep neural network with two fully-connected layers, denoted as “full1 - full2”. The detailed configuration of the deep neural network for text is shown in Table 2, where the configuration shows the number of nodes in each layer. The activation function for the first layer is ReLU, and that for the second layer is the identity function. Table 2. Configuration of the deep neural network for text modality. Layer Configuration full1 8192 full2 Hash code length 𝑐 Please note that the main goal of this paper is to show that it is possible to design an end-to-end learning framework for cross-modal hashing by using deep neural networks for feature learning from scratch. But how to design
different neural networks is not the focus of this paper. The third term n(F1+G1)in (1)is used to Other deep neural networks might also be used to perform make each bit of the hash code be balanced on all the feature learning for our DCMH model,which will be leaved training points.More specifically,the number of +1 and for future study. that of-1 for each bit on all the training points should be almost the same.This constraint can be used to maximize 3.1.2 Hash-Code Learning Part the information provided by each bit. In our experiment,we find that better performance can Let f(xi;0)Re denote the learned image feature for be achieved if the binary codes from the two modalities are point i,which corresponds to the output of the CNN for set to be the same for the same training points.Hence,we image modality.Furthermore,let g(y;:0)E Re denote set B()=B()=B.Then,the problem in (1)can be the learned text feature for point j,which corresponds to the transformed to the following formulation: output of the deep neural network for text modality.Here, 0 is the network parameter of the CNN for image modality. emin了=-∑(S6,-log(1+e8)》 B.0:0u and 0 is the network parameter of the deep neural network i,j=1 for text modality. +y(B-F+B-G胫) The objective function of DCMH is defined as follows: +(F1层+川G12) (2) min B(-).B(w).0z.0v 了=-∑(S,6j-log(1+e9) s.t.B∈{-1,+1}exn ij=1 This is the final objective function of our DCMH for +y(B)-房+B-G2) learning. +(F12+1G1) (1) From (2),we can find that the parameters of the deep s.t. B)e{-1,+1ex" neural networks(and )and the binary hash code(B) are learned from the same objective function.That is to Be{-1,+1xn, say,DCMH integrates both feature learning and hash-code learning into the same deep learning framework where F∈Rexn with F.=f(xi;0z,G∈Rexn with Please note that we only make B()=B()for the train- Gj=g(yj:0u),=FTG.j,B)is the binary hash ing points.After we have learned the problem in (2),we code for imageBis the binary hash code for texty still need to generate different binary codes bh(x) y and n are hyper-parameters. and bh(y)for the two different modalities of the The first term-=(-log(1+))in (1) same point i if point i is a query point or a point from the is the negative log likelihood of the cross-modal similarities database rather than a training point.This will be further with the likelihood function defined as follows: illustrated in Section 3.3. o(Θ) S)=1 3.2.Learning p(SlFi,G)三 1-o()S=0 We adopt an alternating learning strategy to learn z, 0y and B.Each time we learn one parameter with the where6=FTG.j ando(e)=i+e-o可· 1 other parameters fixed.The whole alternating learning It is easy to find that minimizing this negative log likeli- algorithm for DCMH is briefly outlined in Algorithm 1,and hood,which is equivalent to maximizing the likelihood,can the detailed derivation will be introduced in the following make the similarity (inner product)between Fi and G.j content of this subsection. be large when Sj=1 and be small when Si;=0.Hence, optimizing the first term in(1)can preserve the cross-modal similarity in S with the image feature representation F and 3.2.1 Learn 0z,with 0y and B Fixed text feature representation G. When 0y and B are fixed,we learn the CNN parameter By optimizing the second term(B()-FB()- of the image modality by using a back-propagation (BP) G)in (1),we can get B()=sign(F)and B()= algorithm.As most existing deep learning methods [16,we sign(G).Hence,we can consider F and G to be the con- utilize stochastic gradient descent (SGD)to learn 0 with tinuous surrogate of B()and B(),respectively.Because the BP algorithm.More specifically,in each iteration we F and G can preserve the cross-modal similarity in S,the sample a mini-batch of points from the training set and then binary hash codes B(=)and B()can also be expected carry out our learning algorithm based on the sampled data. to preserve the cross-modal similarity in S.which exactly In particular,for each sampled point x;,we first compute matches the goal of cross-modal hashing. the following gradient:
different neural networks is not the focus of this paper. Other deep neural networks might also be used to perform feature learning for our DCMH model, which will be leaved for future study. 3.1.2 Hash-Code Learning Part Let 𝑓(x𝑖; 𝜃𝑥) ∈ ℝ𝑐 denote the learned image feature for point 𝑖, which corresponds to the output of the CNN for image modality. Furthermore, let 𝑔(y𝑗 ; 𝜃𝑦) ∈ ℝ𝑐 denote the learned text feature for point 𝑗, which corresponds to the output of the deep neural network for text modality. Here, 𝜃𝑥 is the network parameter of the CNN for image modality, and 𝜃𝑦 is the network parameter of the deep neural network for text modality. The objective function of DCMH is defined as follows: min B(𝑥),B(𝑦),𝜃𝑥,𝜃𝑦 𝒥 = − ∑𝑛 𝑖,𝑗=1 (𝑆𝑖𝑗Θ𝑖𝑗 − log(1 + 𝑒Θ𝑖𝑗 )) + 𝛾(∥B(𝑥) − F∥2 𝐹 + ∥B(𝑦) − G∥2 𝐹 ) + 𝜂(∥F1∥2 𝐹 + ∥G1∥2 𝐹 ) (1) 𝑠.𝑡. B(𝑥) ∈ {−1, +1}𝑐×𝑛, B(𝑦) ∈ {−1, +1}𝑐×𝑛, where F ∈ ℝ𝑐×𝑛 with F∗𝑖 = 𝑓(x𝑖; 𝜃𝑥), G ∈ ℝ𝑐×𝑛 with G∗𝑗 = 𝑔(y𝑗 ; 𝜃𝑦), Θ𝑖𝑗 = 1 2F𝑇 ∗𝑖G∗𝑗 , B(𝑥) ∗𝑖 is the binary hash code for image x𝑖, B(𝑦) ∗𝑗 is the binary hash code for text y𝑗 , 𝛾 and 𝜂 are hyper-parameters. The first term −∑𝑛 𝑖,𝑗=1(𝑆𝑖𝑗Θ𝑖𝑗 − log(1 + 𝑒Θ𝑖𝑗 )) in (1) is the negative log likelihood of the cross-modal similarities with the likelihood function defined as follows: 𝑝(𝑆𝑖𝑗 ∣F∗𝑖, G∗𝑗 ) = { 𝜎(Θ𝑖𝑗 ) 𝑆𝑖𝑗 = 1 1 − 𝜎(Θ𝑖𝑗 ) 𝑆𝑖𝑗 = 0 where Θ𝑖𝑗 = 1 2F𝑇 ∗𝑖G∗𝑗 and 𝜎(Θ𝑖𝑗 ) = 1 1+𝑒−Θ𝑖𝑗 . It is easy to find that minimizing this negative log likelihood, which is equivalent to maximizing the likelihood, can make the similarity (inner product) between F∗𝑖 and G∗𝑗 be large when 𝑆𝑖𝑗 = 1 and be small when 𝑆𝑖𝑗 = 0. Hence, optimizing the first term in (1) can preserve the cross-modal similarity in S with the image feature representation F and text feature representation G. By optimizing the second term 𝛾(∥B(𝑥)−F∥2 𝐹 +∥B(𝑦)− G∥2 𝐹 ) in (1), we can get B(𝑥) = sign(F) and B(𝑦) = sign(G). Hence, we can consider F and G to be the continuous surrogate of B(𝑥) and B(𝑦) , respectively. Because F and G can preserve the cross-modal similarity in S, the binary hash codes B(𝑥) and B(𝑦) can also be expected to preserve the cross-modal similarity in S, which exactly matches the goal of cross-modal hashing. The third term 𝜂(∥F1∥2 𝐹 + ∥G1∥2 𝐹 ) in (1) is used to make each bit of the hash code be balanced on all the training points. More specifically, the number of +1 and that of −1 for each bit on all the training points should be almost the same. This constraint can be used to maximize the information provided by each bit. In our experiment, we find that better performance can be achieved if the binary codes from the two modalities are set to be the same for the same training points. Hence, we set B(𝑥) = B(𝑦) = B. Then, the problem in (1) can be transformed to the following formulation: min B,𝜃𝑥,𝜃𝑦 𝒥 = − ∑𝑛 𝑖,𝑗=1 (𝑆𝑖𝑗Θ𝑖𝑗 − log(1 + 𝑒Θ𝑖𝑗 )) + 𝛾(∥B − F∥2 𝐹 + ∥B − G∥2 𝐹 ) + 𝜂(∥F1∥2 𝐹 + ∥G1∥2 𝐹 ) (2) 𝑠.𝑡. B ∈ {−1, +1}𝑐×𝑛. This is the final objective function of our DCMH for learning. From (2), we can find that the parameters of the deep neural networks (𝜃𝑥 and 𝜃𝑦) and the binary hash code (B) are learned from the same objective function. That is to say, DCMH integrates both feature learning and hash-code learning into the same deep learning framework. Please note that we only make B(𝑥) = B(𝑦) for the training points. After we have learned the problem in (2), we still need to generate different binary codes b(𝑥) 𝑖 = ℎ(𝑥) (x𝑖) and b(𝑦) 𝑖 = ℎ(𝑦) (y𝑖) for the two different modalities of the same point 𝑖 if point 𝑖 is a query point or a point from the database rather than a training point. This will be further illustrated in Section 3.3. 3.2. Learning We adopt an alternating learning strategy to learn 𝜃𝑥, 𝜃𝑦 and B. Each time we learn one parameter with the other parameters fixed. The whole alternating learning algorithm for DCMH is briefly outlined in Algorithm 1, and the detailed derivation will be introduced in the following content of this subsection. 3.2.1 Learn 𝜃𝑥, with 𝜃𝑦 and B Fixed When 𝜃𝑦 and B are fixed, we learn the CNN parameter 𝜃𝑥 of the image modality by using a back-propagation (BP) algorithm. As most existing deep learning methods [16], we utilize stochastic gradient descent (SGD) to learn 𝜃𝑥 with the BP algorithm. More specifically, in each iteration we sample a mini-batch of points from the training set and then carry out our learning algorithm based on the sampled data. In particular, for each sampled point x𝑖, we first compute the following gradient:
=2∑a()G-SG) Algorithm 1 The learning algorithm for DCMH. Input:Image set X,text set Y.and cross-modal similarity 1 matrix S. +2y(Fi-B.i)+2nF1. (3) Output:Parameters 0 and 0 of the deep neural networks,and binary code matrix B. Then we can compute withby using the chain Initialization rule,based on which BP can be used to update the parameter Initialize neural network parameters and,mini-batch size 0x N Ny =128,and iteration number tr [n/N:1,ty [n/Nul. 3.2.2 Learn 0,with 0r and B Fixed repeat for iter=1,2,·,txdo When 0.and B are fixed.we also learn the neural network Randomly sample N points from X to construct a mini- parameter 0 of the text modality by using SGD with a BP batch. algorithm.More specifically,for each sampled point yi,we For each sampled point x;in the mini-batch,calculate first compute the following gradient: F.=f(x:;0)by forward propagation. Calculate the derivative according to (3). S=a(Θ)E-SF0 Update the parameter by using back propagation. 0G.j end for =1 for iter=1,2,·,tydo +2y(Gj-B)+2mG1 (4) Randomly sample Ny points from Y to construct a mini- batch. Then we can compute with by using the chain For each sampled point y;in the mini-batch,calculate rule,based on which BP can be used to update the parameter G.i=g(y;;0)by forward propagation. Calculate the derivative according to(4). Update the parameter 0,by using back propagation. 3.2.3 Learn B,with and 0 Fixed end for Learn B according to (5). When 0r and 0 are fixed,the problem in (2)can be until a fixed number of iterations reformulated as follows: mxtr(BT(h(F+G)》=tr(BV)=∑B, 4.Experiment i.j s.t.B∈{-1,+1exm, We carry out experiments on image-text datasets to veri- fy the effectiveness of DCMH.DCMH is implemented with where V=y(F+G). the open source deep learning toolbox MatConvNet [31]on It is easy to find that the binary code Bi;should keep the a NVIDIA K80 GPU server. same sign as Vij.Therefore,we have: 4.1.Datasets B sign(V)=sign((F+G)). (5) Three datasets,MIRFLICKR-25K [12],IAPR TC-12 [8] 3.3.Out-of-Sample Extension and NUS-WIDE [6],are used for evaluation. The original MIRFLICKR-25K dataset [12]consists of For any point which is not in the training set,we can 25,000 images collected from Flickr website.Each image obtain its hash code as long as one of its modalities (image is associated with several textual tags.Hence,each point or text)is observed.In particular,given the image modality is a image-text pair.We select those points which have at xg of point q,we can adopt forward propagation to generate least 20 textual tags for our experiment.The text for each the hash code as follows: point is represented as a 1386-dimensional bag-of-words vector.For the hand-crafted feature based method,each b()=h()(xa)=sign(f(xa;0)). image is represented by a 512-dimensional GIST feature Similarly,if point q only has the text modality ya we vector.Furthermore,each point is manually annotated with can also generate the hash code b)as follows: one of the 24 unique labels. The IAPR TC-12 dataset [8]consists of 20,000 image- b(u)=h()(ya)=sign(g(yq:0u)). text pairs which are annotated using 255 labels.We use the entire dataset for our experiment.The text for each point Hence,our DCMH model can be used for cross-modal is represented as a 2912-dimensional bag-of-words vector. search where the query points have one modality and the For the hand-crafted feature based method,each image is points in database have the other modality. represented by a 512-dimensional GIST feature vector
∂𝒥 ∂F∗𝑖 =1 2 ∑𝑛 𝑗=1 (𝜎(Θ𝑖𝑗 )G∗𝑗 − 𝑆𝑖𝑗G∗𝑗 ) + 2𝛾(F∗𝑖 − B∗𝑖)+2𝜂F1. (3) Then we can compute ∂𝒥 ∂𝜃𝑥 with ∂𝒥 ∂F∗𝑖 by using the chain rule, based on which BP can be used to update the parameter 𝜃𝑥. 3.2.2 Learn 𝜃𝑦, with 𝜃𝑥 and B Fixed When 𝜃𝑥 and B are fixed, we also learn the neural network parameter 𝜃𝑦 of the text modality by using SGD with a BP algorithm. More specifically, for each sampled point y𝑗 , we first compute the following gradient: ∂𝒥 ∂G∗𝑗 =1 2 ∑𝑛 𝑖=1 (𝜎(Θ𝑖𝑗 )F∗𝑖 − 𝑆𝑖𝑗F∗𝑖) + 2𝛾(G∗𝑗 − B∗𝑗 )+2𝜂G1. (4) Then we can compute ∂𝒥 ∂𝜃𝑦 with ∂𝒥 ∂G∗𝑗 by using the chain rule, based on which BP can be used to update the parameter 𝜃𝑦. 3.2.3 Learn B, with 𝜃𝑥 and 𝜃𝑦 Fixed When 𝜃𝑥 and 𝜃𝑦 are fixed, the problem in (2) can be reformulated as follows: max B tr(B𝑇 (𝛾(F + G))) = tr(B𝑇 V) = ∑ 𝑖,𝑗 𝐵𝑖𝑗𝑉𝑖𝑗 𝑠.𝑡. B ∈ {−1, +1}𝑐×𝑛, where V = 𝛾(F + G). It is easy to find that the binary code 𝐵𝑖𝑗 should keep the same sign as 𝑉𝑖𝑗 . Therefore, we have: B = sign(V) = sign(𝛾(F + G)). (5) 3.3. Out-of-Sample Extension For any point which is not in the training set, we can obtain its hash code as long as one of its modalities (image or text) is observed. In particular, given the image modality x𝑞 of point 𝑞, we can adopt forward propagation to generate the hash code as follows: b(𝑥) 𝑞 = ℎ(𝑥) (x𝑞) = sign(𝑓(x𝑞; 𝜃𝑥)). Similarly, if point 𝑞 only has the text modality y𝑞, we can also generate the hash code b(𝑦) 𝑞 as follows: b(𝑦) 𝑞 = ℎ(𝑦) (y𝑞) = sign(𝑔(y𝑞; 𝜃𝑦)). Hence, our DCMH model can be used for cross-modal search where the query points have one modality and the points in database have the other modality. Algorithm 1 The learning algorithm for DCMH. Input: Image set X, text set Y, and cross-modal similarity matrix S. Output: Parameters 𝜃𝑥 and 𝜃𝑦 of the deep neural networks, and binary code matrix B. Initialization Initialize neural network parameters 𝜃𝑥 and 𝜃𝑦, mini-batch size 𝑁𝑥 = 𝑁𝑦 = 128, and iteration number 𝑡𝑥 = ⌈𝑛/𝑁𝑥⌉, 𝑡𝑦 = ⌈𝑛/𝑁𝑦⌉. repeat for 𝑖𝑡𝑒𝑟 = 1, 2, ⋅⋅⋅ , 𝑡𝑥 do Randomly sample 𝑁𝑥 points from X to construct a minibatch. For each sampled point x𝑖 in the mini-batch, calculate F∗𝑖 = 𝑓(x𝑖; 𝜃𝑥) by forward propagation. Calculate the derivative according to (3). Update the parameter 𝜃𝑥 by using back propagation. end for for 𝑖𝑡𝑒𝑟 = 1, 2, ⋅⋅⋅ , 𝑡𝑦 do Randomly sample 𝑁𝑦 points from Y to construct a minibatch. For each sampled point y𝑗 in the mini-batch, calculate G∗𝑗 = 𝑔(y𝑗 ; 𝜃𝑦) by forward propagation. Calculate the derivative according to (4). Update the parameter 𝜃𝑦 by using back propagation. end for Learn B according to (5). until a fixed number of iterations 4. Experiment We carry out experiments on image-text datasets to verify the effectiveness of DCMH. DCMH is implemented with the open source deep learning toolbox MatConvNet [31] on a NVIDIA K80 GPU server. 4.1. Datasets Three datasets, MIRFLICKR-25K [12], IAPR TC-12 [8] and NUS-WIDE [6], are used for evaluation. The original MIRFLICKR-25K dataset [12] consists of 25,000 images collected from Flickr website. Each image is associated with several textual tags. Hence, each point is a image-text pair. We select those points which have at least 20 textual tags for our experiment. The text for each point is represented as a 1386-dimensional bag-of-words vector. For the hand-crafted feature based method, each image is represented by a 512-dimensional GIST feature vector. Furthermore, each point is manually annotated with one of the 24 unique labels. The IAPR TC-12 dataset [8] consists of 20,000 imagetext pairs which are annotated using 255 labels. We use the entire dataset for our experiment. The text for each point is represented as a 2912-dimensional bag-of-words vector. For the hand-crafted feature based method, each image is represented by a 512-dimensional GIST feature vector
The NUS-WIDE dataset [6]contains 260.648 web im- Source codes of SePH,STMH and SCM are kindly pro- ages,and some images are associated with textual tags. vided by the corresponding authors.While for CMFH and It is a multi-label dataset where each point is annotated CCA whose codes are not available,we implement them with one or multiple labels from 81 concept labels.We carefully by ourselves.SePH is a kernel-based method,for select 195,834 image-text pairs that belong to the 21 most which we use RBF kernel and take 500 randomly selected frequent concepts.The text for each point is represented as points as kernel bases by following its authors'suggestion. a 1000-dimensional bag-of-words vector.The hand-crafted In SePH,the authors propose two strategies to construct feature for each image is a 500-dimensional bag-of-visual the hash codes for retrieval (database)points according to words(BOVW)vector. whether both modalities of a point are observed or not. For all datasets,the image i and text j are considered to However,in this paper we only use one modality for the be similar if point i and point j share at least one common database (retrieval)points2,because the focus of this paper label.Otherwise.they are considered to be dissimilar. is on cross-modal retrieval.All the other parameters for all baselines are set according to the suggestion of the original 4.2.Evaluation Protocol and Baseline papers of these baselines. 4.2.1 Evaluation Protocol For DCMH,we use a validation set to choose the hyper- parameters y and n,and find that good performance can be For MIRFLICKR-25K and IAPR TC-12 datasets,we ran- achieved with y=n 1.Hence,we set y=n 1 for domly sample 2,000 data points as the test(query)set and DCMH.We exploit the CNN-F network [5]pre-trained on the remaining points as the retrieval set (database).For ImageNet dataset [27]to initialize the first seven layers of NUS-WIDE dataset,we take 2,100 data points as the test the CNN for image modality.All the other parameters of the set and the rest as the retrieval set.Moreover,we sample deep neural networks in DCMH are randomly initialized. 10,000 data points from the retrieval set as training set The input for the image modality is the raw pixels,and for MIRFLICKR-25K and IAPR TC-12.For NUS-WIDE that for the text modality is the BOW vectors.We fix the dataset,we sample 10,500 data points from the retrieval mini-batch size to be 128 and set the iteration number of set as training set.The ground-truth neighbors are defined the outer-loop in Algorithm 1 to be 500.The learning as those image-text pairs which share at least one common rate is chosen from 10-6 to 10-1 with a validation set. label. All experiments are run for five times,and the average For hashing-based retrieval,Hamming ranking and hash performance is reported. lookup are two widely used retrieval protocols [25].We also adopt these two protocols to evaluate our method and other 4.3.Accuracy baselines.The Hamming ranking protocol ranks the points in the database (retrieval set)according to their Hamming 4.3.1 Hamming Ranking distances to the given query point,in an increasing order. The MAP results for DCMH and other baselines with hand- Mean average precision (MAP)[25]is the widely used crafted features on MIRFLICKR-25K,IAPR TC-12 and metric to measure the accuracy of the Hamming ranking NUS-WIDE datasets are reported in Table 3.Here,"I-T protocol.The hash lookup protocol returns all the points denotes the case where the query is image and the database within a certain Hamming radius away from the query point. is text,and"T->I"denotes the case where the query is The precision-recall curve is the widely used metric to text and the database is image.We can find that DCMH measure the accuracy of the hash lookup protocol. can outperform all the other baselines with hand-crafted features. 4.2.2 Baseline To further verify the effectiveness of DCMH,we exploit the CNN-F deep network [5]pre-trained on ImageNet Six state-of-the-art cross-modal hashing methods are dataset,which is the same as the initial CNN of the image adopted as baselines for comparison,including SePH [22], modality in DCMH,to extract CNN features.All the STMH [33].SCM [351.CMFH [7].CCA [11]and baselines are trained based on these CNN features.The DVSH [3].Since DVSH can only be used for a special MAP results for DCMH and other baselines with CNN CMH case where one of the modalities have to be temporal features on three datasets are reported in Table 4.We dynamics,we compare DCMH with DVSH only on lAPR can find that DCMH can outperform all the other baselines TC-12 dataset where the original texts are sentences except SePH for image to text retrieval on NUS-WIDE which can be treated as temporal dynamics.The texts in MIRFLICKR-25K and NUS-WIDE are tags which are -For both SePH and DCMH,the accuracy by using both modalities for not suitable for DVSH.Please note that the texts are database points is typically higher than that by using only one modality for database points.DCMH can still outperform SePH for cases with both represented as BOW vectors for all the evaluated methods modalities for database points.This result is omitted in this paper due to except DVSH. space limitation
The NUS-WIDE dataset [6] contains 260,648 web images, and some images are associated with textual tags. It is a multi-label dataset where each point is annotated with one or multiple labels from 81 concept labels. We select 195,834 image-text pairs that belong to the 21 most frequent concepts. The text for each point is represented as a 1000-dimensional bag-of-words vector. The hand-crafted feature for each image is a 500-dimensional bag-of-visual words (BOVW) vector. For all datasets, the image 𝑖 and text 𝑗 are considered to be similar if point 𝑖 and point 𝑗 share at least one common label. Otherwise, they are considered to be dissimilar. 4.2. Evaluation Protocol and Baseline 4.2.1 Evaluation Protocol For MIRFLICKR-25K and IAPR TC-12 datasets, we randomly sample 2,000 data points as the test (query) set and the remaining points as the retrieval set (database). For NUS-WIDE dataset, we take 2,100 data points as the test set and the rest as the retrieval set. Moreover, we sample 10,000 data points from the retrieval set as training set for MIRFLICKR-25K and IAPR TC-12. For NUS-WIDE dataset, we sample 10,500 data points from the retrieval set as training set. The ground-truth neighbors are defined as those image-text pairs which share at least one common label. For hashing-based retrieval, Hamming ranking and hash lookup are two widely used retrieval protocols [25]. We also adopt these two protocols to evaluate our method and other baselines. The Hamming ranking protocol ranks the points in the database (retrieval set) according to their Hamming distances to the given query point, in an increasing order. Mean average precision (MAP) [25] is the widely used metric to measure the accuracy of the Hamming ranking protocol. The hash lookup protocol returns all the points within a certain Hamming radius away from the query point. The precision-recall curve is the widely used metric to measure the accuracy of the hash lookup protocol. 4.2.2 Baseline Six state-of-the-art cross-modal hashing methods are adopted as baselines for comparison, including SePH [22], STMH [33], SCM [35], CMFH [7], CCA [11] and DVSH [3]. Since DVSH can only be used for a special CMH case where one of the modalities have to be temporal dynamics, we compare DCMH with DVSH only on IAPR TC-12 dataset where the original texts are sentences which can be treated as temporal dynamics. The texts in MIRFLICKR-25K and NUS-WIDE are tags which are not suitable for DVSH. Please note that the texts are represented as BOW vectors for all the evaluated methods except DVSH. Source codes of SePH, STMH and SCM are kindly provided by the corresponding authors. While for CMFH and CCA whose codes are not available, we implement them carefully by ourselves. SePH is a kernel-based method, for which we use RBF kernel and take 500 randomly selected points as kernel bases by following its authors’ suggestion. In SePH, the authors propose two strategies to construct the hash codes for retrieval (database) points according to whether both modalities of a point are observed or not. However, in this paper we only use one modality for the database (retrieval) points2, because the focus of this paper is on cross-modal retrieval. All the other parameters for all baselines are set according to the suggestion of the original papers of these baselines. For DCMH, we use a validation set to choose the hyperparameters 𝛾 and 𝜂, and find that good performance can be achieved with 𝛾 = 𝜂 = 1. Hence, we set 𝛾 = 𝜂 = 1 for DCMH. We exploit the CNN-F network [5] pre-trained on ImageNet dataset [27] to initialize the first seven layers of the CNN for image modality. All the other parameters of the deep neural networks in DCMH are randomly initialized. The input for the image modality is the raw pixels, and that for the text modality is the BOW vectors. We fix the mini-batch size to be 128 and set the iteration number of the outer-loop in Algorithm 1 to be 500. The learning rate is chosen from 10−6 to 10−1 with a validation set. All experiments are run for five times, and the average performance is reported. 4.3. Accuracy 4.3.1 Hamming Ranking The MAP results for DCMH and other baselines with handcrafted features on MIRFLICKR-25K, IAPR TC-12 and NUS-WIDE datasets are reported in Table 3. Here, “𝐼 → 𝑇” denotes the case where the query is image and the database is text, and “𝑇 → 𝐼” denotes the case where the query is text and the database is image. We can find that DCMH can outperform all the other baselines with hand-crafted features. To further verify the effectiveness of DCMH, we exploit the CNN-F deep network [5] pre-trained on ImageNet dataset, which is the same as the initial CNN of the image modality in DCMH, to extract CNN features. All the baselines are trained based on these CNN features. The MAP results for DCMH and other baselines with CNN features on three datasets are reported in Table 4. We can find that DCMH can outperform all the other baselines except SePH for image to text retrieval on NUS-WIDE. 2For both SePH and DCMH, the accuracy by using both modalities for database points is typically higher than that by using only one modality for database points. DCMH can still outperform SePH for cases with both modalities for database points. This result is omitted in this paper due to space limitation
Table 3.MAP.The best accuracy is shown in boldface.The baselines are based on hand-crafted features. MIRFLICKR-25K Task Method LAPR TC-12 NUS-WIDE 16 bits 32 bits 64 bits 16 bits 32 bits 64 bits 16 bits 32 bits 64 bits DCMH 0.7410 0.7465 0.7485 0.4526 0.4732 0.4844 0.5903 0.6031 0.6093 SePH 0.6573 0.6603 0.6616 0.4112 0.4158 0.4203 0.4787 0.4869 0.4888 STMH 0.5921 0.5950 0.5980 0.3580 0.3732 0.3819 0.3973 0.4082 0.4153 I→T SCM 0.6290 0.6404 0.6480 0.3833 0.3898 0.3878 0.4650 0.4714 0.4822 CMFH 0.5818 0.5808 0.5805 0.3683 0.3734 0.3786 0.3568 0.3624 0.3661 CCA 0.5695 0.5663 0.5641 0.3345 0.3254 0.3193 0.3414 0.3336 0.3282 DCMH 0.7827 0.7900 0.7932 0.5185 0.5378 0.5468 0.6389 0.6511 0.6571 SePH 0.64801 0.6521 0.6545 0.4024 0.4074 0.4131 0.4489 0.4539 0.4587 STMH 0.5802 0.5846 0.5855 0.3445 0.3570 0.3690 0.3607 0.3738 0.3842 T→I SCM 0.6195 0.6302 0.6366 0.3698 0.3734 0.3696 0.4370 0.4428 0.4504 CMFH 0.5787 0.5774 0.5784 0.3619 0.3687 0.3769 0.3623 0.3670 0.3723 CCA 0.5690 0.5659 0.5639 0.3340 0.3255 0.3197 0.3392 0.3320 0.3272 Table 4.MAP.The best accuracy is shown in boldface.The baselines are based on CNN-F features. MIRFLICKR-25K IAPR TC-12 NUS-WIDE Task Method 16 bits 32 bits 64 bits 16 bits 32 bits 64 bits 16 bits 32 bits 64 bits DCMH 0.7410 0.7465 0.7485 0.4526 0.4732 0.4844 0.5903 0.6031 0.6093 Sep▣ 0.7123 0.7194 0.7232 0.4442 0.4563 0.4639 0.6037 0.6136 0.6211 STMH 0.6132 0.6219 0.6274 0.3775 0.4002 0.4130 0.4710 0.4864 0.4942 I→T SCM 0.6851 0.6921 0.7003 0.3692 0.3666 0.3802 0.5409 0.5485 0.5553 CMFH 0.6377 0.6418 0.6451 0.4189 0.4234 0.4251 0.4900 0.5053 0.5097 CCA 0.57190.5693 0.5672 0.3422 0.3361 0.3300 0.3604 0.3485 0.3390 DCMH 0.7827 0.7900 0.7932 0.5185 0.5378 0.5468 0.6389 0.6511 0.6571 SePH 0.7216 0.7261 0.7319 0.4423 0.4562 0.4648 0.5983 0.6025 0.6109 STMH 0.6074 0.6153 0.6217 0.3687 0.3897 0.4044 0.4471 0.4677 0.4780 T→I SCM 0.6939 0.7012 0.7060 0.3453 0.3410 0.3470 0.5344 0.5412 0.5484 CMFH 0.6365 0.6399 0.6429 0.4168 0.4212 0.4277 0.5031 0.5187 0.5225 CCA 0.5742 0.5713 0.5691 0.3493 0.3438 0.3378 0.3614 0.3494 0.3395 4.3.2 Hash Lookup Table 5.Top-500 MAP on IAPR TC-12 dataset. In the hash lookup protocol,we can compute the precision Task Method 16 bits 32 bits 64 bits and recall for the returned points given any Hamming DCMH 0.5780 0.6061 0.6310 radius.By varying the Hamming radius from 0 to c with DVSH 0.5696 0.63210.6964 a stepsize 1,we can get the precision-recall curve. DCMH0.65940.67440.6905 Figure 2 shows the precision-recall curve with code DVSH 0.6037 0.63950.6806 length 16 on three datasets,where the first two sub-figures are based on hand-crafted features and the last two sub- figures are based on CNN-F features for baselines in each for comparison.The top-500 MAP result on IAPR TC-12 row of the figures.We can find that DCMH can dramatically dataset is listed in Table 5.Please note that the text input for outperform the baselines for both hand-crafted features and DVSH is sentences and we represent the sentences as BOW CNN-F features.Our DCMH can also achieve the best vectors for DCMH.We can find that DCMH can outperform performance on other cases with different values of code DVSH in most cases. length,such as 32 bits and 64 bits.Those results are omitted due to space limitation. 4.5.Sensitivity to Parameters 4.4.Comparison with DVSH We explore the influence of the hyper-parameters y and Since the source code of DVSH is not publicly available n.Figure 3 shows the MAP results on MIRFLICKR-25K and it is also difficult to re-implement DVSH,we adopt dataset with different values of y and n,where the code the same experimental setting as that in DVSH [3]to length is 16 bits.We can see that DCMH is not sensitive to evaluate DCMH and directly use the result in DVSH [3] and n with 0.01<y<2 and 0.01<n<2
Table 3. MAP. The best accuracy is shown in boldface. The baselines are based on hand-crafted features. Task Method MIRFLICKR-25K IAPR TC-12 NUS-WIDE 16 bits 32 bits 64 bits 16 bits 32 bits 64 bits 16 bits 32 bits 64 bits DCMH 0.7410 0.7465 0.7485 0.4526 0.4732 0.4844 0.5903 0.6031 0.6093 SePH 0.6573 0.6603 0.6616 0.4112 0.4158 0.4203 0.4787 0.4869 0.4888 𝐼 → 𝑇 STMH 0.5921 0.5950 0.5980 0.3580 0.3732 0.3819 0.3973 0.4082 0.4153 SCM 0.6290 0.6404 0.6480 0.3833 0.3898 0.3878 0.4650 0.4714 0.4822 CMFH 0.5818 0.5808 0.5805 0.3683 0.3734 0.3786 0.3568 0.3624 0.3661 CCA 0.5695 0.5663 0.5641 0.3345 0.3254 0.3193 0.3414 0.3336 0.3282 DCMH 0.7827 0.7900 0.7932 0.5185 0.5378 0.5468 0.6389 0.6511 0.6571 SePH 0.6480 0.6521 0.6545 0.4024 0.4074 0.4131 0.4489 0.4539 0.4587 𝑇 → 𝐼 STMH 0.5802 0.5846 0.5855 0.3445 0.3570 0.3690 0.3607 0.3738 0.3842 SCM 0.6195 0.6302 0.6366 0.3698 0.3734 0.3696 0.4370 0.4428 0.4504 CMFH 0.5787 0.5774 0.5784 0.3619 0.3687 0.3769 0.3623 0.3670 0.3723 CCA 0.5690 0.5659 0.5639 0.3340 0.3255 0.3197 0.3392 0.3320 0.3272 Table 4. MAP. The best accuracy is shown in boldface. The baselines are based on CNN-F features. Task Method MIRFLICKR-25K IAPR TC-12 NUS-WIDE 16 bits 32 bits 64 bits 16 bits 32 bits 64 bits 16 bits 32 bits 64 bits DCMH 0.7410 0.7465 0.7485 0.4526 0.4732 0.4844 0.5903 0.6031 0.6093 SePH 0.7123 0.7194 0.7232 0.4442 0.4563 0.4639 0.6037 0.6136 0.6211 𝐼 → 𝑇 STMH 0.6132 0.6219 0.6274 0.3775 0.4002 0.4130 0.4710 0.4864 0.4942 SCM 0.6851 0.6921 0.7003 0.3692 0.3666 0.3802 0.5409 0.5485 0.5553 CMFH 0.6377 0.6418 0.6451 0.4189 0.4234 0.4251 0.4900 0.5053 0.5097 CCA 0.5719 0.5693 0.5672 0.3422 0.3361 0.3300 0.3604 0.3485 0.3390 DCMH 0.7827 0.7900 0.7932 0.5185 0.5378 0.5468 0.6389 0.6511 0.6571 SePH 0.7216 0.7261 0.7319 0.4423 0.4562 0.4648 0.5983 0.6025 0.6109 𝑇 → 𝐼 STMH 0.6074 0.6153 0.6217 0.3687 0.3897 0.4044 0.4471 0.4677 0.4780 SCM 0.6939 0.7012 0.7060 0.3453 0.3410 0.3470 0.5344 0.5412 0.5484 CMFH 0.6365 0.6399 0.6429 0.4168 0.4212 0.4277 0.5031 0.5187 0.5225 CCA 0.5742 0.5713 0.5691 0.3493 0.3438 0.3378 0.3614 0.3494 0.3395 4.3.2 Hash Lookup In the hash lookup protocol, we can compute the precision and recall for the returned points given any Hamming radius. By varying the Hamming radius from 0 to 𝑐 with a stepsize 1, we can get the precision-recall curve. Figure 2 shows the precision-recall curve with code length 16 on three datasets, where the first two sub-figures are based on hand-crafted features and the last two sub- figures are based on CNN-F features for baselines in each row of the figures. We can find that DCMH can dramatically outperform the baselines for both hand-crafted features and CNN-F features. Our DCMH can also achieve the best performance on other cases with different values of code length, such as 32 bits and 64 bits. Those results are omitted due to space limitation. 4.4. Comparison with DVSH Since the source code of DVSH is not publicly available and it is also difficult to re-implement DVSH, we adopt the same experimental setting as that in DVSH [3] to evaluate DCMH and directly use the result in DVSH [3] Table 5. Top-500 MAP on IAPR TC-12 dataset. Task Method 16 bits 32 bits 64 bits 𝐼 → 𝑇 DCMH 0.5780 0.6061 0.6310 DVSH 0.5696 0.6321 0.6964 𝑇 → 𝐼 DCMH 0.6594 0.6744 0.6905 DVSH 0.6037 0.6395 0.6806 for comparison. The top-500 MAP result on IAPR TC-12 dataset is listed in Table 5. Please note that the text input for DVSH is sentences and we represent the sentences as BOW vectors for DCMH. We can find that DCMH can outperform DVSH in most cases. 4.5. Sensitivity to Parameters We explore the influence of the hyper-parameters 𝛾 and 𝜂. Figure 3 shows the MAP results on MIRFLICKR-25K dataset with different values of 𝛾 and 𝜂, where the code length is 16 bits. We can see that DCMH is not sensitive to 𝛾 and 𝜂 with 0.01 <𝛾< 2 and 0.01 <𝜂< 2
0 09 02 0.8 02 08 0.2 08 a.2 0.8 (a)MIRFLICKR-25K (b)MIRFLICKR-25K (c)MIRFLICKR-25K (d)MIRFLICKR-25K @Hand-crafted feature @Hand-crafted feature @CNN-F feature @CNN-F feature ◆T T+1 T◆I 0.8 08 05 0.4 04 04 一 03 0.2 0.8 0.36 0立0 30 03 4 02 0.8 (e)IAPR TC-12 (f)IAPR TC-12 (g)IAPR TC-12 (h)IAPR TC-12 @Hand-crafted feature @Hand-crafted feature @CNN-F feature @CNN-F feature T+1 1→T T+1 0.9 09 09 0.8 08 0.7 07 -e-DC e-DC 0.6 左05 04 04 0.3 0.3 03 0.8 0.2 08 02 0.8 (NUS-WIDE (i)NUS-WIDE (k)NUS-WIDE (①)NUS-WIDE @Hand-crafted feature @Hand-crafted feature @CNN-F feature @CNN-F feature Figure 2.Precision-recall curves on three datasets.The code length is 16. 07 + .01 0.45 Code length Code length Figure 3.The influence of hyper-parameters. Figure 4.MAP on IAPR TC-/2. 4.6.Further Analysis 5.Conclusion To further verify the effectiveness of feature learning,we evaluate some variants of DCMH,i.e..DCMH-I,DCMH- In this paper,we have proposed a novel hashing method, T.DCMH-IT.DCMH-I denotes the variant without image called DCMH,for cross-modal retrieval applications. feature learning,in which we fix the parameters of the first DCMH is an end-to-end deep learning framework seven layers for image modality during training.DCMH-T which can perform feature learning and hash-code denotes the variant without text feature learning,in which learning simultaneously.Experiments on three datasets we replace the deep neural network of text modality as a show that DCMH can significantly outperform other linear projection.DCMH-IT denotes the variant without baselines to achieve the state-of-the-art performance in real both image and text feature learning applications. Figure 4 reports the MAP results on IAPR TC-12.We can find that DCMH can achieve higher accuracy than 6.Acknowledgements DCMH-I.DCMH-T and DCMH-IT.which demonstrates the importance of simultaneous hash-code learning and This work is supported by the NSFC(61472182)and a fund feature learning. from Tencent
0 0.2 0.4 0.6 0.8 1 Recall 0.6 0.7 0.8 0.9 Precision (a) MIRFLICKR-25K @Hand-crafted feature 0 0.2 0.4 0.6 0.8 1 Recall 0.6 0.7 0.8 0.9 Precision (b) MIRFLICKR-25K @Hand-crafted feature 0 0.2 0.4 0.6 0.8 1 Recall 0.6 0.7 0.8 0.9 Precision (c) MIRFLICKR-25K @CNN-F feature 0 0.2 0.4 0.6 0.8 1 Recall 0.6 0.7 0.8 0.9 Precision (d) MIRFLICKR-25K @CNN-F feature 0 0.2 0.4 0.6 0.8 1 Recall 0.3 0.4 0.5 0.6 0.7 0.8 Precision (e) IAPR TC-12 @Hand-crafted feature 0 0.2 0.4 0.6 0.8 1 Recall 0.3 0.4 0.5 0.6 0.7 0.8 Precision (f) IAPR TC-12 @Hand-crafted feature 0 0.2 0.4 0.6 0.8 1 Recall 0.3 0.4 0.5 0.6 0.7 0.8 Precision (g) IAPR TC-12 @CNN-F feature 0 0.2 0.4 0.6 0.8 1 Recall 0.3 0.4 0.5 0.6 0.7 0.8 Precision (h) IAPR TC-12 @CNN-F feature 0 0.2 0.4 0.6 0.8 1 Recall 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Precision (i) NUS-WIDE @Hand-crafted feature 0 0.2 0.4 0.6 0.8 1 Recall 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Precision (j) NUS-WIDE @Hand-crafted feature 0 0.2 0.4 0.6 0.8 1 Recall 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Precision (k) NUS-WIDE @CNN-F feature 0 0.2 0.4 0.6 0.8 1 Recall 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Precision (l) NUS-WIDE @CNN-F feature Figure 2. Precision-recall curves on three datasets. The code length is 16. 0.001 0.01 0.1 1 2 0.65 0.7 0.75 0.8 MAP 0.001 0.01 0.1 1 2 0.65 0.7 0.75 0.8 MAP Figure 3. The influence of hyper-parameters. 4.6. Further Analysis To further verify the effectiveness of feature learning, we evaluate some variants of DCMH, i.e., DCMH-I, DCMHT, DCMH-IT. DCMH-I denotes the variant without image feature learning, in which we fix the parameters of the first seven layers for image modality during training. DCMH-T denotes the variant without text feature learning, in which we replace the deep neural network of text modality as a linear projection. DCMH-IT denotes the variant without both image and text feature learning. Figure 4 reports the MAP results on IAPR TC-12. We can find that DCMH can achieve higher accuracy than DCMH-I, DCMH-T and DCMH-IT, which demonstrates the importance of simultaneous hash-code learning and feature learning. 16 32 64 Code length 0.41 0.43 0.45 0.47 0.49 MAP 16 32 64 Code length 0.45 0.47 0.5 0.52 0.55 MAP Figure 4. MAP on IAPR TC-12. 5. Conclusion In this paper, we have proposed a novel hashing method, called DCMH, for cross-modal retrieval applications. DCMH is an end-to-end deep learning framework which can perform feature learning and hash-code learning simultaneously. Experiments on three datasets show that DCMH can significantly outperform other baselines to achieve the state-of-the-art performance in real applications. 6. Acknowledgements This work is supported by the NSFC (61472182) and a fund from Tencent.
References [21]G.Lin,C.Shen,Q.Shi,A.van den Hengel,and D.Suter. Fast supervised hashing with decision trees for high- [1]A.Andoni and P.Indyk.Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions.Commun. dimensional data.In CVPR,pages 1971-1978,2014. [22]Z.Lin,G.Ding,M.Hu,and J.Wang.Semantics-preserving ACM.51(1):117-122.2008. hashing for cross-view retrieval.In CVPR,pages 3864-3872, [2]M.M.Bronstein,A.M.Bronstein,F.Michel,and 2015. N.Paragios.Data fusion through cross-modality metric learning using similarity-sensitive hashing.In CVPR,pages [23]V.E.Liong.J.Lu,G.Wang,P.Moulin,and J.Zhou.Deep hashing for compact binary codes learning.In CVPR,pages 3594-3601.2010. 2475-2483.2015. [3]Y.Cao,M.Long,J.Wang,Q.Yang,and P.S.Yu. [24]H.Liu,R.Wang,S.Shan,and X.Chen.Deep supervised Deep visual-semantic hashing for cross-modal retrieval.In SIGKDD,pages 1445-1454,2016. hashing for fast image retrieval.In CVPR,pages 2064-2072. 2016. [4]M.A.Carreira-Perpinan and R.Raziperchikolaei.Hashing with binary autoencoders.In CVPR,pages 557-566,2015. [25]W.Liu,C.Mu,S.Kumar,and S.-F.Chang.Discrete graph hashing.In NIPS,pages 3419-3427,2014. [5]K.Chatfield,K.Simonyan,A.Vedaldi,and A.Zisserman. [26]W.Liu,J.Wang,R.Ji,Y.-G.Jiang,and S.-F.Chang. Return of the devil in the details:Delving deep into Supervised hashing with kernels.In CVPR,pages 2074- convolutional nets.In BMVC.2014. 2081.2012. [6]T.-S.Chua,J.Tang,R.Hong.H.Li,Z.Luo,and Y.Zheng. [27]O.Russakovsky,J.Deng,H.Su,J.Krause,S.Satheesh. NUS-WIDE:a real-world web image database from national S.Ma,Z.Huang,A.Karpathy,A.Khosla,M.S.Bernstein. university of singapore.In CIVR,2009. A.C.Berg.and F-F.Li.Imagenet large scale visual [7]G.Ding,Y.Guo,and J.Zhou.Collective matrix factorization recognition challenge.CoRR,abs/1409.0575,2014. hashing for multimodal data.In CVPR,pages 2083-2090. [28]F.Shen,C.Shen,W.Liu,and H.T.Shen.Supervised discrete 2014 hashing.In CVPR,pages 37-45,2015. [8]H.J.Escalante,C.A.Hernandez,J.A.Gonzalez,A.Lopez- [29]F.Shen,C.Shen,Q.Shi,A.van den Hengel,and Z.Tang. Lopez,M.Montes-y-Gomez,E.F.Morales,L.E.Sucar, L.V.Pineda,and M.Grubinger.The segmented and Inductive hashing on manifolds.In CVPR,pages 1562-1569 2013. annotated IAPR TC-12 benchmark.CV/U.114(4):419-428. 2010 [30]J.Song,Y.Yang.Z.Huang,H.T.Shen,and R.Hong. Multiple feature hashing for real-time large scale near- [9]Y.Gong and S.Lazebnik.Iterative quantization:A procrustean approach to learning binary codes.In CVPR. duplicate video retrieval.In ACM MM,pages 423-432,2011. pages817-824,2011. [31]A.Vedaldi and K.Lenc.Matconvnet:Convolutional neural networks for MATLAB.In ACM MM.pages 689-692,2015. [10]J.-P.Heo.Y.Lee.J.He.S.-F.Chang.and S.-E.Yoon. Spherical hashing.In CVPR,pages 2957-2964,2012. [32]D.Wang,P.Cui,M.Ou,and W.Zhu.Deep multimodal [11]H.Hotelling.Relations between two sets of variates. hashing with orthogonal regularization.In I/CAL,pages 2291-2297.2015. Biometrika,pages 321-377,1936. [33]D.Wang,X.Gao,X.Wang,and L.He.Semantic topic [12]M.J.Huiskes and M.S.Lew.The MIR flickr retrieval multimodal hashing for cross-media retrieval.In //CAl. evaluation.In ACM MM.pages 39-43,2008. pages3890-3896.2015 [13]Q.-Y.Jiang and W.-J.Li.Deep cross-modal hashing.CoRR, abs/1602.02255.2016. [34]J.Wang.O.Kumar,and S.-F.Chang.Semi-supervised hashing for scalable image retrieval.In CVPR,pages 3424- [14]Y.Kang,S.Kim,and S.Choi.Deep learning to hash with 3431.2010. multiple representations.In /CDM,pages 930-935,2012. [35]D.Zhang and W.-J.Li.Large-scale supervised multimodal [15]W.Kong and W.-J.Li.Isotropic hashing.In NIPS,pages 1655-1663.2012. hashing with semantic correlation maximization.In AAA/, pages2177-2183,2014. [16]A.Krizhevsky,I.Sutskever,and G.E.Hinton.Imagenet [36]D.Zhang,F.Wang,and L.Si.Composite hashing with classification with deep convolutional neural networks.In multiple information sources.In S/G/R,pages 225-234 MPS,pages1106-1114,2012. 2011. [17]B.Kulis and K.Grauman.Kernelized locality-sensitive (37]F.Zhao,Y.Huang,L.Wang,and T.Tan.Deep semantic hashing for scalable image search.In /CCV,pages 2130- ranking based hashing for multi-label image retrieval.In 2137.2009. CVPR,pages1556-1564,2015. [18]S.Kumar and R.Udupa.Learning hash functions for cross- [38]Y.Zhen and D.-Y.Yeung.Co-regularized hashing for view similarity search.In //CAl,pages 1360-1365,2011. multimodal data.In NIPS,pages 1385-1393,2012. [19]Y.LeCun,B.E.Boser,J.S.Denker,D.Henderson.R.E. Howard,W.E.Hubbard,and L.D.Jackel.Handwritten [39]Y.Zhen and D.-Y.Yeung.A probabilistic model for multimodal hash function learning.In SIGKDD,pages 940- digit recognition with a back-propagation network.In N/PS. 948.2012. pages396-404,1989. [40]B.Zhuang,G.Lin,C.Shen,and I.D.Reid.Fast training [20]W.-J.Li,S.Wang.and W.-C.Kang.Feature learning based of triplet-based deep binary embedding networks.In CVPR. deep supervised hashing with pairwise labels.In //CAl. pages1711-1717,2016. pages5955-5964,2016
References [1] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117–122, 2008. [2] M. M. Bronstein, A. M. Bronstein, F. Michel, and N. Paragios. Data fusion through cross-modality metric learning using similarity-sensitive hashing. In CVPR, pages 3594–3601, 2010. [3] Y. Cao, M. Long, J. Wang, Q. Yang, and P. S. Yu. Deep visual-semantic hashing for cross-modal retrieval. In SIGKDD, pages 1445–1454, 2016. [4] M. A. Carreira-Perpi ´ n˜an and R. Raziperchikolaei. Hashing ´ with binary autoencoders. In CVPR, pages 557–566, 2015. [5] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In BMVC, 2014. [6] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng. NUS-WIDE: a real-world web image database from national university of singapore. In CIVR, 2009. [7] G. Ding, Y. Guo, and J. Zhou. Collective matrix factorization hashing for multimodal data. In CVPR, pages 2083–2090, 2014. [8] H. J. Escalante, C. A. Hernandez, J. A. Gonz ´ alez, A. L ´ opez- ´ Lopez, M. Montes-y-G ´ omez, E. F. Morales, L. E. Sucar, ´ L. V. Pineda, and M. Grubinger. The segmented and annotated IAPR TC-12 benchmark. CVIU, 114(4):419–428, 2010. [9] Y. Gong and S. Lazebnik. Iterative quantization: A procrustean approach to learning binary codes. In CVPR, pages 817–824, 2011. [10] J.-P. Heo, Y. Lee, J. He, S.-F. Chang, and S.-E. Yoon. Spherical hashing. In CVPR, pages 2957–2964, 2012. [11] H. Hotelling. Relations between two sets of variates. Biometrika, pages 321–377, 1936. [12] M. J. Huiskes and M. S. Lew. The MIR flickr retrieval evaluation. In ACM MM, pages 39–43, 2008. [13] Q.-Y. Jiang and W.-J. Li. Deep cross-modal hashing. CoRR, abs/1602.02255, 2016. [14] Y. Kang, S. Kim, and S. Choi. Deep learning to hash with multiple representations. In ICDM, pages 930–935, 2012. [15] W. Kong and W.-J. Li. Isotropic hashing. In NIPS, pages 1655–1663, 2012. [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1106–1114, 2012. [17] B. Kulis and K. Grauman. Kernelized locality-sensitive hashing for scalable image search. In ICCV, pages 2130– 2137, 2009. [18] S. Kumar and R. Udupa. Learning hash functions for crossview similarity search. In IJCAI, pages 1360–1365, 2011. [19] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, and L. D. Jackel. Handwritten digit recognition with a back-propagation network. In NIPS, pages 396–404, 1989. [20] W.-J. Li, S. Wang, and W.-C. Kang. Feature learning based deep supervised hashing with pairwise labels. In IJCAI, pages 1711–1717, 2016. [21] G. Lin, C. Shen, Q. Shi, A. van den Hengel, and D. Suter. Fast supervised hashing with decision trees for highdimensional data. In CVPR, pages 1971–1978, 2014. [22] Z. Lin, G. Ding, M. Hu, and J. Wang. Semantics-preserving hashing for cross-view retrieval. In CVPR, pages 3864–3872, 2015. [23] V. E. Liong, J. Lu, G. Wang, P. Moulin, and J. Zhou. Deep hashing for compact binary codes learning. In CVPR, pages 2475–2483, 2015. [24] H. Liu, R. Wang, S. Shan, and X. Chen. Deep supervised hashing for fast image retrieval. In CVPR, pages 2064–2072, 2016. [25] W. Liu, C. Mu, S. Kumar, and S.-F. Chang. Discrete graph hashing. In NIPS, pages 3419–3427, 2014. [26] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang. Supervised hashing with kernels. In CVPR, pages 2074– 2081, 2012. [27] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F.-F. Li. Imagenet large scale visual recognition challenge. CoRR, abs/1409.0575, 2014. [28] F. Shen, C. Shen, W. Liu, and H. T. Shen. Supervised discrete hashing. In CVPR, pages 37–45, 2015. [29] F. Shen, C. Shen, Q. Shi, A. van den Hengel, and Z. Tang. Inductive hashing on manifolds. In CVPR, pages 1562–1569, 2013. [30] J. Song, Y. Yang, Z. Huang, H. T. Shen, and R. Hong. Multiple feature hashing for real-time large scale nearduplicate video retrieval. In ACM MM, pages 423–432, 2011. [31] A. Vedaldi and K. Lenc. Matconvnet: Convolutional neural networks for MATLAB. In ACM MM, pages 689–692, 2015. [32] D. Wang, P. Cui, M. Ou, and W. Zhu. Deep multimodal hashing with orthogonal regularization. In IJCAI, pages 2291–2297, 2015. [33] D. Wang, X. Gao, X. Wang, and L. He. Semantic topic multimodal hashing for cross-media retrieval. In IJCAI, pages 3890–3896, 2015. [34] J. Wang, O. Kumar, and S.-F. Chang. Semi-supervised hashing for scalable image retrieval. In CVPR, pages 3424– 3431, 2010. [35] D. Zhang and W.-J. Li. Large-scale supervised multimodal hashing with semantic correlation maximization. In AAAI, pages 2177–2183, 2014. [36] D. Zhang, F. Wang, and L. Si. Composite hashing with multiple information sources. In SIGIR, pages 225–234, 2011. [37] F. Zhao, Y. Huang, L. Wang, and T. Tan. Deep semantic ranking based hashing for multi-label image retrieval. In CVPR, pages 1556–1564, 2015. [38] Y. Zhen and D.-Y. Yeung. Co-regularized hashing for multimodal data. In NIPS, pages 1385–1393, 2012. [39] Y. Zhen and D.-Y. Yeung. A probabilistic model for multimodal hash function learning. In SIGKDD, pages 940– 948, 2012. [40] B. Zhuang, G. Lin, C. Shen, and I. D. Reid. Fast training of triplet-based deep binary embedding networks. In CVPR, pages 5955–5964, 2016