正在加载图片...
with the hash-code learning procedure.Hence,these column of W is denoted as Wij.The ith row of W is existing CMH methods with hand-crafted features may not denoted as Wi.,and the jth column of W is denoted as achieve satisfactory performance in real applications. W材 WT is the transpose of W.We use 1 to denote Recently,deep learning with neural networks [19,16] a vector with all elements being 1.tr()and‖·lF has been widely used to perform feature learning from denote the trace of a matrix and the Frobenius norm of scratch with promising performance.There also exist a matrix,respectively.sign(.)is an element-wise sign some methods which adopt deep learning for uni-modal function defined as follows: hashing [37,23,20,40.24].These methods show that end-to-end deep learning architecture is more compatible x≥0, for hashing learning.For the CMH setting,there also sign(x x<0. appears one method,called deep visual-semantic hash- ing (DVSH)[3],with deep neural networks for feature learning'.However,DVSH can only be used for a special 2.2.Cross-Modal Hashing CMH case where one of the modalities have to be temporal dynamics. Although the method proposed in this paper can be easily In this paper,we propose a novel CMH method,called adapted to cases with more than two modalities,we only deep cross-modal hashing(DCMH),for cross-modal re- focus on the case with two modalities here. trieval applications.The main contributions of DCMH are Assume that we have n training entities (data points), outlined as follows: each of which has two modalities of features.Without loss of generality,we use image-text datasets for illustration in .DCMH is an end-to-end learning framework with deep this paper,which means that each training point has both neural networks,one for each modality,to perform text modality and image modality.We use X=x feature learning from scratch. to denote the image modality,where xi can be the hand- crafted features or the raw pixels of image i.Moreover, The hash-code learning problem is essentially a dis- we use Y=[yi]1 to denote the text modality,where crete learning problem,which is difficult to learn. yi is typically the tag information related to image i.In Hence.most existing CMH methods solve this prob- addition,we are also given a cross-modal similarity matrix lem by relaxing the original discrete learning problem S.Sij=1 if image xi and text yj are similar,and Sij =0 into a continuous learning problem.This relaxation otherwise.Here,the similarity is typically defined by some procedure may deteriorate the accuracy of the learned semantic information such as class labels.For example,we hash codes [25].Unlike these relaxation-based meth- can say that image xi and text y;are similar if they share ods,DCMH directly learns the discrete hash codes the same class label.Otherwise,image xi and text yj are without relaxation. dissimilar if they are from different classes. Experiments on real datasets with image-text modali- Given the above training information X,Y and S,the ties show that DCMH can outperform other baselines goal of cross-modal hashing is to learn two hash functions to achieve the state-of-the-art performance in cross- for the two modalities:h()(x){-1,+1e for the modal retrieval applications. image modality and h()(y)E{-1,+c for the text modality,where c is the length of binary code.These two The rest of this paper is organized as follows.Section 2 hash functions should preserve the cross-modal similarity introduces the problem definition of this paper.We present in S.More specifically,if Sj=1.the Hamming distance our DCMH method in Section 3,including the model between the binary codes b)=h()(x)and bv)= formulation and learning algorithm.Experiments are shown h()(y;)should be small.Otherwise,if Si=0,the in Section 4.At last,we conclude our work in Section 5. corresponding Hamming distance should be large. Here,we assume that both modalities of features for each 2.Problem Definition point in the training set are observed although our method 2.1.Notation can also be easily adapted to other settings where some training points have only one modality of features being Boldface lowercase letters like w are used to denote observed.Please note that we only make this assumption vectors.Boldface uppercase letters like W are used to for training points.After we have trained the model,we can denote matrices,and the element in the ith row and jth use the learned model to generate hash codes for query and IThe first version of our DCMH method has been submitted to database points of either one modality or two modalities. arXiv [13]before this CVPR submission,which actually appeared earlier which exactly matches the setting of cross-modal retrieval than DVSH in public literature. applications.with the hash-code learning procedure. Hence, these existing CMH methods with hand-crafted features may not achieve satisfactory performance in real applications. Recently, deep learning with neural networks [19, 16] has been widely used to perform feature learning from scratch with promising performance. There also exist some methods which adopt deep learning for uni-modal hashing [37, 23, 20, 40, 24]. These methods show that end-to-end deep learning architecture is more compatible for hashing learning. For the CMH setting, there also appears one method, called deep visual-semantic hash￾ing (DVSH) [3], with deep neural networks for feature learning1. However, DVSH can only be used for a special CMH case where one of the modalities have to be temporal dynamics. In this paper, we propose a novel CMH method, called deep cross-modal hashing (DCMH), for cross-modal re￾trieval applications. The main contributions of DCMH are outlined as follows: ∙ DCMH is an end-to-end learning framework with deep neural networks, one for each modality, to perform feature learning from scratch. ∙ The hash-code learning problem is essentially a dis￾crete learning problem, which is difficult to learn. Hence, most existing CMH methods solve this prob￾lem by relaxing the original discrete learning problem into a continuous learning problem. This relaxation procedure may deteriorate the accuracy of the learned hash codes [25]. Unlike these relaxation-based meth￾ods, DCMH directly learns the discrete hash codes without relaxation. ∙ Experiments on real datasets with image-text modali￾ties show that DCMH can outperform other baselines to achieve the state-of-the-art performance in cross￾modal retrieval applications. The rest of this paper is organized as follows. Section 2 introduces the problem definition of this paper. We present our DCMH method in Section 3, including the model formulation and learning algorithm. Experiments are shown in Section 4. At last, we conclude our work in Section 5. 2. Problem Definition 2.1. Notation Boldface lowercase letters like w are used to denote vectors. Boldface uppercase letters like W are used to denote matrices, and the element in the 𝑖th row and 𝑗th 1The first version of our DCMH method has been submitted to arXiv [13] before this CVPR submission, which actually appeared earlier than DVSH in public literature. column of W is denoted as 𝑊𝑖𝑗 . The 𝑖th row of W is denoted as W𝑖∗, and the 𝑗th column of W is denoted as W∗𝑗 . W𝑇 is the transpose of W. We use 1 to denote a vector with all elements being 1. tr(⋅) and ∥⋅∥𝐹 denote the trace of a matrix and the Frobenius norm of a matrix, respectively. sign(⋅) is an element-wise sign function defined as follows: sign(𝑥) = { 1 𝑥 ≥ 0, −1 𝑥 < 0. 2.2. Cross-Modal Hashing Although the method proposed in this paper can be easily adapted to cases with more than two modalities, we only focus on the case with two modalities here. Assume that we have 𝑛 training entities (data points), each of which has two modalities of features. Without loss of generality, we use image-text datasets for illustration in this paper, which means that each training point has both text modality and image modality. We use X = {x𝑖}𝑛 𝑖=1 to denote the image modality, where x𝑖 can be the hand￾crafted features or the raw pixels of image 𝑖. Moreover, we use Y = {y𝑖}𝑛 𝑖=1 to denote the text modality, where y𝑖 is typically the tag information related to image 𝑖. In addition, we are also given a cross-modal similarity matrix S. 𝑆𝑖𝑗 = 1 if image x𝑖 and text y𝑗 are similar, and 𝑆𝑖𝑗 = 0 otherwise. Here, the similarity is typically defined by some semantic information such as class labels. For example, we can say that image x𝑖 and text y𝑗 are similar if they share the same class label. Otherwise, image x𝑖 and text y𝑗 are dissimilar if they are from different classes. Given the above training information X, Y and S, the goal of cross-modal hashing is to learn two hash functions for the two modalities: ℎ(𝑥) (x) ∈ {−1, +1}𝑐 for the image modality and ℎ(𝑦) (y) ∈ {−1, +1}𝑐 for the text modality, where 𝑐 is the length of binary code. These two hash functions should preserve the cross-modal similarity in S. More specifically, if 𝑆𝑖𝑗 = 1, the Hamming distance between the binary codes b(𝑥) 𝑖 = ℎ(𝑥) (x𝑖) and b(𝑦) 𝑗 = ℎ(𝑦) (y𝑗 ) should be small. Otherwise, if 𝑆𝑖𝑗 = 0, the corresponding Hamming distance should be large. Here, we assume that both modalities of features for each point in the training set are observed although our method can also be easily adapted to other settings where some training points have only one modality of features being observed. Please note that we only make this assumption for training points. After we have trained the model, we can use the learned model to generate hash codes for query and database points of either one modality or two modalities, which exactly matches the setting of cross-modal retrieval applications
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有