正在加载图片...
EEE TRANSACTIONS ON IMAGE PROCESSING.VOL.XX,NO.X.XXX 2019 learning simultaneously with a deep neural networks.The time and Vis respectively denote the binary hash codes of two complexity of DCMH is also high. modalities for point i and c is the length of binary code. Typically,discrete methods can outperform continuous The goal of supervised CMH is to learn the binary codes methods in terms of retrieval accuracy.However,high com- U and V,which try to preserve the similarity information plexity makes the training of discrete supervised CMH time- in S.In other words,if Sij=1,the Hamming distance consuming.To make the training practicable,discrete super- between Ui and Vi+should be as small as possible and vice vised CMH methods usually sample a subset from database versa.Furthermore.we also need to learn two hash functions during training procedure.Hence,existing discrete supervised h(x)E{-1,+1]e and hu(ya)E{-1,+1]e respectively CMH methods can't fully utilize supervised information and for modality x and modality y,which can compute binary will deteriorate the accuracy.In summary,to fully utilize hash codes for any new query point (xa or ya)unseen in the the supervised information,we need to design a scalable training set. discrete CMH method to further improve retrieval accuracy and scalability for large-scale datasets. IV.DISCRETE LATENT FACTOR MODEL BASED CROSS-MODAL HASHING III.NOTATIONS AND PROBLEM DEFINITION In this section.we introduce the details of DLFH,including We briefly introduce the notations and problem definition model formulation and learning algorithm. in this section. A.Model Formulation A.Notations Given a binary code pair [Ui.,Vi,we define j as: We use bold uppercase letters like U and bold lowercase V.where c is the code length which is pre- letters like u to denote matrices and vectors,respectively.The specified,and A>0 is a hyper-parameter denoting a scale element at the position (i,j)of matrix U is denoted as Uj. factor for tuning. The ith row of matrix U is denoted as Ui,and the jth column By using a logistic function,we define Aij as:Aij= of matrix U is denoted as U..l.lg and UT denote the Based on Aij,we define the likelihood Frobenius norm of a matrix and the transpose of matrix U of the cross-modal similarity S as: respectively.sign()is an element-wise sign function. p(SIU,V)= p(SijlU,V) .=1 B.Problem Definition Without loss of generality,we assume there exist only where p(SijlU,V)is defined as follows: two modalities in the data although our DLFH can be easily if Sij=1, adapted to more modalities.We use X=12...n p(SijlU,V) Rnxd and Y =[y1,y2,...,yn]Te Rnxdy to denote the 1-Aij, otherwise feature vectors of the two modalities(modality x and modality Then the log-likelihood of U and V can be derived as y),where d and du respectively denote the dimensionality of follows: the feature spaces for two modalities and n is the number of training data.In particular,x;and yi denote the feature vectors L=logp(SIU,V)= ∑[S,6-log1+e8 of the two modalities for training point i,respectively.Without ,j=1 loss of generality,the data are assumed to be zero-centered which means∑1x=0and∑21yi=0.Here,weas- The model of DLFH tries to maximize the log-likelihood of sume that both modalities are observed for all training points. U and V.That is,DLFH tries to solve the following problem: However,we do not require that both modalities are observed L=logp(SIU,V) (1) for query (test)points.Hence,the setting is cross-modal. Actually,DLFH can be easily adapted to cases where some training points are with missing modalities,which will be left ∑[S,0-log(1+e9】 ij=1 for future study.In this paper,we focus on supervised CMH which has shown better accuracy that unsupervised CMH [16], s.tU,V∈{-l,+1mxc, [17].[26],[38].[44].In supervised CMH,besides the feature where U and V are constrained to be in a binary space,i.e., vectors X and Y,we are also given a cross-modal supervised {-1,+1)nxc,because they are binary hash codes for learning. similarity matrix S{0,1]nx".If Sj=1,it means that We can find that maximizing the objective function in(1) point xi and point yi are similar.Otherwise xi and yi are exactly matches the goal of hashing.More specifically,the dissimilar.Here,we assume all elements of S are observed. learned binary hash codes try to preserve the similarity infor- But our DLFH can also be adapted for cases with missing mation in S. elements in S.Sij can be manually labeled by users,or Please note that in (1),DLFH adopts the maximum like- constructed from the labels of point i and point j. lihood loss function.Although there exist some CMH meth- We use U,V E{-1,+1]nxe to respectively denote the ods [34],[41]which also use maximum likelihood loss func- binary codes for modality x and modality y,where Ui.tion,none of them can utilize discrete latent factor model toIEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, XXX 2019 3 learning simultaneously with a deep neural networks. The time complexity of DCMH is also high. Typically, discrete methods can outperform continuous methods in terms of retrieval accuracy. However, high com￾plexity makes the training of discrete supervised CMH time￾consuming. To make the training practicable, discrete super￾vised CMH methods usually sample a subset from database during training procedure. Hence, existing discrete supervised CMH methods can’t fully utilize supervised information and will deteriorate the accuracy. In summary, to fully utilize the supervised information, we need to design a scalable discrete CMH method to further improve retrieval accuracy and scalability for large-scale datasets. III. NOTATIONS AND PROBLEM DEFINITION We briefly introduce the notations and problem definition in this section. A. Notations We use bold uppercase letters like U and bold lowercase letters like u to denote matrices and vectors, respectively. The element at the position (i, j) of matrix U is denoted as Uij . The ith row of matrix U is denoted as Ui∗, and the jth column of matrix U is denoted as U∗j . k · kF and UT denote the Frobenius norm of a matrix and the transpose of matrix U respectively. sign(·) is an element-wise sign function. B. Problem Definition Without loss of generality, we assume there exist only two modalities in the data although our DLFH can be easily adapted to more modalities. We use X = [x1, x2, . . . , xn] T ∈ R n×dx and Y = [y1, y2, . . . , yn] T ∈ R n×dy to denote the feature vectors of the two modalities (modality x and modality y), where dx and dy respectively denote the dimensionality of the feature spaces for two modalities and n is the number of training data. In particular, xi and yi denote the feature vectors of the two modalities for training point i, respectively. Without loss of generality, the data are assumed to be zero-centered which means Pn i=1 xi = 0 and Pn i=1 yi = 0. Here, we as￾sume that both modalities are observed for all training points. However, we do not require that both modalities are observed for query (test) points. Hence, the setting is cross-modal. Actually, DLFH can be easily adapted to cases where some training points are with missing modalities, which will be left for future study. In this paper, we focus on supervised CMH which has shown better accuracy that unsupervised CMH [16], [17], [26], [38], [44]. In supervised CMH, besides the feature vectors X and Y, we are also given a cross-modal supervised similarity matrix S ∈ {0, 1} n×n. If Sij = 1, it means that point xi and point yj are similar. Otherwise xi and yj are dissimilar. Here, we assume all elements of S are observed. But our DLFH can also be adapted for cases with missing elements in S. Sij can be manually labeled by users, or constructed from the labels of point i and point j. We use U, V ∈ {−1, +1} n×c to respectively denote the binary codes for modality x and modality y, where Ui∗ and Vi∗ respectively denote the binary hash codes of two modalities for point i and c is the length of binary code. The goal of supervised CMH is to learn the binary codes U and V, which try to preserve the similarity information in S. In other words, if Sij = 1, the Hamming distance between Ui∗ and Vj∗ should be as small as possible and vice versa. Furthermore, we also need to learn two hash functions hx(xq) ∈ {−1, +1} c and hy(yq) ∈ {−1, +1} c respectively for modality x and modality y, which can compute binary hash codes for any new query point (xq or yq) unseen in the training set. IV. DISCRETE LATENT FACTOR MODEL BASED CROSS-MODAL HASHING In this section, we introduce the details of DLFH, including model formulation and learning algorithm. A. Model Formulation Given a binary code pair {Ui∗, Vj∗}, we define Θij as: Θij = λ c Ui∗VT j∗ , where c is the code length which is pre￾specified, and λ > 0 is a hyper-parameter denoting a scale factor for tuning. By using a logistic function, we define Aij as: Aij = σ(Θij ) = 1 1+e −Θij . Based on Aij , we define the likelihood of the cross-modal similarity S as: p(S|U, V) = Yn i,j=1 p(Sij |U, V), where p(Sij |U, V) is defined as follows: p(Sij |U, V) = ( Aij , if Sij = 1, 1 − Aij , otherwise. Then the log-likelihood of U and V can be derived as follows: L = log p(S|U, V) = Xn i,j=1 SijΘij − log(1 + e Θij ) The model of DLFH tries to maximize the log-likelihood of U and V. That is, DLFH tries to solve the following problem: max U,V L = log p(S|U, V) (1) = Xn i,j=1 SijΘij − log(1 + e Θij ) s.t. U, V ∈ {−1, +1} n×c , where U and V are constrained to be in a binary space, i.e., {−1, +1} n×c , because they are binary hash codes for learning. We can find that maximizing the objective function in (1) exactly matches the goal of hashing. More specifically, the learned binary hash codes try to preserve the similarity infor￾mation in S. Please note that in (1), DLFH adopts the maximum like￾lihood loss function. Although there exist some CMH meth￾ods [34], [41] which also use maximum likelihood loss func￾tion, none of them can utilize discrete latent factor model to
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有