正在加载图片...
different neural networks is not the focus of this paper. The third term n(F1+G1)in (1)is used to Other deep neural networks might also be used to perform make each bit of the hash code be balanced on all the feature learning for our DCMH model,which will be leaved training points.More specifically,the number of +1 and for future study. that of-1 for each bit on all the training points should be almost the same.This constraint can be used to maximize 3.1.2 Hash-Code Learning Part the information provided by each bit. In our experiment,we find that better performance can Let f(xi;0)Re denote the learned image feature for be achieved if the binary codes from the two modalities are point i,which corresponds to the output of the CNN for set to be the same for the same training points.Hence,we image modality.Furthermore,let g(y;:0)E Re denote set B()=B()=B.Then,the problem in (1)can be the learned text feature for point j,which corresponds to the transformed to the following formulation: output of the deep neural network for text modality.Here, 0 is the network parameter of the CNN for image modality. emin了=-∑(S6,-log(1+e8)》 B.0:0u and 0 is the network parameter of the deep neural network i,j=1 for text modality. +y(B-F+B-G胫) The objective function of DCMH is defined as follows: +(F1层+川G12) (2) min B(-).B(w).0z.0v 了=-∑(S,6j-log(1+e9) s.t.B∈{-1,+1}exn ij=1 This is the final objective function of our DCMH for +y(B)-房+B-G2) learning. +(F12+1G1) (1) From (2),we can find that the parameters of the deep s.t. B)e{-1,+1ex" neural networks(and )and the binary hash code(B) are learned from the same objective function.That is to Be{-1,+1xn, say,DCMH integrates both feature learning and hash-code learning into the same deep learning framework where F∈Rexn with F.=f(xi;0z,G∈Rexn with Please note that we only make B()=B()for the train- Gj=g(yj:0u),=FTG.j,B)is the binary hash ing points.After we have learned the problem in (2),we code for imageBis the binary hash code for texty still need to generate different binary codes bh(x) y and n are hyper-parameters. and bh(y)for the two different modalities of the The first term-=(-log(1+))in (1) same point i if point i is a query point or a point from the is the negative log likelihood of the cross-modal similarities database rather than a training point.This will be further with the likelihood function defined as follows: illustrated in Section 3.3. o(Θ) S)=1 3.2.Learning p(SlFi,G)三 1-o()S=0 We adopt an alternating learning strategy to learn z, 0y and B.Each time we learn one parameter with the where6=FTG.j ando(e)=i+e-o可· 1 other parameters fixed.The whole alternating learning It is easy to find that minimizing this negative log likeli- algorithm for DCMH is briefly outlined in Algorithm 1,and hood,which is equivalent to maximizing the likelihood,can the detailed derivation will be introduced in the following make the similarity (inner product)between Fi and G.j content of this subsection. be large when Sj=1 and be small when Si;=0.Hence, optimizing the first term in(1)can preserve the cross-modal similarity in S with the image feature representation F and 3.2.1 Learn 0z,with 0y and B Fixed text feature representation G. When 0y and B are fixed,we learn the CNN parameter By optimizing the second term(B()-FB()- of the image modality by using a back-propagation (BP) G)in (1),we can get B()=sign(F)and B()= algorithm.As most existing deep learning methods [16,we sign(G).Hence,we can consider F and G to be the con- utilize stochastic gradient descent (SGD)to learn 0 with tinuous surrogate of B()and B(),respectively.Because the BP algorithm.More specifically,in each iteration we F and G can preserve the cross-modal similarity in S,the sample a mini-batch of points from the training set and then binary hash codes B(=)and B()can also be expected carry out our learning algorithm based on the sampled data. to preserve the cross-modal similarity in S.which exactly In particular,for each sampled point x;,we first compute matches the goal of cross-modal hashing. the following gradient:different neural networks is not the focus of this paper. Other deep neural networks might also be used to perform feature learning for our DCMH model, which will be leaved for future study. 3.1.2 Hash-Code Learning Part Let 𝑓(x𝑖; 𝜃𝑥) ∈ ℝ𝑐 denote the learned image feature for point 𝑖, which corresponds to the output of the CNN for image modality. Furthermore, let 𝑔(y𝑗 ; 𝜃𝑦) ∈ ℝ𝑐 denote the learned text feature for point 𝑗, which corresponds to the output of the deep neural network for text modality. Here, 𝜃𝑥 is the network parameter of the CNN for image modality, and 𝜃𝑦 is the network parameter of the deep neural network for text modality. The objective function of DCMH is defined as follows: min B(𝑥),B(𝑦),𝜃𝑥,𝜃𝑦 𝒥 = − ∑𝑛 𝑖,𝑗=1 (𝑆𝑖𝑗Θ𝑖𝑗 − log(1 + 𝑒Θ𝑖𝑗 )) + 𝛾(∥B(𝑥) − F∥2 𝐹 + ∥B(𝑦) − G∥2 𝐹 ) + 𝜂(∥F1∥2 𝐹 + ∥G1∥2 𝐹 ) (1) 𝑠.𝑡. B(𝑥) ∈ {−1, +1}𝑐×𝑛, B(𝑦) ∈ {−1, +1}𝑐×𝑛, where F ∈ ℝ𝑐×𝑛 with F∗𝑖 = 𝑓(x𝑖; 𝜃𝑥), G ∈ ℝ𝑐×𝑛 with G∗𝑗 = 𝑔(y𝑗 ; 𝜃𝑦), Θ𝑖𝑗 = 1 2F𝑇 ∗𝑖G∗𝑗 , B(𝑥) ∗𝑖 is the binary hash code for image x𝑖, B(𝑦) ∗𝑗 is the binary hash code for text y𝑗 , 𝛾 and 𝜂 are hyper-parameters. The first term −∑𝑛 𝑖,𝑗=1(𝑆𝑖𝑗Θ𝑖𝑗 − log(1 + 𝑒Θ𝑖𝑗 )) in (1) is the negative log likelihood of the cross-modal similarities with the likelihood function defined as follows: 𝑝(𝑆𝑖𝑗 ∣F∗𝑖, G∗𝑗 ) = { 𝜎(Θ𝑖𝑗 ) 𝑆𝑖𝑗 = 1 1 − 𝜎(Θ𝑖𝑗 ) 𝑆𝑖𝑗 = 0 where Θ𝑖𝑗 = 1 2F𝑇 ∗𝑖G∗𝑗 and 𝜎(Θ𝑖𝑗 ) = 1 1+𝑒−Θ𝑖𝑗 . It is easy to find that minimizing this negative log likeli￾hood, which is equivalent to maximizing the likelihood, can make the similarity (inner product) between F∗𝑖 and G∗𝑗 be large when 𝑆𝑖𝑗 = 1 and be small when 𝑆𝑖𝑗 = 0. Hence, optimizing the first term in (1) can preserve the cross-modal similarity in S with the image feature representation F and text feature representation G. By optimizing the second term 𝛾(∥B(𝑥)−F∥2 𝐹 +∥B(𝑦)− G∥2 𝐹 ) in (1), we can get B(𝑥) = sign(F) and B(𝑦) = sign(G). Hence, we can consider F and G to be the con￾tinuous surrogate of B(𝑥) and B(𝑦) , respectively. Because F and G can preserve the cross-modal similarity in S, the binary hash codes B(𝑥) and B(𝑦) can also be expected to preserve the cross-modal similarity in S, which exactly matches the goal of cross-modal hashing. The third term 𝜂(∥F1∥2 𝐹 + ∥G1∥2 𝐹 ) in (1) is used to make each bit of the hash code be balanced on all the training points. More specifically, the number of +1 and that of −1 for each bit on all the training points should be almost the same. This constraint can be used to maximize the information provided by each bit. In our experiment, we find that better performance can be achieved if the binary codes from the two modalities are set to be the same for the same training points. Hence, we set B(𝑥) = B(𝑦) = B. Then, the problem in (1) can be transformed to the following formulation: min B,𝜃𝑥,𝜃𝑦 𝒥 = − ∑𝑛 𝑖,𝑗=1 (𝑆𝑖𝑗Θ𝑖𝑗 − log(1 + 𝑒Θ𝑖𝑗 )) + 𝛾(∥B − F∥2 𝐹 + ∥B − G∥2 𝐹 ) + 𝜂(∥F1∥2 𝐹 + ∥G1∥2 𝐹 ) (2) 𝑠.𝑡. B ∈ {−1, +1}𝑐×𝑛. This is the final objective function of our DCMH for learning. From (2), we can find that the parameters of the deep neural networks (𝜃𝑥 and 𝜃𝑦) and the binary hash code (B) are learned from the same objective function. That is to say, DCMH integrates both feature learning and hash-code learning into the same deep learning framework. Please note that we only make B(𝑥) = B(𝑦) for the train￾ing points. After we have learned the problem in (2), we still need to generate different binary codes b(𝑥) 𝑖 = ℎ(𝑥) (x𝑖) and b(𝑦) 𝑖 = ℎ(𝑦) (y𝑖) for the two different modalities of the same point 𝑖 if point 𝑖 is a query point or a point from the database rather than a training point. This will be further illustrated in Section 3.3. 3.2. Learning We adopt an alternating learning strategy to learn 𝜃𝑥, 𝜃𝑦 and B. Each time we learn one parameter with the other parameters fixed. The whole alternating learning algorithm for DCMH is briefly outlined in Algorithm 1, and the detailed derivation will be introduced in the following content of this subsection. 3.2.1 Learn 𝜃𝑥, with 𝜃𝑦 and B Fixed When 𝜃𝑦 and B are fixed, we learn the CNN parameter 𝜃𝑥 of the image modality by using a back-propagation (BP) algorithm. As most existing deep learning methods [16], we utilize stochastic gradient descent (SGD) to learn 𝜃𝑥 with the BP algorithm. More specifically, in each iteration we sample a mini-batch of points from the training set and then carry out our learning algorithm based on the sampled data. In particular, for each sampled point x𝑖, we first compute the following gradient:
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有