=2โa()G-SG) Algorithm 1 The learning algorithm for DCMH. Input:Image set X,text set Y.and cross-modal similarity 1 matrix S. +2y(Fi-B.i)+2nF1. (3) Output:Parameters 0 and 0 of the deep neural networks,and binary code matrix B. Then we can compute withby using the chain Initialization rule,based on which BP can be used to update the parameter Initialize neural network parameters and,mini-batch size 0x N Ny =128,and iteration number tr [n/N:1,ty [n/Nul. 3.2.2 Learn 0,with 0r and B Fixed repeat for iter=1,2,·,txdo When 0.and B are fixed.we also learn the neural network Randomly sample N points from X to construct a mini- parameter 0 of the text modality by using SGD with a BP batch. algorithm.More specifically,for each sampled point yi,we For each sampled point x;in the mini-batch,calculate first compute the following gradient: F.=f(x:;0)by forward propagation. Calculate the derivative according to (3). S=a(ฮ)E-SF0 Update the parameter by using back propagation. 0G.j end for =1 for iter=1,2,·,tydo +2y(Gj-B๏ผ+2mG1 (4) Randomly sample Ny points from Y to construct a mini- batch. Then we can compute with by using the chain For each sampled point y;in the mini-batch,calculate rule,based on which BP can be used to update the parameter G.i=g(y;;0)by forward propagation. Calculate the derivative according to(4). Update the parameter 0,by using back propagation. 3.2.3 Learn B,with and 0 Fixed end for Learn B according to (5). When 0r and 0 are fixed,the problem in (2)can be until a fixed number of iterations reformulated as follows: mxtr(BT(h(F+G)ใ=tr(BV)=โB, 4.Experiment i.j s.t.Bโ{-1๏ผ+1exm, We carry out experiments on image-text datasets to veri- fy the effectiveness of DCMH.DCMH is implemented with where V=y(F+G). the open source deep learning toolbox MatConvNet [31]on It is easy to find that the binary code Bi;should keep the a NVIDIA K80 GPU server. same sign as Vij.Therefore,we have: 4.1.Datasets B sign(V)=sign((F+G)). (5) Three datasets,MIRFLICKR-25K [12],IAPR TC-12 [8] 3.3.Out-of-Sample Extension and NUS-WIDE [6],are used for evaluation. The original MIRFLICKR-25K dataset [12]consists of For any point which is not in the training set,we can 25,000 images collected from Flickr website.Each image obtain its hash code as long as one of its modalities (image is associated with several textual tags.Hence,each point or text)is observed.In particular,given the image modality is a image-text pair.We select those points which have at xg of point q,we can adopt forward propagation to generate least 20 textual tags for our experiment.The text for each the hash code as follows: point is represented as a 1386-dimensional bag-of-words vector.For the hand-crafted feature based method,each b()=h()(xa)=sign(f(xa;0)). image is represented by a 512-dimensional GIST feature Similarly,if point q only has the text modality ya we vector.Furthermore,each point is manually annotated with can also generate the hash code b)as follows: one of the 24 unique labels. The IAPR TC-12 dataset [8]consists of 20,000 image- b(u)=h()(ya)=sign(g(yq:0u)). text pairs which are annotated using 255 labels.We use the entire dataset for our experiment.The text for each point Hence,our DCMH model can be used for cross-modal is represented as a 2912-dimensional bag-of-words vector. search where the query points have one modality and the For the hand-crafted feature based method,each image is points in database have the other modality. represented by a 512-dimensional GIST feature vector.โ๐ฅ โFโ๐ =1 2 โ๐ ๐=1 (๐(ฮ๐๐ )Gโ๐ โ ๐๐๐Gโ๐ ) + 2๐พ(Fโ๐ โ Bโ๐)+2๐F1. (3) Then we can compute โ๐ฅ โ๐๐ฅ with โ๐ฅ โFโ๐ by using the chain rule, based on which BP can be used to update the parameter ๐๐ฅ. 3.2.2 Learn ๐๐ฆ, with ๐๐ฅ and B Fixed When ๐๐ฅ and B are fixed, we also learn the neural network parameter ๐๐ฆ of the text modality by using SGD with a BP algorithm. More specifically, for each sampled point y๐ , we first compute the following gradient: โ๐ฅ โGโ๐ =1 2 โ๐ ๐=1 (๐(ฮ๐๐ )Fโ๐ โ ๐๐๐Fโ๐) + 2๐พ(Gโ๐ โ Bโ๐ )+2๐G1. (4) Then we can compute โ๐ฅ โ๐๐ฆ with โ๐ฅ โGโ๐ by using the chain rule, based on which BP can be used to update the parameter ๐๐ฆ. 3.2.3 Learn B, with ๐๐ฅ and ๐๐ฆ Fixed When ๐๐ฅ and ๐๐ฆ are fixed, the problem in (2) can be reformulated as follows: max B tr(B๐ (๐พ(F + G))) = tr(B๐ V) = โ ๐,๐ ๐ต๐๐๐๐๐ ๐ .๐ก. B โ {โ1, +1}๐×๐, where V = ๐พ(F + G). It is easy to find that the binary code ๐ต๐๐ should keep the same sign as ๐๐๐ . Therefore, we have: B = sign(V) = sign(๐พ(F + G)). (5) 3.3. Out-of-Sample Extension For any point which is not in the training set, we can obtain its hash code as long as one of its modalities (image or text) is observed. In particular, given the image modality x๐ of point ๐, we can adopt forward propagation to generate the hash code as follows: b(๐ฅ) ๐ = โ(๐ฅ) (x๐) = sign(๐(x๐; ๐๐ฅ)). Similarly, if point ๐ only has the text modality y๐, we can also generate the hash code b(๐ฆ) ๐ as follows: b(๐ฆ) ๐ = โ(๐ฆ) (y๐) = sign(๐(y๐; ๐๐ฆ)). Hence, our DCMH model can be used for cross-modal search where the query points have one modality and the points in database have the other modality. Algorithm 1 The learning algorithm for DCMH. Input: Image set X, text set Y, and cross-modal similarity matrix S. Output: Parameters ๐๐ฅ and ๐๐ฆ of the deep neural networks, and binary code matrix B. Initialization Initialize neural network parameters ๐๐ฅ and ๐๐ฆ, mini-batch size ๐๐ฅ = ๐๐ฆ = 128, and iteration number ๐ก๐ฅ = โ๐/๐๐ฅโ, ๐ก๐ฆ = โ๐/๐๐ฆโ. repeat for ๐๐ก๐๐ = 1, 2, โ
โ
โ
, ๐ก๐ฅ do Randomly sample ๐๐ฅ points from X to construct a mini๏ฟพbatch. For each sampled point x๐ in the mini-batch, calculate Fโ๐ = ๐(x๐; ๐๐ฅ) by forward propagation. Calculate the derivative according to (3). Update the parameter ๐๐ฅ by using back propagation. end for for ๐๐ก๐๐ = 1, 2, โ
โ
โ
, ๐ก๐ฆ do Randomly sample ๐๐ฆ points from Y to construct a mini๏ฟพbatch. For each sampled point y๐ in the mini-batch, calculate Gโ๐ = ๐(y๐ ; ๐๐ฆ) by forward propagation. Calculate the derivative according to (4). Update the parameter ๐๐ฆ by using back propagation. end for Learn B according to (5). until a fixed number of iterations 4. Experiment We carry out experiments on image-text datasets to veri๏ฟพfy the effectiveness of DCMH. DCMH is implemented with the open source deep learning toolbox MatConvNet [31] on a NVIDIA K80 GPU server. 4.1. Datasets Three datasets, MIRFLICKR-25K [12], IAPR TC-12 [8] and NUS-WIDE [6], are used for evaluation. The original MIRFLICKR-25K dataset [12] consists of 25,000 images collected from Flickr website. Each image is associated with several textual tags. Hence, each point is a image-text pair. We select those points which have at least 20 textual tags for our experiment. The text for each point is represented as a 1386-dimensional bag-of-words vector. For the hand-crafted feature based method, each image is represented by a 512-dimensional GIST feature vector. Furthermore, each point is manually annotated with one of the 24 unique labels. The IAPR TC-12 dataset [8] consists of 20,000 image๏ฟพtext pairs which are annotated using 255 labels. We use the entire dataset for our experiment. The text for each point is represented as a 2912-dimensional bag-of-words vector. For the hand-crafted feature based method, each image is represented by a 512-dimensional GIST feature vector