ๆญฃๅœจๅŠ ่ฝฝๅ›พ็‰‡...
=2โˆ‘a()G-SG) Algorithm 1 The learning algorithm for DCMH. Input:Image set X,text set Y.and cross-modal similarity 1 matrix S. +2y(Fi-B.i)+2nF1. (3) Output:Parameters 0 and 0 of the deep neural networks,and binary code matrix B. Then we can compute withby using the chain Initialization rule,based on which BP can be used to update the parameter Initialize neural network parameters and,mini-batch size 0x N Ny =128,and iteration number tr [n/N:1,ty [n/Nul. 3.2.2 Learn 0,with 0r and B Fixed repeat for iter=1,2,·,txdo When 0.and B are fixed.we also learn the neural network Randomly sample N points from X to construct a mini- parameter 0 of the text modality by using SGD with a BP batch. algorithm.More specifically,for each sampled point yi,we For each sampled point x;in the mini-batch,calculate first compute the following gradient: F.=f(x:;0)by forward propagation. Calculate the derivative according to (3). S=a(ฮ˜)E-SF0 Update the parameter by using back propagation. 0G.j end for =1 for iter=1,2,·,tydo +2y(Gj-B๏ผ‰+2mG1 (4) Randomly sample Ny points from Y to construct a mini- batch. Then we can compute with by using the chain For each sampled point y;in the mini-batch,calculate rule,based on which BP can be used to update the parameter G.i=g(y;;0)by forward propagation. Calculate the derivative according to(4). Update the parameter 0,by using back propagation. 3.2.3 Learn B,with and 0 Fixed end for Learn B according to (5). When 0r and 0 are fixed,the problem in (2)can be until a fixed number of iterations reformulated as follows: mxtr(BT(h(F+G)ใ€‹=tr(BV)=โˆ‘B, 4.Experiment i.j s.t.Bโˆˆ{-1๏ผŒ+1exm, We carry out experiments on image-text datasets to veri- fy the effectiveness of DCMH.DCMH is implemented with where V=y(F+G). the open source deep learning toolbox MatConvNet [31]on It is easy to find that the binary code Bi;should keep the a NVIDIA K80 GPU server. same sign as Vij.Therefore,we have: 4.1.Datasets B sign(V)=sign((F+G)). (5) Three datasets,MIRFLICKR-25K [12],IAPR TC-12 [8] 3.3.Out-of-Sample Extension and NUS-WIDE [6],are used for evaluation. The original MIRFLICKR-25K dataset [12]consists of For any point which is not in the training set,we can 25,000 images collected from Flickr website.Each image obtain its hash code as long as one of its modalities (image is associated with several textual tags.Hence,each point or text)is observed.In particular,given the image modality is a image-text pair.We select those points which have at xg of point q,we can adopt forward propagation to generate least 20 textual tags for our experiment.The text for each the hash code as follows: point is represented as a 1386-dimensional bag-of-words vector.For the hand-crafted feature based method,each b()=h()(xa)=sign(f(xa;0)). image is represented by a 512-dimensional GIST feature Similarly,if point q only has the text modality ya we vector.Furthermore,each point is manually annotated with can also generate the hash code b)as follows: one of the 24 unique labels. The IAPR TC-12 dataset [8]consists of 20,000 image- b(u)=h()(ya)=sign(g(yq:0u)). text pairs which are annotated using 255 labels.We use the entire dataset for our experiment.The text for each point Hence,our DCMH model can be used for cross-modal is represented as a 2912-dimensional bag-of-words vector. search where the query points have one modality and the For the hand-crafted feature based method,each image is points in database have the other modality. represented by a 512-dimensional GIST feature vector.โˆ‚๐’ฅ โˆ‚Fโˆ—๐‘– =1 2 โˆ‘๐‘› ๐‘—=1 (๐œŽ(ฮ˜๐‘–๐‘— )Gโˆ—๐‘— โˆ’ ๐‘†๐‘–๐‘—Gโˆ—๐‘— ) + 2๐›พ(Fโˆ—๐‘– โˆ’ Bโˆ—๐‘–)+2๐œ‚F1. (3) Then we can compute โˆ‚๐’ฅ โˆ‚๐œƒ๐‘ฅ with โˆ‚๐’ฅ โˆ‚Fโˆ—๐‘– by using the chain rule, based on which BP can be used to update the parameter ๐œƒ๐‘ฅ. 3.2.2 Learn ๐œƒ๐‘ฆ, with ๐œƒ๐‘ฅ and B Fixed When ๐œƒ๐‘ฅ and B are fixed, we also learn the neural network parameter ๐œƒ๐‘ฆ of the text modality by using SGD with a BP algorithm. More specifically, for each sampled point y๐‘— , we first compute the following gradient: โˆ‚๐’ฅ โˆ‚Gโˆ—๐‘— =1 2 โˆ‘๐‘› ๐‘–=1 (๐œŽ(ฮ˜๐‘–๐‘— )Fโˆ—๐‘– โˆ’ ๐‘†๐‘–๐‘—Fโˆ—๐‘–) + 2๐›พ(Gโˆ—๐‘— โˆ’ Bโˆ—๐‘— )+2๐œ‚G1. (4) Then we can compute โˆ‚๐’ฅ โˆ‚๐œƒ๐‘ฆ with โˆ‚๐’ฅ โˆ‚Gโˆ—๐‘— by using the chain rule, based on which BP can be used to update the parameter ๐œƒ๐‘ฆ. 3.2.3 Learn B, with ๐œƒ๐‘ฅ and ๐œƒ๐‘ฆ Fixed When ๐œƒ๐‘ฅ and ๐œƒ๐‘ฆ are fixed, the problem in (2) can be reformulated as follows: max B tr(B๐‘‡ (๐›พ(F + G))) = tr(B๐‘‡ V) = โˆ‘ ๐‘–,๐‘— ๐ต๐‘–๐‘—๐‘‰๐‘–๐‘— ๐‘ .๐‘ก. B โˆˆ {โˆ’1, +1}๐‘×๐‘›, where V = ๐›พ(F + G). It is easy to find that the binary code ๐ต๐‘–๐‘— should keep the same sign as ๐‘‰๐‘–๐‘— . Therefore, we have: B = sign(V) = sign(๐›พ(F + G)). (5) 3.3. Out-of-Sample Extension For any point which is not in the training set, we can obtain its hash code as long as one of its modalities (image or text) is observed. In particular, given the image modality x๐‘ž of point ๐‘ž, we can adopt forward propagation to generate the hash code as follows: b(๐‘ฅ) ๐‘ž = โ„Ž(๐‘ฅ) (x๐‘ž) = sign(๐‘“(x๐‘ž; ๐œƒ๐‘ฅ)). Similarly, if point ๐‘ž only has the text modality y๐‘ž, we can also generate the hash code b(๐‘ฆ) ๐‘ž as follows: b(๐‘ฆ) ๐‘ž = โ„Ž(๐‘ฆ) (y๐‘ž) = sign(๐‘”(y๐‘ž; ๐œƒ๐‘ฆ)). Hence, our DCMH model can be used for cross-modal search where the query points have one modality and the points in database have the other modality. Algorithm 1 The learning algorithm for DCMH. Input: Image set X, text set Y, and cross-modal similarity matrix S. Output: Parameters ๐œƒ๐‘ฅ and ๐œƒ๐‘ฆ of the deep neural networks, and binary code matrix B. Initialization Initialize neural network parameters ๐œƒ๐‘ฅ and ๐œƒ๐‘ฆ, mini-batch size ๐‘๐‘ฅ = ๐‘๐‘ฆ = 128, and iteration number ๐‘ก๐‘ฅ = โŒˆ๐‘›/๐‘๐‘ฅโŒ‰, ๐‘ก๐‘ฆ = โŒˆ๐‘›/๐‘๐‘ฆโŒ‰. repeat for ๐‘–๐‘ก๐‘’๐‘Ÿ = 1, 2, โ‹…โ‹…โ‹… , ๐‘ก๐‘ฅ do Randomly sample ๐‘๐‘ฅ points from X to construct a mini๏ฟพbatch. For each sampled point x๐‘– in the mini-batch, calculate Fโˆ—๐‘– = ๐‘“(x๐‘–; ๐œƒ๐‘ฅ) by forward propagation. Calculate the derivative according to (3). Update the parameter ๐œƒ๐‘ฅ by using back propagation. end for for ๐‘–๐‘ก๐‘’๐‘Ÿ = 1, 2, โ‹…โ‹…โ‹… , ๐‘ก๐‘ฆ do Randomly sample ๐‘๐‘ฆ points from Y to construct a mini๏ฟพbatch. For each sampled point y๐‘— in the mini-batch, calculate Gโˆ—๐‘— = ๐‘”(y๐‘— ; ๐œƒ๐‘ฆ) by forward propagation. Calculate the derivative according to (4). Update the parameter ๐œƒ๐‘ฆ by using back propagation. end for Learn B according to (5). until a fixed number of iterations 4. Experiment We carry out experiments on image-text datasets to veri๏ฟพfy the effectiveness of DCMH. DCMH is implemented with the open source deep learning toolbox MatConvNet [31] on a NVIDIA K80 GPU server. 4.1. Datasets Three datasets, MIRFLICKR-25K [12], IAPR TC-12 [8] and NUS-WIDE [6], are used for evaluation. The original MIRFLICKR-25K dataset [12] consists of 25,000 images collected from Flickr website. Each image is associated with several textual tags. Hence, each point is a image-text pair. We select those points which have at least 20 textual tags for our experiment. The text for each point is represented as a 1386-dimensional bag-of-words vector. For the hand-crafted feature based method, each image is represented by a 512-dimensional GIST feature vector. Furthermore, each point is manually annotated with one of the 24 unique labels. The IAPR TC-12 dataset [8] consists of 20,000 image๏ฟพtext pairs which are annotated using 255 labels. We use the entire dataset for our experiment. The text for each point is represented as a 2912-dimensional bag-of-words vector. For the hand-crafted feature based method, each image is represented by a 512-dimensional GIST feature vector
<<ๅ‘ไธŠ็ฟป้กตๅ‘ไธ‹็ฟป้กต>>
©2008-็Žฐๅœจ cucdc.com ้ซ˜็ญ‰ๆ•™่‚ฒ่ต„่ฎฏ็ฝ‘ ็‰ˆๆƒๆ‰€ๆœ‰