正在加载图片...
The NUS-WIDE dataset [6]contains 260.648 web im- Source codes of SePH,STMH and SCM are kindly pro- ages,and some images are associated with textual tags. vided by the corresponding authors.While for CMFH and It is a multi-label dataset where each point is annotated CCA whose codes are not available,we implement them with one or multiple labels from 81 concept labels.We carefully by ourselves.SePH is a kernel-based method,for select 195,834 image-text pairs that belong to the 21 most which we use RBF kernel and take 500 randomly selected frequent concepts.The text for each point is represented as points as kernel bases by following its authors'suggestion. a 1000-dimensional bag-of-words vector.The hand-crafted In SePH,the authors propose two strategies to construct feature for each image is a 500-dimensional bag-of-visual the hash codes for retrieval (database)points according to words(BOVW)vector. whether both modalities of a point are observed or not. For all datasets,the image i and text j are considered to However,in this paper we only use one modality for the be similar if point i and point j share at least one common database (retrieval)points2,because the focus of this paper label.Otherwise.they are considered to be dissimilar. is on cross-modal retrieval.All the other parameters for all baselines are set according to the suggestion of the original 4.2.Evaluation Protocol and Baseline papers of these baselines. 4.2.1 Evaluation Protocol For DCMH,we use a validation set to choose the hyper- parameters y and n,and find that good performance can be For MIRFLICKR-25K and IAPR TC-12 datasets,we ran- achieved with y=n 1.Hence,we set y=n 1 for domly sample 2,000 data points as the test(query)set and DCMH.We exploit the CNN-F network [5]pre-trained on the remaining points as the retrieval set (database).For ImageNet dataset [27]to initialize the first seven layers of NUS-WIDE dataset,we take 2,100 data points as the test the CNN for image modality.All the other parameters of the set and the rest as the retrieval set.Moreover,we sample deep neural networks in DCMH are randomly initialized. 10,000 data points from the retrieval set as training set The input for the image modality is the raw pixels,and for MIRFLICKR-25K and IAPR TC-12.For NUS-WIDE that for the text modality is the BOW vectors.We fix the dataset,we sample 10,500 data points from the retrieval mini-batch size to be 128 and set the iteration number of set as training set.The ground-truth neighbors are defined the outer-loop in Algorithm 1 to be 500.The learning as those image-text pairs which share at least one common rate is chosen from 10-6 to 10-1 with a validation set. label. All experiments are run for five times,and the average For hashing-based retrieval,Hamming ranking and hash performance is reported. lookup are two widely used retrieval protocols [25].We also adopt these two protocols to evaluate our method and other 4.3.Accuracy baselines.The Hamming ranking protocol ranks the points in the database (retrieval set)according to their Hamming 4.3.1 Hamming Ranking distances to the given query point,in an increasing order. The MAP results for DCMH and other baselines with hand- Mean average precision (MAP)[25]is the widely used crafted features on MIRFLICKR-25K,IAPR TC-12 and metric to measure the accuracy of the Hamming ranking NUS-WIDE datasets are reported in Table 3.Here,"I-T protocol.The hash lookup protocol returns all the points denotes the case where the query is image and the database within a certain Hamming radius away from the query point. is text,and"T->I"denotes the case where the query is The precision-recall curve is the widely used metric to text and the database is image.We can find that DCMH measure the accuracy of the hash lookup protocol. can outperform all the other baselines with hand-crafted features. 4.2.2 Baseline To further verify the effectiveness of DCMH,we exploit the CNN-F deep network [5]pre-trained on ImageNet Six state-of-the-art cross-modal hashing methods are dataset,which is the same as the initial CNN of the image adopted as baselines for comparison,including SePH [22], modality in DCMH,to extract CNN features.All the STMH [33].SCM [351.CMFH [7].CCA [11]and baselines are trained based on these CNN features.The DVSH [3].Since DVSH can only be used for a special MAP results for DCMH and other baselines with CNN CMH case where one of the modalities have to be temporal features on three datasets are reported in Table 4.We dynamics,we compare DCMH with DVSH only on lAPR can find that DCMH can outperform all the other baselines TC-12 dataset where the original texts are sentences except SePH for image to text retrieval on NUS-WIDE which can be treated as temporal dynamics.The texts in MIRFLICKR-25K and NUS-WIDE are tags which are -For both SePH and DCMH,the accuracy by using both modalities for not suitable for DVSH.Please note that the texts are database points is typically higher than that by using only one modality for database points.DCMH can still outperform SePH for cases with both represented as BOW vectors for all the evaluated methods modalities for database points.This result is omitted in this paper due to except DVSH. space limitation.The NUS-WIDE dataset [6] contains 260,648 web im￾ages, and some images are associated with textual tags. It is a multi-label dataset where each point is annotated with one or multiple labels from 81 concept labels. We select 195,834 image-text pairs that belong to the 21 most frequent concepts. The text for each point is represented as a 1000-dimensional bag-of-words vector. The hand-crafted feature for each image is a 500-dimensional bag-of-visual words (BOVW) vector. For all datasets, the image 𝑖 and text 𝑗 are considered to be similar if point 𝑖 and point 𝑗 share at least one common label. Otherwise, they are considered to be dissimilar. 4.2. Evaluation Protocol and Baseline 4.2.1 Evaluation Protocol For MIRFLICKR-25K and IAPR TC-12 datasets, we ran￾domly sample 2,000 data points as the test (query) set and the remaining points as the retrieval set (database). For NUS-WIDE dataset, we take 2,100 data points as the test set and the rest as the retrieval set. Moreover, we sample 10,000 data points from the retrieval set as training set for MIRFLICKR-25K and IAPR TC-12. For NUS-WIDE dataset, we sample 10,500 data points from the retrieval set as training set. The ground-truth neighbors are defined as those image-text pairs which share at least one common label. For hashing-based retrieval, Hamming ranking and hash lookup are two widely used retrieval protocols [25]. We also adopt these two protocols to evaluate our method and other baselines. The Hamming ranking protocol ranks the points in the database (retrieval set) according to their Hamming distances to the given query point, in an increasing order. Mean average precision (MAP) [25] is the widely used metric to measure the accuracy of the Hamming ranking protocol. The hash lookup protocol returns all the points within a certain Hamming radius away from the query point. The precision-recall curve is the widely used metric to measure the accuracy of the hash lookup protocol. 4.2.2 Baseline Six state-of-the-art cross-modal hashing methods are adopted as baselines for comparison, including SePH [22], STMH [33], SCM [35], CMFH [7], CCA [11] and DVSH [3]. Since DVSH can only be used for a special CMH case where one of the modalities have to be temporal dynamics, we compare DCMH with DVSH only on IAPR TC-12 dataset where the original texts are sentences which can be treated as temporal dynamics. The texts in MIRFLICKR-25K and NUS-WIDE are tags which are not suitable for DVSH. Please note that the texts are represented as BOW vectors for all the evaluated methods except DVSH. Source codes of SePH, STMH and SCM are kindly pro￾vided by the corresponding authors. While for CMFH and CCA whose codes are not available, we implement them carefully by ourselves. SePH is a kernel-based method, for which we use RBF kernel and take 500 randomly selected points as kernel bases by following its authors’ suggestion. In SePH, the authors propose two strategies to construct the hash codes for retrieval (database) points according to whether both modalities of a point are observed or not. However, in this paper we only use one modality for the database (retrieval) points2, because the focus of this paper is on cross-modal retrieval. All the other parameters for all baselines are set according to the suggestion of the original papers of these baselines. For DCMH, we use a validation set to choose the hyper￾parameters 𝛾 and 𝜂, and find that good performance can be achieved with 𝛾 = 𝜂 = 1. Hence, we set 𝛾 = 𝜂 = 1 for DCMH. We exploit the CNN-F network [5] pre-trained on ImageNet dataset [27] to initialize the first seven layers of the CNN for image modality. All the other parameters of the deep neural networks in DCMH are randomly initialized. The input for the image modality is the raw pixels, and that for the text modality is the BOW vectors. We fix the mini-batch size to be 128 and set the iteration number of the outer-loop in Algorithm 1 to be 500. The learning rate is chosen from 10−6 to 10−1 with a validation set. All experiments are run for five times, and the average performance is reported. 4.3. Accuracy 4.3.1 Hamming Ranking The MAP results for DCMH and other baselines with hand￾crafted features on MIRFLICKR-25K, IAPR TC-12 and NUS-WIDE datasets are reported in Table 3. Here, “𝐼 → 𝑇” denotes the case where the query is image and the database is text, and “𝑇 → 𝐼” denotes the case where the query is text and the database is image. We can find that DCMH can outperform all the other baselines with hand-crafted features. To further verify the effectiveness of DCMH, we exploit the CNN-F deep network [5] pre-trained on ImageNet dataset, which is the same as the initial CNN of the image modality in DCMH, to extract CNN features. All the baselines are trained based on these CNN features. The MAP results for DCMH and other baselines with CNN features on three datasets are reported in Table 4. We can find that DCMH can outperform all the other baselines except SePH for image to text retrieval on NUS-WIDE. 2For both SePH and DCMH, the accuracy by using both modalities for database points is typically higher than that by using only one modality for database points. DCMH can still outperform SePH for cases with both modalities for database points. This result is omitted in this paper due to space limitation
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有