IEEE TRANSACTIONS ON IMAGE PROCESSING.VOL.XX,NO.X.XXX 2019 1 Discrete Latent Factor Mmodel for Cross-Modal Hashing Qing-Yuan Jiang,Wu-Jun Li,Member;IEEE Abstract-Due to its storage and retrieval efficiency, representation,the storage cost can be dramatically reduced, cross-modal hashing (CMH)has been widely used for and furthermore we can achieve constant or sub-linear search cross-modal similarity search in many multimedia applications. According to the training strategy,existing CMH methods can be speed which is much faster than the search speed in the mainly divided into two categories:relaxation-based continuous original space 12],13],24],25]. methods and discrete methods.In general,the training of Early hashing methods are mainly proposed for uni-modal relaxation-based continuous methods is faster than discrete meth- data to perform uni-modal similarity search.In recent years, ods,but the accuracy of relaxation-based continuous methods is with the explosive growing of multimedia data in real ap- not satisfactory.On the contrary,the accuracy of discrete meth- plications,multi-modal similarity search has attracted a lot ods is typically better than relaxation-based continuous methods, but the training of discrete methods is very time-consuming.In of attention.For example,given a text query,a multi-modal this paper,we propose a novel CMH method,called discrete similarity search system can return the nearest images or latent factor model based cross-modal hashing (DLFH),for cross videos in the database.To achieve an efficient performance for modal similarity search.DLFH is a discrete method which can large-scale problems,multi-modal hashing (MMH)has been directly learn the binary hash codes for CMH.At the same time,the training of DLFH is efficient.Experiments show that proposed for multi-modal search [26]-[28]. DLFH can achieve significantly better accuracy than existing Existing MMH methods can be divided into two major methods,and the training time of DLFH is comparable to that categories:multi-source hashing (MSH)[15],[2931]and of relaxation-based continuous methods which are much faster cross-modal hashing(CMH)[26].[27].[32]-[34].MSH meth- than existing discrete methods. ods aim to learn binary hash codes by utilizing information Index Terms-Approximate nearest neighbor,cross-modal re- from multiple modalities for each point.In other words,all trieval,hashing,multimedia. these multiple modalities should be observed for all data points including the query points and those in database under MSH settings.Because it's usually difficult to observe all the I.INTRODUCTION modalities in many real applications,the application scenarios EAREST neighbor (NN)search plays a fundamental role for MSH methods are limited.Unlike MSH methods,CMH in many areas including machine learning,information methods usually require only one modality for a query point retrieval,computer vision and so on.In many real applications, to perform search in a database with other modalities.The there is no need to return exact nearest neighbors for every application scenarios for CMH are more flexible than those given query and approximate nearest neighbor (ANN)is for MSH.For example,CMH can perform text-to-image or enough to achieve satisfactory performance [1]-[3].Because image-to-text retrieval tasks in real applications.Hence,CMH ANN search might be much faster than exact NN search,ANN has gained more attention than MSH [26],[27]. search has become an active research topic with a wide range A lot of CMH methods have been proposed in recent of applications especially for large-scale problems [1]-[3]. years.Based on whether supervised information is used Among existing ANN search methods,hashing methods or not during training procedure,existing CMH methods have attracted much more attention due to their storage and can be further divided into two categories:unsupervised retrieval efficiency in real applications [4]-[23].The goal of CMH and supervised CMH.Unsupervised CMH methods hashing is to embed data points from the original space into directly explore data features without supervised informa- a Hamming space where the similarity is preserved.More tion to learn binary codes (or hash functions).Represen- specifically,in the Hamming space,each data point will tative unsupervised CMH methods include canonical cor- be represented as a binary code.Based on the binary code relation analysis iterative quantization (CCA-ITQ)[6],col- lective matrix factorization hashing (CMFH)[27],alternat- Manuscript received July 20,2018;revised December 30,2018;accepted ing co-quantization (ACQ)[35]and unsupervised generative January 22,2019.This work was supported in part by the NSFC-NRF Joint Research Project under Grant 61861146001.and in part by the NSFC adversarial cross-modal hashing (UGACH)[36].Supervised under Grant 61472182.The associate editor coordinating the review of this CMH tries to learn the hash function by utilizing supervised manuscript and approving it for publication was Prof.Xiaochun Cao.(Cor- responding author:Wu-Jun Li) information.As supervised CMH methods can incorporate All authors are with the National Key Laboratory for Novel Software semantic labels to mitigate the semantic gap,supervised CMH Technology,Collaborative Innovation Center of Novel Software Technology methods can achieve better accuracy than unsupervised CMH and Industrialization,Department of Computer Science and Technology,Nan- methods.Representative supervised CMH methods include jing University,Nanjing 210023,China (E-mail:jiangqy@lamda.nju.edu.cn: iwujun@nju.edu.cn). multi-modal latent binary embedding(MLBE)[37],semantic correlation maximization (SCM)[26],semantics preserving
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, XXX 2019 1 Discrete Latent Factor Model for Cross-Modal Hashing Qing-Yuan Jiang, Wu-Jun Li, Member, IEEE Abstract—Due to its storage and retrieval efficiency, cross-modal hashing (CMH) has been widely used for cross-modal similarity search in many multimedia applications. According to the training strategy, existing CMH methods can be mainly divided into two categories: relaxation-based continuous methods and discrete methods. In general, the training of relaxation-based continuous methods is faster than discrete methods, but the accuracy of relaxation-based continuous methods is not satisfactory. On the contrary, the accuracy of discrete methods is typically better than relaxation-based continuous methods, but the training of discrete methods is very time-consuming. In this paper, we propose a novel CMH method, called discrete latent factor model based cross-modal hashing (DLFH), for cross modal similarity search. DLFH is a discrete method which can directly learn the binary hash codes for CMH. At the same time, the training of DLFH is efficient. Experiments show that DLFH can achieve significantly better accuracy than existing methods, and the training time of DLFH is comparable to that of relaxation-based continuous methods which are much faster than existing discrete methods. Index Terms—Approximate nearest neighbor, cross-modal retrieval, hashing, multimedia. I. INTRODUCTION N EAREST neighbor (NN) search plays a fundamental role in many areas including machine learning, information retrieval, computer vision and so on. In many real applications, there is no need to return exact nearest neighbors for every given query and approximate nearest neighbor (ANN) is enough to achieve satisfactory performance [1]–[3]. Because ANN search might be much faster than exact NN search, ANN search has become an active research topic with a wide range of applications especially for large-scale problems [1]–[3]. Among existing ANN search methods, hashing methods have attracted much more attention due to their storage and retrieval efficiency in real applications [4]–[23]. The goal of hashing is to embed data points from the original space into a Hamming space where the similarity is preserved. More specifically, in the Hamming space, each data point will be represented as a binary code. Based on the binary code Manuscript received July 20, 2018; revised December 30, 2018; accepted January 22, 2019. This work was supported in part by the NSFC-NRF Joint Research Project under Grant 61861146001, and in part by the NSFC under Grant 61472182. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Xiaochun Cao. (Corresponding author: Wu-Jun Li) All authors are with the National Key Laboratory for Novel Software Technology, Collaborative Innovation Center of Novel Software Technology and Industrialization, Department of Computer Science and Technology, Nanjing University, Nanjing 210023, China (E-mail: jiangqy@lamda.nju.edu.cn; liwujun@nju.edu.cn). representation, the storage cost can be dramatically reduced, and furthermore we can achieve constant or sub-linear search speed which is much faster than the search speed in the original space [12], [13], [24], [25]. Early hashing methods are mainly proposed for uni-modal data to perform uni-modal similarity search. In recent years, with the explosive growing of multimedia data in real applications, multi-modal similarity search has attracted a lot of attention. For example, given a text query, a multi-modal similarity search system can return the nearest images or videos in the database. To achieve an efficient performance for large-scale problems, multi-modal hashing (MMH) has been proposed for multi-modal search [26]–[28]. Existing MMH methods can be divided into two major categories: multi-source hashing (MSH) [15], [29]–[31] and cross-modal hashing (CMH) [26], [27], [32]–[34]. MSH methods aim to learn binary hash codes by utilizing information from multiple modalities for each point. In other words, all these multiple modalities should be observed for all data points including the query points and those in database under MSH settings. Because it’s usually difficult to observe all the modalities in many real applications, the application scenarios for MSH methods are limited. Unlike MSH methods, CMH methods usually require only one modality for a query point to perform search in a database with other modalities. The application scenarios for CMH are more flexible than those for MSH. For example, CMH can perform text-to-image or image-to-text retrieval tasks in real applications. Hence, CMH has gained more attention than MSH [26], [27]. A lot of CMH methods have been proposed in recent years. Based on whether supervised information is used or not during training procedure, existing CMH methods can be further divided into two categories: unsupervised CMH and supervised CMH. Unsupervised CMH methods directly explore data features without supervised information to learn binary codes (or hash functions). Representative unsupervised CMH methods include canonical correlation analysis iterative quantization (CCA-ITQ) [6], collective matrix factorization hashing (CMFH) [27], alternating co-quantization (ACQ) [35] and unsupervised generative adversarial cross-modal hashing (UGACH) [36]. Supervised CMH tries to learn the hash function by utilizing supervised information. As supervised CMH methods can incorporate semantic labels to mitigate the semantic gap, supervised CMH methods can achieve better accuracy than unsupervised CMH methods. Representative supervised CMH methods include multi-modal latent binary embedding (MLBE) [37], semantic correlation maximization (SCM) [26], semantics preserving
EEE TRANSACTIONS ON IMAGE PROCESSING.VOL.XX,NO.X.XXX 2019 2 hashing (SePH)[38],supervised matrix factorization hash- A.Continuous Cross-Modal Hashing ing (SMFH)[39],binary set hashing (BSH)[40].deep cross-modal hashing (DCMH)[34],cross-modal Hamming Continuous CMH methods usually adopt relaxation strategy hashing (CMHH)[41].and generalized semantic preserving for learning.More specifically,these methods adopt relaxation hashing(GSPH)[42]. strategy to learn continuous representation at the first stage, and then utilize some rounding techniques to generate discrete According to the training strategy,existing CMH methods binary codes at the second stage.Representative continuous can be divided into two categories:relaxation-based contin- CMH methods include CMFH [27],BSE [40].SCM [26]. uous methods and discrete methods.Hashing is essentially a SMFH [39]and GSPH [421.CMFH is an unsupervised CMH discrete learning problem.To avoid the difficulty caused by discrete learning,relaxation-based continuous methods try to method which adopts collective matrix factorization to learn cross-view hash functions.BSE,SCM,SMFH and GSPH solve a relaxed continuous problem with some relaxation strat- egy.Representative continuous methods include CMFH [27], are supervised CMH methods.BSE tries to preserve the cross view hashing (CVH)[43].SCM [26].SMFH [39]and inter-modal and intra-modal similarity by learning two pro- jections and taking the geometric structures of each modality GSPH [421.Discrete methods try to directly solve the discrete problem without continuous relaxation.Representative dis- into account.SCM learns two hash functions by integrating crete methods include CCA-ITO [61.ACQ [351.MLBE [37]. semantic labels into the learning procedure.To perform effi- predictable dual-view hashing (PDH)[44]and SePH [38]. cient training and fully utilize supervised information,SCM In general,the training of relaxation-based continuous meth- proposes an approximation method to avoid explicit computa- tion of the pairwise similarity matrix.SMFH is the supervised ods is faster than discrete methods,but the accuracy of version of CMFH which integrates supervised information into relaxation-based continuous methods is not satisfactory.On the contrary,the accuracy of discrete methods is typically better learning procedure to further improve retrieval performance. GSPH tries to design a generalized hashing framework to than relaxation-based continuous methods,but the training of handle a wide range of scenarios discrete methods is time-consuming. In this paper,we propose a novel CMH method, One shortcoming of the continuous methods is that the re- laxation procedure might result in a sub-optimal solution [45] called discrete latent factor model based cross-modal hashing (DLFH),for cross modal similarity search.The con- tributions of DLFH are outlined as follows: B.Discrete Cross-Modal Hashing DLFH is a supervised CMH method,and in DLFH a novel discrete latent factor model is proposed to model Discrete CMH methods try to directly learn binary codes the supervised information. without discarding discrete constraints.Representative discrete DLFH is a discrete method which can directly learn the CMH methods include CCA-ITQ [46],ACQ [35],MLBE [37], binary hash codes without continuous relaxation. PDH [44].SePH [38]and DCMH [34].CCA-ITO.ACO and .A novel discrete learning algorithm is proposed for PDH are unsupervised CMH methods.CCA-ITQ and ACQ DLFH,which can be proved to be convergent.Further- utilize the dimension reduction technologies,e.g.,canoni- more,the implementation of DLFH is simple. cal correlation analysis (CCA)and neighborhood preserving The training (learning)of DLFH is still efficient although embedding (NPE)[47],to embed multimodal data into one DLFH is a discrete method. common subspace.Then,they minimize the quantization error Experiments on real datasets show that DLFH can achieve to learn binary codes.PDH adopts max-margin theory to significantly better accuracy than existing methods,in- learn binary codes.By using an iterative optimization method cluding both relaxation-based continuous methods and based on block coordinate descent.PDH tries to maintain existing discrete methods.Experimental results also show the predictability of the binary codes.MLBE,SePH and that the training speed of DLFH is comparable to that of DCMH are supervised CMH methods.MLBE leverages a relaxation-based continuous methods,and is much faster probabilistic latent factor model to learn binary codes which are devised as binary latent factors and designs an alternating than that of existing discrete methods learning algorithm to learn binary latent factors (i.e.,binary The rest of this paper is organized as follows.In Section II, codes).Although MLBE is a supervised CMH method,the we briefly review the related works.In Section III.we de- complexity of MLBE is extremely high.Hence,the training scribe the notations and problem definition.In Section IV,we of MLBE is extremely time-consuming and it cannot scale present our DLFH in detail,including model formulation and to large-scale datasets.SePH aims to transform semantic learning algorithm.In Section V,we carry out experiments for affinities information between training data into a probability evaluation on three widely used datasets.At last,we conclude distribution by minimizing the Kullback-Leibler divergence. the paper in Section VI. Furthermore,SePH utilizes a kernel logistic regression method as an out-of-the-sample strategy to learn binary code for un- II.RELATED WORK seen data.SePH relaxes binary code as real-value continuous codes and imposes a quantization error term to learn binary In this section,we briefly review the related works of codes.In the meantime,the time complexity of SePH is also cross-modal hashing.including continuous cross-modal hash-high.DCMH is a deep learning based cross-modal hashing ing and discrete cross-modal hashing. method which integrates feature learning and binary code
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, XXX 2019 2 hashing (SePH) [38], supervised matrix factorization hashing (SMFH) [39], binary set hashing (BSH) [40], deep cross-modal hashing (DCMH) [34], cross-modal Hamming hashing (CMHH) [41], and generalized semantic preserving hashing (GSPH) [42] . According to the training strategy, existing CMH methods can be divided into two categories: relaxation-based continuous methods and discrete methods. Hashing is essentially a discrete learning problem. To avoid the difficulty caused by discrete learning, relaxation-based continuous methods try to solve a relaxed continuous problem with some relaxation strategy. Representative continuous methods include CMFH [27], cross view hashing (CVH) [43], SCM [26], SMFH [39] and GSPH [42]. Discrete methods try to directly solve the discrete problem without continuous relaxation. Representative discrete methods include CCA-ITQ [6], ACQ [35], MLBE [37], predictable dual-view hashing (PDH) [44] and SePH [38]. In general, the training of relaxation-based continuous methods is faster than discrete methods, but the accuracy of relaxation-based continuous methods is not satisfactory. On the contrary, the accuracy of discrete methods is typically better than relaxation-based continuous methods, but the training of discrete methods is time-consuming. In this paper, we propose a novel CMH method, called discrete latent factor model based cross-modal hashing (DLFH), for cross modal similarity search. The contributions of DLFH are outlined as follows: • DLFH is a supervised CMH method, and in DLFH a novel discrete latent factor model is proposed to model the supervised information. • DLFH is a discrete method which can directly learn the binary hash codes without continuous relaxation. • A novel discrete learning algorithm is proposed for DLFH, which can be proved to be convergent. Furthermore, the implementation of DLFH is simple. • The training (learning) of DLFH is still efficient although DLFH is a discrete method. • Experiments on real datasets show that DLFH can achieve significantly better accuracy than existing methods, including both relaxation-based continuous methods and existing discrete methods. Experimental results also show that the training speed of DLFH is comparable to that of relaxation-based continuous methods, and is much faster than that of existing discrete methods. The rest of this paper is organized as follows. In Section II, we briefly review the related works. In Section III, we describe the notations and problem definition. In Section IV, we present our DLFH in detail, including model formulation and learning algorithm. In Section V, we carry out experiments for evaluation on three widely used datasets. At last, we conclude the paper in Section VI. II. RELATED WORK In this section, we briefly review the related works of cross-modal hashing, including continuous cross-modal hashing and discrete cross-modal hashing. A. Continuous Cross-Modal Hashing Continuous CMH methods usually adopt relaxation strategy for learning. More specifically, these methods adopt relaxation strategy to learn continuous representation at the first stage, and then utilize some rounding techniques to generate discrete binary codes at the second stage. Representative continuous CMH methods include CMFH [27], BSE [40], SCM [26], SMFH [39] and GSPH [42]. CMFH is an unsupervised CMH method which adopts collective matrix factorization to learn cross-view hash functions. BSE, SCM, SMFH and GSPH are supervised CMH methods. BSE tries to preserve the inter-modal and intra-modal similarity by learning two projections and taking the geometric structures of each modality into account. SCM learns two hash functions by integrating semantic labels into the learning procedure. To perform effi- cient training and fully utilize supervised information, SCM proposes an approximation method to avoid explicit computation of the pairwise similarity matrix. SMFH is the supervised version of CMFH which integrates supervised information into learning procedure to further improve retrieval performance. GSPH tries to design a generalized hashing framework to handle a wide range of scenarios. One shortcoming of the continuous methods is that the relaxation procedure might result in a sub-optimal solution [45]. B. Discrete Cross-Modal Hashing Discrete CMH methods try to directly learn binary codes without discarding discrete constraints. Representative discrete CMH methods include CCA-ITQ [46], ACQ [35], MLBE [37], PDH [44], SePH [38] and DCMH [34]. CCA-ITQ, ACQ and PDH are unsupervised CMH methods. CCA-ITQ and ACQ utilize the dimension reduction technologies, e.g., canonical correlation analysis (CCA) and neighborhood preserving embedding (NPE) [47], to embed multimodal data into one common subspace. Then, they minimize the quantization error to learn binary codes. PDH adopts max-margin theory to learn binary codes. By using an iterative optimization method based on block coordinate descent, PDH tries to maintain the predictability of the binary codes. MLBE, SePH and DCMH are supervised CMH methods. MLBE leverages a probabilistic latent factor model to learn binary codes which are devised as binary latent factors and designs an alternating learning algorithm to learn binary latent factors (i.e., binary codes). Although MLBE is a supervised CMH method, the complexity of MLBE is extremely high. Hence, the training of MLBE is extremely time-consuming and it cannot scale to large-scale datasets. SePH aims to transform semantic affinities information between training data into a probability distribution by minimizing the Kullback-Leibler divergence. Furthermore, SePH utilizes a kernel logistic regression method as an out-of-the-sample strategy to learn binary code for unseen data. SePH relaxes binary code as real-value continuous codes and imposes a quantization error term to learn binary codes. In the meantime, the time complexity of SePH is also high. DCMH is a deep learning based cross-modal hashing method which integrates feature learning and binary code
EEE TRANSACTIONS ON IMAGE PROCESSING.VOL.XX,NO.X.XXX 2019 learning simultaneously with a deep neural networks.The time and Vis respectively denote the binary hash codes of two complexity of DCMH is also high. modalities for point i and c is the length of binary code. Typically,discrete methods can outperform continuous The goal of supervised CMH is to learn the binary codes methods in terms of retrieval accuracy.However,high com- U and V,which try to preserve the similarity information plexity makes the training of discrete supervised CMH time- in S.In other words,if Sij=1,the Hamming distance consuming.To make the training practicable,discrete super- between Ui and Vi+should be as small as possible and vice vised CMH methods usually sample a subset from database versa.Furthermore.we also need to learn two hash functions during training procedure.Hence,existing discrete supervised h(x)E{-1,+1]e and hu(ya)E{-1,+1]e respectively CMH methods can't fully utilize supervised information and for modality x and modality y,which can compute binary will deteriorate the accuracy.In summary,to fully utilize hash codes for any new query point (xa or ya)unseen in the the supervised information,we need to design a scalable training set. discrete CMH method to further improve retrieval accuracy and scalability for large-scale datasets. IV.DISCRETE LATENT FACTOR MODEL BASED CROSS-MODAL HASHING III.NOTATIONS AND PROBLEM DEFINITION In this section.we introduce the details of DLFH,including We briefly introduce the notations and problem definition model formulation and learning algorithm. in this section. A.Model Formulation A.Notations Given a binary code pair [Ui.,Vi,we define j as: We use bold uppercase letters like U and bold lowercase V.where c is the code length which is pre- letters like u to denote matrices and vectors,respectively.The specified,and A>0 is a hyper-parameter denoting a scale element at the position (i,j)of matrix U is denoted as Uj. factor for tuning. The ith row of matrix U is denoted as Ui,and the jth column By using a logistic function,we define Aij as:Aij= of matrix U is denoted as U..l.lg and UT denote the Based on Aij,we define the likelihood Frobenius norm of a matrix and the transpose of matrix U of the cross-modal similarity S as: respectively.sign()is an element-wise sign function. p(SIU,V)= p(SijlU,V) .=1 B.Problem Definition Without loss of generality,we assume there exist only where p(SijlU,V)is defined as follows: two modalities in the data although our DLFH can be easily if Sij=1, adapted to more modalities.We use X=12...n p(SijlU,V) Rnxd and Y =[y1,y2,...,yn]Te Rnxdy to denote the 1-Aij, otherwise feature vectors of the two modalities(modality x and modality Then the log-likelihood of U and V can be derived as y),where d and du respectively denote the dimensionality of follows: the feature spaces for two modalities and n is the number of training data.In particular,x;and yi denote the feature vectors L=logp(SIU,V)= ∑[S,6-log1+e8 of the two modalities for training point i,respectively.Without ,j=1 loss of generality,the data are assumed to be zero-centered which means∑1x=0and∑21yi=0.Here,weas- The model of DLFH tries to maximize the log-likelihood of sume that both modalities are observed for all training points. U and V.That is,DLFH tries to solve the following problem: However,we do not require that both modalities are observed L=logp(SIU,V) (1) for query (test)points.Hence,the setting is cross-modal. Actually,DLFH can be easily adapted to cases where some training points are with missing modalities,which will be left ∑[S,0-log(1+e9】 ij=1 for future study.In this paper,we focus on supervised CMH which has shown better accuracy that unsupervised CMH [16], s.tU,V∈{-l,+1mxc, [17].[26],[38].[44].In supervised CMH,besides the feature where U and V are constrained to be in a binary space,i.e., vectors X and Y,we are also given a cross-modal supervised {-1,+1)nxc,because they are binary hash codes for learning. similarity matrix S{0,1]nx".If Sj=1,it means that We can find that maximizing the objective function in(1) point xi and point yi are similar.Otherwise xi and yi are exactly matches the goal of hashing.More specifically,the dissimilar.Here,we assume all elements of S are observed. learned binary hash codes try to preserve the similarity infor- But our DLFH can also be adapted for cases with missing mation in S. elements in S.Sij can be manually labeled by users,or Please note that in (1),DLFH adopts the maximum like- constructed from the labels of point i and point j. lihood loss function.Although there exist some CMH meth- We use U,V E{-1,+1]nxe to respectively denote the ods [34],[41]which also use maximum likelihood loss func- binary codes for modality x and modality y,where Ui.tion,none of them can utilize discrete latent factor model to
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, XXX 2019 3 learning simultaneously with a deep neural networks. The time complexity of DCMH is also high. Typically, discrete methods can outperform continuous methods in terms of retrieval accuracy. However, high complexity makes the training of discrete supervised CMH timeconsuming. To make the training practicable, discrete supervised CMH methods usually sample a subset from database during training procedure. Hence, existing discrete supervised CMH methods can’t fully utilize supervised information and will deteriorate the accuracy. In summary, to fully utilize the supervised information, we need to design a scalable discrete CMH method to further improve retrieval accuracy and scalability for large-scale datasets. III. NOTATIONS AND PROBLEM DEFINITION We briefly introduce the notations and problem definition in this section. A. Notations We use bold uppercase letters like U and bold lowercase letters like u to denote matrices and vectors, respectively. The element at the position (i, j) of matrix U is denoted as Uij . The ith row of matrix U is denoted as Ui∗, and the jth column of matrix U is denoted as U∗j . k · kF and UT denote the Frobenius norm of a matrix and the transpose of matrix U respectively. sign(·) is an element-wise sign function. B. Problem Definition Without loss of generality, we assume there exist only two modalities in the data although our DLFH can be easily adapted to more modalities. We use X = [x1, x2, . . . , xn] T ∈ R n×dx and Y = [y1, y2, . . . , yn] T ∈ R n×dy to denote the feature vectors of the two modalities (modality x and modality y), where dx and dy respectively denote the dimensionality of the feature spaces for two modalities and n is the number of training data. In particular, xi and yi denote the feature vectors of the two modalities for training point i, respectively. Without loss of generality, the data are assumed to be zero-centered which means Pn i=1 xi = 0 and Pn i=1 yi = 0. Here, we assume that both modalities are observed for all training points. However, we do not require that both modalities are observed for query (test) points. Hence, the setting is cross-modal. Actually, DLFH can be easily adapted to cases where some training points are with missing modalities, which will be left for future study. In this paper, we focus on supervised CMH which has shown better accuracy that unsupervised CMH [16], [17], [26], [38], [44]. In supervised CMH, besides the feature vectors X and Y, we are also given a cross-modal supervised similarity matrix S ∈ {0, 1} n×n. If Sij = 1, it means that point xi and point yj are similar. Otherwise xi and yj are dissimilar. Here, we assume all elements of S are observed. But our DLFH can also be adapted for cases with missing elements in S. Sij can be manually labeled by users, or constructed from the labels of point i and point j. We use U, V ∈ {−1, +1} n×c to respectively denote the binary codes for modality x and modality y, where Ui∗ and Vi∗ respectively denote the binary hash codes of two modalities for point i and c is the length of binary code. The goal of supervised CMH is to learn the binary codes U and V, which try to preserve the similarity information in S. In other words, if Sij = 1, the Hamming distance between Ui∗ and Vj∗ should be as small as possible and vice versa. Furthermore, we also need to learn two hash functions hx(xq) ∈ {−1, +1} c and hy(yq) ∈ {−1, +1} c respectively for modality x and modality y, which can compute binary hash codes for any new query point (xq or yq) unseen in the training set. IV. DISCRETE LATENT FACTOR MODEL BASED CROSS-MODAL HASHING In this section, we introduce the details of DLFH, including model formulation and learning algorithm. A. Model Formulation Given a binary code pair {Ui∗, Vj∗}, we define Θij as: Θij = λ c Ui∗VT j∗ , where c is the code length which is prespecified, and λ > 0 is a hyper-parameter denoting a scale factor for tuning. By using a logistic function, we define Aij as: Aij = σ(Θij ) = 1 1+e −Θij . Based on Aij , we define the likelihood of the cross-modal similarity S as: p(S|U, V) = Yn i,j=1 p(Sij |U, V), where p(Sij |U, V) is defined as follows: p(Sij |U, V) = ( Aij , if Sij = 1, 1 − Aij , otherwise. Then the log-likelihood of U and V can be derived as follows: L = log p(S|U, V) = Xn i,j=1 SijΘij − log(1 + e Θij ) The model of DLFH tries to maximize the log-likelihood of U and V. That is, DLFH tries to solve the following problem: max U,V L = log p(S|U, V) (1) = Xn i,j=1 SijΘij − log(1 + e Θij ) s.t. U, V ∈ {−1, +1} n×c , where U and V are constrained to be in a binary space, i.e., {−1, +1} n×c , because they are binary hash codes for learning. We can find that maximizing the objective function in (1) exactly matches the goal of hashing. More specifically, the learned binary hash codes try to preserve the similarity information in S. Please note that in (1), DLFH adopts the maximum likelihood loss function. Although there exist some CMH methods [34], [41] which also use maximum likelihood loss function, none of them can utilize discrete latent factor model to
IEEE TRANSACTIONS ON IMAGE PROCESSING.VOL.XX,NO.X.XXX 2019 model the supervised information.On the contrary,DLFH I where I is an identity matrix.We define I(U)as can directly utilize discrete latent factor model to model follows: the supervised information and learn the binary hash codes without relaxation U)=Ute)+Ut-U(④P,包 +U.-U.k()]TH[U.&-U.()J. B.Learning Algorithm Problem (1)is a discrete (binary)learning problem,which -U00-u01-uer0 is difficult to solve.One possible solution is to relax the +LU)+(H(U.HU. discrete problem to a continuous problem by discarding the 2 2 discrete (binary)constraints.Similar relaxation strategies have been adopted by many existing relaxation-based continuous -.-. methods like GSPH [42],CMFH [27]and SMFH [39].How- ever,this relaxation may cause the solution to be sub-optimal. +L(Uk)+U.PHUk因_X2n3 2 8c2 Hence,the search accuracy might be unsatisfactory. In this paper,we propose a novel method to directly Ug国-Hvo+on (2) learn the binary codes without continuous relaxation.The two parameters U and V are learned in an alternating way. where denotes the coant term(U,k(】-g+ Specifically,we design an iterative learning algorithm,and in ()HU(t)-[U.((t)which is indepen- each iteration we learn one parameter with the other parameter dent of the variable U.k. fixed. Theorem 1.L(U.k)is a lower bound of L(U.k). 1)Learning U with V Fixed:We try to learn U with V fixed.Even if V is fixed,it is still difficult to opti- To prove Theorem 1,we first introduce the following Lemma. mize (learn)the whole U in one time.For example,if we simply flip the signs of each element to learn U,the total Lemma 1.For a concave function f(x)with bounded cur- time complexity will be O(2mxe)which is very high.Here, vature,if the Hesian satisfiesD for some we adopt a column-wise learning strategy to optimize one matrix D 0,we have: column (corresponds to one bit for all data points)of U each time with other columns fixed.The time complexity to directly learn one column is O(2m)which is still high.Here,we adopt fy)≥fx)+y-xrf+ x +(y-x)"D(y-x). a surrogate strategy [48]to learn each column,which results Here,DL(U.k). respect to U.k can be computed as follows This means L(U.k)is a lower bound of L(U.k). aL (S.j-A.j)Vjk, Then,we learn the column U.k by maximizing the lower U*k bound L(U+k): j=1 82L 2 aU.kOUIk =diag(a1,a2,an), 歌id=ug[恶0-Huao例+om s.L.U*k∈{-1,+1}n (3) where A [Aiglj=1.ai =>=1Ag(1 -Aig).and diag(a1,a2,..,an)denotes a diagonal matrix with the ith Vl,Uik is defined over {-1,+1.Hence,to maximize diagonal element being ai. L(U.k),we only need to take Uuk 1 whenever the 1-th Let Uk(t)denote the value of Uk at the t-th iteration, element of [()HU.(t)]is greater than 0.and (t)denote the gradient with respect to U.(t),andH= Oi=-1 otherwise.That is to say,the optimal solution for Problem (3)is:U-signHU().And we use this solution to get U.(+1): Please note that the objective function L is defined on the whole real space although the variables U and V are constrained to be discrete.Hence,we can still compute the gradient and Hessian for any discrete points U and V. Ukt+)=g国-HU (4)
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, XXX 2019 4 model the supervised information. On the contrary, DLFH can directly utilize discrete latent factor model to model the supervised information and learn the binary hash codes without relaxation. B. Learning Algorithm Problem (1) is a discrete (binary) learning problem, which is difficult to solve. One possible solution is to relax the discrete problem to a continuous problem by discarding the discrete (binary) constraints. Similar relaxation strategies have been adopted by many existing relaxation-based continuous methods like GSPH [42], CMFH [27] and SMFH [39]. However, this relaxation may cause the solution to be sub-optimal. Hence, the search accuracy might be unsatisfactory. In this paper, we propose a novel method to directly learn the binary codes without continuous relaxation. The two parameters U and V are learned in an alternating way. Specifically, we design an iterative learning algorithm, and in each iteration we learn one parameter with the other parameter fixed. 1) Learning U with V Fixed: We try to learn U with V fixed. Even if V is fixed, it is still difficult to optimize (learn) the whole U in one time. For example, if we simply flip the signs of each element to learn U, the total time complexity will be O(2n×c ) which is very high. Here, we adopt a column-wise learning strategy to optimize one column (corresponds to one bit for all data points) of U each time with other columns fixed. The time complexity to directly learn one column is O(2n) which is still high. Here, we adopt a surrogate strategy [48] to learn each column, which results in a lower time complexity of O(n 2 ). More specifically, to optimize the kth column U∗k, we construct a lower bound of L(U∗k) and then optimize the lower bound, which can get a closed form solution and make learning procedure simple and efficient. Moreover, the lower-bound based learning strategy can guarantee the solution to converge. The following content will introduce the details about the derivation of column-wise learning strategy. The gradient and Hessian of the objective function L with respect to U∗k can be computed as follows 1 : ∂L ∂U∗k = λ c Xn j=1 (S∗j − A∗j )Vjk, ∂ 2L ∂U∗k∂UT ∗k = − λ 2 c 2 diag(a1, a2, · · · , an), where A = [Aij ] n i,j=1, ai = Pn j=1 Aij (1 − Aij ), and diag(a1, a2, · · · , an) denotes a diagonal matrix with the ith diagonal element being ai . Let U∗k(t) denote the value of U∗k at the t-th iteration, ∂L ∂U∗k (t) denote the gradient with respect to U∗k(t), and H = 1Please note that the objective function L is defined on the whole real space although the variables U and V are constrained to be discrete. Hence, we can still compute the gradient and Hessian for any discrete points U and V. − nλ2 4c 2 I where I is an identity matrix. We define Le(U∗k) as follows: Le(U∗k) =L(U∗k(t)) + [U∗k − U∗k(t)]T ∂L ∂U∗k (t) + 1 2 [U∗k − U∗k(t)]T H[U∗k − U∗k(t)], =UT ∗k [ ∂L ∂U∗k (t) − HU∗k(t)] − [U∗k(t)]T ∂L ∂U∗k (t) + L(U∗k(t)) + [U∗k(t)]T HU∗k(t) 2 + UT ∗kHU∗k 2 =UT ∗k [ ∂L ∂U∗k (t) − HU∗k(t)] − [U∗k(t)]T ∂L ∂U∗k (t) + L(U∗k(t)) + [U∗k(t)]T HU∗k(t) 2 − λ 2n 3 8c 2 =UT ∗k h ∂L ∂U∗k (t) − HU∗k(t) i + const (2) where const denotes the constant term L(U∗k(t)) − λ 2n 3 8c 2 + 1 2 [U∗k(t)]T HU∗k(t) − [U∗k(t)]T ∂L ∂U∗k (t) which is independent of the variable U∗k. Theorem 1. Le(U∗k) is a lower bound of L(U∗k). To prove Theorem 1, we first introduce the following Lemma. Lemma 1. For a concave function f(x) with bounded curvature, if the Hessian ∂ 2 f(x) ∂x∂xT satisfies ∂ 2 f(x) ∂x∂xT D for some matrix D ≺ 0, we have: f(y) ≥ f(x) + (y − x) T ∂f(x) ∂x + 1 2 (y − x) T D(y − x). Here, D ≺ 0 denotes that D is a negative definite matrix, and B C indicates that B − C is a positive definite matrix. The proof for Lemma 1 can be found in [48], [49]. Based on Lemma 1, we can prove Theorem 1. Proof of Theorem 1. It’s easy to see that the L(U∗k) is a concave function with bounded curvature. Furthermore, because 0 < Aij < 1, we have 0 < Aij (1 − Aij ) < 1 4 . Then we can get: ∂ 2L ∂U∗k∂UT ∗k H. According to Lemma 1, we can get: L(U∗k) ≥ Le(U∗k). This means Le(U∗k) is a lower bound of L(U∗k). Then, we learn the column U∗k by maximizing the lower bound Le(U∗k): max U∗k Le(U∗k) =UT ∗k h ∂L ∂U∗k (t) − HU∗k(t) i + const, s.t. U∗k ∈ {−1, +1} n . (3) ∀l, Ulk is defined over {−1, +1}. Hence, to maximize Le(U∗k), we only need to take Ulk = 1 whenever the l-th element of ∂L ∂U∗k (t) − HU∗k(t) is greater than 0, and Ulk = −1 otherwise. That is to say, the optimal solution for Problem (3) is: U∗k = sign[ ∂L ∂U∗k (t)−HU∗k(t)]. And we use this solution to get U∗k(t + 1): U∗k(t + 1) = sign[ ∂L ∂U∗k (t) − HU∗k(t)]. (4)
IEEE TRANSACTIONS ON IMAGE PROCESSING.VOL.XX,NO.X.XXX 2019 Algorithm 1 Learning algorithm for DLFH L(U,V)is non-concave2 in both U and V,the leared Input:S{1,0)nxn:supervised similarity matrix, solution will converge to a local optimum. ▣ c:code length. Output:U and V:binary codes for two modalities. 3)Stochastic Learning Strategy:We can find that the 1:Procedure:initialize U and V. computational cost for learning U.k and V.k is O(n2).DLFH 2:fort=1→Tdo will become intractable when the size of training set is large. fork=1→cdo Here,we design a stochastic learning strategy to avoid high computational cost. 名 Update U.according to (4). It is easy to see that the high computational cost mainly 5 end for comes from the gradient computation for both U.k and V.k 6: fork=1→cdo In our stochastic learning strategy,we randomly sample m 公 Update Vk according to (5). columns (rows)of s to compute(during ch 8: end for iteration.Then,we get the following formulas for updating 9:end for Uk and Vk: U.k(t+1)=sign 2)Learning V with U Fixed:When U is fixed,we adopt (8) a similar strategy as that for U to learn V.Specifically,we can get the following closed form solution to update V.: 0a=1 (9) V.k(t +1)=sign[ V.()-HV.(). aL (5) where j1 and i are the m sampled column and whee,=合∑(Sg-A)U and H=-L row indices,respectively. To get the stochastic learning algorithm for DLFH,we only The learning algorithm for DLFH is summarized in Algo- rithm 1,which can be easily implemented. need to substitute (4)and(5)in Algorithm 1 by (8)and (9), respectively.Then the computational cost will decrease from Please note that LFH [50]has adopted latent factor model O(n2)to O(nm),where mlog(1+e-()M?)+M i=1 Hence,during the learning procedure,we can always guar- where n is the hyper-parameter for regularization term and antee that L(U.(t+1))>L(U.(t)).Similarly,we can also (x;)is a RBF kernel feature representation for x;.Then we guarantee that L(V(+1))>L(Vk(t)).This means that can get the hash function for out-of-sample extension: we can always guarantee that the objective function will not decrease during the whole learning procedure in Algorithm 1. ha(xa)=sign([M()T(xi)), Together with the fact that L(U,V)is upper-bounded by Please note that L()is concave in U.k or Vk,but non-concave in both 0,we can guarantee that Algorithm I will converge.Because U and V
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, XXX 2019 5 Algorithm 1 Learning algorithm for DLFH Input: S ∈ {1, 0} n×n: supervised similarity matrix, c: code length. Output: U and V: binary codes for two modalities. 1: Procedure: initialize U and V. 2: for t = 1 → T do 3: for k = 1 → c do 4: Update U∗k according to (4). 5: end for 6: for k = 1 → c do 7: Update V∗k according to (5). 8: end for 9: end for 2) Learning V with U Fixed: When U is fixed, we adopt a similar strategy as that for U to learn V. Specifically, we can get the following closed form solution to update V∗k: V∗k(t + 1) = sign[ ∂L ∂V∗k (t) − HV∗k(t)], (5) where ∂L ∂V∗k = λ c Pn i=1(S T i∗ − AT i∗ )Uik and H = − nλ2 4c 2 I. The learning algorithm for DLFH is summarized in Algorithm 1, which can be easily implemented. Please note that LFH [50] has adopted latent factor model for hashing. However, DLFH is different from LFH and is novel due to the following reasons. Firstly, DLFH is for cross-modal supervised hashing but LFH is for uni-modal supervised hashing. Secondly, DLFH is a discrete method which directly learns discrete (binary) codes without relaxation, but LFH is a relaxation-based continuous method which cannot directly learn the discrete codes. Last but not the least, LFH learns all bits of a point each time, but DLFH learns one bit of all points each time which can lead to highly uncorrelated hash codes. This will be empirically verified in Section V-H. Theorem 2. The learning algorithm for DLFH which is shown in Algorithm 1 is convergent. Proof of Theorem 2. According to Theorem 1, we have: L(U∗k) ≥ Le(U∗k). (6) Furthermore, since we use the optimal solution of Problem (3) to get U∗k(t + 1), we have: Le(U∗k(t + 1)) ≥ Le(U∗k(t)). (7) We can also find that Le(U∗k(t)) = L(U∗k(t)). Then, we have: L(U∗k(t + 1)) ≥ Le(U∗k(t + 1)) ≥ Le(U∗k(t)) = L(U∗k(t)). Hence, during the learning procedure, we can always guarantee that L(U∗k(t+1)) ≥ L(U∗k(t)). Similarly, we can also guarantee that L(V∗k(t + 1)) ≥ L(V∗k(t)). This means that we can always guarantee that the objective function will not decrease during the whole learning procedure in Algorithm 1. Together with the fact that L(U, V) is upper-bounded by 0, we can guarantee that Algorithm 1 will converge. Because L(U, V) is non-concave2 in both U and V, the learned solution will converge to a local optimum. 3) Stochastic Learning Strategy: We can find that the computational cost for learning U∗k and V∗k is O(n 2 ). DLFH will become intractable when the size of training set is large. Here, we design a stochastic learning strategy to avoid high computational cost. It is easy to see that the high computational cost mainly comes from the gradient computation for both U∗k and V∗k. In our stochastic learning strategy, we randomly sample m columns (rows) of S to compute ∂L ∂U∗k ( ∂L ∂V∗k ) during each iteration. Then, we get the following formulas for updating U∗k and V∗k: U∗k(t + 1) = signλ c Xm q=1 (S∗jq − A∗jq )Vjqk(t) + mλ2 4c 2 U∗k(t) , (8) V∗k(t + 1) = signλ c Xm q=1 (S T iq∗ − A T iq∗)Uiqk(t) + mλ2 4c 2 V∗k(t) , (9) where {jq} m q=1 and {iq} m q=1 are the m sampled column and row indices, respectively. To get the stochastic learning algorithm for DLFH, we only need to substitute (4) and (5) in Algorithm 1 by (8) and (9), respectively. Then the computational cost will decrease from O(n 2 ) to O(nm), where m n. C. Out-of-Sample Extension For any unseen query points xq ∈/ X or yq ∈/ Y, we need to perform out-of-sample extension to generate the binary codes. Many existing hashing methods learn classifiers to predict binary codes for out-of-sample extension. These classifiers include SVM [24], boosted decision tree [8], kernel logistic regression [38], and so on. In this paper, we adopt two classifiers for out-of-sample extension, one being linear and the other being non-linear. We use the modality x as an example to demonstrate these two strategies. For linear classifier, we minimize the following square loss: Lsquare(W(x) ) = kU − XW(x) k 2 F + γxkW(x) k 2 F , where γx is the hyper-parameter for regularization term. We can get: W(x) = (XT X + γxI) −1XT U. Then we can get the hash function for out-of-sample extension: hx(xq) = sign([W(x) ] T xq). For non-linear classifier, we adopt kernel logistic regression (KLR) as that in SePH [38]. Specifically, we learn c classifiers, one classifier for a bit, to predict binary codes for xq. For the kth bit, we minimize the following KLR loss: LKLR(M(x) ∗k ) = Xn i=1 log(1 + e −Uikφ(xi) TM(x) ∗k ) + ηxkM(x) ∗k k 2 F , where ηx is the hyper-parameter for regularization term and φ(xi) is a RBF kernel feature representation for xi . Then we can get the hash function for out-of-sample extension: hx(xq) = sign([M(x) ] T φ(xi)), 2Please note that L(·) is concave in U∗k or V∗k, but non-concave in both U and V.
IEEE TRANSACTIONS ON IMAGE PROCESSING.VOL.XX,NO.X.XXX 2019 TABLE I EXAMPLE POINTS FROM THREE DATASETS Dataset IAPR-TC12 MIRFLICKR-25K NUS-WIDE Image example cascading,waterfall,middle. explore,beach,people,sea, sunset,night,architecture,city,building,lake,lights Text information jungle,pool,dirty,water, summer,boat,jump,playa,mar, colorful,buildings,tower,america,evening, downtown,work,chicago,office,great,cityscape, foreground. coast,gente,shore. streets,illinois,business. Label sky-light,trees,waterfall,water. male,people,sea,sky,sunset sky,buildings,lake. vegetation. transport,water. where M is the kth column of M() for our experiment.Each text is represented as a 2912-D We can use similar ways to learn the out-of-sample classi- bag-of-words (BOW)vector.Each image is represented by fiers for the y modality.We can also use other out-of-sample a 512-D GIST feature vector. classifiers in our method,but this is not the focus of this paper. The MIRFLICKR-25K dataset contains 25,000 data points. In the following content,,we use“DLFH”and“KDLFH”to Each point corresponds to an image associated with some denote the linear and non-linear versions for out-of-sample textual tags.We only select those points which have at least extension.respectively. 20 textual tags for our experiment.Each text is represented as a 1386-D BOW vector.Each image is represented by a 512-D D.Complexity Analysis GIST feature vector.Each point is manually annotated with We call the DLFH version without stochastic learning some of the 24 unique labels. strategy DLFH-Full,and call the stochastic version of DLFH The NUS-WIDE dataset contains 260,648 data points,each of which corresponds to an image associated with some DLFH-Stochastic.The time complexity of DLFH-Full is (Tcn2),where T and c are typically small constants.The textual tags.Furthermore,each point is annotated with one or multiple labels from 81 concept labels.We only select 186,577 time complexity of DLFH-Stochastic is O(Tcnm),which is much lower than that of the DLFH-Full when m n.In points that belong to the 10 most frequent concepts from the practice,we suggest to adopt the DLFH-Stochastic because in original dataset.The text for each point is represented as a our experiment we find that the accuracy of DLFH-Stochastic 1000-D BOW vector.Furthermore,each image is a 500-D is comparable to that of the DLFH-Full even if m is set to bag-of-visual words (BOVW)vector. be c which is typically a small constant.When m =c,the For all three datasets,the ground-truth neighbors (similar training of DLFH-Stochastic is very efficient. pairs)are defined as those image-text pairs which share at least one common label.Table I illustrates some example points from three datasets V.EXPERIMENTS We utilize three widely used datasets for evaluation.Our DLFH and KDLFH are non-deep methods.The experiments B.Baselines and Evaluation Protocol for non-deep baselines are performed on a workstation with We adopt nine state-of-the-art CMH methods as base- Intel (R)CPU E5-2620V2@2.1G of 12 cores and 96G RAM. lines for comparison.They are CMHH [41],UGACH [36], For this platform,the operating system is CentOS 6.5.For deep DCMH [34],GSPH [42],SePH [38],SCM [26],CMFH [27], CMH baselines,we conduct our experiments on a workstation CCA-ITQ [6]and MLBE [37].Among these baselines, with Intel (R)CPU E5-2860V4@2.4G of 14 cores.768G UGACH,CCA-ITQ and CMFH are unsupervised,and others RAM and a NVIDIA M40 GPU3.For this platform,the are supervised.CMFH,SCM and GSPH are relaxation-based operating system is Ubuntu 14.04. continuous methods,and others are discrete methods.CMHH. UGACH and DCMH are deep learning based methods,and others are non-deep methods.For DCMH,GSPH,SePH,SCM A.Datasets and MLBE,the source codes are kindly provided by their Three datasets,IAPR-TC12 [51],MIRFLICKR-25K [52] corresponding authors.For the other baselines,we implement and NUS-WIDE [53],are used for evaluation. them carefully by ourselves.GSPH and SePH are kernel-based The IAPR-TC12 dataset consists of 20,000 image-text pairs methods.Following their authors'suggestion,we utilize RBF which are annotated using 255 labels.We use the entire dataset kernel and adopt two strategies to create kernel bases to learn 3When we compare our DLFH and KDLFH with deep CMH baselines in hash functions for SePH.Specifically,one strategy is randomly terms of time cost,DLFH and KDLFH are also performed on the same GPU taking 500 data points as kernel bases and the other strategy platform. is to use k-means algorithm to learn 500 centers as kernel
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, XXX 2019 6 TABLE I EXAMPLE POINTS FROM THREE DATASETS. Dataset IAPR-TC12 MIRFLICKR-25K NUS-WIDE Image example Text information cascading, waterfall, middle, jungle, pool, dirty, water, foreground. explore, beach, people, sea, summer, boat, jump, playa, mar, coast, gente, shore. sunset, night, architecture, city, building, lake, lights, colorful, buildings, tower, america, evening, downtown, work, chicago, office, great, cityscape, streets, illinois, business. Label sky-light, trees, waterfall, water, vegetation. male, people, sea, sky, sunset, transport, water. sky, buildings, lake. where M(x) ∗k is the kth column of M(x) . We can use similar ways to learn the out-of-sample classi- fiers for the y modality. We can also use other out-of-sample classifiers in our method, but this is not the focus of this paper. In the following content, we use “DLFH” and “KDLFH” to denote the linear and non-linear versions for out-of-sample extension, respectively. D. Complexity Analysis We call the DLFH version without stochastic learning strategy DLFH-Full, and call the stochastic version of DLFH DLFH-Stochastic. The time complexity of DLFH-Full is O(T cn2 ), where T and c are typically small constants. The time complexity of DLFH-Stochastic is O(T cnm), which is much lower than that of the DLFH-Full when m n. In practice, we suggest to adopt the DLFH-Stochastic because in our experiment we find that the accuracy of DLFH-Stochastic is comparable to that of the DLFH-Full even if m is set to be c which is typically a small constant. When m = c, the training of DLFH-Stochastic is very efficient. V. EXPERIMENTS We utilize three widely used datasets for evaluation. Our DLFH and KDLFH are non-deep methods. The experiments for non-deep baselines are performed on a workstation with Intel (R) CPU E5-2620V2@2.1G of 12 cores and 96G RAM. For this platform, the operating system is CentOS 6.5. For deep CMH baselines, we conduct our experiments on a workstation with Intel (R) CPU E5-2860V4@2.4G of 14 cores, 768G RAM and a NVIDIA M40 GPU3 . For this platform, the operating system is Ubuntu 14.04. A. Datasets Three datasets, IAPR-TC12 [51], MIRFLICKR-25K [52] and NUS-WIDE [53], are used for evaluation. The IAPR-TC12 dataset consists of 20,000 image-text pairs which are annotated using 255 labels. We use the entire dataset 3When we compare our DLFH and KDLFH with deep CMH baselines in terms of time cost, DLFH and KDLFH are also performed on the same GPU platform. for our experiment. Each text is represented as a 2912-D bag-of-words (BOW) vector. Each image is represented by a 512-D GIST feature vector. The MIRFLICKR-25K dataset contains 25,000 data points. Each point corresponds to an image associated with some textual tags. We only select those points which have at least 20 textual tags for our experiment. Each text is represented as a 1386-D BOW vector. Each image is represented by a 512-D GIST feature vector. Each point is manually annotated with some of the 24 unique labels. The NUS-WIDE dataset contains 260,648 data points, each of which corresponds to an image associated with some textual tags. Furthermore, each point is annotated with one or multiple labels from 81 concept labels. We only select 186,577 points that belong to the 10 most frequent concepts from the original dataset. The text for each point is represented as a 1000-D BOW vector. Furthermore, each image is a 500-D bag-of-visual words (BOVW) vector. For all three datasets, the ground-truth neighbors (similar pairs) are defined as those image-text pairs which share at least one common label. Table I illustrates some example points from three datasets. B. Baselines and Evaluation Protocol We adopt nine state-of-the-art CMH methods as baselines for comparison. They are CMHH [41], UGACH [36], DCMH [34], GSPH [42], SePH [38], SCM [26], CMFH [27], CCA-ITQ [6] and MLBE [37]. Among these baselines, UGACH, CCA-ITQ and CMFH are unsupervised, and others are supervised. CMFH, SCM and GSPH are relaxation-based continuous methods, and others are discrete methods. CMHH, UGACH and DCMH are deep learning based methods, and others are non-deep methods. For DCMH, GSPH, SePH, SCM and MLBE, the source codes are kindly provided by their corresponding authors. For the other baselines, we implement them carefully by ourselves. GSPH and SePH are kernel-based methods. Following their authors’ suggestion, we utilize RBF kernel and adopt two strategies to create kernel bases to learn hash functions for SePH. Specifically, one strategy is randomly taking 500 data points as kernel bases and the other strategy is to use k-means algorithm to learn 500 centers as kernel
IEEE TRANSACTIONS ON IMAGE PROCESSING.VOL.XX,NO.X.XXX 2019 7 TABLE II DETAILED STATISTICS OF THREE DATASETS Dataset #Total #Query #Database Image feature Text feature IAPR-TC12 20.000 2.000 18.000 512-D GIST 2912.-DB0W MIRFLICKR-25K 20.015 2.000 18.015 512-D GIST 1386-DB0W NUS-WIDE 186.577 1.867 184,710 500-D BOVW 1000-D BOW 106 O.8 。 0.75 75 0.7 TaD[FB-Stochutic 06 TODLEH-Stoche DLFH-Scochitie 0.55 --I一DLPB-Stocbam =DLF日Stodiast ==T→LFH-ort. 一DLF-Fl 于JODLFB-Fn -DLFH-Full 0约 05 20 30 40 50 20 30 40 20 30 40 0 10 0 iteration iteration 意iteratio aiteration (a)Obj.value@32 bits (b)MAP@32 bits (c)Obj.value@64 bits (d)MAP@64 bits Fig.1.Objective function value and MAP of DLFH on subset of MIRFLICKR-25K. bases.These two versions of SePH are denoted as SePHrnd Figure 1 shows the convergence of objective function and SePHkm.For GSPH,we use k-means algorithm to learn value (Figure 1 (a)and (c))and MAP (Figure 1 (b)and (d)). 500 centers as kernel bases.For the other baselines,we set where "DLFH-Full"denotes the full version of DLFH and the parameters by following suggestions of the corresponding "DLFH-Stochastic"denotes the stochastic version of DLFH authors.For our DLFH,we set A=8 and T=30 which In the figure,"I>Tdenotes image-to-text retrieval where is selected based on a validation set.We adopt the stochastic we use image modality as queries and then retrieve text version of DLFH unless otherwise stated.In stochastic DLFH,from database,and other notations are defined similarly.We m=c.To initialize U and V,we first use a uniform distri-can find that the objective function value of DLFH-Full will bution to randomly generate the elements in the matrices,and not decrease as iteration number increases,which verifies then use sign()function to map them to {+1).We find that the claim about the convergence in Theorem 2.There is our DLFH is not sensitive to initialization.For KDLFH.we some vibration in DLFH-Stochastic due to the stochastic randomly take 500 data points as kernel bases like SePHrnd. sampling procedure,but the overall trend is convergent.For All experiments are run 5 times to remove randomness,then MAP,we can observe the convergence for overall trend.Both the average accuracy with standard deviations is reported. DLFH-Full and DLFH-Stochastic converge very fast,and only For IAPR-TC12 and MIRFLICKR-25K datasets,we ran- a small number of iterations are needed. domly select 2,000 data points as query (test)set and the rest as Another interesting phenomenon is that the accuracy of retrieval(database)set.For NUS-WIDE dataset,we randomly DLFH-Stochastic is almost the same as that of DLFH-Full. select 1,867 data points as query set and the rest as retrieval Hence,unless otherwise stated,the DLFH in the following set.For our DLFH and most baselines except GSPH.SePH and experiment refers to the stochastic version of DLFH. MLBE,we use the whole retrieval set as training set.Because GSPH,SePH and MLBE can not scale to large training set, we randomly sample 5,000 data points from retrieval set to D.Hamming Ranking Task construct training set for GSPH and SePH,and sample 1,000 Table III,Table IV and Table V present the MAP and data points for training MLBE due to an out-of-memory error Top100 precision on IAPR-TC12,MIRFLICKR-25K and with larger training set.Table II presents the detailed statistics NUS-WIDE datasets,respectively.The best accuracy is shown of three datasets. in boldface.Because DLFH and KDLFH are non-deep meth- As existing methods [6],Hamming ranking and hash lookup ods,here we only compare them with non-deep baselines, are adopted as retrieval protocols for evaluation.We report including GSPH,SePH,SCM,CMFH,CCA-ITQ and MLBE Mean Average Precision (MAP)and Top-k precision for the The detailed comparison with deep learning-based baselines Hamming ranking protocol.The precision-recall curve is the will be separately shown in Section V-G. widely used metric to measure the accuracy of the hash By comparing GSPH,SePHrnd,SePHkm,SCM to CMFH lookup protocol and we report the precision-recall curve for and CCA-ITQ,we can see that supervised methods can out- evaluation. perform unsupervised methods in most cases.By comparing KDLFH,DLFH,SePH,nd and SePHkm to other methods, C.Convergence Analysis we can find that in general discrete methods can outper- To verify the convergence property of DLFH,we conduct form relaxation-based continuous methods.Please note that an experiment on a subset of MIRFLICKR-25K dataset.MLBE is a special case because the training of MLBE is too Specifically,we randomly select 5,000 data points from slow and we have to sample only a very small subset for MIRFLICKR-25K dataset,with 3,000 data points for training training.Hence,its accuracy is low although it is supervised and the rest for test (query). and discrete.We can find that our DLFH can significantly
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, XXX 2019 7 TABLE II DETAILED STATISTICS OF THREE DATASETS. Dataset #Total #Query #Database Image feature Text feature IAPR-TC12 20,000 2,000 18,000 512-D GIST 2912-D BOW MIRFLICKR-25K 20,015 2,000 18,015 512-D GIST 1386-D BOW NUS-WIDE 186,577 1,867 184,710 500-D BOVW 1000-D BOW # iteration 0 10 20 30 40 50 L o s s 106 -10 -8 -6 -4 -2 0 (a) Obj. value@32 bits # iteration 0 10 20 30 40 50 M A P 0.5 0.55 0.6 0.65 0.7 0.75 0.8 (b) MAP@32 bits # iteration 0 10 20 30 40 50 L o s s 106 -8 -7 -6 -5 -4 -3 -2 -1 (c) Obj. value@64 bits # iteration 0 10 20 30 40 50 M A P 0.5 0.55 0.6 0.65 0.7 0.75 0.8 (d) MAP@64 bits Fig. 1. Objective function value and MAP of DLFH on subset of MIRFLICKR-25K. bases. These two versions of SePH are denoted as SePHrnd and SePHkm. For GSPH, we use k-means algorithm to learn 500 centers as kernel bases. For the other baselines, we set the parameters by following suggestions of the corresponding authors. For our DLFH, we set λ = 8 and T = 30 which is selected based on a validation set. We adopt the stochastic version of DLFH unless otherwise stated. In stochastic DLFH, m = c. To initialize U and V, we first use a uniform distribution to randomly generate the elements in the matrices, and then use sign(·) function to map them to {±1}. We find that our DLFH is not sensitive to initialization. For KDLFH, we randomly take 500 data points as kernel bases like SePHrnd. All experiments are run 5 times to remove randomness, then the average accuracy with standard deviations is reported. For IAPR-TC12 and MIRFLICKR-25K datasets, we randomly select 2,000 data points as query (test) set and the rest as retrieval (database) set. For NUS-WIDE dataset, we randomly select 1,867 data points as query set and the rest as retrieval set. For our DLFH and most baselines except GSPH, SePH and MLBE, we use the whole retrieval set as training set. Because GSPH, SePH and MLBE can not scale to large training set, we randomly sample 5,000 data points from retrieval set to construct training set for GSPH and SePH, and sample 1,000 data points for training MLBE due to an out-of-memory error with larger training set. Table II presents the detailed statistics of three datasets. As existing methods [6], Hamming ranking and hash lookup are adopted as retrieval protocols for evaluation. We report Mean Average Precision (MAP) and Top-k precision for the Hamming ranking protocol. The precision-recall curve is the widely used metric to measure the accuracy of the hash lookup protocol and we report the precision-recall curve for evaluation. C. Convergence Analysis To verify the convergence property of DLFH, we conduct an experiment on a subset of MIRFLICKR-25K dataset. Specifically, we randomly select 5,000 data points from MIRFLICKR-25K dataset, with 3,000 data points for training and the rest for test (query). Figure 1 shows the convergence of objective function value (Figure 1 (a) and (c)) and MAP (Figure 1 (b) and (d)), where “DLFH-Full” denotes the full version of DLFH and “DLFH-Stochastic” denotes the stochastic version of DLFH. In the figure, “I → T” denotes image-to-text retrieval where we use image modality as queries and then retrieve text from database, and other notations are defined similarly. We can find that the objective function value of DLFH-Full will not decrease as iteration number increases, which verifies the claim about the convergence in Theorem 2. There is some vibration in DLFH-Stochastic due to the stochastic sampling procedure, but the overall trend is convergent. For MAP, we can observe the convergence for overall trend. Both DLFH-Full and DLFH-Stochastic converge very fast, and only a small number of iterations are needed. Another interesting phenomenon is that the accuracy of DLFH-Stochastic is almost the same as that of DLFH-Full. Hence, unless otherwise stated, the DLFH in the following experiment refers to the stochastic version of DLFH. D. Hamming Ranking Task Table III, Table IV and Table V present the MAP and Top100 precision on IAPR-TC12, MIRFLICKR-25K and NUS-WIDE datasets, respectively. The best accuracy is shown in boldface. Because DLFH and KDLFH are non-deep methods, here we only compare them with non-deep baselines, including GSPH, SePH, SCM, CMFH, CCA-ITQ and MLBE. The detailed comparison with deep learning-based baselines will be separately shown in Section V-G. By comparing GSPH, SePHrnd, SePHkm, SCM to CMFH and CCA-ITQ, we can see that supervised methods can outperform unsupervised methods in most cases. By comparing KDLFH, DLFH, SePHrnd and SePHkm to other methods, we can find that in general discrete methods can outperform relaxation-based continuous methods. Please note that MLBE is a special case because the training of MLBE is too slow and we have to sample only a very small subset for training. Hence, its accuracy is low although it is supervised and discrete. We can find that our DLFH can significantly
EEE TRANSACTIONS ON IMAGE PROCESSING.VOL.XX,NO.X.XXX 2019 TABLE III MAP (mean std,IN PERCENT)AND TOP100 PRECISION (mean std,IN PERCENT)ON IAPR-TC12 DATASET WITH 8,16,32 AND 64 BITS. MAP Task Method Top100 precision 8 bits 16 bits 32 bits 64 bits 8 bits 16 bits 32 bits 64 bits DLFH 43.07±1.85 45.58土1.33 52.00士0.75 54.84土0.57 54.62±1.98 57.94士1.93 66.0士0.74 70.05±0.96 SCM 37.67士0.00 38.74±0.00 39.72±0.00 41.02±0.00 45.85±0.00 49.40士0.00 50.94±0.00 52.73士0.00 CMFH 30.92±0.81 30.91±0.62 31.70±1.16 30.76士0.53 31.57±1.10 31.27±0.61 32.51±2.04 31.31士0.94 CCA-ITQ 35.35士0.00 34.17±0.00 33.09±0.00 32.36士0.00 44.03±0.00 42.31±0.00 39.51±0.00 3750+000 -T MLBE 30.39±0.01 30.52士0.16 30.36士0.04 30.46士0.09 31.16±0.00 30.79士0.52 30.30±0.72 30.50士0.24 KDLFH 43.26士1.42 47.02士1.11 52.61士0.65 56.54土1.00 54.57±1.66 60.512.11 67.041.52 71.04士1.21 SePHrnd 40.30士0.07 41.47±0.22 42.27±0.11 42.89士0.10 45.95±0.39 48.20士0.41 4997+024 51.52±0.30 SePHkm 40.30士0.08 41.41±0.26 42.07±0.12 42.65+0.10 46.22±0.14 48.22±0.60 49.99±0.36 51.27±0.24 GSPH 38.67±0.41 40.03士0.22 41.67±0.16 42.85±0.21 45.43±0.44 48.79士0.23 51.27±0.40 53.420.23 DLFH 43.84±1.96 48.47士1.30 57.13土0.60 62.43±0.41 55.08±1.24 62.22±1.56 71.76±0.77 77.650.62 SCM 37.65±0.00 38.57±0.00 3981+000 40.82士0.00 44.39士0.00 4781士00 518士000 53.08士0.00 CMFH 31.021.17 31.67±0.94 32.42±1.15 31.71±0.36 32.33±2.19 33.24±1.27 33.67±2.56 32.79±0.94 CCA-ITO 35.27±0.00 34.19士0.00 33.18±0.00 32.44±0.00 44.85±0.00 42.85±0.00 40.63±0.00 38.59士0.00 MLBE 30.39±0.00 30.44土0.09 30.35±0.11 30.49±0.12 31.16±0.00 3030+093 30.09士1.08 30.55±0.15 KDLFH 44.40士139 50.73士.00 58.10±0.59 63.93+14 55.85士3.62 65.76士1.14 73.50士152 79.2士1.2 SePHrnd 42.70±0.12 44.31±0.18 45.62±0.14 46.66士0.21 53.20±0.26 57.04士0.41 60.60士0.22 63.38±0.70 SePHkm 42.19士0.08 43.77±0.28 44.76士0.14 45.67±0.29 51.92±0.30 55.48±0.92 58.49士0.48 60.72±0.56 GSPH 40.01±0.61 42.30士0.28 44.77±0.26 46.62士0.27 49.14±1.09 55.91±0.25 60.72±0.63 64.40士0.41 TABLE IV MAP (mean std.IN PERCENT)AND TOP100 PRECISION (mean+std.IN PERCENT)ON MIRFLICKR-25K DATASET WITH 8.16.32 AND 64 BITS. Task Method MAP Top100 precision 8 bits 16 bits 32 bits 64 bits 8 bits 16 bits 32 bits 64 bits DLFH 72.68王1.11 75.84土0.46 78.02土0.07 78.92王0.03 78.25土2.26 82.94±0.93 85.37士0.90 86.13土0.46 SCM 62.88±0.00 63.89±0.00 65.06士0.00 65.76士0.00 67.41士0.00 69.59±0.00 701)+000 71.01±0.00 CMFH 56.98±1.02 57.38±0.59 57.32士0.22 56.87±0.22 58.72±1.44 59.76±0.91 60.35±0.60 59.430.74 CCA-ITQ 58.33±0.00 57.62±0.00 57.11±0.00 56.77±0.00 62.26±0.00 60.80±0.00 59.67士0.00 58.72±0.00 MLBE 55.75±0.05 55.81士0.14 55.83±0.15 55.83士0.15 57.34±0.71 56.20±0.85 55.79±2.45 58.04士2.22 KDLFH 74.28士0.67 77.07士0.79 79.27士0.24 80.08土0.19 79.02士0.24 83.75士1.8I 85.99士0.57 86.92王0.45 SePHrnd 65.16士0.16 65.60士0.17 65.96±0.09 66.23±0.08 67.05±0.41 67.64±0.13 68.70±0.36 69.40士0.39 SePHkm 65.51±0.21 66.06±0.16 6640+0)2 66.66士0.10 67.81±0.57 68.38±0.45 68.99士0.53 69.87±0.19 GSPH 63.12±0.65 6462+035 65.77±0.32 6640+031 67R1+047 6944+032 7029+025 7059+040 DLFH 77.73土0.88 82.39士0.29 85.01±0.17 86.05土0.21 82.85土2.03 87.95±0.69 90.50士0.31 91.28士0.19 SCM 61.62士0.00 62.37±0.00 63.37±0.00 64.25±0.00 65.08±0.00 67.97±0.00 69.53±0.00 71.380.00 CMEH 57.07±0.72 575)+046 57.71±0.20 5760+007 57.09士1.74 57.49±0.57 60.20士0.24 60.18士0.27 CCA-ITO 58.22±0.00 5759+000 57.12+000 5682+.0 6265-+000 60.85±0.00 5gg3+00 59.3+000 MLBE 55.73±0.08 55.790.12 55.79±0.10 56.04+0.22 57.01125 55.54士0.87 55.57士0.70 56.05土0.59 KDLFH 79.08士1.40 82.14士0.57 85.02士0.08 85.83士0.24 85.15王1.9 88.63±1.05 9137士0.49 92.46士033 SePHrnd 68.11士0.22 68.66±0.27 69.42±0.28 69.79士0.42 73.98±0.55 75.47±0.37 77.00士0.33 78.15±0.34 SePHkm 69.15±0.24 69.72±0.30 70.39±0.36 70.83±0.20 75.98±0.40 76.65±0.84 78.28士0.71 79.62±0.43 GSPH 66.50±0.50 68.47±0.40 69.93士0.37 71.21±0.30 76.73±0.48 79.12±0.38 80.18士0.27 81.53士0.40 TABLE V MAP(mean std,IN PERCENT)AND TOP100 PRECISION(mean+std,IN PERCENT)ON NUS-WIDE DATASET WITH 8,16,32 AND 64 BITS. MAP Task Method Top100 precision 8 bits 16 bits 32b1s 64 bits 8 bits 16 bits 32 bits 64 bits DLFH 63.82士0.61 67.00士0.45 6927±0.21 70.33士0.25 74.61±1.13 78.91±0.88 82.07士0.80 84.03士0.82 SCM 5120+000 5219+000 54.81±0.00 55.58±0.00 53.38±0.00 57.02±0.00 60.90±0.00 61.96士0.00 CMFH 36.31士1.53 38.12±1.08 38.62±1.36 40.03士0.81 38.23士1.36 41.23±1.13 43.08±1.42 44.03土0.92 CCA-ITQ 40.97±0.00 39.55±0.00 38.23±0.00 37.03土0.00 52.24±0.00 51.89±0.00 50.05±0.00 46.44士0.00 MLBE 34.56士0.00 34.51±0.00 34.86士0.08 35.07±0.00 33.840.00 33.840.00 40.77±3.65 →T 3446±0.00 KDLFH 65.24士1.03 68.47±0.65 701901 7112+029 76.84±1.18 79.91王0.67 82.49士0.63 R431045 SePHrnd 52.67±0.47 5396+087 5485+025 5524+04) 55.00±0.30 5589+031 56.52±0.24 5681+0.4 SePHkm 53.36±0.50 54.41±0.57 55.54±0.41 55.64+0.41 55.51±0.21 56.32±0.4 56.56±0.38 57.09士0.35 GSPH 51.28±0.75 54.19士0.40 55.29士0.25 56.02士0.22 55.15±0.21 56.66±0.74 57.89士0.19 58.27±0.16 DLFH 72.22士2.28 7850+050 8203+042 83.78±0.09 R001+292 86.37±1.75 90.12士0.67 90.89士0.52 SCM 4826+00D 4925+000 5153+000 5226+0 5345+000 705+)0 6159+0.0 5292+000 CMFH 35.67±1.56 36.87±1.09 37.48±1.44 38.24士0.90 36.22±2.85 40.72±0.76 44.57±2.92 46.91士1.92 CCA-ITQ 40.32±0.00 38.98士0.00 37.81±0.00 36.79±0.00 51.68±0.00 52.67±0.00 50.62±0.00 46.81±0.00 MLBE 34.53±0.00 34.53士0.00 34.53士0.00 34.53士0.08 33.93士0.00 33.93±0.00 33.93±0.00 35.52±1.24 KDLFH 7237士2.19 78.27士0.95 8.78士0.5I 83.25士0.28 82.77士1.90 88.59士0.8I 1.03士1.03 92.57士0.47 SePHrnd 60.71±0.47 61.79±1.17 62.75±0.29 63.40士0.50 68.85±0.41 69.79±0.92 70.36士0.81 71.44士0.53 SePHkm 61.34±0.38 6763+078 64.14±0.14 6447+053 69.50士0.61 7037-+027 71.48±0.20 7230+045 GSPH 58.60士0.86 62.69±0.47 63.95±0.39 65.13士0.34 68.83士071 71.58±0.90 72.91±0.27 73.81±0.42 outperform all baselines in all cases for both image-to-text further improve retrieval accuracy thanks to the non-linear and text-to-image retrieval tasks.Furthermore,KDLFH can out-of-sample extension strategy
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, XXX 2019 8 TABLE III MAP (mean ± std, IN PERCENT) AND TOP100 PRECISION (mean ± std, IN PERCENT) ON IAPR-TC12 DATASET WITH 8, 16, 32 AND 64 BITS. Task Method MAP Top100 precision 8 bits 16 bits 32 bits 64 bits 8 bits 16 bits 32 bits 64 bits DLFH 43.07±1.85 45.58±1.33 52.00±0.75 54.84±0.57 54.62±1.98 57.94±1.93 66.60±0.74 70.05±0.96 SCM 37.67±0.00 38.74±0.00 39.72±0.00 41.02±0.00 45.85±0.00 49.40±0.00 50.94±0.00 52.73±0.00 CMFH 30.92±0.81 30.91±0.62 31.70±1.16 30.76±0.53 31.57±1.10 31.27±0.61 32.51±2.04 31.31±0.94 CCA-ITQ 35.35±0.00 34.17±0.00 33.09±0.00 32.36±0.00 44.03±0.00 42.31±0.00 39.51±0.00 37.50±0.00 I → T MLBE 30.39±0.01 30.52±0.16 30.36±0.04 30.46±0.09 31.16±0.00 30.79±0.52 30.30±0.72 30.50±0.24 KDLFH 43.26±1.42 47.02±1.11 52.61±0.65 56.54±1.00 54.57±1.66 60.51±2.11 67.04±1.52 71.04±1.21 SePHrnd 40.30±0.07 41.47±0.22 42.27±0.11 42.89±0.10 45.95±0.39 48.20±0.41 49.97±0.24 51.52±0.30 SePHkm 40.30±0.08 41.41±0.26 42.07±0.12 42.65±0.10 46.22±0.14 48.22±0.60 49.99±0.36 51.27±0.24 GSPH 38.67±0.41 40.03±0.22 41.67±0.16 42.85±0.21 45.43±0.44 48.79±0.23 51.27±0.40 53.42±0.23 DLFH 43.84±1.96 48.47±1.30 57.13±0.60 62.43±0.41 55.08±1.24 62.22±1.56 71.76±0.77 77.65±0.62 SCM 37.65±0.00 38.57±0.00 39.81±0.00 40.82±0.00 44.39±0.00 47.81±0.00 51.08±0.00 53.08±0.00 CMFH 31.02±1.17 31.67±0.94 32.42±1.15 31.71±0.36 32.33±2.19 33.24±1.27 33.67±2.56 32.79±0.94 CCA-ITQ 35.27±0.00 34.19±0.00 33.18±0.00 32.44±0.00 44.85±0.00 42.85±0.00 40.63±0.00 38.59±0.00 T → I MLBE 30.39±0.00 30.44±0.09 30.35±0.11 30.49±0.12 31.16±0.00 30.30±0.93 30.09±1.08 30.55±0.15 KDLFH 44.40±1.39 50.73±1.00 58.10±0.59 63.93±1.34 55.85±3.62 65.76±1.14 73.50±1.52 79.21±1.12 SePHrnd 42.70±0.12 44.31±0.18 45.62±0.14 46.66±0.21 53.20±0.26 57.04±0.41 60.60±0.22 63.38±0.70 SePHkm 42.19±0.08 43.77±0.28 44.76±0.14 45.67±0.29 51.92±0.30 55.48±0.92 58.49±0.48 60.72±0.56 GSPH 40.01±0.61 42.30±0.28 44.77±0.26 46.62±0.27 49.14±1.09 55.91±0.25 60.72±0.63 64.40±0.41 TABLE IV MAP (mean ± std, IN PERCENT) AND TOP100 PRECISION (mean ± std, IN PERCENT) ON MIRFLICKR-25K DATASET WITH 8, 16, 32 AND 64 BITS. Task Method MAP Top100 precision 8 bits 16 bits 32 bits 64 bits 8 bits 16 bits 32 bits 64 bits DLFH 72.68±1.11 75.84±0.46 78.02±0.07 78.92±0.03 78.25±2.26 82.94±0.93 85.37±0.90 86.13±0.46 SCM 62.88±0.00 63.89±0.00 65.06±0.00 65.76±0.00 67.41±0.00 69.59±0.00 70.12±0.00 71.01±0.00 CMFH 56.98±1.02 57.38±0.59 57.32±0.22 56.87±0.22 58.72±1.44 59.76±0.91 60.35±0.60 59.43±0.74 CCA-ITQ 58.33±0.00 57.62±0.00 57.11±0.00 56.77±0.00 62.26±0.00 60.80±0.00 59.67±0.00 58.72±0.00 I → T MLBE 55.75±0.05 55.81±0.14 55.83±0.15 55.83±0.15 57.34±0.71 56.20±0.85 55.79±2.45 58.04±2.22 KDLFH 74.28±0.67 77.07±0.79 79.27±0.24 80.08±0.19 79.02±0.24 83.75±1.81 85.99±0.57 86.92±0.45 SePHrnd 65.16±0.16 65.60±0.17 65.96±0.09 66.23±0.08 67.05±0.41 67.64±0.13 68.70±0.36 69.40±0.39 SePHkm 65.51±0.21 66.06±0.16 66.40±0.22 66.66±0.10 67.81±0.57 68.38±0.45 68.99±0.53 69.87±0.19 GSPH 63.12±0.65 64.62±0.35 65.77±0.32 66.40±0.31 67.81±0.47 69.44±0.32 70.29±0.25 70.59±0.40 DLFH 77.73±0.88 82.39±0.29 85.01±0.17 86.05±0.21 82.85±2.03 87.95±0.69 90.50±0.31 91.28±0.19 SCM 61.62±0.00 62.37±0.00 63.37±0.00 64.25±0.00 65.08±0.00 67.97±0.00 69.53±0.00 71.38±0.00 CMFH 57.07±0.72 57.52±0.46 57.71±0.20 57.69±0.07 57.09±1.74 57.49±0.57 60.20±0.24 60.18±0.27 CCA-ITQ 58.22±0.00 57.59±0.00 57.12±0.00 56.82±0.00 62.65±0.00 60.85±0.00 59.93±0.00 59.03±0.00 T → I MLBE 55.73±0.08 55.79±0.12 55.79±0.10 56.04±0.22 57.01±1.25 55.54±0.87 55.57±0.70 56.05±0.59 KDLFH 79.08±1.40 82.14±0.57 85.02±0.08 85.83±0.24 85.15±1.91 88.63±1.05 91.37±0.49 92.46±0.33 SePHrnd 68.11±0.22 68.66±0.27 69.42±0.28 69.79±0.42 73.98±0.55 75.47±0.37 77.00±0.33 78.15±0.34 SePHkm 69.15±0.24 69.72±0.30 70.39±0.36 70.83±0.20 75.98±0.40 76.65±0.84 78.28±0.71 79.62±0.43 GSPH 66.50±0.50 68.47±0.40 69.93±0.37 71.21±0.30 76.73±0.48 79.12±0.38 80.18±0.27 81.53±0.40 TABLE V MAP (mean ± std, IN PERCENT) AND TOP100 PRECISION (mean ± std, IN PERCENT) ON NUS-WIDE DATASET WITH 8, 16, 32 AND 64 BITS. Task Method MAP Top100 precision 8 bits 16 bits 32 bits 64 bits 8 bits 16 bits 32 bits 64 bits DLFH 63.82±0.61 67.00±0.45 69.27±0.21 70.33±0.25 74.61±1.13 78.91±0.88 82.07±0.80 84.03±0.82 SCM 51.29±0.00 52.19±0.00 54.81±0.00 55.58±0.00 53.38±0.00 57.02±0.00 60.90±0.00 61.96±0.00 CMFH 36.31±1.53 38.12±1.08 38.62±1.36 40.03±0.81 38.23±1.36 41.23±1.13 43.08±1.42 44.03±0.92 CCA-ITQ 40.97±0.00 39.55±0.00 38.23±0.00 37.03±0.00 52.24±0.00 51.89±0.00 50.05±0.00 46.44±0.00 I → T MLBE 34.56±0.00 34.51±0.00 34.46±0.00 34.86±0.08 35.07±0.00 33.84±0.00 33.84±0.00 40.77±3.65 KDLFH 65.24±1.03 68.47±0.65 70.19±0.15 71.12±0.29 76.84±1.18 79.91±0.67 82.49±0.63 84.31±0.45 SePHrnd 52.67±0.47 53.96±0.87 54.85±0.25 55.24±0.42 55.00±0.30 55.89±0.31 56.52±0.24 56.81±0.24 SePHkm 53.36±0.50 54.41±0.57 55.54±0.41 55.64±0.41 55.51±0.21 56.32±0.41 56.56±0.38 57.09±0.35 GSPH 51.28±0.75 54.19±0.40 55.29±0.25 56.02±0.22 55.15±0.21 56.66±0.74 57.89±0.19 58.27±0.16 DLFH 72.22±2.28 78.50±0.50 82.03±0.42 83.78±0.09 80.01±2.92 86.37±1.75 90.12±0.67 90.89±0.52 SCM 48.26±0.00 49.25±0.00 51.53±0.00 52.26±0.00 53.45±0.00 57.05±0.00 61.59±0.00 62.92±0.00 CMFH 35.67±1.56 36.87±1.09 37.48±1.44 38.24±0.90 36.22±2.85 40.72±0.76 44.57±2.92 46.91±1.92 CCA-ITQ 40.32±0.00 38.98±0.00 37.81±0.00 36.79±0.00 51.68±0.00 52.67±0.00 50.62±0.00 46.81±0.00 T → I MLBE 34.53±0.00 34.53±0.00 34.53±0.00 34.53±0.08 33.93±0.00 33.93±0.00 33.93±0.00 35.52±1.24 KDLFH 72.37±2.19 78.27±0.95 81.78±0.51 83.25±0.28 82.77±1.90 88.59±0.81 91.03±1.03 92.57±0.47 SePHrnd 60.71±0.47 61.79±1.17 62.75±0.29 63.40±0.50 68.85±0.41 69.79±0.92 70.36±0.81 71.44±0.53 SePHkm 61.34±0.38 62.63±0.78 64.14±0.14 64.47±0.53 69.50±0.61 70.37±0.27 71.48±0.20 72.30±0.45 GSPH 58.60±0.86 62.69±0.47 63.95±0.39 65.13±0.34 68.83±0.71 71.58±0.90 72.91±0.27 73.81±0.42 outperform all baselines in all cases for both image-to-text and text-to-image retrieval tasks. Furthermore, KDLFH can further improve retrieval accuracy thanks to the non-linear out-of-sample extension strategy
IEEE TRANSACTIONS ON IMAGE PROCESSING.VOL.XX,NO.X.XXX 2019 9 CCA-ITQ:-CMFH:-MLBE:SCM;SePHrnd+SePHm*-GSPH;DLFH;。-KDLFH I→T@8bits Ta32bits I→Ta64bits 0.8 0.8 0.8 0.6 06 0.6 0.4 04 0.2 0. 0.2 0.2 0.5 0 0.5 0.5 0.5 Recall Recall Recall →Ia32 bits T→I@64bits 0.8 1 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0. 0.2 0 0.5 0 0.5 0 0.5 0.5 Recall Recall Recall Recall Fig.2.Precision-recall curve on IAPR-TC12 dataset. CCA-ITQ: CMFH:MLBE:-SCM:◆-SePHd+- SePHim◆GSPH:+DLFH;一KDLFH. I→T@8bits I→T16 bits I→Ta32bit I→T64 bits 0.9 1 17 1 0.9 0.9 0.9 0.8 0.8 0.8 0.7 0.7 0. 0.7 0.6 0.6 0.6 0.6 0.5 0.5 0. 0. 0 0.5 0 0.5 1 0 0.5 0 0.5 Recall Recall Recall Recall T→I@l6bits T→I@32bits T→IaG4bits 1 1 1 1 0.9 0.9 0.9 0.9 5 0.8 0.8 0.8 0.8 0.7 0.7 0.7 0.7 0.6 0.6 06 0.6 0.5 0.5 0.5 0.5 0 0.5 0 0.5 0.5 0 0.5 Recall Recall Recall Recall Fig.3.Precision-recall curve on MIRFLICKR-25K dataset. E.Hash Lookup Task F.Training Speed To evaluate the training speed of DLFH,we adopt different In real applications,retrieval with hash lookup can usually number of data points from retrieval set to construct training achieve constant or sub-linear search speed.For hash lookup set and then report the training time.Table VI presents protocols,we report precision-recall curves to evaluate the proposed DLFH,KDLFH and baselines on three datasets. the training time (in second)for our DLFH and baselines, where ""denotes that we cannot carry out corresponding Figure 2,Figure 3 and Figure 4 show the precision-recall experiments due to out-of-memory errors.We can find that the curve on IAPR-TC12,MIRFLICKR-25K and NUS-WIDE unsupervised method CCA-ITQ is the fastest method because datasets,respectively.Once again,we can find that DLFH and it does not use supervised information for training.Although KDLFH can significantly outperform all baselines in all cases. the training of CCA-ITQ is fast,the accuracy of it is low
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, XXX 2019 9 Recall 0 0.5 1 Pre cisio n 0.2 0.4 0.6 0.8 Recall 0 0.5 1 Pre cisio n 0.2 0.4 0.6 0.8 1 Recall 0 0.5 1 Pre cisio n 0.2 0.4 0.6 0.8 1 Recall 0 0.5 1 Pre cisio n 0.2 0.4 0.6 0.8 1 Recall 0 0.5 1 Pre cisio n 0.2 0.4 0.6 0.8 Recall 0 0.5 1 Pre cisio n 0.2 0.4 0.6 0.8 1 Recall 0 0.5 1 Pre cisio n 0.2 0.4 0.6 0.8 1 Recall 0 0.5 1 Pre cisio n 0.2 0.4 0.6 0.8 1 Fig. 2. Precision-recall curve on IAPR-TC12 dataset. Recall 0 0.5 1 Pre cisio n 0.5 0.6 0.7 0.8 0.9 Recall 0 0.5 1 Pre cisio n 0.5 0.6 0.7 0.8 0.9 1 Recall 0 0.5 1 Pre cisio n 0.5 0.6 0.7 0.8 0.9 1 Recall 0 0.5 1 Pre cisio n 0.5 0.6 0.7 0.8 0.9 1 Recall 0 0.5 1 Pre cisio n 0.5 0.6 0.7 0.8 0.9 1 Recall 0 0.5 1 Pre cisio n 0.5 0.6 0.7 0.8 0.9 1 Recall 0 0.5 1 Pre cisio n 0.5 0.6 0.7 0.8 0.9 1 Recall 0 0.5 1 Pre cisio n 0.5 0.6 0.7 0.8 0.9 1 Fig. 3. Precision-recall curve on MIRFLICKR-25K dataset. E. Hash Lookup Task In real applications, retrieval with hash lookup can usually achieve constant or sub-linear search speed. For hash lookup protocols, we report precision-recall curves to evaluate the proposed DLFH, KDLFH and baselines on three datasets. Figure 2, Figure 3 and Figure 4 show the precision-recall curve on IAPR-TC12, MIRFLICKR-25K and NUS-WIDE datasets, respectively. Once again, we can find that DLFH and KDLFH can significantly outperform all baselines in all cases. F. Training Speed To evaluate the training speed of DLFH, we adopt different number of data points from retrieval set to construct training set and then report the training time. Table VI presents the training time (in second) for our DLFH and baselines, where “–” denotes that we cannot carry out corresponding experiments due to out-of-memory errors. We can find that the unsupervised method CCA-ITQ is the fastest method because it does not use supervised information for training. Although the training of CCA-ITQ is fast, the accuracy of it is low
EEE TRANSACTIONS ON IMAGE PROCESSING.VOL.XX,NO.X.XXX 2019 10 CCA-ITQ:一CMFH:-MLBE;-SCM;4—SePHrnd+—SePHimn*GSPH;◆-DLFH;。-KDLFH →T@8bits →T@16bits ÷T@32bits I→Ta64bits 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.5 0.5 0.5 0.5 Recall Recall Recal Recall →I@8bits →Ia64bits 0.8 0.8 0.8 0.6 0.6 0.6 0.5 0.5 0.5 0.5 Recall Recall Recal Recall Fig.4.Precision-recall curve on NUS-WIDE dataset. TABLE VI TABLE VII TRAINING TIME (IN SECOND)ON SUBSETS OF NUS-WIDE. COMPARISON WITH DEEP BASELINES ON IAPR-TC12 DATASET. Method 1K 5K 10K 50K 184K MAP ( Method Avg. Time DLFH 1.26 3.84 6.61 34.88 112.88 →T SCM 10.41 11.13 11.09 11.34 12.30 CMHH 48.81士0.46 49.46士0.24 49.14士0.11 45468.5 CMFH 4.20 12.01 20.65 84.00 305.51 DCMH 45.26士0.33 51.85士0.38 48.56±0.35 33090.2 CCA-ITO 0.56 0.59 0.69 1.51 4.25 DLFH 43.77±1.27 52.67±1.52 48.22±1.33 5.9 MLBE 998.98 一 KDLFH 50.48士0.83 49.91±1.09 50.20±0.97 741.8 KDLFH 56.58 214.65 479.46 2097.52 7341.05 SePHrnd 80.54 606.18 SePHkm 83.81 705.39 TABLE VIII GSPH 126.43 981.35 COMPARISON WITH DEEP BASELINES ON MIRFLICKR-25K DATASET. Method MAP (% Avg. Time → → CMHH 7537士0.1 76.84土0.22 76.10士0.32 44358.6 DCMH 74.41±0.12 78.48±0.53 76.44士0.22 33736.7 DLFH 77.94±0.98 79.00±0.68 78.47±0.79 45 Hence,CCA-ITQ is not practical in real applications.By com- KDLFH 85.91±0.72 82.03±1.28 8397士0.99 777.6 paring MLBE,SePHrnd and SePHkm to SCM and CMFH,we can find that existing discrete methods are much slower than relaxation-based continuous methods.Although our DLFH is G. Comparison with Deep Baselines a discrete method,its training speed can be comparable to There have appeared some deep learning based cross-modal relaxation-based continuous methods.Furthermore,KDLFH hashing methods [16],[34],[36],[41].We compare DLFH is also much faster than the kernel baselines SePHrnd and and KDLFH with some representative deep CMH methods, SePHkm. including CMHH [41],DCMH [34]and UGACH [36].Please Although KDLFH can achieve better accuracy than DLFH, note that the main focus of this paper is on supervised hashing the training speed of KDLFH is much slower than that of and UGACH is an unsupervised method.Hence,we only DLFH.Hence,in real applications,we provide users with two compare DLFH and KDLFH with UGACH on the largest choices between DLFH and KDLFH based on whether they dataset NUS-WIDE in terms of accuracy for demonstration. care more about training speed or accuracy. For fair comparison,we use 4096-D deep features extracted by using a pre-trained CNN model for the image modality in Overall.our DLFH and KDLFH methods achieve the best DLFH and KDLFH.This pre-trained CNN model is also used accuracy with a relatively fast training speed.In particular,our for initializing DCMH and CMHH.Hence,the comparison is methods can significantly outperform relaxation-based contin- fair.Because deep CMH methods are time-consuming,we use uous methods in terms of accuracy,but with a comparable a subset of the whole dataset as training set by following the training speed.Furthermore,our methods can significantly same strategy in [34]. outperform existing discrete methods in terms of both accuracy The MAP and training time (in second)with 16 bits and training speed. are shown in Table VII,Table VIII and Table IX on
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, XXX 2019 10 Recall 0 0.5 1 Pre cisio n 0.4 0.6 0.8 1 Recall 0 0.5 1 Pre cisio n 0.4 0.6 0.8 1 Recall 0 0.5 1 Pre cisio n 0.4 0.6 0.8 1 Recall 0 0.5 1 Pre cisio n 0.4 0.6 0.8 1 Recall 0 0.5 1 Pre cisio n 0.4 0.6 0.8 1 Recall 0 0.5 1 Pre cisio n 0.4 0.6 0.8 1 Recall 0 0.5 1 Pre cisio n 0.4 0.6 0.8 1 Recall 0 0.5 1 Pre cisio n 0.4 0.6 0.8 1 Fig. 4. Precision-recall curve on NUS-WIDE dataset. TABLE VI TRAINING TIME (IN SECOND) ON SUBSETS OF NUS-WIDE. Method 1K 5K 10K 50K ∼184K DLFH 1.26 3.84 6.61 34.88 112.88 SCM 10.41 11.13 11.09 11.34 12.30 CMFH 4.20 12.01 20.65 84.00 305.51 CCA-ITQ 0.56 0.59 0.69 1.51 4.25 MLBE 998.98 – – – – KDLFH 56.58 214.65 479.46 2097.52 7341.05 SePHrnd 80.54 606.18 – – – SePHkm 83.81 705.39 – – – GSPH 126.43 981.35 – – – Hence, CCA-ITQ is not practical in real applications. By comparing MLBE, SePHrnd and SePHkm to SCM and CMFH, we can find that existing discrete methods are much slower than relaxation-based continuous methods. Although our DLFH is a discrete method, its training speed can be comparable to relaxation-based continuous methods. Furthermore, KDLFH is also much faster than the kernel baselines SePHrnd and SePHkm. Although KDLFH can achieve better accuracy than DLFH, the training speed of KDLFH is much slower than that of DLFH. Hence, in real applications, we provide users with two choices between DLFH and KDLFH based on whether they care more about training speed or accuracy. Overall, our DLFH and KDLFH methods achieve the best accuracy with a relatively fast training speed. In particular, our methods can significantly outperform relaxation-based continuous methods in terms of accuracy, but with a comparable training speed. Furthermore, our methods can significantly outperform existing discrete methods in terms of both accuracy and training speed. TABLE VII COMPARISON WITH DEEP BASELINES ON IAPR-TC12 DATASET. Method MAP (%) Avg. Time I → T T → I CMHH 48.81±0.46 49.46±0.24 49.14±0.11 45468.5 DCMH 45.26±0.33 51.85±0.38 48.56±0.35 33090.2 DLFH 43.77±1.27 52.67±1.52 48.22±1.33 5.9 KDLFH 50.48±0.83 49.91±1.09 50.20±0.97 741.8 TABLE VIII COMPARISON WITH DEEP BASELINES ON MIRFLICKR-25K DATASET. Method MAP (%) Avg. Time I → T T → I CMHH 75.37±0.13 76.84±0.22 76.10±0.32 44358.6 DCMH 74.41±0.12 78.48±0.53 76.44±0.22 33736.7 DLFH 77.94±0.98 79.00±0.68 78.47±0.79 4.5 KDLFH 85.91±0.72 82.03±1.28 83.97±0.99 777.6 G. Comparison with Deep Baselines There have appeared some deep learning based cross-modal hashing methods [16], [34], [36], [41]. We compare DLFH and KDLFH with some representative deep CMH methods, including CMHH [41], DCMH [34] and UGACH [36]. Please note that the main focus of this paper is on supervised hashing and UGACH is an unsupervised method. Hence, we only compare DLFH and KDLFH with UGACH on the largest dataset NUS-WIDE in terms of accuracy for demonstration. For fair comparison, we use 4096-D deep features extracted by using a pre-trained CNN model for the image modality in DLFH and KDLFH. This pre-trained CNN model is also used for initializing DCMH and CMHH. Hence, the comparison is fair. Because deep CMH methods are time-consuming, we use a subset of the whole dataset as training set by following the same strategy in [34]. The MAP and training time (in second) with 16 bits are shown in Table VII, Table VIII and Table IX on