Column Sampling based Discrete Supervised Hashing Wang-Cheng Kang,Wu-Jun Li and Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210023 Department of Computer Science and Technology,Nanjing University,China kwc.oliver@gmail.com,liwujun,zhouzhenju.edu.cn Abstract methods.Representative data-independent methods include locality-sensitive hashing(LSH)(Andoni and Indyk 2006) By leveraging semantic (label)information,supervised hashing has demonstrated better accuracy than unsuper- and its variants.The data-dependent hashing can be further vised hashing in many real applications.Because the divided into unsupervised hashing and supervised hashing hashing-code learning problem is essentially a discrete methods (Liu et al.2012:Zhang et al.2014).Unsupervised optimization problem which is hard to solve,most ex- hashing methods,such as iterative quantization(ITQ)(Gong isting supervised hashing methods try to solve a relaxed and Lazebnik 2011),isotropic hashing (IsoHash)(Kong and continuous optimization problem by dropping the dis- Li 2012),discrete graph hashing(DGH)(Liu et al.2014)and crete constraints.However,these methods typically suf- scalable graph hashing(SGH)(Jiang and Li 2015),only use fer from poor performance due to the errors caused by the feature information of the data points for learning with- the relaxation.Some other methods try to directly solve out using any semantic (label)information.On the contrary. the discrete optimization problem.However.they are supervised hashing methods try to leverage semantic (label) typically time-consuming and unscalable.In this paper, we propose a novel method,called column sampling information for hashing function learning.Representative based discrete supervised hashing(COSDISH),to di- supervised hashing methods include sequential projection rectly learn the discrete hashing code from semantic learning for hashing (SPLH)(Wang,Kumar,and Chang information.COSDISH is an iterative method,in each 2010b),minimal loss hashing (MLH)(Norouzi and Fleet iteration of which several columns are sampled from the 2011),supervised hashing with kernels (KSH)(Liu et al. semantic similarity matrix and then the hashing code 2012),two-step hashing (TSH)(Lin et al.2013a),latent is decomposed into two parts which can be alternately factor hashing (LFH)(Zhang et al.2014),FastH (Lin et al. optimized in a discrete way.Theoretical analysis shows 2014),graph cuts coding(GCC)(Ge,He,and Sun 2014)and that the learning (optimization)algorithm of COSDISH supervised discrete hashing (SDH)(Shen et al.2015). has a constant-approximation bound in each step of the Supervised hashing has attracted more and more attention alternating optimization procedure.Empirical results on datasets with semantic labels illustrate that COSDISH in recent years because it has demonstrated better accu- can outperform the state-of-the-art methods in real racy than unsupervised hashing in many real applications. applications like image retrieval. Because the hashing-code learning problem is essentially a discrete optimization problem which is hard to solve,most existing supervised hashing methods,such as KSH (Liu et Introduction al.2012),try to solve a relaxed continuous optimization Although different kinds of methods have been proposed problem by dropping the discrete constraints.However, for approximate nearest neighbor (ANN)search (Indyk these methods typically suffer from poor performance due to and Motwani 1998),hashing has become one of the most the errors caused by the relaxation,which has been verified popular candidates for ANN search because it can achieve by the experiments in (Lin et al.2014;Shen et al.2015). better performance than other methods in real application- Some other methods,such as FastH (Lin et al.2014),try to s(Weiss,Torralba,and Fergus 2008;Kulis and Grauman directly solve the discrete optimization problem.However, 2009;Zhang et al.2010;Zhang,Wang,and Si 2011;Zhang they are typically time-consuming and unscalable.Hence, et al.2012;Strecha et al.2012;Zhen and Yeung 2012; they have to sample only a small subset of the entire dataset Rastegari et al.2013;Lin et al.2013b;Xu et al.2013; for training even if a large-scale training set is given,which Jin et al.2013;Zhu et al.2013;Wang,Zhang,and Si 2013; cannot achieve satisfactory performance in real applications Zhou,Ding,and Guo 2014;Yu et al.2014). In this paper,we propose a novel method,called column There have appeared two main categories of hashing sampling based discrete supervised hashing (COSDISH), methods (Kong and Li 2012;Liu et al.2012;Zhang et to directly learn the discrete hashing code from semantic al.2014):data-independent methods and data-dependent information.The main contributions of COSDISH are listed as follows: Copyright C)2016,Association for the Advancement of Artificial Intelligence (www.aaai.org).All rights reserved. COSDISH is iterative,and in each iteration column sam-
Column Sampling based Discrete Supervised Hashing Wang-Cheng Kang, Wu-Jun Li and Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210023 Department of Computer Science and Technology, Nanjing University, China kwc.oliver@gmail.com, {liwujun,zhouzh}@nju.edu.cn Abstract By leveraging semantic (label) information, supervised hashing has demonstrated better accuracy than unsupervised hashing in many real applications. Because the hashing-code learning problem is essentially a discrete optimization problem which is hard to solve, most existing supervised hashing methods try to solve a relaxed continuous optimization problem by dropping the discrete constraints. However, these methods typically suffer from poor performance due to the errors caused by the relaxation. Some other methods try to directly solve the discrete optimization problem. However, they are typically time-consuming and unscalable. In this paper, we propose a novel method, called column sampling based discrete supervised hashing (COSDISH), to directly learn the discrete hashing code from semantic information. COSDISH is an iterative method, in each iteration of which several columns are sampled from the semantic similarity matrix and then the hashing code is decomposed into two parts which can be alternately optimized in a discrete way. Theoretical analysis shows that the learning (optimization) algorithm of COSDISH has a constant-approximation bound in each step of the alternating optimization procedure. Empirical results on datasets with semantic labels illustrate that COSDISH can outperform the state-of-the-art methods in real applications like image retrieval. Introduction Although different kinds of methods have been proposed for approximate nearest neighbor (ANN) search (Indyk and Motwani 1998), hashing has become one of the most popular candidates for ANN search because it can achieve better performance than other methods in real applications (Weiss, Torralba, and Fergus 2008; Kulis and Grauman 2009; Zhang et al. 2010; Zhang, Wang, and Si 2011; Zhang et al. 2012; Strecha et al. 2012; Zhen and Yeung 2012; Rastegari et al. 2013; Lin et al. 2013b; Xu et al. 2013; Jin et al. 2013; Zhu et al. 2013; Wang, Zhang, and Si 2013; Zhou, Ding, and Guo 2014; Yu et al. 2014). There have appeared two main categories of hashing methods (Kong and Li 2012; Liu et al. 2012; Zhang et al. 2014): data-independent methods and data-dependent Copyright c 2016, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. methods. Representative data-independent methods include locality-sensitive hashing (LSH) (Andoni and Indyk 2006) and its variants. The data-dependent hashing can be further divided into unsupervised hashing and supervised hashing methods (Liu et al. 2012; Zhang et al. 2014). Unsupervised hashing methods, such as iterative quantization (ITQ) (Gong and Lazebnik 2011), isotropic hashing (IsoHash) (Kong and Li 2012), discrete graph hashing (DGH) (Liu et al. 2014) and scalable graph hashing (SGH) (Jiang and Li 2015), only use the feature information of the data points for learning without using any semantic (label) information. On the contrary, supervised hashing methods try to leverage semantic (label) information for hashing function learning. Representative supervised hashing methods include sequential projection learning for hashing (SPLH) (Wang, Kumar, and Chang 2010b), minimal loss hashing (MLH) (Norouzi and Fleet 2011), supervised hashing with kernels (KSH) (Liu et al. 2012), two-step hashing (TSH) (Lin et al. 2013a), latent factor hashing (LFH) (Zhang et al. 2014), FastH (Lin et al. 2014), graph cuts coding (GCC) (Ge, He, and Sun 2014) and supervised discrete hashing (SDH) (Shen et al. 2015). Supervised hashing has attracted more and more attention in recent years because it has demonstrated better accuracy than unsupervised hashing in many real applications. Because the hashing-code learning problem is essentially a discrete optimization problem which is hard to solve, most existing supervised hashing methods, such as KSH (Liu et al. 2012), try to solve a relaxed continuous optimization problem by dropping the discrete constraints. However, these methods typically suffer from poor performance due to the errors caused by the relaxation, which has been verified by the experiments in (Lin et al. 2014; Shen et al. 2015). Some other methods, such as FastH (Lin et al. 2014), try to directly solve the discrete optimization problem. However, they are typically time-consuming and unscalable. Hence, they have to sample only a small subset of the entire dataset for training even if a large-scale training set is given, which cannot achieve satisfactory performance in real applications. In this paper, we propose a novel method, called column sampling based discrete supervised hashing (COSDISH), to directly learn the discrete hashing code from semantic information. The main contributions of COSDISH are listed as follows: • COSDISH is iterative, and in each iteration column sam-
pling (Zhang et al.2014)is adopted to sample several by different optimization problems,the commonly used one columns from the semantic similarity matrix.Different is as follows: from traditional sampling methods which try to sample only a small subset of the entire dataset for training,our min llgS BBT (1) Be{-1,1}n×g column sampling method can exploit all the available data points for training. where s-BBI=∑∑=1(qS-B,Bg2 Based on the sampled columns,the hashing-code learning The main idea of (1)is to adopt the inner product, which reflects the opposite of the Hamming distance,of procedure can be decomposed into two parts which can two binary codes to approximate the similarity label with be alternately optimized in a discrete way.The discrete the square loss.This model has been widely used in many optimization strategy can avoid the errors caused by supervised hashing methods (Liu et al.2012;Lin et al. relaxation in traditional continuous optimization methods 2013a;Zhang and Li 2014;Lin et al.2014;Xia et al.2014; Theoretical analysis shows that the learning (optimiza- Leng et al.2014).LFH (Zhang et al.2014)also uses the tion)algorithm has a constant-approximation bound in inner product to approximate the similarity label,but it uses each step of the alternating optimization procedure. the logistic loss rather than the square loss in (1). Problem(1)is a discrete optimization problem which is Empirical results on datasets with semantic labels illus- trate that COSDISH can outperform the state-of-the-art hard to solve.Most existing methods optimize it by dropping the discrete constraint (Liu et al.2012;Lin et al.2013a). methods in real applications,such as image retrieval. To the best of our knowledge,only one method,called FastH(Lin et al.2014),has been proposed to directly solve Notation and Problem Definition the discrete optimization problem in(1).Due to the difficulty Notation of discrete optimization,FastH adopts a bit-wise learning strategy which uses a Block Graph-Cut method to get the Boldface lowercase letters like a denote vectors.and the ith local optima and learn one bit at a time.The experiments element of a is denoted as a;.Boldface uppercase letters of FastH show that FastH can achieve better accuracy than like A denote matrices.I denotes the identity matrix.Ai other supervised methods with continuous relaxation. and A.denote the ith row and the jth column of A, It is easy to see that both time complexity and storage respectively.Aii denotes the element at the ith row and complexity are O(n2)if all the supervised information in S jth column in A.A-1 denotes the inverse of A,and AT is used for training.Hence,all the existing methods,includ- denotes the transpose of A.denotes the cardinality of a ing relaxation-based continuous optimization methods and set,i.e.,the number of elements in the set.lg denotes the discrete optimization methods,sample only a small subset Frobenius norm of a vector or matrix,and denotes the with m(m n)points for training where m is typically LI norm of a vector or matrix.sgn()is the element-wise several thousand even if we are given a large-scale training sign function which returns 1 if the element is a positive set.Because some training points are discarded,all these number and returns-1 otherwise. existing methods cannot achieve satisfactory accuracy. Therefore,to get satisfactory accuracy,we need to solve Problem Definition the problem in (1)from two aspects.On one hand,we Suppose we have n points {xiE R)1 where xi is the need to adopt proper sampling strategy to effectively exploit of thepn all the n available points for training.On the other hand, we need to design strategies for discrete optimization.This where.Besides the feature vectors,the training motivates the work in this paper. set of supervised hashing also contains a semantic similarity matrix S E{-1,0,1)nxn,where Sij 1 means that COSDISH point i and point j are semantically similar,Sii=-1 This section presents the details of our proposed method means that point i and point j are semantically dissimilar, called COSDISH.More specifically,we try to solve the two and S=0 means that whether point i and point j are aspects stated above,sampling and discrete optimization,for semantically similar or not is unknown.Here,the semantic the supervised hashing problem. information typically refers to semantic labels provided with manual effort.In this paper,we assume that S is fully Column Sampling observed without missing entries,i.e.,S {-1,1]nxn As stated above,both time complexity and storage com- This assumption is reasonable because in many cases we plexity are O(n2)if all the supervised information in S is can always get the semantic label information between two used for training.Hence,we have to perform sampling for points.Furthermore,our model in this paper can be easily training,which is actually adopted by almost all existing adapted to the cases with missing entries. methods,such as KSH,TSH,FastH and LFH.However. The goal of supervised hashing is to learn a binary all existing methods except LFH try to sample only a small code matrix B E {-1,1)nx4,where Bi.denotes the subset with m(m<n)points for training and discard the g-bit code for training point i.Furthermore,the learned rest training points,which leads to unsatisfactory accuracy. binary codes should preserve the semantic similarity in S. The special case is LFH,which proposes to sample sev- Although the goal of supervised hashing can be formulated eral columns from S in each iteration and several iterations
pling (Zhang et al. 2014) is adopted to sample several columns from the semantic similarity matrix. Different from traditional sampling methods which try to sample only a small subset of the entire dataset for training, our column sampling method can exploit all the available data points for training. • Based on the sampled columns, the hashing-code learning procedure can be decomposed into two parts which can be alternately optimized in a discrete way. The discrete optimization strategy can avoid the errors caused by relaxation in traditional continuous optimization methods. • Theoretical analysis shows that the learning (optimization) algorithm has a constant-approximation bound in each step of the alternating optimization procedure. • Empirical results on datasets with semantic labels illustrate that COSDISH can outperform the state-of-the-art methods in real applications, such as image retrieval. Notation and Problem Definition Notation Boldface lowercase letters like a denote vectors, and the ith element of a is denoted as ai . Boldface uppercase letters like A denote matrices. I denotes the identity matrix. Ai∗ and A∗j denote the ith row and the jth column of A, respectively. Aij denotes the element at the ith row and jth column in A. A−1 denotes the inverse of A, and AT denotes the transpose of A. | · | denotes the cardinality of a set, i.e., the number of elements in the set. k · kF denotes the Frobenius norm of a vector or matrix, and k · k1 denotes the L1 norm of a vector or matrix. sgn(·) is the element-wise sign function which returns 1 if the element is a positive number and returns -1 otherwise. Problem Definition Suppose we have n points {xi ∈ R d} n i=1 where xi is the feature vector of point i. We can denote the feature vectors of the n points in a compact matrix form X ∈ R n×d , where XT i∗ = xi . Besides the feature vectors, the training set of supervised hashing also contains a semantic similarity matrix S ∈ {−1, 0, 1} n×n, where Sij = 1 means that point i and point j are semantically similar, Sij = −1 means that point i and point j are semantically dissimilar, and Sij = 0 means that whether point i and point j are semantically similar or not is unknown. Here, the semantic information typically refers to semantic labels provided with manual effort. In this paper, we assume that S is fully observed without missing entries, i.e., S ∈ {−1, 1} n×n. This assumption is reasonable because in many cases we can always get the semantic label information between two points. Furthermore, our model in this paper can be easily adapted to the cases with missing entries. The goal of supervised hashing is to learn a binary code matrix B ∈ {−1, 1} n×q , where Bi∗ denotes the q-bit code for training point i. Furthermore, the learned binary codes should preserve the semantic similarity in S. Although the goal of supervised hashing can be formulated by different optimization problems, the commonly used one is as follows: min B∈{−1,1}n×q kqS − BBT k 2 F , (1) where kqS − BBT k 2 F = Pn i=1 Pn j=1(qSij − Bi∗BT j∗ ) 2 . The main idea of (1) is to adopt the inner product, which reflects the opposite of the Hamming distance, of two binary codes to approximate the similarity label with the square loss. This model has been widely used in many supervised hashing methods (Liu et al. 2012; Lin et al. 2013a; Zhang and Li 2014; Lin et al. 2014; Xia et al. 2014; Leng et al. 2014). LFH (Zhang et al. 2014) also uses the inner product to approximate the similarity label, but it uses the logistic loss rather than the square loss in (1). Problem (1) is a discrete optimization problem which is hard to solve. Most existing methods optimize it by dropping the discrete constraint (Liu et al. 2012; Lin et al. 2013a). To the best of our knowledge, only one method, called FastH (Lin et al. 2014), has been proposed to directly solve the discrete optimization problem in (1). Due to the difficulty of discrete optimization, FastH adopts a bit-wise learning strategy which uses a Block Graph-Cut method to get the local optima and learn one bit at a time. The experiments of FastH show that FastH can achieve better accuracy than other supervised methods with continuous relaxation. It is easy to see that both time complexity and storage complexity are O(n 2 ) if all the supervised information in S is used for training. Hence, all the existing methods, including relaxation-based continuous optimization methods and discrete optimization methods, sample only a small subset with m (m < n) points for training where m is typically several thousand even if we are given a large-scale training set. Because some training points are discarded, all these existing methods cannot achieve satisfactory accuracy. Therefore, to get satisfactory accuracy, we need to solve the problem in (1) from two aspects. On one hand, we need to adopt proper sampling strategy to effectively exploit all the n available points for training. On the other hand, we need to design strategies for discrete optimization. This motivates the work in this paper. COSDISH This section presents the details of our proposed method called COSDISH. More specifically, we try to solve the two aspects stated above, sampling and discrete optimization, for the supervised hashing problem. Column Sampling As stated above, both time complexity and storage complexity are O(n 2 ) if all the supervised information in S is used for training. Hence, we have to perform sampling for training, which is actually adopted by almost all existing methods, such as KSH, TSH, FastH and LFH. However, all existing methods except LFH try to sample only a small subset with m (m < n) points for training and discard the rest training points, which leads to unsatisfactory accuracy. The special case is LFH, which proposes to sample several columns from S in each iteration and several iterations
are performed for training.In this paper,we adopt the same The proof of Theorem 1 can be found in the supplemen- column sampling method as LFH (Zhang et al.2014)for our tary material, COSDISH.Unlike LFH which adopts continuous relaxation That is to say,if we use the solution of fi().i.e..Br= for learning,in COSDISH we propose a novel discrete Fi=sgn(STB),as the solution of f2(),the solution optimization (learning)method based on column sampling. is a 2q-approximation solution.Because g is usually small, More specifically,in each iteration we randomly sample we can say that it is a constant-approximation solution, a subset n of N={1,2,...,n}and then choose the which means that we can get an error bound for the original semantic similarity between all n points and those points problem. indexed by 1.That is to say,we sample columns of S with the column numbers being indexed by We use Because the elements of STB can be zero,we set B)= S-1,1)x to denote the sampled sub-matrix of e_sgn(STB,B-1))in practice: similarity. b1>0 We can find that there exist two different kinds of points b1=0 in each iteration,one being those indexed by n and the e-sgn(b1,62) -1 b1<0 other being those indexed by I =N-1.We use S {-1,1x to denote a sub-matrix formed by the rows of where t is the iteration number,and e_sgn(,)is applied in S indexed by2.sr∈{-1,1}rlx,B∈{-l,1al×g an element-wise manner. and Br e{-1,1rlx are defined in a similar way. Update B with Bl Fixed When Br is fixed,the sub- According to the problem in (1),the problem associated problem of B is given by: with the sampled columns in each iteration can be reformu- min llgsT-BT BO]T+S-B2 B2]T lated as follows: B2e{-1,1InI×g llas"-B+-B"[BT 2) Inspired by TSH (Lin et al.2013a),we can transform the above problem to q binary quadratic programming (BQP) Alternating Optimization problems.The optimization of the kth bit of B is given by: We propose an alternating optimization strategy,which eQb+b的7p (4) contains several iterations with each iteration divided into two alternating steps,to solve the problem in (2).More where b denotes the kth column of B,and specifically,we update B with B fixed,and then update B with Br fixed.This two-step alternating optimization procedure will be repeated for several times. ∑6,Q=0 Update Br with BR Fixed When BS is fixed,the objec- 0=-24g- tive function of Br is given by: 左一1 f(B)=IgsT-BPB]哈 =-2Bik(g配:- l=1 m=1 Br∈{-1,1}1rI×9 Note that the formulation in(4)is not the same as that in Minimizing f2()can be viewed as a discrete least square TSH due to the additional linear term.More details about problem.By relaxing the B into continuous values,it is the above derivation can be found in the supplementary easy to get the optimal continuous solution.However,after material. we quantize the continuous solution into discrete solution, Then,we need to turn the problem in (4)into a standard the error bound caused by quantization cannot be guar- BOP form in which the domain of binary variable is {0,1} anteed.Hence,even if the continuous solution is optimal and there are no linear terms. for the relaxed continuous problem,the discrete solution First,we transform the domain of bk from {-1,11 to might be far away from the optimal solution of the original problem. 10,1)11.Let B=(b+1),we have: Here,we design a method to solve it in a discrete way with [bTQ()b+bTp() a constant-approximation bound.Let's consider the follow- ing problem fi()which changes the loss from Frobenius 12412 2 norm in f2()to Li norm: =4∑∑Q+2∑限-∑(Q+Q以》 m=1l=1 fi(B)=llasr-Bl[BTI. (3) const, Br∈{-1,1IrI×9 where const is a constant. It's easy to find that when Br sgn(STB),f() Hence,problem(4)can be reformulated as follows: reaches its minimum.Furthermore,we have the following theorem. min b Tbp() (5) Theorem 1.Suppose that fi(Fi)and f2(F2)reach their BE(0.1)I1 minimum at the points Fi and F3,respectively.We have IThe supplementary material can be downloaded from http: f2(F1)≤2gf2(F2) //cs.nju.edu.cn/1wj/paper/COSDISH_sup.pdf
are performed for training. In this paper, we adopt the same column sampling method as LFH (Zhang et al. 2014) for our COSDISH. Unlike LFH which adopts continuous relaxation for learning, in COSDISH we propose a novel discrete optimization (learning) method based on column sampling. More specifically, in each iteration we randomly sample a subset Ω of N = {1, 2, . . . , n} and then choose the semantic similarity between all n points and those points indexed by Ω. That is to say, we sample |Ω| columns of S with the column numbers being indexed by Ω. We use Se ∈ {−1, 1} n×|Ω| to denote the sampled sub-matrix of similarity. We can find that there exist two different kinds of points in each iteration, one being those indexed by Ω and the other being those indexed by Γ = N − Ω. We use SeΩ ∈ {−1, 1} |Ω|×|Ω| to denote a sub-matrix formed by the rows of Se indexed by Ω. SeΓ ∈ {−1, 1} |Γ|×|Ω| , BΩ ∈ {−1, 1} |Ω|×q and BΓ ∈ {−1, 1} |Γ|×q are defined in a similar way. According to the problem in (1), the problem associated with the sampled columns in each iteration can be reformulated as follows: min BΩ,BΓ kqSeΓ − B Γ [B Ω] T k 2 F + kqSeΩ − B Ω[B Ω] T k 2 F . (2) Alternating Optimization We propose an alternating optimization strategy, which contains several iterations with each iteration divided into two alternating steps, to solve the problem in (2). More specifically, we update BΓ with BΩ fixed, and then update BΩ with BΓ fixed. This two-step alternating optimization procedure will be repeated for several times. Update BΓ with BΩ Fixed When BΩ is fixed, the objective function of BΓ is given by: f2(B Γ ) BΓ∈{−1,1}|Γ|×q = kqSeΓ − B Γ [B Ω] T k 2 F . Minimizing f2(·) can be viewed as a discrete least square problem. By relaxing the BΓ into continuous values, it is easy to get the optimal continuous solution. However, after we quantize the continuous solution into discrete solution, the error bound caused by quantization cannot be guaranteed. Hence, even if the continuous solution is optimal for the relaxed continuous problem, the discrete solution might be far away from the optimal solution of the original problem. Here, we design a method to solve it in a discrete way with a constant-approximation bound. Let’s consider the following problem f1(·) which changes the loss from Frobenius norm in f2(·) to L1 norm: f1(B Γ ) BΓ∈{−1,1}|Γ|×q = kqSeΓ − B Γ [B Ω] T k1. (3) It’s easy to find that when BΓ = sgn(SeΓBΩ), f1(·) reaches its minimum. Furthermore, we have the following theorem. Theorem 1. Suppose that f1(F ∗ 1 ) and f2(F ∗ 2 ) reach their minimum at the points F ∗ 1 and F ∗ 2 , respectively. We have f2(F ∗ 1 ) ≤ 2qf2(F ∗ 2 ). The proof of Theorem 1 can be found in the supplementary material1 . That is to say, if we use the solution of f1(·), i.e., BΓ = F ∗ 1 = sgn(SeΓBΩ), as the solution of f2(·), the solution is a 2q-approximation solution. Because q is usually small, we can say that it is a constant-approximation solution, which means that we can get an error bound for the original problem. Because the elements of SeΓBΩ can be zero, we set BΓ (t) = e sgn(SeΓBΩ, BΓ (t−1)) in practice: e sgn(b1, b2) = ( 1 b1 > 0 b2 b1 = 0 −1 b1 < 0 where t is the iteration number, and e sgn(·, ·) is applied in an element-wise manner. Update BΩ with BΓ Fixed When BΓ is fixed, the subproblem of BΩ is given by: min BΩ∈{−1,1}|Ω|×q kqSeΓ−B Γ [B Ω] T k 2 F +kqSeΩ−B Ω[B Ω] T k 2 F . Inspired by TSH (Lin et al. 2013a), we can transform the above problem to q binary quadratic programming (BQP) problems. The optimization of the kth bit of BΩ is given by: min bk∈{−1,1}|Ω| [b k ] T Q(k)b k + [b k ] T p (k) (4) where b k denotes the kth column of BΩ, and Q (k) i,j i6=j = − 2(qSeΩ i,j − kX−1 m=1 b m i b m j ), Q(k) i,i = 0, p (k) i = − 2 X |Γ| l=1 B Γ l,k(qSeΓ l,i − kX−1 m=1 B Γ l,mB Ω i,m). Note that the formulation in (4) is not the same as that in TSH due to the additional linear term. More details about the above derivation can be found in the supplementary material. Then, we need to turn the problem in (4) into a standard BQP form in which the domain of binary variable is {0, 1} and there are no linear terms. First, we transform the domain of b k from {−1, 1} |Ω| to {0, 1} |Ω| . Let b k = 1 2 (b k + 1), we have: [b k ] T Q (k)b k + [b k ] T p (k) =4 X |Ω| m=1 X |Ω| l=1 b k mb k l Q (k) m,l + 2 X |Ω| m=1 b k m(p (k) m − X |Ω| l=1 (Q (k) m,l + Q (k) l,m)) + const, where const is a constant. Hence, problem (4) can be reformulated as follows: min b k ∈{0,1}|Ω| [b k ] T Q (k) b k + [b k ] T p (k) , (5) 1The supplementary material can be downloaded from http: //cs.nju.edu.cn/lwj/paper/COSDISH_sup.pdf
where可=4Q,)-2n-∑把l(Q+Q]: The Whole Algorithm The whole learning(optimization) More details about the above derivation can be found in the algorithm is summarized in Algorithm 1. supplementary material. Furthermore,BOP can be turned into an equivalent form Algorithm 1 Discrete optimization in COSDISH without linear terms(Yang 2013).That is to say,the problem in(5)can be rewritten as follows: Input:SE{-1,1)nxn,q.Tsto.Taut. Initialize binary code B by randomization min [61TO)6 for iter=l→Tsto do (6) Sample columns of S to get S. s.t. 6e{0,1}+1,a+1=1 Let be the set of sampled column indices and I =A-, where with N={1,2,...n} Split S into S?and Sh 6= Q(k) Split B into B and B) 0 fort=1→Taut do fork=1→gdo A method proposed by (Yang 2013)can solve the problem in(6)with an additional constraint=[( Construct problem (4)from B and the 1)/27.We also add this constraint to our problem to get a first k-1 columns of B). Construct problem(5)from problem (4) balanced result for each bit which has been widely used in Construct problem(7)from problems (5)and (6). hashing (Liu et al.2014).So we reformulate problem (6) Construct problem(8)from problem(7)by performing with the constraint as follows: Cholesky decomposition. min [61TO6 Using the 2-approximation algorithm (Yang 2013)to solve problem (8)and acquire the kth column of B s.t. b∈{0,1M,的u=1 end for M (7) Bro=e-sgn(SB),B(t-1)). 丁被=H end for Recover B by combining B and B i=1 end for where M=+1,and H=[M/2]. 0 utput:B∈{-1,1}nxg As in (Yang 2013),we transform the problem in(7)to an equivalent clustering problem:given a dataset ={ui E M)we want to find a subset u of size H that the sum in general,.10≤Tsto≤20,3≤Talt≤10and|2≥q of square of the distances within the subset is minimized is enough to get satisfactory performance.Unless otherwise It can be formulated as: stated,we set Tsto 10,Talt 3 and g in our 1v2 experiments.Furthermore,in our experiments we find that min our algorithm is not sensitive to the initialization.Hence,we u∈l4" vEW (8) adopt random initialization in this paper. s.t.U'cu,'=H,uv eu' Remark 1.Unlike other hashing methods such as Then,we have the following theorems. TSH(Lin et al.2013a)which try to solve the whole problem Theorem 2.Let us use a matrix U of size Mx M to denote as q BOP problems,our alternating optimization strategy the datesetu with U=ui,and if UTU=XI-Q(k)>0. decomposes the whole problem into two sub-problems. then (7)and (8)are equivalent. Typically,.i.e.the number of variables in B is far less than that in B.Furthermore,the cost to get a Proof.We can use similar method in (Yang 2013)to prove solution for problem(3)is much lower than that to get a ▣ solution for BOP.Hence,the key idea of our alternating According to (Yang 2013),we can always find such U optimization strategy is to adopt a faster solution for the and A to satisfy AI-Q(k)0 in Theorem 2.In practice, larger sub-problem,which makes our strategy much faster we can take a sufficiently large number as A,and perform than TSH.Moreover,the faster solution of our strategy Cholesky decomposition on AI-Q(k)to get U. can also guarantee accuracy,which will be verified in the experiments. Theorem 3.Assuming u!is the global solution of (8) TSH adopts LBFGSB to solve the BOP problem in a and f(u)=∑weu llu-a∑vevl2 is the objective continuous-relaxation way.We found that if LBFGSB is used function of(8),there exists an algorithm which can find a to solve the BOP problem in our method (i.e.,problem (6)). solution W where the accuracy of our method will be dramatically deteriorat- f(41)≤2f(4) ed.The reason is that the the solution of LBFGSB will hurt That is to say,there exists a 2-approximation algorithm the quality of the solution of Br. for (8). In addition,the graph-cut method used in FastH cannot be adapted to solve our problem (6),because problem (6) Proof.Please refer to (Yang 2013)for the proof and algo- doesn't satisfy the sub-modular property required by the rithm. graph-cut method
where Q (k) = 4Q(k) , p (k) i = 2[p (k) i − P|Ω| l=1(Q (k) i,l +Q (k) l,i )]. More details about the above derivation can be found in the supplementary material. Furthermore, BQP can be turned into an equivalent form without linear terms (Yang 2013). That is to say, the problem in (5) can be rewritten as follows: min [bek ] T Qe (k)bek s.t. bek ∈ {0, 1} |Ω|+1 , eb k |Ω|+1 = 1 (6) where bek = b k 1 , Qe (k) = Q (k) 1 2 p (k) 1 2 [p (k) ] T 0 ! . A method proposed by (Yang 2013) can solve the problem in (6) with an additional constraint P|Ω|+1 i=1 eb k i = d(|Ω| + 1)/2e. We also add this constraint to our problem to get a balanced result for each bit which has been widely used in hashing (Liu et al. 2014). So we reformulate problem (6) with the constraint as follows: min [bek ] T Qe (k)bek s.t. bek ∈ {0, 1} M, eb k M = 1 X M i=1 eb k i = H (7) where M = |Ω| + 1, and H = dM/2e. As in (Yang 2013), we transform the problem in (7) to an equivalent clustering problem: given a dataset U = {ui ∈ RM}M i=1, we want to find a subset U 0 of size H that the sum of square of the distances within the subset U 0 is minimized. It can be formulated as: min X u∈U0 ku − 1 H X v∈U0 vk 2 s.t. U 0 ⊆ U, |U0 | = H, uM ∈ U0 (8) Then, we have the following theorems. Theorem 2. Let us use a matrix U of size M ×M to denote the dateset U with U∗i = ui , and if UT U = λI−Qe (k) 0, then (7) and (8) are equivalent. Proof. We can use similar method in (Yang 2013) to prove it. According to (Yang 2013), we can always find such U and λ to satisfy λI − Qe (k) 0 in Theorem 2. In practice, we can take a sufficiently large number as λ, and perform Cholesky decomposition on λI − Qe (k) to get U. Theorem 3. Assuming U 0 ∗ is the global solution of (8) and f(U 0 ) = P u∈U0 ku − 1 H P v∈U0 vk 2 is the objective function of (8), there exists an algorithm which can find a solution U 0 1 where f(U 0 1 ) ≤ 2f(U 0 ∗ ). That is to say, there exists a 2-approximation algorithm for (8). Proof. Please refer to (Yang 2013) for the proof and algorithm. The Whole Algorithm The whole learning (optimization) algorithm is summarized in Algorithm 1. Algorithm 1 Discrete optimization in COSDISH Input: S ∈ {−1, 1} n×n , q, Tsto, Talt, |Ω| Initialize binary code B by randomization for iter = 1 → Tsto do Sample |Ω| columns of S to get Se. Let Ω be the set of sampled column indices and Γ = N − Ω, with N = {1, 2, . . . n}. Split Se into SeΩ and SeΓ . Split B into B Ω (0) and B Γ (0) . for t = 1 → Talt do for k = 1 → q do Construct problem (4) from B Γ (t−1) , SeΩ , SeΓ and the first k − 1 columns of B Ω (t) . Construct problem (5) from problem (4). Construct problem (7) from problems (5) and (6). Construct problem (8) from problem (7) by performing Cholesky decomposition. Using the 2-approximation algorithm (Yang 2013) to solve problem (8) and acquire the kth column of B Ω (t) . end for B Γ (t) = e sgn(SeΓB Ω (t) , B Γ (t−1)). end for Recover B by combining B Ω (t) and B Γ (t) . end for Output: B ∈ {−1, 1} n×q In general, 10 ≤ Tsto ≤ 20 , 3 ≤ Talt ≤ 10 and |Ω| ≥ q is enough to get satisfactory performance. Unless otherwise stated, we set Tsto = 10, Talt = 3 and |Ω| = q in our experiments. Furthermore, in our experiments we find that our algorithm is not sensitive to the initialization. Hence, we adopt random initialization in this paper. Remark 1. Unlike other hashing methods such as TSH (Lin et al. 2013a) which try to solve the whole problem as q BQP problems, our alternating optimization strategy decomposes the whole problem into two sub-problems. Typically, |Ω| |Γ|, i.e., the number of variables in BΩ is far less than that in BΓ. Furthermore, the cost to get a solution for problem (3) is much lower than that to get a solution for BQP. Hence, the key idea of our alternating optimization strategy is to adopt a faster solution for the larger sub-problem, which makes our strategy much faster than TSH. Moreover, the faster solution of our strategy can also guarantee accuracy, which will be verified in the experiments. TSH adopts LBFGSB to solve the BQP problem in a continuous-relaxation way. We found that if LBFGSB is used to solve the BQP problem in our method (i.e., problem (6)), the accuracy of our method will be dramatically deteriorated. The reason is that the the solution of LBFGSB will hurt the quality of the solution of BΓ. In addition, the graph-cut method used in FastH cannot be adapted to solve our problem (6), because problem (6) doesn’t satisfy the sub-modular property required by the graph-cut method
Soft Constraints g is very small,e.g.,less than 64.Hence,our method is As mentioned in (Leng et al.2014),when we can only scalable. get a subset of the semantic information,pushing two dissimilar points to have maximum Hamming distance may Experiment lead to over-fitting and unexpected result.Moreover,the We use real datasets to evaluate the effectiveness of our number of dissimilar labels is typically far more than that method.All the experiments are conducted on a workstation of similar labels.Hence,we can also view it as a class- with 6 Intel Xeon CPU cores and 48GB RAM. imbalance problem between positive and negative labels Inspired by (Leng et al.2014),we change the element-1 Dataset in our similarity matrix S to a real value 0<B<1.More Two image datasets with semantic labels are used to specifically,we take B=the number of 1 in s the number of-1 ins empirically. evaluate our method and the other baselines.They are Please note that although soft constraints can further im- CIFAR-10 (Krizhevsky 2009)and NUS-WIDE (Chua prove performance,the superior performance of our method et al.2009).Both of them have been widely mainly comes from the learning procedure rather than the used for hashing evaluation (Lin et al.2013a; soft constraints.Empirical verification about this can be Zhang et al.2014).Each instance in CIFAR-10 has a found in the supplementary material. single label,and each instance in NUS-WIDE might have multi-labels. Out-of-Sample Extension CIFAR-10 contains 60,000 images.Each image is repre- sented by a 512-dimension GIST feature vector extracted Many supervised hashing methods can be viewed as two- from the original color image of size 32 x 32.Each image step methods (Lin et al.2013a):learn binary code in the first is manually labeled to be one of the ten classes.Two images step,and then train g binary classifiers based on the feature are considered to be semantically similar if they share the matrix X and the learned code matrix B in the second step same class label.Otherwise,they are treated as semantically with each bit corresponding to one classifier(Zhang et al. dissimilar. 2014;Lin et al.2013a;2014;Xia et al.2014).Besides NUS-WIDE includes 269,648 images crawled from those methods using linear classifiers(Wang,Kumar,and Chang 2010a;Zhang et al.2014),some other methods use Flickr with 81 ground-truth labels (tags).Each image is represented by a 1134-dimension feature vector after more powerful nonlinear classifiers,such as SVM with RBF feature extraction.Each image might be associated with kernel (Lin et al.2013a),deep convolutional network (Xia multi-labels.There also exist some images without any et al.2014)and boosted decision trees(Lin et al.2014)and label,which are not suitable for our evaluation.After so on.In general,the more powerful classifiers we use for removing those images without any label,we get 209,347 out-of-sample extension,the better accuracy we can achieve images for our experiment.We consider two images to be and also the more training time will be consumed (Lin et semantically similar if they share at least one common label al.2013a;2014).FastH (Lin et al.2014)adopts an efficient Otherwise,they are semantically dissimilar. implementation of boosted decision trees for out-of-sample For all the datasets,we perform normalization on feature extension,which shows better accuracy and less training vectors to make each dimension have zero mean and equal time than other methods like KSH and TSH with nonlinear variance. classifiers Our COSDISH is also a two-step method.For out-of- Experimental Settings and Baselines sample extension,COSDISH chooses linear classifier and boosted decision trees in FastH to get two different variants As in LFH(Zhang et al.2014),for all datasets we randomly We will empirically evaluate these two variants in our choose 1000 points as validation set and 1000 points as experiments. query (test)set,with the rest of the points as training set.All experimental results are the average values of 10 Complexity Analysis independent random partitions. Unless otherwise stated,COSDISH refers to the variant The time complexity to construct problem (4)is O(Tx with soft constraints because in most cases it will out- +2).Both the time complexity to construct prob- perform the variant without soft constraints (refer to the lem (5)and that for problem (7)are O(2).Performing supplementary material).We use COSDISH to denote our Cholesky decomposition on AI-Q(k)need O(3).The method with linear classifier for out-of-sample extension. 2-approximation algorithm to solve the clustering problem and COSDISH BT to denote our method with boosted need O(3+2 log )For the BS-subproblem,we decision trees for out-of-sample extension. need to solve g BOP problems.Hence,the complexity of Because existing methods (Lin et al.2013a:Zhang et al. BS-subproblem is O(gx (Tx+3)).In addition, 2014)have shown that supervised methods can outperform the time complexity of Bl-subproblem is O(g x Tx). unsupervised methods,we only compare our method with Therefore,the total time complexity is O(Tsto Talt x qx some representative supervised hashing methods,including (Tx3)),and the space complexity is (T SPLH (Wang,Kumar,and Chang 2010b),KSH (Liu et al. +2).If we take q,then time complexity is 2012),TSH (Lin et al.2013a),LFH (Zhang et al.2014), O(Tsto x Tait x(nq2+q)),which is linear to n.Typically, FastH (Lin et al.2014)and SDH(Shen et al.2015)
Soft Constraints As mentioned in (Leng et al. 2014), when we can only get a subset of the semantic information, pushing two dissimilar points to have maximum Hamming distance may lead to over-fitting and unexpected result. Moreover, the number of dissimilar labels is typically far more than that of similar labels. Hence, we can also view it as a classimbalance problem between positive and negative labels. Inspired by (Leng et al. 2014), we change the element -1 in our similarity matrix Se to a real value 0 < β < 1. More specifically, we take β = the number of 1 in Se the number of −1 in Se empirically. Please note that although soft constraints can further improve performance, the superior performance of our method mainly comes from the learning procedure rather than the soft constraints. Empirical verification about this can be found in the supplementary material. Out-of-Sample Extension Many supervised hashing methods can be viewed as twostep methods (Lin et al. 2013a): learn binary code in the first step, and then train q binary classifiers based on the feature matrix X and the learned code matrix B in the second step with each bit corresponding to one classifier (Zhang et al. 2014; Lin et al. 2013a; 2014; Xia et al. 2014). Besides those methods using linear classifiers (Wang, Kumar, and Chang 2010a; Zhang et al. 2014), some other methods use more powerful nonlinear classifiers, such as SVM with RBF kernel (Lin et al. 2013a), deep convolutional network (Xia et al. 2014) and boosted decision trees (Lin et al. 2014) and so on. In general, the more powerful classifiers we use for out-of-sample extension, the better accuracy we can achieve and also the more training time will be consumed (Lin et al. 2013a; 2014). FastH (Lin et al. 2014) adopts an efficient implementation of boosted decision trees for out-of-sample extension, which shows better accuracy and less training time than other methods like KSH and TSH with nonlinear classifiers. Our COSDISH is also a two-step method. For out-ofsample extension, COSDISH chooses linear classifier and boosted decision trees in FastH to get two different variants. We will empirically evaluate these two variants in our experiments. Complexity Analysis The time complexity to construct problem (4) is O(|Γ| × |Ω| + |Ω| 2 ). Both the time complexity to construct problem (5) and that for problem (7) are O(|Ω| 2 ). Performing Cholesky decomposition on λI − Qe (k) need O(|Ω| 3 ). The 2-approximation algorithm to solve the clustering problem need O(|Ω| 3 + |Ω| 2 log |Ω|). For the BΩ-subproblem, we need to solve q BQP problems. Hence, the complexity of BΩ-subproblem is O(q × (|Γ| × |Ω| + |Ω| 3 )). In addition, the time complexity of BΓ-subproblem is O(q × |Γ| × |Ω|). Therefore, the total time complexity is O(Tsto × Talt × q × (|Γ| × |Ω| + |Ω| 3 )), and the space complexity is O(|Γ| × |Ω| + |Ω| 2 ). If we take |Ω| = q, then time complexity is O(Tsto ×Talt ×(nq2 +q 4 )), which is linear to n. Typically, q is very small, e.g., less than 64. Hence, our method is scalable. Experiment We use real datasets to evaluate the effectiveness of our method. All the experiments are conducted on a workstation with 6 Intel Xeon CPU cores and 48GB RAM. Dataset Two image datasets with semantic labels are used to evaluate our method and the other baselines. They are CIFAR-10 (Krizhevsky 2009) and NUS-WIDE (Chua et al. 2009). Both of them have been widely used for hashing evaluation (Lin et al. 2013a; Zhang et al. 2014). Each instance in CIFAR-10 has a single label, and each instance in NUS-WIDE might have multi-labels. CIFAR-10 contains 60,000 images. Each image is represented by a 512-dimension GIST feature vector extracted from the original color image of size 32 × 32. Each image is manually labeled to be one of the ten classes. Two images are considered to be semantically similar if they share the same class label. Otherwise, they are treated as semantically dissimilar. NUS-WIDE includes 269,648 images crawled from Flickr with 81 ground-truth labels (tags). Each image is represented by a 1134-dimension feature vector after feature extraction. Each image might be associated with multi-labels. There also exist some images without any label, which are not suitable for our evaluation. After removing those images without any label, we get 209,347 images for our experiment. We consider two images to be semantically similar if they share at least one common label. Otherwise, they are semantically dissimilar. For all the datasets, we perform normalization on feature vectors to make each dimension have zero mean and equal variance. Experimental Settings and Baselines As in LFH (Zhang et al. 2014), for all datasets we randomly choose 1000 points as validation set and 1000 points as query (test) set, with the rest of the points as training set. All experimental results are the average values of 10 independent random partitions. Unless otherwise stated, COSDISH refers to the variant with soft constraints because in most cases it will outperform the variant without soft constraints (refer to the supplementary material). We use COSDISH to denote our method with linear classifier for out-of-sample extension, and COSDISH BT to denote our method with boosted decision trees for out-of-sample extension. Because existing methods (Lin et al. 2013a; Zhang et al. 2014) have shown that supervised methods can outperform unsupervised methods, we only compare our method with some representative supervised hashing methods, including SPLH (Wang, Kumar, and Chang 2010b), KSH (Liu et al. 2012), TSH (Lin et al. 2013a), LFH (Zhang et al. 2014), FastH (Lin et al. 2014) and SDH (Shen et al. 2015)
Table 1:Accuracy in terms of MAP.The best MAPs for each category are shown in boldface. Method CIFAR-10 NUS-WIDE 8-bits 16-bits 32-bits 64-bits 8-bits 16-bits 32-bits 64-bits COSDISH 0.4986 0.5768 0.6191 0.6371 0.5454 0.5940 0.6218 0.6329 SDH 0.2642 0.3994 0.4145 0.4346 0.4739 0.4674 0.4908 0.4944 LFH 0.2908 0.4098 0.5446 0.6182 0.5437 0.5929 0.6025 0.6136 TSH 0.2365 0.3080 0.3455 0.3663 0.4593 0.4784 0.4857 0.4955 KSH 0.2334 0.2662 0.2923 0.3128 0.4275 0.4546 0.4645 0.4688 SPLH 0.1588 0.1635 0.1701 0.1730 0.3769 0.4077 0.4147 0.4071 COSDISH BT 0.5856 0.6681 0.7079 0.7346 0.5819 0.6316 0.6618 0.6786 FastH 0.4230 0.5216 0.5970 0.6446 0.5014 0.5296 0.5541 0.5736 All the baselines are implemented by the source code pling strategies.In sum,our COSDISH and COSDISH_BT provided by the corresponding authors.For LFH,we use can achieve the state-of-the-art accuracy. the stochastic learning version with 50 iterations and each iteration sample g columns of the semantic similarity matrix Scalability as in (Zhang et al.2014).For SPLH,KSH and TSH,we cannot use the entire training set for training due to high We sample different numbers of training points from time complexity.As in (Liu et al.2012;Lin et al.2013a),we NUS-WIDE as the training set,and evaluate the scalability randomly sample 2000 points as training set for CIFAR-10 of COSDISH and baselines by assuming that all methods and 5000 points for NUS-WIDE.TSH can use different should use all the sampled training points for learning. loss functions for training.For fair comparison,the loss Table 2 reports the training time,where the symbol function in KSH which is also the same as our COSDISH means that we can't finish the experiment due to out- is used for training TSH.The SVM with RBF-kernel is of-memory errors.One can easily see that COSDISH, used for out-of-sample-extension in TSH.For KSH,the COSDISH_BT,SDH and LFH can easily scale to dataset of number of support vectors is 300 for CIFAR-10,1000 for size 200,000(200K)or even larger,while other methods NUS-WIDE.For FastH,boosted decision trees are used for either exceed the memory limit or consume too much time. out-of-sample extension.We use the entire training set for LFH is more scalable than our COSDISH.However,as FastH training on CIFAR-10,and randomly sample 100.000 stated above,our COSDISH can achieve better accuracy points for FastH training on NUS-WIDE.All the other than LFH with slightly increased time complexity.Hence COSDISH is more practical than LFH for supervised hyperparameters and initialization strategy are the same as those suggested by the authors of the methods.In our hashing. experiment,we choose=g for COSDISH which is the same as that in LFH.Actually,our method is not sensitive to I24 Table 2:Training time (in second)on subsets of NUS-WIDE Method 3K 10K 50K 100K 200K COSDISH 5.6 8.0 33.7 67.7 162.2 Accuracy SDH 3.9 11.8 66.2 126.9 248.2 The mean average precision(MAP)is a widely used metric LFH 14.3 16.3 27.8 40.8 85.9 for evaluating the accuracy of hashing (Zhang et al.2014; TSH 922.2 27360 >50000 Lin et al.2014).Table 1 shows the MAP of our method KSH 1104 4446 >50000 SPLH 25.3 and baselines.The eight methods in Table 1 can be divided 185 COSDISH BT 602 69.1 228.3 into two different categories.The first category contains 422.6 893.3 the first(top)six methods in Table 1 which use relatively astH 172.3 291.6 1451 3602 weak classifiers for out-of-sample extension,and the second category contains the last (bottom)two methods in Table 1 which use a relatively strong classifier (i.e..boosted decision Conclusion trees)for out-of-sample extension. In this paper,we have proposed a novel model called By comparing COSDISH to SPLH,KSH.TSH.LFH COSDISH for supervised hashing.COSDISH can directly SDH and FastH,we can find that COSDISH can out- learn discrete hashing code from semantic labels.Experi- perform the other baselines in most cases.By comparing ments on several datasets show that COSDISH can outper- COSDISH_BT to COSDISH,we can find that the boost- form other state-of-the-art methods in real applications. ed decision trees can achieve much better accuracy than linear classifier for out-of-sample extension.By comparing COSDISH_BT to FastH which also uses boosted deci- Acknowledgements sion trees for out-of-sample extension,we can find that This work is supported by the NSFC (No.61472182. COSDISH_BT can outperform FastH,which verifies the 61333014),and the Fundamental Research Funds for the effectiveness of our discrete optimization and column sam- Central Universities (No.20620140510)
Table 1: Accuracy in terms of MAP. The best MAPs for each category are shown in boldface. Method CIFAR-10 NUS-WIDE 8-bits 16-bits 32-bits 64-bits 8-bits 16-bits 32-bits 64-bits COSDISH 0.4986 0.5768 0.6191 0.6371 0.5454 0.5940 0.6218 0.6329 SDH 0.2642 0.3994 0.4145 0.4346 0.4739 0.4674 0.4908 0.4944 LFH 0.2908 0.4098 0.5446 0.6182 0.5437 0.5929 0.6025 0.6136 TSH 0.2365 0.3080 0.3455 0.3663 0.4593 0.4784 0.4857 0.4955 KSH 0.2334 0.2662 0.2923 0.3128 0.4275 0.4546 0.4645 0.4688 SPLH 0.1588 0.1635 0.1701 0.1730 0.3769 0.4077 0.4147 0.4071 COSDISH BT 0.5856 0.6681 0.7079 0.7346 0.5819 0.6316 0.6618 0.6786 FastH 0.4230 0.5216 0.5970 0.6446 0.5014 0.5296 0.5541 0.5736 All the baselines are implemented by the source code provided by the corresponding authors. For LFH, we use the stochastic learning version with 50 iterations and each iteration sample q columns of the semantic similarity matrix as in (Zhang et al. 2014). For SPLH, KSH and TSH, we cannot use the entire training set for training due to high time complexity. As in (Liu et al. 2012; Lin et al. 2013a), we randomly sample 2000 points as training set for CIFAR-10 and 5000 points for NUS-WIDE. TSH can use different loss functions for training. For fair comparison, the loss function in KSH which is also the same as our COSDISH is used for training TSH. The SVM with RBF-kernel is used for out-of-sample-extension in TSH. For KSH, the number of support vectors is 300 for CIFAR-10, 1000 for NUS-WIDE. For FastH, boosted decision trees are used for out-of-sample extension. We use the entire training set for FastH training on CIFAR-10, and randomly sample 100,000 points for FastH training on NUS-WIDE. All the other hyperparameters and initialization strategy are the same as those suggested by the authors of the methods. In our experiment, we choose |Ω| = q for COSDISH which is the same as that in LFH. Actually, our method is not sensitive to |Ω|. Accuracy The mean average precision (MAP) is a widely used metric for evaluating the accuracy of hashing (Zhang et al. 2014; Lin et al. 2014). Table 1 shows the MAP of our method and baselines. The eight methods in Table 1 can be divided into two different categories. The first category contains the first (top) six methods in Table 1 which use relatively weak classifiers for out-of-sample extension, and the second category contains the last (bottom) two methods in Table 1 which use a relatively strong classifier (i.e., boosted decision trees) for out-of-sample extension. By comparing COSDISH to SPLH, KSH, TSH, LFH, SDH and FastH, we can find that COSDISH can outperform the other baselines in most cases. By comparing COSDISH BT to COSDISH, we can find that the boosted decision trees can achieve much better accuracy than linear classifier for out-of-sample extension. By comparing COSDISH BT to FastH which also uses boosted decision trees for out-of-sample extension, we can find that COSDISH BT can outperform FastH, which verifies the effectiveness of our discrete optimization and column sampling strategies. In sum, our COSDISH and COSDISH BT can achieve the state-of-the-art accuracy. Scalability We sample different numbers of training points from NUS-WIDE as the training set, and evaluate the scalability of COSDISH and baselines by assuming that all methods should use all the sampled training points for learning. Table 2 reports the training time, where the symbol ‘-’ means that we can’t finish the experiment due to outof-memory errors. One can easily see that COSDISH, COSDISH BT, SDH and LFH can easily scale to dataset of size 200,000 (200K) or even larger, while other methods either exceed the memory limit or consume too much time. LFH is more scalable than our COSDISH. However, as stated above, our COSDISH can achieve better accuracy than LFH with slightly increased time complexity. Hence, COSDISH is more practical than LFH for supervised hashing. Table 2: Training time (in second) on subsets of NUS-WIDE Method 3K 10K 50K 100K 200K COSDISH 5.6 8.0 33.7 67.7 162.2 SDH 3.9 11.8 66.2 126.9 248.2 LFH 14.3 16.3 27.8 40.8 85.9 TSH 922.2 27360 >50000 - - KSH 1104 4446 >50000 - - SPLH 25.3 185 - - - COSDISH BT 60.2 69.1 228.3 422.6 893.3 FastH 172.3 291.6 1451 3602 - Conclusion In this paper, we have proposed a novel model called COSDISH for supervised hashing. COSDISH can directly learn discrete hashing code from semantic labels. Experiments on several datasets show that COSDISH can outperform other state-of-the-art methods in real applications. Acknowledgements This work is supported by the NSFC (No. 61472182, 61333014), and the Fundamental Research Funds for the Central Universities (No. 20620140510)
References Strecha.C.:Bronstein,A.A.:Bronstein.M.M.:and Fua.P Andoni,A.,and Indyk,P.2006. Near-optimal hashing 2012.Ldahash:Improved matching with smaller descriptors. algorithms for approximate nearest neighbor in high dimensions. IEEE Transactions on Pattern Analysis and Machine Intelligence In Proceedings of the Annual Symposium on Foundations of 341):66-78. Computer Science. Wang,J.;Kumar,O.;and Chang,S.-F.2010a.Semi-supervised Chua,T.-S.;Tang,J.;Hong.R.;Li,H.;Luo,Z.;and Zheng, hashing for scalable image retrieval.In Proceedings of the IEEE Y.2009.NUS-WIDE:A real-world web image database from Conference on Computer Vision and Pattern Recognition national university of singapore.In Proceedings of the ACM Wang,J.;Kumar,S.;and Chang,S.-F.2010b.Sequential projection International Conference on Image and Video Retrieval. learning for hashing with compact codes.In Proceedings of the Ge,T.;He,K.;and Sun,J.2014.Graph cuts for supervised binary International Conference on Machine Learning coding.In Proceedings of the European Conference on Computer Wang,Q.;Zhang,D.;and Si,L.2013.Semantic hashing using Vision. tags and topic modeling.In Proceedings of the International ACM Gong,Y.,and Lazebnik,S.2011.Iterative quantization:A SIGIR conference on research and development in Information procrustean approach to learning binary codes.In Proceedings of Retrieval. the IEEE Conference on Computer Vision and Pattern Recognition. Weiss,Y.:Torralba,A.;and Fergus,R.2008.Spectral hashing Indyk,P.,and Motwani,R.1998.Approximate nearest neighbors: In Proceedings of the Annual Conference on Neural Information Towards removing the curse of dimensionality.In Proceedings of Processing Systems. the Annual ACM Symposium on Theory of Computing. Xia.R.:Pan.Y.:Lai,H.:Liu.C.:and Yan.S.2014.Supervised Jiang,Q.-Y.,and Li,W.-J.2015.Scalable graph hashing with hashing for image retrieval via image representation learing.In feature transformation.In Proceedings of the International Joint Proceedings of the AAAl Conference on Artificial Intelligence. Conference on Artificial Intelligence. Xu,B.;Bu,J.;Lin,Y.;Chen,C.;He,X.;and Cai,D.2013. Jin,Z.;Hu,Y.;Lin,Y.;Zhang,D.:Lin,S.;Cai,D.;and Li,X. Harmonious hashing.In Proceedings of the International Joint 2013.Complementary projection hashing.In Proceedings of the Conference on Artificial Intelligence. IEEE International Conference on Computer Vision. Yang,R.2013.New Results on Some Quadratic Programming Kong,W.,and Li,W.-J.2012.Isotropic hashing.In Proceedings of Problems.Phd thesis,University of Illinois at Urbana-Champaign. the Annual Conference on Neural Information Processing Systems. Yu,Z.;Wu,F.;Yang,Y.;Tian,Q.;Luo,J.;and Zhuang,Y. Krizhevsky,A.2009.Learning multiple layers of features from 2014.Discriminative coupled dictionary hashing for fast cross- tiny images.Master's thesis,University of Toronto. media retrieval.In Proceedings of the International ACM Kulis,B.,and Grauman,K.2009.Kernelized locality-sensitive SIGIR Conference on Research and Development in Information hashing for scalable image search.In Proceedings of the IEEE Retrieval. International Conference on Computer Vision. Zhang,D.,and Li,W.-J.2014.Large-scale supervised multimodal Leng.C.;Cheng.J.;Wu,J.;Zhang,X.;and Lu,H.2014. hashing with semantic correlation maximization.In Proceedings Supervised hashing with soft constraints.In Proceedings of the of the AAAl Conference on Artificial Intelligence. ACM International Conference on Conference on Information and Zhang.D.;Wang.J.;Cai,D.;and Lu,J.2010.Self-taught hashing Knowledge Management. for fast similarity search.In Proceedings of the International ACM Lin,G.;Shen,C.;Suter,D.;and Hengel,A.v.d.2013a.A general SIGIR Conference on Research and Development in Information Retrieval. two-step approach to learning-based hashing.In Proceedings of the IEEE International Conference on Computer Vision. Zhang.Q.;Wu,Y.;Ding.Z.;and Huang.X.2012.Learning Lin,Y.;Jin,R.:Cai,D.:Yan,S.;and Li,X.2013b.Compressed hash codes for efficient content reuse detection.In Proceedings hashing.In Proceedings of the IEEE Conference on Computer of the International ACM SIGIR Conference on Research and Vision and Pattern Recognition. Development in Information Retrieval. 2014 Lin,G.;Shen,C.;Shi,Q.;van den Hengel,A.;and Suter,D.2014. Zhang.P:Zhang,W.;Li,W.-J.;and Guo,M. Fast supervised hashing with decision trees for high-dimensional Supervised hashing with latent factor models.In Proceedings data.In Proceedings of the IEEE Conference on Computer Vision of the International ACM SIGIR Conference on Research and and Pattern Recognition. Development in Information Retrieval. Liu,W.;Wang,J.;Ji,R.;Jiang.Y.-G.;and Chang,S.-F.2012. Zhang,D.;Wang,F;and Si,L.2011.Composite hashing with Supervised hashing with kernels.In Proceedings of the IEEE multiple information sources.In Proceedings of the International Conference on Computer Vision and Pattern Recognition ACM SIGIR Conference on Research and Development in Information Retrieval. Liu,W.;Mu,C.;Kumar,S.;and Chang.S.2014.Discrete graph hashing.In Proceedings of the Annual Conference on Neural Zhen,Y.,and Yeung.D.-Y.2012.A probabilistic model for Information Processing Systems. multimodal hash function learning.In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Norouzi,M.,and Fleet,D.J.2011.Minimal loss hashing Data Mining. for compact binary codes.In Proceedings of the International Conference on Machine Learning. Zhou,J.;Ding,G.;and Guo,Y.2014.Latent semantic sparse hashing for cross-modal similarity search.In Proceedings Rastegari,M.:Choi,J.:Fakhraei.S.:Hal,D.:and Davis,L.S.2013. of the International ACM SIGIR Conference on Research and Predictable dual-view hashing.In Proceedings of the International Development in Information Retrieval. Conference on Machine Learning. Zhu,X.;Huang.Z.:Shen,H.T.;and Zhao,X.2013.Linear cross- Shen,F.;Shen,C.;Liu,W.;and Shen,H.T.2015.Supervised modal hashing for efficient multimedia search.In Proceedings of discrete hashing.In Proceedings of the IEEE Conference on the ACM International Conference on Multimedia. Computer Vision and Pattern Recognition
References Andoni, A., and Indyk, P. 2006. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Proceedings of the Annual Symposium on Foundations of Computer Science. Chua, T.-S.; Tang, J.; Hong, R.; Li, H.; Luo, Z.; and Zheng, Y. 2009. NUS-WIDE: A real-world web image database from national university of singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval. Ge, T.; He, K.; and Sun, J. 2014. Graph cuts for supervised binary coding. In Proceedings of the European Conference on Computer Vision. Gong, Y., and Lazebnik, S. 2011. Iterative quantization: A procrustean approach to learning binary codes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Indyk, P., and Motwani, R. 1998. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Annual ACM Symposium on Theory of Computing. Jiang, Q.-Y., and Li, W.-J. 2015. Scalable graph hashing with feature transformation. In Proceedings of the International Joint Conference on Artificial Intelligence. Jin, Z.; Hu, Y.; Lin, Y.; Zhang, D.; Lin, S.; Cai, D.; and Li, X. 2013. Complementary projection hashing. In Proceedings of the IEEE International Conference on Computer Vision. Kong, W., and Li, W.-J. 2012. Isotropic hashing. In Proceedings of the Annual Conference on Neural Information Processing Systems. Krizhevsky, A. 2009. Learning multiple layers of features from tiny images. Master’s thesis, University of Toronto. Kulis, B., and Grauman, K. 2009. Kernelized locality-sensitive hashing for scalable image search. In Proceedings of the IEEE International Conference on Computer Vision. Leng, C.; Cheng, J.; Wu, J.; Zhang, X.; and Lu, H. 2014. Supervised hashing with soft constraints. In Proceedings of the ACM International Conference on Conference on Information and Knowledge Management. Lin, G.; Shen, C.; Suter, D.; and Hengel, A. v. d. 2013a. A general two-step approach to learning-based hashing. In Proceedings of the IEEE International Conference on Computer Vision. Lin, Y.; Jin, R.; Cai, D.; Yan, S.; and Li, X. 2013b. Compressed hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Lin, G.; Shen, C.; Shi, Q.; van den Hengel, A.; and Suter, D. 2014. Fast supervised hashing with decision trees for high-dimensional data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Liu, W.; Wang, J.; Ji, R.; Jiang, Y.-G.; and Chang, S.-F. 2012. Supervised hashing with kernels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Liu, W.; Mu, C.; Kumar, S.; and Chang, S. 2014. Discrete graph hashing. In Proceedings of the Annual Conference on Neural Information Processing Systems. Norouzi, M., and Fleet, D. J. 2011. Minimal loss hashing for compact binary codes. In Proceedings of the International Conference on Machine Learning. Rastegari, M.; Choi, J.; Fakhraei, S.; Hal, D.; and Davis, L. S. 2013. Predictable dual-view hashing. In Proceedings of the International Conference on Machine Learning. Shen, F.; Shen, C.; Liu, W.; and Shen, H. T. 2015. Supervised discrete hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Strecha, C.; Bronstein, A. A.; Bronstein, M. M.; and Fua, P. 2012. Ldahash: Improved matching with smaller descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(1):66–78. Wang, J.; Kumar, O.; and Chang, S.-F. 2010a. Semi-supervised hashing for scalable image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Wang, J.; Kumar, S.; and Chang, S.-F. 2010b. Sequential projection learning for hashing with compact codes. In Proceedings of the International Conference on Machine Learning. Wang, Q.; Zhang, D.; and Si, L. 2013. Semantic hashing using tags and topic modeling. In Proceedings of the International ACM SIGIR conference on research and development in Information Retrieval. Weiss, Y.; Torralba, A.; and Fergus, R. 2008. Spectral hashing. In Proceedings of the Annual Conference on Neural Information Processing Systems. Xia, R.; Pan, Y.; Lai, H.; Liu, C.; and Yan, S. 2014. Supervised hashing for image retrieval via image representation learning. In Proceedings of the AAAI Conference on Artificial Intelligence. Xu, B.; Bu, J.; Lin, Y.; Chen, C.; He, X.; and Cai, D. 2013. Harmonious hashing. In Proceedings of the International Joint Conference on Artificial Intelligence. Yang, R. 2013. New Results on Some Quadratic Programming Problems. Phd thesis, University of Illinois at Urbana-Champaign. Yu, Z.; Wu, F.; Yang, Y.; Tian, Q.; Luo, J.; and Zhuang, Y. 2014. Discriminative coupled dictionary hashing for fast crossmedia retrieval. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. Zhang, D., and Li, W.-J. 2014. Large-scale supervised multimodal hashing with semantic correlation maximization. In Proceedings of the AAAI Conference on Artificial Intelligence. Zhang, D.; Wang, J.; Cai, D.; and Lu, J. 2010. Self-taught hashing for fast similarity search. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. Zhang, Q.; Wu, Y.; Ding, Z.; and Huang, X. 2012. Learning hash codes for efficient content reuse detection. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. Zhang, P.; Zhang, W.; Li, W.-J.; and Guo, M. 2014. Supervised hashing with latent factor models. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. Zhang, D.; Wang, F.; and Si, L. 2011. Composite hashing with multiple information sources. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. Zhen, Y., and Yeung, D.-Y. 2012. A probabilistic model for multimodal hash function learning. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Zhou, J.; Ding, G.; and Guo, Y. 2014. Latent semantic sparse hashing for cross-modal similarity search. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. Zhu, X.; Huang, Z.; Shen, H. T.; and Zhao, X. 2013. Linear crossmodal hashing for efficient multimedia search. In Proceedings of the ACM International Conference on Multimedia