Isotropic Hashing Weihao Kong,Wu-Jun Li Shanghai Key Laboratory of Scalable Computing and Systems Department of Computer Science and Engineering,Shanghai Jiao Tong University,China {kongweihao,liwujun)@cs.sjtu.edu.cn Abstract Most existing hashing methods adopt some projection functions to project the o- riginal data into several dimensions of real values,and then each of these projected dimensions is quantized into one bit(zero or one)by thresholding.Typically,the variances of different projected dimensions are different for existing projection functions such as principal component analysis(PCA).Using the same number of bits for different projected dimensions is unreasonable because larger-variance dimensions will carry more information.Although this viewpoint has been widely accepted by many researchers,it is still not verified by either theory or experiment because no methods have been proposed to find a projection with equal variances for different dimensions.In this paper,we propose a novel method,called isotrop- ic hashing(IsoHash),to learn projection functions which can produce projected dimensions with isotropic variances (equal variances).Experimental results on real data sets show that IsoHash can outperform its counterpart with different vari- ances for different dimensions,which verifies the viewpoint that projections with isotropic variances will be better than those with anisotropic variances. 1 Introduction Due to its fast query speed and low storage cost,hashing [1,5]has been successfully used for approximate nearest neighbor(ANN)search [28].The basic idea of hashing is to learn similarity- preserving binary codes for data representation.More specifically,each data point will be hashed into a compact binary string,and similar points in the original feature space should be hashed into close points in the hashcode space.Compared with the original feature representation,hashing has two advantages.One is the reduced storage cost,and the other is the constant or sub-linear query time complexity [28].These two advantages make hashing become a promising choice for efficient ANN search in massive data sets[1,5,6,9,10.14,15,17,20,21,23,26.29,30,31,32,33,34. Most existing hashing methods adopt some projection functions to project the original data into several dimensions of real values,and then each of these projected dimensions is quantized into one bit(zero or one)by thresholding.Locality-sensitive hashing(LSH)[1,5]and its extension- s [4,18,19,22,25]use simple random projections for hash functions.These methods are called data-independent methods because the projection functions are independent of training data.Anoth- er class of methods are called data-dependent methods,whose projection functions are learned from training data.Representative data-dependent methods include spectral hashing(SH)[31],anchor graph hashing(AGH)[21].sequential projection learning(SPL)[29],principal component analy- sis [13]based hashing (PCAH)[7],and iterative quantization (ITQ)[7,8].SH learns the hashing functions based on spectral graph partitioning.AGH adopts anchor graphs to speed up the com- putation of graph Laplacian eigenvectors,based on which the Nystrom method is used to compute projection functions.SPL leans the projection functions in a sequential way that each function is designed to correct the errors caused by the previous one.PCAH adopts principal component anal- ysis (PCA)to learn the projection functions.ITQ tries to learn an orthogonal rotation matrix to refine the initial projection matrix learned by PCA so that the quantization error of mapping the data
Isotropic Hashing Weihao Kong, Wu-Jun Li Shanghai Key Laboratory of Scalable Computing and Systems Department of Computer Science and Engineering, Shanghai Jiao Tong University, China {kongweihao,liwujun}@cs.sjtu.edu.cn Abstract Most existing hashing methods adopt some projection functions to project the original data into several dimensions of real values, and then each of these projected dimensions is quantized into one bit (zero or one) by thresholding. Typically, the variances of different projected dimensions are different for existing projection functions such as principal component analysis (PCA). Using the same number of bits for different projected dimensions is unreasonable because larger-variance dimensions will carry more information. Although this viewpoint has been widely accepted by many researchers, it is still not verified by either theory or experiment because no methods have been proposed to find a projection with equal variances for different dimensions. In this paper, we propose a novel method, called isotropic hashing (IsoHash), to learn projection functions which can produce projected dimensions with isotropic variances (equal variances). Experimental results on real data sets show that IsoHash can outperform its counterpart with different variances for different dimensions, which verifies the viewpoint that projections with isotropic variances will be better than those with anisotropic variances. 1 Introduction Due to its fast query speed and low storage cost, hashing [1, 5] has been successfully used for approximate nearest neighbor (ANN) search [28]. The basic idea of hashing is to learn similaritypreserving binary codes for data representation. More specifically, each data point will be hashed into a compact binary string, and similar points in the original feature space should be hashed into close points in the hashcode space. Compared with the original feature representation, hashing has two advantages. One is the reduced storage cost, and the other is the constant or sub-linear query time complexity [28]. These two advantages make hashing become a promising choice for efficient ANN search in massive data sets [1, 5, 6, 9, 10, 14, 15, 17, 20, 21, 23, 26, 29, 30, 31, 32, 33, 34]. Most existing hashing methods adopt some projection functions to project the original data into several dimensions of real values, and then each of these projected dimensions is quantized into one bit (zero or one) by thresholding. Locality-sensitive hashing (LSH) [1, 5] and its extensions [4, 18, 19, 22, 25] use simple random projections for hash functions. These methods are called data-independent methods because the projection functions are independent of training data. Another class of methods are called data-dependent methods, whose projection functions are learned from training data. Representative data-dependent methods include spectral hashing (SH) [31], anchor graph hashing (AGH) [21], sequential projection learning (SPL) [29], principal component analysis [13] based hashing (PCAH) [7], and iterative quantization (ITQ) [7, 8]. SH learns the hashing functions based on spectral graph partitioning. AGH adopts anchor graphs to speed up the computation of graph Laplacian eigenvectors, based on which the Nystrom method is used to compute ¨ projection functions. SPL leans the projection functions in a sequential way that each function is designed to correct the errors caused by the previous one. PCAH adopts principal component analysis (PCA) to learn the projection functions. ITQ tries to learn an orthogonal rotation matrix to refine the initial projection matrix learned by PCA so that the quantization error of mapping the data 1
to the vertices of binary hypercube is minimized.Compared to the data-dependent methods,the data-independent methods need longer codes to achieve satisfactory performance [7. For most existing projection functions such as those mentioned above,the variances of different projected dimensions are different.Many researchers [7.12.21]have argued that using the same number of bits for different projected dimensions with unequal variances is unreasonable because larger-variance dimensions will carry more information.Some methods [7,12]use orthogonal trans- formation to the PCA-projected data with the expectation of balancing the variances of different PCA dimensions,and achieve better performance than the original PCA based hashing.However, to the best of our knowledge,there exist no methods which can guarantee to learn a projection with equal variances for different dimensions.Hence,the viewpoint that using the same number of bit- s for different projected dimensions is unreasonable has still not been verified by either theory or experiment. In this paper,a novel hashing method,called isotropic hashing(IsoHash),is proposed to learn a pro- jection function which can produce projected dimensions with isotropic variances (equal variances). To the best of our knowledge,this is the first work which can learn projections with isotropic vari- ances for hashing.Experimental results on real data sets show that IsoHash can outperform its counterpart with anisotropic variances for different dimensions,which verifies the intuitive view- point that projections with isotropic variances will be better than those with anisotropic variances. Furthermore,the performance of IsoHash is also comparable,if not superior,to the state-of-the-art methods. 2 Isotropic Hashing 2.1 Problem Statement Assume we are given n data points {x1,x2,...,x}with xi Rd,which form the columns of the data matrix XE Rdxn.Without loss of generality,in this paper the data are assumed to be trry gdaote the leuhe more oine股 zero centered which means original space Rd should be hashed into similar binary codes in the code space {0,1]m to preserve the similarity structure in the original space.In general,we compute the binary code of xi as yi=[h(x),h2()..,hm(]T with m binary hash functions h( Because it is NP hard to directly compute the best binary functions hk()for a given data set [31]. most hashing methods adopt a two-stage strategy to learn h().In the projection stage,m real- valued projection functions {f(x)are learned and each function can generate one real value. Hence,we have m projected dimensions each of which corresponds to one projection function.In the quantization stage,the real-values are quantized into a binary string by thresholding. Currently,most methods use one bit to quantize each projected dimension.More specifically, hk(xi)=sgn(fk(xi))where sgn(x)=1 if >0 and 0 otherwise.The exceptions of the quan- tization methods only contain AGH [21],DBQ [14]and MH [15],which use two bits to quantize each dimension.In sum,all of these methods adopt the same number (either one or two)of bits for different projected dimensions.However,the variances of different projected dimensions are unequal,and larger-variance dimensions typically carry more information.Hence,using the same number of bits for different projected dimensions with unequal variances is unreasonable,which has also been argued by many researchers [7,12,21].Unfortunately,there exist no methods which can learn projection functions with equal variances for different dimensions.In the following content of this section,we present a novel model to learn projections with isotropic variances. 2.2 Model Formulation The idea of our IsoHash method is to learn an orthogonal matrix to rotate the PCA projection matrix. To generate a code of m bits,PCAH performs PCA on X,and then use the top m eigenvectors of the covariance matrixXXT as columns of the projection matrix W ERdxm.Here,top m eigenvectors are those corresponding to the m largest eigenvalues k1,generally arranged with the non-
to the vertices of binary hypercube is minimized. Compared to the data-dependent methods, the data-independent methods need longer codes to achieve satisfactory performance [7]. For most existing projection functions such as those mentioned above, the variances of different projected dimensions are different. Many researchers [7, 12, 21] have argued that using the same number of bits for different projected dimensions with unequal variances is unreasonable because larger-variance dimensions will carry more information. Some methods [7, 12] use orthogonal transformation to the PCA-projected data with the expectation of balancing the variances of different PCA dimensions, and achieve better performance than the original PCA based hashing. However, to the best of our knowledge, there exist no methods which can guarantee to learn a projection with equal variances for different dimensions. Hence, the viewpoint that using the same number of bits for different projected dimensions is unreasonable has still not been verified by either theory or experiment. In this paper, a novel hashing method, called isotropic hashing (IsoHash), is proposed to learn a projection function which can produce projected dimensions with isotropic variances (equal variances). To the best of our knowledge, this is the first work which can learn projections with isotropic variances for hashing. Experimental results on real data sets show that IsoHash can outperform its counterpart with anisotropic variances for different dimensions, which verifies the intuitive viewpoint that projections with isotropic variances will be better than those with anisotropic variances. Furthermore, the performance of IsoHash is also comparable, if not superior, to the state-of-the-art methods. 2 Isotropic Hashing 2.1 Problem Statement Assume we are given n data points {x1, x2, · · · , xn} with xi ∈ R d , which form the columns of the data matrix X ∈ R d×n. Without loss of generality, in this paper the data are assumed to be zero centered which means Pn i=1 xi = 0. The basic idea of hashing is to map each point xi into a binary string yi ∈ {0, 1} m with m denoting the code size. Furthermore, close points in the original space R d should be hashed into similar binary codes in the code space {0, 1} m to preserve the similarity structure in the original space. In general, we compute the binary code of xi as yi = [h1(xi), h2(xi), · · · , hm(xi)]T with m binary hash functions {hk(·)} m k=1. Because it is NP hard to directly compute the best binary functions hk(·) for a given data set [31], most hashing methods adopt a two-stage strategy to learn hk(·). In the projection stage, m realvalued projection functions {fk(x)} m k=1 are learned and each function can generate one real value. Hence, we have m projected dimensions each of which corresponds to one projection function. In the quantization stage, the real-values are quantized into a binary string by thresholding. Currently, most methods use one bit to quantize each projected dimension. More specifically, hk(xi) = sgn(fk(xi)) where sgn(x) = 1 if x ≥ 0 and 0 otherwise. The exceptions of the quantization methods only contain AGH [21], DBQ [14] and MH [15], which use two bits to quantize each dimension. In sum, all of these methods adopt the same number (either one or two) of bits for different projected dimensions. However, the variances of different projected dimensions are unequal, and larger-variance dimensions typically carry more information. Hence, using the same number of bits for different projected dimensions with unequal variances is unreasonable, which has also been argued by many researchers [7, 12, 21]. Unfortunately, there exist no methods which can learn projection functions with equal variances for different dimensions. In the following content of this section, we present a novel model to learn projections with isotropic variances. 2.2 Model Formulation The idea of our IsoHash method is to learn an orthogonal matrix to rotate the PCA projection matrix. To generate a code of m bits, PCAH performs PCA on X, and then use the top m eigenvectors of the covariance matrix XXT as columns of the projection matrix W ∈ R d×m. Here, top m eigenvectors are those corresponding to the m largest eigenvalues {λk} m k=1, generally arranged with the non- 2
increasing order≥A2≥.·≥Am.Hence,the projection functions of PCAH are defined as follows:f(x)=wix,where wi is the kth column of W. Let A =[A1,A2,...,Am]T and A diag(),where diag(A)denotes the diagonal matrix whose diagonal entries are formed from the vector A.It is easy to prove that WTXXTW =A.Hence,the variance of the values {f(x)1on the kth projected dimension,which corresponds to the kth row of WTX,is Ak.Obviously,the variances for different PCA dimensions are anisotropic. To get isotropic projection functions,the idea of our IsoHash method is to leam an orthogonal matrix Q E Rmxm which makes QTWTXXTWQ become a matrix with equal diagonal values, i.e..[QTWTXXTWQl =[QTWTXXTW]22 =..=[QTWTXXTWQlmm.Here.An denotes the ith diagonal entry of a square matrix A,and a matrix Q is said to be orthogonal if QQ=I where I is an identity matrix whose dimensionality depends on the context.The effect of the orthogonal matrix Q is to rotate the coordinate axes while keeping the Euclidean distances between any two points unchanged.It is easy to prove that the new projection functions of IsoHash are f(x)=(WQ)x which have the same (isotropic)variance.Here (WQ)k denotes the kth column of WQ. If we use tr(A)to denote the trace of a symmetric matrix A,we have the following Lemma 1. Lemma 1.If QTQ=I tr(QTAQ)=tr(A). Based on Lemma 1,we have tr(QTWTXXTWQ)=tr(WTXXTW)=tr(A)=>1Ai if QTQ I.Hence,to make QTWTXXTWQ become a matrix with equal diagonal values,we should set this diagonal value a= Let a=a1,2,,with4=a=∑ (1) m and T(z)={TE Rmxmldiag(T)=diag(z)}, where z is a vector of length m,diag(T)is overloaded to denote a diagonal matrix with the same diagonal entries of matrix T Based on our motivation of IsoHash,we can define the problem of IsoHash as follows: Problem 1.The problem of IsoHash is to find an orthogonal matrix Q making QTWTXXTWQE T(a),where a is defined in (1). Then,we have the following Theorem 1: Theorem 1.Assume QTQ I and T E T(a).If QTAQ =T,Q will be a solution to the problem of IsoHash. Proof.Because WTXXTW A,we have QTAQ =QT[WTXXTW]Q.It is obvious that Q will be a solution to the problem of IsoHash. As in [2],we define M(A)={QTAQJQ∈O(m)}. (2) where (m)is the set of all orthogonal matrices in Rmxm,i.e.,QTQ=I. According to Theorem 1,the problem of IsoHash is equivalent to finding an orthogonal matrix Q for the following equation [2]: IT-ZIF =0, (3) where T E T(a).Z E M(A).denotes the Frobenius norm.Please note that for ease of understanding,we use the same notations as those in [2. In the following content,we will use the Schur-Horn lemma [11]to prove that we can always find a solution to problem(3)
increasing order λ1 ≥ λ2 ≥ · · · ≥ λm. Hence, the projection functions of PCAH are defined as follows: fk(x) = wT k x, where wk is the kth column of W. Let λ = [λ1, λ2, · · · , λm] T and Λ = diag(λ), where diag(λ) denotes the diagonal matrix whose diagonal entries are formed from the vector λ. It is easy to prove that WT XXT W = Λ. Hence, the variance of the values {fk(xi)} n i=1 on the kth projected dimension, which corresponds to the kth row of WT X, is λk. Obviously, the variances for different PCA dimensions are anisotropic. To get isotropic projection functions, the idea of our IsoHash method is to learn an orthogonal matrix Q ∈ R m×m which makes QT WT XXT W Q become a matrix with equal diagonal values, i.e., [QT WT XXT W Q]11 = [QT WT XXT W Q]22 = · · · = [QT WT XXT W Q]mm. Here, Aii denotes the ith diagonal entry of a square matrix A, and a matrix Q is said to be orthogonal if QT Q = I where I is an identity matrix whose dimensionality depends on the context. The effect of the orthogonal matrix Q is to rotate the coordinate axes while keeping the Euclidean distances between any two points unchanged. It is easy to prove that the new projection functions of IsoHash are fk(x) = (W Q) T k x which have the same (isotropic) variance. Here (W Q)k denotes the kth column of W Q. If we use tr(A) to denote the trace of a symmetric matrix A, we have the following Lemma 1. Lemma 1. If QT Q = I, tr(QT AQ) = tr(A). Based on Lemma 1, we have tr(QT WT XXT W Q) = tr(WT XXT W) = tr(Λ) = Pm i=1 λi if QT Q = I. Hence, to make QT WT XXT W Q become a matrix with equal diagonal values, we should set this diagonal value a = Pm i=1 λi m . Let a = [a1, a2, · · · , am] with ai = a = Pm i=1 λi m , (1) and T (z) = {T ∈ R m×m|diag(T) = diag(z)}, where z is a vector of length m, diag(T) is overloaded to denote a diagonal matrix with the same diagonal entries of matrix T. Based on our motivation of IsoHash, we can define the problem of IsoHash as follows: Problem 1. The problem of IsoHash is to find an orthogonal matrix Q making QT WT XXT W Q ∈ T (a), where a is defined in (1). Then, we have the following Theorem 1: Theorem 1. Assume QT Q = I and T ∈ T (a). If QTΛQ = T, Q will be a solution to the problem of IsoHash. Proof. Because WT XXT W = Λ, we have QTΛQ = QT [WT XXT W]Q. It is obvious that Q will be a solution to the problem of IsoHash. As in [2], we define M(Λ) = {Q TΛQ|Q ∈ O(m)}, (2) where O(m) is the set of all orthogonal matrices in R m×m, i.e., QT Q = I. According to Theorem 1, the problem of IsoHash is equivalent to finding an orthogonal matrix Q for the following equation [2]: ||T − Z||F = 0, (3) where T ∈ T (a), Z ∈ M(Λ), || · ||F denotes the Frobenius norm. Please note that for ease of understanding, we use the same notations as those in [2]. In the following content, we will use the Schur-Horn lemma [11] to prove that we can always find a solution to problem (3). 3
Lemma 2.[Schur-Horn Lemmal Let c (ci}e Rm and b (bi}E Rm be real vectors in non-increasing order respectively,i.e.,C1≥c2≥…≥cm,b1≥b2≥.·≥bm-There exists a Hermitian matrix H with eigenvalues c and diagonal values b if and only if ∑bi≤c,for anyk=1,2,,m, =1 =1 m m ∑bi-∑ca i=1 i=1 Proof.Please refer to Horn's article [11]. 口 Base on Lemma 2,we have the following Theorem 2. Theorem 2.There exists a solution to the IsoHash problem in (3).And this solution is in the intersection ofT(a)and M(A). Pro0 f.Because≥2≥…≥and a1=a2=…=am=4,itiseasy to prove that-L≥for any.Hence,.∑1X=k×2L≥k×L= m Furthermore,we can prove that According to Lemma 2,there exists a Hermitian matrix H with eigenvalues A and diagonal values a. Moreover,we can prove that H is in the intersection of T(a)and M(A),i.e.,H E T(a)and H∈M(A). According to Theorem 2,to find a Q solving the problem in(3)is equivalent to finding the intersec- tion point of T(a)and M(A),which is just an inverse eigenvalue problem called SHIEP in [2]. 2.3 Learning The problem in(3)can be reformulated as the following optimization problem: argmin T-ZlF. (4) Q:T∈T(a),Z∈M(A) As in [2],we propose two algorithms to learn Q:one is called lift and projection(LP),and the other is called gradient flow (GF).For ease of understanding,we use the same notations as those in [2], and some proofs of theorems are omitted.The readers can refer to [2]for the details. 2.3.1 Lift and Projection The main idea of lift and projection(LP)algorithm is to alternate between the following two steps: 。Lift step: Given a T(k)E T(a),we find the point ()M(A)such thatT(k)(= dist(T(),M(A)),where dist(T(k),M(A))denotes the minimum distance between T() and the points in M(A). ·Projection step: Given a Z(),we find T(+)ET(a)such that(+)(=dist(T(a),()). where dist(T(a),Z())denotes the minimum distance between Z(k)and the points in T(a). Please note in [2].the values are in increasing order.It is easy to prove that our presentation of Schur-Horn lemma is equivalent to that in [2].The non-increasing order is chosen here just because it will facilitate our following presentation due to the non-increasing order of the eigenvalues in A
Lemma 2. [Schur-Horn Lemma] Let c = {ci} ∈ R m and b = {bi} ∈ R m be real vectors in non-increasing order respectively 1 , i.e., c1 ≥ c2 ≥ · · · ≥ cm, b1 ≥ b2 ≥ · · · ≥ bm. There exists a Hermitian matrix H with eigenvalues c and diagonal values b if and only if X k i=1 bi ≤ X k i=1 ci , for any k = 1, 2, ..., m, Xm i=1 bi = Xm i=1 ci . Proof. Please refer to Horn’s article [11]. Base on Lemma 2, we have the following Theorem 2. Theorem 2. There exists a solution to the IsoHash problem in (3). And this solution is in the intersection of T (a) and M(Λ). Proof. Because λ1 ≥ λ2 ≥ · · · ≥ λm and a1 = a2 = · · · = am = Pm i=1 λi m , it is easy to prove that Pk i=1 λi k ≥ Pm i=1 λi m for any k. Hence, Pk i=1 λi = k × Pk i=1 λi k ≥ k × Pm i=1 λi m = Pk i=1 ai . Furthermore, we can prove that Pm i=1 λi = Pm i=1 ai . According to Lemma 2, there exists a Hermitian matrix H with eigenvalues λ and diagonal values a. Moreover, we can prove that H is in the intersection of T (a) and M(Λ), i.e., H ∈ T (a) and H ∈ M(Λ). According to Theorem 2, to find a Q solving the problem in (3) is equivalent to finding the intersection point of T (a) and M(Λ), which is just an inverse eigenvalue problem called SHIEP in [2]. 2.3 Learning The problem in (3) can be reformulated as the following optimization problem: argmin Q:T ∈T (a),Z∈M(Λ) ||T − Z||F . (4) As in [2], we propose two algorithms to learn Q: one is called lift and projection (LP), and the other is called gradient flow (GF). For ease of understanding, we use the same notations as those in [2], and some proofs of theorems are omitted. The readers can refer to [2] for the details. 2.3.1 Lift and Projection The main idea of lift and projection (LP) algorithm is to alternate between the following two steps: • Lift step: Given a T (k) ∈ T (a), we find the point Z (k) ∈ M(Λ) such that ||T (k) − Z (k) ||F = dist(T (k) ,M(Λ)), where dist(T (k) ,M(Λ)) denotes the minimum distance between T (k) and the points in M(Λ). • Projection step: Given a Z (k) , we find T (k+1) ∈ T (a) such that ||T (k+1) − Z (k) ||F = dist(T (a), Z(k) ), where dist(T (a), Z(k) ) denotes the minimum distance between Z (k) and the points in T (a). 1 Please note in [2], the values are in increasing order. It is easy to prove that our presentation of Schur-Horn lemma is equivalent to that in [2]. The non-increasing order is chosen here just because it will facilitate our following presentation due to the non-increasing order of the eigenvalues in Λ. 4
We call (k)a lift of T(k)onto M(A)and T(+1)a projection of (k)onto T(a).The projection operation is easy to complete.Suppose()=[then T()=[must be given by ty ∫,ifi卡j ai;ifi=j (5) For the lift operation,we have the following Theorem 3. Theorem 3.Suppose T =QTDQ is an eigen-decomposition of T where D =diag(d)with d=(d,d2dmT being T's eigenvalues which are ordered as d2d2.zdm.Then the nearest neighbor of T in M(A)is given by Z=QTAQ. (6) Proof.See Theorem 4.1 in [3]. 口 Since in each step we minimize the distance between T and Z,we have IT-ZlF≥lTk+)-ZF≥IlTk+)-Zk+9lF. It is easy to see that(T(),())will converge to a stationary point.The whole IsoHash algorithm based on LP,abbreviated as IsoHash-LP,is briefly summarized in Algorithm 1. Algorithm 1 Lift and projection based IsoHash(IsoHash-LP) Input:XE Rdxn,m EN+,tEN+ [A,W]=PCA(X,m),as stated in Section 2.2. ·Generate a random orthogonal matrix Qo∈Rmxm 。Zo)←Q6AQ0- ●fork=1→tdo Calculate T(k)from Z(k-1)by equation(5). Perform eigen-decomposition of T(k)to get QTDQ=T(k). Calculate Z(k)from Q&and A by equation (6). ●end for ·Y=sgn(QTWTX). Output:Y Because M(A)is not a convex set,the stationary point we find is not necessarily inside the intersec- tion of T(a)and M(A).For example,if we set Z(0)=A,the lift and projection learning algorithm would get no progress because the Z and T are already in a stationary point.To solve this problem of degenerate solutions,we initiate Z as a transformed A with some random orthogonal matrix Qo, which is illustrated in Algorithm 1. 2.3.2 Gradient Flow Another learning algorithm is a continuous one based on the construction of a gradient flow (GF) on the surface M(A)that moves towards the desired intersection point.Because there always ex- ists a solution for the problem in(3)according to Theorem 2,the objective function in(4)can be reformulated as follows[2: (Q)lldiag(QTAQ)-diag() (7) The details about how to optimize(7)can be found in [2].We just show some key steps of the learning algorithm in the following content. The gradient VF at Q can be calculated as 7F(Q)=2A6(Q): (8) where B(Q)=diag(QTAQ)-diag(a).Once we have computed the gradient of F,it can be projected onto the manifold O(m)according to the following Theorem 4
We call Z (k) a lift of T (k) onto M(Λ) and T (k+1) a projection of Z (k) onto T (a). The projection operation is easy to complete. Suppose Z (k) = [zij ], then T (k+1) = [tij ] must be given by tij = zij , if i 6= j ai , if i = j (5) For the lift operation, we have the following Theorem 3. Theorem 3. Suppose T = QT DQ is an eigen-decomposition of T where D = diag(d) with d = [d1, d2, ..., dm] T being T’s eigenvalues which are ordered as d1 ≥ d2 ≥ · · · ≥ dm. Then the nearest neighbor of T in M(Λ) is given by Z = Q TΛQ. (6) Proof. See Theorem 4.1 in [3]. Since in each step we minimize the distance between T and Z, we have ||T (k) − Z (k) ||F ≥ ||T (k+1) − Z (k) ||F ≥ ||T (k+1) − Z (k+1)||F . It is easy to see that (T (k) , Z(k) ) will converge to a stationary point. The whole IsoHash algorithm based on LP, abbreviated as IsoHash-LP, is briefly summarized in Algorithm 1. Algorithm 1 Lift and projection based IsoHash (IsoHash-LP) Input: X ∈ R d×n, m ∈ N +, t ∈ N + • [Λ, W] = P CA(X, m), as stated in Section 2.2. • Generate a random orthogonal matrix Q0 ∈ R m×m. • Z (0) ← QT 0 ΛQ0. • for k = 1 → t do Calculate T (k) from Z (k−1) by equation (5). Perform eigen-decomposition of T (k) to get QT k DQk = T (k) . Calculate Z (k) from Qk and Λ by equation (6). • end for • Y = sgn(QT t WT X). Output: Y Because M(Λ) is not a convex set, the stationary point we find is not necessarily inside the intersection of T (a) and M(Λ). For example, if we set Z (0) = Λ, the lift and projection learning algorithm would get no progress because the Z and T are already in a stationary point. To solve this problem of degenerate solutions, we initiate Z as a transformed Λ with some random orthogonal matrix Q0, which is illustrated in Algorithm 1. 2.3.2 Gradient Flow Another learning algorithm is a continuous one based on the construction of a gradient flow (GF) on the surface M(Λ) that moves towards the desired intersection point. Because there always exists a solution for the problem in (3) according to Theorem 2, the objective function in (4) can be reformulated as follows [2]: min Q∈O(m) F(Q) = 1 2 ||diag(Q TΛQ) − diag(a)||2 F . (7) The details about how to optimize (7) can be found in [2]. We just show some key steps of the learning algorithm in the following content. The gradient ∇F at Q can be calculated as ∇F(Q) = 2Λβ(Q), (8) where β(Q) = diag(QTΛQ) − diag(a). Once we have computed the gradient of F, it can be projected onto the manifold O(m) according to the following Theorem 4. 5
Theorem 4.The projection of VF(Q)onto O(m)is given by g(Q)=QIQTAQ,B(Q)] (9) where [A,B]=AB-BA is the Lie bracket. Proof.See the formulas(20),(21)and (22)in [3]. The vector field =-g(Q)defines a steepest descent flow on the manifold (m)for function F(Q).Letting Z=QTAQ and a(Z)=B(Q),we get [Z,[a(z),a]] (10) where Z is an isospectral flow that moves to reduce the objective function F(Q). As stated by Theorems 3.3 and 3.4 in [2],a stable equilibrium point of(10)must be combined with B(Q)=0,which means that F(Q)has decreased to zero.Hence,the gradient flow method can always find an intersection point as the solution.The whole IsoHash algorithm based on GF, abbreviated as IsoHash-GF,is briefly summarized in Algorithm 2. Algorithm 2 Gradient flow based IsoHash(IsoHash-GF) Input:X∈Rdxn,m∈N+ .[A,W]=PCA(X,m),as stated in Section 2.2. ·Generate a random orthogonal matrix Qo∈Rmxm 。Zo)←Q8AQ0- .Start integration from Z=Z(0)with gradient computed from equation(10). Stop integration when reaching a stable equilibrium point. Perform eigen-decomposition of Z to get QTAQ=Z. ·Y=sgn(QWTX). Output:Y We now discuss some implementation details of IsoHash-GF.Since all diagonal matrices in M(A) result in=0,one should not use A as the starting point.In our implementation,we use the same method as that in IsoHash-LP to avoid this degenerate case,i.e..a random orthogonal transformation matrix Qo is use to rotate A.To integrate Z with gradient in(10),we use Adams-Bashforth-Moulton PECE solver in [27]where the parameter RelTol is set to 10-3.The relative error of the algorithm is computed by comparing the diagonal entries of Z to the target diag(a).The whole integration process will be terminated when their relative error is below 10-7. 2.4 Complexity Analysis The learning of our IsoHash method contains two phases:the first phase is PCA and the second phase is LP or GF.The time complexity of PCA is O(min(n2d,nd2)).The time complexity of LP after PCA is O(m3t),and that of GF after PCA is O(m3).In our experiments,t is set to 100 because good performance can be achieved at this setting.Because m is typically set to be a very small number like 64 or 128,the main time complexity of IsoHash is from the PCA phase.In general,the training of IsoHash-GF will be faster than IsoHash-LP in our experiments. One promising property of both LP and GF is that the time complexity after PCA is independent of the number of training data,which makes them scalable to large-scale data sets. 3 Relation to Existing Works The most related method of IsoHash is ITQ [7].because both ITQ and IsoHash have to learn an orthogonal matrix.However,IsoHash is different from ITQ in many aspects:firstly,the goal of IsoHash is to learn a projection with isotropic variances,but the results of ITQ cannot necessari- ly guarantee isotropic variances;secondly,IsoHash directly learns the orthogonal matrix from the eigenvalues and eigenvectors of PCA,but ITQ first quantizes the PCA results to get some binary 6
Theorem 4. The projection of ∇F(Q) onto O(m) is given by g(Q) = Q[Q TΛQ, β(Q)] (9) where [A, B] = AB − BA is the Lie bracket. Proof. See the formulas (20), (21) and (22) in [3]. The vector field Q˙ = −g(Q) defines a steepest descent flow on the manifold O(m) for function F(Q). Letting Z = QTΛQ and α(Z) = β(Q), we get Z˙ = [Z, [α(Z), Z]], (10) where Z˙ is an isospectral flow that moves to reduce the objective function F(Q). As stated by Theorems 3.3 and 3.4 in [2], a stable equilibrium point of (10) must be combined with β(Q) = 0, which means that F(Q) has decreased to zero. Hence, the gradient flow method can always find an intersection point as the solution. The whole IsoHash algorithm based on GF, abbreviated as IsoHash-GF, is briefly summarized in Algorithm 2. Algorithm 2 Gradient flow based IsoHash (IsoHash-GF) Input: X ∈ R d×n, m ∈ N + • [Λ, W] = P CA(X, m), as stated in Section 2.2. • Generate a random orthogonal matrix Q0 ∈ R m×m. • Z (0) ← QT 0 ΛQ0. • Start integration from Z = Z (0) with gradient computed from equation (10). • Stop integration when reaching a stable equilibrium point. • Perform eigen-decomposition of Z to get QTΛQ = Z. • Y = sgn(QT WT X). Output: Y We now discuss some implementation details of IsoHash-GF. Since all diagonal matrices in M(Λ) result in Z˙ = 0, one should not use Λ as the starting point. In our implementation, we use the same method as that in IsoHash-LP to avoid this degenerate case, i.e., a random orthogonal transformation matrix Q0 is use to rotate Λ. To integrate Z with gradient in (10), we use Adams-Bashforth-Moulton PECE solver in [27] where the parameter RelTol is set to 10−3 . The relative error of the algorithm is computed by comparing the diagonal entries of Z to the target diag(a). The whole integration process will be terminated when their relative error is below 10−7 . 2.4 Complexity Analysis The learning of our IsoHash method contains two phases: the first phase is PCA and the second phase is LP or GF. The time complexity of PCA is O(min(n 2d, nd2 )). The time complexity of LP after PCA is O(m3 t), and that of GF after PCA is O(m3 ). In our experiments, t is set to 100 because good performance can be achieved at this setting. Because m is typically set to be a very small number like 64 or 128, the main time complexity of IsoHash is from the PCA phase. In general, the training of IsoHash-GF will be faster than IsoHash-LP in our experiments. One promising property of both LP and GF is that the time complexity after PCA is independent of the number of training data, which makes them scalable to large-scale data sets. 3 Relation to Existing Works The most related method of IsoHash is ITQ [7], because both ITQ and IsoHash have to learn an orthogonal matrix. However, IsoHash is different from ITQ in many aspects: firstly, the goal of IsoHash is to learn a projection with isotropic variances, but the results of ITQ cannot necessarily guarantee isotropic variances; secondly, IsoHash directly learns the orthogonal matrix from the eigenvalues and eigenvectors of PCA, but ITQ first quantizes the PCA results to get some binary 6
codes,and then learns the orthogonal matrix based on the resulting binary codes:thirdly.IsoHash has an explicit objective function to optimize,but ITQ uses a two-step heuristic strategy whose goal cannot be formulated by a single objective function;fourthly,to learn the orthogonal matrix, IsoHash uses Lift and Projection or Gradient Flow,but ITQ uses Procruster method which is much slower than IsoHash.From the experimental results which will be presented in the next section,we can find that IsoHash can achieve accuracy comparable to ITQ with much faster training speed. 4 Experiment 4.1 Data Sets We evaluate our methods on two widely used data sets,CIFAR [16]and LabelMe [28]. The first data set is ClFAR-10 [16]which consists of 60,000 images.These images are manually labeled into 10 classes,which are airplane,automobile,bird,cat,deer,dog,frog,horse,ship,and truck.The size of each image is 32x32 pixels.We represent them with 256-dimensional gray-scale GIST descriptors [24] The second data set is 22K LabelMe used in [23.28]which contains 22.019 images sampled from the large LabelMe data set.As in [28].The images are scaled to 32x32 pixels,and then represented by 512-dimensional GIST descriptors [241. 4.2 Evaluation Protocols and Baselines As the protocols widely used in recent papers [7,23,25,31],Euclidean neighbors in the original s- pace are considered as ground truth.More specifically,a threshold of the average distance to the 50th nearest neighbor is used to define whether a point is a true positive or not.Based on the Euclidean ground truth,we compute the precision-recall curve and mean average precision(mAP)[7,21].For all experiments,we randomly select 1000 points as queries,and leave the rest as training set to learn the hash functions.All the experimental results are averaged over 10 random training/test partitions. Although a lot of hashing methods have been proposed,some of them are either supervised [23] or semi-supervised [29].Our IsoHash method is essentially an unsupervised one.Hence,for fair comparison,we select the most representative unsupervised methods for evaluation,which contain PCAH [7],ITQ [7],SH [31],LSH [1],and SIKH [25].Among these methods,PCAH,ITQ and SH are data-dependent methods,while SIKH and LSH are data-independent methods. All experiments are conducted on our workstation with Intel(R)Xeon(R)CPU X7560@2.27GHz and 64G memory 4.3 Accuracy Table 1 shows the Hamming ranking performance measured by mAP on LabelMe and CIFAR.It is clear that our IsoHash methods,including both IsoHash-GF and IsoHash-LP,achieve far better performance than PCAH.The main difference between IsoHash and PCAH is that the PCAH di- mensions have anisotropic variances while IsoHash dimensions have isotropic variances.Hence, the intuitive viewpoint that using the same number of bits for different projected dimensions with anisotropic variances is unreasonable has been successfully verified by our experiments.Further- more,the performance of IsoHash is also comparable,if not superior,to the state-of-the-art methods, such as ITQ. Figure 1 illustrates the precision-recall curves on LabelMe data set with different code sizes.The relative performance in the precision-recall curves on CIFAR is similar to that on LabelMe.We omit the results on CIFAR due to space limitation.Once again,we can find that our IsoHash methods can achieve performance which is far better than PCAH and comparable to the state-of-the-art. 4.4 Computational Cost Table 2 shows the training time on CIFAR.We can see that our IsoHash methods are much faster than ITQ.The time complexity of ITQ also contains two parts:the first part is PCA which is the same
codes, and then learns the orthogonal matrix based on the resulting binary codes; thirdly, IsoHash has an explicit objective function to optimize, but ITQ uses a two-step heuristic strategy whose goal cannot be formulated by a single objective function; fourthly, to learn the orthogonal matrix, IsoHash uses Lift and Projection or Gradient Flow, but ITQ uses Procruster method which is much slower than IsoHash. From the experimental results which will be presented in the next section, we can find that IsoHash can achieve accuracy comparable to ITQ with much faster training speed. 4 Experiment 4.1 Data Sets We evaluate our methods on two widely used data sets, CIFAR [16] and LabelMe [28]. The first data set is CIFAR-10 [16] which consists of 60,000 images. These images are manually labeled into 10 classes, which are airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. The size of each image is 32×32 pixels. We represent them with 256-dimensional gray-scale GIST descriptors [24]. The second data set is 22K LabelMe used in [23, 28] which contains 22,019 images sampled from the large LabelMe data set. As in [28], The images are scaled to 32×32 pixels, and then represented by 512-dimensional GIST descriptors [24]. 4.2 Evaluation Protocols and Baselines As the protocols widely used in recent papers [7, 23, 25, 31], Euclidean neighbors in the original space are considered as ground truth. More specifically, a threshold of the average distance to the 50th nearest neighbor is used to define whether a point is a true positive or not. Based on the Euclidean ground truth, we compute the precision-recall curve and mean average precision (mAP) [7, 21]. For all experiments, we randomly select 1000 points as queries, and leave the rest as training set to learn the hash functions. All the experimental results are averaged over 10 random training/test partitions. Although a lot of hashing methods have been proposed, some of them are either supervised [23] or semi-supervised [29]. Our IsoHash method is essentially an unsupervised one. Hence, for fair comparison, we select the most representative unsupervised methods for evaluation, which contain PCAH [7], ITQ [7], SH [31], LSH [1], and SIKH [25]. Among these methods, PCAH, ITQ and SH are data-dependent methods, while SIKH and LSH are data-independent methods. All experiments are conducted on our workstation with Intel(R) Xeon(R) CPU X7560@2.27GHz and 64G memory. 4.3 Accuracy Table 1 shows the Hamming ranking performance measured by mAP on LabelMe and CIFAR. It is clear that our IsoHash methods, including both IsoHash-GF and IsoHash-LP, achieve far better performance than PCAH. The main difference between IsoHash and PCAH is that the PCAH dimensions have anisotropic variances while IsoHash dimensions have isotropic variances. Hence, the intuitive viewpoint that using the same number of bits for different projected dimensions with anisotropic variances is unreasonable has been successfully verified by our experiments. Furthermore, the performance of IsoHash is also comparable, if not superior, to the state-of-the-art methods, such as ITQ. Figure 1 illustrates the precision-recall curves on LabelMe data set with different code sizes. The relative performance in the precision-recall curves on CIFAR is similar to that on LabelMe. We omit the results on CIFAR due to space limitation. Once again, we can find that our IsoHash methods can achieve performance which is far better than PCAH and comparable to the state-of-the-art. 4.4 Computational Cost Table 2 shows the training time on CIFAR. We can see that our IsoHash methods are much faster than ITQ. The time complexity of ITQ also contains two parts: the first part is PCA which is the same 7
Table 1:mAP on LabelMe and CIFAR data sets Method labelMe IFAR #bits 32 64 96 1)8 56 4 06 18 256 IsoHash-GF 0.25800.3269 0.3528 03662 03889 0.2249 0.2969 0.3256 03357 03600 IsoHash-LP 0重4■ 0.3223 0.3577 03826 0.4274 0.I907 0.2624 0.3027 0.3223 PCAH 00516 00401 00341 00307 00232 00319 00274 00241 00216 00168 02786 0328 0.3504 0.615 03728 0.2490 0305 0.3238 0.3319 0346 SH 0.0826 0.1034 0.1447 0.1653 0.2080 0.0510 0.0589 00802 0.1121 0.1535 SIKH 0.0590 ■0.1482 0.2074 0.2526 0.4488 0.0353 0.0902 0.1245 0.1909 0.3614 LSH 0.15490.2574 0.3147 0375 0.4034 0.1052 0.1907 0.2396 0.2776 03432 (a)32 bits (b)64 bits (c)96 bits (d)256 bits Figure 1:Precision-recall curves on LabelMe data set. as that in IsoHash,and the second part is an iteration algorithm to rotate the original PCA matrix with time complexity O(nm),where n is the number of training points and m is the number of bits in the binary code.Hence,as the number of training data increases,the second-part time complexity of ITQ will increase linearly to the number of training points.But the time complexity of IsoHash after PCA is independent of the number of training points.Hence,IsoHash will be much faster than ITQ,particularly in the case with a large number of training points.This is clearly shown in Figure 2 which illustrates the training time when the numbers of training data are changed. Table 2:Training time(in second)on CIFAR. #bits 32 64 96 128256 IsoHash-GF 2.48 2.452.70 3.00 5.55 CAH IsoHash-LP 2.14 2.43 2.94 3.47 8.83 PCAH 1.84 2.14 2.23 2.36 2.92 ITQ 4.35 6.33 9.73 12.40 29.25 SH 1.60 3.41 8.37 13.66 49.44 SIKH 1.30 1.44 1.57 1.55 2.20 LSH 0.050.08 0.11 0.19 0.31 of training data x10 Figure 2:Training time on CIFAR 5 Conclusion Although many researchers have intuitively argued that using the same number of bits for different projected dimensions with anisotropic variances is unreasonable,this viewpoint has still not been verified by either theory or experiment because no methods have been proposed to find projection functions with isotropic variances for different dimensions.The proposed IsoHash method in this paper is the first work to learn projection functions which can produce projected dimensions with isotropic variances (equal variances).Experimental results on real data sets have successfully veri- fied the viewpoint that projections with isotropic variances will be better than those with anisotropic variances.Furthermore,IsoHash can achieve accuracy comparable to the state-of-the-art methods with faster training speed. 6 Acknowledgments This work is supported by the NSFC (No.61100125),the 863 Program of China (No.2011AA01A202,No.2012AA011003),and the Program for Changjiang Scholars and Innovative Research Team in University of China (IRT1158.PCSIRT)
Table 1: mAP on LabelMe and CIFAR data sets. Method LabelMe CIFAR # bits 32 64 96 128 256 32 64 96 128 256 IsoHash-GF 0.2580 0.3269 0.3528 0.3662 0.3889 0.2249 0.2969 0.3256 0.3357 0.3600 IsoHash-LP 0.2534 0.3223 0.3577 0.3826 0.4274 0.1907 0.2624 0.3027 0.3223 0.3651 PCAH 0.0516 0.0401 0.0341 0.0307 0.0232 0.0319 0.0274 0.0241 0.0216 0.0168 ITQ 0.2786 0.3328 0.3504 0.3615 0.3728 0.2490 0.3051 0.3238 0.3319 0.3436 SH 0.0826 0.1034 0.1447 0.1653 0.2080 0.0510 0.0589 0.0802 0.1121 0.1535 SIKH 0.0590 0.1482 0.2074 0.2526 0.4488 0.0353 0.0902 0.1245 0.1909 0.3614 LSH 0.1549 0.2574 0.3147 0.3375 0.4034 0.1052 0.1907 0.2396 0.2776 0.3432 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recall Precision IsoHash−GF IsoHash−LP ITQ SH SIKH LSH PCAH (a) 32 bits 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recall Precision IsoHash−GF IsoHash−LP ITQ SH SIKH LSH PCAH (b) 64 bits 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recall Precision IsoHash−GF IsoHash−LP ITQ SH SIKH LSH PCAH (c) 96 bits 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recall Precision IsoHash−GF IsoHash−LP ITQ SH SIKH LSH PCAH (d) 256 bits Figure 1: Precision-recall curves on LabelMe data set. as that in IsoHash, and the second part is an iteration algorithm to rotate the original PCA matrix with time complexity O(nm2 ), where n is the number of training points and m is the number of bits in the binary code. Hence, as the number of training data increases, the second-part time complexity of ITQ will increase linearly to the number of training points. But the time complexity of IsoHash after PCA is independent of the number of training points. Hence, IsoHash will be much faster than ITQ, particularly in the case with a large number of training points. This is clearly shown in Figure 2 which illustrates the training time when the numbers of training data are changed. Table 2: Training time (in second) on CIFAR. # bits 32 64 96 128 256 IsoHash-GF 2.48 2.45 2.70 3.00 5.55 IsoHash-LP 2.14 2.43 2.94 3.47 8.83 PCAH 1.84 2.14 2.23 2.36 2.92 ITQ 4.35 6.33 9.73 12.40 29.25 SH 1.60 3.41 8.37 13.66 49.44 SIKH 1.30 1.44 1.57 1.55 2.20 LSH 0.05 0.08 0.11 0.19 0.31 0 1 2 3 4 5 6 x 104 0 10 20 30 40 50 Number of training data Training Time(s) IsoHash−GF IsoHash−LP ITQ SH SIKH LSH PCAH Figure 2: Training time on CIFAR . 5 Conclusion Although many researchers have intuitively argued that using the same number of bits for different projected dimensions with anisotropic variances is unreasonable, this viewpoint has still not been verified by either theory or experiment because no methods have been proposed to find projection functions with isotropic variances for different dimensions. The proposed IsoHash method in this paper is the first work to learn projection functions which can produce projected dimensions with isotropic variances (equal variances). Experimental results on real data sets have successfully veri- fied the viewpoint that projections with isotropic variances will be better than those with anisotropic variances. Furthermore, IsoHash can achieve accuracy comparable to the state-of-the-art methods with faster training speed. 6 Acknowledgments This work is supported by the NSFC (No. 61100125), the 863 Program of China (No. 2011AA01A202, No. 2012AA011003), and the Program for Changjiang Scholars and Innovative Research Team in University of China (IRT1158, PCSIRT). 8
References [1]A.Andoni and P.Indyk.Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions.Commun.ACM,51(1):117-122.2008. [2]M.T.Chu.Constructing a Hermitian matrix from its diagonal entries and eigenvalues.SIAM Journal on Matrix Analysis and Applications,16(1):207-217.1995. [3]M.T.Chu and K.R.Driessel.The projected gradient method for least squares matrix approximations with spectral constraints.SIAM Journal on Numerical Analysis,pages 1050-1060,1990. [4]M.Datar.N.Immorlica.P.Indyk,and V.S.Mirrokni.Locality-sensitive hashing scheme based on p-stable distributions.In Proceedings of the ACM Symposium on Computational Geometry,2004. [5]A.Gionis,P.Indyk,and R.Motwani.Similarity search in high dimensions via hashing.In VLDB,1999. [6]Y.Gong.S.Kumar,V.Verma,and S.Lazebnik.Angular quantization based binary codes for fast similarity search.In NIPS.2012. [7]Y.Gong and S.Lazebnik.Iterative quantization:A Procrustean approach to learning binary codes.In CVPR.2011. [8]Y.Gong.S.Lazebnik,A.Gordo,and F.Perronnin.Iterative quantization:A Procrustean approach to learning binary codes for large-scale image retrieval.In IEEE Trans.Pattern Anal.Mach.Intell.,2012. [9]J.He,W.Liu,and S.-F.Chang.Scalable similarity search with optimized kernel hashing.In KDD,2010. [10]J.-P.Heo,Y.Lee,J.He,S.-F.Chang,and S.-E.Yoon.Spherical hashing.In CVPR,2012. [11]A.Horn.Doubly stochastic matrices and the diagonal of a rotation matrix.American Journal of Mathe- matics,.76(3):620-630.1954. [12]H.Jegou,M.Douze,C.Schmid,and P.Perez.Aggregating local descriptors into a compact image representation.In CVPR.2010. [13]I.Jolliffe.Principal Component Analysis.Springer,2002. [14]W.Kong and W.-J.Li.Double-bit quantization for hashing.In AAA/,2012. [15]W.Kong,W.-J.Li,and M.Guo.Manhattan hashing for large-scale image retrieval.In S/GIR,2012. [16]A.Krizhevsky.Learning multiple layers of features from tiny images.Tech report,University of Toronto, 2009. [17]B.Kulis and T.Darrell.Learning to hash with binary reconstructive embeddings.In NIPS,2009. [18]B.Kulis and K.Grauman.Kernelized locality-sensitive hashing for scalable image search.In /CCV. 2009. [19]B.Kulis,P.Jain,and K.Grauman.Fast similarity search for learned metrics.IEEE Trans.Pattern Anal. Mach.ntell.,31(12):2143-2157,2009. [20]W.Liu,J.Wang.R.Ji,Y.-G.Jiang,and S.-F.Chang.Supervised hashing with kernels.In CVPR,2012. [21]W.Liu,J.Wang.S.Kumar,and S.-F.Chang.Hashing with graphs.In /CML,2011. [22]Y.Mu and S.Yan.Non-metric locality-sensitive hashing.In AAA/.2010. [23]M.Norouzi and D.J.Fleet.Minimal loss hashing for compact binary codes.In ICML,2011. [24]A.Oliva and A.Torralba.Modeling the shape of the scene:A holistic representation of the spatial envelope.International Journal of Computer Vision,42(3):145-175,2001. [25]M.Raginsky and S.Lazebnik.Locality-sensitive binary codes from shift-invariant kernels.In NIPS, 2009. [26]R.Salakhutdinov and G.E.Hinton.Semantic hashing.Int.J.Approx.Reasoning.50(7):969-978.2009. [27]L.F.Shampine and M.K.Gordon.Computer solution of ordinary differential equations:the initial value problem.Freeman,San Francisco,California,1975. [28]A.Torralba,R.Fergus,and Y.Weiss.Small codes and large image databases for recognition.In CVPR. 2008. [29]J.Wang.S.Kumar,and S.-F.Chang.Sequential projection learning for hashing with compact codes.In 1CML.2010. [30]J.Wang.S.Kumar,and S.-F.Chang.Semi-supervised hashing for large-scale search.IEEE Trans.Pattern Anal.Mach.ntell.,3412):2393-2406.2012. [31]Y.Weiss,A.Torralba,and R.Fergus.Spectral hashing.In NIPS,2008. [32]H.Xu,J.Wang,Z.Li,G.Zeng,S.Li,and N.Yu.Complementary hashing for approximate nearest neighbor search.In /CCV.2011. [33]D.Zhang,F.Wang,and L.Si.Composite hashing with multiple information sources.In S/G/R,2011. [34]Y.Zhen and D.-Y.Yeung.A probabilistic model for multimodal hash function learning.In KDD,2012. 9
References [1] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117–122, 2008. [2] M.T. Chu. Constructing a Hermitian matrix from its diagonal entries and eigenvalues. SIAM Journal on Matrix Analysis and Applications, 16(1):207–217, 1995. [3] M.T. Chu and K.R. Driessel. The projected gradient method for least squares matrix approximations with spectral constraints. SIAM Journal on Numerical Analysis, pages 1050–1060, 1990. [4] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the ACM Symposium on Computational Geometry, 2004. [5] A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, 1999. [6] Y. Gong, S. Kumar, V. Verma, and S. Lazebnik. Angular quantization based binary codes for fast similarity search. In NIPS, 2012. [7] Y. Gong and S. Lazebnik. Iterative quantization: A Procrustean approach to learning binary codes. In CVPR, 2011. [8] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: A Procrustean approach to learning binary codes for large-scale image retrieval. In IEEE Trans. Pattern Anal. Mach. Intell., 2012. [9] J. He, W. Liu, and S.-F. Chang. Scalable similarity search with optimized kernel hashing. In KDD, 2010. [10] J.-P. Heo, Y. Lee, J. He, S.-F. Chang, and S.-E. Yoon. Spherical hashing. In CVPR, 2012. [11] A. Horn. Doubly stochastic matrices and the diagonal of a rotation matrix. American Journal of Mathematics, 76(3):620–630, 1954. [12] H. Jegou, M. Douze, C. Schmid, and P. Perez. Aggregating local descriptors into a compact image ´ representation. In CVPR, 2010. [13] I. Jolliffe. Principal Component Analysis. Springer, 2002. [14] W. Kong and W.-J. Li. Double-bit quantization for hashing. In AAAI, 2012. [15] W. Kong, W.-J. Li, and M. Guo. Manhattan hashing for large-scale image retrieval. In SIGIR, 2012. [16] A. Krizhevsky. Learning multiple layers of features from tiny images. Tech report, University of Toronto, 2009. [17] B. Kulis and T. Darrell. Learning to hash with binary reconstructive embeddings. In NIPS, 2009. [18] B. Kulis and K. Grauman. Kernelized locality-sensitive hashing for scalable image search. In ICCV, 2009. [19] B. Kulis, P. Jain, and K. Grauman. Fast similarity search for learned metrics. IEEE Trans. Pattern Anal. Mach. Intell., 31(12):2143–2157, 2009. [20] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang. Supervised hashing with kernels. In CVPR, 2012. [21] W. Liu, J. Wang, S. Kumar, and S.-F. Chang. Hashing with graphs. In ICML, 2011. [22] Y. Mu and S. Yan. Non-metric locality-sensitive hashing. In AAAI, 2010. [23] M. Norouzi and D. J. Fleet. Minimal loss hashing for compact binary codes. In ICML, 2011. [24] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3):145–175, 2001. [25] M. Raginsky and S. Lazebnik. Locality-sensitive binary codes from shift-invariant kernels. In NIPS, 2009. [26] R. Salakhutdinov and G. E. Hinton. Semantic hashing. Int. J. Approx. Reasoning, 50(7):969–978, 2009. [27] L.F. Shampine and M.K. Gordon. Computer solution of ordinary differential equations: the initial value problem. Freeman, San Francisco, California, 1975. [28] A. Torralba, R. Fergus, and Y. Weiss. Small codes and large image databases for recognition. In CVPR, 2008. [29] J. Wang, S. Kumar, and S.-F. Chang. Sequential projection learning for hashing with compact codes. In ICML, 2010. [30] J. Wang, S. Kumar, and S.-F. Chang. Semi-supervised hashing for large-scale search. IEEE Trans. Pattern Anal. Mach. Intell., 34(12):2393–2406, 2012. [31] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In NIPS, 2008. [32] H. Xu, J. Wang, Z. Li, G. Zeng, S. Li, and N. Yu. Complementary hashing for approximate nearest neighbor search. In ICCV, 2011. [33] D. Zhang, F. Wang, and L. Si. Composite hashing with multiple information sources. In SIGIR, 2011. [34] Y. Zhen and D.-Y. Yeung. A probabilistic model for multimodal hash function learning. In KDD, 2012. 9