Scalable Graph Hashing with Feature Transformation Qing-Yuan Jiang and Wu-Jun Li National Key Laboratory for Novel Software Technology Collaborative Innovation Center of Novel Software Technology and Industrialization Department of Computer Science and Technology,Nanjing University,China jiangqy@lamda.nju.edu.cn,liwujun@nju.edu.cn Abstract Song et al.,2013;Zhang et al.,2014;Liu et al.,2014]has been widely used for ANN search.The hashing techniques Hashing has been widely used for approximate n- used for ANN search are usually called similarity-preserving earest neighbor(ANN)search in big data applica- tions because of its low storage cost and fast re- hashing,which tries to map the data points from the origina space into a binary-code Hamming space where the similari- trieval speed.The goal of hashing is to map the da- ty (neighborhood structure)in the original space is preserved ta points from the original space into a binary-code space where the similarity (neighborhood struc- More specifically,the Hamming distance between the binary codes of two points should be small if these two points are ture)in the original space is preserved.By di- rectly exploiting the similarity to guide the hashing similar in the original space.Otherwise,the Hamming dis- tance should be as large as possible.With the binary-code code learning procedure,graph hashing has attract- representation,hashing can achieve constant or sub-linear ed much attention.However,most existing graph time complexity for ANN search [Gong and Lazebnik,2011; hashing methods cannot achieve satisfactory per- formance in real applications due to the high com- Zhang et al.,2014].Furthermore,hashing can also reduce the storage cost dramatically.For example,only 4GB memory is plexity for graph modeling.In this paper,we pro- needed to store one billion data points if each point is repre- pose a novel method,called scalable graph hashing sented as a binary code of 32 bits.Hence,hashing has be- with feature transformation (SGH).for large-scale graph hashing.Through feature transformation,we come one of the most popular methods for ANN search [Gio- nis et al.,1999;Datar et al.,2004;Weiss et al.,2008; can effectively approximate the whole graph with- out explicitly computing the similarity graph ma- Kulis and Darrell,2009;Wang et al.,2010b;Liu et al.,2011; trix,based on which a sequential learning method Gong and Lazebnik,2011;Kong and Li,2012;Liu et al, is proposed to learn the hash functions in a bit-wise 2012;Xu et al.,2013;Zhang et al.,2014;Lin et al.,2014; Liu et al.,2014] manner.Experiments on two datasets with one mil- lion data points show that our SGH method can Compared with traditional data-independent hashing outperform the state-of-the-art methods in terms of methods like locality sensitive hashing (LSH)[Gionis et al., both accuracy and scalability 1999:Datar et al.,2004]which do not use any data for train- ing,data-dependent hashing methods,which are also called learning to hash(LH)methods,can achieve comparable or 1 Introduction better accuracy with shorter codes by learning hash func- Nearest neighbor search [Andoni,2009]plays an importan- tions from training data [Gong and Lazebnik,2011;Liu et t role in a large variety of areas including machine learning, al.,2012;Zhang et al.,2014;Liu et al.,2014].Hence,LH data mining,and information retrieval,and so on.In big data methods have become more popular than data-independent applications,it is typically time-consuming or impossible to methods [Wang et al.,2010b;Gong and Lazebnik,2011; return the exact nearest neighbors to the given queries.In fact, Liu et al.,2012;Zhang et al.,2014;Lin et al.,2014; approximate nearest neighbors (ANN)[Indyk and Motwani. Liu et al.,2014]. 1998:Andoni and Indyk,2008]are enough to achieve satis- Existing LH methods can be divided into two main cate- factory performance in many applications,such as the image gories [Gong and Lazebnik,2011;Liu et al.,2012;Zhang retrieval task in search engines.Furthermore,ANN search is et al.,2014]:unsupervised hashing and supervised hashing usually more efficient than exact nearest neighbor search to methods.Unsupervised hashing tries to preserve the Eu- solve large-scale problems.Hence,ANN search has attract- clidean similarity between the attributes of training points, ed more and more attention in this big data era lAndoni and while supervised hashing [Norouzi and Fleet,2011;Zhang et Indyk,2008] al.,2014;Lin et al.,2014]tries to preserve the semantic sim- Because of its low storage cost and fast retrieval speed, ilarity constructed from the semantic labels of the training hashing [Andoni and Indyk,2008;Wang et al.,2010a;Gong points.Although supervised hashing methods have demon- and Lazebnik,2011;Zhen and Yeung,2012;Zhu et al.,2013; strated promising performance in some applications with se-
Scalable Graph Hashing with Feature Transformation Qing-Yuan Jiang and Wu-Jun Li National Key Laboratory for Novel Software Technology Collaborative Innovation Center of Novel Software Technology and Industrialization Department of Computer Science and Technology, Nanjing University, China jiangqy@lamda.nju.edu.cn, liwujun@nju.edu.cn Abstract Hashing has been widely used for approximate nearest neighbor (ANN) search in big data applications because of its low storage cost and fast retrieval speed. The goal of hashing is to map the data points from the original space into a binary-code space where the similarity (neighborhood structure) in the original space is preserved. By directly exploiting the similarity to guide the hashing code learning procedure, graph hashing has attracted much attention. However, most existing graph hashing methods cannot achieve satisfactory performance in real applications due to the high complexity for graph modeling. In this paper, we propose a novel method, called scalable graph hashing with feature transformation (SGH), for large-scale graph hashing. Through feature transformation, we can effectively approximate the whole graph without explicitly computing the similarity graph matrix, based on which a sequential learning method is proposed to learn the hash functions in a bit-wise manner. Experiments on two datasets with one million data points show that our SGH method can outperform the state-of-the-art methods in terms of both accuracy and scalability. 1 Introduction Nearest neighbor search [Andoni, 2009] plays an important role in a large variety of areas including machine learning, data mining, and information retrieval, and so on. In big data applications, it is typically time-consuming or impossible to return the exact nearest neighbors to the given queries. In fact, approximate nearest neighbors (ANN) [Indyk and Motwani, 1998; Andoni and Indyk, 2008] are enough to achieve satisfactory performance in many applications, such as the image retrieval task in search engines. Furthermore, ANN search is usually more efficient than exact nearest neighbor search to solve large-scale problems. Hence, ANN search has attracted more and more attention in this big data era [Andoni and Indyk, 2008] Because of its low storage cost and fast retrieval speed, hashing [Andoni and Indyk, 2008; Wang et al., 2010a; Gong and Lazebnik, 2011; Zhen and Yeung, 2012; Zhu et al., 2013; Song et al., 2013; Zhang et al., 2014; Liu et al., 2014] has been widely used for ANN search. The hashing techniques used for ANN search are usually called similarity-preserving hashing, which tries to map the data points from the original space into a binary-code Hamming space where the similarity (neighborhood structure) in the original space is preserved. More specifically, the Hamming distance between the binary codes of two points should be small if these two points are similar in the original space. Otherwise, the Hamming distance should be as large as possible. With the binary-code representation, hashing can achieve constant or sub-linear time complexity for ANN search [Gong and Lazebnik, 2011; Zhang et al., 2014]. Furthermore, hashing can also reduce the storage cost dramatically. For example, only 4GB memory is needed to store one billion data points if each point is represented as a binary code of 32 bits. Hence, hashing has become one of the most popular methods for ANN search [Gionis et al., 1999; Datar et al., 2004; Weiss et al., 2008; Kulis and Darrell, 2009; Wang et al., 2010b; Liu et al., 2011; Gong and Lazebnik, 2011; Kong and Li, 2012; Liu et al., 2012; Xu et al., 2013; Zhang et al., 2014; Lin et al., 2014; Liu et al., 2014]. Compared with traditional data-independent hashing methods like locality sensitive hashing (LSH) [Gionis et al., 1999; Datar et al., 2004] which do not use any data for training, data-dependent hashing methods, which are also called learning to hash (LH) methods, can achieve comparable or better accuracy with shorter codes by learning hash functions from training data [Gong and Lazebnik, 2011; Liu et al., 2012; Zhang et al., 2014; Liu et al., 2014]. Hence, LH methods have become more popular than data-independent methods [Wang et al., 2010b; Gong and Lazebnik, 2011; Liu et al., 2012; Zhang et al., 2014; Lin et al., 2014; Liu et al., 2014]. Existing LH methods can be divided into two main categories [Gong and Lazebnik, 2011; Liu et al., 2012; Zhang et al., 2014]: unsupervised hashing and supervised hashing methods. Unsupervised hashing tries to preserve the Euclidean similarity between the attributes of training points, while supervised hashing [Norouzi and Fleet, 2011; Zhang et al., 2014; Lin et al., 2014] tries to preserve the semantic similarity constructed from the semantic labels of the training points. Although supervised hashing methods have demonstrated promising performance in some applications with se-
mantic labels,it is time-consuming or impossible to get se- out explicitly computing the pairwise similarity graph mantic labels in many real applications.Hence,we can only matrix.Hence,the O(n)computation cost and storage perform unsupervised hashing for these cases,which is also cost are avoided in SGH.which makes SGH suitable for the focus of this paper large-scale applications Representative unsupervised hashing methods include spectral hashing(SH)[Weiss et al.,2008],binary reconstruc- A sequential learning method is proposed to learn the hash functions in a bit-wise manner.which is effective tive embeddings(BRE)[Kulis and Darrell,2009],princi- because the residual caused by former bits can be com- pal component analysis based hashing (PCAH)[Gong and Lazebnik,2011],iterative quantization (ITQ)[Gong and plementarily captured in the following bits. Lazebnik,2011],anchor graph hashing(AGH)[Liu et al., Experiments on two datasets with one million data 2011],isotropic hashing (IsoHash)[Kong and Li,2012],and points show that our SGH method can outperform the discrete graph hashing (DGH)[Liu et al.,2014].Among state-of-the-art methods in terms of both accuracy and these methods,SH,BRE,AGH and DGH are graph hash- scalability. ing methods.By directly exploiting the similarity(neighbor- The rest of this paper is organized as follows.Section 2 in- hood structure)to guide the hashing code learning procedure, troduces the problem definition of this paper.We present our the objective of graph hashing exactly matches the goal of SGH method in Section 3.Experiments are shown in Sec- similarity-preserving hashing.Hence,graph hashing should tion 4,and finally we conclude the whole paper in Section 5. be expected to achieve better performance than other non- graph based hashing methods if the learning algorithms are 2 effective enough. Problem Definition However,most existing graph hashing methods cannot 2.1 Notation achieve satisfactory performance in real applications due to We use boldface lowercase letters like v to denote vectors. the high complexity for graph modeling.More specifically, and the ith element of v is denoted as vi.Boldface uppercase the similarity typically reflects the pairwise relationship be- letters like V denote matrices.The ith row of V is denoted as tween two points.The memory cost for storing all the pair- Vi,the jth column of V is denoted as V.j.and the (i,j)th wise similarities is O(n),where n is the number of train- ing points.The time complexity for directly computing all element in V is denoted as Vij.VT is the transpose of V. the pairwise similarities is also O(n2).Besides these costs and tr(V)is the trace of matrix V.lVF=V∑,f to compute and store the similarity graph,almost all meth- is the Frobenius norm,which can also be used to define the ods will introduce extra computation and memory cost dur- length of a vector.sgn()is an element-wise sign function. ing the learning procedure.Hence,it is memory-consuming u:v denotes the concatenation of two vectors u and v.I is and time-consuming or even intractable to learn from the w- an identity matrix with dimensionality d. hole similarity graph for large-scale datasets which are typi- cal in hashing applications.Existing methods have to adop- 2.2 Graph Hashing t approximation or subsampling methods for graph hashing on large-scale datasets.For example,SH uses an eigenfunc- Assume we are given n data points X=[x1;x2;...;xn] tion solution of 1-D Laplacian for approximation by assum- Rnxd where d is the dimensionality of the data points and ing uniform data distribution,which actually loses the neigh- Xi.=x.Without loss of generality,the data are assumed borhood structure in the data.BRE has to subsample a s- to be zero centered which means0.Hashing is to mall subset for training even if a large-scale dataset is avail- map each point x;into a binary code biE{-1,+1e,where able.Both SH and BRE cannot achieve satisfactory perfor- c denotes the code size (length).In general,we use c binary mance in real applications,which has been verified by exist- hash functions th)=1,2,...,c}to compute the binary ing work [Liu et al.,2011].Both AGH and DGH use anchor code of xi.ie.bi=[h(xi),h2(xi),...,he(xi)]T graphs to approximate the similarity graph,which successful- We have different metrics to measure the similarity be- ly avoid the O(n2)complexity for both memory and compu- tween two points in the original space.Let Sij denote the tation cost.However,the accuracy of approximation cannot similarity between xi and xj.One most widely used metric be guaranteed,which might deteriorate the accuracy of the where p>0 is a parame- learned codes.Furthermore,it need extra computation cost is defined as:Se ter.We can find that Sij (0,1].Hashing need to preserve to construct the anchors,which will be proved to be time- the similarity in the original feature space.More specifical- consuming in our experiment.Hence,although the objective ly,the larger the similarity between xi and xj is,the smaller is attractive,existing graph hashing methods cannot achieve the Hamming distance between b;and b;will be.In other satisfactory performance in real applications. words,for any three points xi,xj and xk,if Sj<Sik,the In this paper,we propose a novel method,called scalable Hamming distance between bi and b;should be smaller than graph hashing with feature transformation(SGH),for large- that between bi and bi. scale graph hashing.The main contributions of SGH are If we compute all the pairwise similarities between any two briefly outlined as follows: points in X,we can get a similarity graph with the graph ma- Inspired by the asymmetric LSH (ALSH)[Shrivasta- trix S=[Sijlnxn E Rnxn.Graph hashing tries to use all the va and Li,2014],SGH adopts a feature transformation information or part of the information in S to learn the binary method to effectively approximate the whole graph with- codes.It's obvious that both the time complexity and storage
mantic labels, it is time-consuming or impossible to get semantic labels in many real applications. Hence, we can only perform unsupervised hashing for these cases, which is also the focus of this paper. Representative unsupervised hashing methods include spectral hashing (SH) [Weiss et al., 2008], binary reconstructive embeddings (BRE) [Kulis and Darrell, 2009], principal component analysis based hashing (PCAH) [Gong and Lazebnik, 2011], iterative quantization (ITQ) [Gong and Lazebnik, 2011], anchor graph hashing (AGH) [Liu et al., 2011], isotropic hashing (IsoHash) [Kong and Li, 2012], and discrete graph hashing (DGH) [Liu et al., 2014]. Among these methods, SH, BRE, AGH and DGH are graph hashing methods. By directly exploiting the similarity (neighborhood structure) to guide the hashing code learning procedure, the objective of graph hashing exactly matches the goal of similarity-preserving hashing. Hence, graph hashing should be expected to achieve better performance than other nongraph based hashing methods if the learning algorithms are effective enough. However, most existing graph hashing methods cannot achieve satisfactory performance in real applications due to the high complexity for graph modeling. More specifically, the similarity typically reflects the pairwise relationship between two points. The memory cost for storing all the pairwise similarities is O(n 2 ), where n is the number of training points. The time complexity for directly computing all the pairwise similarities is also O(n 2 ). Besides these costs to compute and store the similarity graph, almost all methods will introduce extra computation and memory cost during the learning procedure. Hence, it is memory-consuming and time-consuming or even intractable to learn from the whole similarity graph for large-scale datasets which are typical in hashing applications. Existing methods have to adopt approximation or subsampling methods for graph hashing on large-scale datasets. For example, SH uses an eigenfunction solution of 1-D Laplacian for approximation by assuming uniform data distribution, which actually loses the neighborhood structure in the data. BRE has to subsample a small subset for training even if a large-scale dataset is available. Both SH and BRE cannot achieve satisfactory performance in real applications, which has been verified by existing work [Liu et al., 2011]. Both AGH and DGH use anchor graphs to approximate the similarity graph, which successfully avoid the O(n 2 ) complexity for both memory and computation cost. However, the accuracy of approximation cannot be guaranteed, which might deteriorate the accuracy of the learned codes. Furthermore, it need extra computation cost to construct the anchors, which will be proved to be timeconsuming in our experiment. Hence, although the objective is attractive, existing graph hashing methods cannot achieve satisfactory performance in real applications. In this paper, we propose a novel method, called scalable graph hashing with feature transformation (SGH), for largescale graph hashing. The main contributions of SGH are briefly outlined as follows: • Inspired by the asymmetric LSH (ALSH) [Shrivastava and Li, 2014], SGH adopts a feature transformation method to effectively approximate the whole graph without explicitly computing the pairwise similarity graph matrix. Hence, the O(n 2 ) computation cost and storage cost are avoided in SGH, which makes SGH suitable for large-scale applications. • A sequential learning method is proposed to learn the hash functions in a bit-wise manner, which is effective because the residual caused by former bits can be complementarily captured in the following bits. • Experiments on two datasets with one million data points show that our SGH method can outperform the state-of-the-art methods in terms of both accuracy and scalability. The rest of this paper is organized as follows. Section 2 introduces the problem definition of this paper. We present our SGH method in Section 3. Experiments are shown in Section 4, and finally we conclude the whole paper in Section 5. 2 Problem Definition 2.1 Notation We use boldface lowercase letters like v to denote vectors, and the ith element of v is denoted as vi . Boldface uppercase letters like V denote matrices. The ith row of V is denoted as Vi∗, the jth column of V is denoted as V∗j , and the (i, j)th element in V is denoted as Vij . VT is the transpose of V, and tr(V) is the trace of matrix V. kVkF = qP ij V 2 ij is the Frobenius norm, which can also be used to define the length of a vector. sgn(·) is an element-wise sign function. [u; v] denotes the concatenation of two vectors u and v. Id is an identity matrix with dimensionality d. 2.2 Graph Hashing Assume we are given n data points X = [x1; x2; · · · ; xn] T ∈ Rn×d , where d is the dimensionality of the data points and Xi∗ = x T i . Without loss of generality, the data are assumed to be zero centered which means Pn i=1 xi = 0. Hashing is to map each point xi into a binary code bi ∈ {−1, +1} c , where c denotes the code size (length). In general, we use c binary hash functions {hk(·)|k = 1, 2, · · · , c} to compute the binary code of xi , i.e., bi = [h1(xi), h2(xi), · · · , hc(xi)]T . We have different metrics to measure the similarity between two points in the original space. Let Sij denote the similarity between xi and xj . One most widely used metric is defined as: Sij = e − kxi−xj k 2 F ρ , where ρ > 0 is a parameter. We can find that Sij ∈ (0, 1]. Hashing need to preserve the similarity in the original feature space. More specifically, the larger the similarity between xi and xj is, the smaller the Hamming distance between bi and bj will be. In other words, for any three points xi , xj and xk, if Sij < Sik, the Hamming distance between bk and bi should be smaller than that between bj and bi . If we compute all the pairwise similarities between any two points in X, we can get a similarity graph with the graph matrix S = [Sij ]n×n ∈ Rn×n. Graph hashing tries to use all the information or part of the information in S to learn the binary codes. It’s obvious that both the time complexity and storage
complexity are O(n2)if we explicitly compute all the pair- wise similarities in S,which is not acceptable in large-scale applications.Hence,as stated in the introduction,existing methods have to adopt approximation or subsampling tech- niques to solve it.However,they cannot achieve satisfactory performance,which motivates the work in this paper. 3 Scalable Graph Hashing with Feature Transformation In this section,we present the details of our graph hashing -05 0.5 method SGH,including the model,learning algorithm,and complexity analysis. Figure 1:Approximation in feature transformation 3.1 Model Feature Transformation The model of SGH contains two key components:the objec- As stated in Section 2,both time complexity and storage com- tive function and the feature transformation method. plexity are O(n2)if we explicitly compute all the pairwise similarities in S.which is obviously unscalable.Here,we Objective Function propose a feature transformation method to use all the simi- The aim of SGH is to approximate the similarity matrix S by the learned hashing codes,which results in the following larities without explicitly computing S. We first define P(x)and Q(x)as follows: objective function: min ∑-bb,, 2(e2-1e-x (1) P(x)= x;1 e2+1。-:川 {bt-1,j ep (3) 2(e2-1)。-x2 where Sij=2Si-l.Note that Sj∈(-1,l↓,and the rela-- Q(x)=[1 P X; 2+e-4;-1 tive distance in S is kept in S.We use S in the objective func- tion to keep consistent with the range of Ibb;E[-1,1]. to x,and then It is NP-hard to directly learn the binary codes {bi}in (1). where we multiply a value ep As in kernelized locality-sensitive hashing (KLSH)[Kulis add two extra dimensions. and Grauman,2009]and supervised hashing with kernel- Then we can get: s(KSH)[Liu et al.,2012],we define the hash function for the kth bit of bi as follows: Px)o)=22x+e正-1 hk(xi)=sgn(Wj(xixj)+biasx), ≈2e呢-3+色-1 =1 =2e--1 where W E Rexm is the weight matrix,(xi,xj)is a kernel =5 function which is a RBF(Gaussian)function in our experi- ment,m denotes the number of kernel bases,and biask is a Here,we use an approximatione, bias value.Our goal is to learn H(x)=th(x),...,he(x)} which is shown in Figure1 when x∈【-l,,We can to map the whole training set X to a binary matrix B E find that these two functions are close to each other with {-1,+1)nxe with Bi.=b.biask is typically set to z E[-1,1].Hence,to make the approximation reasonable, -∑gei∑=lWkp(x,)which has the same effect as we assume-l≤axx≤1.It is easy to prove that that by making the training data in the kernel space zero- p=2max{lx3}是1 can make-1≤axx≤1.Ac centered.In fact,the above hash function can be rewrit- ten as h(x)=sgn(K(x)w).where w=Wi.and tually,2 maxx is not a tight bound of p.In real K(x)=[(x,x)-∑1(x,x)/n,…,(x,xm)- applications,we can also treat p as a hyper-parameter,and )/n].Then,by substituting the H()into (1). tune it with cross-validation techniques. By using this simple feature transformation,we can we can get the objective function with the parameter W to learn: derive the similarity matrix S P(X)TQ(X),where P(X)={P(x),…,P(xn)}∈Rd+2)xn and Q(X)= nlleS -sgn(K(X)WT)sgn(K(x)WT)T (2) {Q(x1),...,Q(xn)}E R(d+2)xm.Both the time and stor- age complexities are still O(n2)if we explicitly computing where K(X)E Rnxm is the kernel feature matrix for all S=P(X)TQ(X)even if the feature transformation is adopt- training points in X. ed.However,we use only P(X)and O(X)for computation
complexity are O(n 2 ) if we explicitly compute all the pairwise similarities in S, which is not acceptable in large-scale applications. Hence, as stated in the introduction, existing methods have to adopt approximation or subsampling techniques to solve it. However, they cannot achieve satisfactory performance, which motivates the work in this paper. 3 Scalable Graph Hashing with Feature Transformation In this section, we present the details of our graph hashing method SGH, including the model, learning algorithm, and complexity analysis. 3.1 Model The model of SGH contains two key components: the objective function and the feature transformation method. Objective Function The aim of SGH is to approximate the similarity matrix S by the learned hashing codes, which results in the following objective function: min {bl} n l=1 Xn i,j=1 (Seij − 1 c b T i bj ) 2 , (1) where Seij = 2Sij − 1. Note that Seij ∈ (−1, 1], and the relative distance in S is kept in Se. We use Se in the objective function to keep consistent with the range of 1 c b T i bj ∈ [−1, 1]. It is NP-hard to directly learn the binary codes {bl} in (1). As in kernelized locality-sensitive hashing (KLSH) [Kulis and Grauman, 2009] and supervised hashing with kernels (KSH) [Liu et al., 2012], we define the hash function for the kth bit of bi as follows: hk(xi) = sgn( Xm j=1 Wkjφ(xi , xj ) + biask), where W ∈ Rc×m is the weight matrix, φ(xi , xj ) is a kernel function which is a RBF (Gaussian) function in our experiment, m denotes the number of kernel bases, and biask is a bias value. Our goal is to learn H(x) = {h1(x), · · · , hc(x)} to map the whole training set X to a binary matrix B ∈ {−1, +1} n×c with Bi∗ = b T i . biask is typically set to − 1 n Pn i=1 Pm j=1 Wkjφ(xi , xj ), which has the same effect as that by making the training data in the kernel space zerocentered. In fact, the above hash function can be rewritten as hk(x) = sgn(K(x)wk), where wk = WT k∗ and K(x) = [φ(x, x1) − Pn i=1 φ(xi P , x1)/n, · · · , φ(x, xm) − n i=1 φ(xi , xm)/n]. Then, by substituting the H(x) into (1), we can get the objective function with the parameter W to learn: min W kcSe − sgn(K(X)WT )sgn(K(X)WT ) T k 2 F , (2) where K(X) ∈ Rn×m is the kernel feature matrix for all training points in X. Figure 1: Approximation in feature transformation. Feature Transformation As stated in Section 2, both time complexity and storage complexity are O(n 2 ) if we explicitly compute all the pairwise similarities in Se, which is obviously unscalable. Here, we propose a feature transformation method to use all the similarities without explicitly computing Se. We first define P(x) and Q(x) as follows: P(x) = [s 2(e 2 − 1) eρ e − kxk 2 F ρ x; r e 2 + 1 e e − kxk 2 F ρ ; 1] Q(x) = [s 2(e 2 − 1) eρ e − kxk 2 F ρ x; r e 2 + 1 e e − kxk 2 F ρ ; −1] (3) where we multiply a value q2(e 2−1) eρ e − kxk 2 F ρ to x, and then add two extra dimensions. Then we can get: P(xi) T Q(xj ) =2[ e 2 − 1 2e × 2x T i xj ρ + e 2 + 1 2e ]e − kxik 2 F +kxj k 2 F ρ − 1 ≈ 2e −kxik 2 F −kxj k 2 F +2xT i xj ρ − 1 = 2e − kxi−xj k 2 F ρ − 1 = Seij . Here, we use an approximation e 2−1 2e x + e 2+1 2e ≈ e x , which is shown in Figure 1 when x ∈ [−1, 1]. We can find that these two functions are close to each other with x ∈ [−1, 1]. Hence, to make the approximation reasonable, we assume −1 ≤ 2 ρ x T i xj ≤ 1. It is easy to prove that ρ = 2 max{kxik 2 F } n i=1 can make −1 ≤ 2 ρ x T i xj ≤ 1. Actually, 2 max{kxik 2 F } n i=1 is not a tight bound of ρ. In real applications, we can also treat ρ as a hyper-parameter, and tune it with cross-validation techniques. By using this simple feature transformation, we can derive the similarity matrix Se = P(X) T Q(X), where P(X) = {P(x1), · · · , P(xn)} ∈ R(d+2)×n and Q(X) = {Q(x1), · · · , Q(xn)} ∈ R(d+2)×n. Both the time and storage complexities are still O(n 2 ) if we explicitly computing Se = P(X) T Q(X) even if the feature transformation is adopted. However, we use only P(X) and Q(X) for computation
but do not explicitly computing S in our following learning If we define At =K(X)TRK(X),then we have: algorithm.Hence,the O(n2)complexity can be avoided in our method. At=At-1- Please note that the feature transformation is inspired by K(X)sgn(K(X)wt-1)sgn(K(X)wt-1)K(X) ALSH [Shrivastava and Li,2014],which is proposed for maximum inner product search with data-independent hash- and ing.Different from ALSH,our SGH is for data-dependent A1=cK(X)TSK(X) hashing.Furthermore,the feature transformation method in SGH is different from that in ALSH. =cK(X)TP(X)TQ(X)K(X) (8) 3.2 Learning =clK(X)P(X)Q(X)K(X)]. The discrete sgn()function in (2)makes the problem very Equation(8)is the key component of our learning algorith- difficult to solve.One possible solution is to discard the m.It is easy to see that we not only implicitly include all the discrete constraints and relax the whole H()function to information of the pairwise similarity matrix S for training, a real-valued function,which has been adopted by many but also successfully avoid the high computation and storage methods such as SH [Weiss et al,2008]and AGH [Liu et al.,2011].However,this relaxation procedure may lead complexity without explicitly computing S. to poor performance,which has been verified by existing After t iterates from 1 to c,we can learn all the W= work [Kong and Li,2012;Zhang and Li,2014].Here,we [wi=1.Actually,the sequential learning procedure can fur- design a sequential learning strategy in a bit-wise manner, ther continue by adopting the following residual matrix: where the residual caused by former bits can be comple- mentarily captured in the following bits [Liu et al.,2012; Rt=cS- >sgn(K(X)w:)sgn(K(X)w:)T. Zhang and Li,2014]. Assuming that we have already learned t-1 bits which are 浮 parameterized by w the residual matrix to reconstruct We find that this procedure can further improve the accura- the similarity matrix can be computed as follows: cy.In our paper,we continue it for another c iterations,which t一4 achieves a good tradeoff between accuracy and speed. R.=cS->sgn(K(X)wi)sgn(K(X)w:)T.(4) The sequential learning strategy is briefly summarized in i=1 Algorithm 1.Here,y is a very small positive number to avoid Then our obiective function to learn the tth bit can be writ- numerical problems,which is 10-6 in our experiments. ten as follows: Algorithm 1 Sequential learning algorithm for SGH minR:-sgn(K(X)w:)sgn(K(X)w:)T (5) Input:Feature vectors X Rmxd;code length c:number of kernel bases m. The objective function in(5)is still a NP-hard problem due to the sgn()function.In order to solve the problem in(5),we Output:Weight matrix WE Rexm Procedure apply spectral relaxation [Weiss et al.,2008]and impose an Construct P(X)and Q(X)according to (3); orthogonality constraint to get the following formulation: Construct K(X)based on the kernel bases,which are m points min IR:-K(X)w:w?K(X)T randomly selected from X: Ao=K(X)P(X)Q(X)K(X): (6) A1=CAo; s.t.wK(X)K(X)wt=1 Z=K(X)TK(X)+Id: The problem in(6)can be further simplified to: fort=1->cdo Solve the following generalized eigenvalue problem R:-K(X)w:wK(X)T Atw:=λZwE; =tr[(R:-K(X)w:wK(X)T)(R-K(X)w:wK(X)T)T] U=[K(X)Tsgn(K(X)w:)][K(X)Tsgn(K(X)w:)]T; A+1=A:-U; =trK(X)w:wK(X)TK(X)w:wTK(X)T] end for A0=Ac+1 -2tr(wK(X)R:K(X)w:))+tr(RR) Randomly permutate {1,2,...,c}to generate a random index =-2tr(wiK(X)R:K(X)wt))+const. set M; fort=1->cdo Then we reformulate the problem in (6)as follows: t=M(: min -tr(wK(X)TR.K(X)w:) Ao Ao+K(X)sgn(K(X)wi)sgn(K(X)w:)K(X): (7) Solve the following generalized eigenvalue problem s.t.wfK(X)TK(X)wt=1 Aov=λZw: Update wi←-v Then we can obtain a generalized eigenvalue problem as Ao=Ao-K(X)Tsgn(K(X)wi)sgn(K(X)wi)K(X): follows: end for K(X)R:K(X)wt=AK(X)K(X)wt
but do not explicitly computing Se in our following learning algorithm. Hence, the O(n 2 ) complexity can be avoided in our method. Please note that the feature transformation is inspired by ALSH [Shrivastava and Li, 2014], which is proposed for maximum inner product search with data-independent hashing. Different from ALSH, our SGH is for data-dependent hashing. Furthermore, the feature transformation method in SGH is different from that in ALSH. 3.2 Learning The discrete sgn(·) function in (2) makes the problem very difficult to solve. One possible solution is to discard the discrete constraints and relax the whole H(·) function to a real-valued function, which has been adopted by many methods such as SH [Weiss et al., 2008] and AGH [Liu et al., 2011]. However, this relaxation procedure may lead to poor performance, which has been verified by existing work [Kong and Li, 2012; Zhang and Li, 2014]. Here, we design a sequential learning strategy in a bit-wise manner, where the residual caused by former bits can be complementarily captured in the following bits [Liu et al., 2012; Zhang and Li, 2014]. Assuming that we have already learned t−1 bits which are parameterized by {wi} t−1 i=1, the residual matrix to reconstruct the similarity matrix can be computed as follows: Rt = cSe − Xt−1 i=1 sgn(K(X)wi)sgn(K(X)wi) T . (4) Then our objective function to learn the tth bit can be written as follows: min wt kRt − sgn(K(X)wt)sgn(K(X)wt) T k 2 F (5) The objective function in (5) is still a NP-hard problem due to the sgn(·) function. In order to solve the problem in (5), we apply spectral relaxation [Weiss et al., 2008] and impose an orthogonality constraint to get the following formulation: min wt kRt − K(X)wtwT t K(X) T k 2 F s.t. wT t K(X) T K(X)wt = 1 (6) The problem in (6) can be further simplified to: kRt − K(X)wtw T t K(X) T k 2 F = tr[(Rt − K(X)wtw T t K(X) T )(Rt − K(X)wtw T t K(X) T ) T ] = tr[K(X)wtw T t K(X) T K(X)wtw T t K(X) T ] − 2tr(w T t K(X) T RtK(X)wt)) + tr(RtR T t ) = −2tr(w T t K(X) T RtK(X)wt)) + const. Then we reformulate the problem in (6) as follows: min wt −tr(wT t K(X) T RtK(X)wt) s.t. wT t K(X) T K(X)wt = 1 (7) Then we can obtain a generalized eigenvalue problem as follows: K(X) T RtK(X)wt = λK(X) T K(X)wt. If we define At = K(X) T RtK(X), then we have: At = At−1− K(X) T sgn(K(X)wt−1)sgn(K(X)wt−1) T K(X) and A1 = cK(X) T SeK(X) = cK(X) T P(X) T Q(X)K(X) = c[K(X) T P(X) T ][Q(X)K(X)]. (8) Equation (8) is the key component of our learning algorithm. It is easy to see that we not only implicitly include all the information of the pairwise similarity matrix Se for training, but also successfully avoid the high computation and storage complexity without explicitly computing Se. After t iterates from 1 to c, we can learn all the W = {wi} c i=1. Actually, the sequential learning procedure can further continue by adopting the following residual matrix: Rt = cSe − Xc i=1 i6=t sgn(K(X)wi)sgn(K(X)wi) T . We find that this procedure can further improve the accuracy. In our paper, we continue it for another c iterations, which achieves a good tradeoff between accuracy and speed. The sequential learning strategy is briefly summarized in Algorithm 1. Here, γ is a very small positive number to avoid numerical problems, which is 10−6 in our experiments. Algorithm 1 Sequential learning algorithm for SGH Input: Feature vectors X ∈ Rn×d ; code length c; number of kernel bases m. Output: Weight matrix W ∈ Rc×m. Procedure Construct P(X) and Q(X) according to (3); Construct K(X) based on the kernel bases, which are m points randomly selected from X; A0 = [K(X) T P(X) T ][Q(X)K(X)]; A1 = cA0; Z = K(X) T K(X) + γId; for t = 1 → c do Solve the following generalized eigenvalue problem Atwt = λZwt; U = [K(X) T sgn(K(X)wt)][K(X) T sgn(K(X)wt)]T ; At+1 = At − U; end for Ab 0 = Ac+1 Randomly permutate {1, 2, · · · , c} to generate a random index set M; for t = 1 → c do tˆ= M(t); Ab 0 = Ab 0 + K(X) T sgn(K(X)wtˆ)sgn(K(X)wtˆ) T K(X); Solve the following generalized eigenvalue problem Ab 0v = λZv; Update wtˆ ← v Ab 0 = Ab 0 − K(X) T sgn(K(X)wtˆ)sgn(K(X)wtˆ) T K(X); end for
Table 1:Top-1000 precision on TINY-1M and MIRFLICKR-1M.The best accuracy is shown in boldface. Method TINY-IM MIRFLICKR-1M 32 bits 64 bits 96 bits 128 bits 256 bits 32 bits 64 bits 96 bits 128 bits 256 bits SGH 0.4697 0.5742 0.6299 0.6737 0.7357 0.49190.60410.6677 0.6985 0.7584 ITO 0.4289 0.4782 0.4947 0.4986 0.5003 0.5177 0.5776 0.5999 0.6096 0.6228 AGH 0.3973 0.4402 0.4577 0.4654 0.4767 0.4299 0.4741 0.4911 0.4998 0.506 DGH-I 0.3974 0.4536 0.4737 0.4874 0.4969 0.4299 0.4806 0.5001 0.5111 0.5253 DGH-R0.3793 0.4554 0.4871 0.4989 0.5276 0.4121 0.47760.5054 0.5196 0.5428 PCAH 0.2457 0.2203 0.2000 0.1836 0.1421 0.2720 0.2384 0.2141 0.1950 0.1508 LSH 0.2507 0.3575 0.4122 0.4529 0.5212 0.2597 0.3995 0.466 0.5160 0.6072 200 800 200 800 00 200 800 1000 200 (a)64 bits @TINY-1M (b)128 bits @TINY-1M (c)64 bits @MIRFLICKR-1M (d)128 bits @MIRFLICKR-1M Figure 2:Performance of Top-K precision on TINY-1M and MIRFLICKR-1M 3.3 Complexity Analysis are defined as the top 2%nearest neighbors in the training set The computation cost can be divided in two parts:initial- in terms of the Euclidean distance.So each query has 19900 ization and the main procedure.Initialization of P(X)and ground truth neighbors for both datasets. Q(X),kernel initialization,and initialization of Ao and Z The baselines for comparison contain one data- will cost O(dn dmn+mn+(m-+mn)(d+2)+m-n). independent method LSH [Datar et al.,2004]and some The main procedure will cost O(c(mn +m2)+m3).Typi- representative unsupervised hashing methods,including two cally,m,d,c will be much less than n.Hence,the time com- graph-based hashing methods AGH [Liu et al.,2011]and plexity of SGH is O(n)although all the n2 similarities have DGH Liu et al..2014,two linear projection based methods been used.Furthermore,the memory cost is also O(n)since PCAH [Gong and Lazebnik,2011]and ITQ [Gong and the similarity matrix S is not explicitly computed.Therefore, Lazebnik,2011].By using different initialization strategies, it is expected that SGH is not only accurate but also scalable. DGH has two variants [Liu et al.,2014],DGH-I and DGH-R both of which are used for comparison.Please note that two 4 Experiment other graph hashing methods SH and BRE are not adopted for comparison because existing works have shown that AGH We use two datasets with one million data points for evalua- and DGH can outperform them [Liu et al.,2014].For kernel tion.All the experiments are conducted on a workstation with feature construction,we use Gaussian kernel and take 300 Intel(R)CPU E5-2620V2@2.1G 12 cores and 64G RAM. randomly sampled points as kernel bases for our method.We set the parameter p =2 in P(X)and Q(X).For all the other 4.1 Datasets baselines,we set the parameters by following the suggestions We evaluate our method on two widely used large-scale of the corresponding authors. benchmark datasets:TINY-/M Liu et al..2014]and The Top-K precision [Liu et al,2014]is adopted as a met- MIRFLICKR-IM [Huiskes et al..20101. ric to evaluate our method and baselines.In real applications The first dataset is TINY-IM which contains one million such as search engines,the users may only be interested in images from the 80M tiny images.Each tiny image is rep- the top returned results given a query.Hence,the Top-K pre- resented by a 384-dim GIST descriptors extracted from the cision is a good metric for evaluation. original image of size 32x32. The second dataset is MIRFL/CKR-IM from LIACS Medi- 4.3 Accuracy alab at Leiden University.This dataset has one million Flickr The Top-1000 precision based on the Hamming ranking re- images which are downloaded from Flickr.We extract 512 sults is shown in Table 1.We can see that SGH achieves features from each image. the best accuracy in all the cases except on MIRFLICKR-1M with 32 bits.In particular,SGH can outperform all the other 4.2 Evaluation Protocols and Baselines graph hashing methods in all cases.This shows that SGH is For each dataset,we randomly select 5000 data points to con- effective to capture the similarity information in the data. struct the test(query)set and the remaining points will be We also report the Top-K precision with other numbers of used for training.The groundtruth neighbors of each query K(returned samples)in Figure 2 on two datasets with 64 bits
Table 1: Top-1000 precision on TINY-1M and MIRFLICKR-1M. The best accuracy is shown in boldface. Method TINY-1M MIRFLICKR-1M 32 bits 64 bits 96 bits 128 bits 256 bits 32 bits 64 bits 96 bits 128 bits 256 bits SGH 0.4697 0.5742 0.6299 0.6737 0.7357 0.4919 0.6041 0.6677 0.6985 0.7584 ITQ 0.4289 0.4782 0.4947 0.4986 0.5003 0.5177 0.5776 0.5999 0.6096 0.6228 AGH 0.3973 0.4402 0.4577 0.4654 0.4767 0.4299 0.4741 0.4911 0.4998 0.506 DGH-I 0.3974 0.4536 0.4737 0.4874 0.4969 0.4299 0.4806 0.5001 0.5111 0.5253 DGH-R 0.3793 0.4554 0.4871 0.4989 0.5276 0.4121 0.4776 0.5054 0.5196 0.5428 PCAH 0.2457 0.2203 0.2000 0.1836 0.1421 0.2720 0.2384 0.2141 0.1950 0.1508 LSH 0.2507 0.3575 0.4122 0.4529 0.5212 0.2597 0.3995 0.466 0.5160 0.6072 (a) 64 bits @TINY-1M (b) 128 bits @TINY-1M (c) 64 bits @MIRFLICKR-1M (d) 128 bits @MIRFLICKR-1M Figure 2: Performance of Top-K precision on TINY-1M and MIRFLICKR-1M 3.3 Complexity Analysis The computation cost can be divided in two parts: initialization and the main procedure. Initialization of P(X) and Q(X), kernel initialization, and initialization of A0 and Z will cost O(dn + dmn + mn + (m2 + mn)(d + 2) + m2n). The main procedure will cost O(c(mn + m2 ) + m3 ). Typically, m, d, c will be much less than n. Hence, the time complexity of SGH is O(n) although all the n 2 similarities have been used. Furthermore, the memory cost is also O(n) since the similarity matrix Se is not explicitly computed. Therefore, it is expected that SGH is not only accurate but also scalable. 4 Experiment We use two datasets with one million data points for evaluation. All the experiments are conducted on a workstation with Intel (R) CPU E5-2620V2@2.1G 12 cores and 64G RAM. 4.1 Datasets We evaluate our method on two widely used large-scale benchmark datasets: TINY-1M [Liu et al., 2014] and MIRFLICKR-1M [Huiskes et al., 2010]. The first dataset is TINY-1M which contains one million images from the 80M tiny images. Each tiny image is represented by a 384-dim GIST descriptors extracted from the original image of size 32×32. The second dataset is MIRFLICKR-1M from LIACS Medialab at Leiden University. This dataset has one million Flickr images which are downloaded from Flickr. We extract 512 features from each image. 4.2 Evaluation Protocols and Baselines For each dataset, we randomly select 5000 data points to construct the test (query) set and the remaining points will be used for training. The groundtruth neighbors of each query are defined as the top 2% nearest neighbors in the training set in terms of the Euclidean distance. So each query has 19900 ground truth neighbors for both datasets. The baselines for comparison contain one dataindependent method LSH [Datar et al., 2004] and some representative unsupervised hashing methods, including two graph-based hashing methods AGH [Liu et al., 2011] and DGH [Liu et al., 2014], two linear projection based methods PCAH [Gong and Lazebnik, 2011] and ITQ [Gong and Lazebnik, 2011]. By using different initialization strategies, DGH has two variants [Liu et al., 2014], DGH-I and DGH-R, both of which are used for comparison. Please note that two other graph hashing methods SH and BRE are not adopted for comparison because existing works have shown that AGH and DGH can outperform them [Liu et al., 2014]. For kernel feature construction, we use Gaussian kernel and take 300 randomly sampled points as kernel bases for our method. We set the parameter ρ = 2 in P(X) and Q(X). For all the other baselines, we set the parameters by following the suggestions of the corresponding authors. The Top-K precision [Liu et al., 2014] is adopted as a metric to evaluate our method and baselines. In real applications such as search engines, the users may only be interested in the top returned results given a query. Hence, the Top-K precision is a good metric for evaluation. 4.3 Accuracy The Top-1000 precision based on the Hamming ranking results is shown in Table 1. We can see that SGH achieves the best accuracy in all the cases except on MIRFLICKR-1M with 32 bits. In particular, SGH can outperform all the other graph hashing methods in all cases. This shows that SGH is effective to capture the similarity information in the data. We also report the Top-K precision with other numbers of K (returned samples) in Figure 2 on two datasets with 64 bits
Table 2:Training time on TINY-1M (in second). Method 32 bits 64 bits 96 bits 128 bits 256 bits SGH 34.49 52.37 71.53 89.65 164.23 ITQ 31.72 60.62 89.01 149.18 322.06 AGH 18.60+1438.60 19.40+1438.60 20.08+1438.60 22.48+1438.60 25.09+1438.60 DGH-I 187.57+1438.60 296.99+1438.60 518.57+1438.60 924.08+1438.60 1838.30+1438.60 DGH-R 217.06+1438.60 360.18+1438.60615.74+1438.60 1089.10+1438.60 2300.10+1438.60 PCAH 4.29 4.54 4.75 5.85 6.49 LSH 1.68 1.77 1.84 2.55 3.76 Table 3:Training time on MIRFLICKR-1M (in second). Method 32 bits 64 bits 96 bits 128 bits 256 bits SGH 41.51 59.02 74.86 97.25 168.35 ITO 36.17 64.61 89.50 132.71 285.10 AgH 17.99+1564.86 18.80+1564.86 20.30+1564.86 19.87+1564.86 21.60+1564.86 DGH-I■T 85.81+1564.86 143.68+1564.86215.41+1564.86 352.73+1564.86 739.56+156486 DGH-R 116.25+1564.86 206.24+1564.86 308.32+1564.86 517.97+1564.86 1199.44+1564.86 PCAH 7.65 7.90 8.47 9.23 10.42 LSH 2.44 2.43 2.71 3.38 4.21 0.75 0.75 e 08 0.7 0.7 0.6 0日 0.6 0.65 0.4 0 --64 bits 05 --64 bits 64 bits 64 bits 128 bits 128 bits ←128bits 128 bits D.5 0.5 100200n 400500600 100 200 300 400500600 2 3 4 2 3 4 (a)TINY-IM (b)MIRFLICKR-1M (a)TINY-1M (b)MIRFLICKR-1M Figure 3:Top-1000 precision using different numbers of ker- Figure 4:Top-1000 precision using different values of p nel bases(m) cations,we need to choose a suitable m for tradeoff between and 128 bits.The cases with other numbers of bits are similar accuracy and cost. to these reported results,which are omitted for space saving. Figure 4 shows the Top-1000 precision on TINY-1M and We can see that SGH achieves the best accuracy for different MIRFLICKR-IM for different values of p.We can find that numbers of K. the accuracy is not sensitive to p when 1 <p <5.One nice 4.4 Speed phenomenon is that our method achieves the best accuracy when p=2 on both datasets. We report the time consumption on two datasets in Table 2 and Table 3,respectively.Please note the "+1438.60"and "+1564.86"in the two tables denote the anchor graph con- 5 Conclusion structing time in AGH and DGH.It is fair to include this an- In this paper,we have proposed a novel method,called SGH, chor graph constructing time for comparison because the time for large-scale graph hashing.SGH is scalable by avoid- of our SGH also includes all the training time.We can find ing explicitly computing the similarity matrix,and simulta- that our SGH is much faster than AGH and DGH in all cas- neously SGH can preserve the entire similarity information es,and is faster than ITQ in most cases.LSH and PCAH are in the dataset.Experiments show that SGH can outperfor- faster than SGH,but their accuracy is not satisfactory. m the state-of-the-art methods in terms of both accuracy and scalability. 4.5 Sensitivity to Parameters Figure 3 shows the Top-1000 precision on TINY-IM and 6 Acknowledgements MIRFLICKR-1M for different numbers of kernel bases(m). We can find that better accuracy can be achieved with larger This work is supported by the NSFC (No.61472182),and m,which is consistent with our intuition.However,larger m the Fundamental Research Funds for the Central Universi- will result in higher computation cost.Hence,in real appli- ties (No.20620140510)
Table 2: Training time on TINY-1M (in second). Method 32 bits 64 bits 96 bits 128 bits 256 bits SGH 34.49 52.37 71.53 89.65 164.23 ITQ 31.72 60.62 89.01 149.18 322.06 AGH 18.60 + 1438.60 19.40 + 1438.60 20.08 + 1438.60 22.48 + 1438.60 25.09 + 1438.60 DGH-I 187.57 + 1438.60 296.99 + 1438.60 518.57 + 1438.60 924.08 + 1438.60 1838.30 + 1438.60 DGH-R 217.06 + 1438.60 360.18 + 1438.60 615.74 + 1438.60 1089.10 + 1438.60 2300.10 + 1438.60 PCAH 4.29 4.54 4.75 5.85 6.49 LSH 1.68 1.77 1.84 2.55 3.76 Table 3: Training time on MIRFLICKR-1M (in second). Method 32 bits 64 bits 96 bits 128 bits 256 bits SGH 41.51 59.02 74.86 97.25 168.35 ITQ 36.17 64.61 89.50 132.71 285.10 AGH 17.99 + 1564.86 18.80 + 1564.86 20.30 + 1564.86 19.87 + 1564.86 21.60 + 1564.86 DGH-I 85.81 + 1564.86 143.68 + 1564.86 215.41 + 1564.86 352.73 + 1564.86 739.56 + 1564.86 DGH-R 116.25 + 1564.86 206.24 + 1564.86 308.32 + 1564.86 517.97 + 1564.86 1199.44 + 1564.86 PCAH 7.65 7.90 8.47 9.23 10.42 LSH 2.44 2.43 2.71 3.38 4.21 (a) TINY-1M (b) MIRFLICKR-1M Figure 3: Top-1000 precision using different numbers of kernel bases (m) and 128 bits. The cases with other numbers of bits are similar to these reported results, which are omitted for space saving. We can see that SGH achieves the best accuracy for different numbers of K. 4.4 Speed We report the time consumption on two datasets in Table 2 and Table 3, respectively. Please note the “+1438.60” and “+1564.86” in the two tables denote the anchor graph constructing time in AGH and DGH. It is fair to include this anchor graph constructing time for comparison because the time of our SGH also includes all the training time. We can find that our SGH is much faster than AGH and DGH in all cases, and is faster than ITQ in most cases. LSH and PCAH are faster than SGH, but their accuracy is not satisfactory. 4.5 Sensitivity to Parameters Figure 3 shows the Top-1000 precision on TINY-1M and MIRFLICKR-1M for different numbers of kernel bases (m). We can find that better accuracy can be achieved with larger m, which is consistent with our intuition. However, larger m will result in higher computation cost. Hence, in real appli- (a) TINY-1M (b) MIRFLICKR-1M Figure 4: Top-1000 precision using different values of ρ cations, we need to choose a suitable m for tradeoff between accuracy and cost. Figure 4 shows the Top-1000 precision on TINY-1M and MIRFLICKR-1M for different values of ρ. We can find that the accuracy is not sensitive to ρ when 1 ≤ ρ ≤ 5. One nice phenomenon is that our method achieves the best accuracy when ρ = 2 on both datasets. 5 Conclusion In this paper, we have proposed a novel method, called SGH, for large-scale graph hashing. SGH is scalable by avoiding explicitly computing the similarity matrix, and simultaneously SGH can preserve the entire similarity information in the dataset. Experiments show that SGH can outperform the state-of-the-art methods in terms of both accuracy and scalability. 6 Acknowledgements This work is supported by the NSFC (No. 61472182), and the Fundamental Research Funds for the Central Universities (No. 20620140510)
References [Liu et al.,2014]Wei Liu,Cun Mu,Sanjiv Kumar,and Shih- [Andoni and Indyk.2008]Alexandr Andoni and Piotr Indyk. Fu Chang.Discrete graph hashing.In Proceedings of Near-optimal hashing algorithms for approximate nearest the Advances in Neural Information Processing Systems, neighbor in high dimensions.Communications on ACM. pages3419-3427,2014. 51(1):117-122.2008. [Norouzi and Fleet,2011]Mohammad Norouzi and David J. [Andoni,2009 Alexandr Andoni.Nearest Neighbor Search: Fleet.Minimal loss hashing for compact binary codes.In The Old,The New,and The Impossible.PhD thesis,Mas- Proceedings of the International Conference on Machine sachusetts Institute of Technology,2009. Learning,pages 353-360,2011. [Datar et al.,2004]Mayur Datar,Nicole Immorlica,Piotr [Shrivastava and Li,2014]Anshumali Shrivastava and Ping Indyk,and Vahab S.Mirrokni.Locality-sensitive hash- Li.Asymmetric Ish (alsh)for sublinear time maximum ing scheme based on p-stable distributions.In Proceed- inner product search (mips).In Proceedings of the Ad- ings of the Annual Symposium on Computational Geome- vances in Neural Information Processing Systems,pages ty,pages253-262,2004. 2321-2329.2014 [Gionis et al.,1999]Aristides Gionis,Piotr Indyk,and Ra- [Song et al,2013]Jingkuan Song,Yang Yang,Yi Yang,Z- jeev Motwani.Similarity search in high dimensions via i Huang,and Heng Tao Shen.Inter-media hashing for hashing.In Proceedings of the International Conference large-scale retrieval from heterogeneous data sources.In on Very Large Data Bases,pages 518-529,1999. Proceedings of the ACM SIGMOD International Confer- [Gong and Lazebnik,2011]Yunchao Gong and Svetlana ence on Management of Data,pages 785-796,2013. Lazebnik.Iterative quantization:A procrustean approach [Wang et al.,2010a]Jun Wang,Ondrej Kumar,and Shih-Fu to learning binary codes.In Proceedings of the IEEE Chang.Semi-supervised hashing for scalable image re- Conference on Computer Vision and Pattern Recognition, trieval.In Proceedings of the IEEE Conference on Com- pages817-824,2011 puter Vision and Pattern Recognition,pages 3424-3431, Huiskes et al..2010]Mark J.Huiskes.B.Thomee.and 2010 Michael S.Lew.New trends and ideas in visual concep- [Wang et al.,2010b]Jun Wang,Sanjiv Kumar,and Shih-Fu t detection:The mir flickr retrieval evaluation initiative. Chang.Sequential projection learning for hashing with In Proceedings of the ACM International Conference on compact codes.In Proceedings of the International Con- Multimedia Information Retrieval,pages 527-536,2010. ference on Machine Learning,pages 1127-1134,2010. [Indyk and Motwani,1998]Piotr Indyk and Rajeev Mot- [Weiss et al.,2008]Yair Weiss,Antonio Torralba,and wani.Approximate nearest neighbors:Towards removing Robert Fergus.Spectral hashing.In Proceedings ofthe Ad- the curse of dimensionality.In Proceedings of the Annu- vances in Neural Information Processing Systems,pages al ACM Symposium on Theory of Computing,pages 604 1753-1760,2008 613.1998. [Xu et al.,2013]Bin Xu,Jiajun Bu,Yue Lin,Chun Chen, [Kong and Li,2012]Weihao Kong and Wu-Jun Li.Isotropic Xiaofei He,and Deng Cai.Harmonious hashing.In Pro- hashing.In Proceedings of the Advances in Neural Infor- ceedings of the International Joint Conference on Artificial mation Processing Systems,pages 1655-1663,2012. Intelligence,2013. Kulis and Darrell,2009]Brian Kulis and Trevor Darrell. [Zhang and Li,2014]Dongqing Zhang and Wu-Jun Li. Learning to hash with binary reconstructive embeddings. Large-scale supervised multimodal hashing with seman- In Proceedings of the Advances in Neural Information tic correlation maximization.In Proceedings of the AAAl Processing Systems,pages 1042-1050,2009 Conference on Artificial Intelligence,pages 2177-2183, [Kulis and Grauman,2009]Brian Kulis and Kristen Grau- 2014. man.Kernelized locality-sensitive hashing for scalable [Zhang et al.,2014]Peichao Zhang,Wei Zhang,Wu-Jun Li, image search.In Proceedings of the IEEE International and Minyi Guo.Supervised hashing with latent factor Conference on Computer Vision,pages 2130-2137,2009. models.In Proceedings of the International ACM SIGIR [Lin et al.,2014]Guosheng Lin,Chunhua Shen,Qinfeng Conference on Research and Development in Information Shi,Anton van den Hengel,and David Suter.Fast su- Retrieval,pages 173-182,2014. pervised hashing with decision trees for high-dimensional [Zhen and Yeung,2012]Yi Zhen and Dit-Yan Yeung.A data.In Proceedings of the IEEE Conference on Computer probabilistic model for multimodal hash function learning. Vision and Pattern Recognition,pages 1971-1978,2014. In Proceedings of the ACM SIGKDD International Con- Liu et al.,2011]Wei Liu,Jun Wang,Sanjiv Kumar,and ference on Knowledge Discovery and Data Mining,pages Shih-Fu Chang.Hashing with graphs.In Proceedings of 940-948.2012 the International Conference on Machine Learning,2011. Zhu et al.,2013]Xiaofeng Zhu,Zi Huang,Heng Tao Shen, [Liu et al.,2012]Wei Liu,Jun Wang,Rongrong Ji,Yu-Gang and Xin Zhao.Linear cross-modal hashing for efficient Jiang,and Shih-Fu Chang.Supervised hashing with ker- multimedia search.In Proceedings of the ACM Interna- nels.In Proceedings of the IEEE Conference on Computer tional Conference on Multimedia,pages 143-152,2013. Vision and Pattern Recognition,pages 2074-2081,2012
References [Andoni and Indyk, 2008] Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Communications on ACM, 51(1):117–122, 2008. [Andoni, 2009] Alexandr Andoni. Nearest Neighbor Search: The Old, The New, and The Impossible. PhD thesis, Massachusetts Institute of Technology, 2009. [Datar et al., 2004] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the Annual Symposium on Computational Geometry, pages 253–262, 2004. [Gionis et al., 1999] Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Similarity search in high dimensions via hashing. In Proceedings of the International Conference on Very Large Data Bases, pages 518–529, 1999. [Gong and Lazebnik, 2011] Yunchao Gong and Svetlana Lazebnik. Iterative quantization: A procrustean approach to learning binary codes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 817–824, 2011. [Huiskes et al., 2010] Mark J. Huiskes, B. Thomee, and Michael S. Lew. New trends and ideas in visual concept detection: The mir flickr retrieval evaluation initiative. In Proceedings of the ACM International Conference on Multimedia Information Retrieval, pages 527–536, 2010. [Indyk and Motwani, 1998] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Annual ACM Symposium on Theory of Computing, pages 604– 613, 1998. [Kong and Li, 2012] Weihao Kong and Wu-Jun Li. Isotropic hashing. In Proceedings of the Advances in Neural Information Processing Systems, pages 1655–1663, 2012. [Kulis and Darrell, 2009] Brian Kulis and Trevor Darrell. Learning to hash with binary reconstructive embeddings. In Proceedings of the Advances in Neural Information Processing Systems, pages 1042–1050, 2009. [Kulis and Grauman, 2009] Brian Kulis and Kristen Grauman. Kernelized locality-sensitive hashing for scalable image search. In Proceedings of the IEEE International Conference on Computer Vision, pages 2130–2137, 2009. [Lin et al., 2014] Guosheng Lin, Chunhua Shen, Qinfeng Shi, Anton van den Hengel, and David Suter. Fast supervised hashing with decision trees for high-dimensional data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1971–1978, 2014. [Liu et al., 2011] Wei Liu, Jun Wang, Sanjiv Kumar, and Shih-Fu Chang. Hashing with graphs. In Proceedings of the International Conference on Machine Learning, 2011. [Liu et al., 2012] Wei Liu, Jun Wang, Rongrong Ji, Yu-Gang Jiang, and Shih-Fu Chang. Supervised hashing with kernels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2074–2081, 2012. [Liu et al., 2014] Wei Liu, Cun Mu, Sanjiv Kumar, and ShihFu Chang. Discrete graph hashing. In Proceedings of the Advances in Neural Information Processing Systems, pages 3419–3427, 2014. [Norouzi and Fleet, 2011] Mohammad Norouzi and David J. Fleet. Minimal loss hashing for compact binary codes. In Proceedings of the International Conference on Machine Learning, pages 353–360, 2011. [Shrivastava and Li, 2014] Anshumali Shrivastava and Ping Li. Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). In Proceedings of the Advances in Neural Information Processing Systems, pages 2321–2329, 2014. [Song et al., 2013] Jingkuan Song, Yang Yang, Yi Yang, Zi Huang, and Heng Tao Shen. Inter-media hashing for large-scale retrieval from heterogeneous data sources. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 785–796, 2013. [Wang et al., 2010a] Jun Wang, Ondrej Kumar, and Shih-Fu Chang. Semi-supervised hashing for scalable image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3424–3431, 2010. [Wang et al., 2010b] Jun Wang, Sanjiv Kumar, and Shih-Fu Chang. Sequential projection learning for hashing with compact codes. In Proceedings of the International Conference on Machine Learning, pages 1127–1134, 2010. [Weiss et al., 2008] Yair Weiss, Antonio Torralba, and Robert Fergus. Spectral hashing. In Proceedings of the Advances in Neural Information Processing Systems, pages 1753–1760, 2008. [Xu et al., 2013] Bin Xu, Jiajun Bu, Yue Lin, Chun Chen, Xiaofei He, and Deng Cai. Harmonious hashing. In Proceedings of the International Joint Conference on Artificial Intelligence, 2013. [Zhang and Li, 2014] Dongqing Zhang and Wu-Jun Li. Large-scale supervised multimodal hashing with semantic correlation maximization. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2177–2183, 2014. [Zhang et al., 2014] Peichao Zhang, Wei Zhang, Wu-Jun Li, and Minyi Guo. Supervised hashing with latent factor models. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 173–182, 2014. [Zhen and Yeung, 2012] Yi Zhen and Dit-Yan Yeung. A probabilistic model for multimodal hash function learning. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 940–948, 2012. [Zhu et al., 2013] Xiaofeng Zhu, Zi Huang, Heng Tao Shen, and Xin Zhao. Linear cross-modal hashing for efficient multimedia search. In Proceedings of the ACM International Conference on Multimedia, pages 143–152, 2013