《人工智能、机器学习与大数据》课程教学资源（参考文献）Supervised Hashing with Latent Factor Models

团购合买资源类别：文库，文档格式：PDF，文档页数：10，文件大小：982.95KB

Supervised Hashing with Latent Factor Models Peichao Zhang Wei Zhang Wu-Jun Li Shanghai Key Laboratory of Shanghai Key Laboratory of National Key Laboratory for Scalable Computing and Scalable Computing and Novel Software Technology Systems Systems Department of Computer Department of Computer Department of Computer Science and Technology Science and Engineering Science and Engineering Nanjing University,China Shanghai Jiao Tong University, Shanghai Jiao Tong University, liwujun@nju.edu.cn China China starforevero0@gmail.com razhangwei@gmail.com Minyi Guo Shanghai Key Laboratory of Scalable Computing and Systems Department of Computer Science and Engineering Shanghai Jiao Tong University, China guo-my@cs.sjtu.edu.cn ABSTRACT 1.INTRODUCTION Due to its low storage cost and fast query speed,hashing Nearest neighbor (NN)search plays a fundamental role in has been widely adopted for approximate nearest neighbor machine learning and related areas,such as pattern recog- search in large-scale datasets.Traditional hashing methods nition,information retrieval,data mining and computer vi- try to learn the hash codes in an unsupervised way where sion.In many real applications,it's not necessary for an the metric (Euclidean)structure of the training data is p- algorithm to return the exact nearest neighbors for every reserved.Very recently,supervised hashing methods,which possible query.Hence,in recent years approximate nearest try to preserve the semantic structure constructed from the neighbor (ANN)search algorithms with improved speed and semantic labels of the training points,have exhibited high- memory saving have received more and more attention from er accuracy than unsupervised methods.In this paper,we researchers 1,2,7. propose a novel supervised hashing method.called latent Over the last decades,there has been an explosive growth factor hashing (LFH),to learn similarity-preserving bina- of data from many areas.To meet the demand of perform- ry codes based on latent factor models.An algorithm with ing ANN search on these massive datasets,various hashing convergence guarantee is proposed to learn the parameters techniques have been proposed due to their fast query speed of LFH.Furthermore,a linear-time variant with stochastic and low storage cost1,4,5,8,10,16,20,21,26,27,31, learning is proposed for training LFH on large-scale dataset- 35,36,37,38,39,40,41.The essential idea of hashing is s.Experimental results on two large datasets with semantic to map the data points from the original feature space into labels show that LFH can achieve superior accuracy than binary codes in the hashcode space with similarities between state-of-the-art methods with comparable training time. pairs of data points preserved.Hamming distance is used to measure the closeness between binary codes,which is de- Categories and Subject Descriptors fined as the number of positions at which two binary codes differ.More specifically,when two data points are deemed H.3.3 Information Storage And Retrieval:Informa- as similar,their binary codes should have a low Hamming tion Search and Retrieval-Retrieval models distance.On the contrary.when two data points are dis- similar,a high Hamming distance is expected between their Keywords binary codes.The advantage of binary codes representation Hashing:Latent Factor Model:Image Retrieval:Big Data over the original feature vector representation is twofold Firstly,each dimension of a binary code can be stored using Permission to make digital or hard copies of all or part of this work for personal or only 1 bit while several bytes are typically required for one classroom use is granted without fee provided that copies are not made or distributed dimension of the original feature vector,leading to a dra- for profit or commercial advantage and that copies bear this notice and the full cita matic reduction in storage cost.Secondly,by using binary tion on the first page.Copyrights for components of this work owned by others than codes representation,all the data points within a specif- ACM must be honored.Abstracting with credit is permitted.To copy otherwise,or re- ic Hamming distance to a given query can be retrieved in publish,to post on servers or to redistribute to lists,requires prior specific permission constant or sub-linear time regardless of the total size of and/or a fee.Request permissions from permissions@acm.org. S/GIR'/4.July 6-11.2014.Gold Coast.Queensland.Australia the dataset [30.Because of these two advantages,hashing Copyright2014ACM978-1-4503-2257-7/1407.s15.00. http:/k.doi.org/10.1145/2600428.2609600

Supervised Hashing with Latent Factor Models Peichao Zhang Shanghai Key Laboratory of Scalable Computing and Systems Department of Computer Science and Engineering Shanghai Jiao Tong University, China starforever00@gmail.com Wei Zhang Shanghai Key Laboratory of Scalable Computing and Systems Department of Computer Science and Engineering Shanghai Jiao Tong University, China razhangwei@gmail.com Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer Science and Technology Nanjing University, China liwujun@nju.edu.cn Minyi Guo Shanghai Key Laboratory of Scalable Computing and Systems Department of Computer Science and Engineering Shanghai Jiao Tong University, China guo-my@cs.sjtu.edu.cn ABSTRACT Due to its low storage cost and fast query speed, hashing has been widely adopted for approximate nearest neighbor search in large-scale datasets. Traditional hashing methods try to learn the hash codes in an unsupervised way where the metric (Euclidean) structure of the training data is preserved. Very recently, supervised hashing methods, which try to preserve the semantic structure constructed from the semantic labels of the training points, have exhibited higher accuracy than unsupervised methods. In this paper, we propose a novel supervised hashing method, called latent factor hashing (LFH), to learn similarity-preserving binary codes based on latent factor models. An algorithm with convergence guarantee is proposed to learn the parameters of LFH. Furthermore, a linear-time variant with stochastic learning is proposed for training LFH on large-scale datasets. Experimental results on two large datasets with semantic labels show that LFH can achieve superior accuracy than state-of-the-art methods with comparable training time. Categories and Subject Descriptors H.3.3 [Information Storage And Retrieval]: Information Search and Retrieval—Retrieval models Keywords Hashing; Latent Factor Model; Image Retrieval; Big Data Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. SIGIR’14, July 6–11, 2014, Gold Coast, Queensland, Australia. Copyright 2014 ACM 978-1-4503-2257-7/14/07 ...$15.00. http://dx.doi.org/10.1145/2600428.2609600 . 1. INTRODUCTION Nearest neighbor (NN) search plays a fundamental role in machine learning and related areas, such as pattern recognition, information retrieval, data mining and computer vision. In many real applications, it’s not necessary for an algorithm to return the exact nearest neighbors for every possible query. Hence, in recent years approximate nearest neighbor (ANN) search algorithms with improved speed and memory saving have received more and more attention from researchers [1, 2, 7]. Over the last decades, there has been an explosive growth of data from many areas. To meet the demand of performing ANN search on these massive datasets, various hashing techniques have been proposed due to their fast query speed and low storage cost [1, 4, 5, 8, 10, 16, 20, 21, 26, 27, 31, 35, 36, 37, 38, 39, 40, 41]. The essential idea of hashing is to map the data points from the original feature space into binary codes in the hashcode space with similarities between pairs of data points preserved. Hamming distance is used to measure the closeness between binary codes, which is de- fined as the number of positions at which two binary codes differ. More specifically, when two data points are deemed as similar, their binary codes should have a low Hamming distance. On the contrary, when two data points are dissimilar, a high Hamming distance is expected between their binary codes. The advantage of binary codes representation over the original feature vector representation is twofold. Firstly, each dimension of a binary code can be stored using only 1 bit while several bytes are typically required for one dimension of the original feature vector, leading to a dramatic reduction in storage cost. Secondly, by using binary codes representation, all the data points within a specific Hamming distance to a given query can be retrieved in constant or sub-linear time regardless of the total size of the dataset [30]. Because of these two advantages, hashing

techniques have become a promising choice for efficient ANN To model the large-scale problems,a linear-time vari- search on massive datasets. ant with stochastic learning is proposed for fast pa- Existing hashing methods can be divided into two cate- rameter learning. gories:data-independent methods and data-dependent meth- ods 6,17,18.For data-independent methods,the hash- Experimental results on two real datasets with seman- ing functions are learned without using any training data tic labels show that LFH can achieve much higher ac- Representative data-independent methods include locality- curacy than other state-of-the-art methods with effi- sensitive hashing (LSH)1,5,7,shift-invariant kernels hash- ciency in training time. ing (SIKH)22,and many other extensions [4,13,14.On the other hand,for data-dependent methods,their hashing The rest of this paper is organized as follows:In Section 2, functions are learned from some training data.Generally s- we will introduce the details of our LFH model.Experimen- peaking,data-dependent methods often require less number tal results are presented in Section 3.Finally,we conclude of bits than data-independent methods to achieve satisfac- the paper in Section 4. tory performance. The data-dependent methods can be further divided into 2. LATENT FACTOR HASHING two categories:unsupervised methods and supervised meth- ods [17,20,32].Unsupervised methods try to preserve In this section,we present the details of our latent factor the metric (Euclidean)structure between data points us- hashing (LFH)model,including the model formulation and ing only their feature information.Representative unsuper- learning algorithms. vised methods include spectral hashing (SH)34,principal 2.1 Problem Definition component analysis based hashing (PCAH)[33].iterative quantization (ITQ)6],anchor graph hashing (AGH)[18]. Suppose we have N points as the training data,each rep- isotropic hashing (IsoHash)[9].multimodel latent binary resented as a feature vector xiRD.Some pairs of points embedding (MLBE)42 and predictable dual-view hash- have similarity labels sij associated,where s=1 means ing (PDH)23.Due to the fact that high level semantic xi and xj are similar and s=0 means xi and x;are dis- description of an object often differs from the extracted low similar.Our goal is to learn a binary code bi(-1,1)0 level feature descriptors,known as semantic gap 25,re- for each xi with similarity between pairs preserved.In par- turning nearest neighbors according to metric distances be- ticular,when sij =1,the binary codes bi and bi should tween the feature vectors doesn't always guarantee a good have a low Hamming distance.On the other hand,when search quality.Hence,many recent works focus on super- sij =0,the Hamming distance between bi and bj i should vised methods which try to preserve the semantic structure be high.In compact form,we use a matrix X RNxD to among the data points by utilizing their associated semantic denote all the feature vectors,a matrix B[-1,1]NxQ to information [17,19].Although there are also some works to denote all the binary codes,and a set S=fsii}to denote exploit other types of supervised information like the rank- all the observed similarity labels.Additionally,we use the ing information for hashing [16,20,the semantic informa- notation Ai and A.;to denote the ith row and the jth tion is usually given in the form of pairwise labels indicating column of a matrix A,respectively.AT is the transpose whether two data points are known to be similar or dissim- of A.The similarity labels S can be constructed from the ilar.Representative supervised methods include restricted neighborhood structure by thresholding on the metric dis- Boltzmann machine based hashing(RBM)[24],binary re tances between the feature vectors [17].However,such s is constructive embedding (BRE)12.sequential projection of low quality since no semantic information is utilized.In learning for hashing(SPLH)[33,minimal loss hashing (ML supervised hashing setting,S is often constructed from the H)[19],kernel-based supervised hashing(KSH)[17],and semantic labels within the data points.Such labels are often linear discriminant analysis based hashing(LDAHash)[28. built with manual effort to ensure its quality. Additionally,there are also some semi-supervised hashing methods [32 which use both labeled data and unlabeled da 2.2 Model Formulation ta to train their model.As stated in recent works [17,19 Let i;denote half of the inner product between two bi- 20,sophisticated supervised methods,such as SPLH,ML nary codes bi,bj E{-1,1)9: H,and KSH,can achieve higher accuracy than unsupervised methods.However,some existing supervised methods,like 6=5bb MLH,suffer from a very large amount of training time,mak- ing it difficult to apply to large-scale datasets The likelihood of the observed similarity labels S can be In this paper,we propose a novel method,called latent defined as follows: factor hashing (LFH),for supervised hashing.The main contributions of this paper are outlined as follows: p(S|B)=Πp(s|B) (1) BijES Based on latent factor models,the likelihood of the with pairwise similarities are elegantly modeled as a func- tion of the Hamming distance between the correspond- p(sij IB)= $=1 ing data points. 1-a,s6=0 where aii=(i)with o being the logistic function An algorithm with convergence guarantee is proposed to learn the parameters of LFH. (r=1+e-

techniques have become a promising choice for efficient ANN search on massive datasets. Existing hashing methods can be divided into two categories: data-independent methods and data-dependent methods [6, 17, 18]. For data-independent methods, the hashing functions are learned without using any training data. Representative data-independent methods include localitysensitive hashing (LSH) [1, 5, 7], shift-invariant kernels hashing (SIKH) [22], and many other extensions [4, 13, 14]. On the other hand, for data-dependent methods, their hashing functions are learned from some training data. Generally speaking, data-dependent methods often require less number of bits than data-independent methods to achieve satisfactory performance. The data-dependent methods can be further divided into two categories: unsupervised methods and supervised methods [17, 20, 32]. Unsupervised methods try to preserve the metric (Euclidean) structure between data points using only their feature information. Representative unsupervised methods include spectral hashing (SH) [34], principal component analysis based hashing (PCAH) [33], iterative quantization (ITQ) [6], anchor graph hashing (AGH) [18], isotropic hashing (IsoHash) [9], multimodel latent binary embedding (MLBE) [42] and predictable dual-view hashing (PDH) [23]. Due to the fact that high level semantic description of an object often differs from the extracted low level feature descriptors, known as semantic gap [25], returning nearest neighbors according to metric distances between the feature vectors doesn’t always guarantee a good search quality. Hence, many recent works focus on supervised methods which try to preserve the semantic structure among the data points by utilizing their associated semantic information [17, 19]. Although there are also some works to exploit other types of supervised information like the ranking information for hashing [16, 20], the semantic information is usually given in the form of pairwise labels indicating whether two data points are known to be similar or dissimilar. Representative supervised methods include restricted Boltzmann machine based hashing (RBM) [24], binary reconstructive embedding (BRE) [12], sequential projection learning for hashing (SPLH) [33], minimal loss hashing (MLH) [19], kernel-based supervised hashing (KSH) [17], and linear discriminant analysis based hashing (LDAHash) [28]. Additionally, there are also some semi-supervised hashing methods [32] which use both labeled data and unlabeled data to train their model. As stated in recent works [17, 19, 20], sophisticated supervised methods, such as SPLH, MLH, and KSH, can achieve higher accuracy than unsupervised methods. However, some existing supervised methods, like MLH, suffer from a very large amount of training time, making it difficult to apply to large-scale datasets. In this paper, we propose a novel method, called latent factor hashing (LFH), for supervised hashing. The main contributions of this paper are outlined as follows: • Based on latent factor models, the likelihood of the pairwise similarities are elegantly modeled as a function of the Hamming distance between the corresponding data points. • An algorithm with convergence guarantee is proposed to learn the parameters of LFH. • To model the large-scale problems, a linear-time variant with stochastic learning is proposed for fast parameter learning. • Experimental results on two real datasets with semantic labels show that LFH can achieve much higher accuracy than other state-of-the-art methods with effi- ciency in training time. The rest of this paper is organized as follows: In Section 2, we will introduce the details of our LFH model. Experimental results are presented in Section 3. Finally, we conclude the paper in Section 4. 2. LATENT FACTOR HASHING In this section, we present the details of our latent factor hashing (LFH) model, including the model formulation and learning algorithms. 2.1 Problem Definition Suppose we have N points as the training data, each represented as a feature vector xi ∈ R D. Some pairs of points have similarity labels sij associated, where sij = 1 means xi and xj are similar and sij = 0 means xi and xj are dissimilar. Our goal is to learn a binary code bi ∈ {−1, 1} Q for each xi with similarity between pairs preserved. In particular, when sij = 1, the binary codes bi and bj should have a low Hamming distance. On the other hand, when sij = 0, the Hamming distance between bi and bj should be high. In compact form, we use a matrix X ∈ R N×D to denote all the feature vectors, a matrix B ∈ {−1, 1} N×Q to denote all the binary codes, and a set S = {sij} to denote all the observed similarity labels. Additionally, we use the notation Ai∗ and A∗j to denote the ith row and the jth column of a matrix A, respectively. AT is the transpose of A. The similarity labels S can be constructed from the neighborhood structure by thresholding on the metric distances between the feature vectors [17]. However, such S is of low quality since no semantic information is utilized. In supervised hashing setting, S is often constructed from the semantic labels within the data points. Such labels are often built with manual effort to ensure its quality. 2.2 Model Formulation Let Θij denote half of the inner product between two binary codes bi, bj ∈ {−1, 1} Q: Θij = 1 2 b T i bj . The likelihood of the observed similarity labels S can be defined as follows: p(S | B) = Y sij∈S p(sij | B), (1) with p(sij | B) = ( aij , sij = 1 1 − aij , sij = 0 , where aij = σ(Θij ) with σ being the logistic function: σ(x) = 1 1 + e−x

It is easy to prove the following relationship between the The gradient vector and the Hessian matrix of the objec- Hamming distance dist(,)and the inner product of two tive function L defined in (2)with respect to Ui.can be binary codes: derived as: aL 1 distn (bi,b)=(Q-bf bj)=(Q20). >(si-ag)U. j:siES We can find that the smaller the dist#(bi,bi)is,the larger +1 p(si;1 B)will be.Maximizing the likelihood of S in ∑.(-aU-g (1)will make the Hamming distance between two similar 1:s∈S points as small as possible,and that between two dissimilar points as high as possible.Hence this model is reasonable 02L aUg8U:=-4∑ aij(1-aij)Uf.Uj. and matches the goal to preserve similarity. j:s4j∈S With some prior p(B),the posteriori of B can be comput- ed as follows: p(BS)~p(SB)p(B) If we define H;as: We can use maximum a posteriori estimation to learn the optimal B.However,directly optimizing on B is an NP-hard H=- 1 problem 34.Following most existing hashing methods,we 16 ∑明u-∑明u-点⑧) j:sji∈S compute the optimal B through two stages.In the first stage,we relax B to be a real valued matrix U E RNxQ we can prove that The ith row of U is called the latent factor for the ith data 02L point.We learn an optimal U under the same probabilistic aU∂U* 之H, framework as for B.Then in the second stage,we perform some rounding technique on the real valued U to get the where A B means A-B is a positive semi-definite matrix. binary codes B. Then we can construct the lower bound of L(Us),denot- More specifically,we replace all the occurrences of B in ed by L(Ui),as: previous equations with U.O;is then re-defined as: i(U.)=L(U.(+(U:.-U() 6=2U.U明 +2U-U.)H,(U.-U.()7 Similarly,p(S|B),p(B),p(BS)are replaced with p(SU), p(U),p(U S),respectively.We put a normal distribution The values of U and other parameters that depend on U on p(U): change through the updating process.Here we use the no- tation r(t)to denote the value of a parameter r at some Q iteration t.We update Ui to be the one that gives the p(U)= W(U*d|0,I) maximum value of L(Ui).It is easy to see that L(Ui)is a d= quadratic form in the variable Ui,which can be proved to where A()denotes the normal distribution,I is an identity be convex.Hence,we can find the optimum value of Ui by matrix,and B is a hyper-parameter.The log posteriori of setting the gradient of L(Ui)with respect to Ui to 0.As U can then be derived as: a result,we end up with the following update rule for U: L=logp(U S) U.t+)=U.(0-l0L (PH()-1. (4) =∑(s,6,-l1og1+e9w》- 2川U喉+G (2) Lau? We can then update other rows of U iteratively using the above rule. whereF denotes the Frobenius norm of a matrix,and The convergence of the iterative updating process is con- c is a constant term independent of U.The next step is to trolled by some threshold value e and the maximum allowed learn the optimal U that maximizes L in(2). number of iterations T.Here e and T are both hyper- parameters.During each iteration,we update U by up- 2.3 Learning dating each of its rows sequentially.The initial value of U Since directly optimizing the whole U can be very time- can be obtained through PCA on the feature space X.The consuming,we optimize each row of U at a time with its pseudocode of the updating process is shown in Algorithm 1. other rows fixed.We adopt the surrogate algorithm [15 to optimize each row Uis.The surrogate learning algo- 2.3.I Rounding rithm can be viewed as a generalization of the expectation- After the optimal U is learned,we can obtain the final maximization(EM)algorithm.It constructs a lower bound binary codes B using some rounding techniques.In this of the objective function,and then updates the parameters paper,to keep our method simple,the real values of U are to maximize that lower bound.Just like EM algorithm,we quantized into the binary codes of B by taking their signs, need to derive different lower bounds and optimization pro- that is: cedures for different models [15].In the following content, 1. U>0 we will derive the details of the surrogate learning algorithm for our model. -1, otherwise

It is easy to prove the following relationship between the Hamming distance distH(·, ·) and the inner product of two binary codes: distH(bi, bj ) = 1 2 (Q − b T i bj ) = 1 2 (Q − 2Θij ). We can find that the smaller the distH(bi, bj ) is, the larger p(sij = 1 | B) will be. Maximizing the likelihood of S in (1) will make the Hamming distance between two similar points as small as possible, and that between two dissimilar points as high as possible. Hence this model is reasonable and matches the goal to preserve similarity. With some prior p(B), the posteriori of B can be computed as follows: p(B | S) ∼ p(S | B)p(B). We can use maximum a posteriori estimation to learn the optimal B. However, directly optimizing on B is an NP-hard problem [34]. Following most existing hashing methods, we compute the optimal B through two stages. In the first stage, we relax B to be a real valued matrix U ∈ R N×Q. The ith row of U is called the latent factor for the ith data point. We learn an optimal U under the same probabilistic framework as for B. Then in the second stage, we perform some rounding technique on the real valued U to get the binary codes B. More specifically, we replace all the occurrences of B in previous equations with U. Θij is then re-defined as: Θij = 1 2 Ui∗U T j∗. Similarly, p(S | B), p(B), p(B | S) are replaced with p(S | U), p(U), p(U | S), respectively. We put a normal distribution on p(U): p(U) = Y Q d=1 N (U∗d | 0, βI), where N (·) denotes the normal distribution, I is an identity matrix, and β is a hyper-parameter. The log posteriori of U can then be derived as: L = log p(U | S) = X sij∈S (sijΘij − log(1 + e Θij )) − 1 2β kUk 2 F + c, (2) where k · kF denotes the Frobenius norm of a matrix, and c is a constant term independent of U. The next step is to learn the optimal U that maximizes L in (2). 2.3 Learning Since directly optimizing the whole U can be very timeconsuming, we optimize each row of U at a time with its other rows fixed. We adopt the surrogate algorithm [15] to optimize each row Ui∗. The surrogate learning algorithm can be viewed as a generalization of the expectationmaximization (EM) algorithm. It constructs a lower bound of the objective function, and then updates the parameters to maximize that lower bound. Just like EM algorithm, we need to derive different lower bounds and optimization procedures for different models [15]. In the following content, we will derive the details of the surrogate learning algorithm for our model. The gradient vector and the Hessian matrix of the objective function L defined in (2) with respect to Ui∗ can be derived as: ∂L ∂UT i∗ = 1 2 X j:sij∈S (sij − aij )U T j∗ + 1 2 X j:sji∈S (sji − aji)U T j∗ − 1 β U T i∗, ∂ 2L ∂UT i∗∂Ui∗ = − 1 4 X j:sij∈S aij (1 − aij )U T j∗Uj∗ − 1 4 X j:sji∈S aji(1 − aji)U T j∗Uj∗ − 1 β I. If we define Hi as: Hi = − 1 16 X j:sij∈S U T j∗Uj∗ − 1 16 X j:sji∈S U T j∗Uj∗ − 1 β I, (3) we can prove that ∂ 2L ∂UT i∗∂Ui∗ Hi, where A B means A−B is a positive semi-definite matrix. Then we can construct the lower bound of L(Ui∗), denoted by Le(Ui∗), as: Le(Ui∗) = L(Ui∗(t)) + (Ui∗ − Ui∗(t)) ∂L ∂UT i∗ (t) + 1 2 (Ui∗ − Ui∗(t))Hi(t)(Ui∗ − Ui∗(t))T . The values of U and other parameters that depend on U change through the updating process. Here we use the notation x(t) to denote the value of a parameter x at some iteration t. We update Ui∗ to be the one that gives the maximum value of Le(Ui∗). It is easy to see that Le(Ui∗) is a quadratic form in the variable Ui∗, which can be proved to be convex. Hence, we can find the optimum value of Ui∗ by setting the gradient of Le(Ui∗) with respect to Ui∗ to 0. As a result, we end up with the following update rule for Ui∗: Ui∗(t + 1) = Ui∗(t) − [ ∂L ∂UT i∗ (t)]T Hi(t) −1 . (4) We can then update other rows of U iteratively using the above rule. The convergence of the iterative updating process is controlled by some threshold value ε and the maximum allowed number of iterations T. Here ε and T are both hyperparameters. During each iteration, we update U by updating each of its rows sequentially. The initial value of U can be obtained through PCA on the feature space X. The pseudocode of the updating process is shown in Algorithm 1. 2.3.1 Rounding After the optimal U is learned, we can obtain the final binary codes B using some rounding techniques. In this paper, to keep our method simple, the real values of U are quantized into the binary codes of B by taking their signs, that is: Bij = ( 1, Uij > 0 −1, otherwise

Algorithm 1 Optimizing U using surrogate algorithm Input:XERNXD,S={s},Q,T∈N+,B,e∈Rt, Initializing U by performing PCA on X. fort=1→Tdo fori=1→Ndo Update Ui by following the rule given in (4). end for Compute L in(2)using the updated U. (a)Full (b)Sparse (c)Aligned Terminate the iterative process when the change of L is smaller than e. end for Figure 1:Selection of S for stochastic learning. Output:U∈RNxQ. 2.5 Stochastic Learning For a given set of N training points with supervised infor- 2.3.2 Out-of-Sample Extension mation,there are N x N pairs of similarity labels available In order to perform ANN search,we need to compute the to form S.Straightforwardly,we can choose S to include binary code b for a query x which is not in the training set. all the available similarity pairs,that is,S={s i,j= We achieve this by finding a matrix WRx that maps 1,...,N,i.This is illustrated in Figure 1(a),in which x to u in the following way: a colored cell in the i-th row and j-th column indicates that u=WTx. sij is included in S.In this way,the best possible accuracy can be achieved since we use as much as available semantic We then perform the same rounding technique discussed in information.However,according to the complexity analy- Section 2.3.1 on u to obtain b sis described in Section 2.4,if we set S to contain the full We use linear regression to train W over the training set supervised information,the time cost to update U would The squared loss with regularization term is shown below: become O(NQ3+N2Q2)per iteration,and O(N2)memory is needed to store S.Such cost in both computation and Le=IU-XW+入IW storage is unacceptable when N grows large. Inspired by stochastic gradient descent method,we pro And the optimal W can be computed as: pose an efficient way of updating U,called stochastic learn- W=(XX+AI)-xTU (5) ing.As illustrated in Figure 1(b),S contains a sparse subset with only O(NQ)entries which are randomly selected from 2.4 Convergence and Complexity Analysis all the available similarity pairs.The random sampling is performed for each iteration.We choose the size of S to At each iteration,we first construct a lower bound at the be O(NQ)so that the time cost to update U is reduced to current point Ui(t),and then optimize it to get a new point Ui(t+1)with a higher function value L(Uis(t+1)) O(NQ)per iteration.For storage efficiency,we compute only the sampled subset during each iteration.By this way Hence,the objective function value always increases in the the maximum required storage reduces to O(NQ) new iteration.Furthermore,the objective function value L=logp(U S)is upper bounded by zero.Hence,our Al- We can even further reduce the time cost by sampling S in an aligned way.During each iteration,an index set gorithm 1 can always guarantee convergence to local maxi- mum.the principle of which is similar to that of EM algo- I of size O(Q)is randomly chosen from {1,...,N}.We then construct S by using the rows and columns in Z,with rithm.This convergence property will also be illustrated in entries at the diagonal excluded.The resulted S is shown in Figure 2.The convergence property of our surrogate algo- Figure 1(c).Following this,the constructed S is guaranteed rithm is one of the key advantages compared with standard to be symmetric,and H;defined in (3)can be simplified as: Newton method and first-order gradient method.In both Newton method and first-order gradient method,a step size, 1 H;=- also called learning rate in some literatures,should be man- 8 ∑u ually specified,which cannot necessarily guarantee conver- gence. For all i the set fi:sii S}is exactly the same We can prove that,when updating Uis,the total time to thanks to the alignment of S.This implies that H;remains compute aL/aU.and Hi for all i 1,....N is (IS|Q) the same for all iZ.Therefore,we can compute H.in a and O(SQ),respectively.Since the inversion of Hi can be preprocessing step.By doing so,the time cost to update Uis computed in O(Q),the time to update U:s following the for each iI can be reduced to (Q-).For each iEI,we rule given in (4)is O(Q).Then the time to update U by can update Ui.through some complicated calculations while one iteration is O(NQ+SQ2).Therefore,the total time still maintaining the (Q2)time complexity.We can also of the updating process is O(T(NO+SQ)),where T is safely skip updating Ui.for that iteration without much loss the number of iterations. in performance due to the fact that Q is much smaller than Besides that,the time for rounding is O(NQ).And the N.Even though Ui.is not updated for some small portion time to compute W for out-of-sample extension is O(ND+ of i in one single iteration,it will much likely be updated D3+NDQ),which can be further reduced to O(ND2)with in the subsequent iterations because Z changes during the the typical assumption that Q<D N.With the pre- iterations.As a result,the total time cost to update U is computed W,the out-of-sample extension for a query x can reduced to O(NQ-)per iteration.For Q up to 128,this be achieved in O(DQ). makes our learning process two orders of magnitude faster

Algorithm 1 Optimizing U using surrogate algorithm Input: X ∈ R N×D, S = {sij}, Q, T ∈ N +, β, ε ∈ R +. Initializing U by performing PCA on X. for t = 1 → T do for i = 1 → N do Update Ui∗ by following the rule given in (4). end for Compute L in (2) using the updated U. Terminate the iterative process when the change of L is smaller than ε. end for Output: U ∈ R N×Q. 2.3.2 Out-of-Sample Extension In order to perform ANN search, we need to compute the binary code b for a query x which is not in the training set. We achieve this by finding a matrix W ∈ R D×Q that maps x to u in the following way: u = WT x. We then perform the same rounding technique discussed in Section 2.3.1 on u to obtain b. We use linear regression to train W over the training set. The squared loss with regularization term is shown below: Le = kU − XWk 2 F + λekWk 2 F . And the optimal W can be computed as: W = (X T X + λeI) −1X T U. (5) 2.4 Convergence and Complexity Analysis At each iteration, we first construct a lower bound at the current point Ui∗(t), and then optimize it to get a new point Ui∗(t + 1) with a higher function value L(Ui∗(t + 1)). Hence, the objective function value always increases in the new iteration. Furthermore, the objective function value L = log p(U | S) is upper bounded by zero. Hence, our Algorithm 1 can always guarantee convergence to local maximum, the principle of which is similar to that of EM algorithm. This convergence property will also be illustrated in Figure 2. The convergence property of our surrogate algorithm is one of the key advantages compared with standard Newton method and first-order gradient method. In both Newton method and first-order gradient method, a step size, also called learning rate in some literatures, should be manually specified, which cannot necessarily guarantee convergence. We can prove that, when updating Ui∗, the total time to compute ∂L/∂UT i∗ and Hi for all i = 1, . . . , N is O(|S|Q) and O(|S|Q 2 ), respectively. Since the inversion of Hi can be computed in O(Q 3 ), the time to update Ui∗ following the rule given in (4) is O(Q 3 ). Then the time to update U by one iteration is O(NQ3 + |S|Q 2 ). Therefore, the total time of the updating process is O(T(NQ3 + |S|Q 2 )), where T is the number of iterations. Besides that, the time for rounding is O(NQ). And the time to compute W for out-of-sample extension is O(ND2+ D 3 +NDQ), which can be further reduced to O(ND2 ) with the typical assumption that Q < D N. With the precomputed W, the out-of-sample extension for a query x can be achieved in O(DQ). (a) Full (b) Sparse (c) Aligned Figure 1: Selection of S for stochastic learning. 2.5 Stochastic Learning For a given set of N training points with supervised information, there are N × N pairs of similarity labels available to form S. Straightforwardly, we can choose S to include all the available similarity pairs, that is, S = {sij | i, j = 1, . . . , N, i 6= j}. This is illustrated in Figure 1(a), in which a colored cell in the i-th row and j-th column indicates that sij is included in S. In this way, the best possible accuracy can be achieved since we use as much as available semantic information. However, according to the complexity analysis described in Section 2.4, if we set S to contain the full supervised information, the time cost to update U would become O(NQ3 + N 2Q 2 ) per iteration, and O(N 2 ) memory is needed to store S. Such cost in both computation and storage is unacceptable when N grows large. Inspired by stochastic gradient descent method, we propose an efficient way of updating U, called stochastic learning. As illustrated in Figure 1(b), S contains a sparse subset with only O(NQ) entries which are randomly selected from all the available similarity pairs. The random sampling is performed for each iteration. We choose the size of S to be O(NQ) so that the time cost to update U is reduced to O(NQ3 ) per iteration. For storage efficiency, we compute only the sampled subset during each iteration. By this way, the maximum required storage reduces to O(NQ). We can even further reduce the time cost by sampling S in an aligned way. During each iteration, an index set I of size O(Q) is randomly chosen from {1, . . . , N}. We then construct S by using the rows and columns in I, with entries at the diagonal excluded. The resulted S is shown in Figure 1(c). Following this, the constructed S is guaranteed to be symmetric, and Hi defined in (3) can be simplified as: Hi = − 1 8 X j:sij∈S U T j∗Uj∗ − 1 β I. For all i /∈ I, the set {j : sij ∈ S} is exactly the same thanks to the alignment of S. This implies that Hi remains the same for all i /∈ I. Therefore, we can compute H−1 i in a preprocessing step. By doing so, the time cost to update Ui∗ for each i /∈ I can be reduced to O(Q 2 ). For each i ∈ I, we can update Ui∗ through some complicated calculations while still maintaining the O(Q 2 ) time complexity. We can also safely skip updating Ui∗ for that iteration without much loss in performance due to the fact that Q is much smaller than N. Even though Ui∗ is not updated for some small portion of i in one single iteration, it will much likely be updated in the subsequent iterations because I changes during the iterations. As a result, the total time cost to update U is reduced to O(NQ2 ) per iteration. For Q up to 128, this makes our learning process two orders of magnitude faster

Consequently,since Q is bounded by a small constant best hyper-parameters are selected by using the validation we can say that the cost in computation and storage of our set of CIFAR-10.All experimental results are averaged over learning algorithm is linear to the number of training points 10 independent rounds of random training/validation/query N.This makes our LFH easily scalable to very large dataset- partitions. S. Unless otherwise stated,we refer LFH in the experiment 2.6 Normalized Hyper-parameters section to the LFH with stochastic learning.We compare our LFH method with some state-of-the-art hashing meth- The hyper-parameter B in the objective function defined ods.which include: in (2)acts as a factor weighing the relative importance be- tween the first and the second term.However.the number Data-independent methods:locality-sensitive hashing of sum items in each term is different:in the first term there (LSH),and shift-invariant kernels hashing (SIKH); are S sum items,while in the second term there are N sum items.Since different datasets may have different values of Unsupervised data-dependent methods:spectral hash- N and ISI,the optimal value of B may vary between the ing (SH),principal component analysis based hash- datasets.To address this issue and make our method less ing (PCAH),iterative quantization(ITQ),anchor graph dependent on a specific dataset,we introduce a new hyper- hashing (AGH): parameter B'satisfying that: Supervised data-dependent methods:sequential pro- B'= jection learning for hashing(SPLH),minimal loss hash- ing(MLH),and kernel-based supervised hashing(KSH) By replacing B with B in(2),we have a specialized param- All the baseline methods are implemented using the source eter B'for each dataset.The optimal value for B is then codes provided by the corresponding authors.For KSH and normalized to roughly the same range on different datasets AGH,the number of support points for kernel construction We can normalize Ae in (5)by following the same idea. is set to 300 by following the same settings in [17,18].For We find that the MLH method spends most of the time KSH,SPLH,and MLH,it's impossible to use all the super- on selecting the best hyper-parameters for each dataset. vised information for training since it would be very time- With the normalized hyper-parameters introduced,we can consuming.Following the same strategy used in [17,we pre-compute the optimal values for the hyper-parameters sample 2,000 labeled points for these methods. on some smaller dataset,and then apply the same values All our experiments are conducted on a workstation with to all other datasets.This saves us the time of hyper- 24 Intel Xeon CPU cores and 64 GB RAM. parameter selection and makes our method more efficient on large datasets. 3.3 Effect of Stochastic Learning The convergence curve of the objective function on a sam- 3. EXPERIMENT pled CIFAR-10 subset of 5000 points with code length 32 is shown in Figure 2.The LFH-Full method refers to the LFH 3.1 Datasets that uses the full supervised information for updating,and We evaluate our method on two standard large image LFH-Stochastic refers to the LFH with stochastic learning. datasets with semantic labels:CIFAR-10 [11]and NUS- The objective function value is computed based on the full WIDE 3. supervised information for both methods.We can see that The CIFAR-10 dataset [11]consists of 60,000 color images the objective function of LFH-Full converges to a station- drawn from the 80M tiny image collection 29.Each image ary point after a few iterations.The objective function of of size 32 x 32 is represented by a 512-dimensional GIST LFH-Stochastic has a major trend of convergence to some s- feature vector.Each image is manually classified into one tationary point with slight vibration.This behavior is quite of the 10 classes.with an exclusive label indicating its be- similar to stochastic gradient descent method and is empir- longing class.Two images are considered as a ground truth ically acceptable. neighbor if they have the same class label. Figure 3 shows the mean average precision (MAP)[10,17, The NUS-WIDE dataset [3]contains 269,648 images col- 19]values computed on a validation set during the updating lected from Flickr.Each image is represented by a 1134- process.The final MAP evaluated on a query set is 0.5237 dimensional low level feature vector,including color his- for LFH-Full and 0.4694 for LFH-Stochastic.The reduction togram,color auto-correlogram,edge direction histogram. in MAP of LFH-Stochastic is affordable given the dramatic wavelet texture,block-wise color moments,and bag of visu- decrease in time complexity by using stochastic learning. al words.The images are manually assigned with some of the 81 concept tags.The ground truth neighbor is defined 3.4 Hamming Ranking Performance on two images if they share at least one common tag. We perform Hamming ranking using the generated bina- For data pre-processing,we follow the standard way of ry codes on the CIFAR-10 and NUS-WIDE datasets.For feature normalization by making each dimension of the fea- each query in the query set,all the points in the training ture vectors to have zero mean and equal variance. set are ranked according to the Hamming distance between their binary codes and the query's.The MAP is reported to 3.2 Experimental Settings and Baselines evaluate the accuracy of different hashing methods. For both CIFAR-10 and NUS-WIDE datasets,we random- Figure 4(a)and Figure 4(b)show the averaged MAP re- ly sample 1,000 points as query set,1,000 points as valida- sults with different code lengths on the two datasets,re- tion set,and all the remaining points as training set.Using spectively.We can find that with the help of semantic infor- normalized hyper-parameters described in Section 2.6,the mation,supervised data-dependent methods can generally

Consequently, since Q is bounded by a small constant, we can say that the cost in computation and storage of our learning algorithm is linear to the number of training points N. This makes our LFH easily scalable to very large datasets. 2.6 Normalized Hyper-parameters The hyper-parameter β in the objective function defined in (2) acts as a factor weighing the relative importance between the first and the second term. However, the number of sum items in each term is different: in the first term there are |S| sum items, while in the second term there are N sum items. Since different datasets may have different values of N and |S|, the optimal value of β may vary between the datasets. To address this issue and make our method less dependent on a specific dataset, we introduce a new hyperparameter β 0 satisfying that: β 0 = N |S|β. By replacing β with β 0 in (2), we have a specialized parameter β 0 for each dataset. The optimal value for β is then normalized to roughly the same range on different datasets. We can normalize λe in (5) by following the same idea. We find that the MLH method spends most of the time on selecting the best hyper-parameters for each dataset. With the normalized hyper-parameters introduced, we can pre-compute the optimal values for the hyper-parameters on some smaller dataset, and then apply the same values to all other datasets. This saves us the time of hyperparameter selection and makes our method more efficient on large datasets. 3. EXPERIMENT 3.1 Datasets We evaluate our method on two standard large image datasets with semantic labels: CIFAR-10 [11] and NUSWIDE [3]. The CIFAR-10 dataset [11] consists of 60,000 color images drawn from the 80M tiny image collection [29]. Each image of size 32 × 32 is represented by a 512-dimensional GIST feature vector. Each image is manually classified into one of the 10 classes, with an exclusive label indicating its belonging class. Two images are considered as a ground truth neighbor if they have the same class label. The NUS-WIDE dataset [3] contains 269,648 images collected from Flickr. Each image is represented by a 1134- dimensional low level feature vector, including color histogram, color auto-correlogram, edge direction histogram, wavelet texture, block-wise color moments, and bag of visual words. The images are manually assigned with some of the 81 concept tags. The ground truth neighbor is defined on two images if they share at least one common tag. For data pre-processing, we follow the standard way of feature normalization by making each dimension of the feature vectors to have zero mean and equal variance. 3.2 Experimental Settings and Baselines For both CIFAR-10 and NUS-WIDE datasets, we randomly sample 1,000 points as query set, 1,000 points as validation set, and all the remaining points as training set. Using normalized hyper-parameters described in Section 2.6, the best hyper-parameters are selected by using the validation set of CIFAR-10. All experimental results are averaged over 10 independent rounds of random training/validation/query partitions. Unless otherwise stated, we refer LFH in the experiment section to the LFH with stochastic learning. We compare our LFH method with some state-of-the-art hashing methods, which include: • Data-independent methods: locality-sensitive hashing (LSH), and shift-invariant kernels hashing (SIKH); • Unsupervised data-dependent methods: spectral hashing (SH), principal component analysis based hashing (PCAH), iterative quantization (ITQ), anchor graph hashing (AGH); • Supervised data-dependent methods: sequential projection learning for hashing (SPLH), minimal loss hashing (MLH), and kernel-based supervised hashing (KSH). All the baseline methods are implemented using the source codes provided by the corresponding authors. For KSH and AGH, the number of support points for kernel construction is set to 300 by following the same settings in [17, 18]. For KSH, SPLH, and MLH, it’s impossible to use all the supervised information for training since it would be very timeconsuming. Following the same strategy used in [17], we sample 2,000 labeled points for these methods. All our experiments are conducted on a workstation with 24 Intel Xeon CPU cores and 64 GB RAM. 3.3 Effect of Stochastic Learning The convergence curve of the objective function on a sampled CIFAR-10 subset of 5000 points with code length 32 is shown in Figure 2. The LFH-Full method refers to the LFH that uses the full supervised information for updating, and LFH-Stochastic refers to the LFH with stochastic learning. The objective function value is computed based on the full supervised information for both methods. We can see that the objective function of LFH-Full converges to a stationary point after a few iterations. The objective function of LFH-Stochastic has a major trend of convergence to some stationary point with slight vibration. This behavior is quite similar to stochastic gradient descent method and is empirically acceptable. Figure 3 shows the mean average precision (MAP) [10, 17, 19] values computed on a validation set during the updating process. The final MAP evaluated on a query set is 0.5237 for LFH-Full and 0.4694 for LFH-Stochastic. The reduction in MAP of LFH-Stochastic is affordable given the dramatic decrease in time complexity by using stochastic learning. 3.4 Hamming Ranking Performance We perform Hamming ranking using the generated binary codes on the CIFAR-10 and NUS-WIDE datasets. For each query in the query set, all the points in the training set are ranked according to the Hamming distance between their binary codes and the query’s. The MAP is reported to evaluate the accuracy of different hashing methods. Figure 4(a) and Figure 4(b) show the averaged MAP results with different code lengths on the two datasets, respectively. We can find that with the help of semantic information, supervised data-dependent methods can generally

e一LFH-B-KSHMLHTSPLHITQ◆=AGHLSH★一PCAH◆SH●SIKH 0.8 0.65 0.7 0.6 0.55 0.5 0 0.45 0.4 0.35 0.3 03 0.2 0.25 03 8162432 48 64 128 8162432 48 .64 96 128 Code Length Code Length (a)CIFAR-10 (b)NUS-WIDE Figure 4:MAP results with different code lengths 010 0.6 M 0.55 -04 0.5 -0.6 0.45 wA -0.8 1 0.35 -1.2 -1.4 0.25 -1.6 03 -1.8 LFH-Full 0.15 LFH-Ful -LFH-Stochastic -LFH-Stochastic 汤 0. 40 60 80 100 0 20 40 60 80 100 Iteration Iteration Figure 2:Convergence curve. Figure 3:MAP during the iterations. achieve better accuracy than data-independent and unsu- We can find that the data-independent hashing method- pervised data-dependent methods.Furthermore,the accu- s require the least amount of training time,and the su- pervised data-dependent hashing methods need the most racy of our LFH method is much higher than other methods including these supervised data-dependent methods KSH. amount of training time.Compared to other supervised SPLH,and MLH. data-dependent hashing methods,the training time of LFH The precision-recall curves with different code lengths will is much smaller than that of MLH and is comparable to that of KSH and SPLH.For large code lengths,our LFH is even be illustrated in the Appendix at the end of this paper(refer to Figure 9 and Figure 10),which will also show that our faster than KSH and SPLH.This is because the number of LFH method can significantly outperform other state-of-the- iterations needed to learn U decreases as the code length increases. art hashing methods. 3.6 Performance using Full Supervised Infor- 3.5 Computational Cost mation Figure 5(a)and Figure 5(b)show the average training For the results reported in Section 3.4 and Section 3.5, time of different hashing methods with different code length- we adopt the same strategy as that in [17]to train KSH, s on the CIFAR-10 and NUS-WIDE datasets,respectively. SPLH,and MLH by sampling only 2,000 labeled points due The reported values are in seconds in a logarithmic scale. to their high time complexity.To get a deeper comparison

LFH KSH MLH SPLH ITQ AGH LSH PCAH SH SIKH 8 16 24 32 48 64 96 128 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Code Length MAP (a) CIFAR-10 8 16 24 32 48 64 96 128 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 Code Length MAP (b) NUS-WIDE Figure 4: MAP results with different code lengths. 0 20 40 60 80 100 −2 −1.8 −1.6 −1.4 −1.2 −1 −0.8 −0.6 −0.4 −0.2 0 x 108 Iteration Objective Function LFH−Full LFH−Stochastic Figure 2: Convergence curve. achieve better accuracy than data-independent and unsupervised data-dependent methods. Furthermore, the accuracy of our LFH method is much higher than other methods including these supervised data-dependent methods KSH, SPLH, and MLH. The precision-recall curves with different code lengths will be illustrated in the Appendix at the end of this paper (refer to Figure 9 and Figure 10), which will also show that our LFH method can significantly outperform other state-of-theart hashing methods. 3.5 Computational Cost Figure 5(a) and Figure 5(b) show the average training time of different hashing methods with different code lengths on the CIFAR-10 and NUS-WIDE datasets, respectively. The reported values are in seconds in a logarithmic scale. 0 20 40 60 80 100 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 Iteration MAP LFH−Full LFH−Stochastic Figure 3: MAP during the iterations. We can find that the data-independent hashing methods require the least amount of training time, and the supervised data-dependent hashing methods need the most amount of training time. Compared to other supervised data-dependent hashing methods, the training time of LFH is much smaller than that of MLH and is comparable to that of KSH and SPLH. For large code lengths, our LFH is even faster than KSH and SPLH. This is because the number of iterations needed to learn U decreases as the code length increases. 3.6 Performance using Full Supervised Information For the results reported in Section 3.4 and Section 3.5, we adopt the same strategy as that in [17] to train KSH, SPLH, and MLH by sampling only 2,000 labeled points due to their high time complexity. To get a deeper comparison

(b) (e) (f) Figure 8:Example search(Hamming ranking)results on CIFAR-10,where red rectangles are used to indicate the images that are not in the same class as the query image,i.e.,the wrongly returned results.(a)and (b)contain the same query images,which are duplicated for better demonstration.Other images are the returned results of(c)LFH;(d)KSH;(e)MLH;(f)SPLH. 4.CONCLUSION Proceedings of the Annual Sumposium on Foundations of Hashing has become a very effective technique for ANN Computer Science,pages 459-468,2006. search in massive datasets which are common in this big [2]S.Arya,D.M.Mount,N.S.Netanyahu,R.Silverman,and A.Y.Wu.An optimal algorithm for approximate nearest data era.In many datasets,it is not hard to collect some neighbor searching fixed dimensions.Journal of the ACM. supervised information,such as the tag information in many 45(6):891-923.1998. social web sites,for part of the whole dataset.Hence,su- 3 T.-S.Chua,J.Tang,R.Hong,H.Li,Z.Luo,and Y.Zheng. pervised hashing methods,which can outperform traditional Nus-wide:A real-world web image database from national unsupervised hashing methods,have become more and more university of singapore.In Proceedings of the ACM important.In this paper,we propose a novel method,called International Conference on Image and Video Retrieval 2009 LFH,for supervised hashing.A learning algorithm with con- vergence guarantee is proposed to learn the parameters of [4 M.Datar,N.Immorlica,P.Indyk,and V.S.Mirrokni Locality-sensitive hashing scheme based on p-stable LFH.Moreover,to model large-scale problems,we propose a distributions.In Proceedings of the Annual Symposium on stochastic learning method which has linear time complexi- Computational Geometry,pages 253-262,2004. ty.Experimental results on two large datasets with semantic [5]A.Gionis,P.Indyk,and R.Motwani.Similarity search in labels show that our LFH method can achieve higher accu- high dimensions via hashing.In Proceedings of the racy than other state-of-the-art methods with comparable International Conference on Very Large Data Bases,pages 518-529,1999. or lower training cost 6 Y.Gong and S.Lazebnik.Iterative quantization:A procrustean approach to learning binary codes.In 5.ACKNOWLEDGMENTS Proceedings of the IEEE Computer Society Conference on This work is supported by the NSFC (No.61100125 Computer Vision and Pattern Recognition,pages 817-824 2011 61272099,61261160502),the 863 Program of China (No. [7]P.Indyk and R.Motwani.Approximate nearest neighbors: 2012AA011003).Shanghai Excellent Academic Leaders Plan Towards removing the curse of dimensionality.In (No.11XD1402900),Scientific Innovation Act of STCSM Proceedings of the Annual ACM Symposium on Theory of (No.13511504200),and the Program for Changjiang Schol- Computing,pages 604-613,1998. ars and Innovative Research Team in University of China [8]W.Kong and W.-J.Li.Double-bit quantization for hashing.In Proceedings of the AAAI Conference on (IRT1158,PCSIRT). Artificial Intelligence,2012. 9]W.Kong and W.-J.Li.Isotropic hashing.In Proceedings of 6. REFERENCES the Annual Conference on Neural Information Processing [1]A.Andoni and P.Indyk.Near-optimal hashing algorithms Systems,pages 1655-1663,2012. for approximate nearest neighbor in high dimensions.In

(a) (b) (c) (d) (e) (f) Figure 8: Example search (Hamming ranking) results on CIFAR-10, where red rectangles are used to indicate the images that are not in the same class as the query image, i.e., the wrongly returned results. (a) and (b) contain the same query images, which are duplicated for better demonstration. Other images are the returned results of (c) LFH; (d) KSH; (e) MLH; (f) SPLH. 4. CONCLUSION Hashing has become a very effective technique for ANN search in massive datasets which are common in this big data era. In many datasets, it is not hard to collect some supervised information, such as the tag information in many social web sites, for part of the whole dataset. Hence, supervised hashing methods, which can outperform traditional unsupervised hashing methods, have become more and more important. In this paper, we propose a novel method, called LFH, for supervised hashing. A learning algorithm with convergence guarantee is proposed to learn the parameters of LFH. Moreover, to model large-scale problems, we propose a stochastic learning method which has linear time complexity. Experimental results on two large datasets with semantic labels show that our LFH method can achieve higher accuracy than other state-of-the-art methods with comparable or lower training cost. 5. ACKNOWLEDGMENTS This work is supported by the NSFC (No. 61100125, 61272099, 61261160502), the 863 Program of China (No. 2012AA011003), Shanghai Excellent Academic Leaders Plan (No. 11XD1402900), Scientific Innovation Act of STCSM (No. 13511504200), and the Program for Changjiang Scholars and Innovative Research Team in University of China (IRT1158, PCSIRT). 6. REFERENCES [1] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Proceedings of the Annual Symposium on Foundations of Computer Science, pages 459–468, 2006. [2] S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. Journal of the ACM, 45(6):891–923, 1998. [3] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng. Nus-wide: A real-world web image database from national university of singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval, 2009. [4] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the Annual Symposium on Computational Geometry, pages 253–262, 2004. [5] A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proceedings of the International Conference on Very Large Data Bases, pages 518–529, 1999. [6] Y. Gong and S. Lazebnik. Iterative quantization: A procrustean approach to learning binary codes. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 817–824, 2011. [7] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Annual ACM Symposium on Theory of Computing, pages 604–613, 1998. [8] W. Kong and W.-J. Li. Double-bit quantization for hashing. In Proceedings of the AAAI Conference on Artificial Intelligence, 2012. [9] W. Kong and W.-J. Li. Isotropic hashing. In Proceedings of the Annual Conference on Neural Information Processing Systems, pages 1655–1663, 2012

[10]W.Kong,W.-J.Li,and M.Guo.Manhattan hashing for [29]A.Torralba,R.Fergus,and W.T.Freeman.80 million tiny large-scale image retrieval.In Proceedings of the images:A large data set for nonparametric object and International ACM SIGIR Conference on Research and scene recognition.IEEE Transactions on Pattern Analysis Development in Information Retrieval.pages 45-54.2012. and Machine Intelligence.30(11):1958-1970.2008. [11]A.Krizhevsky.Learning multiple layers of features from [30A.Torralba,R.Fergus,and Y.Weiss.Small codes and tiny images.Master's thesis,University of Toronto,2009 large image databases for recognition.In Proceedings of the [12]B.Kulis and T.Darrell.Learning to hash with binary IEEE Computer Society Conference on Computer Vision reconstructive embeddings.In Proceedings of the Annual and Pattern Recognition,2008. Conference on Neural Information Processing Systems, [31]F.Ture,T.Elsayed,and J.J.Lin.No free lunch:Brute pages1042-1050,2009. force vs.locality-sensitive hashing for cross-lingual pairwise 13 B.Kulis and K.Grauman.Kernelized locality-sensitive similarity.In Proceedings of the International ACM SIGIR hashing for scalable image search.In Proceedings of the Conference on Research and Development in Information IEEE International Conference on Computer Vision,pages Retrieval,pages 943-952,2011. 2130-2137,2009. [32]J.Wang,O.Kumar,and S.-F.Chang.Semi-supervised (14]B.Kulis,P.Jain,and K.Grauman.Fast similarity search hashing for scalable image retrieval.In Proceedings of the for learned metrics.IEEE Transactions on Pattern IEEE Computer Society Conference on Computer Vision Analysis and Machine Intelligence,31(12):2143-2157,2009 and Pattern Recognition,pages 3424-3431,2010. 15 K.Lange,D.R.Hunter,and I.Yang.Optimization transfer [33 J.Wang,S.Kumar,and S.-F.Chang.Sequential projection using surrogate objective functions.Journal of learning for hashing with compact codes.In Proceedings of Computational and Graphical Statistics,9(1):1-20,2000 the International Conference on Machine Learning;pages 16 X.Li,G.Lin,C.Shen,A.van den Hengel,and A.R.Dick. 1127-1134,2010 Learning hash functions using column generation.In [34 Y.Weiss,A.Torralba,and R.Fergus.Spectral hashing.In Proceedings of the International Conference on Machine Proceedings of the Annual Conference on Neural Learning,pages 142-150,2013. Information Processing Systems,pages 1753-1760,2008. (17]W.Liu,J.Wang,R.Ji,Y.-G.Jiang:and S.-F.Chang. [35]F.Wu,Z.Yu,Y.Yang,S.Tang,Y.Zhang,and Y.Zhuang Supervised hashing with kernels.In Proceedings of the Sparse multi-modal hashing.IEEE Transactions on IEEE Computer Society Conference on Computer Vision Multimedia,16(2):427-439,2014 and Pattern Recognition,pages 2074-2081,2012. [36]B.Xu,J.Bu,Y.Lin,C.Chen,X.He,and D.Cai 18 W.Liu,J.Wang,S.Kumar,and S.-F.Chang.Hashing Harmonious hashing.In Proceedings of the International with graphs.In Proceedings of the International Conference Joint Conference on Artificial Intelligence,2013 on Machine Learning,2011. [37]D.Zhai,H.Chang,Y.Zhen,X.Liu,X.Chen,and W.Gao. [19]M.Norouzi and D.J.Fleet.Minimal loss hashing for Parametric local multimodal hashing for cross-view compact binary codes.In Proceedings of the International similarity search.In Proceedings of the International Joint Conference on Machine Learning,pages 353-360,2011. Conference on Artificial Intelligence,2013. [20]M.Norouzi,D.J.Fleet,and R.Salakhutdinov.Hamming [38]D.Zhang and W.-J.Li.Large-scale supervised multimodal distance metric learning.In Proceedings of the Annual hashing with semantic correlation maximization.In Conference on Neural Information Processing Systems. Proceedings of the AAAI Conference on Artificial pages1070-1078.2012. Intelligence,2014. (21]M.Ou,P.Cui,F.Wang,J.Wang,W.Zhu,and S.Yang. [39 D.Zhang,F.Wang,and L.Si.Composite hashing with Comparing apples to oranges:A scalable solution with multiple information sources.In Proceedings of the heterogeneous hashing.In Proceedings of the ACM International ACM SIGIR Conference on Research and SIGKDD International Conference on Knowledge Development in Information Retrieval,pages 225-234, Discovery and Data Mining,pages 230-238,2013. 2011. 22 M.Raginsky and S.Lazebnik.Locality-sensitive binary [40 D.Zhang,J.Wang,D.Cai,and J.Lu.Self-taught hashing codes from shift-invariant kernels.In Proceedings of the for fast similarity search.In Proceedings of the Annual Conference on Neural Information Processing International ACM SIGIR Conference on Research and Systems,pages 1509-1517,2009. Development in Information Retrieval,pages 18-25,2010. 23 M.Rastegari,J.Choi,S.Fakhraei,D.Hal,and L.S.Davis [41]Q.Zhang,Y.Wu,Z.Ding,and X.Huang.Learning hash Predictable dual-view hashing.In Proceedings of the codes for efficient content reuse detection.In Proceedings of International Conference on Machine Learning,pages the International ACM SIGIR Conference on Research and 1328-1336.2013. Development in Information Retrieval,pages 405-414. (24 R.Salakhutdinov and G.E.Hinton.Semantic hashing. 2012. International Journal of Approrimate Reasoning, [42]Y.Zhen and D.-Y.Yeung.A probabilistic model for 50(7):969-978,2009. multimodal hash function learning.In Proceedings of the (25]A.W.M.Smeulders,M.Worring,S.Santini,A.Gupta, ACM SIGKDD International Conference on Knowledge and R.Jain.Content-based image retrieval at the end of Discovery and Data Mining,pages 940-948,2012. the early years.IEEE Transactions on Pattern Analysis and Machine Intelligence,22(12):1349-1380,2000. APPENDIX 26 J.Song,Y.Yang,Y.Yang,Z.Huang,and H.T.Shen Inter-media hashing for large-scale retrieval from For a more extensive evaluation,in Figure 9 and Figure 10. heterogeneous data sources.In Proceedings of the ACM we illustrate the precision-recall curves with different code SIGMOD International Conference on Management of lengths on the two datasets,CIFAR-10 and NUS-WIDE Data,pages785-796,2013. Our LFH method shows clear superiority on almost all set- 27 B.Stein.Principles of hash-based text retrieval.In Proceedings of the International ACM SIGIR Conference tings,followed by KSH,SPLH,and MLH,and then the oth on Research and Development in Information Retrieval, er methods without using semantic information.The results pages527-534,2007. are consistent with the MAP results given above 28C.Strecha,A.A.Bronstein,M.M.Bronstein,and P.Fua Ldahash:Improved matching with smaller descriptors IEEE Transactions on Pattern Analysis and Machine Intelligence,.34(1):66-78,2012

[10] W. Kong, W.-J. Li, and M. Guo. Manhattan hashing for large-scale image retrieval. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 45–54, 2012. [11] A. Krizhevsky. Learning multiple layers of features from tiny images. Master’s thesis, University of Toronto, 2009. [12] B. Kulis and T. Darrell. Learning to hash with binary reconstructive embeddings. In Proceedings of the Annual Conference on Neural Information Processing Systems, pages 1042–1050, 2009. [13] B. Kulis and K. Grauman. Kernelized locality-sensitive hashing for scalable image search. In Proceedings of the IEEE International Conference on Computer Vision, pages 2130–2137, 2009. [14] B. Kulis, P. Jain, and K. Grauman. Fast similarity search for learned metrics. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(12):2143–2157, 2009. [15] K. Lange, D. R. Hunter, and I. Yang. Optimization transfer using surrogate objective functions. Journal of Computational and Graphical Statistics, 9(1):1–20, 2000. [16] X. Li, G. Lin, C. Shen, A. van den Hengel, and A. R. Dick. Learning hash functions using column generation. In Proceedings of the International Conference on Machine Learning, pages 142–150, 2013. [17] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang. Supervised hashing with kernels. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 2074–2081, 2012. [18] W. Liu, J. Wang, S. Kumar, and S.-F. Chang. Hashing with graphs. In Proceedings of the International Conference on Machine Learning, 2011. [19] M. Norouzi and D. J. Fleet. Minimal loss hashing for compact binary codes. In Proceedings of the International Conference on Machine Learning, pages 353–360, 2011. [20] M. Norouzi, D. J. Fleet, and R. Salakhutdinov. Hamming distance metric learning. In Proceedings of the Annual Conference on Neural Information Processing Systems, pages 1070–1078, 2012. [21] M. Ou, P. Cui, F. Wang, J. Wang, W. Zhu, and S. Yang. Comparing apples to oranges: A scalable solution with heterogeneous hashing. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 230–238, 2013. [22] M. Raginsky and S. Lazebnik. Locality-sensitive binary codes from shift-invariant kernels. In Proceedings of the Annual Conference on Neural Information Processing Systems, pages 1509–1517, 2009. [23] M. Rastegari, J. Choi, S. Fakhraei, D. Hal, and L. S. Davis. Predictable dual-view hashing. In Proceedings of the International Conference on Machine Learning, pages 1328–1336, 2013. [24] R. Salakhutdinov and G. E. Hinton. Semantic hashing. International Journal of Approximate Reasoning, 50(7):969–978, 2009. [25] A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12):1349–1380, 2000. [26] J. Song, Y. Yang, Y. Yang, Z. Huang, and H. T. Shen. Inter-media hashing for large-scale retrieval from heterogeneous data sources. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 785–796, 2013. [27] B. Stein. Principles of hash-based text retrieval. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 527–534, 2007. [28] C. Strecha, A. A. Bronstein, M. M. Bronstein, and P. Fua. Ldahash: Improved matching with smaller descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(1):66–78, 2012. [29] A. Torralba, R. Fergus, and W. T. Freeman. 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11):1958–1970, 2008. [30] A. Torralba, R. Fergus, and Y. Weiss. Small codes and large image databases for recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2008. [31] F. Ture, T. Elsayed, and J. J. Lin. No free lunch: Brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 943–952, 2011. [32] J. Wang, O. Kumar, and S.-F. Chang. Semi-supervised hashing for scalable image retrieval. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 3424–3431, 2010. [33] J. Wang, S. Kumar, and S.-F. Chang. Sequential projection learning for hashing with compact codes. In Proceedings of the International Conference on Machine Learning, pages 1127–1134, 2010. [34] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In Proceedings of the Annual Conference on Neural Information Processing Systems, pages 1753–1760, 2008. [35] F. Wu, Z. Yu, Y. Yang, S. Tang, Y. Zhang, and Y. Zhuang. Sparse multi-modal hashing. IEEE Transactions on Multimedia, 16(2):427–439, 2014. [36] B. Xu, J. Bu, Y. Lin, C. Chen, X. He, and D. Cai. Harmonious hashing. In Proceedings of the International Joint Conference on Artificial Intelligence, 2013. [37] D. Zhai, H. Chang, Y. Zhen, X. Liu, X. Chen, and W. Gao. Parametric local multimodal hashing for cross-view similarity search. In Proceedings of the International Joint Conference on Artificial Intelligence, 2013. [38] D. Zhang and W.-J. Li. Large-scale supervised multimodal hashing with semantic correlation maximization. In Proceedings of the AAAI Conference on Artificial Intelligence, 2014. [39] D. Zhang, F. Wang, and L. Si. Composite hashing with multiple information sources. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 225–234, 2011. [40] D. Zhang, J. Wang, D. Cai, and J. Lu. Self-taught hashing for fast similarity search. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 18–25, 2010. [41] Q. Zhang, Y. Wu, Z. Ding, and X. Huang. Learning hash codes for efficient content reuse detection. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 405–414, 2012. [42] Y. Zhen and D.-Y. Yeung. A probabilistic model for multimodal hash function learning. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 940–948, 2012. APPENDIX For a more extensive evaluation, in Figure 9 and Figure 10, we illustrate the precision-recall curves with different code lengths on the two datasets, CIFAR-10 and NUS-WIDE. Our LFH method shows clear superiority on almost all settings, followed by KSH, SPLH, and MLH, and then the other methods without using semantic information. The results are consistent with the MAP results given above

点击进入文档下载页（PDF格式）

已到末页，全文结束

点击下载（PDF格式）

浏览记录