正在加载图片...
Double-Bit Quantization for Hashing Weihao Kong and Wu-Jun Li Shanghai Key Laboratory of Scalable Computing and Systems Department of Computer Science and Engineering.Shanghai Jiao Tong University,China kongweihao,liwujun@cs.situ.edu.cn Abstract massive data sets,and many hashing methods have been pro- Hashing.which tries to learn similarity-preserving bi- posed by researchers from different research communities. nary codes for data representation,has been widely The existing hashing methods can be mainly divided into used for efficient nearest neighbor search in massive two categories (Gong and Lazebnik 2011;Liu et al.2011; databases due to its fast query speed and low storage 2012):data-independent methods and data-dependent meth- cost.Because it is NP hard to directly compute the best ods. binary codes for a given data set,mainstream hash- Representative data-independent methods include ing methods typically adopt a two-stage strategy.In the locality-sensitive hashing (LSH)(Gionis,Indyk,and first stage,several projected dimensions of real values Motwani 1999:Andoni and Indyk 2008)and its ex- are generated.Then in the second stage,the real val- tensions (Datar et al.2004:Kulis and Grauman 2009: ues will be quantized into binary codes by thresholding. Currently,most existing methods use one single bit to Kulis,Jain,and Grauman 2009),and shift invariant kernel quantize each projected dimension.One problem with hashing (SIKH)(Raginsky and Lazebnik 2009).LSH and this single-bit quantization(SBQ)is that the threshold its extensions use simple random projections which are typically lies in the region of the highest point density independent of the training data for hash functions.SIKH and consequently a lot of neighboring points close to chooses projection vectors similar to those of LSH,but the threshold will be hashed to totally different bits, SIKH uses a shifted cosine function to generate hash values which is unexpected according to the principle of hash- Both LSH and SIKH have an important property that points ing.In this paper,we propose a novel quantization strat- with high similarity will have high probability to be mapped egy,called double-bit quantization (DBQ),to solve the to the same hashcodes.Compared with the data-dependent problem of SBQ.The basic idea of DBQ is to quantize each projected dimension into double bits with adap- methods,data-independent methods need longer codes tively learned thresholds.Extensive experiments on two to achieve satisfactory performance (Gong and Lazebnik real data sets show that our DBQ strategy can signifi- 2011),which will be less efficient due to the higher storage cantly outperform traditional SBQ strategy for hashing. and computational cost. Considering the shortcomings of data-independent meth- Introduction ods,more and more recent works have focused on data- dependent methods whose hash functions are learned from With the explosive growth of data on the Internet,there the training data.Semantic hashing(Salakhutdinov and Hin- has been increasing interest in approximate nearest neigh- bor (ANN)search in massive data sets.Common approaches ton 2007;2009)adopts a deep generative model to learn the hash functions.Spectral hashing (SH)(Weiss,Torralba,and for efficient ANN search are based on similarity-preserving hashing techniques (Gionis,Indyk,and Motwani 1999; Fergus 2008)uses spectral graph partitioning for hashing with the graph constructed from the data similarity relation- Andoni and Indyk 2008)which encode similar points in the original space into close binary codes in the hashcode ships.Binary reconstruction embedding (BRE)(Kulis and Darrell 2009)learns the hash functions by explicitly min- space.Most methods use the Hamming distance to measure the closeness between points in the hashcode space.By us- imizing the reconstruction error between the original dis- tances and the Hamming distances of the corresponding bi- ing hashing codes,we can achieve constant or sub-linear search time complexity (Torralba,Fergus,and Weiss 2008; nary codes.Semi-supervise hashing(SSH)(Wang,Kumar, and Chang 2010)exploits some labeled data to help hash Liu et al.2011).Furthermore,the storage needed to store the function learning.Selt-taught hashing (Zhang et al.2010) binary codes will be greatly decreased.For example,if we encode each point with 128 bits,we can store a data set of 1 uses supervised learning algorithms for hashing based on million points with only 16M memory.Hence,hashing pro- self-labeled data.Composite hashing (Zhang,Wang,and Si 2011)integrates multiple information sources for hash- vides a very effective and efficient way to perform ANN for ing.Minimal loss hashing (MLH)(Norouzi and Fleet 2011) Copyright C)2012,Association for the Advancement of Artificial formulates the hashing problem as a structured prediction Intelligence (www.aaai.org).All rights reserved. problem.Both accuracy and time are jointly optimized toDouble-Bit Quantization for Hashing Weihao Kong and Wu-Jun Li Shanghai Key Laboratory of Scalable Computing and Systems Department of Computer Science and Engineering, Shanghai Jiao Tong University, China {kongweihao,liwujun}@cs.sjtu.edu.cn Abstract Hashing, which tries to learn similarity-preserving bi￾nary codes for data representation, has been widely used for efficient nearest neighbor search in massive databases due to its fast query speed and low storage cost. Because it is NP hard to directly compute the best binary codes for a given data set, mainstream hash￾ing methods typically adopt a two-stage strategy. In the first stage, several projected dimensions of real values are generated. Then in the second stage, the real val￾ues will be quantized into binary codes by thresholding. Currently, most existing methods use one single bit to quantize each projected dimension. One problem with this single-bit quantization (SBQ) is that the threshold typically lies in the region of the highest point density and consequently a lot of neighboring points close to the threshold will be hashed to totally different bits, which is unexpected according to the principle of hash￾ing. In this paper, we propose a novel quantization strat￾egy, called double-bit quantization (DBQ), to solve the problem of SBQ. The basic idea of DBQ is to quantize each projected dimension into double bits with adap￾tively learned thresholds. Extensive experiments on two real data sets show that our DBQ strategy can signifi- cantly outperform traditional SBQ strategy for hashing. Introduction With the explosive growth of data on the Internet, there has been increasing interest in approximate nearest neigh￾bor (ANN) search in massive data sets. Common approaches for efficient ANN search are based on similarity-preserving hashing techniques (Gionis, Indyk, and Motwani 1999; Andoni and Indyk 2008) which encode similar points in the original space into close binary codes in the hashcode space. Most methods use the Hamming distance to measure the closeness between points in the hashcode space. By us￾ing hashing codes, we can achieve constant or sub-linear search time complexity (Torralba, Fergus, and Weiss 2008; Liu et al. 2011). Furthermore, the storage needed to store the binary codes will be greatly decreased. For example, if we encode each point with 128 bits, we can store a data set of 1 million points with only 16M memory. Hence, hashing pro￾vides a very effective and efficient way to perform ANN for Copyright c 2012, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. massive data sets, and many hashing methods have been pro￾posed by researchers from different research communities. The existing hashing methods can be mainly divided into two categories (Gong and Lazebnik 2011; Liu et al. 2011; 2012): data-independent methods and data-dependent meth￾ods. Representative data-independent methods include locality-sensitive hashing (LSH) (Gionis, Indyk, and Motwani 1999; Andoni and Indyk 2008) and its ex￾tensions (Datar et al. 2004; Kulis and Grauman 2009; Kulis, Jain, and Grauman 2009), and shift invariant kernel hashing (SIKH) (Raginsky and Lazebnik 2009). LSH and its extensions use simple random projections which are independent of the training data for hash functions. SIKH chooses projection vectors similar to those of LSH, but SIKH uses a shifted cosine function to generate hash values. Both LSH and SIKH have an important property that points with high similarity will have high probability to be mapped to the same hashcodes. Compared with the data-dependent methods, data-independent methods need longer codes to achieve satisfactory performance (Gong and Lazebnik 2011), which will be less efficient due to the higher storage and computational cost. Considering the shortcomings of data-independent meth￾ods, more and more recent works have focused on data￾dependent methods whose hash functions are learned from the training data. Semantic hashing (Salakhutdinov and Hin￾ton 2007; 2009) adopts a deep generative model to learn the hash functions. Spectral hashing (SH) (Weiss, Torralba, and Fergus 2008) uses spectral graph partitioning for hashing with the graph constructed from the data similarity relation￾ships. Binary reconstruction embedding (BRE) (Kulis and Darrell 2009) learns the hash functions by explicitly min￾imizing the reconstruction error between the original dis￾tances and the Hamming distances of the corresponding bi￾nary codes. Semi-supervise hashing (SSH) (Wang, Kumar, and Chang 2010) exploits some labeled data to help hash function learning. Selt-taught hashing (Zhang et al. 2010) uses supervised learning algorithms for hashing based on self-labeled data. Composite hashing (Zhang, Wang, and Si 2011) integrates multiple information sources for hash￾ing. Minimal loss hashing (MLH) (Norouzi and Fleet 2011) formulates the hashing problem as a structured prediction problem. Both accuracy and time are jointly optimized to
向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有