ExchNet:A Unified Hashing Network for Large-Scale Fine-Grained Image Retrieval Quan Cuil3,Qing-Yuan Jiang2,Xiu-Shen Wei3(),Wu-Jun Li2, and Osamu Yoshiel 1 Graduate School of IPS,Waseda University,Fukuoka,Japan cui-quan@toki.waseda.jp,yoshie@waseda.jp 2 National Key Laboratory for Novel Software Technology,Department of Computer Science and Technology,Nanjing University,Nanjing,China qyjiang24@gmail.com,liwujun@nju.edu.cn 3 Megvii Research Nanjing,Megvii Technology,Nanjing,China weixs.gm@gmail.com Abstract.Retrieving content relevant images from a large-scale fine- grained dataset could suffer from intolerably slow query speed and highly redundant storage cost,due to high-dimensional real-valued embeddings which aim to distinguish subtle visual differences of fine-grained objects. In this paper,we study the novel fine-grained hashing topic to generate compact binary codes for fine-grained images,leveraging the search and storage efficiency of hash learning to alleviate the aforementioned prob- lems.Specifically,we propose a unified end-to-end trainable network, termed as ExchNet.Based on attention mechanisms and proposed atten- tion constraints,ExchNet can firstly obtain both local and global features to represent object parts and the whole fine-grained objects,respectively. Furthermore,to ensure the discriminative ability and semantic meaning's consistency of these part-level features across images,we design a local feature alignment approach by performing a feature exchanging opera- tion.Later,an alternating learning algorithm is employed to optimize the whole ExchNet and then generate the final binary hash codes.Val- idated by extensive experiments,our ExchNet consistently outperforms state-of-the-art generic hashing methods on five fine-grained datasets. Moreover,compared with other approximate nearest neighbor methods, ExchNet achieves the best speed-up and storage reduction,revealing its efficiency and practicality. Keywords:Fine-Grained Image Retrieval.Learning to hash. Feature alignment.Large-scale image search Q.Cui,Q.-Y.Jiang-Equal contribution. Electronic supplementary material The online version of this chapter (https:/ doi.org/10.1007/978-3-030-58580-8_12)contains supplementary material,which is available to authorized users. Springer Nature Switzerland AG 2020 A.Vedaldi et al.(Eds.):ECCV 2020,LNCS 12348,pp.189-205,2020. https://doi.org/10.1007/978-3-030-58580-8_12
ExchNet: A Unified Hashing Network for Large-Scale Fine-Grained Image Retrieval Quan Cui1,3, Qing-Yuan Jiang2, Xiu-Shen Wei3(B), Wu-Jun Li2, and Osamu Yoshie1 1 Graduate School of IPS, Waseda University, Fukuoka, Japan cui-quan@toki.waseda.jp, yoshie@waseda.jp 2 National Key Laboratory for Novel Software Technology, Department of Computer Science and Technology, Nanjing University, Nanjing, China qyjiang24@gmail.com, liwujun@nju.edu.cn 3 Megvii Research Nanjing, Megvii Technology, Nanjing, China weixs.gm@gmail.com Abstract. Retrieving content relevant images from a large-scale finegrained dataset could suffer from intolerably slow query speed and highly redundant storage cost, due to high-dimensional real-valued embeddings which aim to distinguish subtle visual differences of fine-grained objects. In this paper, we study the novel fine-grained hashing topic to generate compact binary codes for fine-grained images, leveraging the search and storage efficiency of hash learning to alleviate the aforementioned problems. Specifically, we propose a unified end-to-end trainable network, termed as ExchNet. Based on attention mechanisms and proposed attention constraints, ExchNet can firstly obtain both local and global features to represent object parts and the whole fine-grained objects, respectively. Furthermore, to ensure the discriminative ability and semantic meaning’s consistency of these part-level features across images, we design a local feature alignment approach by performing a feature exchanging operation. Later, an alternating learning algorithm is employed to optimize the whole ExchNet and then generate the final binary hash codes. Validated by extensive experiments, our ExchNet consistently outperforms state-of-the-art generic hashing methods on five fine-grained datasets. Moreover, compared with other approximate nearest neighbor methods, ExchNet achieves the best speed-up and storage reduction, revealing its efficiency and practicality. Keywords: Fine-Grained Image Retrieval · Learning to hash · Feature alignment · Large-scale image search Q. Cui, Q.-Y. Jiang—Equal contribution. Electronic supplementary material The online version of this chapter (https:// doi.org/10.1007/978-3-030-58580-8 12) contains supplementary material, which is available to authorized users. c Springer Nature Switzerland AG 2020 A. Vedaldi et al. (Eds.): ECCV 2020, LNCS 12348, pp. 189–205, 2020. https://doi.org/10.1007/978-3-030-58580-8_12
190 Q.Cui et al. ner-class vananoe ■:1口:0 Artic Tem ··① Bna时 Code oon Tem ■ ommon Tem Dissimilar Binary Codes Green Jay 口 Gircen Jay Fig.1.Illustration of the fine-grained hashing task.Fine-grained images could share large intra-class variances but small inter-class variances.Fine-grained hashing aims to generate compact binary codes with tiny Hamming distances for images of the same sub-category,as well as distinct codes for images from different sub-categories. 1 Introduction Fine-Grained Image Retrieval(FGIR)[19,26,31,36,41,42]is a practical but chal- lenging computer vision task.It aims to retrieve images belonging to various sub-categories of a certain meta-category (e.g.,birds,cars and aircrafts)and return images with the same sub-category as the query image.In real FGIR applications,previous methods could suffer from slow query speed and redun- dant storage costs due to both the explosive growth of massive fine-grained data and high-dimensional real-valued features. Learning to hash [3,6,7,10,14,16,17,21,22,34,35]has proven to be a promis- ing solution for large-scale image retrieval because it can greatly reduce the storage cost and increase the query speed.As a representative research area of approximate nearest neighbor (ANN)search [1,6,13],hashing aims to embed data points as similarity-preserving binary codes.Recently,hashing has been successfully applied in a wide range of image retrieval tasks,e.g.,face image retrieval [18],person re-identification [5,43],etc.We hereby explore the effec- tiveness of hashing for fine-grained image retrieval. To the best of our knowledge,this is the first work to study the fine-grained hashing problem,which refers to the problem of designing hashing for fine- grained objects.As shown in Fig.1,the task is desirable to generate compact binary codes for fine-grained images sharing both large intra-class variances and small inter-class variances.To deal with the challenging task,we propose a uni- fied end-to-end trainable network ExchNet to first learn fine-grained tailored features and then generate the final binary hash codes. In concretely,our ExchNet consists of three main modules,including rep- resentation learning,local feature alignment and hash code learning,as shown in Fig.2.In the representation learning module,beyond obtaining the holistic image representation (i.e.,global features),we also employ the attention mech- anism to capture the part-level features (i.e.,local features)for representing fine-grained objects'parts.Localizing parts and embedding part-level cues are
190 Q. Cui et al. Artic Tern Common Tern Green Jay Intra-class variance Inter-class variance Artic Tern Common Tern Green Jay Feature Extractor Similar Binary Codes : 1 : 0 Dissimilar Binary Codes Hashing Network Fig. 1. Illustration of the fine-grained hashing task. Fine-grained images could share large intra-class variances but small inter-class variances. Fine-grained hashing aims to generate compact binary codes with tiny Hamming distances for images of the same sub-category, as well as distinct codes for images from different sub-categories. 1 Introduction Fine-Grained Image Retrieval (FGIR) [19,26,31,36,41,42] is a practical but challenging computer vision task. It aims to retrieve images belonging to various sub-categories of a certain meta-category (e.g., birds, cars and aircrafts) and return images with the same sub-category as the query image. In real FGIR applications, previous methods could suffer from slow query speed and redundant storage costs due to both the explosive growth of massive fine-grained data and high-dimensional real-valued features. Learning to hash [3,6,7,10,14,16,17,21,22,34,35] has proven to be a promising solution for large-scale image retrieval because it can greatly reduce the storage cost and increase the query speed. As a representative research area of approximate nearest neighbor (ANN) search [1,6,13], hashing aims to embed data points as similarity-preserving binary codes. Recently, hashing has been successfully applied in a wide range of image retrieval tasks, e.g., face image retrieval [18], person re-identification [5,43], etc. We hereby explore the effectiveness of hashing for fine-grained image retrieval. To the best of our knowledge, this is the first work to study the fine-grained hashing problem, which refers to the problem of designing hashing for finegrained objects. As shown in Fig. 1, the task is desirable to generate compact binary codes for fine-grained images sharing both large intra-class variances and small inter-class variances. To deal with the challenging task, we propose a uni- fied end-to-end trainable network ExchNet to first learn fine-grained tailored features and then generate the final binary hash codes. In concretely, our ExchNet consists of three main modules, including representation learning, local feature alignment and hash code learning, as shown in Fig. 2. In the representation learning module, beyond obtaining the holistic image representation (i.e., global features), we also employ the attention mechanism to capture the part-level features (i.e., local features) for representing fine-grained objects’ parts. Localizing parts and embedding part-level cues are
ExchNet:A Unified Hashing Network for Large-Scale Fine-Grained Retrieval 191 Representation Learning Local Features Alignment Hash Codes Learning Training Puc0的 Fig.2.Framework of our proposed ExchNet,which consists of three modules.(1) The representation learning module,as well as the attention mechanism with spatial and channel diversity learning constraints,is designed to obtain both local and global features of fine-grained objects.(2)The local feature alignment module is used to align obtained local features w.r.t.object parts across different fine-grained images.(3)The hash codes learning module is performed to generate the compact binary codes. crucial for fine-grained tasks,since these discriminative but subtle parts (e.g., bird heads or tails)play a major role to distinguish different sub-categories. Moreover,we also develop two kinds of attention constraints,i.e.,spatial and channel constraints,to collaboratively work together for further improving the discriminative ability of these local features.In the following,to ensure that these part-level features can correspond to their own corresponding parts across differ- ent fine-grained images,we design an anchor based feature alignment approach to align these local features.Specifically,in the local feature alignment module, we treat the anchored local features as the "prototype"w.r.t.its sub-category by averaging all the local features of that part across images.Once local features are well aligned for their own parts,even if we exchange one specific part's local feature of an input image with the same part's local feature of the prototype, the image meanings derived from the image representations and also the final hash codes should be both extremely similar.Inspired by this motivation,we perform a feature exchanging operation upon the anchored local features and other learned local features,which is illustrated in Fig.3.After that,for effec- tively training the network with our feature alignment fashion,we utilize an alternating algorithm to solve the hashing learning problem and update anchor features simultaneously. To quantitatively prove both effectiveness and efficiency of our ExchNet,we conduct comprehensive experiments on five fine-grained benchmark datasets, including the large-scale ones,i.e.,NABirds [11,VegFru [12 and Food101 [23. Particularly,compared with competing approximate nearest neighbor methods, our ExchNet achieves up to hundreds times speedup for large-scale fine-grained image retrieval without significant accuracy drops.Meanwhile,compared with state-of-the-art generic hashing methods,ExchNet could consistently outperform these methods by a large margin on all the fine-grained datasets.Additionally, ablation studies and visualization results justify the effectiveness of our tailored model designs like local feature alignment and proposed attention approach
ExchNet: A Unified Hashing Network for Large-Scale Fine-Grained Retrieval 191 Backbone CNN Hashing Network Attention Generation ! Global Features Refinement Local Features Refinement M Attention Maps Spatial Diversity Channel Diversity GAP GAP Representation Learning Local Features Alignment Hash Codes Learning Anchor Local Features Local Feature Extractor Global Feature Extractor (Training Phase Only) " #! ! $" # "%! Fig. 2. Framework of our proposed ExchNet, which consists of three modules. (1) The representation learning module, as well as the attention mechanism with spatial and channel diversity learning constraints, is designed to obtain both local and global features of fine-grained objects. (2) The local feature alignment module is used to align obtained local features w.r.t. object parts across different fine-grained images. (3) The hash codes learning module is performed to generate the compact binary codes. crucial for fine-grained tasks, since these discriminative but subtle parts (e.g., bird heads or tails) play a major role to distinguish different sub-categories. Moreover, we also develop two kinds of attention constraints, i.e., spatial and channel constraints, to collaboratively work together for further improving the discriminative ability of these local features. In the following, to ensure that these part-level features can correspond to their own corresponding parts across different fine-grained images, we design an anchor based feature alignment approach to align these local features. Specifically, in the local feature alignment module, we treat the anchored local features as the “prototype” w.r.t. its sub-category by averaging all the local features of that part across images. Once local features are well aligned for their own parts, even if we exchange one specific part’s local feature of an input image with the same part’s local feature of the prototype, the image meanings derived from the image representations and also the final hash codes should be both extremely similar. Inspired by this motivation, we perform a feature exchanging operation upon the anchored local features and other learned local features, which is illustrated in Fig. 3. After that, for effectively training the network with our feature alignment fashion, we utilize an alternating algorithm to solve the hashing learning problem and update anchor features simultaneously. To quantitatively prove both effectiveness and efficiency of our ExchNet, we conduct comprehensive experiments on five fine-grained benchmark datasets, including the large-scale ones, i.e., NABirds [11], VegFru [12] and Food101 [23]. Particularly, compared with competing approximate nearest neighbor methods, our ExchNet achieves up to hundreds times speedup for large-scale fine-grained image retrieval without significant accuracy drops. Meanwhile, compared with state-of-the-art generic hashing methods, ExchNet could consistently outperform these methods by a large margin on all the fine-grained datasets. Additionally, ablation studies and visualization results justify the effectiveness of our tailored model designs like local feature alignment and proposed attention approach
192 Q.Cui et al. The contributions of this paper are summarized as follows: We study the novel fine-grained hashing topic to leverage the search and storage efficiency of hash codes for solving the challenging large-scale fine- grained image retrieval problem. We propose a unified end-to-end trainable network,i.e.,ExchNet,to first learn fine-grained tailored features and then generate the final binary hash codes Particularly,the proposed attention constraints,local feature alignment and anchor-based learning fashion contribute well to obtain discriminative fine- grained representations. We conduct extensive experiments on five fine-grained datasets to validate both effectiveness and efficiency of our proposed ExchNet.Especially for the results on large-scale datasets,ExchNet exhibits its outperforming retrieval performance on either speedup,memory usages and retrieval accuracy. 2 Related Work Fine-Grained Image Retrieval.Fine-Grained Image Retrieval(FGIR)is an active research topic emerged in recent years,where the database and query images could share small inter-class variance but large intra-class variance.In previous works [36,handcrafted features were initially utilized to tackle the FGIR problem.Powered by deep learning techniques,more and more deep learn- ing based FGIR.methods [19,26,31-33,36,41,42]were proposed.These deep methods can be roughly divided into two parts,i.e.,supervised and unsupervised methods.In supervised methods,FGIR is defined as a metric learning problem. Zheng et al.[41]designed a novel ranking loss and a weakly-supervised attrac- tive feature extraction strategy to facilitate the retrieval performance.Zheng et al.[42]improved their former work [41]with a normalize-scale layer and de- correlated ranking loss.As to unsupervised methods,Selective Convolutional Descriptor Aggregation (SCDA)[31]was proposed to localize the main object in fine-grained images firstly,and then discard the noisy background and keep useful deep descriptors for fine-grained image retrieval. Deep Hashing.Hashing methods can be divided into two categories,i.e., data-independent methods [6]and data-dependent methods [10,17],based on whether training points are used to learn hash functions.Generally speaking, data-dependent methods,also named as Learning to Hash(L2H)methods,can achieve better retrieval performance with the help of the learning on training data.With the rise of deep learning,some L2H methods integrate deep feature learning into hash frameworks and achieve promising performance.As previous work,many deep hashing methods [2,3,7,14,16,17,21,22,30,35,38,39]for large- scale image retrieval have been proposed.Compared with deep unsupervised hashing methods [7,14,21],deep supervised hashing methods [14,16,17,35]can achieve superior retrieval accuracy as they can fully explore the semantic infor- mation.Specifically,the previous work 35 was essentially a two-stage method
192 Q. Cui et al. The contributions of this paper are summarized as follows: – We study the novel fine-grained hashing topic to leverage the search and storage efficiency of hash codes for solving the challenging large-scale finegrained image retrieval problem. – We propose a unified end-to-end trainable network, i.e., ExchNet, to first learn fine-grained tailored features and then generate the final binary hash codes. Particularly, the proposed attention constraints, local feature alignment and anchor-based learning fashion contribute well to obtain discriminative finegrained representations. – We conduct extensive experiments on five fine-grained datasets to validate both effectiveness and efficiency of our proposed ExchNet. Especially for the results on large-scale datasets, ExchNet exhibits its outperforming retrieval performance on either speedup, memory usages and retrieval accuracy. 2 Related Work Fine-Grained Image Retrieval. Fine-Grained Image Retrieval (FGIR) is an active research topic emerged in recent years, where the database and query images could share small inter-class variance but large intra-class variance. In previous works [36], handcrafted features were initially utilized to tackle the FGIR problem. Powered by deep learning techniques, more and more deep learning based FGIR methods [19,26,31–33,36,41,42] were proposed. These deep methods can be roughly divided into two parts, i.e., supervised and unsupervised methods. In supervised methods, FGIR is defined as a metric learning problem. Zheng et al. [41] designed a novel ranking loss and a weakly-supervised attractive feature extraction strategy to facilitate the retrieval performance. Zheng et al. [42] improved their former work [41] with a normalize-scale layer and decorrelated ranking loss. As to unsupervised methods, Selective Convolutional Descriptor Aggregation (SCDA) [31] was proposed to localize the main object in fine-grained images firstly, and then discard the noisy background and keep useful deep descriptors for fine-grained image retrieval. Deep Hashing. Hashing methods can be divided into two categories, i.e., data-independent methods [6] and data-dependent methods [10,17], based on whether training points are used to learn hash functions. Generally speaking, data-dependent methods, also named as Learning to Hash (L2H) methods, can achieve better retrieval performance with the help of the learning on training data. With the rise of deep learning, some L2H methods integrate deep feature learning into hash frameworks and achieve promising performance. As previous work, many deep hashing methods [2,3,7,14,16,17,21,22,30,35,38,39] for largescale image retrieval have been proposed. Compared with deep unsupervised hashing methods [7,14,21], deep supervised hashing methods [14,16,17,35] can achieve superior retrieval accuracy as they can fully explore the semantic information. Specifically, the previous work [35] was essentially a two-stage method
ExchNet:A Unified Hashing Network for Large-Scale Fine-Grained Retrieval 193 Local Features 山 Similzr Codes Fig.3.Key idea of our local feature alignment approach:given an image pair of a fine-grained category,exchanging their local features of the same object parts should not change their corresponding hash codes,i.e.,these hash codes should be the same as those generated without local feature exchanging and their Hamming distance should be still close also. which tried to learn binary codes in the first stage and employed feature learning guided by the learned binary codes in the second stage.Then,there appeared numerous one-stage deep supervised hashing methods,including Deep Pairwise Supervised Hashing (DPSH)[17],Deep Supervised Hashing (DSH)[22],and Deep Cauchy Hashing (DCH)[3],which aimed to integrate feature learning and hash code learning into an end-to-end framework. 3 Methodology The framework of our ExchNet is presented in Fig.2,which contains three key modules,i.e.,the representation learning module,local feature alignment module,and hash code learning module. 3.1 Representation Learning The learning of discriminative and meaningful local features is mutually cor- related with fine-grained tasks 9,15,20,37,40],since these local features can greatly benefit the distinguishing of sub-categories with subtle visual differences deriving from the discriminative fine-grained parts(e.g.,bird heads or tails).In consequence,as shown in Fig.2,beyond the global feature extractor,we also introduce a local feature extractor in the representation learning module.Specif- ically,by considering model efficiency,we hereby propose to learn local features with the attention mechanism,rather than other fine-grained techniques with tremendous computation cost,e.g.,second-order representations [15,20]or com- plicated network architectures [9,37,40]. Given an input image xi,a backbone CNN is utilized to extract a holistic deep feature EE RHxWxC,which serves as the appetizer for both the local feature extractor and the global feature extractor
ExchNet: A Unified Hashing Network for Large-Scale Fine-Grained Retrieval 193 Fig. 3. Key idea of our local feature alignment approach: given an image pair of a fine-grained category, exchanging their local features of the same object parts should not change their corresponding hash codes, i.e., these hash codes should be the same as those generated without local feature exchanging and their Hamming distance should be still close also. which tried to learn binary codes in the first stage and employed feature learning guided by the learned binary codes in the second stage. Then, there appeared numerous one-stage deep supervised hashing methods, including Deep Pairwise Supervised Hashing (DPSH) [17], Deep Supervised Hashing (DSH) [22], and Deep Cauchy Hashing (DCH) [3], which aimed to integrate feature learning and hash code learning into an end-to-end framework. 3 Methodology The framework of our ExchNet is presented in Fig. 2, which contains three key modules, i.e., the representation learning module, local feature alignment module, and hash code learning module. 3.1 Representation Learning The learning of discriminative and meaningful local features is mutually correlated with fine-grained tasks [9,15,20,37,40], since these local features can greatly benefit the distinguishing of sub-categories with subtle visual differences deriving from the discriminative fine-grained parts (e.g., bird heads or tails). In consequence, as shown in Fig. 2, beyond the global feature extractor, we also introduce a local feature extractor in the representation learning module. Specifically, by considering model efficiency, we hereby propose to learn local features with the attention mechanism, rather than other fine-grained techniques with tremendous computation cost, e.g., second-order representations [15,20] or complicated network architectures [9,37,40]. Given an input image xi, a backbone CNN is utilized to extract a holistic deep feature Ei ∈ RH×W×C , which serves as the appetizer for both the local feature extractor and the global feature extractor
194 Q.Cui et al. It is worth mentioning that the attention is engaged in the middle of the feature extractor.Since,in the shallow layers of deep neural networks,low-level context information (e.g.,colors and edges,etc.)are well preserved,which is crucial for distinguish subtle visual differences of fine-grained objects.Then,by feeding E into the attention generation module,M pieces of attention maps AiE RMxHxW are generated and we use AiE RHxW to denote the attentive region of the j-th (j{1,...,M))part cues for xi.After that,the obtained part-level attention map A is element-wisely multiplied on Ei to select the attentive local feature corresponding to the j-th part,which is formulated as: =E⑧A, (1) where ERHxwxc represents the j-th attentive local feature of ri,and denotes the Hadamard product on each channel.For simplification,we use =[E,...,EM)to denote a set of local features and,subsequently,is fed into the later Local Features Refinement (LFR)network composed of a stack of convolution layers to embed these attentive local features into higher-level semantic meanings: F=f九R(8), (2) where the output of the network is denoted as Fi={F,...,FM),which represents the final local feature maps w.r.t.high-level semantics.We denote fRC as the local feature vector after applying global average pooling(GAP) onFy∈RH'xw'xC'as: f月=feaP(F). (3) On the other side,as to the global feature extractor,for xi,we directly adopt a Global Features Refinement (GFR)network composed of conventional convolutional operations to embed Ei,which is presented by: Fglobal =fCrR(Ei). (4) We use Fslobal E RIxW'xc'and fslobal eRC to denote the learned global feature and the corresponding holistic feature vector after GAP,respectively. Furthermore,to facilitate the learning of localizing local feature cues (i.e., capturing fine-grained parts),we impose the spatial diversity and channel diver- sity constraints over the local features in Fi. Specifically,it is a natural choice to increase the diversity of local features by differentiating the distributions of attention maps [40].However,it might cause a problem that the holistic feature can not be activated in some spatial positions, while the attention map has large activation values on them due to over-applied constraints upon the learned attention maps.Instead,in our method,we design and apply constraints on the local features.In concretely,for the local feature Fy,we obtain its“aggregation map”A∈RH'xw'by adding all C feature maps through the channel dimension and apply the softmax function on it for
194 Q. Cui et al. It is worth mentioning that the attention is engaged in the middle of the feature extractor. Since, in the shallow layers of deep neural networks, low-level context information (e.g., colors and edges, etc.) are well preserved, which is crucial for distinguish subtle visual differences of fine-grained objects. Then, by feeding Ei into the attention generation module, M pieces of attention maps Ai ∈ RM×H×W are generated and we use Aj i ∈ RH×W to denote the attentive region of the j-th (j ∈ {1,...,M}) part cues for xi. After that, the obtained part-level attention map Aj i is element-wisely multiplied on Ei to select the attentive local feature corresponding to the j-th part, which is formulated as: Eˆj i = Ei ⊗ Aj i , (1) where Eˆj i ∈ RH×W×C represents the j-th attentive local feature of xi, and “⊗” denotes the Hadamard product on each channel. For simplification, we use Eˆ i = {Eˆ1 i ,..., EˆM i } to denote a set of local features and, subsequently, Eˆ i is fed into the later Local Features Refinement (LFR) network composed of a stack of convolution layers to embed these attentive local features into higher-level semantic meanings: Fi = fLFR(Eˆ i), (2) where the output of the network is denoted as Fi = {F1 i ,...,F M i }, which represents the final local feature maps w.r.t. high-level semantics. We denote fj i ∈ RC as the local feature vector after applying global average pooling (GAP) on Fj i ∈ RH ×W ×C as: fj i = fGAP(Fj i ). (3) On the other side, as to the global feature extractor, for xi, we directly adopt a Global Features Refinement (GFR) network composed of conventional convolutional operations to embed Ei, which is presented by: Fglobal i = fGFR(Ei). (4) We use Fglobal i ∈ RH ×W ×C and f global i ∈ RC to denote the learned global feature and the corresponding holistic feature vector after GAP, respectively. Furthermore, to facilitate the learning of localizing local feature cues (i.e., capturing fine-grained parts), we impose the spatial diversity and channel diversity constraints over the local features in Fi. Specifically, it is a natural choice to increase the diversity of local features by differentiating the distributions of attention maps [40]. However, it might cause a problem that the holistic feature can not be activated in some spatial positions, while the attention map has large activation values on them due to over-applied constraints upon the learned attention maps. Instead, in our method, we design and apply constraints on the local features. In concretely, for the local feature Fj i , we obtain its “aggregation map” Aˆj i ∈ RH ×W by adding all C feature maps through the channel dimension and apply the softmax function on it for
ExchNet:A Unified Hashing Network for Large-Scale Fine-Grained Retrieval 195 converting it into a valid distribution,then flat it into a vector a:.Based on the Hellinger distance,we propose a spatial diversity induced loss as: C即(x)=1- - (5) where (is used to denote the combinatorial number of ways to pick 2 unordered outcomes from M possibilities.The spatial diversity constraint drives the aggregation maps to be activated in spatial positions as diverse as possible. As to the channel diversity constraint,we first convert the local feature vector f into a valid distribution,which can be formulated by p=softmax(f),j∈{1,,M} (6) Subsequently,we propose a constraint loss overp as: c-卜a-网风 (7) where t E [0,1]is a hyper-parameter to adjust the diversity and+denotes max(.,0).Equipping with the channel diversity constraint could benefit the net- work to depress redundancies in features through channel dimensions.Overall, our spatial diversity and channel diversity constraints can work in a collaborative way to obtain discriminative local features. 3.2 Learning to Align by Local Feature Exchanging Upon the representation learning module,the alignment on local features is necessary for confirming that they represent and more importantly correspond to common fine-grained parts across images,which are essential to fine-grained tasks.Hence,we propose an anchor-based local features alignment approach assisted with our feature exchanging operation. Intuitively,local features from the same object part (e.g.,bird heads of a bird species)should be embedded with almost the same semantic meaning.As illustrated by Fig.3,our key idea is that,if local features were well aligned, exchanging the features of identical parts for two input images belonging to the same sub-category should not change the generated hash codes.Inspired by that, we propose a local feature alignment strategy by leveraging the feature exchang- ing operation,which happens between learned local features and anchored local features.As a foundation for feature exchanging,a set of dynamic anchored local featuresC)for classshould be maintained,in which the j-th anchored local feature cis obtained by averaging all j-th part's local features of training samples from class yi.At the end of each training epoch,anchored local features will be recalculated and updated.Subsequently,as shown in Fig.4
ExchNet: A Unified Hashing Network for Large-Scale Fine-Grained Retrieval 195 converting it into a valid distribution, then flat it into a vector aˆj i . Based on the Hellinger distance, we propose a spatial diversity induced loss as: Lsp(xi)=1 − 1 √2 M 2 M l,k=1 aˆl i − aˆk i 2 , (5) where M 2 is used to denote the combinatorial number of ways to pick 2 unordered outcomes from M possibilities. The spatial diversity constraint drives the aggregation maps to be activated in spatial positions as diverse as possible. As to the channel diversity constraint, we first convert the local feature vector fj i into a valid distribution, which can be formulated by pj i = softmax(fj i ), ∀j ∈ {1,...,M}. (6) Subsequently, we propose a constraint loss over {pj i }M j=1 as: Lcp(xi) = ⎡ ⎣t − 1 √2 M 2 M l,k=1 pl i − pk i 2 ⎤ ⎦ + , (7) where t ∈ [0, 1] is a hyper-parameter to adjust the diversity and [·]+ denotes max(·, 0). Equipping with the channel diversity constraint could benefit the network to depress redundancies in features through channel dimensions. Overall, our spatial diversity and channel diversity constraints can work in a collaborative way to obtain discriminative local features. 3.2 Learning to Align by Local Feature Exchanging Upon the representation learning module, the alignment on local features is necessary for confirming that they represent and more importantly correspond to common fine-grained parts across images, which are essential to fine-grained tasks. Hence, we propose an anchor-based local features alignment approach assisted with our feature exchanging operation. Intuitively, local features from the same object part (e.g., bird heads of a bird species) should be embedded with almost the same semantic meaning. As illustrated by Fig. 3, our key idea is that, if local features were well aligned, exchanging the features of identical parts for two input images belonging to the same sub-category should not change the generated hash codes. Inspired by that, we propose a local feature alignment strategy by leveraging the feature exchanging operation, which happens between learned local features and anchored local features. As a foundation for feature exchanging, a set of dynamic anchored local features Cyi = {c1 yi ,..., cM yi } for class yi should be maintained, in which the j-th anchored local feature cj yi is obtained by averaging all j-th part’s local features of training samples from class yi. At the end of each training epoch, anchored local features will be recalculated and updated. Subsequently, as shown in Fig. 4,
196 Q.Cui et al. ■■■口 u mage of the-th class Select Categorical Anchored Local Features Feature Exchanging Csg=(u-5)月 select Categoncal Anchored Local Featu nage of the Fig.4.Our feature exchanging and hash codes learning in the training phase.Accord- ing to the class indices (i.e.,yi and yi),we first select categorical anchor features Cv and Cv,for samples ci and j,respectively.Then,for each input image,the feature exchanging operation is conducted between its learned and anchored local features. After that,hash codes are generated with exchanged features and the learning is driven by preserving pairwise similarities of hash codes ui and vj. for a sample i whose category is yi,we exchange a half of the learned local fea- tures in Gi={f,...,fM}with its corresponding anchored local features in C).The exchanging process can be formulated as: ie{1,,M,f月 f, if5≥0.5, (8) otherwise, where i B(0.5)is a random variable following the Bernoulli distribu- tion for the j-th part.The local features after exchanging are denoted as Gi={f,...,fM}and fed into the hashing learning module for generating binary codes and computing similarity preservation losses. 3.3 Hash Code Learning After obtaining both global features and local features,we concatenate them together and feed them into the hashing learning module.Specifically,the hash- ing network contains a fully connected layer and a sign()activation function layer.In our method,we choose an asymmetric hashing for ExchNet for its flex- ibility [25].Concretely,we utilize two hash functions,defined as g()and h()
196 Q. Cui et al. Hashing Network Feature Extractor Feature Exchanging Image of the ! !-th class " " " # " $% " &'( #$% " #$% # #$% & Anchored Local Features of All Categories Hashing Network Feature Extractor #$) " #$) & #$% # #$% & $) # $% " Select Categorical Anchored Local Features $) *+,-'+ $% *+,-'+ " $) #$) " #$) # #$) & #$% " #$% # #$% & #$) " #$) # #$) & %. &) &% Image of the ! . -th class %! Select Categorical Anchored Local Features Fig. 4. Our feature exchanging and hash codes learning in the training phase. According to the class indices (i.e., yi and yj ), we first select categorical anchor features Cyi and Cyj for samples xi and xj , respectively. Then, for each input image, the feature exchanging operation is conducted between its learned and anchored local features. After that, hash codes are generated with exchanged features and the learning is driven by preserving pairwise similarities of hash codes ui and vj . for a sample xi whose category is yi, we exchange a half of the learned local features in Gi = {f 1 i ,..., fM i } with its corresponding anchored local features in Cyi = {c1 yi ,..., cM yi }. The exchanging process can be formulated as: ∀j ∈ {1,...,M}, fˆj i fj i , if ξj ≥ 0.5, cj yi , otherwise, (8) where ξj ∼ B(0.5) is a random variable following the Bernoulli distribution for the j-th part. The local features after exchanging are denoted as Gˆi = {fˆ1 i ,..., fˆM i } and fed into the hashing learning module for generating binary codes and computing similarity preservation losses. 3.3 Hash Code Learning After obtaining both global features and local features, we concatenate them together and feed them into the hashing learning module. Specifically, the hashing network contains a fully connected layer and a sign(·) activation function layer. In our method, we choose an asymmetric hashing for ExchNet for its flexibility [25]. Concretely, we utilize two hash functions, defined as g(·) and h(·),
ExchNet:A Unified Hashing Network for Large-Scale Fine-Grained Retrieval 197 to learn two different binary codes for the same training sample.The learning procedure is as follows: ug(oba]cat)=sig(W(foballeat), (9) v:=h(G fgloballcat)=sign(W(h)[;fgloballeat), (10) where [;]cat denotes the concatenation operator,and ui,viE{-1,+119 denote the two different binary codes of the i-th sample.g represents the code length W(g)and W(h)present the parameters of hash functions g()and h(),respec- tively.We denote U ={ui}and V={vi}21 as learned binary codes. Inspired by [14,we only keep binary codes v;and set hash function h()implic- itly.Hence,we can perform feature learning and binary codes learning simulta- neously. To preserve the pairwise similarity,we adopt the squared loss and define the following objective function: L(ui,vj:C)=(uJvj-qSi)2, (11) where u(bae),is the pairwise similarity label andc-C We use to denote the parameters of deep neural network and hash layer.The aforementioned process is generally illustrated by Fig.4. Due to the zero-gradient problem caused by the sign()function,La(,,) becomes intractable to optimize.In this paper,we relax g()=sign()into g()=tanh()to alleviate this problem.Then,we can derive the following loss function: Cq(,,C)=(au-q5)2, (12) wherefoballeat)and U is relaxed as U= Then,given a set of image samples ={1,...,n}and their pairwise labels S={S=,we can get the following objective function by combining Eqs.(5),(7)and(12: .c()=∑ca,u:S)+∑Cpc,)+n∑c(c) (13) V.e.c i,j=1 i=1 i=1 8.t.i∈{1,…,n,i=(g:flob]cat),v∈{-l,+1}9 where Sj represents the similarity between the i-th and j-th samples,q denotes the code length,A and y are hyper-parameters. 3.4 Learning Algorithm To solve the optimization problem in Eq.(13),we design an alternating algorithm to learn V,6,and C.Specifically,we learn one parameter with the others fixed. 1 We omit the bias term for simplicity
ExchNet: A Unified Hashing Network for Large-Scale Fine-Grained Retrieval 197 to learn two different binary codes for the same training sample. The learning procedure is as follows: ui = g([Gˆi; f global i ]cat) = sign(W(g) [Gˆi; f global i ]cat), (9) vi = h([Gˆi; f global i ]cat) = sign(W(h) [Gˆi; f global i ]cat), (10) where [; ]cat denotes the concatenation operator, and ui, vi ∈ {−1, +1}q denote the two different binary codes of the i-th sample. q represents the code length. W(g) and W(h) present the parameters of hash functions g(·) and h(·)1, respectively. We denote U = {ui}n i=1 and V = {vi}n i=1 as learned binary codes. Inspired by [14], we only keep binary codes vi and set hash function h(·) implicitly. Hence, we can perform feature learning and binary codes learning simultaneously. To preserve the pairwise similarity, we adopt the squared loss and define the following objective function: Lsq(ui, vj , C) = u i vj − qSij 2 , (11) where ui = g([Gˆi; f global i ]cat), Sij is the pairwise similarity label and C = {Ci}M i=1. We use Θ to denote the parameters of deep neural network and hash layer. The aforementioned process is generally illustrated by Fig. 4. Due to the zero-gradient problem caused by the sign(·) function, Lsq(·, ·, ·) becomes intractable to optimize. In this paper, we relax g(·) = sign(·) into gˆ(·) = tanh(·) to alleviate this problem. Then, we can derive the following loss function: Lˆsq(uˆi, vj , C) = uˆ i vj − qSij 2 , (12) where uˆi = ˆg([Gˆi; f global i ]cat) and U is relaxed as Uˆ = {uˆi}n i=1. Then, given a set of image samples X = {x1,...,xn} and their pairwise labels S = {Sij}n i,j=1, we can get the following objective function by combining Eqs. (5), (7) and (12): min V ,Θ,C L(X ) = n i,j=1 Lˆsq(uˆi, vj ; Sij ) + λ n i=1 Lsp(xi) + γ n i=1 Lcp(xi) (13) s.t.∀i ∈ {1,...,n},uˆi = ˆg([Gˆi; f global i ]cat), vj ∈ {−1, +1}q, where Sij represents the similarity between the i-th and j-th samples, q denotes the code length, λ and γ are hyper-parameters. 3.4 Learning Algorithm To solve the optimization problem in Eq. (13), we design an alternating algorithm to learn V , Θ, and C. Specifically, we learn one parameter with the others fixed. 1 We omit the bias term for simplicity.
198 Q.Cui et al. Learn with V and C Fixed.When V,C fixed,we use back-propagation(BP) to update the parameters 9.In particular,for input sample i,we first calculate the following gradient: aCLx)Val()Vacatala). (14) Then,we use the back-propagation algorithm to update Learn V with and C Fixed.When e,C are fixed,we rewrite C(V)as follows: cV)=∑(a-gS)}2=I0vT-qs形 (15) i.=1 =UVT-2gtr(STUVT)+const. (16) Because V is defined over {-1,+1)nx4,we learn V column by column as that in ADSH [14].Specifically,we can get the closed-form solution for the k-th column Vk as follows: V.k sign(VjUnU.k -qQ.k), (17) where Q=STU and Vk denotes the matrix excluding the k-th column. Learn C with V and Fixed.When V fixed,we use the following equation to update each Ci∈C: ,=, (18) ni i=1 where ni denotes the number of samples in class yi. 3.5 Out-of-Sample Extension When we finish the training phase,we can generate the binary code for the sampleby u=sign(Wfobc). 4 Experiments 4.1 Datasets For comparisons,we select two widely used fine-grained datasets,i.e.,CUB [29] and Aircraft [24],as well as three popular large-scale fine-grained datasets,i.e., NABirds [11],VegFru [12],and Food101 [23],to conduct experiments. Specifically,CUB is a bird classification benchmark dataset containing 11,788 images from 200 bird species.It is officially split into 5,994 for training
198 Q. Cui et al. Learn Θ with V and C Fixed. When V , C fixed, we use back-propagation (BP) to update the parameters Θ. In particular, for input sample xi, we first calculate the following gradient: ∇ΘL(X) = n i,j=1 ∇ΘLsq(uˆi, vj ) + λ n i=1 ∇ΘLsp(xi) + γ n i=1 ∇ΘLcp(xi). (14) Then, we use the back-propagation algorithm to update Θ. Learn V with Θ and C Fixed. When Θ, C are fixed, we rewrite L(V ) as follows: L(V ) = n i,j=1 uˆ i vj − qSij 2 = UV − qS 2 F (15) = UV 2 F − 2qtr(SUV ) + const. (16) Because V is defined over {−1, +1}n×q, we learn V column by column as that in ADSH [14]. Specifically, we can get the closed-form solution for the k-th column V∗k as follows: V∗k = sign(V/kU /kU∗k − qQ∗k), (17) where Q = SU and V/k denotes the matrix excluding the k-th column. Learn C with V and Θ Fixed. When Θ, V fixed, we use the following equation to update each Ci ∈ C: ∀k, ck i = 1 ni ni i=1 f k i , (18) where ni denotes the number of samples in class yi. 3.5 Out-of-Sample Extension When we finish the training phase, we can generate the binary code for the sample xi by ui = sign(W(g) [Gi; f global i ]cat). 4 Experiments 4.1 Datasets For comparisons, we select two widely used fine-grained datasets, i.e., CUB [29] and Aircraft [24], as well as three popular large-scale fine-grained datasets, i.e., NABirds [11], VegFru [12], and Food101 [23], to conduct experiments. Specifically, CUB is a bird classification benchmark dataset containing 11, 788 images from 200 bird species. It is officially split into 5, 994 for training