正在加载图片...
194 Q.Cui et al. It is worth mentioning that the attention is engaged in the middle of the feature extractor.Since,in the shallow layers of deep neural networks,low-level context information (e.g.,colors and edges,etc.)are well preserved,which is crucial for distinguish subtle visual differences of fine-grained objects.Then,by feeding E into the attention generation module,M pieces of attention maps AiE RMxHxW are generated and we use AiE RHxW to denote the attentive region of the j-th (j{1,...,M))part cues for xi.After that,the obtained part-level attention map A is element-wisely multiplied on Ei to select the attentive local feature corresponding to the j-th part,which is formulated as: =E⑧A, (1) where ERHxwxc represents the j-th attentive local feature of ri,and denotes the Hadamard product on each channel.For simplification,we use =[E,...,EM)to denote a set of local features and,subsequently,is fed into the later Local Features Refinement (LFR)network composed of a stack of convolution layers to embed these attentive local features into higher-level semantic meanings: F=f九R(8), (2) where the output of the network is denoted as Fi={F,...,FM),which represents the final local feature maps w.r.t.high-level semantics.We denote fRC as the local feature vector after applying global average pooling(GAP) onFy∈RH'xw'xC'as: f月=feaP(F). (3) On the other side,as to the global feature extractor,for xi,we directly adopt a Global Features Refinement (GFR)network composed of conventional convolutional operations to embed Ei,which is presented by: Fglobal =fCrR(Ei). (4) We use Fslobal E RIxW'xc'and fslobal eRC to denote the learned global feature and the corresponding holistic feature vector after GAP,respectively. Furthermore,to facilitate the learning of localizing local feature cues (i.e., capturing fine-grained parts),we impose the spatial diversity and channel diver- sity constraints over the local features in Fi. Specifically,it is a natural choice to increase the diversity of local features by differentiating the distributions of attention maps [40].However,it might cause a problem that the holistic feature can not be activated in some spatial positions, while the attention map has large activation values on them due to over-applied constraints upon the learned attention maps.Instead,in our method,we design and apply constraints on the local features.In concretely,for the local feature Fy,we obtain its“aggregation map”A∈RH'xw'by adding all C feature maps through the channel dimension and apply the softmax function on it for194 Q. Cui et al. It is worth mentioning that the attention is engaged in the middle of the feature extractor. Since, in the shallow layers of deep neural networks, low-level context information (e.g., colors and edges, etc.) are well preserved, which is crucial for distinguish subtle visual differences of fine-grained objects. Then, by feeding Ei into the attention generation module, M pieces of attention maps Ai ∈ RM×H×W are generated and we use Aj i ∈ RH×W to denote the attentive region of the j-th (j ∈ {1,...,M}) part cues for xi. After that, the obtained part-level attention map Aj i is element-wisely multiplied on Ei to select the attentive local feature corresponding to the j-th part, which is formulated as: Eˆj i = Ei ⊗ Aj i , (1) where Eˆj i ∈ RH×W×C represents the j-th attentive local feature of xi, and “⊗” denotes the Hadamard product on each channel. For simplification, we use Eˆ i = {Eˆ1 i ,..., EˆM i } to denote a set of local features and, subsequently, Eˆ i is fed into the later Local Features Refinement (LFR) network composed of a stack of convolution layers to embed these attentive local features into higher-level semantic meanings: Fi = fLFR(Eˆ i), (2) where the output of the network is denoted as Fi = {F1 i ,...,F M i }, which represents the final local feature maps w.r.t. high-level semantics. We denote fj i ∈ RC as the local feature vector after applying global average pooling (GAP) on Fj i ∈ RH ×W ×C as: fj i = fGAP(Fj i ). (3) On the other side, as to the global feature extractor, for xi, we directly adopt a Global Features Refinement (GFR) network composed of conventional convolutional operations to embed Ei, which is presented by: Fglobal i = fGFR(Ei). (4) We use Fglobal i ∈ RH ×W ×C and f global i ∈ RC to denote the learned global feature and the corresponding holistic feature vector after GAP, respectively. Furthermore, to facilitate the learning of localizing local feature cues (i.e., capturing fine-grained parts), we impose the spatial diversity and channel diver￾sity constraints over the local features in Fi. Specifically, it is a natural choice to increase the diversity of local features by differentiating the distributions of attention maps [40]. However, it might cause a problem that the holistic feature can not be activated in some spatial positions, while the attention map has large activation values on them due to over-applied constraints upon the learned attention maps. Instead, in our method, we design and apply constraints on the local features. In concretely, for the local feature Fj i , we obtain its “aggregation map” Aˆj i ∈ RH ×W by adding all C feature maps through the channel dimension and apply the softmax function on it for
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有