计算机科学与技术（参考文献）Object Tracking Using Learned Feature Manifolds

团购合买资源类别：文库，文档格式：PDF，文档页数：12，文件大小：4.43MB

Computer Vision and Image Understanding 118(2014)128-139 Contents lists available at ScienceDirect Computer Vision and Image Understanding ELSEVIER journal homepage:www.elsevier.com/locate/cviu Object tracking using learned feature manifolds* CrossMark Yanwen Guo*,Ye Chen3,Feng Tang,Ang Li,Weitao Luo,Mingming Liu National Key Lab for Novel Software Technology,Nanjing University.Nanjing 210023.PR China P Hewlett-Packard Laboratories,Palo Alto,CA 94304.USA University of Maryland,College Park,MD 20740.USA ARTICLE INFO ABSTRACT Article history: Local feature based object tracking approaches have been promising in solving the tracking problems Received 10 August 2012 such as occlusions and illumination variations.However,existing approaches typically model feature Accepted 30 September 2013 variations using prototypes,and this discrete representation cannot capture the gradual changing prop- Available online 17 October 2013 erty of local appearance.In this paper,we propose to model each local feature as a feature manifold to characterize the smooth changing behavior of the feature descriptor.The manifold is constructed from Keywords: a series of transformed images simulating possible variations of the feature being tracked.We propose Feature manifold to build a collection of linear subspaces which approximate the original manifold as a low dimensional SIFT Tracking representation.This representation is used for object tracking.Object location is located by a feature- to-manifold matching process.Our tracking method can update the manifold status,add new feature manifolds and remove expiring ones adaptively according to object appearance.We show both qualita- tively and quantitatively this representation significantly improves the tracking performance under occlusions and appearance variations using standard tracking dataset. 2013 Elsevier Inc.All rights reserved. 1.Introduction Object dynamics model how the object appearance evolves over time to be able to handle appearance variations.The two problems Object tracking is a central problem in computer vision with are usually coupled together:the object representation should be many applications,such as activity analysis,automated surveil- designed to be easily updated to model appearance variations, lance,traffic monitoring,and human-computer interaction.It is while the object dynamics should be able to take advantage of essentially the problem of finding the most likely estimate of the the characteristics of object representation for model update. object state given a sequence of observations.Object tracking is Traditional methods for representing the object,such as global challenging because of: histogram based approach in meanshift tracking [1]and PCA sub- space based approach in EigenTracking[2],are global approaches Complex object appearance.The object may have complicated which describe the object to be tracked as a whole.Such methods appearance which is hard to model.Furthermore,it may work well in many practical applications,but have several intrinsic undergo significant changes due to the pose and scale variations limitations.First,it is usually very difficult for a global representa- as well as non-rigid object motions. tion to capture local details and as a result unable to model com- Occlusions.The object may be occluded by the background or plex appearances.Second,global representations are not robust other moving objects,making it difficult to be localized. to partial occlusion.Once the object is occluded,the whole feature Complex object motion.This is caused by either the moving pat- vector of object representation is affected.Third,global representa- tern of the object or by camera motion accompanied by object tions are hard to update. motion. Recently,local representations have opened a promising direc- tion to solve these problems by representing an object as a set of There are two key components in an object tracking algorithm: local parts or sparse local features.Part-based trackers generally object representation and dynamics.Object representation tries to use sets of connected or global visual properties incorporated local model the object as accurately as possible so that the tracking parts or components [3-6.The parts used for object representa- algorithm can accurately describe the complex object appearance. tion are updated during tracking by removing old parts that exhibit signs of drifting and adding new ones for easy accommodation of This paper has been recommended for acceptance by Eklundh. appearance changes.Feature-based trackers often represent the Corresponding author. target by a set of sparse local features such as SIFT [7]and affine E-mail address:ywguo@nju.edu.cn (Y.Guo). invariant point detectors [8]which are often invariant to changes 1077-3142/$-see front matter2013 Elsevier Inc.All rights reserved. http://dx.doi.org/10.1016/j.cviu.2013.09.007

Object tracking using learned feature manifolds q Yanwen Guo a,⇑ , Ye Chen a , Feng Tang b , Ang Li c , Weitao Luo a , Mingming Liu a aNational Key Lab for Novel Software Technology, Nanjing University, Nanjing 210023, PR China b Hewlett-Packard Laboratories, Palo Alto, CA 94304, USA cUniversity of Maryland, College Park, MD 20740, USA article info Article history: Received 10 August 2012 Accepted 30 September 2013 Available online 17 October 2013 Keywords: Feature manifold SIFT Tracking abstract Local feature based object tracking approaches have been promising in solving the tracking problems such as occlusions and illumination variations. However, existing approaches typically model feature variations using prototypes, and this discrete representation cannot capture the gradual changing property of local appearance. In this paper, we propose to model each local feature as a feature manifold to characterize the smooth changing behavior of the feature descriptor. The manifold is constructed from a series of transformed images simulating possible variations of the feature being tracked. We propose to build a collection of linear subspaces which approximate the original manifold as a low dimensional representation. This representation is used for object tracking. Object location is located by a featureto-manifold matching process. Our tracking method can update the manifold status, add new feature manifolds and remove expiring ones adaptively according to object appearance. We show both qualitatively and quantitatively this representation significantly improves the tracking performance under occlusions and appearance variations using standard tracking dataset. 2013 Elsevier Inc. All rights reserved. 1. Introduction Object tracking is a central problem in computer vision with many applications, such as activity analysis, automated surveillance, traffic monitoring, and human-computer interaction. It is essentially the problem of finding the most likely estimate of the object state given a sequence of observations. Object tracking is challenging because of: Complex object appearance. The object may have complicated appearance which is hard to model. Furthermore, it may undergo significant changes due to the pose and scale variations as well as non-rigid object motions. Occlusions. The object may be occluded by the background or other moving objects, making it difficult to be localized. Complex object motion. This is caused by either the moving pattern of the object or by camera motion accompanied by object motion. There are two key components in an object tracking algorithm: object representation and dynamics. Object representation tries to model the object as accurately as possible so that the tracking algorithm can accurately describe the complex object appearance. Object dynamics model how the object appearance evolves over time to be able to handle appearance variations. The two problems are usually coupled together: the object representation should be designed to be easily updated to model appearance variations, while the object dynamics should be able to take advantage of the characteristics of object representation for model update. Traditional methods for representing the object, such as global histogram based approach in meanshift tracking [1] and PCA subspace based approach in EigenTracking [2], are global approaches which describe the object to be tracked as a whole. Such methods work well in many practical applications, but have several intrinsic limitations. First, it is usually very difficult for a global representation to capture local details and as a result unable to model complex appearances. Second, global representations are not robust to partial occlusion. Once the object is occluded, the whole feature vector of object representation is affected. Third, global representations are hard to update. Recently, local representations have opened a promising direction to solve these problems by representing an object as a set of local parts or sparse local features. Part-based trackers generally use sets of connected or global visual properties incorporated local parts or components [3–6]. The parts used for object representation are updated during tracking by removing old parts that exhibit signs of drifting and adding new ones for easy accommodation of appearance changes. Feature-based trackers often represent the target by a set of sparse local features such as SIFT [7] and affine invariant point detectors [8] which are often invariant to changes 1077-3142/$ - see front matter 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.cviu.2013.09.007 q This paper has been recommended for acceptance by J.-O. Eklundh. ⇑ Corresponding author. E-mail address: ywguo@nju.edu.cn (Y. Guo). Computer Vision and Image Understanding 118 (2014) 128–139 Contents lists available at ScienceDirect Computer Vision and Image Understanding journal homepage: www.elsevier.com/locate/cviu

Y.Guo et al/Computer Vision and Image Understanding 118(2014)128-139 129 in rotation,scale,illumination and viewpoint.These approaches viewpoint changes and occlusions,our dynamic model is designed first localize the features at a sparse set of distinctive image points to be able to add new feature manifolds and remove expiring ones by feature detectors.Then the feature vectors,usually named as adaptively and dynamically.To the best of our knowledge,this is descriptors,are computed based on the local image statistics cen- the first paper that applies manifold learning to local features for tered at these locations.Two major advantages of sparse local fea- object tracking. tures are the invariance to image changes and robustness to The rest of this paper is organized as follows.Section 2 de- occlusions.Existing local feature based approaches typically model scribes the related work on object tracking.We present our feature how the local features vary using prototypes.However,this dis- manifold model in Section 3.Section 4 shows our main tracking crete representation cannot capture the gradual changing property paradigm.Experiments and analysis are given in Section 5,and of local appearance. the whole paper is concluded finally. In this paper,we propose a local feature based manifold repre- sentation for object tracking.The object is represented by a set of sparse local feature manifolds.Each local feature manifold is com- 2.Related work puted from a series of SIFT feature descriptors[7]that correspond to different appearances of the same object feature under simu- Object tracking using local features has been explored by previ- lated variations of practical situations.To build it,we first detect ous researchers.In [9].Shi and Tomasi proposed a method to select a set of interest points on the object by the state-of-the-art feature corner-based features that are most reliable for tracking.Collins detectors.For each feature point,we transform the image regions et al.developed an algorithm for unsupervized learning of object surrounding it for simulating real object changes.A feature mani- models as constellations of features,and proposed a discriminative fold is thus obtained by exploring the ensemble of descriptors ex- feature tracker [10].A simultaneous modeling and tracking meth- tracted on the transformed image regions.Such a manifold is an od is proposed in [11]to learn the object model during tracking. informative yet robust representation in that it captures the local The object features are selected manually and tracked individually. appearance variations of a part of the object over time,making The posterior distribution of appearance and shape is built up the local representation more robust against object changes.The incrementally using an exemplar-based approach.In [12],the ob- local feature variation is complicated and nonlinear in practice as ject is represented as view-dependent object appearance models an example illustrated in Fig.1 which shows a feature on a walking corresponding to different viewing angles.This collection of ac- man.As can be observed,the feature appearance changes dramat- quired models is indexed with respect to the view sphere.These ically during the move.As a result,the feature manifold is a highly models are matched to each frame to estimate object motion.In nonlinear appearance manifold.For computational efficiency,we [13].the authors proposed a"feature harvesting"approach that apply incremental principal component analysis to it and yield a has a training phase to learn the object geometry and appearance collection of linear subspace approximation. using a randomized tree classifier.The online tracking then be- To model geometric relations among local features,the feature comes a detection problem using the learned classifier.Liu et al. manifolds are organized as a feature manifold graph which is used proposed to jointly track different types of features by representing to represent the target object to be tracked.Each local feature man- the objects of interest with the hybrid templates [14].A generative ifold describes object appearance details and relationships among model is developed to learn the template and to estimate object them encode object structure.Such geometric relationships are location and scale. elastic and have the flexibility to handle objects with coherent mo- It is noted that the state-of-the-art local features such as SIFT tion and a certain amount of variations caused by viewpoint 7]and SURF [15]have been used for object tracking recently.In changes and articulated motions.An advantage of the feature man- [16].an attributed relational feature graph which represents the ifold graph is that locally the manifold graph reinforces the power object using SIFT features with geometric relations is proposed of feature description and characterizes variations of object for object tracking.Zhou et al.presented a SIFT based mean shift appearance by learning a series of descriptors,while globally it en- tracking algorithm [17.The similarity between two neighboring codes object structure with the geometric relations among those frames is measured in terms of color and SIFT correspondence by manifolds.Such characteristics make it suitable for many vision using an expectation-maximization algorithm.In [18].He et al. tasks. proposed to represent the object by a set of SURF features of inter- We apply the feature manifold graph to object tracking as an est.Object motion is estimated in terms of maximum likelihood application.With the feature manifold graph representation,the feature motion observations.In [19].Sun and Liu proposed an ob- target object is tracked based on graph-based feature-to-manifold ject tracking method which is based on the combination of local tracking.During tracking,features are extracted in a candidate SIFT description and global PCA representation.The method is con- region of the current frame and then matched with the manifold. structed in the framework of particle filter.In fact,the changing of Object position is located by integrating all matching in the feature appearance is smooth and highly nonlinear in nature, manifold graph.Since features may appear and disappear due to which is hard to be modeled using discrete prototypes.In this Fig.1.Appearance variations of a feature during tracking.Feature patches of different frames are shown on the left and a feature manifold is visualized on the right

in rotation, scale, illumination and viewpoint. These approaches first localize the features at a sparse set of distinctive image points by feature detectors. Then the feature vectors, usually named as descriptors, are computed based on the local image statistics centered at these locations. Two major advantages of sparse local features are the invariance to image changes and robustness to occlusions. Existing local feature based approaches typically model how the local features vary using prototypes. However, this discrete representation cannot capture the gradual changing property of local appearance. In this paper, we propose a local feature based manifold representation for object tracking. The object is represented by a set of sparse local feature manifolds. Each local feature manifold is computed from a series of SIFT feature descriptors [7] that correspond to different appearances of the same object feature under simulated variations of practical situations. To build it, we first detect a set of interest points on the object by the state-of-the-art feature detectors. For each feature point, we transform the image regions surrounding it for simulating real object changes. A feature manifold is thus obtained by exploring the ensemble of descriptors extracted on the transformed image regions. Such a manifold is an informative yet robust representation in that it captures the local appearance variations of a part of the object over time, making the local representation more robust against object changes. The local feature variation is complicated and nonlinear in practice as an example illustrated in Fig. 1 which shows a feature on a walking man. As can be observed, the feature appearance changes dramatically during the move. As a result, the feature manifold is a highly nonlinear appearance manifold. For computational efficiency, we apply incremental principal component analysis to it and yield a collection of linear subspace approximation. To model geometric relations among local features, the feature manifolds are organized as a feature manifold graph which is used to represent the target object to be tracked. Each local feature manifold describes object appearance details and relationships among them encode object structure. Such geometric relationships are elastic and have the flexibility to handle objects with coherent motion and a certain amount of variations caused by viewpoint changes and articulated motions. An advantage of the feature manifold graph is that locally the manifold graph reinforces the power of feature description and characterizes variations of object appearance by learning a series of descriptors, while globally it encodes object structure with the geometric relations among those manifolds. Such characteristics make it suitable for many vision tasks. We apply the feature manifold graph to object tracking as an application. With the feature manifold graph representation, the target object is tracked based on graph-based feature-to-manifold tracking. During tracking, features are extracted in a candidate region of the current frame and then matched with the manifold. Object position is located by integrating all matching in the manifold graph. Since features may appear and disappear due to viewpoint changes and occlusions, our dynamic model is designed to be able to add new feature manifolds and remove expiring ones adaptively and dynamically. To the best of our knowledge, this is the first paper that applies manifold learning to local features for object tracking. The rest of this paper is organized as follows. Section 2 describes the related work on object tracking. We present our feature manifold model in Section 3. Section 4 shows our main tracking paradigm. Experiments and analysis are given in Section 5, and the whole paper is concluded finally. 2. Related work Object tracking using local features has been explored by previous researchers. In [9], Shi and Tomasi proposed a method to select corner-based features that are most reliable for tracking. Collins et al. developed an algorithm for unsupervized learning of object models as constellations of features, and proposed a discriminative feature tracker [10]. A simultaneous modeling and tracking method is proposed in [11] to learn the object model during tracking. The object features are selected manually and tracked individually. The posterior distribution of appearance and shape is built up incrementally using an exemplar-based approach. In [12], the object is represented as view-dependent object appearance models corresponding to different viewing angles. This collection of acquired models is indexed with respect to the view sphere. These models are matched to each frame to estimate object motion. In [13], the authors proposed a ‘‘feature harvesting’’ approach that has a training phase to learn the object geometry and appearance using a randomized tree classifier. The online tracking then becomes a detection problem using the learned classifier. Liu et al. proposed to jointly track different types of features by representing the objects of interest with the hybrid templates [14]. A generative model is developed to learn the template and to estimate object location and scale. It is noted that the state-of-the-art local features such as SIFT [7] and SURF [15] have been used for object tracking recently. In [16], an attributed relational feature graph which represents the object using SIFT features with geometric relations is proposed for object tracking. Zhou et al. presented a SIFT based mean shift tracking algorithm [17]. The similarity between two neighboring frames is measured in terms of color and SIFT correspondence by using an expectation–maximization algorithm. In [18], He et al. proposed to represent the object by a set of SURF features of interest. Object motion is estimated in terms of maximum likelihood feature motion observations. In [19], Sun and Liu proposed an object tracking method which is based on the combination of local SIFT description and global PCA representation. The method is constructed in the framework of particle filter. In fact, the changing of feature appearance is smooth and highly nonlinear in nature, which is hard to be modeled using discrete prototypes. In this Fig. 1. Appearance variations of a feature during tracking. Feature patches of different frames are shown on the left and a feature manifold is visualized on the right. Y. Guo et al. / Computer Vision and Image Understanding 118 (2014) 128–139 129

130 Y.Guo et aL/Computer Vision and Image Understanding 118(2014)128-139 paper,we propose to model the feature appearance variations as a To simulate possible viewpoint situations,transformations are feature manifold approximated by several linear subspaces.This applied to the original image.Each transformation involves three significantly enhances the distinctiveness of object representation. atomic transformations:scaling,rotation and shearing.The mix- Avidan's ensemble tracker [20]trains an ensemble of weak clas- ture of these transformations can accurately simulate all possible sifiers and combines them into a strong one using AdaBoost to dis- viewpoint changes.To produce a simulated image /'a combination tinguish between the object and background by pixel labeling.To of these three transformations is applied to the original image lo. adapt to object appearance changes and maintain temporal coher- ence,a new weak classifier is trained per frame and the strong clas- I=Tsh*Tr*Tsc*l0 (1) sifier is updated dynamically.Appearance modeling is important where Tsh,Tr,and Tse mean shearing,rotation,and scaling transfor- for object tracking.In21,the covariance matrices of image fea- mations separately. tures in five modes are used to represent object appearance.Visual We express them using the homogeneous matrices.Specifically, tracking is led by Bayesian inference using the learned Log-Euclid- /a00 ean Riemannian subspace.In [22].Hu et al.presented an incremen- we represent Tsc as0 a 0 and set three levels for a to (0.5.1. tal learning algorithm which represents object appearance with 0 0 low dimensional tensor subspace.During tracking.the learning -sin0 0 algorithm is used to capture appearance of the dynamic object. 2).Tr is expressed as sin0 0 and we rotate the im- In [23,Ross et al.presented a tracking method that incrementally 0 0 learns a low-dimensional subspace representation,efficiently age around its center for a lap and set the interval to 30.We de- adapting online to changes in appearance of the target.A sampling 1 b 0 algorithm with likelihood estimate is used to estimate object loca- note Tsh by d 1 0 and set three levels {-1,0,1)for both b tions during tracking.Originated from [24,251.TLD [26]is a real- 00 time algorithm for tracking of unknown objects in video streams. and d to execute the shearing transformation. It simultaneously tracks the object,learns its appearance and de- In implementation,we use the OpcnCV function cvWarpPerspec- tects it whenever it appears in the video.In [27].a solution for par- tive()to generate the transformed image.This function applies a ticle filtering on general graphs is developed.As the applications,a perspective transformation to an image,and transforms the source high-order Markov chains based object tracking method and a image using the specified matrix multi-object graphical interaction models based distributed multi- It is noted that the interest object is assumed to be nearly planar ple object tracking method are developed. or far away from the camera such that the above combined trans- Our work is also inspired by the face tracking methods using formations can be applied to it for simulating the possible varia- probabilistic appearance manifolds [28].The task of tracking and tions of practical situations.We do not intend to handle large recognition is formulated as a maximum posteriori estimation out-of-plane rotations which may result in self-occlusions of the problem under the manifold representation which models face object.Our experiments show that such scheme works well on a appearance.Learning image manifolds from collections of local broad range of video sequences. features has been addressed in [29].A feature embedding repre- Without loss of generality,we assume that the image is placed sentation that preserves local appearance similarity and spatial with its center at the world origin.The above combined transfor- structure of the features is learned by solving an eigenvalue prob- mations are applied to every point in the image.This yields a series lem.A difference between the above representation and our fea- of transformed images.We then first detect the feature points in ture manifold is that they generally represent the image with a the original image and use corresponding transformation to find manifold.However,we use the manifold to characterize variations their corresponding locations in each transformed image.Specifi- of each individual feature,which is more flexible and able to de- cally,we use the SIFT detector,Harris-Affine detector,as well as scribe object details. Hessian-Affine detector to detect feature points,and describe the local appearance of each point using SIFT descriptor.For each point detected by SIFT.Harris-Affine,or Hessian-Affine detectors,the scale used in computing SIFT descriptor is determined by the cor- 3.The feature manifold model responding detector.For each SIFT feature descriptor fi in the origi- nal image,its corresponding features are different versions of the In traditional local feature based representations,each local fea- original one under viewpoint variations.The manifold M is ture is represented as a single feature vector describing local learned from M(fi.fh.f2,...},in which fn.fi2....are simulated fea- appearance.During tracking.it is matched to features extracted ture descriptors of fi under different viewpoints in the new frame.The assumption is that although object appear- In general,the appearance manifold of a local feature is highly ance may change due to viewpoint and scale changes,the invari- nonlinear.Many samples are needed to learn an accurate represen- ance property of local feature descriptors is able to accommodate tation.Although learning such a manifold globally would be an appearance variations.However,most existing local feature ideal solution,this is mostly infeasible in practice especially for descriptors are only partially invariant to these transformations. tracking which is hard to capture enough training samples.We For example,SIFT is only invariant to scale changes,brightness assume that local linearity property holds everywhere on a global changes,and to some extent robust to viewpoint changes.This is nonlinear manifold.Therefore,a local feature manifold can be illustrated in the results in the performance evaluation of local fea- approximated by a collection of linear subspaces 32.Note that ture descriptors [30.As can be observed,the matching perfor- the similar idea of modeling image variations due to changing illu- mance of SIFT degrades significantly with more significant minations by low-dimensional linear subspaces has been exploited viewpoint changes.To solve this problem,we propose to synthet- by [33].There have been many methods for manifold learning such ically generate additional training samples simulating possible as LLE [32]and ISOMAP [34],but most original algorithms are variations of the features under viewpoint and scale changes.This unsuitable for tracking which requires incremental learning for enriched dataset is used to learn the initial feature manifold.It updating.Although the incremental versions of such algorithms should be noted that the idea of simulating images under different have been developed in the literature [35,36],we choose to types of transformations has been used in[31]to learn a patch approximate the nonlinear feature manifold with several PCA sub- classifier. spaces due to computational efficiency

paper, we propose to model the feature appearance variations as a feature manifold approximated by several linear subspaces. This significantly enhances the distinctiveness of object representation. Avidan’s ensemble tracker [20] trains an ensemble of weak classifiers and combines them into a strong one using AdaBoost to distinguish between the object and background by pixel labeling. To adapt to object appearance changes and maintain temporal coherence, a new weak classifier is trained per frame and the strong classifier is updated dynamically. Appearance modeling is important for object tracking. In [21], the covariance matrices of image features in five modes are used to represent object appearance. Visual tracking is led by Bayesian inference using the learned Log-Euclidean Riemannian subspace. In [22], Hu et al. presented an incremental learning algorithm which represents object appearance with low dimensional tensor subspace. During tracking, the learning algorithm is used to capture appearance of the dynamic object. In [23], Ross et al. presented a tracking method that incrementally learns a low-dimensional subspace representation, efficiently adapting online to changes in appearance of the target. A sampling algorithm with likelihood estimate is used to estimate object locations during tracking. Originated from [24,25], TLD [26] is a realtime algorithm for tracking of unknown objects in video streams. It simultaneously tracks the object, learns its appearance and detects it whenever it appears in the video. In [27], a solution for particle filtering on general graphs is developed. As the applications, a high-order Markov chains based object tracking method and a multi-object graphical interaction models based distributed multiple object tracking method are developed. Our work is also inspired by the face tracking methods using probabilistic appearance manifolds [28]. The task of tracking and recognition is formulated as a maximum posteriori estimation problem under the manifold representation which models face appearance. Learning image manifolds from collections of local features has been addressed in [29]. A feature embedding representation that preserves local appearance similarity and spatial structure of the features is learned by solving an eigenvalue problem. A difference between the above representation and our feature manifold is that they generally represent the image with a manifold. However, we use the manifold to characterize variations of each individual feature, which is more flexible and able to describe object details. 3. The feature manifold model In traditional local feature based representations, each local feature is represented as a single feature vector describing local appearance. During tracking, it is matched to features extracted in the new frame. The assumption is that although object appearance may change due to viewpoint and scale changes, the invariance property of local feature descriptors is able to accommodate appearance variations. However, most existing local feature descriptors are only partially invariant to these transformations. For example, SIFT is only invariant to scale changes, brightness changes, and to some extent robust to viewpoint changes. This is illustrated in the results in the performance evaluation of local feature descriptors [30]. As can be observed, the matching performance of SIFT degrades significantly with more significant viewpoint changes. To solve this problem, we propose to synthetically generate additional training samples simulating possible variations of the features under viewpoint and scale changes. This enriched dataset is used to learn the initial feature manifold. It should be noted that the idea of simulating images under different types of transformations has been used in [31] to learn a patch classifier. To simulate possible viewpoint situations, transformations are applied to the original image. Each transformation involves three atomic transformations: scaling, rotation and shearing. The mixture of these transformations can accurately simulate all possible viewpoint changes. To produce a simulated image I 0 , a combination of these three transformations is applied to the original image I0, I 0 ¼ Tsh Tr Tsc I0; ð1Þ where Tsh, Tr, and Tsc mean shearing, rotation, and scaling transformations separately. We express them using the homogeneous matrices. Specifically, we represent Tsc as a 0 0 0 a 0 001 0 @ 1 A, and set three levels for a to {0.5, 1, 2}. Tr is expressed as cos h sin h 0 sin h cos h 0 00 1 0 @ 1 A, and we rotate the image around its center for a lap and set the interval to 30. We denote Tsh by 1 b 0 d 1 0 001 0 @ 1 A and set three levels {1, 0, 1} for both b and d to execute the shearing transformation. In implementation, we use the OpcnCV function cvWarpPerspective() to generate the transformed image. This function applies a perspective transformation to an image, and transforms the source image using the specified matrix. It is noted that the interest object is assumed to be nearly planar or far away from the camera such that the above combined transformations can be applied to it for simulating the possible variations of practical situations. We do not intend to handle large out-of-plane rotations which may result in self-occlusions of the object. Our experiments show that such scheme works well on a broad range of video sequences. Without loss of generality, we assume that the image is placed with its center at the world origin. The above combined transformations are applied to every point in the image. This yields a series of transformed images. We then first detect the feature points in the original image and use corresponding transformation to find their corresponding locations in each transformed image. Specifi- cally, we use the SIFT detector, Harris-Affine detector, as well as Hessian-Affine detector to detect feature points, and describe the local appearance of each point using SIFT descriptor. For each point detected by SIFT, Harris-Affine, or Hessian-Affine detectors, the scale used in computing SIFT descriptor is determined by the corresponding detector. For each SIFT feature descriptor fi in the original image, its corresponding features are different versions of the original one under viewpoint variations. The manifold Mi is learned from Mffi; fi1; fi2; ...g, in which fi1,fi2,... are simulated feature descriptors of fi under different viewpoints. In general, the appearance manifold of a local feature is highly nonlinear. Many samples are needed to learn an accurate representation. Although learning such a manifold globally would be an ideal solution, this is mostly infeasible in practice especially for tracking which is hard to capture enough training samples. We assume that local linearity property holds everywhere on a global nonlinear manifold. Therefore, a local feature manifold can be approximated by a collection of linear subspaces [32]. Note that the similar idea of modeling image variations due to changing illuminations by low-dimensional linear subspaces has been exploited by [33]. There have been many methods for manifold learning such as LLE [32] and ISOMAP [34], but most original algorithms are unsuitable for tracking which requires incremental learning for updating. Although the incremental versions of such algorithms have been developed in the literature [35,36], we choose to approximate the nonlinear feature manifold with several PCA subspaces due to computational efficiency. 130 Y. Guo et al. / Computer Vision and Image Understanding 118 (2014) 128–139

Y.Guo et al/Computer Vision and Image Understanding 118(2014)128-139 131 Given a set of feature descriptors obtained through the above with a set of linear PCA subspaces,we calculate the distance be- simulation process,we first use K-means clustering to cluster tween the feature descriptor and each subspace.The minimal dis- them into K clusters.For each cluster we fit a PCA subspace tance is taken as d(Mi.f). for dimension reduction.We denote the subspaces as M= to approximate the feature manifold.Here is d(Mif)=min d(cf ) (6) the jth linear subspace in feature manifold Mj. To compute d(c.f.we denote the eigenvectors of a PCA sub- 4.Object tracking space cl as U,and the means as u.Then we project f onto c. Our tracking module includes two stages.The first stage is proif =U (fi). (7) training,in which we generate the initial feature manifold set using features extracted from the first frame of a video.The other The reconstruction of the feature f on the subspace is therefore, is tracking which locates the object in each new frame and updates the object feature model dynamically. =Uu.proif (8) In the training stage,SIFT descriptors are extracted for each de- tected feature point.We note that for some objects,SIFT detector The distance d(c,f)between the feature and the subspace is fi- nally formulated as, cannot find enough features for tracking.so we employ several complementary feature detectors including Harris-Affine detector and Hessian-Affine detector.The set of features is represented as d(c) (9) F=(f.2,....f).Then the approach described in the previous section is used to generate samples for feature manifold learning. We use affine transforms to simulate those transformations.With with K,the feature dimension.K,is set to 128 for SIFT these transformed images,we extracted features on the corre- Based on the feature-manifold matching discussed above,we sponding positions for every original feature fi in F.After all the can further enhance the matching performance by leveraging the transformed images are processed,we obtain the feature manifolds geometric structure of local feature manifolds.The idea is that for every feature.We further approximate each manifold M with a matching of individual features can be made more reliable by con- set of PCA linear subspaces (cc....clm)by using the afore- sidering the neighborhood features.We organize the feature man- mentioned approach. ifolds as an attributed manifold graph similar to [16.Feature In the tracking stage,we maintain two sets.One is the feature matching becomes a graph matching problem which can be solved manifold set Ms generated in the training stage and updated using relaxation label.Thereinto,the Hungarian algorithm is used according to the current object state.The other is a set that contains to enforce one-to-one matching. the features denoted by F={cf,cf2.....cf which are the candi- After feature-manifold matching,we run manifold update to dates for creating new manifolds.When processing a new frame, incorporate the new data into the model.Manifold update includes we do not extract the SIFT features directly on the original frame. two parts:one is manifold self-update which handles feature Instead,we choose a region centered at the predicted location of appearance variations,the other is adding newly appeared manifolds the object with the size twice the object size at the previous frame. and deleting expiring ones which handles pose changes and occlusions. The extracted features in this enlarged region form a feature set =(f.f....,f).For every feature manifold Mi and cf in CF. we try to find the matched feature in F'in a search window which Manifold self-update.After obtaining the matched feature for M in the current frame,we calculate the error between this is center at the feature location of Mi or ofi separately.By using this window,we can filter out unrelated features in F.We believe that matched feature and its reconstructed representation using the location of M and cf will not change dramatically between two the original eigenvectors of the matched subspace of Mi.If continuous frames.So with this region we can save much more the error exceeds a threshold,an incremental PCA(IPCA)proce- computation.The matching probability for f given Mi is, dure is used to update the subspace.There have been many IPCA algorithms,for example the algorithm introduced by Skoc- p()o exp d(Mi.) aj and Leonardis [37]which requires the computation of covari- (2) ance matrix,and the covariance-free IPCA(CCIPCA)proposed by Weng et al.[38].We implemented both algorithms,and the where I denotes the feature index in F extracted in the current results show that IPCA is more accurate,and CCIPCA requires frame.For a given feature manifold,we thus have, less computation.As the precision of CCIPCA is just slightly lower than IPCA,we utilize CCIPCA for tracking.For more details I'arg maxp(fMi). (3) about CCIPCA,please refer to [37]. Adding newly appeared manifolds and deleting expiring ones.In a The feature indexed by I is the matched feature of My.Simi- time window such as 5 frames,if the frequency of a manifold larly,the matching judgment for cfi is, My that can obtain matched features is lower than a threshold, it will be deleted from MS.If the frequency of a candidate cf p(filcfi)exp (cf, (4) that can obtain matched features is higher than a threshold. we will generate its manifold representation,add it into Ms and delete this of from CF meanwhile. and I=arg maxp(flc). 4.1.Object localization (5) Once the features have been matched to all feature manifolds We calculate the L2 distance between cfi's descriptor vector and representing the object,the correspondences are used to estimate f"s descriptor as the value of d(cfi.f).Computation of d(Mi,f)is object motion between two successive frames.We compute a however difficult as M is a manifold.Since M is approximated homography based on the correspondences between the features

Given a set of feature descriptors obtained through the above simulation process, we first use K-means clustering to cluster them into K clusters. For each cluster we fit a PCA subspace for dimension reduction. We denote the subspaces as Mi ¼ fCi1 ; Ci2 ; ... ; Cimg to approximate the feature manifold. Here Cij is the jth linear subspace in feature manifold Mi. 4. Object tracking Our tracking module includes two stages. The first stage is training, in which we generate the initial feature manifold set using features extracted from the first frame of a video. The other is tracking which locates the object in each new frame and updates the object feature model dynamically. In the training stage, SIFT descriptors are extracted for each detected feature point. We note that for some objects, SIFT detector cannot find enough features for tracking, so we employ several complementary feature detectors including Harris-Affine detector and Hessian-Affine detector. The set of features is represented as F¼ff1; f2; ... ; f ng. Then the approach described in the previous section is used to generate samples for feature manifold learning. We use affine transforms to simulate those transformations. With these transformed images, we extracted features on the corresponding positions for every original feature fi in F. After all the transformed images are processed, we obtain the feature manifolds for every feature. We further approximate each manifold Mi with a set of PCA linear subspaces fCi1 ; Ci2 ; ... ; Cimg by using the aforementioned approach. In the tracking stage, we maintain two sets. One is the feature manifold set MS generated in the training stage and updated according to the current object state. The other is a set that contains the features denoted by CF ¼ fcf1; cf2; ... ; cfkg which are the candidates for creating new manifolds. When processing a new frame, we do not extract the SIFT features directly on the original frame. Instead, we choose a region centered at the predicted location of the object with the size twice the object size at the previous frame. The extracted features in this enlarged region form a feature set F0 ¼ f0 1; f0 2; ... ; f0 L . For every feature manifold Mi and cfi in CF, we try to find the matched feature in F0 in a search window which is center at the feature location of Mi or cfi separately. By using this window, we can filter out unrelated features in F0 . We believe that the location of Mi and cfi will not change dramatically between two continuous frames. So with this region we can save much more computation. The matching probability for f0 l given Mi is, pðf0 l jMiÞ / exp 1 r2 d2 ðMi; f0 l Þ ; ð2Þ where l denotes the feature index in F0 extracted in the current frame. For a given feature manifold, we thus have, l ¼ arg maxl pðf0 l jMiÞ: ð3Þ The feature indexed by l ⁄ is the matched feature of Mi. Similarly, the matching judgment for cfi is, pðf0 l jcfiÞ / exp 1 r2 d2 ðcfi; f0 l Þ ; ð4Þ and l ¼ arg maxl pðf0 l jcfiÞ: ð5Þ We calculate the L2 distance between cfi’s descriptor vector and f0 l ’s descriptor as the value of dðcfi; f0 l Þ. Computation of dðMi; f0 l Þ is however difficult as Mi is a manifold. Since Mi is approximated with a set of linear PCA subspaces, we calculate the distance between the feature descriptor and each subspace. The minimal distance is taken as dðMi; f0 l Þ, dðMi; f0 l Þ ¼ min j dðCij ; f0 l Þ: ð6Þ To compute d Cij ; f0 l , we denote the eigenvectors of a PCA subspace Cij as Uij, and the means as lij. Then we project f0 l onto Cij , projf0 l ¼ UijT ðfl lijÞ: ð7Þ The reconstruction of the feature f0 l on the subspace is therefore, f0 l ¼ Uij projf0 l þ lij: ð8Þ The distance dðCij ; f0 l Þ between the feature and the subspace is fi- nally formulated as, dðCij ; f0 l Þ ¼ sqrt Xkf k¼1 ðf0 lk f0 lkÞ !; ð9Þ with Kf the feature dimension. Kf is set to 128 for SIFT. Based on the feature-manifold matching discussed above, we can further enhance the matching performance by leveraging the geometric structure of local feature manifolds. The idea is that matching of individual features can be made more reliable by considering the neighborhood features. We organize the feature manifolds as an attributed manifold graph similar to [16]. Feature matching becomes a graph matching problem which can be solved using relaxation label. Thereinto, the Hungarian algorithm is used to enforce one-to-one matching. After feature-manifold matching, we run manifold update to incorporate the new data into the model. Manifold update includes two parts: one is manifold self-update which handles feature appearance variations, the other is adding newly appeared manifolds and deleting expiring ones which handles pose changes and occlusions. Manifold self-update. After obtaining the matched feature for Mi in the current frame, we calculate the error between this matched feature and its reconstructed representation using the original eigenvectors of the matched subspace of Mi. If the error exceeds a threshold, an incremental PCA (IPCA) procedure is used to update the subspace. There have been many IPCA algorithms, for example the algorithm introduced by Skocaj and Leonardis [37] which requires the computation of covariance matrix, and the covariance-free IPCA (CCIPCA) proposed by Weng et al. [38]. We implemented both algorithms, and the results show that IPCA is more accurate, and CCIPCA requires less computation. As the precision of CCIPCA is just slightly lower than IPCA, we utilize CCIPCA for tracking. For more details about CCIPCA, please refer to [37]. Adding newly appeared manifolds and deleting expiring ones. In a time window such as 5 frames, if the frequency of a manifold Mi that can obtain matched features is lower than a threshold, it will be deleted from MS. If the frequency of a candidate cfi that can obtain matched features is higher than a threshold, we will generate its manifold representation, add it into MS and delete this cfi from CF meanwhile. 4.1. Object localization Once the features have been matched to all feature manifolds representing the object, the correspondences are used to estimate object motion between two successive frames. We compute a homography based on the correspondences between the features Y. Guo et al. / Computer Vision and Image Understanding 118 (2014) 128–139 131

132 Y.Guo et al/Computer Vision and Image Understanding 118(2014)128-139 30 feature to feature 20 102030405060708090 100 Precision (% Fig.2.Two images from Graf and comparison result.The first image is the reference,and the second one is query.The red curve in the right figure represents the feature-to- manifold matching and the blue one denotes feature-to-feature matching.The decimals on curves are the distance ratio thresholds.(For interpretation of the references to colour in this figure legend,the reader is referred to the web version of this article.) in the current frame and the manifolds on the object in the previ- out the transformations to simulate possible viewpoints of practi- ous frame.RANSAC is used to obtain the homography matrix.We cal situations.The number of transformation matrices is 324.Then then transform the box indicating the object in the previous frame we choose an image from other five images as the query.The ori- using this matrix and thus get a parallelogram.The bounding box ginal and query images are shown in Fig.2.The query one has 2806 of the parallelogram is taken as the box indicating the target object features.We use the ground truth homography to evaluate our in the current frame. matching performance.The comparison result is shown in Fig.2 The above tracking process is summarized below. right.From the precision and recall curves,it is obvious that the performance of our feature-to-manifold matching is better than Algorithm 1.Feature manifold tracking traditional feature-to-feature matching Input:The video sequence and an initial object box. 5.2.Tracking results Output:Tracked object positions. Training:For the first frame. 5.2.1.Parameter setting 1 Extract the local features of the initial object to form a Several parameters are used in our tracker.For each feature feature set manifold in the current frame,we use a fixed window around its F=(f2.....fnh. position in the next frame as the search window for matching.In 2 Transform the 1st frame or more precisely the object the experiments,we set this window to 30 x 30 for most videos. region,and generate a feature manifold M;for each feature The size of search window for a sequence named Pets09_2 is f 60 x 60 for accommodating fast object movement.The distance 3 Represent Mi as the PCA linear subspaces threshold for determining whether or not a feature is matched (...cim) with a manifold is 0.8 for admission of more feature-to-manifold 4 Build the feature manifold graph. correspondences.In a time window in the following frames,if Tracking:For each of the successive frames. the number of matched features for a manifold in the current 1 Extract local features=ff....,f}in a candidate frame is less than a threshold t,this manifold will be removed For a newly detected feature in the current frame,if the number region centered at the object. 2 Perform feature-to-manifold matching.Find the matched of matched features in the time window is great than or equal to feature f for each manifold Mi. t,we will generate its manifold.We set the time window to 5 frames,and set t to 4 in our implementation.We would like to 3 Estimate object motion and transform the object box. emphasize that all the parameters,except for the window size 4 Update manifold subspaces using IPCA,add new set to 60x 60 in the Pets09_2 sequence,were kept constant for manifolds and remove expiring ones. our experiments. 5.2.2.Qualitative results 5.Experiments and discussion We apply our tracker to a set of very diverse and challenging video sequences that contain background clutters,camera 5.1.Matching performance motions,rapid changes of object appearance,illumination changes, in-plane/out-of-plane pose changes,scale changes of the targets We first compare our feature-to-manifold matching with fea- and so on.Some of these videos are widely employed by previous ture-to-feature matching.A standard dataset from the VGG datasets tracking approaches for evaluating the performance of their track- is used for comparison. ers.The resolution and number of frames of each video are listed in VGG datasets have six groups of images.Every group includes six Table 1. images for simulating variations under different imaging condi- In the tracking results,the tracked objects are marked as red tions,including viewpoint changes,scale changes,illumination boxes.All the results are shown in our project webpage.We visu- variations,and so on.For viewpoint changes,the ground truth ally inspect the tracking results,and classify them as "keep track"or homography is provided.We choose this group as the dataset for "lose track"as shown in the last column of Table 1.In the following. comparison.We take the Graf dataset as an example.In this data- set,we select the reference image as our original image,and carry http://cs.nju.edu.cn/ywguo/tracking2013/result.html

in the current frame and the manifolds on the object in the previous frame. RANSAC is used to obtain the homography matrix. We then transform the box indicating the object in the previous frame using this matrix and thus get a parallelogram. The bounding box of the parallelogram is taken as the box indicating the target object in the current frame. The above tracking process is summarized below. Algorithm 1. Feature manifold tracking Input: The video sequence and an initial object box. Output: Tracked object positions. Training: For the first frame. 1 Extract the local features of the initial object to form a feature set F¼ff1; f2; ... ; f ng. 2 Transform the 1st frame or more precisely the object region, and generate a feature manifold Mi for each feature fi. 3 Represent Mi as the PCA linear subspaces fCi1; Ci2; ... ; Cimg. 4 Build the feature manifold graph. Tracking: For each of the successive frames. 1 Extract local features F0 ¼ ff0 1; f0 2; ... ; f0 Lg in a candidate region centered at the object. 2 Perform feature-to-manifold matching. Find the matched feature f 0 l for each manifold Mi. 3 Estimate object motion and transform the object box. 4 Update manifold subspaces using IPCA, add new manifolds and remove expiring ones. 5. Experiments and discussion 5.1. Matching performance We first compare our feature-to-manifold matching with feature-to-feature matching. A standard dataset from the VGG datasets is used for comparison. VGG datasets have six groups of images. Every group includes six images for simulating variations under different imaging conditions, including viewpoint changes, scale changes, illumination variations, and so on. For viewpoint changes, the ground truth homography is provided. We choose this group as the dataset for comparison. We take the Graf dataset as an example. In this dataset, we select the reference image as our original image, and carry out the transformations to simulate possible viewpoints of practical situations. The number of transformation matrices is 324. Then we choose an image from other five images as the query. The original and query images are shown in Fig. 2. The query one has 2806 features. We use the ground truth homography to evaluate our matching performance. The comparison result is shown in Fig. 2 right. From the precision and recall curves, it is obvious that the performance of our feature-to-manifold matching is better than traditional feature-to-feature matching. 5.2. Tracking results 5.2.1. Parameter setting Several parameters are used in our tracker. For each feature manifold in the current frame, we use a fixed window around its position in the next frame as the search window for matching. In the experiments, we set this window to 30 30 for most videos. The size of search window for a sequence named Pets09_2 is 60 60 for accommodating fast object movement. The distance threshold for determining whether or not a feature is matched with a manifold is 0.8 for admission of more feature-to-manifold correspondences. In a time window in the following frames, if the number of matched features for a manifold in the current frame is less than a threshold s, this manifold will be removed. For a newly detected feature in the current frame, if the number of matched features in the time window is great than or equal to s, we will generate its manifold. We set the time window to 5 frames, and set s to 4 in our implementation. We would like to emphasize that all the parameters, except for the window size set to 60 60 in the Pets09_2 sequence, were kept constant for our experiments. 5.2.2. Qualitative results We apply our tracker to a set of very diverse and challenging video sequences that contain background clutters, camera motions, rapid changes of object appearance, illumination changes, in-plane/out-of-plane pose changes, scale changes of the targets and so on. Some of these videos are widely employed by previous tracking approaches for evaluating the performance of their trackers. The resolution and number of frames of each video are listed in Table 1. In the tracking results, the tracked objects are marked as red boxes. All the results are shown in our project webpage.1 We visually inspect the tracking results, and classify them as ‘‘keep track’’ or ‘‘lose track’’ as shown in the last column of Table 1. In the following, 10 20 30 40 50 60 70 80 90 100 0 5 10 15 20 25 30 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 0.45 0.60.55 0.65 0.7 0.75 0.8 0.85 0.9 0.95 Precision (%) Recall (%) feature to manifold feature to feature Fig. 2. Two images from Graf and comparison result. The first image is the reference, and the second one is query. The red curve in the right figure represents the feature-tomanifold matching and the blue one denotes feature-to-feature matching. The decimals on curves are the distance ratio thresholds. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) 1 http://cs.nju.edu.cn/ywguo/tracking2013/result.html. 132 Y. Guo et al. / Computer Vision and Image Understanding 118 (2014) 128–139

Y.Guo et al/Computer Vision and Image Understanding 118(2014)128-139 133 Table 1 Fig.5 shows the facial tracking result for the David_indoor video Video sequences used for experiments.The last column shows whether c not our sequence by our tracker.David's face undergoes significant illumi- tracker successfully tracks each sequence by visually inspecting the results without resorting to an accurate measure of the success rate. nation changes and partial occlusion.Furthermore,David changed his posture by moving forward,shaking head,and turning round Sequences Resolution Number of frames Keep track Since SIFT is robust to illumination changes,our tracker keeps track Bagwoman 720×576 96 YES when the face area shifts from dark to bright from around frames Courtyard 720×480 143 YES 200 to 300.From Fig.5 bottom,although the curve of manifold Crosswalk_shadow 360×240 117 YES David 320×240 537 YES numbers shows small fluctuations frequently,the number of stable David_indoor 320×240 761 YES manifolds ensures that our tracker keeps track for this sequence,as David_outdoor 640×480 252 NO is the case for the Sylvester sequence.The result on this sequence Face 640×480 383 YES also demonstrates that our tracker works well for facial tracking. Faceocc2 320×240 819 YES Fig.6 shows another facial tracking result by our tracker.This video Gym 426×234 767 YES Jumping 352×288 313 NO is from the Honda/UCSD Video Database, and the resolution is oldman 720×576 112 YES 640 x 480.In this video,the man changes his posture by moving for- PETS09_2 768×576 66 YES ward,backward and shaking head,and varies facial expression Security officer 720×576 200 VES simultaneously.Our algorithm successfully keeps track throughout Sylvester 320×240 1344 YES Uniformwoman 720×576 286 YES the whole sequence.Other examples of tracking faces are demon- Wallcott 640×352 253 VES strated in the results of David and Faceocc2 sequences,among which the latter one undergoes partial facial occlusions occasionally.The David_indoor video sequence is of size 320 x 240 and the number of frames is 761.Besides this video,our tracker successfully tracks we show the tracking results on some intermediate frames of some the target objects for the long video sequences such as the Sylvester representative video sequences. Fig.3 shows the results of tracking a security officer under se- sequence with 1344 frames,the Faceocc2 sequence with 819 frames vere occlusions.The video clips are from PETS 2007.2 Each frame the David sequence with 537 frames,and the Gym sequence with 767 frames. is of size 720 x 576,and the initialized object region is 24 x 76. The security officer was walking from right to left,and the crowd Our feature manifold model assumes that the object undergoes occasionally occluded him heavily as shown in the selected frames. in-plane pose changes in that we use the concatenations of tree This video clip includes 200 frames,and our tracker successfully atomic transformations applied to each feature to simulate object transformations.However,our tracker can accommodate out-of- keeps track from the 1st frame to 192nd frame.The results show that our tracker is robust to heavy occlusions.We further check the num- plane pose changes of the target objects,for instance the Sylvester. ber of active feature manifolds on the object used for tracking in David_indoor,Gym,and Wallcott video sequences.Our tracker suc- each frame.The feature manifolds are generated in the first frame cessfully keeps track for each of the above videos throughout the whole sequence.Fig.7 shows the tracking results of ten intermedi- of each video,and updated dynamically for adding new feature man- ifolds and removing expiring ones adaptively to adapt to changes of ate frames of the Sylvester sequence.The target object undergoes object appearance.We have visualized the variation of number of fast and large out-of-plane pose changes throughout the sequence. We carefully examine each of these videos to see the frames of out- stable feature manifolds on the object with respect to frame index, as shown in the bottom of Fig.3.Each point of the blue curve repre- of-plane pose changes.The reason why our tracker can handle out- sents the number of manifolds on the object of the current frame of-plane pose changes is that the number of feature manifolds on that match with the features in the predicted location on the next the target object remains relatively stable over time in these frame.We can see that the number of manifolds remains relatively frames,although small fluctuations occur frequently.Our online stable.Even though local fluctuation around some frames exists. updating scheme can adapt to the appearance changes caused by pose variations. the tracker always keeps a certain number of stable manifolds,e.g. around 50,as is the case for other videos sequences on which our tracker works well. 5.2.2.1.Failure cases.Our tracker is sensitive to motion blur and The Security officer sequence is captured in an airport departure strong noises,since local features like SIFT cannot be found in lounge.The Bagwoman,Oldman,and Uniform woman sequences are the heavily blurred frames,and local feature descriptor in the noisy captured in the departure lounges as well.In the Uniform woman frames is often unreliable.As shown in Fig.8,Our tracker fails to sequence,the uniform woman is heavily occluded by other people locate the target from the 16th frame in the Jumping sequence.Sig- occasionally.Besides,the black uniform she wears is similar to the nificant blur in the face area of most frames due to fast motion sig- dark background in some frames,making trackers that rely heavily nificant degrades the performance of our tracker.Obviously,in this on color more vulnerable to drifting.The problems of color ambi- sequence the number of stable manifolds on the object is fewer guity and background clutter are also apparent in the Bagwoman than other video sequences our tracker keeps track.Furthermore and Oldman sequences.Visually inspecting the tracking results, the curve visualizing the variation of manifold numbers has an our tracker works well for these sequences. obvious downside at frame 16 where losing track occurs,but on Fig.4 shows a video sequence from a video called Courtyard. the contrary,the downside of manifold number does not necessar- This video is about a man walking in a yard under significant pose ily imply losing track. changes,with a moving camera.Each frame is of size 720 x 480. Our tracker keeps track of David from the 1st frame to the The initialized region of the target man in the first frame is of size 124th frame,while loses track at frame 129 in the David_outdoor 52 x 124.Despite of the moving camera,our tracker keeps track sequence.The reason is that the object box shrinks heavily in with the moving target.The average number of stable manifolds frame 125 when David turns round to face the camera.We have for this sequence is around 150. carefully checked the manifold graph representing the target in that frame.The repeated grids on the plaid shirt David wears make matching between the manifolds and features error-prone. 2 PETS:Performance Evaluation of Tracking and Surveillance.http:// www.cvg.rdg.ac.uk/slides/pets.htmL. http://vision.ucsd.edu/leekc/HondaUCSDVideoDatabase/HondaUCSD.html

we show the tracking results on some intermediate frames of some representative video sequences. Fig. 3 shows the results of tracking a security officer under severe occlusions. The video clips are from PETS 2007.2 Each frame is of size 720 576, and the initialized object region is 24 76. The security officer was walking from right to left, and the crowd occasionally occluded him heavily as shown in the selected frames. This video clip includes 200 frames, and our tracker successfully keeps track from the 1st frame to 192nd frame. The results show that our tracker is robust to heavy occlusions. We further check the number of active feature manifolds on the object used for tracking in each frame. The feature manifolds are generated in the first frame of each video, and updated dynamically for adding new feature manifolds and removing expiring ones adaptively to adapt to changes of object appearance. We have visualized the variation of number of stable feature manifolds on the object with respect to frame index, as shown in the bottom of Fig. 3. Each point of the blue curve represents the number of manifolds on the object of the current frame that match with the features in the predicted location on the next frame. We can see that the number of manifolds remains relatively stable. Even though local fluctuation around some frames exists, the tracker always keeps a certain number of stable manifolds, e.g. around 50, as is the case for other videos sequences on which our tracker works well. The Security officer sequence is captured in an airport departure lounge. The Bagwoman, Oldman, and Uniform woman sequences are captured in the departure lounges as well. In the Uniform woman sequence, the uniform woman is heavily occluded by other people occasionally. Besides, the black uniform she wears is similar to the dark background in some frames, making trackers that rely heavily on color more vulnerable to drifting. The problems of color ambiguity and background clutter are also apparent in the Bagwoman and Oldman sequences. Visually inspecting the tracking results, our tracker works well for these sequences. Fig. 4 shows a video sequence from a video called Courtyard. This video is about a man walking in a yard under significant pose changes, with a moving camera. Each frame is of size 720 480. The initialized region of the target man in the first frame is of size 52 124. Despite of the moving camera, our tracker keeps track with the moving target. The average number of stable manifolds for this sequence is around 150. Fig. 5 shows the facial tracking result for the David_indoor video sequence by our tracker. David’s face undergoes significant illumination changes and partial occlusion. Furthermore, David changed his posture by moving forward, shaking head, and turning round. Since SIFT is robust to illumination changes, our tracker keeps track when the face area shifts from dark to bright from around frames 200 to 300. From Fig. 5 bottom, although the curve of manifold numbers shows small fluctuations frequently, the number of stable manifolds ensures that our tracker keeps track for this sequence, as is the case for the Sylvester sequence. The result on this sequence also demonstrates that our tracker works well for facial tracking. Fig. 6 shows another facial tracking result by our tracker. This video is from the Honda/UCSD Video Database,3 and the resolution is 640 480. In this video, the man changes his posture by moving forward, backward and shaking head, and varies facial expression simultaneously. Our algorithm successfully keeps track throughout the whole sequence. Other examples of tracking faces are demonstrated in the results of David and Faceocc2 sequences, among which the latter one undergoes partial facial occlusions occasionally. The David_indoor video sequence is of size 320 240 and the number of frames is 761. Besides this video, our tracker successfully tracks the target objects for the long video sequences such as the Sylvester sequence with 1344 frames, the Faceocc2 sequence with 819 frames, the David sequence with 537 frames, and the Gym sequence with 767 frames. Our feature manifold model assumes that the object undergoes in-plane pose changes in that we use the concatenations of tree atomic transformations applied to each feature to simulate object transformations. However, our tracker can accommodate out-ofplane pose changes of the target objects, for instance the Sylvester, David_indoor, Gym, and Wallcott video sequences. Our tracker successfully keeps track for each of the above videos throughout the whole sequence. Fig. 7 shows the tracking results of ten intermediate frames of the Sylvester sequence. The target object undergoes fast and large out-of-plane pose changes throughout the sequence. We carefully examine each of these videos to see the frames of outof-plane pose changes. The reason why our tracker can handle outof-plane pose changes is that the number of feature manifolds on the target object remains relatively stable over time in these frames, although small fluctuations occur frequently. Our online updating scheme can adapt to the appearance changes caused by pose variations. 5.2.2.1. Failure cases. Our tracker is sensitive to motion blur and strong noises, since local features like SIFT cannot be found in the heavily blurred frames, and local feature descriptor in the noisy frames is often unreliable. As shown in Fig. 8, Our tracker fails to locate the target from the 16th frame in the Jumping sequence. Significant blur in the face area of most frames due to fast motion significant degrades the performance of our tracker. Obviously, in this sequence the number of stable manifolds on the object is fewer than other video sequences our tracker keeps track. Furthermore, the curve visualizing the variation of manifold numbers has an obvious downside at frame 16 where losing track occurs, but on the contrary, the downside of manifold number does not necessarily imply losing track. Our tracker keeps track of David from the 1st frame to the 124th frame, while loses track at frame 129 in the David_outdoor sequence. The reason is that the object box shrinks heavily in frame 125 when David turns round to face the camera. We have carefully checked the manifold graph representing the target in that frame. The repeated grids on the plaid shirt David wears make matching between the manifolds and features error-prone. Table 1 Video sequences used for experiments. The last column shows whether c not our tracker successfully tracks each sequence by visually inspecting the results without resorting to an accurate measure of the success rate. Sequences Resolution Number of frames Keep track Bagwoman 720 576 96 YES Courtyard 720 480 143 YES Crosswalk_shadow 360 240 117 YES David 320 240 537 YES David_indoor 320 240 761 YES David_outdoor 640 480 252 NO Face 640 480 383 YES Faceocc2 320 240 819 YES Gym 426 234 767 YES Jumping 352 288 313 NO Oldman 720 576 112 YES PETS09_2 768 576 66 YES Security officer 720 576 200 YES Sylvester 320 240 1344 YES Uniformwoman 720 576 286 YES Wallcott 640 352 253 YES 2 PETS: Performance Evaluation of Tracking and Surveillance. http:// www.cvg.rdg.ac.uk/slides/pets.html. 3 http://vision.ucsd.edu/leekc/HondaUCSDVideoDatabase/HondaUCSD.html. Y. Guo et al. / Computer Vision and Image Understanding 118 (2014) 128–139 133

136 Y.Guo et al/Computer Vision and Image Understanding 118(2014)128-139 Frame 123 Frame 125 Frame 129 Frame 151 Frame 189 Frame 237 David outdoor 400 300 200 150 200 Frame number Fig.9.Our trackers fails to track David from frame 129 in the David_outdoor sequence Table 2 Success rates ()."-"Means that the tracker fails from the beginning.For each sequence,the highest success rate is shown in bold-font. Meanshift tracker IVT TLD Hough tracker LI-APG tracker Superpixel tracker Feature manifold tracker Bagwoman 6 99 9 10 5 100 Courtyard 40 9 100 100 29 100 Crosswalk_shadow 10 100 10 David 100 10 100 David_indoor 100 96 6 8 David_outdoor 91 9 Face 99 18 99 100 Faceocc2 10 100 9 1 Gym 1 54 99 9 Jumping 4 6 弱 g8 7 Oldman 100 PETS09_2 8 Security Officer 给 97 4 6 Sylvester 46 94 100 45 1 Uniformwoman 92 98 Wallcott 10 17 30 83 99 100 100 tracking algorithms.They are the meanshift tracker [1],incremen- the testing video sequences.The horizontal axis represents frame tal visual tracker(IVT)[23],track-learning-detection method(TLD) number and vertical axis denotes center error in pixel.The inset 26].Hough-based tracker [5].L1-APG tracker [39],and superpixel image in each sub-figure is the first frame of each sequence with tracker [40]which represent state-of-the-art tracking algorithms. the red box indicating initial object position.Note that,for each se- For fair evaluation,we use the source codes provided by the web- quence,the tracker that fails from the beginning is excluded here. sites of authors of these trackers.The parameters of each tracker From Table 2,we see that our feature manifold tracker achieves the are carefully selected for best performance.For instance,for IVT highest success rate for ten sequences while being close to the top ten groups of parameters suggested by the authors are tried.For for five sequences of the rest videos,except for the Jumping se- all the trackers,the objects are initialized at the same position at quence.From Table 3,our tracker achieves the lowest average er- the first frames.Most above approaches use a bounding box on ror for eight of the sixteen sequences.It should be noted that for each of the successive frames to represent the target location, each tracker,the average error on each sequence is computed by while the Hough-based tracker [5]delivers a non-rigid contour only considering the tracking errors on those frames that are suc- representation by roughly segmenting the object from the cessfully tracked.Tracking errors obtained after tracking failure are background. excluded.Looking at the success rate and average center error,we We use two criteria,tracking success rate and location error with also see that the David_outdoor sequence is the most difficult video respect to object center for quantitative evaluations.To compute the to track,not just for our tracker but also for many other trackers. success rate,we evaluate whether each tracking result is a success or The tracking result on the Wallcott sequence shows that our not by employing the following criterion.Given the tracked bound- tracker can successfully track the target with highly articulated hu- ing box or non-rigid object representation Rr and the ground truth man motions and rapid appearance changes.The football player in box R the score is defined as We consider the tracking result this sequence changes his poses dramatically and rapidly with the of one frame as a success when this score is above 0.25,and we re- ball.Moreover,the defenders in red shirts occlude him occasion- port the percentage of successfully tracked frames for every tracker ally.These result in the failure of many trackers including IVI on each video sequence.We define the location error as the distance and TLD.We have carefully checked the number of manifolds on between object center and ground truth. each video frame during tracking.and found that the number of The statistical data for each tracker on the videos are given in manifolds that match features on the football player remains stable Tables 2 and 3.Fig.10 shows the curves of center errors for all throughout the sequence.As shown in Tables 2 and 3,our tracker

tracking algorithms. They are the meanshift tracker [1], incremental visual tracker (IVT) [23], track-learning-detection method (TLD) [26], Hough-based tracker [5], L1-APG tracker [39], and superpixel tracker [40] which represent state-of-the-art tracking algorithms. For fair evaluation, we use the source codes provided by the websites of authors of these trackers. The parameters of each tracker are carefully selected for best performance. For instance, for IVT ten groups of parameters suggested by the authors are tried. For all the trackers, the objects are initialized at the same position at the first frames. Most above approaches use a bounding box on each of the successive frames to represent the target location, while the Hough-based tracker [5] delivers a non-rigid contour representation by roughly segmenting the object from the background. We use two criteria, tracking success rate and location error with respect to object center for quantitative evaluations. To compute the success rate, we evaluate whether each tracking result is a success or not by employing the following criterion. Given the tracked bounding box or non-rigid object representation RT and the ground truth box RG, the score is defined as RT\RG RT[RG . We consider the tracking result of one frame as a success when this score is above 0.25, and we report the percentage of successfully tracked frames for every tracker on each video sequence. We define the location error as the distance between object center and ground truth. The statistical data for each tracker on the videos are given in Tables 2 and 3. Fig. 10 shows the curves of center errors for all the testing video sequences. The horizontal axis represents frame number and vertical axis denotes center error in pixel. The inset image in each sub-figure is the first frame of each sequence with the red box indicating initial object position. Note that, for each sequence, the tracker that fails from the beginning is excluded here. From Table 2, we see that our feature manifold tracker achieves the highest success rate for ten sequences while being close to the top for five sequences of the rest videos, except for the Jumping sequence. From Table 3, our tracker achieves the lowest average error for eight of the sixteen sequences. It should be noted that for each tracker, the average error on each sequence is computed by only considering the tracking errors on those frames that are successfully tracked. Tracking errors obtained after tracking failure are excluded. Looking at the success rate and average center error, we also see that the David_outdoor sequence is the most difficult video to track, not just for our tracker but also for many other trackers. The tracking result on the Wallcott sequence shows that our tracker can successfully track the target with highly articulated human motions and rapid appearance changes. The football player in this sequence changes his poses dramatically and rapidly with the ball. Moreover, the defenders in red shirts occlude him occasionally. These result in the failure of many trackers including IVT and TLD. We have carefully checked the number of manifolds on each video frame during tracking, and found that the number of manifolds that match features on the football player remains stable throughout the sequence. As shown in Tables 2 and 3, our tracker Table 2 Success rates (%). ‘‘–’’ Means that the tracker fails from the beginning. For each sequence, the highest success rate is shown in bold-font. Meanshift tracker IVT TLD Hough tracker Ll-APG tracker Superpixel tracker Feature manifold tracker Bagwoman 6 99 9 10 5 52 100 Courtyard 40 99 100 – 100 29 100 Crosswalk_ shadow 100 25 56 100 51 48 100 David 100 100 100 100 88 31 99 David_indoor 13 62 100 96 60 – 98 David_outdoor 91 69 28 76 16 98 81 Face 99 100 100 95 99 99 100 Faceocc2 80 100 99 100 50 9 100 Gym 31 54 95 26 33 44 87 Jumping 42 38 98 64 22 6 7 Oldman 34 90 71 49 59 98 100 PETS09_2 8 14 61 8 14 15 97 Security Officer 12 26 25 87 22 45 96 Sylvester 91 46 94 100 45 41 100 Uniformwoman 92 67 30 15 67 99 98 Wallcott 100 17 30 83 99 100 100 Frame 123 Frame 125 Frame 129 Frame 151 Frame 189 Frame 237 Fig. 9. Our trackers fails to track David from frame 129 in the David_outdoor sequence. 136 Y. Guo et al. / Computer Vision and Image Understanding 118 (2014) 128–139

0 20 40 60 80 100 0 100 200 300 400 500 600 Bagwoman Meanshift IVT TLD Hough L1 Superpixel Manifolds 0 50 100 150 0 50 100 150 200 250 300 350 400 450 500 Courtyard Meanshift IVT TLD L1 Superpixel Manifolds 0 20 40 60 80 100 120 0 50 100 150 200 250 Crosswalk_shadow Meanshift IVT TLD Hough L1 Superpixel Manifolds 0 100 200 300 400 500 600 0 20 40 60 80 100 120 140 160 David Meanshift IVT TLD Hough L1 Superpixel Manifolds 0 100 200 300 400 500 600 700 800 0 50 100 150 200 250 300 350 David_indoor Meanshift IVT TLD Hough L1 Manifolds 0 50 100 150 200 250 300 0 50 100 150 200 250 300 350 400 450 500 David_outdoor Meanshift IVT TLD Hough L1 Superpixel Manifolds 0 50 100 150 200 250 300 350 400 0 50 100 150 Face Meanshift IVT TLD Hough L1 Superpixel Manifolds 0 100 200 300 400 500 600 700 800 900 0 20 40 60 80 100 120 140 160 180 Faceocc2 Meanshift IVT TLD Hough L1 Superpixel Manifolds 0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 120 140 160 180 200 Gym Meanshift IVT TLD Hough L1 Superpixel Manifolds 0 50 100 150 200 250 300 350 0 20 40 60 80 100 120 140 160 180 Jumping Meanshift IVT TLD Hough L1 Superpixel Manifolds 0 20 40 60 80 100 120 0 50 100 150 200 250 300 350 400 Oldman Meanshift IVT TLD Hough L1 Superpixel Manifolds 0 10 20 30 40 50 60 70 0 100 200 300 400 500 600 700 800 PETS09_2 Meanshift IVT TLD Hough L1 Superpixel Manifolds 0 50 100 150 200 0 100 200 300 400 500 600 700 800 Security officer Meanshift IVT TLD Hough L1 Superpixel Manifolds 0 200 400 600 800 1000 1200 1400 0 100 200 300 400 500 600 700 Sylvester Meanshift IVT TLD Hough L1 Superpixel Manifolds 0 50 100 150 200 250 300 0 50 100 150 200 250 300 350 400 Uniformwoman Meanshift IVT TLD Hough L1 Superpixel Manifolds 0 50 100 150 200 250 300 0 50 100 150 200 250 300 350 400 450 Wallcott Meanshift IVT TLD Hough L1 Superpixel Manifolds Fig. 10. Center error plots for all the videos. For each sub-figure, the horizontal axis represents frame number and vertical axis denotes center error in pixel. The inset image in each sub-figure is the first frame of each sequence with the red box indicating initial object position. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) Table 3 Average center location errors (in pixel). ‘‘–’’ Means that the tracker fails from the beginning. For each tracker, the average error on each sequence is computed by only considering the tracking errors on those frames that are successfully tracked. For each sequence, the lowest average center location error is shown in bold-font. Meanshift tracker IVT TLD Hough tracker Ll-APG tracker Superpixel tracker Feature manifold tracker Bagwoman 56 6 50 45 47 18 6 Courtyard 33 7 7 – 4 19 5 Crosswalk_ shadow 9 45 27 7 31 7 3 David 5 4 5 4 6 39 5 David_indoor 50 11 11 16 23 – 15 David_outdoor 24 11 27 28 29 10 20 Face 35 14 16 25 32 35 14 Faceocc2 30 10 18 16 22 47 7 Gym 36 22 21 33 34 14 11 Jumping 30 34 4 15 39 56 56 Oldman 27 9 25 36 20 9 7 PETS09_2 41 31 5 38 42 37 6 Security officer 35 33 29 12 29 11 8 Sylvester 13 18 12 4 29 22 7 Uniformwoman 17 13 28 60 12 10 8 Wallcott 6 56 3 10 4 4 4 Y. Guo et al. / Computer Vision and Image Understanding 118 (2014) 128–139 137

点击进入文档下载页（PDF格式）

共12页，试读已结束，阅读完整版请下载

点击下载（PDF格式）

浏览记录