Background and Foreground Modeling Using Nonparametric Kernel Density Estimation for Visual Surveillance AHMED ELGAMMAL,RAMANI DURAISWAMI,MEMBER,IEEE,DAVID HARWOOD,AND LARRY S.DAVIS,FELLOW,IEEE Invited Paper Automatic understanding of events happening at a site is the 1.INTRODUCTION ultimate goal for many visual surveillance systems.Higher level understanding of events requires that certain lower level computer In automated surveillance systems,cameras and other sen- vision tasks be performed.These may include detection of unusual sors are typically used to monitor activities at a site with the motion,tracking targets.labeling body parts,and understanding goal of automatically understanding events happening at the the interactions between people.To achieve many of these tasks, site.Automatic event understanding would enable function- it is necessary to build representations of the appearance of alities such as detection of suspicious activities and site se- objects in the scene.This paper focuses on two issues related to curity.Current systems archive huge volumes of video for this problem.First,we construct a statistical representation of the scene background that supports sensitive detection of moving eventual off-line human inspection.The automatic detection objects in the scene,but is robust to clutter arising out of natural of events in videos would facilitate efficient archiving and scene variations.Second,we build statistical representations of automatic annotation.It could be used to direct the attention the foreground regions (moving objects)that support their tracking of human operators to potential problems.The automatic de- and support occlusion reasoning.The probability density functions tection of events would also dramatically reduce the band- (pdfs)associated with the background and foreground are likely width required for video transmission and storage as only in- to vary from image to image and will not in general have a known parametric form.We accordingly utilize general nonparametric teresting pieces would need to be transmitted or stored. kernel density estimation techniques for building these statistical Higher level understanding of events requires certain representations of the background and the foreground.These lower level computer vision tasks to be performed such techniques estimate the pdf directly from the data without any as detection of unusual motion,tracking targets,labeling assumptions about the underlying distributions.Example results body parts,and understanding the interactions between from applications are presented. people.For many of these tasks,it is necessary to build Keywords-Background subtraction,color modeling.kernel representations of the appearance of objects in the scene.For density estimation,occlusion modeling,tracking.visual surveil- example,the detection of unusual motions can be achieved lance. by building a representation of the scene background and comparing new frames with this representation.This process is called background subtraction.Building representations for foreground objects (targets)is essential for tracking them and maintaining their identities.This paper focuses Manuscript received May 31,2001;revised February 15,2002.This work was supported in part by the ARDA Video Analysis and Content on two issues:how to construct a statistical representation Exploitation project under Contract MDA 90 400C2110 and in part by of the scene background that supports sensitive detection Philips Research. of moving objects in the scene and how to build statistical A.Elgammal is with the Computer Vision Laboratory,University of Maryland Institute for Advanced Computer Studies,Department of representations of the foreground (moving objects)that Computer Science,University of Maryland,College Park,MD 20742 USA support their tracking. (e-mail:elgammal@cs.umd.edu). One useful tool for building such representations is sta- R.Duraiswami,D.Harwood,and L.S.Davis are with the Computer tistical modeling,where a process is modeled as a random Vision Laboratory,University of Maryland Institute for Advanced Computer Studies,University of Maryland,College Park,MD 20742 USA (e-mail: variable in a feature space with an associated probability den- ramani@umiacs.umd.edu;harwood@umiacs.umd.edu;Isd@cs.umd.edu). sity function(pdf).The density function could be represented Publisher Item Identifier 10.1109/JPROC.2002.801448. parametrically using a specified statistical distribution,that 0018-9219/02s17.00⊙2002IEEE PROCEEDINGS OF THE IEEE,VOL.90,NO.7,JULY 2002 1151
Background and Foreground Modeling Using Nonparametric Kernel Density Estimation for Visual Surveillance AHMED ELGAMMAL, RAMANI DURAISWAMI, MEMBER, IEEE, DAVID HARWOOD, AND LARRY S. DAVIS, FELLOW, IEEE Invited Paper Automatic understanding of events happening at a site is the ultimate goal for many visual surveillance systems. Higher level understanding of events requires that certain lower level computer vision tasks be performed. These may include detection of unusual motion, tracking targets, labeling body parts, and understanding the interactions between people. To achieve many of these tasks, it is necessary to build representations of the appearance of objects in the scene. This paper focuses on two issues related to this problem. First, we construct a statistical representation of the scene background that supports sensitive detection of moving objects in the scene, but is robust to clutter arising out of natural scene variations. Second, we build statistical representations of the foreground regions (moving objects) that support their tracking and support occlusion reasoning. The probability density functions (pdfs) associated with the background and foreground are likely to vary from image to image and will not in general have a known parametric form. We accordingly utilize general nonparametric kernel density estimation techniques for building these statistical representations of the background and the foreground. These techniques estimate the pdf directly from the data without any assumptions about the underlying distributions. Example results from applications are presented. Keywords—Background subtraction, color modeling, kernel density estimation, occlusion modeling, tracking, visual surveillance. Manuscript received May 31, 2001; revised February 15, 2002. This work was supported in part by the ARDA Video Analysis and Content Exploitation project under Contract MDA 90 400C2110 and in part by Philips Research. A. Elgammal is with the Computer Vision Laboratory, University of Maryland Institute for Advanced Computer Studies, Department of Computer Science, University of Maryland, College Park, MD 20742 USA (e-mail: elgammal@cs.umd.edu). R. Duraiswami, D. Harwood, and L. S. Davis are with the Computer Vision Laboratory, University of Maryland Institute for Advanced Computer Studies, University of Maryland, College Park, MD 20742 USA (e-mail: ramani@umiacs.umd.edu; harwood@umiacs.umd.edu; lsd@cs.umd.edu). Publisher Item Identifier 10.1109/JPROC.2002.801448. I. INTRODUCTION In automated surveillance systems, cameras and other sensors are typically used to monitor activities at a site with the goal of automatically understanding events happening at the site. Automatic event understanding would enable functionalities such as detection of suspicious activities and site security. Current systems archive huge volumes of video for eventual off-line human inspection. The automatic detection of events in videos would facilitate efficient archiving and automatic annotation. It could be used to direct the attention of human operators to potential problems. The automatic detection of events would also dramatically reduce the bandwidth required for video transmission and storage as only interesting pieces would need to be transmitted or stored. Higher level understanding of events requires certain lower level computer vision tasks to be performed such as detection of unusual motion, tracking targets, labeling body parts, and understanding the interactions between people. For many of these tasks, it is necessary to build representations of the appearance of objects in the scene. For example, the detection of unusual motions can be achieved by building a representation of the scene background and comparing new frames with this representation. This process is called background subtraction. Building representations for foreground objects (targets) is essential for tracking them and maintaining their identities. This paper focuses on two issues: how to construct a statistical representation of the scene background that supports sensitive detection of moving objects in the scene and how to build statistical representations of the foreground (moving objects) that support their tracking. One useful tool for building such representations is statistical modeling, where a process is modeled as a random variable in a feature space with an associated probability density function (pdf). The density function could be represented parametrically using a specified statistical distribution, that 0018-9219/02$17.00 © 2002 IEEE PROCEEDINGS OF THE IEEE, VOL. 90, NO. 7, JULY 2002 1151
is assumed to approximate the actual distribution,with the where the same kernel function is used in each dimension associated parameters estimated from training data.Alterna- with a suitable bandwidth oj for each dimension.We can tively,nonparametric approaches could be used.These esti- avoid having to store the complete data set by weighting the mate the density function directly from the data without any samples as assumptions about the underlying distribution.This avoids having to choose a model and estimating its distribution pa- (x)=〉aK(c-x) rameters. =1 A particular nonparametric technique that estimates the where the ai's are weighting coefficients that sum up to one. underlying density,avoids having to store the complete data, A variety ofkernel functions with different properties have and is quite general is the kernel density estimation tech- been used in the literature.Typically the Gaussian kernel is nique.In this technique,the underlying pdf is estimated as used for its continuity,differentiability,and locality proper- f(a)=>aiK(z-xi) (1) ties.Note that choosing the Gaussian as a kernel function is different from fitting the distribution to a Gaussian model (normal distribution).Here,the Gaussian is only used as a where K is a"kernel function"(typically a Gaussian)cen- function to weight the data points.Unlike parametric fitting tered at the data points in feature space,i,i=1...n,and ofa mixture ofGaussians,kernel density estimation is a more ai are weighting coefficients(typically uniform weights are general approach that does not assume any specific shape for used,i.e.,=1/n).Kernel density estimators asymptoti- the density function.A good discussion of kernel estimation cally converge to any density function [1],[2].This property techniques can be found in [1].The major drawback of using makes these techniques quite general and applicable to many the nonparametric kernel density estimator is its computa- vision problems where the underlying density is not known. tional cost.This becomes less of a problem as the available In this paper,kernel density estimation techniques are computational power increases and as efficient computational utilized for building representations for both the background methods have become available recently [3],[4]. and the foreground.We present an adaptive background modeling and background subtraction technique that is able III.MODELING THE BACKGROUND to detect moving targets in challenging outdoor environ- A.Background Subtraction:A Review ments with moving trees and changing illumination.We also present a technique for modeling foreground regions and 1)The Concept:In video surveillance systems,sta- show how it can be used for segmenting major body parts of tionary cameras are typically used to monitor activities at a person and for segmenting groups of people outdoor or indoor sites.Since the cameras are stationary,the detection of moving objects can be achieved by comparing II.KERNEL DENSITY ESTIMATION TECHNIQUES each new frame with a representation of the scene back- ground.This process is called background subtraction and Given a sample S=i=1...N from a distribution with the scene representation is called the background model. density function p(r),an estimate ()of the density at Typically,background subtraction forms the first stage a can be calculated using in an automated visual surveillance system.Results from background subtraction are used for further processing,such Ko(x-Ti) (2 as tracking targets and understanding events. A central issue in building a representation for the scene background is what features to use for this representation where Ko is a kernel function(sometimes called a"window" or,in other words,what to model in the background.In function)with a bandwidth (scale)o such that Ko(t)= the literature,a variety of features have been used for (1/)K(t/o).The kernel function K should satisfy K(t)> background modeling,including pixel-based features(pixel 0 and K(t)dt =1.We can think of (2)as estimating intensity,edges,disparity)and region-based features (e.g., the pdf by averaging the effect of a set of kernel functions block correlation).The choice of the features affects how centered at each data point.Alternatively,since the kernel the background model tolerates changes in the scene and the function is symmetric,we can also regard this computation granularity of the detected foreground objects. as averaging the effect of a kernel function centered at the In any indoor or outdoor scene,there are changes that estimation point and evaluated at each data point.Kernel occur over time and may be classified as changes to the scene density estimators asymptotically converge to any density background.It is important that the background model toler- function with sufficient samples [1],[2].This property makes ates these kind of changes,either by being invariant to them the technique quite general for estimating the density of or by adapting to them.These changes can be local,affecting any distribution.In fact,all other nonparametric density only part of the background,or global,affecting the entire estimation methods,e.g.,histograms,can be shown to be background.The study of these changes is essential to un- asymptotically kernel methods [1]. derstand the motivations behind different background sub- For higher dimensions,products of one-dimensional (1-D) traction techniques.We classify these changes according to kernels [1]can be used as their source. Illumination changes: gradual change in illumination,as might occur in out- door scenes due to the change in the location of the sun: 1152 PROCEEDINGS OF THE IEEE,VOL.90,NO.7,JULY 2002
is assumed to approximate the actual distribution, with the associated parameters estimated from training data. Alternatively, nonparametric approaches could be used. These estimate the density function directly from the data without any assumptions about the underlying distribution. This avoids having to choose a model and estimating its distribution parameters. A particular nonparametric technique that estimates the underlying density, avoids having to store the complete data, and is quite general is the kernel density estimation technique. In this technique, the underlying pdf is estimated as (1) where is a “kernel function” (typically a Gaussian) centered at the data points in feature space, , and are weighting coefficients (typically uniform weights are used, i.e., ). Kernel density estimators asymptotically converge to any density function [1], [2]. This property makes these techniques quite general and applicable to many vision problems where the underlying density is not known. In this paper, kernel density estimation techniques are utilized for building representations for both the background and the foreground. We present an adaptive background modeling and background subtraction technique that is able to detect moving targets in challenging outdoor environments with moving trees and changing illumination. We also present a technique for modeling foreground regions and show how it can be used for segmenting major body parts of a person and for segmenting groups of people. II. KERNEL DENSITY ESTIMATION TECHNIQUES Given a sample from a distribution with density function , an estimate of the density at can be calculated using (2) where is a kernel function (sometimes called a “window” function) with a bandwidth (scale) such that . The kernel function should satisfy and . We can think of (2) as estimating the pdf by averaging the effect of a set of kernel functions centered at each data point. Alternatively, since the kernel function is symmetric, we can also regard this computation as averaging the effect of a kernel function centered at the estimation point and evaluated at each data point. Kernel density estimators asymptotically converge to any density function with sufficient samples [1], [2]. This property makes the technique quite general for estimating the density of any distribution. In fact, all other nonparametric density estimation methods, e.g., histograms, can be shown to be asymptotically kernel methods [1]. For higher dimensions, products of one-dimensional (1-D) kernels [1] can be used as (3) where the same kernel function is used in each dimension with a suitable bandwidth for each dimension. We can avoid having to store the complete data set by weighting the samples as where the ’s are weighting coefficients that sum up to one. A variety of kernel functions with different properties have been used in the literature. Typically the Gaussian kernel is used for its continuity, differentiability, and locality properties. Note that choosing the Gaussian as a kernel function is different from fitting the distribution to a Gaussian model (normal distribution). Here, the Gaussian is only used as a function to weight the data points. Unlike parametric fitting of a mixture of Gaussians, kernel density estimation is a more general approach that does not assume any specific shape for the density function. A good discussion of kernel estimation techniques can be found in [1]. The major drawback of using the nonparametric kernel density estimator is its computational cost. This becomes less of a problem as the available computational power increases and as efficient computational methods have become available recently [3], [4]. III. MODELING THE BACKGROUND A. Background Subtraction: A Review 1) The Concept: In video surveillance systems, stationary cameras are typically used to monitor activities at outdoor or indoor sites. Since the cameras are stationary, the detection of moving objects can be achieved by comparing each new frame with a representation of the scene background. This process is called background subtraction and the scene representation is called the background model. Typically, background subtraction forms the first stage in an automated visual surveillance system. Results from background subtraction are used for further processing, such as tracking targets and understanding events. A central issue in building a representation for the scene background is what features to use for this representation or, in other words, what to model in the background. In the literature, a variety of features have been used for background modeling, including pixel-based features (pixel intensity, edges, disparity) and region-based features (e.g., block correlation). The choice of the features affects how the background model tolerates changes in the scene and the granularity of the detected foreground objects. In any indoor or outdoor scene, there are changes that occur over time and may be classified as changes to the scene background. It is important that the background model tolerates these kind of changes, either by being invariant to them or by adapting to them. These changes can be local, affecting only part of the background, or global, affecting the entire background. The study of these changes is essential to understand the motivations behind different background subtraction techniques. We classify these changes according to their source. Illumination changes: • gradual change in illumination, as might occur in outdoor scenes due to the change in the location of the sun; 1152 PROCEEDINGS OF THE IEEE, VOL. 90, NO. 7, JULY 2002
sudden change in illumination as might occur in an in- distributions corresponding to road,shadow,and vehicle dis- door environment by switching the lights on or off,or tribution.Adaptation of the Gaussian mixture models can be in an outdoor environment by a change between cloudy achieved using an incremental version of the EM algorithm. and sunny conditions; In [12],linear prediction using the Wiener filter is used to shadows cast on the background by objects in the back- predict pixel intensity given a recent history of values.The ground itself(e.g,buildings and trees)or by moving prediction coefficients are recomputed each frame from the foreground objects. sample covariance to achieve adaptivity.Linear prediction using the Kalman filter was also used in [6]-[8] Motion changes: All of the previously mentioned models are based on sta- tistical modeling of pixel intensity with the ability to adapt image changes due to small camera displacements the model.While pixel intensity is not invariant to illumi- (these are common in outdoor situations due to wind nation changes,model adaptation makes it possible for such load or other sources of motion which causes global techniques to adapt to gradual changes in illumination.On motion in the images); the other hand,a sudden change in illumination presents a motion in parts of the background,for example,tree challenge to such models. branches moving with the wind or rippling water. Another approach to model a wide range of variations in the pixel intensity is to represent these variations as dis- Changes introduced to the background:These include any crete states corresponding to modes of the environment,e.g, change in the geometry or the appearance of the background lights on/off or cloudy/sunny skies.Hidden Markov models of the scene introduced by targets.Such changes typically (HMMs)have been used for this purpose in [13]and [14]. occur when something relatively permanent is introduced In [13],a three-state HMM has been used to model the in- into the scene background (for example,if somebody moves tensity of a pixel for a traffic-monitoring application where (introduces)something from (to)the background,or if a car the three states correspond to the background,shadow,and is parked in the scene or moves out of the scene,or ifa person foreground.The use of HMMs imposes a temporal continuity stays stationary in the scene for an extended period). constraint on the pixel intensity,i.e.,if the pixel is detected as 2)Practice:Many researchers have proposed methods to a part of the foreground,then it is expected to remain part of address some of the issues regarding the background mod- the foreground for a period of time before switching back to eling,and we provide a brief review of the relevant work here. be part of the background.In [14],the topology of the HMM Pixel intensity is the most commonly used feature in back- representing global image intensity is learned while learning ground modeling.If we monitor the intensity value of a pixel the background.At each global intensity state,the pixel in- over time in a completely static scene,then the pixel in- tensity is modeled using a single Gaussian.It was shown that tensity can be reasonably modeled with a Gaussian distri- the model is able to learn simple scenarios like switching the bution N(u,o2),given that the image noise over time can lights on and off. be modeled by a zero mean Gaussian distribution N(0,o2). Alternatively,edge features have also been used to model This Gaussian distribution model for the intensity value of a the background.The use of edge features to model the back- pixel is the underlying model for many background subtrac- ground is motivated by the desire to have a representation tion techniques.For example,one of the simplest background of the scene background that is invariant to illumination subtraction techniques is to calculate an average image of changes.In [15],foreground edges are detected by com- the scene,subtract each new frame from this image,and paring the edges in each new frame with an edge map of the threshold the result.This basic Gaussian model can adapt to background which is called the background"primal sketch." slow changes in the scene(for example,gradual illumination The major drawback of using edge features to model the changes)by recursively updating the model using a simple background is that it would only be possible to detect edges adaptive filter.This basic adaptive model is used in [5]:also. of foreground objects instead of the dense connected regions Kalman filtering for adaptation is used in [6]-[8]. that result from pixel-intensity-based approaches.A fusion Typically,in outdoor environments with moving trees and of intensity and edge information was used in [16] bushes,the scene background is not completely static.For Block-based approaches have been also used for modeling example,one pixel can be the image of the sky in one frame, the background.Block matching has been extensively used a tree leaf in another frame.a tree branch in a third frame for change detection between consecutive frames.In [17. and some mixture subsequently.In each situation,the pixel each image block is fit to a second-order bivariate polynomial will have a different intensity (color),so a single Gaussian and the remaining variations are assumed to be noise.A sta- assumption for the pdf of the pixel intensity will not hold tistical likelihood test is then used to detect blocks with sig- Instead,a generalization based on a mixture of Gaussians nificant change.In [18,each block was represented with its has been used in [9]-[11]to model such variations.In [9] median template over the background learning period and its and [10],the pixel intensity was modeled by a mixture of K block standard deviation.Subsequently,at each new frame, Gaussian distributions (K is a small number from 3 to 5) each block is correlated with its corresponding template,and The mixture is weighted by the frequency with which each blocks with too much deviation relative to the measured stan- of the Gaussians explains the background.In [11],a mixture dard deviation are considered to be foreground.The major of three Gaussian distributions was used to model the pixel drawback with block-based approaches is that the detection value for traffic surveillance applications.The pixel inten- unit is a whole image block and therefore they are only suit- sity was modeled as a weighted mixture of three Gaussian able for coarse detection. ELGAMMAL et al:MODELING USING NONPARAMETRIC KERNEL DENSITY ESTIMATION FOR VISUAL SURVEILLANCE 1153
• sudden change in illumination as might occur in an indoor environment by switching the lights on or off, or in an outdoor environment by a change between cloudy and sunny conditions; • shadows cast on the background by objects in the background itself (e.g., buildings and trees) or by moving foreground objects. Motion changes: • image changes due to small camera displacements (these are common in outdoor situations due to wind load or other sources of motion which causes global motion in the images); • motion in parts of the background, for example, tree branches moving with the wind or rippling water. Changes introduced to the background: These include any change in the geometry or the appearance of the background of the scene introduced by targets. Such changes typically occur when something relatively permanent is introduced into the scene background (for example, if somebody moves (introduces) something from (to) the background, or if a car is parked in the scene or moves out of the scene, or if a person stays stationary in the scene for an extended period). 2) Practice: Many researchers have proposed methods to address some of the issues regarding the background modeling, and we provide a brief review of the relevant work here. Pixel intensity is the most commonly used feature in background modeling. If we monitor the intensity value of a pixel over time in a completely static scene, then the pixel intensity can be reasonably modeled with a Gaussian distribution , given that the image noise over time can be modeled by a zero mean Gaussian distribution . This Gaussian distribution model for the intensity value of a pixel is the underlying model for many background subtraction techniques. For example, one of the simplest background subtraction techniques is to calculate an average image of the scene, subtract each new frame from this image, and threshold the result. This basic Gaussian model can adapt to slow changes in the scene (for example, gradual illumination changes) by recursively updating the model using a simple adaptive filter. This basic adaptive model is used in [5]; also, Kalman filtering for adaptation is used in [6]–[8]. Typically, in outdoor environments with moving trees and bushes, the scene background is not completely static. For example, one pixel can be the image of the sky in one frame, a tree leaf in another frame, a tree branch in a third frame, and some mixture subsequently. In each situation, the pixel will have a different intensity (color), so a single Gaussian assumption for the pdf of the pixel intensity will not hold. Instead, a generalization based on a mixture of Gaussians has been used in [9]–[11] to model such variations. In [9] and [10], the pixel intensity was modeled by a mixture of Gaussian distributions ( is a small number from 3 to 5). The mixture is weighted by the frequency with which each of the Gaussians explains the background. In [11], a mixture of three Gaussian distributions was used to model the pixel value for traffic surveillance applications. The pixel intensity was modeled as a weighted mixture of three Gaussian distributions corresponding to road, shadow, and vehicle distribution. Adaptation of the Gaussian mixture models can be achieved using an incremental version of the EM algorithm. In [12], linear prediction using the Wiener filter is used to predict pixel intensity given a recent history of values. The prediction coefficients are recomputed each frame from the sample covariance to achieve adaptivity. Linear prediction using the Kalman filter was also used in [6]–[8]. All of the previously mentioned models are based on statistical modeling of pixel intensity with the ability to adapt the model. While pixel intensity is not invariant to illumination changes, model adaptation makes it possible for such techniques to adapt to gradual changes in illumination. On the other hand, a sudden change in illumination presents a challenge to such models. Another approach to model a wide range of variations in the pixel intensity is to represent these variations as discrete states corresponding to modes of the environment, e.g., lights on/off or cloudy/sunny skies. Hidden Markov models (HMMs) have been used for this purpose in [13] and [14]. In [13], a three-state HMM has been used to model the intensity of a pixel for a traffic-monitoring application where the three states correspond to the background, shadow, and foreground. The use of HMMs imposes a temporal continuity constraint on the pixel intensity, i.e., if the pixel is detected as a part of the foreground, then it is expected to remain part of the foreground for a period of time before switching back to be part of the background. In [14], the topology of the HMM representing global image intensity is learned while learning the background. At each global intensity state, the pixel intensity is modeled using a single Gaussian. It was shown that the model is able to learn simple scenarios like switching the lights on and off. Alternatively, edge features have also been used to model the background. The use of edge features to model the background is motivated by the desire to have a representation of the scene background that is invariant to illumination changes. In [15], foreground edges are detected by comparing the edges in each new frame with an edge map of the background which is called the background “primal sketch.” The major drawback of using edge features to model the background is that it would only be possible to detect edges of foreground objects instead of the dense connected regions that result from pixel-intensity-based approaches. A fusion of intensity and edge information was used in [16]. Block-based approaches have been also used for modeling the background. Block matching has been extensively used for change detection between consecutive frames. In [17], each image block is fit to a second-order bivariate polynomial and the remaining variations are assumed to be noise. A statistical likelihood test is then used to detect blocks with significant change. In [18], each block was represented with its median template over the background learning period and its block standard deviation. Subsequently, at each new frame, each block is correlated with its corresponding template, and blocks with too much deviation relative to the measured standard deviation are considered to be foreground. The major drawback with block-based approaches is that the detection unit is a whole image block and therefore they are only suitable for coarse detection. ELGAMMAL et al.: MODELING USING NONPARAMETRIC KERNEL DENSITY ESTIMATION FOR VISUAL SURVEILLANCE 1153
In order to monitor wide areas with sufficient resolution. cameras with zoom lenses are often mounted on pan-tilt plat- forms.This enables high-resolution imagery to be obtained from any arbitrary viewing angle from the location where the camera is mounted.The use of background subtraction in such situations requires a representation of the scene background for any arbitrary pan-tilt-zoom combination, which is an extension to the original background subtraction concept with a stationary camera.In [19],image mosaicing techniques are used to build panoramic representations of the scene background.Alternatively,in [20],a represen- tation of the scene background as a finite set of images (a on a virtual polyhedron is used to construct images of the scene background at any arbitrary pan-tilt-zoom setting. Both techniques assume that the camera rotation is around its optical axis and so that there is no significant motion parallax. B.Nonparametric Background Modeling In this section,we describe a background model and a background subtraction process that we have developed, based on nonparametric kernel density estimation.The model uses pixel intensity (color)as the basic feature for modeling the background.The model keeps a sample of intensity values for each pixel in the image and uses this (b) sample to estimate the density function of the pixel intensity distribution.Therefore,the model is able to estimate the Fig,1.Background Subtraction.(a)Original image.(b)Estimated probability of any newly observed intensity value.The probability image. model can handle situations where the background of the scene is cluttered and not completely static but contains Using this probability estimate.the pixel is considered to small motions that are due to moving tree branches and be a foreground pixel if Pr(t)<th,where the threshold bushes.The model is updated continuously and therefore th is a global threshold over all the images that can be ad- adapts to changes in the scene background. justed to achieve a desired percentage of false positives.Prac- 1)Background Subtraction:Let 1,x2,...,IN be a tically,the probability estimation in(6)can be calculated in a sample of intensity values for a pixel.Given this sample, very fast way using precalculated lookup tables for the kernel we can obtain an estimate of the pixel intensity pdf at function values given the intensity value difference (t-i) any intensity value using kernel density estimation.Given and the kernel function bandwidth.Moreover,a partial eval- the observed intensity t at time t,we can estimate the uation of the sum in(6)is usually sufficient to surpass the probability of this observation as threshold at most image pixels,since most of the image is typically from the background.This allows us to construct a Pr(xt)= very fast implementation Ko(xt-xi) (4) Since kernel density estimation is a general approach,the =1 estimate of(4)can converge to any pixel intensity density where Ko is a kernel function with bandwidth o.This esti- function.Here,the estimate is based on the most recent N mate can be generalized to use color features by using kernel samples used in the computation.Therefore,adaptation of products as the model can be achieved simply by adding new samples and ignoring older samples [21].Fig.1(b)shows the estimated Pr(xt (Uti -zis) background probability where brighter pixels represent lower 5) background probability pixels. One major issue that needs to be addressed when using where tt is a d-dimensional color feature and Ko is a kernel kernel density estimation technique is the choice of suitable function with bandwidtho;in the jth color space dimension. kernel bandwidth(scale).Theoretically,as the number of If we choose our kernel function K to be Gaussian.then the samples reaches infinity,the choice of the bandwidth is density can be estimated as insignificant and the estimate will approach the actual density.Practically,since only a finite number of samples are used and the computation must be performed in real Pr(xt)= (6 time,the choice of suitable bandwidth is essential.Too =1=11 2m0 small a bandwidth will lead to a ragged density estimate, 1154 PROCEEDINGS OF THE IEEE,VOL.90,NO.7,JULY 2002
In order to monitor wide areas with sufficient resolution, cameras with zoom lenses are often mounted on pan-tilt platforms. This enables high-resolution imagery to be obtained from any arbitrary viewing angle from the location where the camera is mounted. The use of background subtraction in such situations requires a representation of the scene background for any arbitrary pan-tilt-zoom combination, which is an extension to the original background subtraction concept with a stationary camera. In [19], image mosaicing techniques are used to build panoramic representations of the scene background. Alternatively, in [20], a representation of the scene background as a finite set of images on a virtual polyhedron is used to construct images of the scene background at any arbitrary pan-tilt-zoom setting. Both techniques assume that the camera rotation is around its optical axis and so that there is no significant motion parallax. B. Nonparametric Background Modeling In this section, we describe a background model and a background subtraction process that we have developed, based on nonparametric kernel density estimation. The model uses pixel intensity (color) as the basic feature for modeling the background. The model keeps a sample of intensity values for each pixel in the image and uses this sample to estimate the density function of the pixel intensity distribution. Therefore, the model is able to estimate the probability of any newly observed intensity value. The model can handle situations where the background of the scene is cluttered and not completely static but contains small motions that are due to moving tree branches and bushes. The model is updated continuously and therefore adapts to changes in the scene background. 1) Background Subtraction: Let be a sample of intensity values for a pixel. Given this sample, we can obtain an estimate of the pixel intensity pdf at any intensity value using kernel density estimation. Given the observed intensity at time , we can estimate the probability of this observation as (4) where is a kernel function with bandwidth . This estimate can be generalized to use color features by using kernel products as (5) where is a -dimensional color feature and is a kernel function with bandwidth in the th color space dimension. If we choose our kernel function to be Gaussian, then the density can be estimated as (6) Fig. 1. Background Subtraction. (a) Original image. (b) Estimated probability image. Using this probability estimate, the pixel is considered to be a foreground pixel if , where the threshold is a global threshold over all the images that can be adjusted to achieve a desired percentage of false positives. Practically, the probability estimation in (6) can be calculated in a very fast way using precalculated lookup tables for the kernel function values given the intensity value difference and the kernel function bandwidth. Moreover, a partial evaluation of the sum in (6) is usually sufficient to surpass the threshold at most image pixels, since most of the image is typically from the background. This allows us to construct a very fast implementation. Since kernel density estimation is a general approach, the estimate of (4) can converge to any pixel intensity density function. Here, the estimate is based on the most recent samples used in the computation. Therefore, adaptation of the model can be achieved simply by adding new samples and ignoring older samples [21]. Fig. 1(b) shows the estimated background probability where brighter pixels represent lower background probability pixels. One major issue that needs to be addressed when using kernel density estimation technique is the choice of suitable kernel bandwidth (scale). Theoretically, as the number of samples reaches infinity, the choice of the bandwidth is insignificant and the estimate will approach the actual density. Practically, since only a finite number of samples are used and the computation must be performed in real time, the choice of suitable bandwidth is essential. Too small a bandwidth will lead to a ragged density estimate, 1154 PROCEEDINGS OF THE IEEE, VOL. 90, NO. 7, JULY 2002
while too wide a bandwidth will lead to an over-smoothed for that pixel,then it will be detected as a foreground object. density estimate [2].Since the expected variations in pixel However,this object will have a high probability of being intensity over time are different from one location to another a part of the background distribution corresponding to its in the image,a different kernel bandwidth is used for each original pixel.Assuming that only a small displacement can pixel.Also,a different kernel bandwidth is used for each occur between consecutive frames,we decide if a detected color channel. pixel is caused by a background object that has moved by To estimate the kernel bandwidth o?for the ith color considering the background distributions of a small neigh- channel for a given pixel,we compute the median absolute borhood of the detection location. deviation over the sample for consecutive intensity values Let t be the observed value of a pixel x detected as a of the pixel.That is,the median m of for each foreground pixel at time t.We define the pixel displacement consecutive pair (i,i+1)in the sample is calculated inde- probability P(t)to be the maximum probability that the pendently for each color channel.The motivation behind the observed value,t,belongs to the background distribution of use of median of absolute deviation is that pixel intensities some point in the neighborhood A ofx over time are expected to have jumps because different objects (e.g.,sky,branch,leaf,and mixtures when an edge P(ct)=max Pr(tlBu) passes through the pixel)are projected onto the same pixel at yEV(T) different times.Since we are measuring deviations between two consecutive intensity values,the pair (i,i+1)usually where By is the background sample for pixel y,and the prob- comes from the same local-in-time distribution,and only ability estimation Pr(xt B)is calculated using the kernel function estimation as in (6).By thresholding Pv for de- a few pairs are expected to come from cross distributions tected pixels,we can eliminate many false detections due (intensity jumps).The median is a robust estimate and should not be affected by few jumps. to small motions in the background scene.To avoid losing true detections that might accidentally be similar to the back- If we assume that this local-in-time distribution is Gaussian N(2),then the distribution for the deviation ground of some nearby pixel (e.g.,camouflaged targets),a constraint is added that the whole detected foreground ob- (i-i+1)is also Gaussian N(0,202).Since this distri- ject must have moved from a nearby location,and not only bution is symmetric.the median of the absolute deviations m is equivalent to the quarter percentile of the deviation some of its pixels.The component displacement probability Pe is defined to be the probability that a detected connected distribution.That is. componentC has been displaced from a nearby location.This Pr(N(0,2o2)>m)=0.25 probability is estimated by and therefore the standard deviation of the first distribution Pc= P rEc can be estimated as m For a connected component corresponding to a real target, 0二 0.68V51 the probability that this component has displaced from the background will be very small.So,a detected pixel x will be Since the deviations are integer gray scale (color)values, considered to be a part of the background only if(P()> linear interpolation is used to obtain more accurate median th)A(Pc(x)>th2). values. Fig.2 illustrates the effect of the second stage of detec- 2)Probabilistic Suppression of False Detection:In out- tion.The result after the first stage is shown in Fig.2(b). door environments with fluctuating backgrounds.there are In this example,the background has not been updated for two sources of false detections.First,there are false detec- several seconds,and the camera has been slightly displaced tions due to random noise which are expected to be homo- during this time interval,so we see many false detections geneous over the entire image.Second,there are false detec- along high-contrast edges.Fig.2(c)shows the result after tions due to small movements in the scene background that suppressing the detected pixels with high displacement prob- are not represented by the background model.This can occur ability.Most false detections due to displacement were elim- locally,for example,if a tree branch moves further than it inated,and only random noise that is uncorrelated with the did during model generation.This can also occur globally in scene remains as false detections.However,some true de- the image as a result of small camera displacements caused tected pixels were also lost.The final result of the second by wind load,which is common in outdoor surveillance and stage of the detection is shown in Fig.2(d).where the com- causes many false detections.These kinds of false detections ponent displacement probability constraint was added.Fig. are usually spatially clustered in the image,and they are not 3(b)shows results for a case where as a result of the wind load easy to eliminate using morphological techniques or noise the camera is shaking slightly,resulting in a lot of clustered filtering because these operations might also affect detection false detections,especially on the edges.After probabilistic of small and/or occluded targets. suppression of false detection [Fig.3(c)],most of these clus- If a part of the background(a tree branch,for example) tered false detection are suppressed,while the small target on moves to occupy a new pixel,but it was not part of the model the left side of the image remains. ELGAMMAL et al:MODELING USING NONPARAMETRIC KERNEL DENSITY ESTIMATION FOR VISUAL SURVEILLANCE 1155
while too wide a bandwidth will lead to an over-smoothed density estimate [2]. Since the expected variations in pixel intensity over time are different from one location to another in the image, a different kernel bandwidth is used for each pixel. Also, a different kernel bandwidth is used for each color channel. To estimate the kernel bandwidth for the th color channel for a given pixel, we compute the median absolute deviation over the sample for consecutive intensity values of the pixel. That is, the median of for each consecutive pair in the sample is calculated independently for each color channel. The motivation behind the use of median of absolute deviation is that pixel intensities over time are expected to have jumps because different objects (e.g., sky, branch, leaf, and mixtures when an edge passes through the pixel) are projected onto the same pixel at different times. Since we are measuring deviations between two consecutive intensity values, the pair usually comes from the same local-in-time distribution, and only a few pairs are expected to come from cross distributions (intensity jumps). The median is a robust estimate and should not be affected by few jumps. If we assume that this local-in-time distribution is Gaussian , then the distribution for the deviation is also Gaussian . Since this distribution is symmetric, the median of the absolute deviations is equivalent to the quarter percentile of the deviation distribution. That is, and therefore the standard deviation of the first distribution can be estimated as Since the deviations are integer gray scale (color) values, linear interpolation is used to obtain more accurate median values. 2) Probabilistic Suppression of False Detection: In outdoor environments with fluctuating backgrounds, there are two sources of false detections. First, there are false detections due to random noise which are expected to be homogeneous over the entire image. Second, there are false detections due to small movements in the scene background that are not represented by the background model. This can occur locally, for example, if a tree branch moves further than it did during model generation. This can also occur globally in the image as a result of small camera displacements caused by wind load, which is common in outdoor surveillance and causes many false detections. These kinds of false detections are usually spatially clustered in the image, and they are not easy to eliminate using morphological techniques or noise filtering because these operations might also affect detection of small and/or occluded targets. If a part of the background (a tree branch, for example) moves to occupy a new pixel, but it was not part of the model for that pixel, then it will be detected as a foreground object. However, this object will have a high probability of being a part of the background distribution corresponding to its original pixel. Assuming that only a small displacement can occur between consecutive frames, we decide if a detected pixel is caused by a background object that has moved by considering the background distributions of a small neighborhood of the detection location. Let be the observed value of a pixel detected as a foreground pixel at time . We define the pixel displacement probability to be the maximum probability that the observed value, , belongs to the background distribution of some point in the neighborhood of where is the background sample for pixel , and the probability estimation is calculated using the kernel function estimation as in (6). By thresholding for detected pixels, we can eliminate many false detections due to small motions in the background scene. To avoid losing true detections that might accidentally be similar to the background of some nearby pixel (e.g., camouflaged targets), a constraint is added that the whole detected foreground object must have moved from a nearby location, and not only some of its pixels. The component displacement probability is defined to be the probability that a detected connected component has been displaced from a nearby location. This probability is estimated by For a connected component corresponding to a real target, the probability that this component has displaced from the background will be very small. So, a detected pixel will be considered to be a part of the background only if . Fig. 2 illustrates the effect of the second stage of detection. The result after the first stage is shown in Fig. 2(b). In this example, the background has not been updated for several seconds, and the camera has been slightly displaced during this time interval, so we see many false detections along high-contrast edges. Fig. 2(c) shows the result after suppressing the detected pixels with high displacement probability. Most false detections due to displacement were eliminated, and only random noise that is uncorrelated with the scene remains as false detections. However, some true detected pixels were also lost. The final result of the second stage of the detection is shown in Fig. 2(d), where the component displacement probability constraint was added. Fig. 3(b) shows results for a case where as a result of the wind load the camera is shaking slightly, resulting in a lot of clustered false detections, especially on the edges. After probabilistic suppression of false detection [Fig. 3(c)], most of these clustered false detection are suppressed, while the small target on the left side of the image remains. ELGAMMAL et al.: MODELING USING NONPARAMETRIC KERNEL DENSITY ESTIMATION FOR VISUAL SURVEILLANCE 1155
Fig.5.(a)Original image.(b)Detection using (R.G.B)color space.(c)detection using chromaticity coordinates (r.g)and the lightness variable,s. Although using chromaticity coordinates helps in the sup- 0 pression of shadows,they have the disadvantage of losing Fig.2.Effect of the second stage of detection on suppressing false lightness information.Lightness is related to the differences detections.(a)Original image.(b)First stage detection result.(c) in whiteness,blackness,and grayness between different ob- Suppressing pixels with high displacement probabilities.(d)Result jects [23].For example,consider the case where the target using component displacement probability constraint. wears a white shirt and walks against a gray background.In this case,there is no color information.Since both white and gray have the same chromaticity coordinates,the target may not be detected. To address this problem,we also need to use a measure of lightness at each pixel.We use s =R+G+B as a lightness measure.Consider the case where the background is completely static,and let the expected value for a pixel Fig.3.(a)Original image.(b)Result after the first stage of be (r,g,s).Assume that this pixel is covered by shadow in detection.(c)Result after the second stage. frame tand let(r,,st)be the observed value for this pixel at this frame.Then,it is expected that a<(st/s)<1.That is,it is expected that the observed value st will be darker than the normal value s up to a certain limit,as<st,which corresponds to the intuition that at most a fraction (1-a) of the light coming to this pixel can be reduced by a target shadow.A similar effect is expected for highlighted back- ground,where the observed value can be brighter than the Fig.4.(a)Original image.(b)Detection using (R,G.B)color expected value up to a certain limit.Similar reasoning was space.(c)Detection using chromaticity coordinates(r.g). used by [24]. In our case,where the background is not static,there is 3)Working With Color:The detection of shadows as part no single expected value for each pixel.Let Abe the sample values representing the background for a certain pixel,each of the foreground regions is a source of confusion for subse- quent phases of analysis.It is desirable to discriminate be- represented as i=(ri,gi,si),and let t=(rt,g,st)be the observed value at frame t.Then,we can select a subset tween targets and their shadows.Color information is useful for suppressing shadows from the detection by separating B C A of sample values that are relevant to the observed lightness st.By relevant,we mean those values from the color information from lightness information.Given three sample which,if affected by shadows,can produce the ob- color variables,R,G,and B,the chromaticity coordinates served lightness of the pixel.That is. are rR/(R+G+B),gG/(R+G+B),and b=B/(R+G+B),wherer+g+b=1 [22].Using chro- maticity coordinates for detection has the advantage of being B={ AAa≤年≤B more insensitive to small changes in illumination that arise due to shadows.Fig.4 shows the results of detection using Using this relevant sample subset,we carry out our kernel both(R,G,B)space and (r,g)space.The figure shows that calculation,as described in Section III-B,based on the two- using the chromaticity coordinates allows detection of the dimensional(2-D)(r,g)color space.The parameters oand target without detecting its shadow.It must be noticed that are fixed over all the image.Fig.5 shows the detection results the background subtraction technique we describe in Section for an indoor scene using both the(R,G,B)color space and III-B can be used with any color space(e.g.,HSV,YUV,etc.). the (r,g)color space after using the lightness variable s to 1156 PROCEEDINGS OF THE IEEE,VOL.90,NO.7,JULY 2002
Fig. 2. Effect of the second stage of detection on suppressing false detections. (a) Original image. (b) First stage detection result. (c) Suppressing pixels with high displacement probabilities. (d) Result using component displacement probability constraint. Fig. 3. (a) Original image. (b) Result after the first stage of detection. (c) Result after the second stage. Fig. 4. (a) Original image. (b) Detection using (R; G; B) color space. (c) Detection using chromaticity coordinates (r; g). 3) Working With Color: The detection of shadows as part of the foreground regions is a source of confusion for subsequent phases of analysis. It is desirable to discriminate between targets and their shadows. Color information is useful for suppressing shadows from the detection by separating color information from lightness information. Given three color variables, and , the chromaticity coordinates are and , where [22]. Using chromaticity coordinates for detection has the advantage of being more insensitive to small changes in illumination that arise due to shadows. Fig. 4 shows the results of detection using both space and space. The figure shows that using the chromaticity coordinates allows detection of the target without detecting its shadow. It must be noticed that the background subtraction technique we describe in Section III-B can be used with any color space (e.g., HSV, YUV, etc.). Fig. 5. (a) Original image. (b) Detection using (R; G; B) color space. (c) detection using chromaticity coordinates (r; g) and the lightness variable, s. Although using chromaticity coordinates helps in the suppression of shadows, they have the disadvantage of losing lightness information. Lightness is related to the differences in whiteness, blackness, and grayness between different objects [23]. For example, consider the case where the target wears a white shirt and walks against a gray background. In this case, there is no color information. Since both white and gray have the same chromaticity coordinates, the target may not be detected. To address this problem, we also need to use a measure of lightness at each pixel. We use as a lightness measure. Consider the case where the background is completely static, and let the expected value for a pixel be . Assume that this pixel is covered by shadow in frame and let be the observed value for this pixel at this frame. Then, it is expected that . That is, it is expected that the observed value will be darker than the normal value up to a certain limit, , which corresponds to the intuition that at most a fraction of the light coming to this pixel can be reduced by a target shadow. A similar effect is expected for highlighted background, where the observed value can be brighter than the expected value up to a certain limit. Similar reasoning was used by [24]. In our case, where the background is not static, there is no single expected value for each pixel. Let be the sample values representing the background for a certain pixel, each represented as , and let be the observed value at frame . Then, we can select a subset of sample values that are relevant to the observed lightness . By relevant, we mean those values from the sample which, if affected by shadows, can produce the observed lightness of the pixel. That is, Using this relevant sample subset, we carry out our kernel calculation, as described in Section III-B, based on the twodimensional (2-D) color space. The parameters and are fixed over all the image. Fig. 5 shows the detection results for an indoor scene using both the color space and the color space after using the lightness variable to 1156 PROCEEDINGS OF THE IEEE, VOL. 90, NO. 7, JULY 2002
IV.MODELING THE FOREGROUND A.Modeling Color Blobs Modeling the color distribution of a homogeneous region has a variety of applications for object tracking and recogni- tion.The color distribution of an object represents a feature that is robust to partial occlusion,scaling,and object defor- mation.It is also relatively stable under rotation in depth in certain applications.Therefore,color distributions have been used successfully to track nonrigid bodies [5],[26]-[28], e.g.,for tracking heads [29],[28],[30],[27],hands [31]. and other body parts against cluttered backgrounds from sta- tionary or moving platforms.Color distributions have also been used for object recognition. A variety of parametric and nonparametric statistical tech- Fig.6.Example of detection results. niques have been used to model the color distribution of homogeneous colored regions.In [5],the color distribution of a region(blob)was modeled using a single Gaussian in the three-dimensional(3-D)YUI space.The use of a single Gaussian to model the color of a blob restricts it to be of a single color which is not a sufficiently general assumption to model regions with mixtures of colors.For example,people's clothing and surfaces with texture usually contain patterns and mixtures of colors.Fitting a mixture of Gaussians using the EM algorithm provides a way to model color blobs with a mixture of colors.This technique was used in [30]and [27] for color-based tracking of a single blob and was applied to tracking faces.The mixture of Gaussian techniques faces the problem of choosing the right number of Gaussians for the assumed model (model selection).Nonparametric tech- niques using histograms have been widely used for modeling the color of objects for different applications to overcome the previously mentioned problems with parametric models. Fig.7.Top:detection result from an omnidirectional camera. Bottom:detection result for a rainy day. Color histograms have been used in [32]for people tracking. Color histograms have also been used in [31]for tracking restrict the sample set to relevant values only.We illustrate hands,in [26]for color region tracking and in [33]for skin the algorithm on an indoor sequence because the effect of detection.The major drawback with color histograms is the shadows is more severe than in outdoor environments.The lack of convergence to the right density function if the data target in the figure wears black pants and the background set is small.Another major drawback with histograms,in is gray,so there is no color information.However,we still general,is that they are not suitable for higher dimensional detect the target very well and suppress the shadows as seen features. in the rightmost parts of the figure. Given a sample S={i}taken from an image region, 4)Example Detection Results:The technique has been where i=1...N and i is a d-dimensional vector repre- tested for a wide variety of challenging background subtrac- senting the color,we can estimate the density function at any tion problems in a variety of setups and was found to be ro- point y of the color space directly from S using the product bust and adaptive.In this section,we show some more ex- of one-dimensional(1-D)kernels [1]as ample results.Fig.6 shows two detection results for targets in a wooded area where the tree branches move heavily and the target is highly occluded.The technique is pixel-based (7) and can work directly with raw images provided by omni- direction cameras [25].Fig.7(top)shows the detection re- where the same kernel function is used in each dimension sults using an omnidirectional camera.The targets are cam- with a different bandwidtho;for each dimension of the color ouflaged and walking through the woods.Fig.7 (bottom) space.Usually in color modeling 2-D or 3-D color spaces shows the detection result for a rainy day where the back- are used.Two-dimensional chromaticity spaces,e.g.,= ground model adapts to account for different rain and lighting R/(R+G+B),g=G/(R+G+B)and a,b from the Lab conditions. color space,are used when it is desired to make the model IVideo clips showing these results and others can be downloaded from invariant to illumination geometry for reasons discussed in ftp://www.umiacs.umd.edu/pub/elgammal/video/index.htm Section III-B3.Three-dimensional color spaces are widely ELGAMMAL et al:MODELING USING NONPARAMETRIC KERNEL DENSITY ESTIMATION FOR VISUAL SURVEILLANCE 1157
Fig. 6. Example of detection results. Fig. 7. Top: detection result from an omnidirectional camera. Bottom: detection result for a rainy day. restrict the sample set to relevant values only. We illustrate the algorithm on an indoor sequence because the effect of shadows is more severe than in outdoor environments. The target in the figure wears black pants and the background is gray, so there is no color information. However, we still detect the target very well and suppress the shadows as seen in the rightmost parts of the figure. 4) Example Detection Results: The technique has been tested for a wide variety of challenging background subtraction problems in a variety of setups and was found to be robust and adaptive. In this section, we show some more example results. Fig. 6 shows two detection results for targets in a wooded area where the tree branches move heavily and the target is highly occluded. The technique is pixel-based and can work directly with raw images provided by omnidirection cameras [25]. Fig. 7 (top) shows the detection results using an omnidirectional camera. The targets are camouflaged and walking through the woods. Fig. 7 (bottom) shows the detection result for a rainy day where the background model adapts to account for different rain and lighting conditions.1 1Video clips showing these results and others can be downloaded from ftp://www.umiacs.umd.edu/pub/elgammal/video/index.htm IV. MODELING THE FOREGROUND A. Modeling Color Blobs Modeling the color distribution of a homogeneous region has a variety of applications for object tracking and recognition. The color distribution of an object represents a feature that is robust to partial occlusion, scaling, and object deformation. It is also relatively stable under rotation in depth in certain applications. Therefore, color distributions have been used successfully to track nonrigid bodies [5], [26]–[28], e.g., for tracking heads [29], [28], [30], [27], hands [31], and other body parts against cluttered backgrounds from stationary or moving platforms. Color distributions have also been used for object recognition. A variety of parametric and nonparametric statistical techniques have been used to model the color distribution of homogeneous colored regions. In [5], the color distribution of a region (blob) was modeled using a single Gaussian in the three-dimensional (3-D) YUV space. The use of a single Gaussian to model the color of a blob restricts it to be of a single color which is not a sufficiently general assumption to model regions with mixtures of colors. For example, people’s clothing and surfaces with texture usually contain patterns and mixtures of colors. Fitting a mixture of Gaussians using the EM algorithm provides a way to model color blobs with a mixture of colors. This technique was used in [30] and [27] for color-based tracking of a single blob and was applied to tracking faces. The mixture of Gaussian techniques faces the problem of choosing the right number of Gaussians for the assumed model (model selection). Nonparametric techniques using histograms have been widely used for modeling the color of objects for different applications to overcome the previously mentioned problems with parametric models. Color histograms have been used in [32] for people tracking. Color histograms have also been used in [31] for tracking hands, in [26] for color region tracking and in [33] for skin detection. The major drawback with color histograms is the lack of convergence to the right density function if the data set is small. Another major drawback with histograms, in general, is that they are not suitable for higher dimensional features. Given a sample taken from an image region, where and is a -dimensional vector representing the color, we can estimate the density function at any point of the color space directly from using the product of one-dimensional (1-D) kernels [1] as (7) where the same kernel function is used in each dimension with a different bandwidth for each dimension of the color space. Usually in color modeling 2-D or 3-D color spaces are used. Two-dimensional chromaticity spaces, e.g., and from the Lab color space, are used when it is desired to make the model invariant to illumination geometry for reasons discussed in Section III-B3. Three-dimensional color spaces are widely ELGAMMAL et al.: MODELING USING NONPARAMETRIC KERNEL DENSITY ESTIMATION FOR VISUAL SURVEILLANCE 1157
used because of their better discrimination since brightness information is preserved.The use of different bandwidths for kernels in different color dimensions is desirable since the variances in each color dimension are different.For example. the luminance variable usually has more variance than the chromaticity variables,and therefore wider kernels should be used in that dimension. Using kernel density estimation for color modeling has many motivations.Unlike histograms,even with a small Fig.8.(a)Blob separator histogram from training data.(b) number of samples,kernel density estimation leads to a Confidence bands.(c)Blob segmentation.(d)Detected blob smooth,continuous,and differentiable density estimate. separators. Since kernel density estimation does not assume any specific underlying distribution and the estimate can converge to any Estimates for the color density hA(c)can be calculated density shape with enough samples,this approach is suitable using kernel density estimation.We represent the color of to model the color distribution of regions with patterns each pixel as a 3-D vectorX=(r,g,s)wherer=R/(R+ and mixture of colors.If the underlying distribution is a G+B),g=G/(R+G+B)are two chromaticity variables mixture of Gaussians,kernel density estimation converges and s=(R+G+B)/3 is a lightness variable.The three to the right density with a small number of samples.Unlike variables are scaled to be in the range 0 to 1.Given a sample parametric fitting of a mixture of Gaussians,kernel density of pixels SA=i=(ri,gi,si)}from blob A,an estimate estimation is a more general approach that does not require A()for the color density hA()can be calculated as the selection of the number of Gaussians to be fitted.One other important advantage of using kernel density estimation Kar(T-Ti)Kog(g-gi)Koa(s-si). is that the adaptation of the model is trivial and can be achieved by adding new samples.Since color spaces are low in dimensionality,efficient computation of kernel density Given a set of samples S=SA corresponding to each blob,and initial estimates for the position of each blob yA, estimation for color pdfs can be achieved using the Fast each pixel is classified into one of the three blobs based on Gauss Transform algorithm [34],[35]. maximum-likelihood classification assuming that all blobs B.Color-Based Body Part Segmentation have the same prior probabilities In this section,we use the color modeling approach X∈A4ks.tk=aug脉nax P(X|Ak) described in Section IV-A to segment foreground regions, =ag脉nax gA(y)hA(C) (8) corresponding to tracked people in upright poses,into major body parts.The foreground regions are detected using the where the vertical density gAr(y)is assumed to have a background subtraction technique described earlier.People Gaussian distribution gA()=N(yA,A )Since the can be dressed in many different ways but generally are blobs are assumed to be vertically above each other,the dressed in a way that leads to a set of major color regions horizontal density fA()is irrelevant to the classification. aligned vertically for people in upright poses (e.g.,shirt, A horizontal blob separator is detected between each T-shirt,jacket on the top and pants,shorts,skirts on the two consecutive blobs by finding the horizontal line that bottom).We consider the case where people are dressed in minimizes the classification error.Given the detected blob a top-bottom manner which yields a segmentation of the separators,the color model is recaptured by sampling pixels person into a head,torso,and bottom.Generally,a person in from each blob.Blob segmentation is performed,and blob an upright pose is modeled as a set of vertically aligned blobs separators are detected in each new frame as long as the M=Ai}where a blob A;models a major color region target is isolated and tracked.Adaptation of the color model along the vertical axis of the person representing a major is achieved by updating the sample (adding new samples part of the body as the torso,bottom,or head.Each blob and ignoring old samples)for each blob model. is represented by its color distribution as well as its spatial Model initialization is done automatically by taking three location with respect to the whole body.Since each blob has samples S={SH,ST,SB}of pixels from three confidence the same color distribution everywhere inside the blob,and bands corresponding to the head,torso,and bottom.The lo- since the vertical location of the blob is independent of the cations of these confidence bands are learned offline as fol- horizontal axis,the joint distribution of pixel (,y,c)(the lows.A set of training data with different people in upright probability of observing color cat location(,y)given blob pose (from both genders and in different orientations)is used A)is a multiplication of three independent density functions to learn the location of blob separators (head-torso,torso- bottom)with respect to the body where these separators are PA(T,U,C)=fA(T)gA(U)hA(C) manually marked.Fig.8(a)shows a histogram of the loca- tions of head-torso (left peak)and torso-bottom (right peak) where hA(c)is the color density of blob Aand the densities in the training data.Based on these separator location esti- gA(y),fA()represent the vertical and horizontal location mates,we can determine the confidence bands proportional of the blob,respectively. to the height where we are confident that they belong to the 1158 PROCEEDINGS OF THE IEEE,VOL.90,NO.7,JULY 2002
used because of their better discrimination since brightness information is preserved. The use of different bandwidths for kernels in different color dimensions is desirable since the variances in each color dimension are different. For example, the luminance variable usually has more variance than the chromaticity variables, and therefore wider kernels should be used in that dimension. Using kernel density estimation for color modeling has many motivations. Unlike histograms, even with a small number of samples, kernel density estimation leads to a smooth, continuous, and differentiable density estimate. Since kernel density estimation does not assume any specific underlying distribution and the estimate can converge to any density shape with enough samples, this approach is suitable to model the color distribution of regions with patterns and mixture of colors. If the underlying distribution is a mixture of Gaussians, kernel density estimation converges to the right density with a small number of samples. Unlike parametric fitting of a mixture of Gaussians, kernel density estimation is a more general approach that does not require the selection of the number of Gaussians to be fitted. One other important advantage of using kernel density estimation is that the adaptation of the model is trivial and can be achieved by adding new samples. Since color spaces are low in dimensionality, efficient computation of kernel density estimation for color pdfs can be achieved using the Fast Gauss Transform algorithm [34], [35]. B. Color-Based Body Part Segmentation In this section, we use the color modeling approach described in Section IV-A to segment foreground regions, corresponding to tracked people in upright poses, into major body parts. The foreground regions are detected using the background subtraction technique described earlier. People can be dressed in many different ways but generally are dressed in a way that leads to a set of major color regions aligned vertically for people in upright poses (e.g., shirt, T-shirt, jacket on the top and pants, shorts, skirts on the bottom). We consider the case where people are dressed in a top–bottom manner which yields a segmentation of the person into a head, torso, and bottom. Generally, a person in an upright pose is modeled as a set of vertically aligned blobs where a blob models a major color region along the vertical axis of the person representing a major part of the body as the torso, bottom, or head. Each blob is represented by its color distribution as well as its spatial location with respect to the whole body. Since each blob has the same color distribution everywhere inside the blob, and since the vertical location of the blob is independent of the horizontal axis, the joint distribution of pixel (the probability of observing color at location given blob ) is a multiplication of three independent density functions where is the color density of blob and the densities represent the vertical and horizontal location of the blob, respectively. Fig. 8. (a) Blob separator histogram from training data. (b) Confidence bands. (c) Blob segmentation. (d) Detected blob separators. Estimates for the color density can be calculated using kernel density estimation. We represent the color of each pixel as a 3-D vector where are two chromaticity variables and is a lightness variable. The three variables are scaled to be in the range 0 to 1. Given a sample of pixels from blob , an estimate for the color density can be calculated as Given a set of samples corresponding to each blob, and initial estimates for the position of each blob , each pixel is classified into one of the three blobs based on maximum-likelihood classification assuming that all blobs have the same prior probabilities s.t. (8) where the vertical density is assumed to have a Gaussian distribution . Since the blobs are assumed to be vertically above each other, the horizontal density is irrelevant to the classification. A horizontal blob separator is detected between each two consecutive blobs by finding the horizontal line that minimizes the classification error. Given the detected blob separators, the color model is recaptured by sampling pixels from each blob. Blob segmentation is performed, and blob separators are detected in each new frame as long as the target is isolated and tracked. Adaptation of the color model is achieved by updating the sample (adding new samples and ignoring old samples) for each blob model. Model initialization is done automatically by taking three samples of pixels from three confidence bands corresponding to the head, torso, and bottom. The locations of these confidence bands are learned offline as follows. A set of training data with different people in upright pose (from both genders and in different orientations) is used to learn the location of blob separators (head-torso, torsobottom) with respect to the body where these separators are manually marked. Fig. 8(a) shows a histogram of the locations of head-torso (left peak) and torso-bottom (right peak) in the training data. Based on these separator location estimates, we can determine the confidence bands proportional to the height where we are confident that they belong to the 1158 PROCEEDINGS OF THE IEEE, VOL. 90, NO. 7, JULY 2002
the silhouette of the foreground regions corresponding to the group.It is able to count the number of people in the groups as long as their heads appear as part of the outer silhouette of the group;it fails otherwise.The Hydra system was not in- tended to accurately segment the group into individuals nor does it recover depth information.In [32],groups of people were segmented based on the individuals'color distribution where the color distribution of the whole person was rep- resented by a histogram.The color features are represented globally and are not spatially localized;therefore,this ap- proach loses spatial information about the color distributions which is an essential discriminant. 1)Segmentation Using Likelihood Maximization:For Fig.9.Example results for blob segmentation. simplicity and without loss of generality,we focus on thethe two-person case.Given a person model M=Ai where head,torso,or bottom and use them to capture initial sam- i=1:n,the probability of observing color c at location ples S=[SH,Sr,SB}.Fig.8(b)shows initial bands used x,y given blob Ais for initialization where the segmentation result is shown in 8(c),and the detected separators are shown in 8(d). PA(x,v,c)=fA(gA(y)hA(C). Fig.9 illustrates some blob segmentation examples for various people.The segmentation and separator detection is Since our blobs are aligned vertically,we can assume that all robust even under partial occlusion of the target as in the the blobs share the same horizontal density function f(). rightmost result.Also,in some of these examples,the clothes Therefore,given a person model M=fAii=1:n,the are not of a uniform color. probability of (,y c)is C.Segmentation of Multiple People Visual surveillance systems are required to keep track of P代,5cM0)=∑g.AhA(回 C (9) targets as they move through the scene even when they are occluded by or interacting with other people in the scene.It where C is a normalization factor such that C(y)= is highly undesirable to lose track of the targets when they are in a group.It is even more important to track the targets g).The location and the spatial densities when they are interacting than when they are isolated.This gA(y),f()are defined relative to an origin o.If the origin problem is important not only for visual surveillance but also moves to o,o,we can shift the previous probability as for other video analysis applications such as video indexing and video archival and retrieval. P氏x,5cMo%》=-∑gAg-%hA.G. In this section,we show how to segment foreground re- C(y-%o) gions corresponding to a group of people into individuals given the representation for isolated people presented in Sec- This defines the conditional density as a function of the tion IV-B.One drawback of this representation is its inability model origin (o,)i.e.,(o,is a parameter for the to model highly articulated parts such as hands.However, density,and it is the only degree of freedom allowed. since our main objective is to segment people under occlu- Given two people occluding each other with models sion,we are principally concerned with the mass of the body. Mi(x1,v)and M2(x2,42),h=(x1,v,x2,v2)is a Correctly locating the major blobs of the body will provide four-dimensional(4-D)hypothesis for their origins.We will constraints on the location of the hands which could then be call h an arrangement hypothesis.For a foreground region used to locate and segment them.The assumption we make X=(X1,...Xm)representing those two people,each about the scenario is that the targets are visually isolated be- foreground pixel Xi=(ti,Ci)can be classified into one fore occlusion so that we can initialize their models. of the two classes using maximum-likelihood classification Given a foreground region corresponding to a group of (assuming the same prior probability for each person).This people,we search for the arrangement that maximizes the defines a segmentation wn(X)=(wn(X1),...wn(Xm)) likelihood of the appearance of this region given the models that minimizes Bayes error,where that we have built for the individuals.As a result,we obtain a segmentation of the region.The segmentation result is then w(Xi)=k s.t.k=argk:max P(Xi Mi(k,)),=1,2. used to determine the relative depth of each individual by evaluating different hypothesis about the arrangement of the Notice that the segmentation wn()is a function of the people.This allows us to construct a model for occlusion. origin hypothesis h for the two models,i.e.,each choice for The problem of tracking groups of people has been ad- the targets'origins defines a different segmentation of the dressed recently in the literature.The Hydra system [36] foreground region.The best choice for the targets'origins is tracks people in groups by tracking their heads based on the one that maximizes the likelihood of the data over the ELGAMMAL et al:MODELING USING NONPARAMETRIC KERNEL DENSITY ESTIMATION FOR VISUAL SURVEILLANCE 1159
Fig. 9. Example results for blob segmentation. head, torso, or bottom and use them to capture initial samples . Fig. 8(b) shows initial bands used for initialization where the segmentation result is shown in 8(c), and the detected separators are shown in 8(d). Fig. 9 illustrates some blob segmentation examples for various people. The segmentation and separator detection is robust even under partial occlusion of the target as in the rightmost result. Also, in some of these examples, the clothes are not of a uniform color. C. Segmentation of Multiple People Visual surveillance systems are required to keep track of targets as they move through the scene even when they are occluded by or interacting with other people in the scene. It is highly undesirable to lose track of the targets when they are in a group. It is even more important to track the targets when they are interacting than when they are isolated. This problem is important not only for visual surveillance but also for other video analysis applications such as video indexing and video archival and retrieval. In this section, we show how to segment foreground regions corresponding to a group of people into individuals given the representation for isolated people presented in Section IV-B. One drawback of this representation is its inability to model highly articulated parts such as hands. However, since our main objective is to segment people under occlusion, we are principally concerned with the mass of the body. Correctly locating the major blobs of the body will provide constraints on the location of the hands which could then be used to locate and segment them. The assumption we make about the scenario is that the targets are visually isolated before occlusion so that we can initialize their models. Given a foreground region corresponding to a group of people, we search for the arrangement that maximizes the likelihood of the appearance of this region given the models that we have built for the individuals. As a result, we obtain a segmentation of the region. The segmentation result is then used to determine the relative depth of each individual by evaluating different hypothesis about the arrangement of the people. This allows us to construct a model for occlusion. The problem of tracking groups of people has been addressed recently in the literature. The Hydra system [36] tracks people in groups by tracking their heads based on the silhouette of the foreground regions corresponding to the group. It is able to count the number of people in the groups as long as their heads appear as part of the outer silhouette of the group; it fails otherwise. The Hydra system was not intended to accurately segment the group into individuals nor does it recover depth information. In [32], groups of people were segmented based on the individuals’ color distribution where the color distribution of the whole person was represented by a histogram. The color features are represented globally and are not spatially localized; therefore, this approach loses spatial information about the color distributions which is an essential discriminant. 1) Segmentation Using Likelihood Maximization: For simplicity and without loss of generality, we focus on the the two-person case. Given a person model where , the probability of observing color at location given blob is Since our blobs are aligned vertically, we can assume that all the blobs share the same horizontal density function . Therefore, given a person model , the probability of is (9) where is a normalization factor such that . The location and the spatial densities are defined relative to an origin . If the origin moves to , we can shift the previous probability as This defines the conditional density as a function of the model origin , i.e., is a parameter for the density, and it is the only degree of freedom allowed. Given two people occluding each other with models and is a four-dimensional (4-D) hypothesis for their origins. We will call an arrangement hypothesis. For a foreground region representing those two people, each foreground pixel can be classified into one of the two classes using maximum-likelihood classification (assuming the same prior probability for each person). This defines a segmentation that minimizes Bayes error, where s.t. Notice that the segmentation is a function of the origin hypothesis for the two models, i.e., each choice for the targets’ origins defines a different segmentation of the foreground region. The best choice for the targets’ origins is the one that maximizes the likelihood of the data over the ELGAMMAL et al.: MODELING USING NONPARAMETRIC KERNEL DENSITY ESTIMATION FOR VISUAL SURVEILLANCE 1159
entire foreground region.Therefore,the optimal choice for h can be defined in terms of a log-likelihood function 772 hopt argh max log P(Xi|Mi.(h). i=1 For each new frame at time t,searching for the optimal (1,1,2,32)t solves both the foreground segmentation as Fig.10.(a)Original image.(b)Foreground region.(c) well as person tracking problems simultaneously.This for- Segmentation result.(d),(e)Occlusion model hypotheses. malization extends in a straightforward way to the case of N people in a group.In this case,we have N differerent classes to measure the depth (higher visibility index indicates that and an arrangement hypothesis is a 2/-dimensional vector the person is in front).While this can be used to identify the h=(c1,1,TN,N) person in front,this approach does not generalize to more Finding the optimal hypothesis for N people is a search than two people.The solution we present here does not use problem in 2N dimension space,and an exhaustive search the ground plane constraint and generalizes to the case of N for this solution would require O(w2N)tests,where w is people in a group. a 1-D window for each parameter (i.e.,the diameter of the Given a hypothesis h about the 3-D arrangement of people search region in pixels).Thus,finding the optimal solution along with their projected locations in the image plane and in this way is exponential in the number of people in the a model of their shape,we can construct an occlusion model group,which is impractical.Instead,since we are tracking On()that maps each pixel x to one of the tracked targets the targets and the targets are not expected to move much or the scene background.Let us consider the case of two tar- between consecutive frames,we can develop a practical gets as shown in Fig.10.The foreground region is segmented solution based on direct detection of an approximate solu- as in Section IV-C1,which yields a labeling w()for each tion ht at frame t given the solution at frame t-1. pixel [Fig.10(c)]as well as the most probable location for Let us choose a model origin that is expected to be visible the model origins.There are two possible hypotheses about throughout the occlusion and can be detected in a robust the depth arrangement of these two people,and the corre- way.For example,if we assume that the tops of the heads sponding occlusion models are shown in Fig.10(d)and(e), are visible throughout the occlusion,we can use these assuming an ellipse as a shape model for the targets.We can as origins for the spatial densities.Moreover,the top of evaluate these two hypotheses (or generally N hypotheses) the head is a shape feature that can be detected robustly by minimizing the error in the labeling between On()and given our segmentation.Given the model origin location w(x)over the foreground pixels,i.e., -1=(xi,)at frame t-1,we can use this origin to classify each foreground pixel X at frame t using the eror(h)=∑(1-6(Om(cw(r)》 maximum likelihood of P(XM(i,)t-1).Since the x∈FG targets are not expected to have significant translations for all foreground pixels.?We use an ellipse with major between frames,we expect that the segmentation based on and minor axes set to the expected height and width of each (i,)t-1 would be good in frame t,except possibly at person estimated before the occlusion.Figs.11 and 12 show the boundaries.Using this segmentation,we can detect new some examples of the constructed occlusion model for some origin locations (top of the head),ie.,(i,)t.We can occlusion situations. summarize this in the following steps. Fig.11 shows results for segmenting two people in dif- 1)h砖--1=(x1,h,,xV,N)-1 ferent occlusion situations.The foreground segmentation be- 2)Segmentation:Classify each foreground pixel X tween the two people is shown as well as part segmentation. based on P(X Mi(,)) Pixels with low likelihood probabilities are not labeled.In 3)Detection:Detect new origins(top of heads) most of the cases,hands and feet are not labeled or are mis- 2)Modeling Occlusion:By occlusion modeling,we classified because they are not modeled by the part represen- mean assigning a relative depth to each person in the group tation.The constructed occlusion model for each case is also based on the segmentation result.Several approaches have shown.Notice that,in the third and fourth examples,the two been suggested in the literature to solve this problem.In people are dressed in similarly colored pants.Therefore,only [37],a ground plane constraint was used to reason about the torso blobs are discriminating in color.This was sufficient occlusion between cars.The assumption that object motion to locate each person's spatial model parameters and there- is constrained to the ground plane is valid for people and fore similarly colored blobs (head and bottom)were seg- cars but would fail if the contact point on the ground plane mented correctly based mainly on their spatial densities.Still, is not visible because of partial occlusion by other objects some misclassification can be noticed around the boundaries or because contact points are out of the field of view (for ex- between the two pants,which is very hard even for a human ample,see Fig.10).In [321,the visibility index was defined to segment accurately.Fig.12 illustrates several frames from to be the ratio between the number of pixels visible for each person during occlusion to the expected number of pixels 2In the two-person case,an efficient implementation for this error formula can be achieved by considering only the intersection region and finding the for that person when isolated.This visibility index was used target which appears most in this region as being the one in front. 1160 PROCEEDINGS OF THE IEEE,VOL.90,NO.7,JULY 2002
entire foreground region. Therefore, the optimal choice for can be defined in terms of a log-likelihood function For each new frame at time , searching for the optimal solves both the foreground segmentation as well as person tracking problems simultaneously. This formalization extends in a straightforward way to the case of people in a group. In this case, we have differerent classes and an arrangement hypothesis is a -dimensional vector . Finding the optimal hypothesis for people is a search problem in dimension space, and an exhaustive search for this solution would require tests, where is a 1-D window for each parameter (i.e., the diameter of the search region in pixels). Thus, finding the optimal solution in this way is exponential in the number of people in the group, which is impractical. Instead, since we are tracking the targets and the targets are not expected to move much between consecutive frames, we can develop a practical solution based on direct detection of an approximate solution at frame given the solution at frame . Let us choose a model origin that is expected to be visible throughout the occlusion and can be detected in a robust way. For example, if we assume that the tops of the heads are visible throughout the occlusion, we can use these as origins for the spatial densities. Moreover, the top of the head is a shape feature that can be detected robustly given our segmentation. Given the model origin location at frame , we can use this origin to classify each foreground pixel at frame using the maximum likelihood of . Since the targets are not expected to have significant translations between frames, we expect that the segmentation based on would be good in frame , except possibly at the boundaries. Using this segmentation, we can detect new origin locations (top of the head), i.e., . We can summarize this in the following steps. 1) 2) Segmentation: Classify each foreground pixel based on . 3) Detection: Detect new origins (top of heads) 2) Modeling Occlusion: By occlusion modeling, we mean assigning a relative depth to each person in the group based on the segmentation result. Several approaches have been suggested in the literature to solve this problem. In [37], a ground plane constraint was used to reason about occlusion between cars. The assumption that object motion is constrained to the ground plane is valid for people and cars but would fail if the contact point on the ground plane is not visible because of partial occlusion by other objects or because contact points are out of the field of view (for example, see Fig. 10). In [32], the visibility index was defined to be the ratio between the number of pixels visible for each person during occlusion to the expected number of pixels for that person when isolated. This visibility index was used Fig. 10. (a) Original image. (b) Foreground region. (c) Segmentation result. (d), (e) Occlusion model hypotheses. to measure the depth (higher visibility index indicates that the person is in front). While this can be used to identify the person in front, this approach does not generalize to more than two people. The solution we present here does not use the ground plane constraint and generalizes to the case of people in a group. Given a hypothesis about the 3-D arrangement of people along with their projected locations in the image plane and a model of their shape, we can construct an occlusion model that maps each pixel to one of the tracked targets or the scene background. Let us consider the case of two targets as shown in Fig. 10. The foreground region is segmented as in Section IV-C1, which yields a labeling for each pixel [Fig. 10(c)] as well as the most probable location for the model origins. There are two possible hypotheses about the depth arrangement of these two people, and the corresponding occlusion models are shown in Fig. 10(d) and (e), assuming an ellipse as a shape model for the targets. We can evaluate these two hypotheses (or generally hypotheses) by minimizing the error in the labeling between and over the foreground pixels, i.e., error for all foreground pixels.2 We use an ellipse with major and minor axes set to the expected height and width of each person estimated before the occlusion. Figs. 11 and 12 show some examples of the constructed occlusion model for some occlusion situations. Fig. 11 shows results for segmenting two people in different occlusion situations. The foreground segmentation between the two people is shown as well as part segmentation. Pixels with low likelihood probabilities are not labeled. In most of the cases, hands and feet are not labeled or are misclassified because they are not modeled by the part representation. The constructed occlusion model for each case is also shown. Notice that, in the third and fourth examples, the two people are dressed in similarly colored pants. Therefore, only the torso blobs are discriminating in color. This was sufficient to locate each person’s spatial model parameters and therefore similarly colored blobs (head and bottom) were segmented correctly based mainly on their spatial densities. Still, some misclassification can be noticed around the boundaries between the two pants, which is very hard even for a human to segment accurately. Fig. 12 illustrates several frames from 2In the two-person case, an efficient implementation for this error formula can be achieved by considering only the intersection region and finding the target which appears most in this region as being the one in front. 1160 PROCEEDINGS OF THE IEEE, VOL. 90, NO. 7, JULY 2002