正在加载图片...
entire foreground region.Therefore,the optimal choice for h can be defined in terms of a log-likelihood function 772 hopt argh max log P(Xi|Mi.(h). i=1 For each new frame at time t,searching for the optimal (1,1,2,32)t solves both the foreground segmentation as Fig.10.(a)Original image.(b)Foreground region.(c) well as person tracking problems simultaneously.This for- Segmentation result.(d),(e)Occlusion model hypotheses. malization extends in a straightforward way to the case of N people in a group.In this case,we have N differerent classes to measure the depth (higher visibility index indicates that and an arrangement hypothesis is a 2/-dimensional vector the person is in front).While this can be used to identify the h=(c1,1,TN,N) person in front,this approach does not generalize to more Finding the optimal hypothesis for N people is a search than two people.The solution we present here does not use problem in 2N dimension space,and an exhaustive search the ground plane constraint and generalizes to the case of N for this solution would require O(w2N)tests,where w is people in a group. a 1-D window for each parameter (i.e.,the diameter of the Given a hypothesis h about the 3-D arrangement of people search region in pixels).Thus,finding the optimal solution along with their projected locations in the image plane and in this way is exponential in the number of people in the a model of their shape,we can construct an occlusion model group,which is impractical.Instead,since we are tracking On()that maps each pixel x to one of the tracked targets the targets and the targets are not expected to move much or the scene background.Let us consider the case of two tar- between consecutive frames,we can develop a practical gets as shown in Fig.10.The foreground region is segmented solution based on direct detection of an approximate solu- as in Section IV-C1,which yields a labeling w()for each tion ht at frame t given the solution at frame t-1. pixel [Fig.10(c)]as well as the most probable location for Let us choose a model origin that is expected to be visible the model origins.There are two possible hypotheses about throughout the occlusion and can be detected in a robust the depth arrangement of these two people,and the corre- way.For example,if we assume that the tops of the heads sponding occlusion models are shown in Fig.10(d)and(e), are visible throughout the occlusion,we can use these assuming an ellipse as a shape model for the targets.We can as origins for the spatial densities.Moreover,the top of evaluate these two hypotheses (or generally N hypotheses) the head is a shape feature that can be detected robustly by minimizing the error in the labeling between On()and given our segmentation.Given the model origin location w(x)over the foreground pixels,i.e., -1=(xi,)at frame t-1,we can use this origin to classify each foreground pixel X at frame t using the eror(h)=∑(1-6(Om(cw(r)》 maximum likelihood of P(XM(i,)t-1).Since the x∈FG targets are not expected to have significant translations for all foreground pixels.?We use an ellipse with major between frames,we expect that the segmentation based on and minor axes set to the expected height and width of each (i,)t-1 would be good in frame t,except possibly at person estimated before the occlusion.Figs.11 and 12 show the boundaries.Using this segmentation,we can detect new some examples of the constructed occlusion model for some origin locations (top of the head),ie.,(i,)t.We can occlusion situations. summarize this in the following steps. Fig.11 shows results for segmenting two people in dif- 1)h砖--1=(x1,h,,xV,N)-1 ferent occlusion situations.The foreground segmentation be- 2)Segmentation:Classify each foreground pixel X tween the two people is shown as well as part segmentation. based on P(X Mi(,)) Pixels with low likelihood probabilities are not labeled.In 3)Detection:Detect new origins(top of heads) most of the cases,hands and feet are not labeled or are mis- 2)Modeling Occlusion:By occlusion modeling,we classified because they are not modeled by the part represen- mean assigning a relative depth to each person in the group tation.The constructed occlusion model for each case is also based on the segmentation result.Several approaches have shown.Notice that,in the third and fourth examples,the two been suggested in the literature to solve this problem.In people are dressed in similarly colored pants.Therefore,only [37],a ground plane constraint was used to reason about the torso blobs are discriminating in color.This was sufficient occlusion between cars.The assumption that object motion to locate each person's spatial model parameters and there- is constrained to the ground plane is valid for people and fore similarly colored blobs (head and bottom)were seg- cars but would fail if the contact point on the ground plane mented correctly based mainly on their spatial densities.Still, is not visible because of partial occlusion by other objects some misclassification can be noticed around the boundaries or because contact points are out of the field of view (for ex- between the two pants,which is very hard even for a human ample,see Fig.10).In [321,the visibility index was defined to segment accurately.Fig.12 illustrates several frames from to be the ratio between the number of pixels visible for each person during occlusion to the expected number of pixels 2In the two-person case,an efficient implementation for this error formula can be achieved by considering only the intersection region and finding the for that person when isolated.This visibility index was used target which appears most in this region as being the one in front. 1160 PROCEEDINGS OF THE IEEE,VOL.90,NO.7,JULY 2002entire foreground region. Therefore, the optimal choice for can be defined in terms of a log-likelihood function For each new frame at time , searching for the optimal solves both the foreground segmentation as well as person tracking problems simultaneously. This for￾malization extends in a straightforward way to the case of people in a group. In this case, we have differerent classes and an arrangement hypothesis is a -dimensional vector . Finding the optimal hypothesis for people is a search problem in dimension space, and an exhaustive search for this solution would require tests, where is a 1-D window for each parameter (i.e., the diameter of the search region in pixels). Thus, finding the optimal solution in this way is exponential in the number of people in the group, which is impractical. Instead, since we are tracking the targets and the targets are not expected to move much between consecutive frames, we can develop a practical solution based on direct detection of an approximate solu￾tion at frame given the solution at frame . Let us choose a model origin that is expected to be visible throughout the occlusion and can be detected in a robust way. For example, if we assume that the tops of the heads are visible throughout the occlusion, we can use these as origins for the spatial densities. Moreover, the top of the head is a shape feature that can be detected robustly given our segmentation. Given the model origin location at frame , we can use this origin to classify each foreground pixel at frame using the maximum likelihood of . Since the targets are not expected to have significant translations between frames, we expect that the segmentation based on would be good in frame , except possibly at the boundaries. Using this segmentation, we can detect new origin locations (top of the head), i.e., . We can summarize this in the following steps. 1) 2) Segmentation: Classify each foreground pixel based on . 3) Detection: Detect new origins (top of heads) 2) Modeling Occlusion: By occlusion modeling, we mean assigning a relative depth to each person in the group based on the segmentation result. Several approaches have been suggested in the literature to solve this problem. In [37], a ground plane constraint was used to reason about occlusion between cars. The assumption that object motion is constrained to the ground plane is valid for people and cars but would fail if the contact point on the ground plane is not visible because of partial occlusion by other objects or because contact points are out of the field of view (for ex￾ample, see Fig. 10). In [32], the visibility index was defined to be the ratio between the number of pixels visible for each person during occlusion to the expected number of pixels for that person when isolated. This visibility index was used Fig. 10. (a) Original image. (b) Foreground region. (c) Segmentation result. (d), (e) Occlusion model hypotheses. to measure the depth (higher visibility index indicates that the person is in front). While this can be used to identify the person in front, this approach does not generalize to more than two people. The solution we present here does not use the ground plane constraint and generalizes to the case of people in a group. Given a hypothesis about the 3-D arrangement of people along with their projected locations in the image plane and a model of their shape, we can construct an occlusion model that maps each pixel to one of the tracked targets or the scene background. Let us consider the case of two tar￾gets as shown in Fig. 10. The foreground region is segmented as in Section IV-C1, which yields a labeling for each pixel [Fig. 10(c)] as well as the most probable location for the model origins. There are two possible hypotheses about the depth arrangement of these two people, and the corre￾sponding occlusion models are shown in Fig. 10(d) and (e), assuming an ellipse as a shape model for the targets. We can evaluate these two hypotheses (or generally hypotheses) by minimizing the error in the labeling between and over the foreground pixels, i.e., error for all foreground pixels.2 We use an ellipse with major and minor axes set to the expected height and width of each person estimated before the occlusion. Figs. 11 and 12 show some examples of the constructed occlusion model for some occlusion situations. Fig. 11 shows results for segmenting two people in dif￾ferent occlusion situations. The foreground segmentation be￾tween the two people is shown as well as part segmentation. Pixels with low likelihood probabilities are not labeled. In most of the cases, hands and feet are not labeled or are mis￾classified because they are not modeled by the part represen￾tation. The constructed occlusion model for each case is also shown. Notice that, in the third and fourth examples, the two people are dressed in similarly colored pants. Therefore, only the torso blobs are discriminating in color. This was sufficient to locate each person’s spatial model parameters and there￾fore similarly colored blobs (head and bottom) were seg￾mented correctly based mainly on their spatial densities. Still, some misclassification can be noticed around the boundaries between the two pants, which is very hard even for a human to segment accurately. Fig. 12 illustrates several frames from 2In the two-person case, an efficient implementation for this error formula can be achieved by considering only the intersection region and finding the target which appears most in this region as being the one in front. 1160 PROCEEDINGS OF THE IEEE, VOL. 90, NO. 7, JULY 2002
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有