16.322 Stochastic Estimation and Control, Fall 2004 Prof vander velde Lecture 22 ast time N、O3 ken+I Ok o keN,+I Or The estimate of x based on n data points can then be made without reprocessing the first N, points. Their effect can be included simply by starting with a pseudo observation which is equal to the estimate based on the first N points having a variance equal to the variance of the estimate based on N The same is true of the variance of the estimate based on 1=1+y1 A priori information about x can be included in exactly this way whether or not it was derived from previous measurements. Whatever the source of the prior information, it can be expressed as an a priori distribution f(x), or at least as an expected value and a variance. Take the expected value as a pseudo observation, do, and accumulate this data with the actual data using the standard formulae With the prior information included as a pseudo observation, the least squares estimate is formed just as if there were no prior information. The result, for normal variables at least, is identical to the estimators based on the conditional distribution of x Bayes rule can be used to form the distribution f(xl:, 2, Ev)starting from the original a priori distribution f(x) f(x) f(x|=1) f(x)f(-1|x) ∫f(ao(-l)lh f(x|=12=2)= f(x|=1)f(=2|x) f(u1=)f(=2|)dh if the measurements are conditionally independent. Two disadvantages relative to the previous method Page 1 of 8
16.322 Stochastic Estimation and Control, Fall 2004 Prof. Vander Velde Page 1 of 8 Lecture 22 Last time: 1 1 1 2 2 ˆ 1 2 2 ˆ 1 ˆ ˆ 1 1 N k x k k N N x k k N x z x σ σ σ σ = + = + + = + ∑ ∑ The estimate of x based on N data points can then be made without reprocessing the first N1 points. Their effect can be included simply by starting with a pseudo observation which is equal to the estimate based on the first N1 points having a variance equal to the variance of the estimate based on N1 . The same is true of the variance of the estimate based on N . 1 1 22 2 ˆ ˆ 1 11 1 N σ xx k σ σ k N= + = + ∑ A priori information about x can be included in exactly this way whether or not it was derived from previous measurements. Whatever the source of the prior information, it can be expressed as an a priori distribution f ( ) x , or at least as an expected value and a variance. Take the expected value as a pseudo observation, 2 σ 0 , and accumulate this data with the actual data using the standard formulae. With the prior information included as a pseudo observation, the least squares estimate is formed just as if there were no prior information. The result, for normal variables at least, is identical to the estimators based on the conditional distribution of x . Bayes’ rule can be used to form the distribution ( ) 1 2 | , ,..., N f xzz z starting from the original a priori distribution f ( ) x ( ) ( ) ( )( ) ( )( ) 1 1 1 1 2 1 2 1 2 ( ) () | (| ) () | | | (| , ) | | etc. f x fxf z x fxz f u f z u du f xz f z x fxzz f u z f z u du = = ∫ ∫ if the measurements are conditionally independent. Two disadvantages relative to the previous method:
16.322 Stochastic Estimation and Control, Fall 2004 Prof vander velde More computation unless you know each conditional density is going to be al Must provide f(x)-the a priori distribution. This is both the advantage and disadvantage of this method Other estimators include the effect of a priori information directly. Several estimators are based on the conditional probability distribution of x given the values of the observations. In this approach, we think of x as a random variabl having some distribution. This troubles some people since we know x is in fact fixed at some value throughout all the experiment. However, the fact that we do not know what the value isis expressed in terms of a distribution of possible values for x. The extent of our a priori knowledge is reflected in the variance of the a priori distribution we assign. Having an a priori distribution for x, and the values of the observations, we can in principle-and often in fact-calculate the conditional distribution of x given the observations. This is in fact the a posteriori distribution, f(x =,-N).this distribution expresses the probability density for various values of x given the values of the observations and the a priori distribution. Having this distribution, one can define a number of reasonable estimates One is the minimum variance estimate -that valuex which minimizes the error variance Var=J(i-x)/(x|=En)dx But a derivative of this variance with respect to i shows the minimizing value of to be =xf(x1…,=)x which is the conditional mean -the mean of the conditional distribution of x Another reasonable estimate of x based on this conditional distribution is the value at the maximum probability density. This can perfectly well be called the maximum likelihood"estimate though it is not necessarily the same as th maximum likelihood estimate we have just derived. Schweppe calls it the MAP (maximum a posteriori probability)estimator. The two are related as follows The first is the x which maximizes f(2…,/)∫(=,…,=,x) f(x) Page 2 of 8
16.322 Stochastic Estimation and Control, Fall 2004 Prof. Vander Velde Page 2 of 8 • More computation unless you know each conditional density is going to be normal • Must provide f ( ) x - the a priori distribution. This is both the advantage and disadvantage of this method. Other estimators include the effect of a priori information directly. Several estimators are based on the conditional probability distribution of x given the values of the observations. In this approach, we think of x as a random variable having some distribution. This troubles some people since we know x is in fact fixed at some value throughout all the experiment. However, the fact that we do not know what the value is is expressed in terms of a distribution of possible values for x . The extent of our a priori knowledge is reflected in the variance of the a priori distribution we assign. Having an a priori distribution for x , and the values of the observations, we can in principle – and often in fact – calculate the conditional distribution of x given the observations. This is in fact the a posteriori distribution, f ( xz z | ,..., 1 N ). This distribution expresses the probability density for various values of x given the values of the observations and the a priori distribution. Having this distribution, one can define a number of reasonable estimates. One is the minimum variance estimate – that value xˆ which minimizes the error variance. ( )( ) 2 1 Var | ,..., ˆ N x x f x z z dx ∞ −∞ = − ∫ But a derivative of this variance with respect to xˆ shows the minimizing value of xˆ to be ( ) 1 ˆ | ,..., N x xf x z z dx ∞ −∞ = ∫ which is the conditional mean – the mean of the conditional distribution of x . Another reasonable estimate of x based on this conditional distribution is the value at the maximum probability density. This can perfectly well be called the “maximum likelihood” estimate though it is not necessarily the same as the maximum likelihood estimate we have just derived. Schweppe calls it the MAP (“maximum a posteriori probability”) estimator. The two are related as follows: The first is the x which maximizes ( ) ( ) 1 1 ,..., , ,..., | ( ) N N f z zx fz z x f x =
16.322 Stochastic Estimation and Control, Fall 2004 Prof vander velde The second is the x which maximizes f(x|=,,三) f(x,,,) f(…,,1x)=/(=”,=,x) f(x) f(x|=,=) =3…:二N f(x) The a priori distribution f(x), and the distribution of x is also involved in the joint distribution of the observations. Only if the distribution of x is flat for all x can we guarantee that these two functions will have maxima at the same value of x. But a flat distribution for x for all x is exactly the case when there is no a priori knowledge about x. In that case all values of x have equal probability density which is in fact an infinitesimal. We can consider f(x) in that case to be the limit of almost any convenient distribution as the variance>0o im a imG→∞ If, on the other hand, we do have some prior information about x based on some physical reason, f(x) will have some finite sha often a normal shape -and the x which maximizes f(==xlx)will not maximize f(x|=1-x).In is case the latter choice of x is to be preferred since it is the most probable value of x based on the a priori distribution of x and the values of the observations whereas the former depends only on the observations We just note that for ther distributie which is symmetric about the maximum point, the most probable value is equal to the conditional mean which is the minimum variance value of x. in the absence of prior information, this is also the maximum likelihood estimate, which is the least weighted squares estimate. This we found to be a linear combination of the data. So in the case of a normal conditional distribution with no prior information, the optimum linear estimate based on the data is the minimum variance estimator. No nonlinear operation on the data can give a smaller variance estimate ge 3 of 8
16.322 Stochastic Estimation and Control, Fall 2004 Prof. Vander Velde Page 3 of 8 The second is the x which maximizes ( ) ( ) ( ) ( ) ( ) ( ) ( ) 1 1 1 1 1 1 1 , ,..., | ,..., ,..., ,..., , ,..., | ( ) ,..., | ,..., ( ) N N N N N N N f xz z f xz z fz z fz z x fz z x f x f z z f xz z f x = = = The a priori distribution f ( ) x , and the distribution of x is also involved in the joint distribution of the observations. Only if the distribution of x is flat for all x can we guarantee that these two functions will have maxima at the same value of x . But a flat distribution for x for all x is exactly the case when there is no a priori knowledge about x . In that case all values of x have equal probability density which is in fact an infinitesimal. We can consider f ( ) x in that case to be the limit of almost any convenient distribution as the variance → ∞ . If, on the other hand, we do have some prior information about x based on some previous measurements or on physical reason, f ( ) x will have some finite shape – often a normal shape – and the x which maximizes f (z zx 1,..., | N ) will not maximize ( ) 1 | ,..., N f xz z . In this case the latter choice of x is to be preferred since it is the most probable value of x based on the a priori distribution of x and the values of the observations, whereas the former depends only on the observations. We just note in passing that for a normal ( ) 1 | ,..., N f xz z or any other distribution which is symmetric about the maximum point, the most probable value is equal to the conditional mean which is the minimum variance value of xˆ . In the absence of prior information, this is also the maximum likelihood estimate, which is the least weighted squares estimate. This we found to be a linear combination of the data. So in the case of a normal conditional distribution with no prior information, the optimum linear estimate based on the data is the minimum variance estimator. No nonlinear operation on the data can give a smaller variance estimate
16.322 Stochastic Estimation and Control, Fall 2004 Prof vander velde For non-normal distributions, one often defines a linear estimator and optimizes it based on minimum mean squared error. This the same estimate as that derived here. But in that case it may be that some nonlinear operation on the data could do better Also note that if f (x|= n)is known to be normal, as if all noises are normal, the initial distribution is normal, and only linear operations are involved, then the distribution is completely defined by its mean and variance -and only these parameters need be computed. But in any other case, updating a genera distribution requires the entire distribution- and the estimation problem becomes infinite in dimension Estimators based on the conditional distribution of x are thus very satisfying theoretically, but are more complicated to derive than a simple least We have found the least squares estimate to equal these other estimators if there is no a priori information. Fortunately it is possible to include a priori information in the least squares format in such a way as to recast the problem in the form of an equivalent estimation problem with no prior information. This is done by introducing the a priori information in the form of an additional pseudo observation. That this can be done is due to the fact that the least squares estimate is a linear combination of the observations-and thus it is possible to roup the observations in any way Recursive formulation (The other method is called batch processing Suppose we had the estimate xo based on any prior information and all measurements already taken, and its variance oo. Then we took just one more measurement and wished to obtain an improved estimate of x immediately The formula savs +k2一 Page 4 of 8
16.322 Stochastic Estimation and Control, Fall 2004 Prof. Vander Velde Page 4 of 8 For non-normal distributions, one often defines a linear estimator and optimizes it based on minimum mean squared error. This gives the same estimate as that derived here. But in that case it may be that some nonlinear operation on the data could do better. Also note that if ( ) 1 | ,..., N f xz z is known to be normal, as if all noises are normal, the initial distribution is normal, and only linear operations are involved, then the distribution is completely defined by its mean and variance – and only these parameters need be computed. But in any other case, updating a general distribution requires the entire distribution- and the estimation problem becomes infinite in dimension. Estimators based on the conditional distribution of x are thus very satisfying theoretically, but are more complicated to derive than a simple least squares fit. We have found the least squares estimate to equal these other estimators if there is no a priori information. Fortunately it is possible to include a priori information in the least squares format in such a way as to recast the problem in the form of an equivalent estimation problem with no prior information. This is done by introducing the a priori information in the form of an additional pseudo observation. That this can be done is due to the fact that the least squares estimate is a linear combination of the observations – and thus it is possible to group the observations in any way. Recursive formulation (The other method is called batch processing.) Suppose we had the estimate 0 xˆ based on any prior information and all measurements already taken, and its variance 2 σ 0 . Then we took just one more measurement and wished to obtain an improved estimate of x immediately. The formula says ( ) ( ) 0 2 2 2 2 0 0 1 0 22 22 0 0 2 2 0 2 0 0 0 2 2 0 0 0 ˆ ˆ ˆ 1 1 ˆ ˆ ˆ ˆ z z z z z z x z x x z x zx x kz x σ σ σ σ σσ σσ σ σ σ σ σ + == + + + + =+ − + =+ −
16.322 Stochastic Estimation and Control, Fall 2004 Prof vander Velde New estimate =Old estimate Gain(Measurement residual) This is the form of the modern recursive estimator. If this is formulated directly the case where several parameters are being estimated, the(02+o2)is a matrix inversion. However, if only one scalar measurement is being processed, the inversion is a scalar reciprocal 02 0. 000- 01=020202+02) =(1-KO k≡G(a2+o Extensions to this simple problem Several estimated parameters instead of one o No conceptual difficulty o Get vector and matrix operations instead of scal Dynamic parameters instead of static o No real difficulty if they obey a set of linear differential equations veral simultaneous measurements instead of one o No difficulty if they are linearly related to x Biased measurement noise o Estimate the bias Correlated measurement noise o Estimate the noise(different form of filter)or work with independent measurement differences Non-normal noises o Makes maximum likelihood difficult o Requires full distribution rather than mean and variance Nonlinear system constraints or measurements o Makes things very difficult o Requires more information than mean and variance Statistics in State Space Formulation In the state space formulation we depend on the concept of the shaping filter exclusively. Even if the input statistics are time varying, we suppose the input to be generated by passing white noise through a suitable time-varying shaping filter. In the non-stationary case it is not clear how to generate the shaping system directl Page 5 of 8
16.322 Stochastic Estimation and Control, Fall 2004 Prof. Vander Velde Page 5 of 8 New estimate = Old estimate + Gain(Measurement residual) This is the form of the modern recursive estimator. If this is formulated directly in the case where several parameters are being estimated, the ( ) 1 2 2 σ σ 0 z − + is a matrix inversion. However, if only one scalar measurement is being processed, the inversion is a scalar reciprocal. ( ) ( ) ( ) 2 2 0 2 2 2 22 10 0 1 2 22 2 2 10 0 2 0 1 22 2 0 0 111 1 z z z z z z k k σ σ σ σ σ σσ σ σσ σ σ σ σσ σ − − + =+= = + = − ≡ + Extensions to this simple problem: • Several estimated parameters instead of one o No conceptual difficulty o Get vector and matrix operations instead of scalar • Dynamic parameters instead of static o No real difficulty if they obey a set of linear differential equations • Several simultaneous measurements instead of one o No difficulty if they are linearly related to x • Biased measurement noise o Estimate the bias • Correlated measurement noise o Estimate the noise (different form of filter) or work with independent measurement differences • Non-normal noises o Makes maximum likelihood difficult o Requires full distribution rather than mean and variance • Nonlinear system constraints or measurements o Makes things very difficult o Requires more information than mean and variance Statistics in State Space Formulation In the state space formulation we depend on the concept of the shaping filter exclusively. Even if the input statistics are time varying, we suppose the input to be generated by passing white noise through a suitable time-varying shaping filter. In the non-stationary case it is not clear how to generate the shaping system directly
16.322 Stochastic Estimation and Control, Fall 2004 Prof vander velde Non-stationary, multiple-input, multiple-output case Note that this model may represent the shaping filter for a random process alone, or a system driven by white noise, or a system driven by correlated noise with he shaping filter included Shaping filter v=Ay+ Bn d=Cl Transfer function(state space model x= Fx+d n(t) Shaping d(t) Fil System white SHa(o)given oIse Augmented system x 0A+n≡Ax+Bn y=[H Or=C'x We will drop the primes and treat the input noise as white i(1)=A(D)x(1)+B()n(1) y(1)=C(1)x(t) We will allow the possibility of a biased input noise n(t) is defined [m()-m))-m)了=N((4-4)
16.322 Stochastic Estimation and Control, Fall 2004 Prof. Vander Velde Page 6 of 8 Non-stationary, multiple-input, multiple-output case Note that this model may represent the shaping filter for a random process alone, or a system driven by white noise, or a system driven by correlated noise with the shaping filter included. Shaping filter: v Av Bn d Cv = + = & Transfer function (state space model): x Fx d y Hx = + = & Augmented system: { [ ] 0 0 0 A B C x x v F GC x x n Ax Bn A B y H x Cx ′ ′ ′ ⎡ ⎤ ′ = ⎢ ⎥ ⎣ ⎦ ⎡ ⎤ ⎡⎤ ′ ′ ′′ ′′ + ≡+ ⎢ ⎥ ⎢⎥ ⎣ ⎦ ⎣⎦ = ≡′ ′′ & = 14243 14243 We will drop the primes and treat the input noise as white. () () () () () () () () x t At xt Bt nt yt Ctxt + = & = We will allow the possibility of a biased input noise. n t( ) is defined. ( ) 1 1 2 2 1 21 () () ( ) ( ) () T ⎡ ⎤⎡ ⎤ nt nt nt nt Nt t t − −= − δ ⎣ ⎦⎣ ⎦
16.322 Stochastic Estimation and Control, Fall 2004 Prof vander velde The elements of N(t) are the intensities, or spectral densities, of the components of n(t. Never call it the variance of the white noise! Propagation of the mean x(1)=A(1)x(t)+B(1)n(1) y(1)=C(1)x(1) So the mean is propagated directly by the system dynamics Propagation of the covariance matrix [()-0)[y-y0)]=cox(-x0)x(0-x0)Jc =C(1)X()C(1) So we require the covariance matrix for the full state variable x(o) Can we derive a differential equation for the covariance matrix in the same way as we did in the error propagation section? x()=Ex()-x(1)x()-x(1) For simplicity, define x(1)=x()-x(1) h(D)=n()-n(1) x(1)=A()x()+B(1)i() X(1)=X(1)() (t)=x()x()+()i() [4(()+B()()2(0)+(0)[x()4(0y+(nB(0 =A(1)(1)x()+B(1)i()x(1)+(1)(0)y4(1)y+x()i(n)B() =A()X()+X(1)4()y+B()i(1)()y+()i(r)B() What is the correlation function between n(t) and x(o? It is not zero even though n(t) drives x(t)and not i(o) directly. Because n(t)is a white noise, and thus is infinite in magnitude almost everywhere, it has a non-negligible effect on x(o)at the same point in time. But we need a more sophisticated calcul calculus, to figure out what it is. So we cannot derive a differential equation for X(n) just by differentiating its definition Page 7 of 8
16.322 Stochastic Estimation and Control, Fall 2004 Prof. Vander Velde Page 7 of 8 The elements of N t( ) are the intensities, or spectral densities, of the components of n t( ). Never call it the variance of the white noise! Propagation of the mean () () () () () () () () x t At xt Bt nt yt Ctxt + = & = So the mean is propagated directly by the system dynamics. Propagation of the covariance matrix () () () () () () () () () () () () T T T T yt yt yt yt Ct xt xt xt xt C CtXtCt ⎡ ⎤⎡ ⎤ ⎡ ⎤⎡ ⎤ − −= − − ⎣ ⎦⎣ ⎦ ⎣ ⎦⎣ ⎦ = So we require the covariance matrix for the full state variable x( )t . Can we derive a differential equation for the covariance matrix in the same way as we did in the error propagation section? () () () () () T X t E xt xt xt xt =− − ⎡ ⎤⎡ ⎤ ⎣ ⎦⎣ ⎦ For simplicity, define: () () () () () () () () () () () xt xt xt nt nt nt x t At xt Bt nt = − = − = + % % &% %% [ ] () () () () () () () () () () () () () () () () () () () () () () () () () () () () () () () () () () () ( T T T T TT TT T T TT TT T Xt xtxt Xt xtxt xtxt At xt Bt nt xt xt xt At nt Bt At xt xt Bt nt xt xt xt At xt nt Bt At X t X t At Bt nt = = + =+ + + ⎡ ⎤ ⎣ ⎦ =++ + =+ + % % & & & %% %% % %% % % % %% %% %% %% % ) () () () () T TT xt xtnt Bt % %% + What is the correlation function between n t %( ) and x%( )t ? It is not zero even though n t %( ) drives x%( )t and not x%( )t directly. Because n t %( ) is a white noise, and thus is infinite in magnitude almost everywhere, it has a non-negligible effect on x%( )t at the same point in time. But we need a more sophisticated calculus i.e., Ito calculus, to figure out what it is. So we cannot derive a differential equation for X ( )t just by differentiating its definition
16.322 Stochastic Estimation and Control, Fall 2004 Prof vander velde Instead, we can use a more round-about procedure. The response of the differential equation for x(o) can be expressed ()=(t,6)(n)+Jc)B(rl(r)dr where qp(t, r) is the transition matrix for the linear system with system matrix A(0. It satisfies the system homogeneous equation a(L, T)=A(oo(t, r) d(z,r)=1 This approach works in this case because we can write down the form of the solution to a set of linear differential equations. We cannot write down the form in the case of nonlinear systems. However, the Ito calculus does apply to roac of the solution to nonlinear differential equations so we cannot use this api nonlinear stochastic differential equations f 8
16.322 Stochastic Estimation and Control, Fall 2004 Prof. Vander Velde Page 8 of 8 Instead, we can use a more round-about procedure. The response of the differential equation for x%( )t can be expressed as 0 0 0 () (, ) ( ) (, ) ( ) ( ) t t x t tt xt t B n d =Φ + φ τ τ ττ ∫ %% % where Φ(, ) t τ is the transition matrix for the linear system with system matrix A t( ). It satisfies the system homogeneous equation (, ) () (, ) (, ) d t At t dt I τ τ τ τ Φ =Φ Φ = This approach works in this case because we can write down the form of the solution to a set of linear differential equations. We cannot write down the form of the solution to nonlinear differential equations so we cannot use this approach in the case of nonlinear systems. However, the Ito calculus does apply to nonlinear stochastic differential equations