Pillai,S U, Shim, T.I., Batalama, S N, Kazakos, D, Daum, F. "Spectral Estimation and Modeling The electrical Engineering Handbook Ed. Richard C. dorf Boca Raton CRC Press llc. 2000
Pillai, S.U., Shim, T.I., Batalama, S.N., Kazakos, D., Daum, F. “Spectral Estimation and Modeling” The Electrical Engineering Handbook Ed. Richard C. Dorf Boca Raton: CRC Press LLC, 2000
16 Spectral estimation S Unnikrishna pillai Polytechnic University and modeling Theodore I Shim lan. batalama 16.1 Spectral Analysis niversity of New York Historical Perspective. Modern Spectral Analysis Dimitri Kazakos Bayesian Estimation. Mean-Square Estimation. Minimax University of Southwestern Estimation. Maximum Likelihood Estimation Other Parameter Fred Daum 16.3 Kalman Filtering Kalman Filter Equations. Kalman Filter Examples. Extended Kalman Filter· Nonlinear Filters· Practical issues 16.1 Spectral Analysis S. Unnikrishna pillai and theodore l shim Historical Perspective Modern spectral analysis dates back at least to Sir Isaac Newton [Newton, 1671, whose prism experiments with sunlight led him to discover that each color represented a particular wavelength of that the sunlight contained all wavelengths. Newton used the word spectrum, a variant of the Latin word specter, to describe the band of visible light colors. vibrating string can be expressed as an infinite sum containing weighted sine and cosine terms. Later, the Felly 9 In the early eighteenth century, Bernoulli discovered that the solution to the wave equation describing engineer Joseph Fourier in his Analytical Theory of Heat[ Fourier, 1822] extended Bernoullis wave equation results to arbitrary periodic functions that might contain a finite number of jump discontinuities. Thus, for oftea 70, if f(0=f(t+ To)for all t, then f(o) represents a periodic signal with period To and in the case al signals, it has the Fourier series representation f(t)=Ao+2>(Ak cos koot+ Bk sin koot) where @o= 2T/To and A f(t)cos ko Cot dt, BKT f(t) sin koot dt with Ao representing the dc term(k= 0). Moreover, the infinite sum on the right-hand side of the above expression converges to [f(to)+f(to)J/2. The total power P of the periodic function satisfies the relation c 2000 by CRC Press LLC
© 2000 by CRC Press LLC 16 Spectral Estimation and Modeling 16.1 Spectral Analysis Historical Perspective • Modern Spectral Analysis 16.2 Parameter Estimation Bayesian Estimation • Mean-Square Estimation • Minimax Estimation • Maximum Likelihood Estimation • Other Parameter Estimation Schemes 16.3 Kalman Filtering Kalman Filter Equations • Kalman Filter Examples • Extended Kalman Filter • Nonlinear Filters • Practical Issues 16.1 Spectral Analysis S. Unnikrishna Pillai and Theodore I. Shim Historical Perspective Modern spectral analysis dates back at least to Sir Isaac Newton [Newton, 1671], whose prism experiments with sunlight led him to discover that each color represented a particular wavelength of light and that the sunlight contained all wavelengths. Newton used the word spectrum, a variant of the Latin word specter, to describe the band of visible light colors. In the early eighteenth century, Bernoulli discovered that the solution to the wave equation describing a vibrating string can be expressed as an infinite sum containing weighted sine and cosine terms. Later, the French engineer Joseph Fourier in his Analytical Theory of Heat [Fourier, 1822] extended Bernoulli’s wave equation results to arbitrary periodic functions that might contain a finite number of jump discontinuities. Thus, for some T0 > 0, if f(t) = f(t + T0) for all t, then f(t) represents a periodic signal with period T0 and in the case of real signals, it has the Fourier series representation where ω0 = 2π/T0, and with A0 representing the dc term (k = 0). Moreover, the infinite sum on the right-hand side of the above expression converges to [f(t–0) + f(t+0)]/2. The total power P of the periodic function satisfies the relation ft A A k t B k t k k k ( ) ( cos sin ) =+ + = ∞ 0 00 ∑ 1 2 ω ω A T f t k t dt B T f t k t dt k k T T = = ∫ ∫ 1 1 0 0 0 0 0 0 0 0 ( ) cos , ( ) sin ω ω S. Unnikrishna Pillai Polytechnic University Theodore I. Shim Polytechnic University Stella N. Batalama State University of New York at Buffalo Dimitri Kazakos University of Southwestern Louisiana Fred Daum Raytheon Company
p=fPd=A2+2∑(A+B) implying that the total power is distributed only among the dc term, the fundamental frequency 2/ and its harmonics koo, k> 1, with 2(Af+Bk) representing the power contained at the harmonic koo. For every periodic signal with finite power, since A,+0, Bx0, eventually the overharmonics become of decreasing The British physicist Schuster [Schuster, 1898] used this observation to suggest that the partial power Pk 2(Ak Bk) at frequency koos k=0-o0, be displayed as the spectrum. Schuster termed this method the periodogram, and information over a multiple of periods was used to compute the Fourier coefficients and/or to smooth the periodogram, since depending on the starting time, the periodogram may contain irregular and spurious peaks. a notable exception to periodogram was the linear regression analysis introduced by the British statistician Yule [Yule, 1927] to obtain a more accurate description of the periodicities in noisy data. Because he sampled periodic process x(k)=cos koo T containing a single harmonic component satisfies the recursive relation where a=2 cos @. T represents the harmonic component, its noisy version x(k)+ n(k)satisfies the recursion x(k)=ax(k-1)-x(k-2)+n(k) Yule interpreted this time series model as a recursive harmonic process driven by a ner generalized the above form to determine the periodicity in the sequence of sunspot numbers. Yule furt recursion to x(k)=ax(k-1)+bx(k-2)+n(k) where a and b are arbitrary, to describe a truly autoregressive process and since for the right choice of a, b the least-square solution to the above autoregressive equation is a damped sinusoid, this generalization forms the basis for the modern day parametric methods Modern Spectral analysis Norbert Wiener's classic work on Generalized Harmonic Analysis [ Wiener, 1930] gave random processes a firm statistical foundation, and with the notion of ensemble average several key concepts were then introduced. The formalization of modern day probability theory by Kolmogorov and his school also played an indispensable part in this development. Thus, if xt) represents a continuous-time stochastic (random)process, then for every fixed t, it behaves like a random variable with some probability density function f(x, t). The ensemble average or expected value of the process is given by (t)=E[x(t)]=xf (x,t)dx and the statistical correlation between two time instants t and s of the random process is described through its autocorrelation function Rx(t1,t2)=E[x(t1)x*(t2)]= x1x2f1x2(x,x2,1,t2 (t2,t1) c 2000 by CRC Press LLC
© 2000 by CRC Press LLC implying that the total power is distributed only among the dc term, the fundamental frequency w0 = 2p/T0 and its harmonics kw0, k ³ 1, with 2(A2 k + B2 k ) representing the power contained at the harmonic kw0. For every periodic signal with finite power, since Ak Æ 0, Bk Æ 0, eventually the overharmonics become of decreasing importance. The British physicist Schuster [Schuster, 1898] used this observation to suggest that the partial power Pk = 2(A2 k + B2 k ) at frequency kw0, k = 0 Æ •, be displayed as the spectrum. Schuster termed this method the periodogram, and information over a multiple of periods was used to compute the Fourier coefficients and/or to smooth the periodogram, since depending on the starting time, the periodogram may contain irregular and spurious peaks. A notable exception to periodogram was the linear regression analysis introduced by the British statistician Yule [Yule, 1927] to obtain a more accurate description of the periodicities in noisy data. Because the sampled periodic process x(k) = cos kw0T containing a single harmonic component satisfies the recursive relation x(k) = ax(k – 1) – x(k – 2) where a = 2 cos w0T represents the harmonic component, its noisy version x(k) + n(k) satisfies the recursion x(k) = ax(k – 1) – x(k – 2) + n(k) Yule interpreted this time series model as a recursive harmonic process driven by a noise process and used this form to determine the periodicity in the sequence of sunspot numbers. Yule further generalized the above recursion to x(k) = ax(k – 1) + bx(k – 2) + n(k) where a and b are arbitrary, to describe a truly autoregressive process and since for the right choice of a, b the least-square solution to the above autoregressive equation is a damped sinusoid, this generalization forms the basis for the modern day parametric methods. Modern Spectral Analysis Norbert Wiener’s classic work on Generalized Harmonic Analysis [Wiener, 1930] gave random processes a firm statistical foundation, and with the notion of ensemble average several key concepts were then introduced. The formalization of modern day probability theory by Kolmogorov and his school also played an indispensable part in this development. Thus, if x(t) represents a continuous-time stochastic (random) process, then for every fixed t, it behaves like a random variable with some probability density function fx(x,t). The ensemble average or expected value of the process is given by and the statistical correlation between two time instants t1 and t2 of the random process is described through its autocorrelation function P T f t dt A Ak k B k T = = + + = • Ú Â 1 2 0 2 0 2 2 2 1 0 0 * ( )* ( ) mx x (t) = = E[x(t)] x f (x, t) dx -• • Ú R t t E x t x t x x f x x t t dx dx R t t xx x x xx ( , ) [ ( ) * ( )] * ( , , , ) * ( , ) – – 1 2 1 2 1 2 1 2 1 2 1 2 2 1 1 2 = = = • • -• • Ú Ú
where fx(x, x2,t, t2)represents the joint probability density function of the random variable x=xt,) and x2=xt2)and*denotes the complex conjugate transpose in general. Processes with autocorrelation functions that depend only upon the difference of the time intervals t, and t, are known as wide sense stationary processes Thus, if x(t) is wide sense stationary, then ELx(t+ t)x(t]=Ru(t)=ri(t) To obtain the distribution of power versus frequency in the case of a stochastic process, one can make use the Fourier transform based on a finite segment of the data. Letting 0)= x(t)e / u dt represent the power contained in a typical realization over the interval (-T D), its ensemble average value as T-o represents the true power contained at frequency o. Thus, for wide sense stationary processes So)=imEP2(o)=im「「R2(t1-1)lmth2 (16.1) lim| Rxx(t)1 Rx(eodt≥0 27 Moreover, the inverse relation gives R2(T)=1∫ S(o)e/o da (16.2) and hence R2(0)=Ex=p=1∫so)do Thus S(o)represents the power spectral density and from Eqs. (16. 1)and(16.2), the power spectral density and the autocorrelation function form a Fourier transform pair, the well-known Wiener-Khinchin theorem If xkD) represents a discrete-time wide sense stationary stochastic process, then Elx(on+k)])x'(nT)) and the power spectral density is given by or in terms of the normalized variable e=ot. c 2000 by CRC Press LLC
© 2000 by CRC Press LLC where fx1x2(x1 , x2 , t1 , t2 ) represents the joint probability density function of the random variable x1 = x(t1) and x2 = x(t2) and * denotes the complex conjugate transpose in general. Processes with autocorrelation functions that depend only upon the difference of the time intervals t1 and t2 are known as wide sense stationary processes. Thus, if x(t) is wide sense stationary, then E[x(t + t)x*(t)] = Rxx(t) = R* xx (–t) To obtain the distribution of power versus frequency in the case of a stochastic process, one can make use of the Fourier transform based on a finite segment of the data. Letting represent the power contained in a typical realization over the interval (–T, T), its ensemble average value as T Æ • represents the true power contained at frequency w. Thus, for wide sense stationary processes (16.1) Moreover, the inverse relation gives (16.2) and hence Thus S(w) represents the power spectral density and from Eqs. (16.1) and (16.2), the power spectral density and the autocorrelation function form a Fourier transform pair, the well-known Wiener–Khinchin theorem. If x(kT) represents a discrete-time wide sense stationary stochastic process, then and the power spectral density is given by or in terms of the normalized variable q = wT, P T x t e dt T j t T T ( ) w ( ) w = - -Ú 1 2 2 S E P R t t e dt dt R T e d R e d T T T xx T T T T jtt T xx T T j xx j ( ) lim [ ( )] lim ( – ) lim ( ) – ( ) – – – ( – ) – – w w t t t t t w wt wt = = = Ê Ë Á ˆ ¯ ˜ = ³ Æ • Æ • Æ • - -• • Ú Ú Ú Ú 1 2 1 2 2 2 1 2 1 2 0 * * R S e d xx j ( )t ( ) p w w wt = -• • Ú 1 2 R E x t P S d xx ( ) [ ] ( ) ( ) – 0 1 2 2 = = = • • Ú * * p w w r E x n k T x nT r k k = + ( ) = - { ( ) * ( )} * S r ek j k T k (w) w = - =-• • Â
S(6)= =S(6+2πk)≥0 (16.3) and the inverse relation gives the autocorrelations to be S(e)e/de = r-k Thus, the power spectral density of a discrete-time process is periodic. Such a process can be obtained by sampling a continuous-time process at t=kT, K=0-o0, and if the original continuous-time process is band limited with a two-sided bandwidth equal to 20,=2T/T, then the set of discrete samples so obtained is equivalent to the original process in a mean-square sense As Schur [Schur, 1917] has observed, for discrete-time stationary processes the nonnegativity of the power pectrum is equivalent to the nonnegative definiteness of the Hermitian Toeplitz matrices, i.e r rk S(6)≥0台 T≥0,k=0 To If xnT)is the output of a discrete-time linear time-invariant causal system driven by w(nt), then we have the w(m)-H()=∑MkT)1→xn)=∑MkTm(m-m)165 In the case of a stationary input, the output is also stationary, and its power spectral density is given by S6)=|Hef)2S() (16.6) then Sm(0)=0and the power spectral density of the input process. If the input is a white noise process, S6)=oH(e)2 Clearly if H(z)is rational, so is S,(e). Conversely, given a power spectral density
© 2000 by CRC Press LLC (16.3) and the inverse relation gives the autocorrelations to be Thus, the power spectral density of a discrete-time process is periodic. Such a process can be obtained by sampling a continuous-time process at t = kT, *k* = 0 Æ •, and if the original continuous-time process is bandlimited with a two-sided bandwidth equal to 2wb = 2p/T, then the set of discrete samples so obtained is equivalent to the original process in a mean-square sense. As Schur [Schur, 1917] has observed, for discrete-time stationary processes the nonnegativity of the power spectrum is equivalent to the nonnegative definiteness of the Hermitian Toeplitz matrices, i.e., (16.4) If x(nT) is the output of a discrete-time linear time-invariant causal system driven by w(nT), then we have the following representation: (16.5) In the case of a stationary input, the output is also stationary, and its power spectral density is given by Sx(q) = *H(ejq)* 2Sw(q) (16.6) where Sw(q) represents the power spectral density of the input process. If the input is a white noise process, then Sw(q) = s2 and Sx(q) = s2 *H(ejq)* 2 Clearly if H(z) is rational, so is Sx(q). Conversely, given a power spectral density (16.7) that satisfies the integrability condition (16.8) S r e S k k j k k ( ) q (q p ) q = = + ³ - =-• • Â 2 0 r S e d r k j k = = -k -Ú 1 2p q q q p p ( ) * S r r r r r r r r r k k k k k k k ( ) ... * ... * ... * , – – q ³ ¤ = Ê Ë Á Á Á Á Á ˆ ¯ ˜ ˜ ˜ ˜ ˜ 0 = ³ 0 = 0 Æ • 0 1 1 0 1 1 0 T T M M O M w nT H z h kT z x nT h kT w n k T k k k ( ) Æ = ( ) ( ) Æ ( ) = ( ) (( - ) ) = • = • Â Â 0 0 S r e x k j k k (q) q = ³ =-• • Â 0 S d x (q q ) p p < • -Ú
and the physical realizability(Paley-Wiener)criterion In S(0) (16.9) there exists a unique function H(z)that is analytic together with its inverse in 20)and Tm is singular [det Tm.=0, det(- )representing the determinant of ()) In that case, there exists a unique vector X=(xo, x,,..., xm)such that Tm X=0 and further, the autocorrelations have a unique extension given by (16.11) where ejei, i=1- m are the m zeros of the polynomial x+x,2+.+ xm zm and Pi >0. This gives m-1=A P T A (16.12) 00….P c 2000 by CRC Press LLC
© 2000 by CRC Press LLC and the physical realizability (Paley–Wiener) criterion (16.9) there exists a unique function H(z) that is analytic together with its inverse in *z* 0) and Tm is singular [det Tm = 0, det (.) representing the determinant of (.)]. In that case, there exists a unique vector X = (x0 , x1 , . . ., xm)T such that Tm X = 0 and further, the autocorrelations have a unique extension given by (16.11) where ejqi , i = 1 Æ m are the m zeros of the polynomial x0 + x1 z + . . . + xm zm and Pi > 0. This gives (16.12) ln ( ) S d x q q p p > -• -Ú Hz bz z k k k () , = < = • Â * * 1 0 S H re H e a e x r j j q q q ( ) = = Æ - lim1 0 2 2 | ( )| | ( )| , . . 1 2 0 p q q q p p K ed r k n j k k () , -Ú = =Æ c Pe k k i j k i m i = = Æ• = Â q , * * 0 1 TA A m m P P P – ... ... ... * 1 1 2 0 0 0 0 0 0 = Ê Ë Á Á Á Á ˆ ¯ ˜ ˜ ˜ ˜ M M OM
where a is an m x m Vandermonde matrix given by 入1 入 A=222 入;=e,i=1→m and Eq (16.12)can be used to determine Pk>0,k=1-m. The power spectrum associated with Eq (16.11) s(e)=∑P(0- and it represents a discrete spectrum that corresponds to pure uncorrelated sinusoids with signal powers P If the given autocorrelations satisfy T, >0, from Eq (16. 4), every unknown Tg, k2n+ I, must be selected so as to satisfy Tk>0, and this gives rx+1-2≤R2 where 5k=f r)and r=det T, /det t - From Eq.(16.13), the unknowns could be anywhere inside a sequence of circles with center 5k and radius Ri and as a result there are an infinite number of solutions to this problem. Schur and Nevanlinna have given an analytic characterization to these solutions in terms of bounded function extensions. A bounded function p(z)is analytic in z< 1 and satisfies the inequality p(a)s 1 everywhere in z<1 In a network theory context, Youla [1980] has also given a closed form parametrization to this class of solutions. In that case, given ro, ri,..., ru, the minimum phase transfer functions satisfying Eqs. (16. 8)and (16.9)are given by (16.14) P(z)-zp(z)P(z) where p(z) is an arbitrary bounded function that satisfies the inequality(paley-Wiener criterion) ∫nm and r(z) is the minimum phase factor obtained from the factorization -lp(e)2=r(e)2 Further, Pn(z) represents the Levinson polynomial generated from ro -r through the recursion sP(z)=Pn2(2)-xnP2-(2) c 2000 by CRC Press LLC
© 2000 by CRC Press LLC where A is an m ¥ m Vandermonde matrix given by and Eq. (16.12) can be used to determine Pk > 0, k = 1 Æ m. The power spectrum associated with Eq. (16.11) is given by and it represents a discrete spectrum that corresponds to pure uncorrelated sinusoids with signal powers P1 , P2 , …, Pm . If the given autocorrelations satisfy Tn > 0, from Eq. (16.4), every unknown rk , k ³ n + 1, must be selected so as to satisfy Tk > 0, and this gives *rk+1 – zk* 2 £ R2 k (16.13) where zk = f T k T – k 1 bk , fk = (r1, r2 , . . ., rk )T , bk = (rk , rk –1 , . . ., r1 ) and Rk = det Tk /det Tk – 1 . From Eq. (16.13), the unknowns could be anywhere inside a sequence of circles with center zk and radius Rk, and as a result, there are an infinite number of solutions to this problem. Schur and Nevanlinna have given an analytic characterization to these solutions in terms of bounded function extensions. A bounded function r(z) is analytic in *z* -• p p q 1 r q 2 * * e d j 1 2 1 1 - s P z P z zs P z n n n nn ( ) = - - ( ) - ( ) ˜
that starts with Po(z)=1//ro,where rk2 (16.15) represents the reflection coefficient at stage n. Here, I I, denotes the coefficient of z" in the expansion I,and P(z)4 z P*(1/z) represents the polynomial reciprocal to Pn(z). Notice that the given information ro -rn enters Pn(z)through Eq. (16.5). The power spectral density ()=|He)2 associated with Eq (16.14)satisfies all the interpolation properties described before. In Eq (16.14), the solution p(z)=0 gives H(2)=1/P,(z), a pure AR(n) system that coincides with Burg's maximum entropy extension. Clearly, if Hp(z)is rational, then p(z) must be rational and, more interestingly, every rational system must follow from Eq (16.14)for a specific rational bounded function p(z) Of course, the choice of p(z) brings in extra freedom, and this can be profitably used for system identification as well as rational and stable approxi mation of nonrational systems [Pillai and Shim, 1993] Defining Terms Autocorrelation function: The expected value of the product of two random variables generated from a random process for two time instants; it represents their interdependence. Expected value(or mean) of a random variable: Ensemble average value of a random variable that is given by integrating the random variable after scaling by its probability density function(weighted average ver the entire range Power spectrum: A nonnegative function that describes the distribution of power versus frequency. For wide sense stationary processes, the power spectrum and the autocorrelation function form a Fourier transform Probability density function: The probability of the random variable taking values between two real numbers x, and x2 is given by the area under the nonnegative probability density function between those two points. Random variable: A continuous or discrete valued variable that maps the set of all outcomes of an experiment into the real line(or complex plane). Because the outcomes of an experiment are inherently randor the final value of the variable cannot be predetermined. Stochastic process: A real valued function of time t, which for every fixed t behaves like a random variable. Related Topics 14.1 Fourier Transforms.40.2 Spectrum, Specifications, and Measurement Techniques.73.3 Stochastic References NI. Akheizer and M. Krein, Some Questions in the Theory of Moments, American Math. Soc. Monogr., 2, 1962. J.B. J. Fourier, Theorie Analytique de la Chaleur(Analytical Theory of Heat ), Paris, 1822. L. Y. Geronimus, Polynomials Orthogonal on a Circle and Their Applications, American Math. Soc., Translation, 104,1954. Newton, Philos. Trans., vol IV, P 3076, 1671 S U. Pillai and T I. Shim, Spectrum Estimation and System Identification, New York: Springer-Verlag, 1993 L. Schur, "Uber Potenzreihen, die im Innern des Einheitzkreises Beschrankt Sind, Journal fur Reine und Angewandte Mathematik, vol. 147, PP. 205-232, 1917. JA. Shohat and J D. Tamarkin, The Problem of Moments, American Math. Soc., Math. Surveys, 1,1970 c 2000 by CRC Press LLC
© 2000 by CRC Press LLC that starts with P0(z) = 1/ , where (16.15) represents the reflection coefficient at stage n. Here, { }n denotes the coefficient of zn in the expansion { }, and ~ Pn(z) D = znP* n (1/z*) represents the polynomial reciprocal to Rn(z). Notice that the given information r0 Æ rn enters Pn(z) through Eq. (16.5). The power spectral density K(q) = *Hr(ej q)* 2 associated with Eq. (16.14) satisfies all the interpolation properties described before. In Eq. (16.14), the solution r(z) [ 0 gives H(z) = 1/Pn(z), a pure AR(n) system that coincides with Burg’s maximum entropy extension. Clearly, if Hr(z) is rational, then r(z) must be rational and, more interestingly, every rational system must follow from Eq. (16.14) for a specific rational bounded function r(z). Of course, the choice of r(z) brings in extra freedom, and this can be profitably used for system identification as well as rational and stable approximation of nonrational systems [Pillai and Shim, 1993]. Defining Terms Autocorrelation function: The expected value of the product of two random variables generated from a random process for two time instants; it represents their interdependence. Expected value (or mean) of a random variable: Ensemble average value of a random variable that is given by integrating the random variable after scaling by its probability density function (weighted average) over the entire range. Power spectrum: A nonnegative function that describes the distribution of power versus frequency. For wide sense stationary processes, the power spectrum and the autocorrelation function form a Fourier transform pair. Probability density function: The probability of the random variable taking values between two real numbers x1 and x2 is given by the area under the nonnegative probability density function between those two points. Random variable: A continuous or discrete valued variable that maps the set of all outcomes of an experiment into the real line (or complex plane). Because the outcomes of an experiment are inherently random, the final value of the variable cannot be predetermined. Stochastic process: A real valued function of time t, which for every fixed t behaves like a random variable. Related Topics 14.1 Fourier Transforms • 40.2 Spectrum, Specifications, and Measurement Techniques • 73.3 Stochastic Processes References N.I. Akheizer and M. Krein, Some Questions in the Theory of Moments, American Math. Soc. Monogr., 2, 1962. J.B.J. Fourier, Theorie Analytique de la Chaleur (Analytical Theory of Heat), Paris, 1822. L. Y. Geronimus, Polynomials Orthogonal on a Circle and Their Applications, American Math. Soc., Translation, 104, 1954. I. Newton, Philos. Trans., vol. IV, p. 3076, 1671. S.U. Pillai and T.I. Shim, Spectrum Estimation and System Identification, New York: Springer-Verlag, 1993. I. Schur, “Uber Potenzreihen, die im Innern des Einheitzkreises Beschrankt Sind,” Journal fur Reine und Angewandte Mathematik, vol. 147, pp. 205–232, 1917. J.A. Shohat and J.D. Tamarkin, The Problem of Moments, American Math. Soc., Math. Surveys, 1, 1970. r0 s P z rz P nn k k k n n = n Ï Ì Ô Ó Ô ¸ ˝ Ô ˛ = Ô – – 1() () Â 1 1 0
N. Wiener"Generalized harmonic analysis, Acta Math., vol 55, pp. 117-258, 1930 D.C. Youla, The FEE: A New Tunable High-Resolution Spectral Estimator: Part L, Technical note, no 3, Dept of Electrical Engineering, Polytechnic Institute of New York, Brooklyn, New York; also RADC Report, RADC-TR-81-397,ADA114996,1982,1980. G U. Yule, On a method of investigating periodicities in disturbed series, with special reference to wolfer unspot numbers, Philos. Trans. R. Soc. London, Ser. A, vol. 226, Pp. 267-298, 1927. 16.2 Parameter estimation Stella n batalama and dimitri kazakos Parameter estimation is the operation of assigning a value in a continuum of alternatives to an unknown parameter based on a set of observations involving some function of the parameter. Estimate is the value assigned to the parameter and estimator is the function of the observations that yields the estimate. The basic elements in the parameter estimation are a vector parameter gm, a vector space & m where em takes its values, a stochastic process X(r) parameterized by m and a performance criterion or cost function. The estimate 8(x")based on the observation vector x"=[x, x2,., xm] is a solution of some optimization proble according to the performance criterion. In the following, the function f(x"em)will denote the conditional joint probability density function of the random variables xr,,x- There are several parameter estimation schemes. If the process X(r) is parametrically known, i.e., if its conditional joint probability density functions are known for each fixed value em of the vector parameter 8 then the corresponding parameter estimation scheme is called parametric. If the statistics of the process X(t) are nonparametrically described, i.e, given 8E &m any joint probability density function of the process is a member of some nonparametric class of probability density functions, then the nonparametric estimation schemes arise Let In denote the n-dimensional observation space. Then an estimator e(xn)of a vector parameter em is a function from the observation space, r", to the parameter space, &m. Since this is a function of random variables, it is itself a random variable (or random vector) There are certain stochastic properties of estimators that quantify somehow their quality. In this stimator is said to be unbiased if its expected value is the true parameter value, i. e, if Ee{6m(x")}=θ′ where the subscript 0 on the expectation symbol denotes that the expectation is taken according to the probability density function f(xn0m) In the case where the observation space is the m and the parameter is a scalar, it Eo10(x"))=0(x")f(x e)dx The bias of the estimate is the Euclidean norm m-E(0m(x))/2. Thus, the bias measures the distancebetween the expected value of the estimate and the true value of the parameter. Clearly, the estimator is unbiased when the bias is zero 6. Usually it is of interest to know the conditional variance of an unbiased estimate. The bias of the estimate x")and the conditional variance Fl6(x")-E6叫(x")3吗 generally represent a trade-off. Indeed, an unbiased estimate may induce relatively large variance On the other hand, the introduction of some low-level bias may then result in a significant reduction of the induced variance. c 2000 by CRC Press LLC
© 2000 by CRC Press LLC N. Wiener “Generalized harmonic analysis,” Acta Math., vol. 55, pp. 117–258, 1930. D.C. Youla, “The FEE: A New Tunable High-Resolution Spectral Estimator: Part I,” Technical note, no. 3, Dept. of Electrical Engineering, Polytechnic Institute of New York, Brooklyn, New York; also RADC Report, RADC-TR-81-397, AD A114996, 1982, 1980. G.U. Yule, “On a method of investigating periodicities in disturbed series, with special reference to Wolfer’s sunspot numbers,” Philos. Trans. R. Soc. London, Ser. A, vol. 226, pp. 267–298, 1927. 16.2 Parameter Estimation Stella N. Batalama and Dimitri Kazakos Parameter estimation is the operation of assigning a value in a continuum of alternatives to an unknown parameter based on a set of observations involving some function of the parameter. Estimate is the value assigned to the parameter and estimator is the function of the observations that yields the estimate. The basic elements in the parameter estimation are a vector parameter qm, a vector space %m where qm takes its values, a stochastic process X(t) parameterized by qm and a performance criterion or cost function. The estimate q ^ m(xn) based on the observation vector xn = [x1 , x2 , ...,xn] is a solution of some optimization problem according to the performance criterion. In the following, the function f(xn *qm) will denote the conditional joint probability density function of the random variables x1,...,xn. There are several parameter estimation schemes. If the process X(t) is parametrically known, i.e., if its conditional joint probability density functions are known for each fixed value qm of the vector parameter qm, then the corresponding parameter estimation scheme is called parametric. If the statistics of the process X(t) are nonparametrically described, i.e., given qm Œ %m any joint probability density function of the process is a member of some nonparametric class of probability density functions, then the nonparametric estimation schemes arise. Let Gn denote the n-dimensional observation space. Then an estimator q ^ (xn) of a vector parameter qm is a function from the observation space, Gn, to the parameter space,%m. Since this is a function of random variables, it is itself a random variable (or random vector). There are certain stochastic properties of estimators that quantify somehow their quality. In this sense an estimator is said to be unbiased if its expected value is the true parameter value, i.e., if Eq{q ^m(xn)} = qm where the subscript q on the expectation symbol denotes that the expectation is taken according to the probability density function f(x n *qm). In the case where the observation space is the ¬n and the parameter is a scalar, it is The bias of the estimate is the Euclidean norm **qm – Eq{qm(xn)}**1/2. Thus, the bias measures the distance between the expected value of the estimate and the true value of the parameter. Clearly, the estimator is unbiased when the bias is zero. Usually it is of interest to know the conditional variance of an unbiased estimate. The bias of the estimate q ^ m(xn ) and the conditional variance Eq{**q ^ m(xn) – Eq{q ^ m(xn)}**2 *qm} generally represent a trade-off. Indeed, an unbiased estimate may induce relatively large variance. On the other hand, the introduction of some low-level bias may then result in a significant reduction of the induced variance. E x f x dx n R n n m n n q {q q q ˆ( )} ˆ x = ( ) ( ) Ú *
In general, the bias versus variance trade-off should be studied carefully for the correct evaluation of any given parameter estimate. A parameter estimate is called efficient if the conditional variance equals a lower bound known as the Rao-Cramer bound It will be useful to present briefly this boun The Rao-Cramer bound gives a theoretical mum for the variance of any estimate More specifically, let e(x") be the estimate of a scalar parameter e given the observation vector x". Let f(xle)be given,twice ntinuously differentiable with respect to 0, and satisfy also some other mild regularity conditions. Then, E。16x2)-}≥E。11ogf(x|) Sometimes we need to consider the case where the sample size n increases to infinity. In such a case, an estimator is said to be consistent if 6m(x")→masn→∞ Since the estimate 0m(x")is a random variable, we have to specify in what sense the above holds.Thus, if the above limit holds w.p. 1, we say that 0m(x) is strongly consistent or consistent w.P. I. In a similar way we can define a weakly consistent estimator As far as the asymptotic distribution of A(x)as n - oo is concerned, it turns out that the central limit theorem can often be applied to e(x)to infer that n[e(x")-e] is asymptotically normal with zero mean In order to examine certain parameter estimation schemes we need first to present the definition of some related functions. Penalty or cost function dem, 0m(xn) is a scalar, nonnegative function whose values vary as g m varies in the parameter space (m and as the sequence x" takes different values in the observation space, T". Conditional expected penalty dem, e m)induced by the parameter estimate and the penalty function is a function defined as follows: de, 0 m)= eec[em,0"(x"II If an a priori density function p(e")is available, then the expected penalty de m, p) can be evaluated The various existing parameter estimation schemes evolve as the solutions of optimization problems whose objective function is either the conditional expected penalty or the conditional density function f(xylem) Bayesian Estimation Scheme In the Bayesian estimation scheme the available assets are a parametrically known stochastic process parameterized by em, in other words, a given co joint density function f(x 0m)defined on the observation space r", where 0m is a well-defined vector 2. A realization x from the underlying active process, where the implied assumption is that the process remains unchanged throughout the whole observation period 3. A density function p(e )defined on the parameter space & m. 4. For each data sequence x", parameter vector Amand parameter estimate 8"(x"), a penalty scalar function c[e, em(x) is given. 5. A performance criterion which is the minimization of the expected penalty c(em, P) The estimate Ao that minimizes the expected penalty is called optimal Bayesian estimate at p. Under some mild conditions the optimal Bayesian estimate Bm(x)is the conditional expectation E(0mlxk c 2000 by CRC Press LLC
© 2000 by CRC Press LLC In general, the bias versus variance trade-off should be studied carefully for the correct evaluation of any given parameter estimate. A parameter estimate is called efficient if the conditional variance equals a lower bound known as the Rao-Cramèr bound. It will be useful to present briefly this bound. The Rao-Cramèr bound gives a theoretical minimum for the variance of any estimate. More specifically, let q ^ (xn) be the estimate of a scalar parameter q given the observation vector xn. Let f(x n *q) be given, twice continuously differentiable with respect to q, and satisfy also some other mild regularity conditions. Then, . Sometimes we need to consider the case where the sample size n increases to infinity. In such a case, an estimator is said to be consistent if q ^ m(xn) Æ qm as n Æ • Since the estimate q ^ m(xn) is a random variable, we have to specify in what sense the above holds. Thus, if the above limit holds w.p. 1, we say that q ^ m(xn) is strongly consistent or consistent w.p. 1. In a similar way we can define a weakly consistent estimator. As far as the asymptotic distribution of q(xn) as n Æ • is concerned, it turns out that the central limit theorem can often be applied to q ^ (xn) to infer that [q ^ (xn) – q] is asymptotically normal with zero mean as n Æ •. In order to examine certain parameter estimation schemes we need first to present the definition of some related functions. Penalty or cost function c[qm, q ^ m(xn)] is a scalar, nonnegative function whose values vary as qm varies in the parameter space E m and as the sequence xn takes different values in the observation space, Gn. Conditional expected penalty c(qm, q ^ m) induced by the parameter estimate and the penalty function is a function defined as follows: c(qm,q ^ m) = Eq{c[qm, q ^ m(xn)]} If an a priori density function p(qm) is available, then the expected penalty c(q ^ m, p) can be evaluated. The various existing parameter estimation schemes evolve as the solutions of optimization problems whose objective function is either the conditional expected penalty or the conditional density function f(xn*qm). Bayesian Estimation Scheme In the Bayesian estimation scheme the available assets are: 1. A parametrically known stochastic process parameterized by qm, in other words, a given conditional joint density function f(xn*qm) defined on the observation space Gn, where qm is a well-defined parameter vector. 2. A realization xn from the underlying active process, where the implied assumption is that the process remains unchanged throughout the whole observation period. 3. A density function p(qm) defined on the parameter space %m. 4. For each data sequence xn, parameter vector qm and parameter estimate q ^ m(xn), a penalty scalar function c[qm, q ^ m(xn)] is given. 5. A performance criterion which is the minimization of the expected penalty c(qm, p). The estimate q ^ m 0 that minimizes the expected penalty is called optimal Bayesian estimate at p. Under some mild conditions the optimal Bayesian estimate q ^ m 0 (xn ) is the conditional expectation E{qm*xn }. E E f x n n q q q q ¶ ¶q q ˆ( ) – log ( ) – x [ ] Ï Ì Ó ¸ ˝ ˛ ³ È Î Í Í ˘ ˚ ˙ ˙ Ï Ì Ô Ó Ô ¸ ˝ Ô ˛ Ô 2 2 1 * n