Ch. 3 Estimation 1 The Nature of statistical Inference It is argued that it is important to develop a mathematical model purporting to provide a generalized description of the data generating process. A prob bility model in the form of the parametric family of the density functions p=f(:0),0E e and its various ramifications formulated in last chapter provides such a mathematical model. By postulating p as a probability model for the distribution of the observation of interested, we could go on to consider questions about the unknown parameters 0(via estimation and hypothesis tests)as well as further observations from the probability model(prediction) In the next section the important concept of a sampleing model is introduced as a way to link the probability model postulated, say p= f(r: 0),0E 0] to the observed data a =(a1,. In)'available. The sampling model provided the second important ingredient needed to define a statistical model; the starting point of any parametric"statistical inference In short, a statistical model is defined as comprising (a). a probability model p=f(; 0),0E0f;and (b). a sampling model x≡(X1,…,Xxn)y The concept of a statistical model provide the starting point of all forms of sta- tistical inference to be considered in the sequel. To be more precise the concept of a statistical model forms the basis of what is known as parametric in ference There is also a branch of statistical inference known as non-parametric in ference where no gp is assumed a prior 1.1 The sampling model A sampleing model is introduced as a way to link the probability model postu ated,sayp={f(x;),θ∈θ} and the observed data a≡(x1,…xn) available It is designed to model the relationship between them and refers to the way the
Ch. 3 Estimation 1 The Nature of Statistical Inference It is argued that it is important to develop a mathematical model purporting to provide a generalized description of the data generating process. A probability model in the form of the parametric family of the density functions Φ = {f(x; θ), θ ∈ Θ} and its various ramifications formulated in last chapter provides such a mathematical model. By postulating Φ as a probability model for the distribution of the observation of interested, we could go on to consider questions about the unknown parameters θ (via estimation and hypothesis tests) as well as further observations from the probability model (prediction). In the next section the important concept of a sampleing model is introduced as a way to link the probability model postulated, say Φ = {f(x; θ), θ ∈ Θ}, to the observed data x ≡ (x1, ..., xn) ′ available. The sampling model provided the second important ingredient needed to define a statistical model; the starting point of any ”parametric” statistical inference. In short, a statistical model is defined as comprising (a). a probability model Φ = {f(x; θ), θ ∈ Θ}; and (b). a sampling model x ≡ (X1, ..., Xn) ′ . The concept of a statistical model provide the starting point of all forms of statistical inference to be considered in the sequel. To be more precise, the concept of a statistical model forms the basis of what is known as parametric inference. There is also a branch of statistical inference known as non−parametric inference where no Φ is assumed a priori. 1.1 The sampling model A sampleing model is introduced as a way to link the probability model postulated, say Φ = {f(x; θ), θ ∈ Θ} and the observed data x ≡ (x1, ..., xn) ′ available. It is designed to model the relationship between them and refers to the way the 1
observed data can be viewed in relation to dp Definition 1 A sample is defined to be a set of random variables(X1, X2,. Xn) whose den- sity functions coincides with the"true"density function f(a; 00) as postulated by the probability model Data are generally drawn in one of two settings. A cross section sample is a sample of a number of observational units all drawn at the same point in time a time series sample is a set of observations drawn on the same observational unit at a number (usually evenly spaced) points in time. Many recently have been based on time-series cross sections, which generally consistent of the same cross section observed at several points in time. The term panel data set is usually fitting for this Given that a sample is a set of r v s related to it must have a distribution which we call the distribution of the sample The distribution of the sample x=(X1, X2, Xn), is defined to be the joint distribution of the r.v.'s X1, X2, .. Xn denoted by ∫x(x1,…,xn;)≡∫(x;) The distribution of the sample incorporates both forms of relevant informa tion, the probability as well as sample information. It must comes as no surprise to learn that f(a: 0) plays a very important role in statistical inference. The form of f(a: 8) depends crucially on the nature of the sampling model and as well as on the idea of a random experiment 2 and is called a random sanpete one based dp. The simplest but most widely used form of a sampling model is the Definition 3 A set of random variables(X1, X2,., Xn)is called a random sample from f(a: 0) if the r.v. 's X1, X2, Xn are independently and identically distributed (ii d ) In
observed data can be viewed in relation to Φ. Definition 1: A sample is defined to be a set of random variables (X1, X2, ..., Xn) whose density functions coincides with the ”true” density function f(x; θ0) as postulated by the probability model. Data are generally drawn in one of two settings. A cross section sample is a sample of a number of observational units all drawn at the same point in time. A time series sample is a set of observations drawn on the same observational unit at a number (usually evenly spaced) points in time. Many recently have been based on time-series cross sections, which generally consistent of the same cross section observed at several points in time. The term panel data set is usually fitting for this sort of study. Given that a sample is a set of r.v.’s related to Φ it must have a distribution which we call the distribution of the sample. Definition 2: The distribution of the sample x ≡ (X1, X2, ..., Xn) ′ , is defined to be the joint distribution of the r.v.’s X1, X2, ..., Xn denoted by fx(x1, ..., xn; θ) ≡ f(x; θ). The distribution of the sample incorporates both forms of relevant information, the probability as well as sample information. It must comes as no surprise to learn that f(x; θ) plays a very important role in statistical inference. The form of f(x; θ) depends crucially on the nature of the sampling model and as well as Φ. The simplest but most widely used form of a sampling model is the one based on the idea of a random experiment E and is called a random sample. Definition 3: A set of random variables (X1, X2, ..., Xn) is called a random sample from f(x; θ) if the r.v.’s X1, X2, ..., Xn are independently and identically distributed (i.i.d). In 2
this case the distribution of the sample takes the form 0)=If(x;.)=(f(x;0) i=1 the first equality due to independence and the second due to the fact that the r.v. are identically distributed A less restrictive form of a sample model in which we call an independent sample, where the identically distributed condition in the random sample is re Definition 4 A set of random variables(X1, X2,..., Xn)is said to be an independent sample Grom f(ri; 01), i= 1, 2, .., respectively, if the r v's X1, X2, .. Xn are indepen- dent. In this case the distribution of the sample takes the form f(r 0)=Tf(x;0) Usually the density function f(:: 81), i= 1, 2,..., n belong to the same family but their numerical characteristics(moments, etc )may differs If we relax the independence assumption as well we have what we can call a non-random sample Definition 5 A set of random variables(X1, X2,.,Xn)I is said to be a non-random sample from f(1, 2,., n; 0) if the r v 's X1, X2, .Xn are non-i.id. In this case the only decomposition of the distribution of the sample possible is f( f(cil ) given To, where f(eiII,,Ti-1; 01), i= 1, 2,.n represent the conditional distri- bution of Xi given X1, X2, ., Xi-1 IHere, we must regard this set of random variables as a sample of size ' one'from a multi- variate point of view
this case the distribution of the sample takes the form f(x1, ..., xn; θ) = Yn i=1 f(xi ; θ) = (f(x; θ))n the first equality due to independence and the second due to the fact that the r.v. are identically distributed. A less restrictive form of a sample model in which we call an independent sample, where the identically distributed condition in the random sample is relaxed. Definition 4: A set of random variables (X1, X2, ..., Xn) is said to be an independent sample from f(xi ; θi), i = 1, 2, ...n, respectively, if the r.v.’s X1, X2, ..., Xn are independent. In this case the distribution of the sample takes the form f(x1, ..., xn; θ) = Yn i=1 f(xi ; θi). Usually the density function f(xi ; θi), i = 1, 2, ..., n belong to the same family but their numerical characteristics (moments, etc.) may differs. If we relax the independence assumption as well we have what we can call a non-random sample. Definition 5: A set of random variables (X1, X2, ..., Xn) 1 is said to be a non-random sample from f(x1, x2, ..., xn; θ) if the r.v.’s X1, X2, ...Xn are non-i.i.d.. In this case the only decomposition of the distribution of the sample possible is f(x1, ..., xn; θ) = Yn i=1 f(xi |x1, ..., xi−1; θi) given x0, where f(xi |x1, ..., xi−1; θi), i = 1, 2, ...n represent the conditional distribution of Xi given X1, X2, ..., Xi−1. 1Here, we must regard this set of random variables as a sample of size ’one’ from a multivariate point of view. 3
n the context of statistical inferences need to postulate both probability as well as a sampling model and thus we define a statistical model as comprising A statistical model is defined as comprising (a). a probability model重={f(x;6),∈e};and (b). a sampling model x≡(X1,X2,…,Xn) It must be emphasized that the two important components of a statistical model, the probability and sampling models, are clearly interrelated. For ex ample we cannot postulate the probability modelΦ={f(x;),θ∈e} if the sample x is non-random. This is because if the r.v. 's X1, X2,. Xn are not independent the probability model must be defined in terms of their joint distri- bution,ie.重={f(x1,x2,…,xn;日),θ∈}( for example, stock price).More over, in the case of an independent but not identically distributed sample we need to specify the individual density functions for each r.v. in the sample, i.e. 重={(xk;日),6∈,k=1,2,…,n}. The most important implication of this relationship is that when the sampling model postulated is found to be inappro- priate it means that the probability model has to be re-specified as well. 1. 2 An overview of statistical inference The statistical model in conjunction with the observed data enable us to consider the following question (A). Are the observed data consistent with the postulated statistical model (model misspeci fication) (B). Assuming that the postulated statistical model is consistent with the ob served data, what can we infer about the unknown parameter bEe? (a). Can we decrease the uncertainty about 8 by reducing the parameters space from e to Oo where Oo is a subset of e.(confidence estimation (b). Can we decrease the uncertainty about 8 by choosing a particular value in 8, say 8, as providing the most representative value of 0?(point estimation
In the context of statistical inferences need to postulate both probability as well as a sampling model and thus we define a statistical model as comprising both. Definition 6: A statistical model is defined as comprising (a). a probability model Φ = {f(x; θ), θ ∈ Θ}; and (b). a sampling model x ≡ (X1, X2, ..., Xn) ′ . It must be emphasized that the two important components of a statistical model, the probability and sampling models, are clearly interrelated. For example we cannot postulate the probability model Φ = {f(x; θ), θ ∈ Θ} if the sample x is non-random. This is because if the r.v.’s X1, X2, ..., Xn are not independent the probability model must be defined in terms of their joint distribution,.i.e. Φ = {f(x1, x2, ..., xn; θ), θ ∈ Θ} (for example, stock price). Moreover, in the case of an independent but not identically distributed sample we need to specify the individual density functions for each r.v. in the sample, i.e. Φ = {fk(xk; θ), θ ∈ Θ, k = 1, 2, ..., n}. The most important implication of this relationship is that when the sampling model postulated is found to be inappropriate it means that the probability model has to be re-specified as well. 1.2 An overview of statistical inference The statistical model in conjunction with the observed data enable us to consider the following question: (A). Are the observed data consistent with the postulated statistical model ? (model misspecif ication) (B). Assuming that the postulated statistical model is consistent with the observed data, what can we infer about the unknown parameter θ ∈ Θ ? (a). Can we decrease the uncertainty about θ by reducing the parameters space from Θ to Θ0 where Θ0 is a subset of Θ. (conf idence estimation) (b). Can we decrease the uncertainty about θ by choosing a particular value in θ, say θˆ, as providing the most representative value of θ ? (point estimation) 4
(c). Can we consider the question that 0 belongs to some subset 0o of 0? hypothesis testing (C). Assuming that a particular representative value 8 of 8 has been chosen what can we infer about further observations from the data generating process(DGP as described by the postulated statistical model?(prediction
(c). Can we consider the question that θ belongs to some subset Θ0 of Θ ?( hypothesis testing) (C). Assuming that a particular representative value θˆ of θ has been chosen what can we infer about further observations from the data generating process (DGP) as described by the postulated statistical model ? (prediction) 5
2 Point estimation (Point) Estimation refers to our attempt to give a numerical value to 0. Let ( S, F, p(O) be the probability space of reference with X a r.v. defined on this space. The following statistical model is postulated ()更={f(x:),∈},R (ii)x=(X1, X2, .,Xn)' is a random sample from f(a: 0) Estimation in the context of this statistical model takes the form of constructing a mapping h(): a-0, where x is the observation space and h( )is a Borel function. The composite function (a statistic)0= h(x):S-0 is called an estimator and its value h(a), a E r an estimate of 0. It is important to distin- guish between the two because the former is a random variable and the latter is a real number Example Let f(r; 0)=[1/v2TJeap(-)(x-0)2), 0 R, and x be a random sample from f(a: 8). Then &=Rn and the following function define estimators of 8 2.2=k∑1X,k=1,2,…,n-1 (X1+Xn) It is obvious that we can construct infinitely many such estimators. However constructing "good"estimators is not so obvious. It is clear we need some criteria to choose between theses estimators. In other words. we need to formalize what we mean by a good"estimator 2.1 Finite sample properties of estimator 2.1.1 Unbiasedness An estimator is constructed with the sole aim of providing us the"most represen tative values "of 0 in the parameter space 0, based on the available information in the form of statistical model. Given that the estimator 0= h(x) is a rv
2 Point Estimation (P oint) Estimation refers to our attempt to give a numerical value to θ. Let (S, F,P(·)) be the probability space of reference with X a r.v. defined on this space. The following statistical model is postulated: (i) Φ = {f(x; θ), θ ∈ Θ}, Θ ⊆ R; (ii)x ≡ (X1, X2, ..., Xn) ′ is a random sample from f(x; θ). Estimation in the context of this statistical model takes the form of constructing a mapping h(·) : X → Θ, where X is the observation space and h(·) is a Borel function. The composite function (a statistic) θˆ ≡ h(x) : S → Θ is called an estimator and its value h(x), x ∈ X an estimate of θ. It is important to distinguish between the two because the former is a random variable and the latter is a real number. Example: Let f(x; θ) = [1/ √ 2π]exp{−1 2 (x − θ) 2}, θ ∈ R, and x be a random sample from f(x; θ). Then X = R n and the following function define estimators of θ: 1. θˆ 1 = 1 n Pn i=1 Xi , 2. θˆ 2 = 1 k Pk i=1 Xi , k = 1, 2, ..., n − 1; 3. θˆ 3 = 1 n (X1 + Xn). It is obvious that we can construct infinitely many such estimators. However, constructing ”good” estimators is not so obvious. It is clear we need some criteria to choose between theses estimators. In other words, we need to formalize what we mean by a ”good” estimator. 2.1 Finite sample properties of estimator 2.1.1 Unbiasedness An estimator is constructed with the sole aim of providing us the ”most representative values” of θ in the parameter space Θ, based on the available information in the form of statistical model. Given that the estimator θˆ = h(x) is a r.v. 6
(being a Borel function a random vector x)any information of what we mean by a"most representative values "must be in terms of the distribution of 0, say f(e). The obvious property to require a ' good'estimator 8 of 8 to satisfy is that f(e) is centered around 8 Definition 7. An estimator 0 of 0 is said to be an unbiased estimator of e if E(O)=6f6=0 That is, the distribution of 0 has mean equal to the unknown parameter to estimated Note that an alternative, but equivalent, way to define e(8)is E(6 h(x)f(x;6) where f(a: 0)=f(a1, 2,,In; 0) is the distribution of the sample, x It must be remembered that unbiasedness is a property based on the distri- bution of 0. This distribution is often called sampling distribution of 0 in order distinguish it from any other distribution of function of r v's 1.2 Effie Although unbiasedness seems at first sight to be a highly desirable property it turns out in most situations there are too many unbiased estimators for this prop- erty to be used as the sole criterion for judging estimators. The question which naturally arises is "How can we choose among unbiased estimators ". Given that the variance is a measure of dispersion, intuition suggests that the estimator with the smallest variance is in a sense better because its distribution is more c Definition 8: An unbiased estimator 0 of 0 is said to be relatively more efficient than some
(being a Borel function a random vector x) any information of what we mean by a ”most representative values” must be in terms of the distribution of θˆ, say f(θˆ). The obvious property to require a ’good’ estimator θˆ of θ to satisfy is that f(θˆ) is centered around θ. Definition 7: An estimator θˆ of θ is said to be an unbiased estimator of θ if E(θˆ) = Z ∞ −∞ θˆf(θˆ)dθˆ = θ. That is, the distribution of θˆ has mean equal to the unknown parameter to be estimated. Note that an alternative, but equivalent, way to define E(θˆ) is E(θˆ) = Z ∞ −∞ · · · Z ∞ −∞ h(x)f(x; θ)dx where f(x; θ) = f(x1, x2, ..., xn; θ) is the distribution of the sample, x. It must be remembered that unbiasedness is a property based on the distribution of θˆ. This distribution is often called sampling distribution of θˆ in order to distinguish it from any other distribution of function of r.v.’s. 2.1.2 Efficiency Although unbiasedness seems at first sight to be a highly desirable property it turns out in most situations there are too many unbiased estimators for this property to be used as the sole criterion for judging estimators. The question which naturally arises is ” How can we choose among unbiased estimators ?”. Given that the variance is a measure of dispersion, intuition suggests that the estimator with the smallest variance is in a sense better because its distribution is more ’concentrated’ around θ. Definition 8: An unbiased estimator θˆ of θ is said to be relatively more efficient than some 7
other unbiased estimator e if Var(e)0 does not depend on 0
other unbiased estimator θ˜ if V ar(θˆ) 0} does not depend on θ; 8
(b). For each 0 E 0 the distribution a log f(a: 0)1/(a0 ), i=1, 2, 3 exist for all ∈ (c).0< El(a/a0)log f(a; 0)12<oo for all 0E 8 In the case of unbiased estimators the inequality takes the form (e”)≥|E alog f(a: 8) the inverse of the lower bound is called Fishers in formation number and is denoted by In(0).2 Definition 10(multi-parameters's Cramer-Rao Theorem An unbiased estimator 0 of 0 is said to be fully efficient if alog f(a: 0)/alog f(a: 8 E 00 a2 log f(c: 8 0006 where In()= a log f(a: 0)/alog f(a: 0) 06 is called the sample information matriz. Proof: (for the case that 0 is 1 x 1) Given that f( 1, 2,.,n; 8) is the joint density function of the sample, it pos- sesses the property that f( 0)dcr.dan =1 or, more compactly, ∫(x;O)dm=1 2It must bear in mind that the information matrix is a function of sample size n
(b). For each θ ∈ Θ the distribution [∂ i log f(x; θ)]/(∂θ i ), i = 1, 2, 3 exist for all x ∈ X ; (c). 0 < E[(∂/∂θ) log f(x; θ)]2 < ∞ for all θ ∈ θ. In the case of unbiased estimators the inequality takes the form V ar(θ ∗ ) ≥ " E ∂ log f(x; θ) ∂θ 2 #−1 ; the inverse of the lower bound is called F isher′ s information number and is denoted by In(θ).2 Definition 10 (multi-parameters’s Cram´er-Rao Theorem): An unbiased estimator θˆ of θ is said to be fully efficient if V ar(θˆ) = E ∂ log f(x; θ) ∂θ ∂ log f(x; θ) ∂θ ′−1 = E − ∂ 2 log f(x; θ) ∂θ∂θ ′ −1 where In(θ) = E ∂ log f(x; θ) ∂θ ∂ log f(x; θ) ∂θ ′ = E − ∂ 2 log f(x; θ) ∂θ∂θ ′ , is called the sample information matrix. Proof: (for the case that θ is 1 × 1): Given that f(x1, x2, ..., xn; θ) is the joint density function of the sample, it possesses the property that Z ∞ −∞ · · · Z ∞ −∞ f(x1, x2, ..., xn; θ)dx1...dxn = 1, or, more compactly, Z ∞ −∞ f(x; θ)dx = 1. 2 It must bear in mind that the information matrix is a function of sample size n. 9
If we assume that the domain of x is independent of 0((a) this permits straight forward differentiation inside the integral sign) and that the derivative af(/a0 exist. Then differentiating the above equation with respect to 8 results in f(m:;6) 06 Is equation can be reexpressed as aIn f(a: 8) d f(x:;)da=0 f(t) f'(t 06 dt Therefore, it simply states that OIn f(a: 8) i.e., the expectation of the derivative of the natural logarithm of the likelihood function of a random sample from a regular density is zero Likewise, differentiating (1)wrt. 8 again provides a2Inf(a: 0 f(a: 0)dar anf(a: 0)af(a; e)d a-In f(a; 0) f(x:;6)da+ aIn f(a; 0) 06 f(x:;6)d aIn f(a: 0) aiN f(a: 0) E Now consider the estimator h(x of 0 whose expectation is E(h(x))=/h(x)f(a; 0)d Differentiating(2)wrt. 0 we obtain dE(h(x)) h(x) 0f(x; 06 h(x) aIn f(a: 0 06 ∫(x;6)dc cou/h(x), nf(r: 0)(since E/INf(a: 02=0) 06
If we assume that the domain of x is independent of θ ((a) this permits straightforward differentiation inside the integral sign) and that the derivative ∂f(·)/∂θ exist. Then differentiating the above equation with respect to θ results in Z ∞ −∞ ∂f(x; θ) ∂θ dx = 0. (1) This equation can be reexpressed as Z ∞ −∞ ∂ ln f(x; θ) ∂θ f(x; θ)dx = 0 ( d dt ln f(t) = f ′ (t) f(t) ). Therefore, it simply states that E ∂ ln f(x; θ) ∂θ = 0, i.e., the expectation of the derivative of the natural logarithm of the likelihood function of a random sample from a regular density is zero. Likewise, differentiating (1) w.r.t. θ again provides Z ∞ −∞ ∂ 2 ln f(x; θ) ∂θ 2 f(x; θ)dx + Z ∞ −∞ ∂ ln f(x; θ) ∂θ ∂f(x; θ) ∂θ dx = Z ∞ −∞ ∂ 2 ln f(x; θ) ∂θ 2 f(x; θ)dx + Z ∞ −∞ ∂ ln f(x; θ) ∂θ 2 f(x; θ)dx. That is V ar ∂ ln f(x; θ) ∂θ = −E ∂ 2 ln f(x; θ) ∂θ 2 . Now consider the estimator h(x) of θ whose expectation is E(h(x)) = Z h(x)f(x; θ)dx. (2) Differentiating (2) w.r.t. θ we obtain ∂E(h(x)) ∂θ = Z h(x) ∂f(x; θ) ∂θ dx = Z h(x) ∂ ln f(x; θ) ∂θ f(x; θ)dx = cov h(x), ∂ ln f(x; θ) ∂θ (since E ∂ ln f(x; θ) ∂θ = 0). 10