Ch. 6 The Linear model under ideal conditions The(multiple) linear model is used to study the relationship between a dependent variable(Y) and several independent variables(X1, X2, ,Xk). That is ∫(X1,X2,…,Xk)+ E assume linear function 1X1+B2X2+…+6kXk+E xB+ where Y is the dependent or explained variable, x=X1 X2., Xk are the independent or the explanatory variables and B=[B1 B2..Bk are unknown coefficients that we are interested in learning about, either through estimation or through hypothesis testing. The term a is an unobservable random disturbance Suppose we have a sample of size T(allowing for non-random)observations on the scalar dependent variable Yt and the vector of explanatory variables Y=x13+et,t=1,2,…,T In matrix form. this relationship is written as YI X X X B1 Y? X X B2 Bk x1「 x2| +E where y is 1× 1 vector, X is an T× k matrix with rows x and e is an T×1 vector with element at Recall from Chapter 2 that we cannot postulate the probability model p if the sample on-random. The probability model must be defined in terms of their sample joint distribution
Ch. 6 The Linear Model Under Ideal Conditions The (multiple) linear model is used to study the relationship between a dependent variable (Y) and several independent variables (X1, X2, ..., Xk). That is Y = f(X1, X2, ..., Xk) + ε assume linear function = β1X1 + β2X2 + ... + βkXk + ε = x ′β + ε where Y is the dependent or explained variable, x = [X1 X2.....; Xk] ′ are the independent or the explanatory variables and β = [β1 β2..... βk] ′ are unknown coefficients that we are interested in learning about, either through estimation or through hypothesis testing. The term ε is an unobservable random disturbance. Suppose we have a sample of size T (allowing for non-random) observations 1 on the scalar dependent variable Yt and the vector of explanatory variables xt = (Xt1, Xt2, ..., Xtk) ′ , i.e. Yt = x ′ tβ + εt , t = 1, 2, ..., T. In matrix form, this relationship is written as y = Y1 Y2 . . . YT = X11 X12 . . . X1k X21 X22 . . . X2T . . . . . . . . . . . . . . . . . . XT1 XT2 . . . XT k β1 β2 . . . βk + ε1 ε2 . . . εT = x ′ 1 x ′ 2 . . . x ′ T β1 β2 . . . βk + ε = Xβ + ε, where y is T × 1 vector, X is an T × k matrix with rows x ′ t and ε is an T × 1 vector with element εt . 1Recall from Chapter 2 that we cannot postulate the probability model Φ if the sample is non-random. The probability model must be defined in terms of their sample joint distribution. 1
Our goal is to regard last equation as a parametric probability and sampling model, and try to inference the unknown Bi s and the parameters contained in E 1 The Probability Model: Gauss Linear Model ume that e nN(0, 2), if X are not stochastic, then by results from"func- tions of random variables”(n→ n transformation) we have y~N(X月,∑) That is, we have specified a probability and sampling model for y to be ( Probability and Sampling Model) X11X12 0i012 X21X2 02102 XTI XT2 OT1 OT2 ≡N(X,∑) That is the sample joint density function is f(y:6)=(2x)m-2exp(-1/2)(y-XB)-y-XB), where 0=(B1, B2, Bk, 01, 012, ,T). It is easily seen that the number of pa- rameters in 0 is large than the sample size, T. Therefore, some restrictions must be imposed in the probability and sampling model for the purpose of estimation as we shall see in the subsequence One kind of restriction on 0 is that 2 is a scalar matrix, then maximize the likelihood of the sample model f(0: x)(wr t. B)is equivalent to minimize the equation(y-XBy'(y-XB)(e'e=E,et constitutes the foundation of ordinary least square estimation To generalize the discussions so far, we have made the following assumption that (a) The model y=XB+E is correct; (no problem of model misspecification)
Our goal is to regard last equation as a parametric probability and sampling model, and try to inference the unknown βi ’s and the parameters contained in ε. 1 The Probability Model: Gauss Linear Model Assume that ε ∼ N(0, Σ), if X are not stochastic, then by results from ”functions of random variables” (n ⇒ n transformation) we have y ∼ N(Xβ, Σ). That is, we have specified a probability and sampling model for y to be (Probability and Sampling Model) y ∼ N X11 X12 . . . X1k X21 X22 . . . X2T . . . . . . . . . . . . . . . . . . XT1 XT2 . . . XT k β1 β2 . . . βk , σ 2 1 σ12 . . . σ1T σ21 σ 2 2 . . . σ2T . . . . . . . . . . . . . . . . . . σT1 σT2 . . . σ2 T ≡ N(Xβ, Σ), That is the sample joint density function is f(y; θ) = (2π) −T /2 |Σ| −1/2 exp(−1/2)(y − Xβ) ′Σ −1 (y − Xβ), where θ = (β1, β2, ..., βk, σ2 1 , σ12, ..., σ2 T ) ′ . It is easily seen that the number of parameters in θ is large than the sample size, T. Therefore, some restrictions must be imposed in the probability and sampling model for the purpose of estimation as we shall see in the subsequence. One kind of restriction on θ is that Σ is a scalar matrix, then maximize the likelihood of the sample model f(θ; x) (w.r.t. β) is equivalent to minimize the equation (y − Xβ) ′ (y − Xβ) (=ε ′ε = PT t=1 ε 2 t , a sums of squared residuals), this constitutes the foundation of ordinary least square estimation. To generalize the discussions so far, we have made the following assumptions that (a) The model y = Xβ + ε is correct; (no problem of model misspecification) 2
(b)X is nonstochastic:(therefore, regression comes first from experimental sci- (c)E(e)=0; (can easily be satisfied by adding a constant in the regression (d)Var(e)=E(ee)=a. I; (disturbance have same variance and are not auto- correlated) (e)Rank(X)=h; (for model identification) (f)e is normal distribute The above six assumptions are usually called the classical ordinary least squares assumption or the ideal conditions 2 Estimation: Ordinary Least Squares Estima tor 2.1 Estimation of B Let us first consider the ordinary least square estimator(OLS) which is the value for B that minimizes the sum of squared errors denoted as SSE (or residuals, remember the principal of estimation at Ch. 3) Sse(B)=(y-xBy-XB ∑(m-x13)2 XB+6xX The first order conditions for a minimum are aSSE(B) =-2xy+2xX=0. If X'X is nonsingular(which is satisfied by the assumption(e) of ideal condition and Ch 1 Sec. 3.5), the system of k equations in k unknown can be uniquely solved for the ordinary least squares(OLS)estimator Xt xt t=1 t=1
(b) X is nonstochastic; (therefore, regression comes first from experimental science) (c) E(ε) = 0; (can easily be satisfied by adding a constant in the regression) (d) V ar(ε) = E(εε′ ) = σ 2 · I; (disturbance have same variance and are not autocorrelated) (e) Rank(X) = k; (for model identification) (f) ε is normal distributed. The above six assumptions are usually called the classical ordinary least squares assumption or the ideal conditions. 2 Estimation: Ordinary Least Squares Estimator 2.1 Estimation of β Let us first consider the ordinary least square estimator (OLS) which is the value for β that minimizes the sum of squared errors denoted as SSE (or residuals, remember the principal of estimation at Ch. 3) SSE(β) = (y − Xβ) ′ (y − Xβ) = X T t=1 (yt − x ′ tβ) 2 = y ′y − 2y ′Xβ + β ′X′Xβ. The first order conditions for a minimum are ∂SSE(β) ∂β = −2X′y + 2X′Xβ = 0. If X′X is nonsingular (which is satisfied by the assumption (e) of ideal condition and Ch.1 Sec. 3.5), the system of k equations in k unknown can be uniquely solved for the ordinary least squares (OLS) estimator βˆ = (X′X) −1X′y = "X T t=1 xtx ′ t #−1 X T t=1 x ′ tyt . (1) 3
To ensure that B is indeed a solution of minimization, we require that asSE(B) 2XX must be a positive definite matrix. This condition is satisfied by assumption 5 and Chl. Sec. 5. 6.1 Denote the T x 1 vector e, of least squares residual be then it is obvious that Xe=X(y-XB=Xy-xx(xxXy=0 i.e., the regressors is orthogonal to the OLS residual. Therefore, if one of the regressors is a constant term. the sum of the residuals is zero since the first element of X'e would be e et=0(a scalar 2.2 Estimation of a2 At this moment, we arrive at the following notation: y= X6+E X6+ o estimate the variance of e, 0, a simple and intuitive idea is that to use infor- nation fre
To ensure that βˆ is indeed a solution of minimization, we require that ∂ 2SSE(β) ∂β∂β ′ = 2X′X must be a positive definite matrix. This condition is satisfied by assumption 5 and Ch1. Sec. 5.6.1. Denote the T × 1 vector e, of least squares residual be e = y − Xβˆ, then it is obvious that X′ e = X′ (y − Xβˆ) = X′y − X′X(X′X) −1X′y = 0, (2) i.e., the regressors is orthogonal to the OLS residual. Therefore, if one of the regressors is a constant term, the sum of the residuals is zero since the first element of X′e would be 1 1 . . . 1 e1 e2 . . . eT = X T t=1 et = 0. (a scalar) 2.2 Estimation of σ 2 At this moment, we arrive at the following notation: y = Xβ + ε = Xβˆ + e. To estimate the variance of ε, σ 2 , a simple and intuitive idea is that to use information from sample e. 4
The matrix Mx=I-X(X'X)X' is symmetric and idempotent.Furthermore MXX=0 Lemma e= Mxy=MxE. That is we can interpret Mx as a matrix that produces the vector of least square residuals in the regression of y on X y-XB X(Xx)X (I-X(XX)Xy Mxy Mxⅹ+M Using the fact that Mx is symmetric and idempotent we have Lemma e'e=EMXMXE=EMxE Theorem 1 (e'e)=02(T-k) Proof E(ee)= E(E'Mxe) Etrace(EMxE(since E'MxE, is a scalar, equals its trace) Etrace(MxEE) trace E(MxEE(Why?
Lemma: The matrix MX = I − X(X′X) −1X′ is symmetric and idempotent . Furthermore, MXX = 0. Lemma: e = MXy = MXε. That is we can interpret MX as a matrix that produces the vector of least square residuals in the regression of y on X. Proof: e = y − Xβˆ = y − X(X′X) −1X′y = (I − X(X′X) −1X′ )y = MXy = MXXβ + MXε = MXε. Using the fact that MX is symmetric and idempotent we have Lemma: e ′e = ε ′M′ XMXε = ε ′MXε. Theorem 1: E(e ′e) = σ 2 (T − k). Proof: E(e ′ e) = E(ε ′MXε) = E[trace (ε ′MXε)] (since ε ′MXε, is a scalar, equals its trace) = E[trace (MXεε′ )] = trace E(MXεε′ )] (W hy ?) = trace (MXσ 2 IT ) = σ 2 trace (MX), 5
trace Mx trace(Ir)-trace(X(XX)X) trace(Ir)-trace((X'X)XX) T-k Corollary An unbiased estimator of g ee T一k Exercise Reproduce the estimate results at Table 4.2 p. 52, for B, s2(X'X)-I and e'e 2.3 Partitioned Regression Estimation It is common to specify a multiple regression model, when in fact, interest centers on only one of a subset of the full set of variables. Let k1+k2= k we can express the Ols result in isolation as β1 +e X1月1+X2B2+ where X1andx2 areTxk1andT×k2, respectively;β1andβ2arek1×1and k2×1, respectively What is the algebraic solution for B2? Denote M1=I-XI(X1X1-X1 then Miy MiX1B1+M1X2B2+Mie M1X22+
but trace MX = trace (IT ) − trace (X(X′X) −1X′ ) = trace (IT ) − trace ((X′X) −1X′X) = T − k. Corollary: An unbiased estimator of σ 2 is s 2 = e ′e T − k . Exercise: Reproduce the estimate results at Table 4.2 p. 52, for βˆ, s 2 (X′X) −1 and e ′e. 2.3 Partitioned Regression Estimation It is common to specify a multiple regression model, when in fact, interest centers on only one of a subset of the full set of variables. Let k1 +k2 = k we can express the OLS result in isolation as y = Xβˆ + e = X1 X2 βˆ 1 βˆ 2 + e = X1βˆ 1 + X2βˆ 2 + e, where X1 and X2 are T × k1 and T × k2, respectively; βˆ 1 and βˆ 2 are k1 × 1 and k2 × 1, respectively. What is the algebraic solution for βˆ 2 ? Denote M1 = I − X1(X′ 1X1) −1X′ 1 , then M1y = M1X1βˆ 1 + M1X2βˆ 2 + M1e = M1X2βˆ 2 + e, 6
using the fact that MiXI=0 and Mie=e. Multiplying X'2 on the above equation and using the fact that Xe 1=0 we have X2Miy X2MiX2B2+X2e=X2MiX2B Therefore B, can be expressed in isolation as B2=(X2M1X2)-4X2M1y (X2X2)X2y, where X2=MiX2 and y=Miy, are vectors of residual from the regression of X2 and y on X1, respectively Theorem 2(Frisch-Waugh) The subvector B, is the set of coefficients obtained when the residuals from a regression of y on Xi alone are regressed on the set of residuals obtained when each column of X is regressed on X Consider a simple regression with a constant, then the slope estimator can also be obtained from a data-demeaned regression without constant 2.4 The Restricted Least Squares Estimators Suppose that we explicitly imposes the restrictions of the hypothesis in the re- gression(take the example of LM test). The restricted least square estimator is obtained as the solution to Minimize SSe(S)=(y-Xp'(y-XB) subject to RB=9, where R is a known J x k matrix and g is values of these linear restrictions
using the fact that M1X1 = 0 and M1e = e. Multiplying X′ 2 on the above equation and using the fact that X′e = X′ 1 X′ 2 e = X′ 1e X′ 2e = 0 we have X′ 2M1y = X′ 2M1X2βˆ 2 + X′ 2e = X′ 2M1X2βˆ 2 . Therefore βˆ 2 can be expressed in isolation as βˆ 2 = (X′ 2M1X2 ) −1X′ 2M1y = (X∗ ′ 2 X∗ 2 ) −1 X∗ ′ 2 y ∗ , where X∗ ′ 2 = M1X2 and y ∗ = M1y, are vectors of residual from the regression of X2 and y on X1, respectively. Theorem 2 (Frisch-Waugh): The subvector βˆ 2 is the set of coefficients obtained when the residuals from a regression of y on X1 alone are regressed on the set of residuals obtained when each column of X2 is regressed on X1. Example: Consider a simple regression with a constant, then the slope estimator can also be obtained from a data-demeaned regression without constant. 2.4 The Restricted Least Squares Estimators Suppose that we explicitly imposes the restrictions of the hypothesis in the regression (take the example of LM test). The restricted least square estimator is obtained as the solution to M inimizeβ SSE(β) = (y − Xβ) ′ (y − Xβ) subject to Rβ = q, where R is a known J × k matrix and q is values of these linear restrictions. 7
A Lagrangean function for this problem can be written L' (B, A)=(y-XB'y-XB)-2X(RB-q), where A isJx1 The solutions B" and A will satisfy the necessary conditions OL* 2x(y-X)-2RA=0, OL dax 2( 9=0(remember ax Dividing through by 2 and expanding terms produces the partitioned matrix equation X'X R'lB y R 0 Assuming that the partitioned matrix in brackets is nonsingular, then Using the partition inverse rule of A 11A1,7-1 A11(I+A12F2A21A11)-A1A12F2 ao a F2A21A11 F where F2=(A22-A21A11 A12)-1 we have the restricted least squared estimator B*=B-XX-RR(XX)RT(RB-q A=[R(X'X-R]-(RB-q
A Lagrangean function for this problem can be written L ∗ (β, λ) = (y − Xβ) ′ (y − Xβ) − 2λ ′ (Rβ − q), where λ is J × 1. The solutions βˆ∗ and λˆ will satisfy the necessary conditions ∂L∗ ∂βˆ∗ = −2X′ (y − Xβˆ ∗ ) − 2R′λˆ = 0, ∂L∗ ∂λˆ = 2(Rβˆ∗ − q) = 0 (remember ∂a ′x ∂x = a) Dividing through by 2 and expanding terms produces the partitioned matrix equation X′X R′ R 0 " βˆ∗ λˆ # = X′y q , or Wˆd ∗ = v. Assuming that the partitioned matrix in brackets is nonsingular, then ˆd ∗ = W−1 v. Using the partition inverse rule of A11 A12 A21 A22 −1 = A−1 11 (I + A12F2A21A−1 11 ) −A−1 11 A12F2 −F2A21A−1 11 F2 , where F2 = (A22 − A21A−1 11 A12) −1 , we have the restricted least squared estimator βˆ∗ = βˆ − (X′X) −1R ′ [R(X′X) −1R ′ ] −1 (Rβˆ − q), and λˆ = [R(X′X) −1R ′ ] −1 (Rβˆ − q). 8
Exercise Show that Var(B")-Var(B)is a nonpositive definite matrix The above result of exercise holds whether or not the restriction are true One way to interpret this reduction in variance is as the value of the information contained in the restriction. See Table 6.2 at p. 103 Let e equal y-xB', i.e., the residuals vector from the restricted least square estimator, then using the familiar device, e*=y-XB-X(B-B)=e-X(B-B) The'restricted' sums of squared residuals is (B*-BXX(B since X'X is a positive definite matrix 2.5 Measurement of goodness of fit Denote the dependent variable's 'fitted value from dependent variables and OLS estimator, y, to be y=xB, that is y=y+e yy Proof Using the fact that X'y=XXB, we have e y -2BXy+BXx6 yy=yy Three measurements of variation are defined as following (a). SST(Sums of Squared Total variation)=EL(Y-Y)2=y'Moy, (b). SSR(Sums of Squared Regression variation)=E(Y-Y)2=y'Moy
Exercise: Show that V ar(βˆ∗ ) − V ar(βˆ) is a nonpositive definite matrix. The above result of exercise holds whether or not the restriction are true. One way to interpret this reduction in variance is as the value of the information contained in the restriction. See Table 6.2 at p. 103. Let e ∗ equal y − Xβˆ∗ , i.e., the residuals vector from the restricted least square estimator, then using the familiar device, e ∗ = y − Xβˆ − X(βˆ∗ − βˆ) = e − X(βˆ∗ − βˆ). The ’restricted’ sums of squared residuals is e ′∗ e ∗ = e ′ e + (βˆ∗ − βˆ) ′X′X(βˆ∗ − βˆ) ≥ e ′ e (3) since X′X is a positive definite matrix. 2.5 Measurement of Goodness of Fit Denote the dependent variable’s ’fitted value’ from dependent variables and OLS estimator, yˆ, to be yˆ = Xβˆ, that is y = yˆ + e. Lemma: e ′e = y ′y − ˆy ′ˆy. Proof: Using the fact that X′y = X′Xβˆ, we have e ′ e = y ′y − 2βˆ ′ X′y + βˆ ′ X′Xβˆ = y ′y − ˆy ′ˆy. Three measurements of variation are defined as following: (a). SST (Sums of Squared Total variation)= PT t=1(Yt − Y¯ ) 2 = y ′M0y, (b). SSR (Sums of Squared Regression variation)=PT t=1(Yˆ t − ¯ Yˆ ) 2 = yˆ ′M0yˆ, 9
(c). SSE(Sums of Squared Error variation)=E(Y-Y1)2=e'e, wher ∑=1 Lemma If one of the regressor is a constant, then=Y Proof: Wy9+x8=(1x[A]+…=1+x+则m is a column of ones. and using the fact that ie=0 we obtain the results Lemma If one of the regressor is a constant, then SST= Ssr+SSe Proof: Multiplying Mo on y=y+e we have Moy= Moy Moe= Moy +e, since Moe=e(why?) y Moy=y'Moy+ 2y'Moe +e'e yMoy+e'e ssr+SSe using the fact that y Me=B'Xe=0 If one of the regressor is a constant, the coefficient of determination is defined as R2=3 m(3) we know that One kind of restriction is of the form that RB=0, and we may think it as a model with fewer regressors(but with the same dependent variable). It is apparent that the coefficient of determination from this restricted model, say R2* is smaller
(c). SSE (Sums of Squared Error variation)=PT t=1(Yt − Yˆ t) 2 = e ′e, where Y¯ = 1 T PT t=1 Yt and ¯ Yˆ = 1 T PT t=1 Yˆ t . Lemma: If one of the regressor is a constant, then Y¯ = ¯ Yˆ . Proof: Writing y = yˆ + e = Xβ + e = i X2 βˆ 1 βˆ 2 + e = iβˆ 1 + X2βˆ 2 + e, where i is a column of ones, and using the fact that i ′e = 0 we obtain the results. Lemma: If one of the regressor is a constant, then SST = SSR + SSE. Proof: Multiplying M0 on y = yˆ + e we have M0y = M0yˆ + M0e = M0yˆ + e, since M0e = e (why ?). Therefore, y ′M0y = ˆy ′M′ 0ˆy + 2ˆy ′M′ 0e + e ′ e = ˆy ′M′ 0ˆy + e ′ e = SSR + SSE, using the fact that ˆy ′M′ 0e = βˆ′X′e = 0 Definition: If one of the regressor is a constant, the coefficient of determination is defined as R2 = SSR SST = 1 − SSE SST . From (3) we know that e ′∗ e ∗ ≥ e ′ e. One kind of restriction is of the form that Rβ = 0, and we may think it as a model with fewer regressors (but with the same dependent variable). It is apparent that the coefficient of determination from this restricted model, say R2∗ is smaller. 10