Ch. 7 Violations of the ideal conditions 1 ST pecification 1.1 Selection of variables Consider a initial model. which we assume that Y=x1/1+E, It is not unusual to begin with some formulation and then contemplate adding more variable(regressors) to the model Y=X1B1+X262+E Let Ri be the R-square of the model with fewer regressor, and Ri2 be the R-square of the model with more regressors. It is apparent as we have shown earlier that Ri2>R3. Clearly, it would be possible to push R2 as high as desired by adding regressors. This problem motivates the use of the adjusted R-square, T-1 (1-R2) It has been suggested that the adjusted R-square does not penalize the loss of degree of freedom heavily, two alternative have been proposed for comparing models are T+k (1-B2) and Akaike' s information criterion In(e:)2k 2k AIc In o+ Although intuitively appealing, these measures are a bit unorthodox in that they have no firm basis in theory (unless that are used in time series analysis model). Perhaps a somewhat more palatable alternative is the method of step- wise regression; However, economists have tends to avoid stepwise regression method for the break down of inference procedures
Ch. 7 Violations of the Ideal Conditions 1 Specification 1.1 Selection of Variables Consider a initial model, which we assume that Y = X1β1 + ε, It is not unusual to begin with some formulation and then contemplate adding more variable (regressors) to the model: Y = X1β1 + X2β2 + ε. Let R2 1 be the R-square of the model with fewer regressor, and R2 12 be the R-square of the model with more regressors. It is apparent as we have shown earlier that R2 12 > R2 1 . Clearly, it would be possible to push R2 as high as desired by adding regressors. This problem motivates the use of the adjusted R-square, R¯2 = 1 − T − 1 T − k (1 − R 2 ) It has been suggested that the adjusted R-square does not penalize the loss of degree of freedom heavily, two alternative have been proposed for comparing models are R˜2 j = T + kj T − Kj (1 − R 2 j ) and Akaike’s information criterion: AICj = ln e 0 jej T + 2kj T = ln σˆ 2 j + 2kj T . Although intuitively appealing, these measures are a bit unorthodox in that they have no firm basis in theory (unless that are used in time series analysis model). Perhaps a somewhat more palatable alternative is the method of stepwise regression; However, economists have tends to avoid stepwise regression method for the break down of inference procedures. 1
1.2 Omission of relevant variables Suppose that a correctly specified regression model would be Y=x11+X2B2 where the two parts of X have ki and k2 columns, respectively. If we regress Y on Xi without including X2, that is you have estimate the model Y=X1B1+E and obtain the estimator as B1=(X1X1)-x1Y=(X1X1)-1x1(X1B1+X2B2+e) B1+(X1X1)-1X1A2+(X1X1)-1X1e Taking the expectation, we see that unless X1X2=0 or B2=0, 6, is biase E(B1)=B1+(X1X1)-1X1X22 The variance of B, is (1)=a2(X1X1)-1 If we had computed the correct regression, including X2, then the slope estimator on Xl, denoted by B,2 would have a covariance matrix equal to the upper left block of o2(X'X)-,i.e V va(32)|=a2(xXx)-1= XiXi XiX Var(B22) XXI XX X1X1-X1X2(X2X2)-1x2x1 Var(312)=a2(X1X1-X1X2(X2X2)-1x2X1]-1. We can compare the covariance matrix of B, and B12 more easily by comparing their inverse Var(1)-1-Var(B12)-1=(1/o2)X1X2(X2X2)-1X2X1
1.2 Omission of Relevant variables Suppose that a correctly specified regression model would be Y = X1β1 + X2β2 + ε, where the two parts of X have k1 and k2 columns, respectively. If we regress Y on X1 without including X2, that is you have estimate the model Y = X1β1 + ε, and obtain the estimator as βˆ 1 = (X0 1X1) −1X0 1Y = (X0 1X1) −1X0 1 (X1β1 + X2β2 + ε) = β1 + (X0 1X1) −1X0 1β2 + (X0 1X1) −1X0 1ε. Taking the expectation, we see that unless X0 1X2 = 0 or β2 = 0, βˆ 1 is biased: E(βˆ 1 ) = β1 + (X0 1X1) −1X0 1X2β2. The variance of βˆ 1 is V ar(βˆ 1 ) = σ 2 (X0 1X1) −1 . If we had computed the correct regression, including X2, then the slope estimator on X1, denoted by βˆ 12 would have a covariance matrix equal to the upper left block of σ 2 (X0X) −1 , i.e. V ar(βˆ) = V ar(βˆ 12) V ar(βˆ 22) = σ 2 (X0X) −1 = σ 2 X0 1X1 X0 1X2 X0 2X1 X0 2X1 −1 = σ 2 [X0 1X1 − X0 1X2(X0 2X2) −1X0 2X1] −1 ∗ ∗ ∗ , or V ar(βˆ 12) = σ 2 [X0 1X1 − X0 1X2(X0 2X2) −1X0 2X1] −1 . We can compare the covariance matrix of βˆ 1 and βˆ 12 more easily by comparing their inverse: V ar(βˆ 1 ) −1 − V ar(βˆ 12) −1 = (1/σ2 )X0 1X2(X0 2X2) −1X0 2X1, 2
which is nonnegative definite. We conclude that although B, is biased, it has a smaller variance than B12 Lemma Let a be a positive definite(n xn) matrix and let b denote any nonzero(n x m) matrix. Then b ab is nonnegative definite Proof: Let x be ant nonzero vector. Define B Then x can be any vector including the zero vector. Then x'B′ABx=xAx>0 from the positive definiteness of matrix A For statistical inference, it would be necessary to estimate o2. Proceeding as usual. we would enel T一k 1=M1Y=M1(X1月1+X22+E)=M1X22+M1e Thus Ee1e1]=B2X2M1X2/2+o2tr(M1)=B2X2M1X22+a2(T-k1) It is simple to see that B2 X2M1X2B2 is positive(how? )so s2 is biased upward The conclusion is that if we omit relevant variables from the regression, then our estimate of both B1 and a are biased although it is possible that Bi is more precise than B12 1. 3 Inclusion of relevant variables If the regression model is correct given by
which is nonnegative definite. We conclude that although βˆ 1 is biased, it has a smaller variance than βˆ 12. Lemma: Let A be a positive definite (n×n) matrix and let B denote any nonzero (n×m) matrix. Then B0AB is nonnegative definite. Proof: Let x be ant nonzero vector. Define x˜ ≡ Bx. Then x˜ can be any vector including the zero vector. Then x 0B 0ABx = x˜ 0Ax˜ ≥ 0 from the positive definiteness of matrix A. For statistical inference, it would be necessary to estimate σ 2 . Proceeding as usual, we would use s 2 = e 0 1e1 T − k1 . But e1 = M1Y = M1(X1β1 + X2β2 + ε) = M1X2β2 + M1ε. Thus, E[e 0 1e1] = β2 0X0 2M1X2β2 + σ 2 tr(M1) = β2 0X0 2M1X2β2 + σ 2 (T − k1). It is simple to see that β2 0X0 2M1X2β2 is positive (how ?) so s 2 is biased upward. The conclusion is that if we omit relevant variables from the regression, then our estimate of both β1 and σ 2 are biased although it is possible that βˆ1 is more precise than βˆ12. 1.3 Inclusion of Irrelevant Variables If the regression model is correct given by Y = X1β1 + ε, 3
and we estimate it b Y=x11+X2B2 from partitioned regression estimator, we obtain that B1=(X1M2X1)x1M2Y=(X1M2X1)-X1M2(X1/1+e) B1+(X1M2X1)-1X1M2e B2=(X2M1X2)-x2M1Y=(X2M1X2)-x2M1(X1B1+e) 0+(X2M1X2)-x2M Therefore, E(B,)=B, and E(B2)=0 Exercise: Show that s2 is unbiased ee T-k1-k2 Then what's the problem? It would seem that one would generally want to overfit" the model. However the cost is the reduction of the precision of the e. As we have seen that the covariance matrix of the shorter regressors in never larger than the covariance matrix for the estimator obtained in the presence of the superfluous variables 2 Functional form 2.1 Dummy Variables One of the most useful devices in regression analysis is the binary, or dummy variables, which takes value of only 0 and 1
and we estimate it by Y = X1β1 + X2β2 + ε, from partitioned regression estimator, we obtain that βˆ 1 = (X0 1M2X1 ) −1X0 1M2Y = (X0 1M2X1 ) −1X0 1M2(X1β1 + ε) = β1 + (X0 1M2X1 ) −1X0 1M2ε, and βˆ 2 = (X0 2M1X2 ) −1X0 2M1Y = (X0 2M1X2 ) −1X0 2M1(X1β1 + ε) = 0 + (X0 2M1X2 ) −1X0 2M1ε. Therefore, E(βˆ 1 ) = β1 and E(βˆ 2 ) = 0. Exercise: Show that s 2 is unbiased: E e 0e T − k1 − k2 = σ 2 . Then what’s the problem ? It would seem that one would generally want to ”overfit” the model. However the cost is the reduction of the precision of the estimate. As we have seen that the covariance matrix of the shorter regressors in never larger than the covariance matrix for the estimator obtained in the presence of the superfluous variables. 2 Functional Form 2.1 Dummy Variables One of the most useful devices in regression analysis is the binary, or dummy variables, which takes value of only 0 and 1. 4
2.1.1 Comparing Two Mean If a model describe the salary-paid function by y=u+rB+E where u can be regard as"initial-pay"to anyone(even)with different academic degree. This model can be made more realistic by dividing the"initial-pay"into two category: individuals attending college and not attending college. Formally y=+8d:+x3+6, where d =1, if attending college di=0, if not attending college Logically,a>0 and di is the dummy variable. The above model can also be written equivalently as y=odli+ nd2i +'b+e e dui =0, if not attending college and dai=0, if attending college dli=l, if attending college dai= 1, if not attending college but not y=+6d1+nd21+x+e to avoid dummy trap. Therefore to remove seasonal effect, we need 4 dummy without a common mean or use 3 dummy with a common mean( see eq. 7-1 at p.118) 2.2 Nonlinearity in the Variables The linear model we proposed is not as"limited as the first glance. By using garithms, exponential, reciprocal, transcendental functions and polynomials and so on, this"linear model is also useful the general form (y)=A1f1(2)+A22(x)+….+kfk(x)+E B1x1+/2 BkCk+ 6+ which can be tailored to any number of situations
2.1.1 Comparing Two Mean If a model describe the salary-paid function by y = µ + x 0β + ε, where µ can be regard as ”initial-pay” to anyone (even) with different academic degree. This model can be made more realistic by dividing the ”initial-pay” into two category: individuals attending college and not attending college. Formally y = µ + δdi + x 0β + ε, where di = 1, if attending college di = 0, if not attending college. Logically, δ > 0 and di is the dummy variable. The above model can also be written equivalently as y = δd1i + ηd2i + x 0β + ε where d1i = 1, if attending college d1i = 0, if not attending college and d2i = 0, if attending college d2i = 1, if not attending college but not y = µ + δd1i + ηd2i + x 0β + ε, to avoid dummy trap. Therefore, to remove seasonal effect, we need 4 dummy without a common mean or use 3 dummy with a common mean (see eq. 7-1 at p. 118). 2.2 Nonlinearity in the Variables The linear model we proposed is not as ”limited’ as the first glance. By using logarithms, exponential, reciprocal, transcendental functions and polynomials, and so on, this ”linear model is also useful the general form: g(y) = β1f1(z) + β2f2(z) + ... + βkfk(z) + ε = β1x1 + β2x2 + ... + βkxk + ε = x 0β + ε. which can be tailored to any number of situations. 5
2.2.1 Log-Linear Model A commonly used form of regression model is the log-linear model In na+>Bk Inzk+e 61+∑+E All you have to do is take natural logarithms of the data before the regression 3 Stochastic Regressors This section will consider the linear ion model Y=XB+E. It will be assumed that the full ideal conditions hold except that the regressor matrix X is random 3.1 Independent Stochastic Linear regression Model First of all, consider the case in which X and e are independent. In this case the distribution of e conditional on X is the same as its marginal distribution; ∫(e|x)=f(e)~N(0 e(eX)=Ef(ElX)dE =0 now investigate the statistical properties of ols estimator under this assumption 3.1.1 Unbiasedness Using laws of iterated expectation, the expected value of B is E(S LEIB+(XX)-XEXI Ex3+(XX)-XE(∈X) x(6)=B
2.2.1 Log-Linear Model A commonly used form of regression model is the log-linear model: y = α Y k z βk k e ε or ln y = ln α + X k βk ln zk + ε = β1 + X k βkxk + ε. All you have to do is take natural logarithms of the data before the regression. 3 Stochastic Regressors This section will consider the linear regression model Y = Xβ + ε. It will be assumed that the full ideal conditions hold except that the regressor matrix X is random. 3.1 Independent Stochastic Linear regression Model First of all, consider the case in which X and ε are independent. In this case the distribution of ε conditional on X is the same as its marginal distribution; specifically, f(ε|X) = f(ε) ∼ N(0, σ 2 I) and E(ε|X) = R εf(ε|X)dε = 0. We now investigate the statistical properties of OLS estimator under this assumption. 3.1.1 Unbiasedness ? Using laws of iterated expectation, the expected value of βˆ is E(βˆ) = EX{E[β + (X0X) −1X0 ε|X]} = EX[β + (X0X) −1X0E(ε|X)] = EX(β) = β. 6
The variance-covariance matrix of B is slightly different from previous mode however Var(B)= El(B-B)(B-ByT ExEI(X'X)-X'EE'X(X'X)-X) ExI(X'X)-XElEe XJX(X'X)-) Exi(x'x)XoI(xx) Ex(XX-=0E(X'X) provided, of course, that a2E(X'X)- exists. The variance-covariance matrix of B is o2 times the expected value of(X'X)- since(X'X)- takes different values with new random samples The OLS estimator of the disturbance variance remains unbiased since E(e'e)=ExE(EMeX)= Ex(o(T-k))=0(T-k), therefore E(s2)=o 3.1.2 Efficiency The Gauss-Markov theorem can be established logically from the results of the preceding paragraph. We have showed that r(BX)< Var(BX) for ant B+ B and for the specific X in our sample. But if this inequality holds for every particular X, then it must hold for Var(B)=ExVar(BIX) That is, if it holds for every particular X, then it must hold over the average value of x Theorem:(Gauss-Markov Theorem with stochastic Regressors) In the classical linear regression model, the least squares estimator B is the min- imum variance linear unbiased estimator of B whether X is stochastic or non-
The variance-covariance matrix of βˆ is slightly different from previous model, however. V ar(βˆ) = E[(βˆ − β)(βˆ − β) 0 ] = EX{E[(X0X) −1X0 εε0X(X0X) −1 |X]} = EX{(X0X) −1X0E[εε0 |X]X(X0X) −1 } = EX{(X0X) −1X0 (σ 2 I)(X0X) −1 } = σ 2EX(X0X) −1 = σ 2E(X0X) −1 , provided, of course, that σ 2E(X0X) −1 exists. The variance-covariance matrix of βˆ is σ 2 times the expected value of (X0X) −1 since (X0X) −1 takes different values with new random samples. The OLS estimator of the disturbance variance, s 2 = e 0e T − k remains unbiased since E(e 0 e) = EX[E(ε 0Mε|X)] = EX(σ 2 (T − k)) = σ 2 (T − k), therefore E(s 2 ) = σ 2 . 3.1.2 Efficiency ? The Gauss-Markov theorem can be established logically from the results of the preceding paragraph. We have showed that V ar(βˆ|X) < V ar(β˜|X) for ant βˆ 6= β˜ and for the specific X in our sample. But if this inequality holds for every particular X, then it must hold for V ar(βˆ) = EX[V ar(βˆ|X)]. That is, if it holds for every particular X, then it must hold over the average value of X. Theorem: (Gauss-Markov Theorem with stochastic Regressors) In the classical linear regression model, the least squares estimator βˆ is the minimum variance linear unbiased estimator of β whether X is stochastic or nonstochastic. 7
3.1.3 Consistency From notation then XX T t=1 If we assume that plim7xX=pm∑xx=Q is finite and nonsingular, then by law of large number, Q=E(X,X+, that is the second moments of the regressors is finite(this assumptions is violated when X The independence assumption imply that plim(Xe)=0. This follows fror the fact that E()=0 and E(xx))2/E∑=1xx T E( X+X lim E/X'e lim But the fact that E(X=)=0 and lim -oo E(x)(=)=0 imply that plim x Recall that B=(XX-XY=B+(X'X-Xe-RXX-IXE
3.1.3 Consistency ? From notation X = X0 1 X0 2 . . . X0 T , then 1 T X0X = 1 T X T t=1 XtX0 t . If we assume that plim 1 T X0X = plim 1 T X T t=1 XtX0 t = Q is finite and nonsingular, then by law of large number, Q = E(XtX0 t ), that is the second moments of the regressors is finite (this assumptions is violated when X is a ”I(1)” or an unit root process). The independence assumption imply that plim(X0ε) = 0. This follows from the fact that E( X0ε T ) = 0 and E X0ε T X0ε T 0 = σ 2 T E(X0X) T = σ 2 T E( PT t=1 XtX0 t ) T ! = σ 2 T PT t=1 E(XtX0 t ) T ! = σ 2 T TQ T = σ 2 T Q. lim T→∞ E X0ε T X0ε T 0 = lim T→∞ σ 2 T Q = 0. But the fact that E( X0ε T ) = 0 and limT→∞ E( X0ε T )(X0ε T ) 0 = 0 imply that plim X0ε T = 0. Recall that βˆ = (X0X) −1X0Y = β + (X0X) −1X0 ε = β + ( X0X T ) −1X0ε T , 8
therefore plim B=B+Q plim Remark: To prove the consistency of OLS under stochastic regressor, we need to show that plim(Xe)=0. However to show that plim(XE)=0 only require that E(XE)= 2t e(XtEt)=0 or E(XtEt=0, i=1, 2, ..,k That is each regressor is uncorrelated with the disturbance at time t. ie, not contemporaneous correlation. There are three common circumstance that one of the stochastic regressor at time t is correlated that with the disturbance terms same period, i.e. E(XtEt)#0: the lagged dependent variables with serial correlation disturbance, unobservable model and simultaneous equation model. Under these circumstance. The ols is not a consistent estimator and the Instrumentalvariables(Iv) estimators is proposed to encounter this problems 3.1.4 Distribution of the estimators Ince B=B+(X'X)-XE The distribution of B therefore depend on the stochastic properties of X. It is pessimistic to say that the usual test of hypothesis will not be valid. However we will see in the following, it is not the case 3.1.5 Hypothesis Test We now consider the validity of our sample test statistics and inference procedures when X is stochastic. First consider the conventional t test statistics for testing Ho: B:= Bi. Under the null hypothesis tIX 32(XXi 1/e atr-k However, what interest us is the marginal, that is the unconditional distribution of t. Remember that if W nt(n) then the density function of W would be f(;n)= /(m)T() +=2)m+/可m>0,∈R
therefore plim βˆ = β + Q −1 plim X0ε T = β. Remark: To prove the consistency of OLS under stochastic regressor, we need to show that plim(X0ε) = 0. However to show that plim(X0ε) = 0 we only require that E(X0ε) = PT t=1 E(Xtεt) = 0 or E(Xtiεt) = 0, i = 1, 2, ..., k. That is each regressor is uncorrelated with the disturbance at time t. i.e. not contemporaneous correlation. There are three common circumstance that one of the stochastic regressor at time t is correlated that with the disturbance terms of the same period, i.e. E(Xtiεt) 6= 0: the lagged dependent variables with serial correlation disturbance, unobservable model and simultaneous equation model. Under these circumstance, The OLS is not a consistent estimator and the InstrumentalV ariables (IV) estimators is proposed to encounter this problems. 3.1.4 Distribution of the Estimators Since βˆ = β + (X0X) −1X0 ε, The distribution of βˆ therefore depend on the stochastic properties of X. It is pessimistic to say that the usual test of hypothesis will not be valid. However as we will see in the following , it is not the case. 3.1.5 Hypothesis Test We now consider the validity of our sample test statistics and inference procedures when X is stochastic. First consider the conventional t test statistics for testing H0 : βi = β 0 i . Under the null hypothesis t|X = (βˆ i − β 0 i ) [s 2 (X0X) −1 ii ] 1/2 ∼ tT −k. However, what interest us is the marginal, that is the unconditional distribution of t. Remember that if W ∼ t(n) then the density function of W would be: f(w; n) = 1 p (nπ) Γ n+1 2 Γ n 2 1 1 + w2 n [(n+1)/2] n > 0, w ∈ R. 9
Therefore we see that the distribution f(t])of the random variables(tx)is not function of X Let g(x) be the density function of X, and the joint pdf of X and t are f(t, x)=f(t]x)9(x), therefore the marginal density of t are f(t)=/f(t, x) f(tx)g(x)dx since f(tx)is not a functionof X f(tx)/g(x)dx=f(tx) We have the surprising results that. regardless of the distribution of x. or even of whether X is stochastic or nonstochastic, the marginal distribution of t(Statis- tics)is still t (distribution). The same reason can be used to deduce that the usual fratio used for testing linear restrictions are valid whether X is stochastic or not Remark: This conclusion only happens in the assumption that the disturbance is normally distributed or we can deduce that B is asymptotically normal 4 Non-Normal disturbance In this section we will suppose that all of the ideal conditions hold except that the disturbances are not normally distributed. In particular, we still suppose that the disturbance Et are independent and identically distributed with zero mean and finite variance a2, however, we no longer suppose that their distribution is normal 4.1 Unbiasedness, Efficiency, and consistency? It is easy to show the ols properties in the following A is unbiased, BLUE, consistent, and has covariance matrix o?(X'X)-1,s?is unbiased and consistent
Therefore we see that the distribution f(t|x) of the random variables (t|X) is not a function of X. Let g(x) be the density function of X, and the joint pdf of X and t are f(t, x) = f(t|x)g(x), therefore the marginal density of t are f(t) = Z f(t, x)dx = Z f(t|x)g(x)dx since f(t|x) is not a functionof X = f(t|x) Z g(x)dx = f(t|x). We have the surprising results that, regardless of the distribution of X, or even of whether X is stochastic or nonstochastic, the marginal distribution of t (Statistics) is still t (distribution). The same reason can be used to deduce that the usual F ratio used for testing linear restrictions are valid whether X is stochastic or not. Remark: This conclusion only happens in the assumption that the disturbance is normally distributed or we can deduce that βˆ is asymptotically normal. 4 Non-Normal Disturbance In this section we will suppose that all of the ideal conditions hold except that the disturbances are not normally distributed. In particular, we still suppose that the disturbance εt are independent and identically distributed with zero mean and finite variance σ 2 ; however, we no longer suppose that their distribution is normal. 4.1 Unbiasedness, Efficiency, and consistency? It is easy to show the OLS properties in the following. Theorem: βˆ is unbiased, BLUE, consistent, and has covariance matrix σ 2 (X0X) −1 ; s 2 is unbiased and consistent. 10