Ch. 24 Johansen's mle for Cointegration We have so far considered only single-equation estimation and testing for cointe- gration. While the estimation of single equation is convenient and often consis- tent, for some purpose only estimation of a system provides sufficient information This is true, for example, when we consider the estimation of multiple cointe- grating vectors, and inference about the number of such vectors. This chapter examines methods of finding the cointegrating rank and derive the asymptotic distributions. To develop these results, we first begin with a discussion of canon- ical correlation analys 1 Canonical Correlation 1.1 Population Canonical Correlations Let the(n1 x 1) vector yt and the(n2 x 1)vector xt denote stationary ran- dom vector that are measured as deviation from their population means, so that e(yty represent the variance-covariance matrix of yt. In general, there might be complicated correlations among the element of yt and xt, i.e yt yu E(yty E(yx E(xty E(xtx' -[ If the two set are very large, the investigator may wish to study only a few of linear combination of y and xt which yield most highly correlated. He may find that the interrelation is completely described by the correlation between the first few canonical variate We now define two new(n x 1)random vectors, n, and $t, where n the smaller of ni and n2. These vectors are linear combinations of yt and xt, respectively Here, K and A' are(n xni) and(n n2)matrices, respectively. The matrices K and A' are chosen such that the following conditions holds
Ch. 24 Johansen’s MLE for Cointegration We have so far considered only single-equation estimation and testing for cointegration. While the estimation of single equation is convenient and often consistent, for some purpose only estimation of a system provides sufficient information. This is true, for example, when we consider the estimation of multiple cointegrating vectors, and inference about the number of such vectors. This chapter examines methods of finding the cointegrating rank and derive the asymptotic distributions. To develop these results, we first begin with a discussion of canonical correlation analysis. 1 Canonical Correlation 1.1 Population Canonical Correlations Let the (n1 × 1) vector yt and the (n2 × 1) vector xt denote stationary random vector that are measured as deviation from their population means, so that E(yty ′ t ) represent the variance-covariance matrix of yt . In general, there might be complicated correlations among the element of yt and xt , i.e. E yt xt yt xt ′ = E(yty ′ t ) E(ytx ′ t ) E(xty ′ t ) E(xtx ′ t ) = Σyy Σyx Σxy Σxx . If the two set are very large, the investigator may wish to study only a few of linear combination of yt and xt which yield most highly correlated. He may find that the interrelation is completely described by the correlation between the first few canonical variate. We now define two new (n×1) random vectors, ηt and ξt , where n the smaller of n1 and n2. These vectors are linear combinations of yt and xt , respectively: ηt ≡ K′yt , ξt ≡ A′xt . Here, K′ and A′ are (n×n1) and (n×n2) matrices, respectively. The matrices K′ and A′ are chosen such that the following conditions holds. 1
(a)E(n nt)=KEyyK= In and E(5,5t)=A2xxA=In (b)E(Sint)=A'ExyK=R, where 0 0 0 0 R andr;≥0,i=1,2,,n (c) The elements of n, and st are ordered in such a way that 1≥n1≥r2≥….≥Tn≥0. The correlation ri is known as the ith population canonical correlation be- tween yt ane The population canonical correlations and the value of and K can be cal Theorem 1 Let be a positive definite symmetric matrix and let(A1, A2, . ni)be the eigenvalue of∑yyx∑ xX 2xy ordered A1≥A22…≥Amn,Let(k1,k2,…,kn) be the associated(n1 x 1)eigenvectors as normalized by Kyrk i=1 fo 1,2, Let(H1,p2,…,pn2) be the eigenvalue of 2xx 2xy 2yy2yx ordered ur≥p2…2 Any. Let(a1, a2, . . an,)be the associated(n2 x 1) eigenvectors as normalized by a∑ 1fori=1,2, Let n be the smaller of ni and n2, and collect the first n vectors k i and the first n vectors a in matrices K=k1 k2
(a) E(ηtη ′ t ) = K′ΣyyK = In and E(ξtξ ′ t ) = A′ΣxxA = In. (b) E(ξtη ′ t ) = A′ΣxyK = R, where R = r1 0 . . . 0 0 r2 . . . 0 . . . . . . . . . . . . . . . . . . 0 0 . . . rn , and ri ≥ 0, i = 1, 2, ..., n. (c) The elements of ηt and ξt are ordered in such a way that 1 ≥ r1 ≥ r2 ≥ ... ≥ rn ≥ 0. The correlation ri is known as the ith population canonical correlation between yt and xt . The population canonical correlations and the value of A and K can be calculated as follows. Theorem 1: Let Σ = Σyy Σyx Σxy Σxx be a positive definite symmetric matrix and let (λ1, λ2, ..., λn1 ) be the eigenvalue of Σ −1 yyΣyxΣ −1 xxΣxy ordered λ1 ≥ λ2 ≥ ... ≥ λn1 . Let (k1, k2, ..., kn1 ) be the associated (n1 × 1) eigenvectors as normalized by k ′ iΣyyki = 1 for i = 1, 2, ..., n1. Let (µ1, µ2, ..., µn2 ) be the eigenvalue of Σ −1 xxΣxyΣ −1 yyΣyx ordered µ1 ≥ µ2 ≥ ... ≥ µn2 . Let (a1, a2, ..., an2 ) be the associated (n2 × 1) eigenvectors as normalized by a ′ iΣxxai = 1 for i = 1, 2, ..., n2. Let n be the smaller of n1 and n2, and collect the first n vectors ki and the first n vectors aj in matrices K = [k1 k2 ... kn], A = [a1 a2 ... an]. 2
Assuming that A1, A2, . An are distinct, then (a)0≤Ayy, 2xy, and 2xx. To find their sample analogs, all we have to do is to start from the sample moment of∑yy,Sy,andx Suppose we have a sample of T observations on the(ni 1) vector yt and the (n2 x 1) vector xt, whose sample moment are given by Sy=(1/m∑yy ∑yx=(1T>yx
Assuming that λ1, λ2, ..., λn are distinct, then (a) 0 ≤ λi < 1 for i = 1, 2, ..., n1 and 0 ≤ µj < 1 for j = 1, 2, ..., n2; (b) λi = µi for i = 1, 2, ..., n; (c) K′ΣyyK = In and A′ΣxxA = In; (d) A′ΣxyK = R, where R2 = λ1 0 . . . 0 0 λ2 . . . 0 . . . . . . . . . . . . . . . . . . 0 0 . . . λn . We may interpret the canonical correlations as follows. The first canonical variates η1t and ξ1t can be interpreted as those linear combination of yt and xt , respectively, such that the correlation between η1t and ξ1t is as large as possible. The variates η2t and ξ2t gives those linear combination of yt and xt that are uncorrelated with η1t and ξ1t and yield the largest remaining correlation between η2t and ξ2t , and so on. 1.2 Sample Canonical Correlations The canonical correlations ri calculated by the procedure just described are population parameters–they are functions of the population moments Σyy, Σxy, and Σxx. To find their sample analogs, all we have to do is to start from the sample moment of Σyy, Σxy, and Σxx. Suppose we have a sample of T observations on the (n1 ×1) vector yt and the (n2 × 1) vector xt , whose sample moment are given by Σˆ yy = (1/T) X T t=1 yty ′ t Σˆ yx = (1/T) X T t=1 ytx ′ t Σˆ xx = (1/T) X T t=1 xtx ′ t . 3
Again, in many applications, yt and xt would be measured in deviations from their sample means. Then all the sample canonical correlations can be calculated from∑yy,y 2xx as the procedures described oren
Again, in many applications, yt and xt would be measured in deviations from their sample means. Then all the sample canonical correlations can be calculated from Σˆ yy, Σˆ yx and Σˆ xx as the procedures described in Theorem 1. 4
2 Johansens granger Representation Theorem Consider a general k-dimensional V AR model with Gaussian error written in the error correction form y +0yt-1+ (1) E(Et E(ees) Q for t 0 other The model defined by(1) is rewritten as ∈(L)yt=-5 (D)=(1-Dn-∑:(1-DD-50L C(D=((L)-(1)(1-L) L 5oyt+C(L)△yt=-5oyt+(Lyt-∈(1)yt Soyt+E(L)yt+So E(LY from the fact in(2) that S(1)=-5 Johansen(1991) provide the following fundamental result about error correc- tion models of order 1 and their structure. The basic results is due to granger (1983) and Engle and Granger(1987). In addition he provide dition for the process to be integrated of order 1 and he clarify the role of the onstant te
2 Johansen’s Granger Representation Theorem Consider a general k-dimensional V AR model with Gaussian error written in the error correction form: △yt = ξ1△yt−1 + ξ2△yt−2 + ... + ξp−1△yt−p+1 + c + ξ0yt−1 + εt , (1) where E(εt) = 0 E(εtε ′ s ) = Ω for t = s 0 otherwise. The model defined by (1) is rewritten as ξ(L)yt = −ξ0yt + C(L)△yt = c + εt , where ξ(L) = (1 − L)I − X p−1 i=1 ξi (1 − L)L i − ξ0L 1 (2) and C(L) = (ξ(L) − ξ(1))/(1 − L) = I − X p−1 i=1 ξiL i . (3) Note that −ξ0yt + C(L)△yt = −ξ0yt + ξ(L)yt − ξ(1)yt = −ξ0yt + ξ(L)yt + ξ0yt = ξ(L)yt from the fact in (2) that ξ(1) = −ξ0 . Johansen (1991) provide the following fundamental result about error correction models of order 1 and their structure. The basic results is due to Granger (1983) and Engle and Granger (1987). In addition he provide an explicit condition for the process to be integrated of order 1 and he clarify the role of the constant term. 5
Theorem 2(Granger's Representation Theorem) Let the process yt satisfy the equation (3)for t= 1, 2,. and let for a and b of dimension kx h and rank hI and let BC(1)A⊥ have full rank k-h. We define y=A⊥(BC()A1)-B Then Ayt and A'y can be given initial distributions, such that (b)A'y is stationary, (c) yt is nonstationary, with linear trend Tt=yct Further (d)E(Ay)=(B'B)-B'c+B'B)-(B'C(1)A1)(B C(1)A1-B1c (e)E(△yt)=T If B c=0, then T=0 and the linear trend disappears. However, the cointe- grating vector still contain a constant term, i.e. E(A)=(BB)Bc, when B f)△yt=业(L)(et+c) with y(1)=业.For业1(L)=(业(L)-业(1)/(1-L) so that业(L)=业(1)+ (1- LVI(L, the process has the representation (g) yt=yo +y LEi+rt+St-So here St=业1(L)et Proof: See Johansen(1991), p 1559 I Define the orthogonal complements P1 of any matrix P of rank g and dimension n x g as follows(0<q<n) (a)P⊥ is of dimension n×(n-q); (b)P⊥P=0 n-q)×q P (c)Pi has rank n-4, and its column space lies in the null space of P
Theorem 2 (Granger’s Representation Theorem): Let the process yt satisfy the equation (3) for t = 1, 2, ..., and let ξ0 = −BA′ for A and B of dimension k × h and rank h, 1 and let B ′ ⊥C(1)A⊥ have full rank k − h. We define Ψ = A⊥(B ′ ⊥C(1)A⊥) −1B ′ ⊥. Then △yt and A′yt can be given initial distributions, such that (a) △yt is stationary, (b) A′yt is stationary, (c) yt is nonstationary, with linear trend τ t = Ψct. Further (d) E(A′yt) = (B′B) −1B′c + (B′B) −1 (B′C(1)A⊥)(B′ ⊥C(1)A⊥) −1B′ ⊥c, (e) E(△yt) = τ . If B′ ⊥c = 0, then τ = 0 and the linear trend disappears. However, the cointegrating vector still contain a constant term, i.e. E(A′yt) = (B′B) −1B′c, when B′ ⊥c = 0. (f) △yt = Ψ(L)(εt + c) with Ψ(1) = Ψ. For Ψ1(L) = (Ψ(L) − Ψ(1))/(1 − L) so that Ψ(L) = Ψ(1) + (1 − L)Ψ1(L), the process has the representation (g) yt = y0 + Ψ PT i=1 εi + τ t + St − S0, where St = Ψ1(L)εt . Proof: See Johansen (1991), p.1559. 1Define the orthogonal complements P⊥ of any matrix P of rank q and dimension n × q as follows (0 < q < n): (a) P⊥ is of dimension n × (n − q); (b) P′ ⊥P = 0(n−q)×q, P′P⊥ = 0q×(n−q) ; (c) P⊥ has rank n − q, and its column space lies in the null space of P′ . 6
3 Maximum likelihood estimation of a gaussian VAR for Cointegration and the test for Coin tegration Rank Consider a general V AR model for the k x 1 vector yt with Gaussian error t=c+重1y-1+重2yt-2+…+重yt-p+Et h E(Et)=0 Ees) Q for t 0 othe We may rewrite(4)in the error correction form △yt=51△y-1+52△y-2+…+5p-1△y4-p+1+c+0y-1+ where 0≡-(I-重1-更2 更 Suppose that y is I(1) with h cointegrating relationship which implies that Ba for B and A an(k x h) matrix. That is, under the hypothesis of h cointegrat ing relations, only h separate linear combination of the level of yt-1 appears in(5) Consider a sample of size T+p observations on y, denoted (y-p+1,y-p+2,. If the disturbance Et are Gaussian, then the log(conditional) likelihood of (y1, y2, .,yr) this V AR model are not necessary I(1) variates and are not necessary cointe prate
3 Maximum Likelihood Estimation of a Gaussian V AR for Cointegration and the Test for Cointegration Rank Consider a general V AR model 2 for the k × 1 vector yt with Gaussian error yt = c + Φ1yt−1 + Φ2yt−2 + ... + Φpyt−p + εt , (4) where E(εt) = 0 E(εtε ′ s ) = Ω for t = s 0 otherwise. We may rewrite (4) in the error correction form: △yt = ξ1△yt−1 + ξ2△yt−2 + ... + ξp−1△yt−p+1 + c + ξ0yt−1 + εt , (5) where ξ0 ≡ −(I − Φ1 − Φ2 − ... − Φp) = −Φ(1). Suppose that yt is I(1) with h cointegrating relationship which implies that ξ0 = −BA′ (6) for B and A an (k × h) matrix. That is, under the hypothesis of h cointegrating relations, only h separate linear combination of the level of yt−1 appears in (5). Consider a sample of size T+p observations on y, denoted (y−p+1, y−p+2, ..., yT ). If the disturbance εt are Gaussian, then the log (conditional) likelihood of (y1, y2, ..., yT ) 2Here, yt in this V AR model are not necessary I(1) variates and are not necessary cointegrated. 7
conditional on(y-p+1, y-p+2, ., yo)is given by C(2.51,E2,…,P-1,C,0)=(-Tk/2)log(2x)-(T/2)logs2-(1/2) ∑[(△y-6:y-1-62 n-1△ !(△y-51△y-1-52△y-2-…-5p-1△y-p+1-c-5oy-1) The goal is to chose(Q2, 51, 52, Sp-1,C, So)so as to maximize(7)subject to the constraint that So can be written in the form of (6) 3.1 Concentrated Log-likelihood Function 3.1.1 Concentrated Likelihood Function We often encounter in practice the situation where the parameter vector 0o can be naturally partitioned into two sub-vectors oo and Bo as 00=(ao Boy Let the likelihood function be L(a B). The MLE is obtained by maximizing L simultaneously for a and B: i.e aIn L aIn L However, sometimes it is easier to maximize L in two step. First, maximize it with respect to B by taking o as given, insert the maximizing value of B back into L; second, maximize L with respect to o. More precisely, define L'(a)=Lo, B(a) (10) where B(a) is defined as the solution to aIn L
conditional on (y−p+1, y−p+2, ..., y0) is given by L(Ω, ξ1 , ξ2 , ..., ξp−1 , c, ξ0 ) = (−T k/2) log(2π) − (T/2) log |Ω| − (1/2) X T t=1 (△yt − ξ1△yt−1 − ξ2△yt−2 − ... − ξp−1△yt−p+1 − c − ξ0yt−1) ′ ×Ω −1 (△yt − ξ1△yt−1 − ξ2△yt−2 − ... − ξp−1△yt−p+1 − c − ξ0yt−1) . (7) The goal is to chose (Ω, ξ1 , ξ2 , ..., ξp−1 , c, ξ0 ) so as to maximize (7) subject to the constraint that ξ0 can be written in the form of (6). 3.1 Concentrated Log-likelihood Function 3.1.1 Concentrated Likelihood Function We often encounter in practice the situation where the parameter vector θ0 can be naturally partitioned into two sub-vectors α0 and β0 as θ0 = (α′ 0 β ′ 0 ) ′ . Let the likelihood function be L(α β). The MLE is obtained by maximizing L simultaneously for α and β: i.e. ∂ ln L ∂α = 0; (8) ∂ ln L ∂β = 0. (9) However, sometimes it is easier to maximize L in two step. First, maximize it with respect to β by taking α as given, insert the maximizing value of β back into L; second, maximize L with respect to α. More precisely, define L ∗ (α) = L[α, βˆ(α)], (10) where βˆ(α) is defined as the solution to ∂ ln L ∂β βˆ = 0, (11) 8
and deline as the solution to OIn L* We call L*(o) the concentrated likelihood function of a. It is able to show that the mlE of a from( 8)and(9)simultaneously a, and from concentrated likelihood(12), a, are identical and have the same limiting distribution 3.1.2 Calculate Auxiliary Regressions The first step involve concentrating the likelihood function. This means take Q so as given and maximization(7) with respect to(c, 51, $2, ...$p-1). This restricted maximization problem take the form of seemingly unrelated regres- sion of the elements of the(k x 1) vector Ayt-Soyi-1 on a constant and the explanatory variables(△y-1,△yt-2,…,△yt-p+1). Since each of the k regres sions in this system has the identical explanatory variables, the estimates of (C, 51, 52,-5p-1)would come from OLS regression of each regressions of each elements of Ayt -]t-1 on a constant and(51, 52,,5p-1). Denote the value of (c, 1, 52, $p-1) that maximize(7) for a given value of So(and Q, although it doesn't matter from the properties of SURE model)by These values are characterized by the condition that the following residual vec- tor must have sample mean zero and be orthogonal to(Ayt-1, Ayt-2 Ay-M--{et)+iE)△yx1+△y2+…+s1(6)△y-}413) To obtain(13)with unknown So(although we assume it is known at this stage to form concentrated log-likelihood function), we may form two auxiliary egression and estimate them by ols to get △yt=o+II1△yt-1+∏2△y-2+…+I-1△y-p+1+t
and define αˆ ∗ as the solution to ∂ ln L ∗ ∂α αˆ∗ = 0. (12) We call L ∗ (α) the concentrated likelihood function of α. It is able to show that the MLE of α from (8) and (9) simultaneously αˆ , and from concentrated likelihood (12), αˆ ∗ , are identical and have the same limiting distribution. 3.1.2 Calculate Auxiliary Regressions The first step involve concentrating the likelihood function. This means take Ω and ξ0 as given and maximization (7) with respect to (c, ξ1 , ξ2 , ..., ξp−1 ). This restricted maximization problem take the form of seemingly unrelated regression of the elements of the (k × 1) vector △yt − ξ0yt−1 on a constant and the explanatory variables (△yt−1, △yt−2, ..., △yt−p+1). Since each of the k regressions in this system has the identical explanatory variables, the estimates of (c, ξ1 , ξ2 , ..., ξp−1 ) would come from OLS regression of each regressions of each elements of △yt − ξ0yt−1 on a constant and (ξ1 , ξ2 , ..., ξp−1 ). Denote the value of (c, ξ1 , ξ2 , ..., ξp−1 ) that maximize (7) for a given value of ξ0 (and Ω, although it doesn’t matter from the properties of SURE model) by h cˆ ∗ (ξ0 ), ˆξ ∗ 1 (ξ0 ), ˆξ ∗ 2 (ξ0 ), ..., ˆξ ∗ p−1 (ξ0 ) i . These values are characterized by the condition that the following residual vector must have sample mean zero and be orthogonal to (△yt−1, △yt−2, ..., △yt−p+1): [△yt − ξ0yt−1] − n cˆ ∗ (ξ0 ) + ˆξ ∗ 1 (ξ0 )△yt−1 + ˆξ ∗ 2 (ξ0 )△yt−2 + ..... + ˆξ ∗ p−1 (ξ0 )△yt−p+1o .(13) To obtain (13) with unknown ξ0 (although we assume it is known at this stage to form concentrated log-likelihood function), we may form two auxiliary regressions and estimate them by OLS to get △yt = πˆ 0 + Πˆ 1△yt−1 + Πˆ 2△yt−2 + ..... + Πˆ p−1△yt−p+1 + uˆt (14) 9
and y-1=6+O1△y-1+e2△y-2+……+p-1△y-p+1+vt,(15) where both the residual vector ut and ut have sample mean zero and be orthogonal to(△y-1,△y-2,…,△y-p+1)also. Moreover,u- Sout also have sample mean zero and is orthogonal to(△yt-1,△y-2,…,△yt-p+1). Therefore, the residual vector(13 )can be expressed by u-ovt=(△yt-丌o-I1△yt-1-I2△yt-2 △ -50(y-1-60-61△yt-1-62△yt-2 ep-1△y-p+1) with c'(50)=亓o-06o ;(50)=Ⅱ1-501,fori=1,2,…,p-1 The concentrated log likelihood function is found by replacing(c, 51, $2,. 5p-1) with(c(E0),51(E0),与2(E0),…Ep-1(E0)in(7) L(9250)=(-7k/2)og(2x)-(T/2)log9-(1/2)∑[-6)2(a1-6)(16) t=1 We can go one step further to concentrate out S. Recall from the analysis of estimation of VAR on p 17 of Chapter 18 that the value of n2 that maximize (16)(for a given so) is given by g(0)=1/T∑u-60)-6ov As in expression(24)of Chapter 18, the value obtained for(16 )when evaluated
and yt−1 = θˆ 0 + Θˆ 1△yt−1 + Θˆ 2△yt−2 + ..... + Θˆ p−1△yt−p+1 + vˆt , (15) where both the residual vector uˆt and uˆt have sample mean zero and be orthogonal to (△yt−1, △yt−2, ..., △yt−p+1) also. Moreover, uˆt − ξ0uˆt also have sample mean zero and is orthogonal to (△yt−1, △yt−2, ..., △yt−p+1). Therefore, the residual vector (13) can be expressed by uˆt − ξ0vˆt = (△yt − πˆ 0 − Πˆ 1△yt−1 − Πˆ 2△yt−2 − ..... − Πˆ p−1△yt−p+1) −ξ0 (yt−1 − θˆ 0 − Θˆ 1△yt−1 − Θˆ 2△yt−2 − ..... − Θˆ p−1△yt−p+1) with cˆ ∗ (ξ0 ) = πˆ 0 − ξ0θˆ 0 ˆξ ∗ i (ξ0 ) = Πˆ i − ξ0Θˆ i , for i = 1, 2, ..., p − 1. The concentrated log likelihood function is found by replacing (c, ξ1 , ξ2 , ..., ξp−1 ) with (cˆ ∗ (ξ0 ), ˆξ ∗ 1 (ξ0 ), ˆξ ∗ 2 (ξ0 ), ..., ˆξ ∗ p−1 (ξ0 )) in (7): L(Ω, ξ0 ) = (−T k/2) log(2π) − (T/2) log |Ω| − (1/2)X T t=1 (uˆt − ξ0vˆt) ′Ω −1 (uˆt − ξ0vˆt) .(16) We can go one step further to concentrate out Ω. Recall from the analysis of estimation of V AR on p.17 of Chapter 18 that the value of Ω that maximize (16) (for a given ξ0 ) is given by Ωˆ ∗ (ξ0 ) = 1/T X T t=1 [(uˆt − ξ0vˆt)(uˆt − ξ0vˆt) ′ ] . (17) As in expression (24) of Chapter 18, the value obtained for (16) when evaluated 10