《多元统计分析》课程教学资源（阅读材料）Visual Hypothesis Tests in Multivariate Linear Models

Visual Hypothesis Tests in Multivariate Linear Models: The heplots Package for R John Fox Michael Friendly Georges Monette McMaster University York University York University 6 Februrary 2007 Abstract Hypothesis-error (or "HE")plots,introduced by Friendly (2006,2007),permit the visualization of hypothesis tests in multivariate linear models by representing hypothesis and error matrices of sums of squares and cross-products as ellipses.This paper describes the implementation of these methods in R, as well as their extension,for example from two to three dimensions and by scaling hypothesis ellipses and ellipsoids in a natural manner relative to error.The methods,incorporated in the heplots package for R,exploit new facilities in the car package for testing linear hypotheses in multivariate linear models and for constructing MANOVA tables for these models,including models for repeated measures. 1 Introduction This paper introduces the heplots package for R,which implements and extends the methods described in Friendly(2006,2007)for visualizing hypothesis tests in multivariate linear models.The paper begins with a brief description of multivariate linear models;proceeds to explain how dispersion matrices can be represented by ellipses or ellipsoids:describes new facilities in the car package (associated with Fox.2002)for testing linear hypotheses in multivariate linear models and for constructing multivariate analysis-of-variance tables; and illustrates the use of the functions in the heplots package for two and three-dimensional visualization of hypothesis tests in multivariate analysis of variance and regression. 2 Multivariate Linear Models The univariate linear model y=XB+8 (1) is surely the most familiar of statistical models.In Equation 1,y is an n x 1 column vector of observations on a response variable;X is an n x p model matrix of full column rank that is either fixed or,if random, independent of the n x 1 vector of errors e;and the p x 1 vector of regression coefficients B is to be estimated from the data..As is also familiar,under the standard assumptions that the errors are normally and independently distributed with zero expectations and common variance,si~NID(0,2)or equivalently eNn(0,02In),the least squares estimator, B=(XTx)-xTy is the maximum-likelihood estimator of B.Here,Nn denotes the multivariate-normal distribution for n variables,0 is the n x 1 zero vector,and In is the order-n identity matrix. In the multivariate linear model (e.g.,Timm,1975), Y=XB+E the response vector y is replaced by an n x m matrix of responses Y,where each column represents a distinct response variable,B is a pxm matrix of regression coefficients,and E is an nxm matrix of errors.Under the 入y

Visual Hypothesis Tests in Multivariate Linear Models: The heplots Package for R John Fox McMaster University Michael Friendly York University Georges Monette York University 6 Februrary 2007 Abstract Hypothesis-error (or “HE”) plots, introduced by Friendly (2006, 2007), permit the visualization of hypothesis tests in multivariate linear models by representing hypothesis and error matrices of sums of squares and cross-products as ellipses. This paper describes the implementation of these methods in R, as well as their extension, for example from two to three dimensions and by scaling hypothesis ellipses and ellipsoids in a natural manner relative to error. The methods, incorporated in the heplots package for R, exploit new facilities in the car package for testing linear hypotheses in multivariate linear models and for constructing MANOVA tables for these models, including models for repeated measures. 1 Introduction This paper introduces the heplots package for R, which implements and extends the methods described in Friendly (2006, 2007) for visualizing hypothesis tests in multivariate linear models. The paper begins with a brief description of multivariate linear models; proceeds to explain how dispersion matrices can be represented by ellipses or ellipsoids; describes new facilities in the car package (associated with Fox, 2002) for testing linear hypotheses in multivariate linear models and for constructing multivariate analysis-of-variance tables; and illustrates the use of the functions in the heplots package for two and three-dimensional visualization of hypothesis tests in multivariate analysis of variance and regression. 2 Multivariate Linear Models The univariate linear model y = Xβ + ε (1) is surely the most familiar of statistical models. In Equation 1, y is an n × 1 column vector of observations on a response variable; X is an n × p model matrix of full column rank that is either fixed or, if random, independent of the n × 1 vector of errors ε; and the p × 1 vector of regression coefficients β is to be estimated from the data.. As is also familiar, under the standard assumptions that the errors are normally and independently distributed with zero expectations and common variance, εi ∼ NID(0, σ2) or equivalently ε ∼ Nn(0, σ2In), the least squares estimator, βb = (XT X) −1XT y is the maximum-likelihood estimator of β. Here, Nn denotes the multivariate-normal distribution for n variables, 0 is the n × 1 zero vector, and In is the order-n identity matrix. In the multivariate linear model (e.g., Timm, 1975), Y = XB + E the response vector y is replaced by an n×m matrix of responses Y, where each column represents a distinct response variable, B is a p×m matrix of regression coefficients, and E is an n×m matrix of errors. Under the 1

assumption that the rows of E are independent.and that each row is multivariately normally distributed with zero expectation and common covariance matrix,~Nn(0,E)or equivalently vec(E)~Nnp(0,In ) the least squares estimator B=(XTX)-XTY is the maximum-likelihood estimator or B.Here,the 0 vectors are respectively of order n x 1 and np x 1, and represents the Kronecker product. Hypothesis tests for multivariate linear models also closely parallel those for univariate linear models. Consider the linear hypothesis Ho:L3=0 in the univariate linear model,where L is a g x p hypothesis matrix of rank q and 0 is the qx 1 zero vector. Under this hypothesis, LLXTX-LL3 P0= 9 SS/q EE SSE/(n-P) n-卫 is distributed as F with q and n-pdegrees of freedom.The quantity sLTL is the sum of squares for the hypothesis,y is the vector of residuals,SSg is the sum of squares for error,and s2=/(n-p)is the estimated error variance.To test the analogous hypothesis in the multivariate linear model, Ho:LB=0 (2) where 0 is now the gx m zero matrix,we compute the m x m hypothesis sum of squares and products matrix SSP BTLT L(XTX)-LT]-LB and the m x m error sum of squares and products matrix SSPE=ETE where E=Y-XB is the matrix of residuals.Multivariate tests of the hypothesis are based on the s min(g,m)nonzero latent roots A1 >A2>...>As of the matrix SSP relative to the matrix SSPE, that is.the values of A for which det(SSPH-λSSPE)=0 These are also the ordinary latent roots of of SSPSSP,that is,the values of for which det(SSPHSSP1-AIm)=0 The corresponding latent vectors give a set of s orthogonal linear combinations of the responses that pro- duce maximal univariate F statistics for the hypothesis in Equation 2.The several commonly employed multivariate test statistics are functions of the latent roots: Pillai's trace. j=1 Hotelling-Lawley trace. THL=∑男=1 1 Wilks's Lambda A= 1+ Roy's maximum root, There is an F approximation to the null distribution of each of these test statistics. In a univariate linear model,it is common to provide F tests for each term in the model,summarized in an analysis-of-variance (ANOVA)table.The hypothesis sums of squares for these tests can be expressed as differences in the error sums of squares for nested models.For example,dropping each term in the model

assumption that the rows of E are independent, and that each row is multivariately normally distributed with zero expectation and common covariance matrix, εT i ∼ Nn(0, Σ) or equivalently vec(E) ∼ Nnp(0, In ⊗ Σ), the least squares estimator Bb = (XT X) −1XT Y is the maximum-likelihood estimator or B. Here, the 0 vectors are respectively of order n × 1 and np × 1, and ⊗ represents the Kronecker product. Hypothesis tests for multivariate linear models also closely parallel those for univariate linear models. Consider the linear hypothesis H0: Lβ = 0 in the univariate linear model, where L is a q × p hypothesis matrix of rank q and 0 is the q × 1 zero vector. Under this hypothesis, F0 = βbT LT [L(XT X) −1 LT ] −1 Lβb q bε T bε n − p = SSH/q SSE/(n − p) is distributed as F with q and n − p degrees of freedom. The quantity SSH = βbT LT [L(XT X) −1 LT ] −1 Lβb is the sum of squares for the hypothesis, bε = y − Xβb is the vector of residuals, SSE = bε T bε is the sum of squares for error, and s2 = bε T bε/(n − p) is the estimated error variance. To test the analogous hypothesis in the multivariate linear model, H0: LB = 0 (2) where 0 is now the q×m zero matrix, we compute the m×m hypothesis sum of squares and products matrix SSPH = Bb TLT [L(XT X) −1 LT ] −1 LBb and the m × m error sum of squares and products matrix SSPE = EbTEb where Eb = Y − XBb is the matrix of residuals. Multivariate tests of the hypothesis are based on the s = min(q,m) nonzero latent roots λ1 > λ2 > ··· > λs of the matrix SSPH relative to the matrix SSPE, that is, the values of λ for which det(SSPH − λSSPE)=0 These are also the ordinary latent roots of of SSPHSSP−1 E , that is, the values of λ for which det(SSPHSSP−1 E − λIm)=0 The corresponding latent vectors give a set of s orthogonal linear combinations of the responses that produce maximal univariate F statistics for the hypothesis in Equation 2. The several commonly employed multivariate test statistics are functions of the latent roots: Pillai’s trace, TP = Xp j=1 λj 1 + λj Hotelling-Lawley trace, THL = Pp j=1 λj Wilks’s Lambda, Λ = Yp j=1 1 1 + λj Roy’s maximum root, λ1 There is an F approximation to the null distribution of each of these test statistics. In a univariate linear model, it is common to provide F tests for each term in the model, summarized in an analysis-of-variance (ANOVA) table. The hypothesis sums of squares for these tests can be expressed as differences in the error sums of squares for nested models. For example, dropping each term in the model 2

in turn and contrasting the resulting residual sum of squares with that for the full model produces so-called Type-III tests;adding terms to the model sequentially produces so-called Type-I tests;and testing each term after all terms in the model with the exception of those to which it is marginal produces so-called Type-II tests.Closely analogous multivariate analysis-of-variable (MANOVA)tables can be formed similarly by taking differences in error sum of squares and products matrices. In some contexts-for example.when the response variables represent repeated measures of the same variable over time-it is also of interest to entertain a design and hypotheses on the response (see,e.g., O'Brien and Kaiser,1985).Such tests can be formulated by extending the linear hypothesis in Equation 2 to Ho:LBP=0 where the m x k matrix P provides contrasts in the responses. 3 Data Ellipses and Ellipsoids The data ellipse,described by Dempster(1969)and Monette(1990),is a device for visualizing the relationship between two variables,Yi and Y2.Let D(y)=(y-y)TS-1(y-y)represent the squared Mahalanobis distance of the point y=(,y2)T from the centroid of the data y=(Y1,Y2)T.The data ellipse Ee of size c is the set of all points y with Di(y)less than or equal to c2: (y:s)={y:(y-)Ts-1(y-)≤2} (3) Here.S is the sample covariance matrix. s=-y-可 n-1 Selecting c=1 produces the "standard"data ellipse,as illustrated in Figure 1:The perpendicular "shadows"of the ellipse on the axes mark off twice the standard deviation of each variable;the regression line for Y2 on Yi intersects the points of vertical tangency on the boundary of the ellipse;and the correlation between the two variables is proportional to the length of the line from the bottom of the ellipse to the point of vertical tangency at the right.Many other properties of correlation and regression can be visualized using the data ellipse (see,e.g.,Monette,1990). These properties of the data ellipse hold regardless of the joint distribution of the variables,but if the variables are bivariate normal,then the data ellipse represents a contour of constant density in their joint distribution.In this case,D(y)has a large-sample x2 distribution with 2 degrees of freedom,and so,for example,taking c2=x2(0.95)=5.996 encloses approximately 95 percent of the data.Alternatively,in small samples,we can take 2=2m-Fm-2≈2n2m-2 n-2 but this typically makes little difference visually. The generalization of the data ellipse to more than two variables is immediate:Applying Equation 3 to y =(42,43)T,for example,produces a data ellipsoid in three dimensions.For m multivariate-normal variables,selecting c2=x(1-a)encloses approximately 100(1-a)percent of the data.Again,for greater precision,we can use 2=ma-Fnn-m≈mEnn-m n-m 4 Implementation of Tests for Multivariate Linear Models in the car Package Tests for multivariate linear models are implemented in the car package as S3 methods for the generic linear.hypothesis and Anova functions,with Manova provided as a synonym for the latter.The Anova function computes partial (so-called "Types II and III")hypothesis tests,as opposed to the anova function in the stats package,which computes sequential ("Type-I")tests;these tests coincide in one-way and balanced designs.Several examples of the use of these functions are given in this section. 3

in turn and contrasting the resulting residual sum of squares with that for the full model produces so-called Type-III tests; adding terms to the model sequentially produces so-called Type-I tests; and testing each term after all terms in the model with the exception of those to which it is marginal produces so-called Type-II tests. Closely analogous multivariate analysis-of-variable (MANOVA) tables can be formed similarly by taking differences in error sum of squares and products matrices. In some contexts – for example, when the response variables represent repeated measures of the same variable over time – it is also of interest to entertain a design and hypotheses on the response (see, e.g., O’Brien and Kaiser, 1985). Such tests can be formulated by extending the linear hypothesis in Equation 2 to H0: LBP = 0 where the m × k matrix P provides contrasts in the responses. 3 Data Ellipses and Ellipsoids The data ellipse, described by Dempster (1969) and Monette (1990), is a device for visualizing the relationship between two variables, Y1 and Y2. Let D2 M(y)=(y − y)T S−1(y − y) represent the squared Mahalanobis distance of the point y = (y1, y2)T from the centroid of the data y = (Y 1, Y 2)T . The data ellipse Ec of size c is the set of all points y with D2 M(y) less than or equal to c2: Ec(y; S,y) ≡ © y: (y − y) T S−1(y − y) ≤ c2ª (3) Here, S is the sample covariance matrix, S = Pn i=1(y − y)T (y − y) n − 1 Selecting c = 1 produces the “standard” data ellipse, as illustrated in Figure 1: The perpendicular “shadows” of the ellipse on the axes mark off twice the standard deviation of each variable; the regression line for Y2 on Y1 intersects the points of vertical tangency on the boundary of the ellipse; and the correlation between the two variables is proportional to the length of the line from the bottom of the ellipse to the point of vertical tangency at the right. Many other properties of correlation and regression can be visualized using the data ellipse (see, e.g., Monette, 1990). These properties of the data ellipse hold regardless of the joint distribution of the variables, but if the variables are bivariate normal, then the data ellipse represents a contour of constant density in their joint distribution. In this case, D2 M(y) has a large-sample χ2 distribution with 2 degrees of freedom, and so, for example, taking c2 = χ2 2(0.95) = 5.99 ≈ 6 encloses approximately 95 percent of the data. Alternatively, in small samples, we can take c2 = 2(n − 1) n − 2 F2,n−2 ≈ 2F2,n−2 but this typically makes little difference visually. The generalization of the data ellipse to more than two variables is immediate: Applying Equation 3 to y = (y1, y2, y3)T , for example, produces a data ellipsoid in three dimensions. For m multivariate-normal variables, selecting c2 = χ2 m(1−α) encloses approximately 100(1−α) percent of the data. Again, for greater precision, we can use c2 = m(n − 1) n − m Fm,n−m ≈ mFm,n−m 4 Implementation of Tests for Multivariate Linear Models in the car Package Tests for multivariate linear models are implemented in the car package as S3 methods for the generic linear.hypothesis and Anova functions, with Manova provided as a synonym for the latter. The Anova function computes partial (so-called “Types II and III”) hypothesis tests, as opposed to the anova function in the stats package, which computes sequential (“Type-I”) tests; these tests coincide in one-way and balanced designs. Several examples of the use of these functions are given in this section. 3

《多元统计分析》课程教学资源（阅读材料）Visual Hypothesis Tests in Multivariate Linear Models - Heplot