Overview Principal component analysis Herve Abdi1*and Lynne J.Williams2 Principal component analysis(PCA)is a multivariate technique that analyzes a data table in which observations are described by several inter-correlated quantitative dependent variables.Its goal is to extract the important information from the table, to represent it as a set of new orthogonal variables called principal components,and to display the pattern of similarity of the observations and of the variables as points in maps.The quality of the PCA model can be evaluated using cross-validation techniques such as the bootstrap and the jackknife.PCA can be generalized as correspondence analysis (CA)in order to handle qualitative variables and as multiple factor analysis(MFA)in order to handle heterogeneous sets of variables. Mathematically,PCA depends upon the eigen-decomposition of positive semi- definite matrices and upon the singular value decomposition(SVD)of rectangular matrices.2010 John Wiley Sons,Inc.WIREs Comp Stat 2010 2 433-459 o mlee from the same matrix all use the same letter (e.g., A,a,a).The transpose operation is denoted by the and it is used by almost all scientific disciplines.It superscript'.The identity matrix is denoted I. is also likely to be the oldest multivariate technique. The data table to be analyzed by PCA comprises In fact,its origin can be traced back to Pearson!or observations described by I variables and it is even Cauchy2 [see Ref 3,p.416],or Jordan4 and also represented by the I x I matrix X,whose generic Cayley,Silverster,and Hamilton,[see Refs 5,6,for element is xij.The matrix X has rank L where more details]but its modern instantiation was formal- L≤min{l,J ized by Hotelling'who also coined the term principal In general,the data table will be preprocessed component.PCA analyzes a data table representing before the analysis.Almost always,the columns of X observations described by several dependent vari- will be centered so that the mean of each column ables,which are,in general,inter-correlated.Its goal is equal to 0 (i.e.,X1=0,where 0 is a I by is to extract the important information from the data 1 vector of zeros and 1 is an I by 1 vector of table and to express this information as a set of new ones).If in addition,each element of X is divided orthogonal variables called principal components. by√I(or√T-i),the analysis is referred to as PCA also represents the pattern of similarity of the a covariance PCA because,in this case,the matrix observations and the variables by displaying them as XTX is a covariance matrix.In addition to centering, points in maps [see Refs 8-10 for more details]. when the variables are measured with different units, it is customary to standardize each variable to unit PREREQUISITE NOTIONS AND norm.This is obtained by dividing each variable by NOTATIONS its norm (i.e.,the square root of the sum of all the squared elements of this variable).In this case,the Matrices are denoted in upper case bold,vectors are analysis is referred to as a correlation PCA because, denoted in lower case bold,and elements are denoted then,the matrix X'X is a correlation matrix(most in lower case italic.Matrices,vectors,and elements statistical packages use correlation preprocessing as a default). The matrix X has the following singular value decomposition [SVD,see Refs 11-13 and Appendix B +Correspondence to:herve@utdallas.edu for an introduction to the SVD]: 1School of Behavioral and Brain Sciences,The University of Texas at Dallas,MS:GR4.1,Richardson,TX 75080-3021,USA X=PAQT (1) 2Department of Psychology,University of Toronto Scarborough, Ontario,Canada where P is the Ix L matrix of left singular vectors, DOL:10.1002wics.101 Q is the Ix L matrix of right singular vectors,and A Volume 2,July/August 2010 2010 John Wiley Sons,Inc. 433
Overview Principal component analysis Herve Abdi ´ 1∗ and Lynne J. Williams2 Principal component analysis (PCA) is a multivariate technique that analyzes a data table in which observations are described by several inter-correlated quantitative dependent variables. Its goal is to extract the important information from the table, to represent it as a set of new orthogonal variables called principal components, and to display the pattern of similarity of the observations and of the variables as points in maps. The quality of the PCA model can be evaluated using cross-validation techniques such as the bootstrap and the jackknife. PCA can be generalized as correspondence analysis (CA) in order to handle qualitative variables and as multiple factor analysis (MFA) in order to handle heterogeneous sets of variables. Mathematically, PCA depends upon the eigen-decomposition of positive semidefinite matrices and upon the singular value decomposition (SVD) of rectangular matrices. 2010 John Wiley & Sons, Inc. WIREs Comp Stat 2010 2 433–459 Principal component analysis (PCA) is probably the most popular multivariate statistical technique and it is used by almost all scientific disciplines. It is also likely to be the oldest multivariate technique. In fact, its origin can be traced back to Pearson1 or even Cauchy2 [see Ref 3, p. 416], or Jordan4 and also Cayley, Silverster, and Hamilton, [see Refs 5,6, for more details] but its modern instantiation was formalized by Hotelling7 who also coined the term principal component. PCA analyzes a data table representing observations described by several dependent variables, which are, in general, inter-correlated. Its goal is to extract the important information from the data table and to express this information as a set of new orthogonal variables called principal components. PCA also represents the pattern of similarity of the observations and the variables by displaying them as points in maps [see Refs 8–10 for more details]. PREREQUISITE NOTIONS AND NOTATIONS Matrices are denoted in upper case bold, vectors are denoted in lower case bold, and elements are denoted in lower case italic. Matrices, vectors, and elements ∗Correspondence to: herve@utdallas.edu 1School of Behavioral and Brain Sciences, The University of Texas at Dallas, MS: GR4.1, Richardson, TX 75080-3021, USA 2Department of Psychology, University of Toronto Scarborough, Ontario, Canada DOI: 10.1002/wics.101 from the same matrix all use the same letter (e.g., A, a, a). The transpose operation is denoted by the superscriptT. The identity matrix is denoted I. The data table to be analyzed by PCA comprises I observations described by J variables and it is represented by the I × J matrix X, whose generic element is xi,j. The matrix X has rank L where L ≤ min ! I,J " . In general, the data table will be preprocessed before the analysis. Almost always, the columns of X will be centered so that the mean of each column is equal to 0 (i.e., XT1 = 0, where 0 is a J by 1 vector of zeros and 1 is an I by 1 vector of ones). If in addition, each element of X is divided by √ I (or √I − 1), the analysis is referred to as a covariance PCA because, in this case, the matrix XTX is a covariance matrix. In addition to centering, when the variables are measured with different units, it is customary to standardize each variable to unit norm. This is obtained by dividing each variable by its norm (i.e., the square root of the sum of all the squared elements of this variable). In this case, the analysis is referred to as a correlation PCA because, then, the matrix XTX is a correlation matrix (most statistical packages use correlation preprocessing as a default). The matrix X has the following singular value decomposition [SVD, see Refs 11–13 and Appendix B for an introduction to the SVD]: X = P!QT (1) where P is the I × L matrix of left singular vectors, Q is the J × L matrix of right singular vectors, and ! Volume 2, July/August 2010 2010 John Wiley & Son s, In c. 433
Overview www.wiley.com/wires/compstats is the diagonal matrix of singular values.Note that to have the largest possible variance(i.e.,inertia and A2 is equal to A which is the diagonal matrix of the therefore this component will 'explain'or 'extract' (nonzero)eigenvalues of XTX and XXT the largest part of the inertia of the data table). The inertia of a column is defined as the sum of The second component is computed under the the squared elements of this column and is computed constraint of being orthogonal to the first component as and to have the largest possible inertia.The other components are computed likewise (see Appendix A (2) for proof).The values of these new variables for the observations are called factor scores,and these factors scores can be interpreted geometrically as the The sum of all the y?is denoted I and it is called projections of the observations onto the principal components. the inertia of the data table or the total inertia.Note that the total inertia is also equal to the sum of the squared singular values of the data table (see Finding the Components Appendix B). The center of gravity of the rows [also called In PCA,the components are obtained from the SVD of the data table X.Specifically,with X=PAQT centroid or barycenter,see Ref 14],denoted g,is the (cf.Eq.1),the I x L matrix of factor scores,denoted vector of the means of each column of X.When X is F,is obtained as: centered,its center of gravity is equal to the 1 x row vector 0 F=PA (5) The(Euclidean)distance of the i-th observation to g is equal to The matrix Q gives the coefficients of the linear combinations used to compute the factors scores. (3) This matrix can also be interpreted as a projection matrix because multiplying X by Q gives the values of the projections of the observations on the principal When the data are centered Eq.3 reduces to components.This can be shown by combining Eqs.1 and 5 as: i.j (4) F=PA=P△QQ=XQ (6) The components can also be represented Note that the sum of all is equal to I which is the geometrically by the rotation of the original axes. inertia of the data table For example,if X represents two variables,the length of a word(Y)and the number of lines of its dictionary definition(W),such as the data shown in Table 1,then GOALS OF PCA PCA represents these data by two orthogonal factors. The geometric representation of PCA is shown in The goals of PCA are to Figure 1.In this figure,we see that the factor scores give the length (i.e.,distance to the origin)of the (1)extract the most important information from the projections of the observations on the components. data table; This procedure is further illustrated in Figure 2.In (2)compress the size of the data set by keeping only this context,the matrix Q is interpreted as a matrix this important information; of direction cosines(because Q is orthonormal).The matrix Q is also called a loading matrix.In this (3)simplify the description of the data set;and context,the matrix X can be interpreted as the (4)analyze the structure of the observations and the product of the factors score matrix by the loading variables. matrix as: In order to achieve these goals,PCA computes X=FOT with FF=A2 and Q'Q=I.(7) new variables called principal components which are obtained as linear combinations of the original This decomposition is often called the bilinear variables.The first principal component is required decomposition of X [see,e.g.,Ref 15]. 434 2010 John Wiley Sons,Inc. Volume 2,July/August 2010
Overview www.wiley.com/wires/compstats is the diagonal matrix of singular values. Note that !2 is equal to " which is the diagonal matrix of the (nonzero) eigenvalues of XTX and XXT. The inertia of a column is defined as the sum of the squared elements of this column and is computed as γ 2 j = # I i x2 i,j . (2) The sum of all the γ 2 j is denoted I and it is called the inertia of the data table or the total inertia. Note that the total inertia is also equal to the sum of the squared singular values of the data table (see Appendix B). The center of gravity of the rows [also called centroid or barycenter, see Ref 14], denoted g, is the vector of the means of each column of X. When X is centered, its center of gravity is equal to the 1 × J row vector 0T. The (Euclidean) distance of the i-th observation to g is equal to d2 i,g = # J j $ xi,j − gj %2 . (3) When the data are centered Eq. 3 reduces to d2 i,g = # J j x2 i,j . (4) Note that the sum of all d2 i,g is equal to I which is the inertia of the data table . GOALS OF PCA The goals of PCA are to (1) extract the most important information from the data table; (2) compress the size of the data set by keeping only this important information; (3) simplify the description of the data set; and (4) analyze the structure of the observations and the variables. In order to achieve these goals, PCA computes new variables called principal components which are obtained as linear combinations of the original variables. The first principal component is required to have the largest possible variance (i.e., inertia and therefore this component will ‘explain’ or ‘extract’ the largest part of the inertia of the data table). The second component is computed under the constraint of being orthogonal to the first component and to have the largest possible inertia. The other components are computed likewise (see Appendix A for proof). The values of these new variables for the observations are called factor scores, and these factors scores can be interpreted geometrically as the projections of the observations onto the principal components. Finding the Components In PCA, the components are obtained from the SVD of the data table X. Specifically, with X = P!QT (cf. Eq. 1), the I × L matrix of factor scores, denoted F, is obtained as: F = P!. (5) The matrix Q gives the coefficients of the linear combinations used to compute the factors scores. This matrix can also be interpreted as a projection matrix because multiplying X by Q gives the values of the projections of the observations on the principal components. This can be shown by combining Eqs. 1 and 5 as: F = P! = P!QTQ = XQ. (6) The components can also be represented geometrically by the rotation of the original axes. For example, if X represents two variables, the length of a word (Y) and the number of lines of its dictionary definition (W), such as the data shown in Table 1, then PCA represents these data by two orthogonal factors. The geometric representation of PCA is shown in Figure 1. In this figure, we see that the factor scores give the length (i.e., distance to the origin) of the projections of the observations on the components. This procedure is further illustrated in Figure 2. In this context, the matrix Q is interpreted as a matrix of direction cosines (because Q is orthonormal). The matrix Q is also called a loading matrix. In this context, the matrix X can be interpreted as the product of the factors score matrix by the loading matrix as: X = FQT with FTF = !2 and QTQ = I. (7) This decomposition is often called the bilinear decomposition of X [see, e.g., Ref 15]. 434 2010 John Wiley & Son s, In c. Volume 2, July/Augu st 2010
WIREs Computational Statistics Principal component analysis 001x5so n 00 9 F 8 的 8o感HA品号8A6品8 的 a a 9寸 品 9 r 8 H 骂 后 三 号 层 男 曾 日 曾 子 足 的 三 5 5 厨 子 S 69.52 S 常 哥 51:50 73.Lt 788 65:gs 65:0E 65.2z 17.10 2 N n m m N e 邑 N m n 文 e 邑 思 P5.0- 21 器 层 S g 69- 20.2 0~- 的 芦 点 25.2 宫 翠 葛 用 66:h 60.0 两 25- 255 学 居 一 T m 7 m Y 三 其 n e 9 2 o 邑 N n m e & 层 5 aupunos 妻 snoijuajald 爱 呈 点 嘉 小 Volume 2,July/August 2010 2010 John Wiley Sons,Inc. 435
WIREs Computational Statistics Principal component analysis TABLE 1 Raw Scores, Deviations from the Mean, Coordinates, Squared Coordinates on the Components, Contributions of the Observations to the Components, Squared Distances to the Center of Gravity, and Squared Cosines of the Observations for the Example Length of Words ( Y) and Number of Lines ( W) Y W y w F1 F2 ctr1 × 100 ctr2 × 100 F2 1 F2 2 d2 cos2 1 × 100 cos2 2 × 100 Bag 3 14 −366.67 0.69 11 1 44.52 0.48 45 99 1 Across 6 70 −1 −0.84 −0.54 0 1 0.71 0.29 1 71 29 On 2 11 −434.68 −1.76 6 6 21.89 3.11 25 88 12 Insane 6 9010.84 0.54 0 1 0.71 0.29 1 71 29 By 2 9 −412.99 −2.84 2 15 8.95 8.05 17 53 47 Monastery 9 43 − 4 − 4.99 0.38 6 0 24.85 0.15 25 99 1 Relief 6 8000.00 0.00 0 0 0 0.00 0 0 0 Slope 5 11 −133.07 0.77 3 1 9.41 0.59 10 94 6 Scoundrel 9 53 −3 − 4.14 0.92 5 2 17.15 0.85 18 95 5 With 4 8 −201.07 −1.69 0 5 1.15 2.85 4 29 71 Neither 7 21 −6 −5.60 −2.38 8 11 31.35 5.65 37 85 15 Pretentious 11 4 5 − 4 −6.06 2.07 9 8 36.71 4.29 41 90 10 Solid 5 12 −143.91 1.30 4 3 15.30 1.70 17 90 10 This 4 9 −211.92 −1.15 1 3 3.68 1.32 5 74 26 For 3 8 −301.61 −2.53 1 12 2.59 6.41 9 29 71 Therefore 9 13 −7 −7.52 −1.23 14 3 56.49 1.51 58 97 3 Generality 10 4 4 − 4 −5.52 1.23 8 3 30.49 1.51 32 95 5 Arise 5 13 −154.76 1.84 6 7 22.61 3.39 26 87 13 Blot 4 15 −276.98 2.07 12 8 48.71 4.29 53 92 8 Infectious 10 6 4 −2 −3.83 2.30 4 10 14.71 5.29 20 74 26 & 120 160 0 0 0 0 100 100 392 52 444 λ1 λ2 I M W = 8, MY = 6. The following abbreviations are used to label the columns: w = (W − M W); y = (Y − MY). The contributions and the squared cosines are multiplied by 100 for ease of reading. The positive important contributions are italicized , and the negative important contributions are represented in bold. Volume 2, July/August 2010 2010 John Wiley & Son s, In c. 435
Overview www.wiley.com/wires/compstats Projecting New Observations onto the Components (a) Equation 6 shows that matrix Q is a projection 11 .Pretentious Generality matrix which transforms the original data matrix 10 ●Infectious into factor scores.This matrix can also be used to 9 Monastery ● Therefore Scoundrel compute factor scores for observations that were not included in the PCA.These observations are ● Across Insane called supplementary or illustrative observations.By 6Ne种ea信"5i68Aise Relief contrast,the observations actually used to compute 51 ● 6 With Solid Blot the PCA are called active observations.The factor This scores for supplementary observations are obtained On Bag by first positioning these observations into the PCA ● space and then projecting them onto the principal components.Specifically a 1xJrow vector can be projected into the PCA space using Eq.6.This gives the 1xL vector of factor scores,denoted 123456789101112131415 Number of lines of the definition which is computed as: (b) Pretentious ● 5 fup=xupQ (8) Generality ● 4 Monastery Infectious ● ● 3士 If the data table has been preprocessed (e.g.,centered Therefore 、Scoundrel 2 Neither or normalized),the same preprocessing should be ● 1 applied to the supplementary observations prior to Insane 3 456 7 +十+ the computation of their factor scores. Across As an illustration,suppose that-in addition to With Blot the data presented in Table 1-we have the French 2 Tis ● word 'sur'(it means 'on').It has Ysur =3 letters,and -30 For On our French dictionary reports that its definition has ● Bags Wsur =12 lines.Because sur is not an English word, we do not want to include it in the analysis,but we would like to know how it relates to the English (c) vocabulary.So,we decided to treat this word as a Infectious Pretentious 34 2 ● supplementary observation. Arise Blot Generality Scoundrel Solid● The first step is to preprocess this supplementary 1 Monastery Insane Slope● Bag observation in a identical manner to the active ye%5-434 observations.Because the data matrix was centered, Across With● the values of this observation are transformed into Therefore On -2 ●This ● deviations from the English center of gravity.We find Neither the following values: 3 ●● For By ysur Ysur -My=3-6=-3 and FIGURE 1 The geometric steps for finding the components of a principal component analysis.To find the components (1)center the Wsur Wsur -Mw =12-8=4. variables then plot them against each other.(2)Find the main direction (called the first component)of the cloud of points such that we have the minimum of the sum of the squared distances from the points to the Then we plot the supplementary word in the graph component.Add a second component orthogonal to the first such that that we have already used for the active analysis. the sum of the squared distances is minimum.(3)When the Because the principal components and the original components have been found,rotate the figure in order to position the variables are in the same space,the projections of the first component horizontally (and the second component vertically), supplementary observation give its coordinates (i.e., then erase the original axes.Note that the final graph could have been factor scores)on the components.This is shown in obtained directly by plotting the observations from the coordinates Figure 3.Equivalently,the coordinates of the projec- given in Table 1. tions on the components can be directly computed 436 2010 John Wiley Sons,Inc. Volume 2,July/August 2010
Overview www.wiley.com/wires/compstats Projecting New Observations onto the Components Equation 6 shows that matrix Q is a projection matrix which transforms the original data matrix into factor scores. This matrix can also be used to compute factor scores for observations that were not included in the PCA. These observations are called supplementary or illustrative observations. By contrast, the observations actually used to compute the PCA are called active observations. The factor scores for supplementary observations are obtained by first positioning these observations into the PCA space and then projecting them onto the principal components. Specifically a 1 × J row vector xT sup, can be projected into the PCA space using Eq. 6. This gives the 1 × L vector of factor scores, denoted fT sup, which is computed as: f T sup = xT supQ. (8) If the data table has been preprocessed (e.g., centered or normalized), the same preprocessing should be applied to the supplementary observations prior to the computation of their factor scores. As an illustration, suppose that—in addition to the data presented in Table 1—we have the French word ‘sur’ (it means ‘on’). It has Ysur = 3 letters, and our French dictionary reports that its definition has Wsur = 12 lines. Because sur is not an English word, we do not want to include it in the analysis, but we would like to know how it relates to the English vocabulary. So, we decided to treat this word as a supplementary observation. The first step is to preprocess this supplementary observation in a identical manner to the active observations. Because the data matrix was centered, the values of this observation are transformed into deviations from the English center of gravity. We find the following values: ysur = Ysur − MY = 3 − 6 = −3 and wsur = Wsur − MW = 12 − 8 = 4. Then we plot the supplementary word in the graph that we have already used for the active analysis. Because the principal components and the original variables are in the same space, the projections of the supplementary observation give its coordinates (i.e., factor scores) on the components. This is shown in Figure 3. Equivalently, the coordinates of the projections on the components can be directly computed 9 8 7 6 5 4 3 2 1 2 4 3 5 6 7 9 10 11 12 13 14 15 10 Monastery Number of lines of the definition This For On Bag Solid Blot Across Insane Relief By Arise With Generality Scoundrel Infectious Pretentious Therefore Slope Neither Number of letters of the word 11 1 8 Across Insane Infectious −7 −6 −5 −4 −3 23 456 7 Monastery Pretentious Relief This By For With On Bag Blot Solid Arise Generality Scoundrel 1 2 1 Neither −4 −1 −3 −2 −2 −1 Slope Therefore 1 2 3 4 5 −1 Across Infectious Bag Relief 3 −1 −3 −7 −6 −4 −3 −2 −2 1 2 Monastery Therefore Neither By This Slope Arise Solid With On For Scoundrel Generality Pretentious Blot Insane 2 4 5 6 7 1 1 2 3 (a) (b) (c) −5 FIGURE 1 | The geometric steps for finding the components of a principal component analysis. To find the components (1) center the variables then plot them against each other. (2) Find the main direction (called the first component) of the cloud of points such that we have the minimum of the sum of the squared distances from the points to the component. Add a second component orthogonal to the first such that the sum of the squared distances is minimum. (3) When the components have been found, rotate the figure in order to position the first component horizontally (and the second component vertically), then erase the original axes. Note that the final graph could have been obtained directly by plotting the observations from the coordinates given in Table 1. 436 2010 John Wiley & Son s, In c. Volume 2, July/Augu st 2010
WIREs Computational Statistics Principal component analysis Projection of (a) "neither" On first component On second component 5.60 Neither (b) -0.38 FIGURE 2 Plot of the centered data,with the first and second 499 components.The projections (or coordinates)of the word 'neither'on the first and the second components are equal to-5.60 and-2.38. from Eg.8 (see also Table 3 for the values of Q)as: fp=xpQ=[-34]× -0.53690.8437 0.84370.5369 =[4.9853-0.38351 (9) FIGURE 3|How to find the coordinates (i.e.,factor scores)on the principal components of a supplementary observation:(a)the French word sur is plotted in the space of the active observations from its deviations to the W and Y variables;and (b)The projections of the sur INTERPRETING PCA on the principal components give its coordinates. Inertia explained by a component The importance of a component is reflected by its where A is the eigenvalue of the e-th component. The value of a contribution is between 0 and 1 and, inertia or by the proportion of the total inertia for a given component,the sum of the contributions "explained"by this factor.In our example (see Table 2)the inertia of the first component is equal of all observations is equal to 1.The larger the value of the contribution,the more the observation to 392 and this corresponds to 83%of the total inertia. contributes to the component.A useful heuristic is to base the interpretation of a component on the observations whose contribution is larger than Contribution of an Observation to a the average contribution (i.e.,observations whose Component contribution is larger than 1/I).The observations Recall that the eigenvalue associated to a component with high contributions and different signs can then is equal to the sum of the squared factor scores be opposed to help interpret the component because for this component.Therefore,the importance of an these observations represent the two endpoints of this observation for a component can be obtained by the component. ratio of the squared factor score of this observation by The factor scores of the supplementary obser- the eigenvalue associated with that component.This vations are not used to compute the eigenvalues and ratio is called the contribution of the observation to the therefore their contributions are generally not com- component.Formally,the contribution of observation puted. i to component e is,denoted ctri.e,obtained as: 后 Squared Cosine of a Component with an (10) Observation The squared cosine shows the importance of a component for a given observation.The squared Volume 2,July/August 2010 2010 John Wiley Sons,Inc. 437
WIREs Computational Statistics Principal component analysis Monastery −1 Across Insane Infectious Bag −2 Relief 5 4 3 2 1 Slope Therefore Scoundrel Generality Arise Solid Blot On With For By This Pretentious −7 −6 −5 −4 −3 −2 1 2 3 4 5 7 6 −3 −1 −4 1 Neither 2 Projection of “neither” On first component On second component −2.38 −5.60 FIGURE 2 | Plot of the centered data, with the first and second components. The projections (or coordinates) of the word ‘neither’ on the first and the second components are equal to −5.60 and −2.38. from Eq. 8 (see also Table 3 for the values of Q) as: f T sup = xT supQ = ' −3 4( × ) −0.5369 0.8437 0.8437 0.5369* = ' 4.9853 − 0.3835( . (9) INTERPRETING PCA Inertia explained by a component The importance of a component is reflected by its inertia or by the proportion of the total inertia ‘‘explained’’ by this factor. In our example (see Table 2) the inertia of the first component is equal to 392 and this corresponds to 83% of the total inertia. Contribution of an Observation to a Component Recall that the eigenvalue associated to a component is equal to the sum of the squared factor scores for this component. Therefore, the importance of an observation for a component can be obtained by the ratio of the squared factor score of this observation by the eigenvalue associated with that component. This ratio is called the contribution of the observation to the component. Formally, the contribution of observation i to component # is, denoted ctri,#, obtained as: ctri,# = f 2 # i,# i f 2 i,# = f 2 i,# λ# (10) Infectious Across Insane 5 4 3 2 1 Slope Therefore Scoundrel Generality Arise Solid Blot On Bag With For By This Relief Monastery Pretentious −7 −6 −5 −4 −3 −2 −1 2 3 5 6 7 −7 −6 −5 −4 −3 −2 −1 1 −2 −1 −4 Neither 1 2 −3 4 Sur 4.99 −0.38 Infectious Across Insane 5 4 3 2 1 Slope Therefore Scoundrel Generality Arise Solid Blot On Bag With For By This Relief Monastery Pretentious 1 2 3 5 6 7 −2 −1 −4 Neither −3 Sur 4 1 2 (a) (b) FIGURE 3 | How to find the coordinates (i.e., factor scores) on the principal components of a supplementary observation: (a) the French word sur is plotted in the space of the active observations from its deviations to the W and Y variables; and (b) The projections of the sur on the principal components give its coordinates. where λ# is the eigenvalue of the #-th component. The value of a contribution is between 0 and 1 and, for a given component, the sum of the contributions of all observations is equal to 1. The larger the value of the contribution, the more the observation contributes to the component. A useful heuristic is to base the interpretation of a component on the observations whose contribution is larger than the average contribution (i.e., observations whose contribution is larger than 1/I). The observations with high contributions and different signs can then be opposed to help interpret the component because these observations represent the two endpoints of this component. The factor scores of the supplementary observations are not used to compute the eigenvalues and therefore their contributions are generally not computed. Squared Cosine of a Component with an Observation The squared cosine shows the importance of a component for a given observation. The squared Volume 2, July/August 2010 2010 John Wiley & Son s, In c. 437
Overview www.wiley.com/wires/compstats TABLE 2 Eigenvalues and Percentage of Explained Inertia by Each Component N Cumulated Percent of Cumulated Component (eigenvalue) (eigenvalues) of Inertia (percentage) 1 392 392 83.29 83.29 2 52 444 11.71 100.00 cosine indicates the contribution of a component to what specific meaning of the word 'loadings'has been the squared distance of the observation to the origin. chosen when looking at the outputs of a program or It corresponds to the square of the cosine of the when reading papers on PCA.In general,however, angle from the right triangle made with the origin,the different meanings of loadings'lead to equivalent observation,and its projection on the component and interpretations of the components.This happens is computed as: because the different types of loadings differ mostly by their type of normalization.For example,the 屁 correlations of the variables with the components (11) are normalized such that the sum of the squared e correlations of a given variable is equal to one;by contrast,the elements of Q are normalized such that where dg is the squared distance of a given the sum of the squared elements of a given component observation to the origin.The squared distance,is is equal to one. computed(thanks to the Pythagorean theorem)as the Plotting the Correlations/Loadings of the sum of the squared values of all the factor scores of Variables with the Components this observation(cf.Eq.4).Components with a large The variables can be plotted as points in the value of cos contribute a relatively large portion to component space using their loadings as coordinates. the total distance and therefore these components are This representation differs from the plot of the important for that observation. observations:The observations are represented by The distance to the center of gravity is defined for their projections,but the variables are represented by supplementary observations and the squared cosine their correlations.Recall that the sum of the squared can be computed and is meaningful.Therefore,the loadings for a variable is equal to one.Remember, value of cos2 can help find the components that are also,that a circle is defined as the set of points important to interpret both active and supplementary with the property that the sum of their squared observations. coordinates is equal to a constant.As a consequence, when the data are perfectly represented by only two components,the sum of the squared loadings is equal Loading:Correlation of a Component and a to one,and therefore,in this case,the loadings will Variable be positioned on a circle which is called the circle of The correlation between a component and a variable correlations.When more than two components are estimates the information they share.In the PCA needed to represent the data perfectly,the variables framework,this correlation is called a loading.Note will be positioned inside the circle of correlations. that the sum of the squared coefficients of correlation The closer a variable is to the circle of correlations, between a variable and all the components is equal the better we can reconstruct this variable from the to 1.As a consequence,the squared loadings are easier first two components(and the more important it is to to interpret than the loadings (because the squared interpret these components);the closer to the center loadings give the proportion of the variance of the of the plot a variable is,the less important it is for the variables explained by the components).Table 3 gives first two components. the loadings as well as the squared loadings for the Figure 4 shows the plot of the loadings of the word length and definition example. variables on the components.Each variable is a point It is worth noting that the term 'loading'has whose coordinates are given by the loadings on the several interpretations.For example,as previously principal components. mentioned,the elements of matrix Q(cf.Eg.B.1) We can also use supplementary variables to are also called loadings.This polysemy is a potential enrich the interpretation.A supplementary variable source of confusion,and therefore it is worth checking should be measured for the same observations 438 2010 John Wiley Sons,Inc. Volume 2,July/August 2010
Overview www.wiley.com/wires/compstats TABLE 2 Eigenvalues and Percentage of Explained Inertia by Each Component λi Cumulated Percent of Cumulated Component (eigenvalue) (eigenvalues) of Inertia (percentage) 1 392 392 83.29 83.29 2 52 444 11.71 100.00 cosine indicates the contribution of a component to the squared distance of the observation to the origin. It corresponds to the square of the cosine of the angle from the right triangle made with the origin, the observation, and its projection on the component and is computed as: cos2 i,# = f 2 # i,# # f 2 i,# = f 2 i,# d2 i,g (11) where d2 i,g is the squared distance of a given observation to the origin. The squared distance, d2 i,g, is computed (thanks to the Pythagorean theorem) as the sum of the squared values of all the factor scores of this observation (cf. Eq. 4). Components with a large value of cos2 i,# contribute a relatively large portion to the total distance and therefore these components are important for that observation. The distance to the center of gravity is defined for supplementary observations and the squared cosine can be computed and is meaningful. Therefore, the value of cos2 can help find the components that are important to interpret both active and supplementary observations. Loading: Correlation of a Component and a Variable The correlation between a component and a variable estimates the information they share. In the PCA framework, this correlation is called a loading. Note that the sum of the squared coefficients of correlation between a variable and all the components is equal to 1. As a consequence, the squared loadings are easier to interpret than the loadings (because the squared loadings give the proportion of the variance of the variables explained by the components). Table 3 gives the loadings as well as the squared loadings for the word length and definition example. It is worth noting that the term ‘loading’ has several interpretations. For example, as previously mentioned, the elements of matrix Q (cf. Eq. B.1) are also called loadings. This polysemy is a potential source of confusion, and therefore it is worth checking what specific meaning of the word ‘loadings’ has been chosen when looking at the outputs of a program or when reading papers on PCA. In general, however, different meanings of ‘loadings’ lead to equivalent interpretations of the components. This happens because the different types of loadings differ mostly by their type of normalization. For example, the correlations of the variables with the components are normalized such that the sum of the squared correlations of a given variable is equal to one; by contrast, the elements of Q are normalized such that the sum of the squared elements of a given component is equal to one. Plotting the Correlations/Loadings of the Variables with the Components The variables can be plotted as points in the component space using their loadings as coordinates. This representation differs from the plot of the observations: The observations are represented by their projections, but the variables are represented by their correlations. Recall that the sum of the squared loadings for a variable is equal to one. Remember, also, that a circle is defined as the set of points with the property that the sum of their squared coordinates is equal to a constant. As a consequence, when the data are perfectly represented by only two components, the sum of the squared loadings is equal to one, and therefore, in this case, the loadings will be positioned on a circle which is called the circle of correlations. When more than two components are needed to represent the data perfectly, the variables will be positioned inside the circle of correlations. The closer a variable is to the circle of correlations, the better we can reconstruct this variable from the first two components (and the more important it is to interpret these components); the closer to the center of the plot a variable is, the less important it is for the first two components. Figure 4 shows the plot of the loadings of the variables on the components. Each variable is a point whose coordinates are given by the loadings on the principal components. We can also use supplementary variables to enrich the interpretation. A supplementary variable should be measured for the same observations 438 2010 John Wiley & Son s, In c. Volume 2, July/Augu st 2010
WIREs Computational Statistics Principal component analysis TABLE 3 Loadings (i.e.,Coefficients of Correlation between Variables and Components) and Squared Loadings Loadings Squared Loadings Q Component W W Y W 1 -0.9927 -0.98100.98550.9624 -0.5369 0.8437 2 0.1203 -0.1939 0.0145 0.0376 0.8437 0.5369 ∑ 1.00001.0000 The elements of matrix Q are also provided. TABLE 4 Supplementary Variables for TABLE 5 Loadings(i.e.,Coefficients of Correlation)and Squared the Example Length of Words and Number Loadings between Supplementary Variables and Components of lines Loadings Squared Loadings Frequency #Entries Component Frequency #Entries Frequency #Entries Bag 8 6 1 -0.3012 0.6999 0.0907 0.4899 Across 230 3 2 -0.7218 -0.4493 0.5210 0.2019 On 700 2 6117 6918 Insane 1 2 y 500 7 Monastery 1 These data are shown in Table 4.A table of loadings Relief 9 1 for the supplementary variables can be computed Slope 2 6 from the coefficients of correlation between these Scoundrel variables and the components (see Table 5).Note With 700 5 that,contrary to the active variables,the squared Neither 7 2 loadings of the supplementary variables do not add Pretentious 1 1 up to 1. Solid 4 5 This 500 0 For 900 2 STATISTICAL INFERENCE: Therefore 3 1 EVALUATING THE QUALITY Generality 1 1 OF THE MODEL Arise 10 4 Blot 2 Fixed Effect Model 4 The results of PCA so far correspond to a fixed Infectious 1 2 effect model (i.e.,the observations are considered Frequency'is expressed as number of occur- to be the population of interest,and conclusions rences per 100,000 words,Entries'is obtained by counting the number of entries are limited to these specific observations).In this for the word in the dictionary. context,PCA is descriptive and the amount of the variance of X explained by a component indicates its used for the analysis (for all of them or part importance. of them,because we only need to compute a For a fixed effect model,the quality of the PCA model using the first M components is obtained coefficient of correlation).After the analysis has been by first computing the estimated matrix,denoted performed,the coefficients of correlation (i.e.,the loadings)between the supplementary variables and the XIMI,which is matrix X reconstituted with the first components are computed.Then the supplementary M components.The formula for this estimation is variables are displayed in the circle of correlations obtained by combining Egs 1,5,and 6 in order to obtain using the loadings as coordinates. For example,we can add two supplementary variables to the word length and definition example. X FQT XQQT (12) Volume 2,July/August 2010 2010 John Wiley Sons,Inc 439
WIREs Computational Statistics Principal component analysis TABLE 3 Loadings (i.e., Coefficients of Correlation between Variables and Components) and Squared Loadings Loadings Squared Loadings Q Component Y W Y W Y W 1 −0.9927 −0.9810 0.9855 0.9624 −0.5369 0.8437 2 0.1203 −0.1939 0.0145 0.0376 0.8437 0.5369 & 1.0000 1.0000 The elements of matrix Q are also provided. TABLE 4 Supplementary Variables for the Example Length of Words and Number of lines Frequency # Entries Bag 8 6 Across 230 3 On 700 12 Insane 1 2 By 500 7 Monastery 1 1 Relief 9 1 Slope 2 6 Scoundrel 1 1 With 700 5 Neither 7 2 Pretentious 1 1 Solid 4 5 This 500 9 For 900 7 Therefore 3 1 Generality 1 1 Arise 10 4 Blot 1 4 Infectious 1 2 ‘Frequency’ is expressed as number of occurrences per 100,000 words, ‘# Entries’ is obtained by counting the number of entries for the word in the dictionary. used for the analysis (for all of them or part of them, because we only need to compute a coefficient of correlation). After the analysis has been performed, the coefficients of correlation (i.e., the loadings) between the supplementary variables and the components are computed. Then the supplementary variables are displayed in the circle of correlations using the loadings as coordinates. For example, we can add two supplementary variables to the word length and definition example. TABLE 5 Loadings (i.e., Coefficients of Correlation) and Squared Loadings between Supplementary Variables and Components Loadings Squared Loadings Component Frequency # Entries Frequency # Entries 1 −0.3012 0.6999 0.0907 0.4899 2 −0.7218 −0.4493 0.5210 0.2019 & .6117 .6918 These data are shown in Table 4. A table of loadings for the supplementary variables can be computed from the coefficients of correlation between these variables and the components (see Table 5). Note that, contrary to the active variables, the squared loadings of the supplementary variables do not add up to 1. STATISTICAL INFERENCE: EVALUATING THE QUALITY OF THE MODEL Fixed Effect Model The results of PCA so far correspond to a fixed effect model (i.e., the observations are considered to be the population of interest, and conclusions are limited to these specific observations). In this context, PCA is descriptive and the amount of the variance of X explained by a component indicates its importance. For a fixed effect model, the quality of the PCA model using the first M components is obtained by first computing the estimated matrix, denoted X+[M] , which is matrix X reconstituted with the first M components. The formula for this estimation is obtained by combining Eqs 1, 5, and 6 in order to obtain X = FQT = XQQT . (12) Volume 2, July/August 2010 2010 John Wiley & Son s, In c. 439
Overview www.wiley.com/wires/compstats (a) between X and XIMI.Several coefficients can be used PC2 for this task [see,e.g.,Refs 16-18].The squared coefficient of correlation is sometimes used,as well as the Ry coefficient.18.19 The most popular coefficient, Length (number of letters) however,is the residual sum of squares (RESS).It is Number of computed as: lines of the definition PC RESSM IX-XIMI2 traceETE =I- ∑ (15) =1 where ll ll is the norm of X(i.e.,the square root of the (b) PC2 sum of all the squared elements of X),and where the trace of a matrix is the sum of its diagonal elements. The smaller the value of RESS,the better the PCA model.For a fixed effect model,a larger M gives a Length better estimation of XIMI.For a fixed effect model, (number of letters) Number of lines of the the matrix X is always perfectly reconstituted with L definition components (recall that L is the rank of X). PC In addition,Eq.12 can be adapted to compute the estimation of the supplementary observations as Entries 单 =xpQIMIQIMIT (16) Frequency ● Random Effect Model In most applications,the set of observations represents FIGURE 4 Circle of correlations and plot of the loadings of (a)the a sample from a larger population.In this case,the variables with principal components 1 and 2,and (b)the variables and goal is to estimate the value of net observations from supplementary variables with principal components 1 and 2.Note that this population.This corresponds to a random effect the supplementary variables are not positioned on the unit circle. model.In order to estimate the generalization capacity of the PCA model,we cannot use standard parametric Then,the matrix XIMI is built back using Eq.12 procedures.Therefore,the performance of the PCA keeping only the first M components: model is evaluated using computer-based resampling techniques such as the bootstrap and cross-validation XIMI PIMIAIMIOIMIT techniques where the data are separated into a learning FIMIOIMIT and a testing set.A popular cross-validation technique =XQIMIOIMIT is the jackknife (aka 'leave one out'procedure).In the (13) jackknife,20-22 each observation is dropped from the set in turn and the remaining observations constitute where PlMI,AIMI,and QIMI represent,respectively the learning set.The learning set is then used to the matrices P,△,and Q with only their first M estimate (using Eq.16)the left-out observation which components.Note,incidentally,that Eq.7 can be constitutes the testing set.Using this procedure,each rewritten in the current context as: observation is estimated according to a random effect model.The predicted observations are then stored in X =XIMI+E=FIMIQIMIT +E (14) a matrix denoted X. The overall quality of the PCA random effect (where E is the error matrix,which is equal to model using M components is evaluated as the X-XIMI). similarity between X and XIMI.As with the fixed To evaluate the quality of the reconstitution of effect model,this can also be done with a squared X with M components,we evaluate the similarity coefficient of correlation or (better)with the Ry 440 2010 John Wiley Sons,Inc. Volume 2,July/August 2010
Overview www.wiley.com/wires/compstats PC2 Number of lines of the definition PC1 Length (number of letters) PC2 Number of lines of the definition PC1 Length (number of letters) # Entries Frequency (a) (b) FIGURE 4 | Circle of correlations and plot of the loadings of (a) the variables with principal components 1 and 2, and (b) the variables and supplementary variables with principal components 1 and 2. Note that the supplementary variables are not positioned on the unit circle. Then, the matrix X+[M] is built back using Eq. 12 keeping only the first M components: X+[M] = P[M] ![M] Q[M]T = F[M] Q[M]T = XQ[M] Q[M]T (13) where P[M] , ![M] , and Q[M] represent, respectively the matrices P, !, and Q with only their first M components. Note, incidentally, that Eq. 7 can be rewritten in the current context as: X = X+[M] + E = F[M] Q[M]T + E (14) (where E is the error matrix, which is equal to X − X+[M] ). To evaluate the quality of the reconstitution of X with M components, we evaluate the similarity between X and X+[M] . Several coefficients can be used for this task [see, e.g., Refs 16–18]. The squared coefficient of correlation is sometimes used, as well as the RV coefficient.18,19 The most popular coefficient, however, is the residual sum of squares (RESS). It is computed as: RESSM = &X − X+[M] &2 = trace , ETE - = I −# M #=1 λ# (15) where & & is the norm of X (i.e., the square root of the sum of all the squared elements of X), and where the trace of a matrix is the sum of its diagonal elements. The smaller the value of RESS, the better the PCA model. For a fixed effect model, a larger M gives a better estimation of X+[M] . For a fixed effect model, the matrix X is always perfectly reconstituted with L components (recall that L is the rank of X). In addition, Eq. 12 can be adapted to compute the estimation of the supplementary observations as +x[M] sup = xsupQ[M] Q[M]T. (16) Random Effect Model In most applications, the set of observations represents a sample from a larger population. In this case, the goal is to estimate the value of new observations from this population. This corresponds to a random effect model. In order to estimate the generalization capacity of the PCA model, we cannot use standard parametric procedures. Therefore, the performance of the PCA model is evaluated using computer-based resampling techniques such as the bootstrap and cross-validation techniques where the data are separated into a learning and a testing set. A popular cross-validation technique is the jackknife (aka ‘leave one out’ procedure). In the jackknife,20–22 each observation is dropped from the set in turn and the remaining observations constitute the learning set. The learning set is then used to estimate (using Eq. 16) the left-out observation which constitutes the testing set. Using this procedure, each observation is estimated according to a random effect model. The predicted observations are then stored in a matrix denoted X.. The overall quality of the PCA random effect model using M components is evaluated as the similarity between X and X.[M] . As with the fixed effect model, this can also be done with a squared coefficient of correlation or (better) with the RV 440 2010 John Wiley & Son s, In c. Volume 2, July/Augu st 2010
WIREs Computational Statistics Principal component analysis coefficient.Similar to RESS,one can use the predicted the optimal number of components to keep when the residual sum of squares (PRESS).It is computed as: goal is to generalize the conclusions of an analysis to new data PRESSM X -XIMI2 (17) A simple approach stops adding components when PRESS decreases.A more elaborated approach The smaller the PRESS the better the qutality of the [see e.g.,Refs 27-31]begins by computing,for each estimation for a random model. component e,a quantity denoted O2 is defined as: Contrary to what happens with the fixed effect model,the matrix X is not always perfectly reconsti- Q2=1- PRESSe tuted with all L components.This is particularly the case when the number of variables is larger than the RESS-1 (19) number of observations(a configuration known as the 'small N large P'problem in the literature). with PRESSe(RESSe)being the value of PRESS(RESS) for the e-th component (where RESSo is equal to the total inertia).Only the components with 2 How Many Components? greater or equal to an arbitrary critical value(usually Often,only the important information needs to be 1-0.952 =0.0975)are kept [an alternative set of extracted from a data matrix.In this case,the problem critical values sets the threshold to 0.05 when I 100 is to figure out how many components need to be and to 0 when I>100;see Ref 28]. considered.This problem is still open,but there Another approach-based on cross-validation- are some guidelines [see,e.g.,Refs 9,8,23].A first to decide upon the number of components to keep uses procedure is to plot the eigenvalues according to their the index We derived from Refs 32 and 33.In contrast size [the so called"scree,"see Refs 8,24 and Table 2] to 2,which depends on RESS and PRESS,the index and to see if there is a point in this graph(often called We,depends only upon PRESS.It is computed for the an 'elbow')such that the slope of the graph goes from e-th component as steep'to "flat"and to keep only the components which are before the elbow.This procedure,somewhat subjective,is called the scree or elbow test. W= PRESSL-1-PRESSe dfresidual, (20) Another standard tradition is to keep only PRESS dfe the components whose eigenvalue is larger than the average.Formally,this amount to keeping the e-th where PRESSo is the inertia of the data table,df,is the component if number of degrees of freedom for the e-th component equal to (18) df=1+J-2e, (21) (where L is the rank of X).For a correlation PCA, anddf is the residual number of degrees of this rule boils down to the standard advice to 'keep freedom which is equal to the total number of degrees only the eigenvalues larger than 1'[see,e.g.,Ref of freedom of the table [equal to /(I-1)]minus the 25].However,this procedure can lead to ignoring number of degrees of freedom used by the previous important information [see Ref 26 for an example of components.The value of dfresidual.is obtained as: this problem]. Random Model dfresidual.=(I-1)->(I+J-2k) As mentioned earlier,when using a random model, k=1 the quality of the prediction does not always increase with the number of components of the model.In fact, =J1-1)-(1+J--1). (22) when the number of variables exceeds the number of observations,quality typically increases and then Most of the time,O2 and We will agree on the number decreases.When the quality of the prediction decreases of components to keep,but We can give a more as the number of components increases this is an conservative estimate of the number of components indication that the model is overfitting the data (i.e., to keep than O2.When is smaller than I,the value the information in the learning set is not useful to fit of both O and WL is meaningless because they both the testing set).Therefore,it is important to determine involve a division by zero. Volume 2,July/August 2010 2010 John Wiley Sons,Inc. 441
WIREs Computational Statistics Principal component analysis coefficient. Similar to RESS, one can use the predicted residual sum of squares (PRESS). It is computed as: PRESSM = &X − X.[M] &2 (17) The smaller the PRESS the better the quality of the estimation for a random model. Contrary to what happens with the fixed effect model, the matrix X is not always perfectly reconstituted with all L components. This is particularly the case when the number of variables is larger than the number of observations (a configuration known as the ‘small N large P’ problem in the literature). How Many Components? Often, only the important information needs to be extracted from a data matrix. In this case, the problem is to figure out how many components need to be considered. This problem is still open, but there are some guidelines [see, e.g.,Refs 9,8,23]. A first procedure is to plot the eigenvalues according to their size [the so called ‘‘scree,’’ see Refs 8,24 and Table 2] and to see if there is a point in this graph (often called an ‘elbow’) such that the slope of the graph goes from ‘steep’ to ‘‘flat’’ and to keep only the components which are before the elbow. This procedure, somewhat subjective, is called the scree or elbow test. Another standard tradition is to keep only the components whose eigenvalue is larger than the average. Formally, this amount to keeping the #-th component if λ# > 1 L # L # λ# = 1 L I (18) (where L is the rank of X). For a correlation PCA, this rule boils down to the standard advice to ‘keep only the eigenvalues larger than 1’ [see, e.g., Ref 25]. However, this procedure can lead to ignoring important information [see Ref 26 for an example of this problem]. Random Model As mentioned earlier, when using a random model, the quality of the prediction does not always increase with the number of components of the model. In fact, when the number of variables exceeds the number of observations, quality typically increases and then decreases. When the quality of the prediction decreases as the number of components increases this is an indication that the model is overfitting the data (i.e., the information in the learning set is not useful to fit the testing set). Therefore, it is important to determine the optimal number of components to keep when the goal is to generalize the conclusions of an analysis to new data. A simple approach stops adding components when PRESS decreases. A more elaborated approach [see e.g., Refs 27–31] begins by computing, for each component #, a quantity denoted Q2 # is defined as: Q2 # = 1 − PRESS# RESS#−1 (19) with PRESS# (RESS#) being the value of PRESS (RESS) for the #-th component (where RESS0 is equal to the total inertia). Only the components with Q2 # greater or equal to an arbitrary critical value (usually 1 − 0.952 = 0.0975) are kept [an alternative set of critical values sets the threshold to 0.05 when I ≤ 100 and to 0 when I > 100; see Ref 28]. Another approach—based on cross-validation— to decide upon the number of components to keep uses the index W# derived from Refs 32 and 33. In contrast to Q2 # , which depends on RESS and PRESS, the index W#, depends only upon PRESS. It is computed for the #-th component as W# = PRESS#−1 − PRESS# PRESS# × dfresidual, # df# , (20) where PRESS0 is the inertia of the data table, df# is the number of degrees of freedom for the #-th component equal to df# = I + J − 2#, (21) and dfresidual, # is the residual number of degrees of freedom which is equal to the total number of degrees of freedom of the table [equal to J(I − 1)] minus the number of degrees of freedom used by the previous components. The value of dfresidual, # is obtained as: dfresidual, # = J(I − 1) −# # k=1 (I + J − 2k) = J(I − 1) − #(I + J − # − 1). (22) Most of the time, Q2 # and W# will agree on the number of components to keep, but W# can give a more conservative estimate of the number of components to keep than Q2 # . When J is smaller than I, the value of both Q2 L and WL is meaningless because they both involve a division by zero. Volume 2, July/August 2010 2010 John Wiley & Son s, In c. 441
Overview www.wiley.com/wires/compstats Bootstrapped Confidence Intervals variable tends to be associated with one (or a small After the number of components to keep has been number)of the components,and each component determined,we can compute confidence intervals represents only a small number of variables.In for the eigenvalues of using the bootstrap.34-39 addition,the components can often be interpreted To use the bootstrap,we draw a large number of from the opposition of few variables with positive samples (e.g.,1000 or 10,000)with replacement loadings to few variables with negative loadings. from the learning set.Each sample produces a set Formally varimax searches for a linear combination of eigenvalues.The whole set of eigenvalues can then of the original factors such that the variance of the be used to compute confidence intervals. squared loadings is maximized,which amounts to maximizing ROTATION v=∑q呢-2 (23) After the number of components has been determined, and in order to facilitate the interpretation,the with being the squared loading of the j-th variable analysis often involves a rotation of the components of matrix Q on component e and being the mean that were retained [see,e.g.,Ref 40 and 67,for of the squared loadings. more details].Two main types of rotation are used: orthogonal when the new axes are also orthogonal to each other,and oblique when the new axes are Oblique Rotations not required to be orthogonal.Because the rotations With oblique rotations,the new axes are free to are always performed in a subspace,the new axes take any position in the component space,but the will always explain less inertia than the original degree of correlation allowed among factors is small components (which are computed to be optimal). because two highly correlated components are better However,the part of the inertia explained by the interpreted as only one factor.Oblique rotations, total subspace after rotation is the same as it was therefore,relax the orthogonality constraint in order before rotation (only the partition of the inertia has to gain simplicity in the interpretation.They were changed).It is also important to note that because strongly recommended by Thurstone,42 but are used rotation always takes place in a subspace (i.e.,the more rarely than their orthogonal counterparts. space of the retained components),the choice of this For oblique rotations,the promax rotation has subspace strongly influences the result of the rotation. the advantage of being fast and conceptually simple. Therefore,it is strongly recommended to try several The first step in promax rotation defines the target sizes for the subspace of the retained components in matrix,almost always obtained as the result of a order to assess the robustness of the interpretation of varimax rotation whose entries are raised to some the rotation.When performing a rotation,the term power(typically between 2 and 4)in order to force loadings almost always refer to the elements of matrix the structure of the loadings to become bipolar. Q.We will follow this tradition in this section. The second step is obtained by computing a least square fit from the varimax solution to the target matrix.Promax rotations are interpreted by looking Orthogonal Rotation at the correlations-regarded as loadings-between An orthogonal rotation is specified by a rotation the rotated axes and the original variables.An matrix,denoted R,where the rows stand for the interesting recent development of the concept of original factors and the columns for the new(rotated) oblique rotation corresponds to the technique of factors.At the intersection of row m and column n we independent component analysis (ICA)where the have the cosine of the angle between the original axis axes are computed in order to replace the notion and the new one:rm,n=cos em.A rotation matrix of orthogonality by statistical independence [see Ref has the important property of being orthonormal 43,for a tutorial]. because it corresponds to a matrix of direction cosines and therefore RR =I. Varimax rotation,developed by Kaiser,41 is the When and Why Using Rotations most popular rotation method.For varimax a simple The main reason for using rotation is to facilitate the solution means that each component has a small interpretation.When the data follow a model(such number of large loadings and a large number of zero as the psychometric model)stipulating(1)that each (or small)loadings.This simplifies the interpretation variable load on only one factor and (2)that there because,after a varimax rotation,each original is a clear difference in intensity between the relevant 442 2010 John Wiley Sons,Inc. Volume 2,July/August 2010
Overview www.wiley.com/wires/compstats Bootstrapped Confidence Intervals After the number of components to keep has been determined, we can compute confidence intervals for the eigenvalues of X. using the bootstrap.34–39 To use the bootstrap, we draw a large number of samples (e.g., 1000 or 10,000) with replacement from the learning set. Each sample produces a set of eigenvalues. The whole set of eigenvalues can then be used to compute confidence intervals. ROTATION After the number of components has been determined, and in order to facilitate the interpretation, the analysis often involves a rotation of the components that were retained [see, e.g., Ref 40 and 67, for more details]. Two main types of rotation are used: orthogonal when the new axes are also orthogonal to each other, and oblique when the new axes are not required to be orthogonal. Because the rotations are always performed in a subspace, the new axes will always explain less inertia than the original components (which are computed to be optimal). However, the part of the inertia explained by the total subspace after rotation is the same as it was before rotation (only the partition of the inertia has changed). It is also important to note that because rotation always takes place in a subspace (i.e., the space of the retained components), the choice of this subspace strongly influences the result of the rotation. Therefore, it is strongly recommended to try several sizes for the subspace of the retained components in order to assess the robustness of the interpretation of the rotation. When performing a rotation, the term loadings almost always refer to the elements of matrix Q. We will follow this tradition in this section. Orthogonal Rotation An orthogonal rotation is specified by a rotation matrix, denoted R, where the rows stand for the original factors and the columns for the new (rotated) factors. At the intersection of row m and column n we have the cosine of the angle between the original axis and the new one: rm,n = cos θm,n. A rotation matrix has the important property of being orthonormal because it corresponds to a matrix of direction cosines and therefore RTR = I. Varimax rotation, developed by Kaiser,41 is the most popular rotation method. For varimax a simple solution means that each component has a small number of large loadings and a large number of zero (or small) loadings. This simplifies the interpretation because, after a varimax rotation, each original variable tends to be associated with one (or a small number) of the components, and each component represents only a small number of variables. In addition, the components can often be interpreted from the opposition of few variables with positive loadings to few variables with negative loadings. Formally varimax searches for a linear combination of the original factors such that the variance of the squared loadings is maximized, which amounts to maximizing ν = #(q2 j,# − q2 # ) 2 (23) with q2 j,# being the squared loading of the j-th variable of matrix Q on component # and q2 # being the mean of the squared loadings. Oblique Rotations With oblique rotations, the new axes are free to take any position in the component space, but the degree of correlation allowed among factors is small because two highly correlated components are better interpreted as only one factor. Oblique rotations, therefore, relax the orthogonality constraint in order to gain simplicity in the interpretation. They were strongly recommended by Thurstone,42 but are used more rarely than their orthogonal counterparts. For oblique rotations, the promax rotation has the advantage of being fast and conceptually simple. The first step in promax rotation defines the target matrix, almost always obtained as the result of a varimax rotation whose entries are raised to some power (typically between 2 and 4) in order to force the structure of the loadings to become bipolar. The second step is obtained by computing a least square fit from the varimax solution to the target matrix. Promax rotations are interpreted by looking at the correlations—regarded as loadings—between the rotated axes and the original variables. An interesting recent development of the concept of oblique rotation corresponds to the technique of independent component analysis (ica) where the axes are computed in order to replace the notion of orthogonality by statistical independence [see Ref 43,for a tutorial]. When and Why Using Rotations The main reason for using rotation is to facilitate the interpretation. When the data follow a model (such as the psychometric model) stipulating (1) that each variable load on only one factor and (2) that there is a clear difference in intensity between the relevant 442 2010 John Wiley & Son s, In c. Volume 2, July/Augu st 2010