正在加载图片...
WIREs Computational Statistics Principal component analysis coefficient.Similar to RESS,one can use the predicted the optimal number of components to keep when the residual sum of squares (PRESS).It is computed as: goal is to generalize the conclusions of an analysis to new data PRESSM X -XIMI2 (17) A simple approach stops adding components when PRESS decreases.A more elaborated approach The smaller the PRESS the better the qutality of the [see e.g.,Refs 27-31]begins by computing,for each estimation for a random model. component e,a quantity denoted O2 is defined as: Contrary to what happens with the fixed effect model,the matrix X is not always perfectly reconsti- Q2=1- PRESSe tuted with all L components.This is particularly the case when the number of variables is larger than the RESS-1 (19) number of observations(a configuration known as the 'small N large P'problem in the literature). with PRESSe(RESSe)being the value of PRESS(RESS) for the e-th component (where RESSo is equal to the total inertia).Only the components with 2 How Many Components? greater or equal to an arbitrary critical value(usually Often,only the important information needs to be 1-0.952 =0.0975)are kept [an alternative set of extracted from a data matrix.In this case,the problem critical values sets the threshold to 0.05 when I 100 is to figure out how many components need to be and to 0 when I>100;see Ref 28]. considered.This problem is still open,but there Another approach-based on cross-validation- are some guidelines [see,e.g.,Refs 9,8,23].A first to decide upon the number of components to keep uses procedure is to plot the eigenvalues according to their the index We derived from Refs 32 and 33.In contrast size [the so called"scree,"see Refs 8,24 and Table 2] to 2,which depends on RESS and PRESS,the index and to see if there is a point in this graph(often called We,depends only upon PRESS.It is computed for the an 'elbow')such that the slope of the graph goes from e-th component as steep'to "flat"and to keep only the components which are before the elbow.This procedure,somewhat subjective,is called the scree or elbow test. W= PRESSL-1-PRESSe dfresidual, (20) Another standard tradition is to keep only PRESS dfe the components whose eigenvalue is larger than the average.Formally,this amount to keeping the e-th where PRESSo is the inertia of the data table,df,is the component if number of degrees of freedom for the e-th component equal to (18) df=1+J-2e, (21) (where L is the rank of X).For a correlation PCA, anddf is the residual number of degrees of this rule boils down to the standard advice to 'keep freedom which is equal to the total number of degrees only the eigenvalues larger than 1'[see,e.g.,Ref of freedom of the table [equal to /(I-1)]minus the 25].However,this procedure can lead to ignoring number of degrees of freedom used by the previous important information [see Ref 26 for an example of components.The value of dfresidual.is obtained as: this problem]. Random Model dfresidual.=(I-1)->(I+J-2k) As mentioned earlier,when using a random model, k=1 the quality of the prediction does not always increase with the number of components of the model.In fact, =J1-1)-(1+J--1). (22) when the number of variables exceeds the number of observations,quality typically increases and then Most of the time,O2 and We will agree on the number decreases.When the quality of the prediction decreases of components to keep,but We can give a more as the number of components increases this is an conservative estimate of the number of components indication that the model is overfitting the data (i.e., to keep than O2.When is smaller than I,the value the information in the learning set is not useful to fit of both O and WL is meaningless because they both the testing set).Therefore,it is important to determine involve a division by zero. Volume 2,July/August 2010 2010 John Wiley Sons,Inc. 441WIREs Computational Statistics Principal component analysis coefficient. Similar to RESS, one can use the predicted residual sum of squares (PRESS). It is computed as: PRESSM = &X − X.[M] &2 (17) The smaller the PRESS the better the quality of the estimation for a random model. Contrary to what happens with the fixed effect model, the matrix X is not always perfectly reconsti￾tuted with all L components. This is particularly the case when the number of variables is larger than the number of observations (a configuration known as the ‘small N large P’ problem in the literature). How Many Components? Often, only the important information needs to be extracted from a data matrix. In this case, the problem is to figure out how many components need to be considered. This problem is still open, but there are some guidelines [see, e.g.,Refs 9,8,23]. A first procedure is to plot the eigenvalues according to their size [the so called ‘‘scree,’’ see Refs 8,24 and Table 2] and to see if there is a point in this graph (often called an ‘elbow’) such that the slope of the graph goes from ‘steep’ to ‘‘flat’’ and to keep only the components which are before the elbow. This procedure, somewhat subjective, is called the scree or elbow test. Another standard tradition is to keep only the components whose eigenvalue is larger than the average. Formally, this amount to keeping the #-th component if λ# > 1 L # L # λ# = 1 L I (18) (where L is the rank of X). For a correlation PCA, this rule boils down to the standard advice to ‘keep only the eigenvalues larger than 1’ [see, e.g., Ref 25]. However, this procedure can lead to ignoring important information [see Ref 26 for an example of this problem]. Random Model As mentioned earlier, when using a random model, the quality of the prediction does not always increase with the number of components of the model. In fact, when the number of variables exceeds the number of observations, quality typically increases and then decreases. When the quality of the prediction decreases as the number of components increases this is an indication that the model is overfitting the data (i.e., the information in the learning set is not useful to fit the testing set). Therefore, it is important to determine the optimal number of components to keep when the goal is to generalize the conclusions of an analysis to new data. A simple approach stops adding components when PRESS decreases. A more elaborated approach [see e.g., Refs 27–31] begins by computing, for each component #, a quantity denoted Q2 # is defined as: Q2 # = 1 − PRESS# RESS#−1 (19) with PRESS# (RESS#) being the value of PRESS (RESS) for the #-th component (where RESS0 is equal to the total inertia). Only the components with Q2 # greater or equal to an arbitrary critical value (usually 1 − 0.952 = 0.0975) are kept [an alternative set of critical values sets the threshold to 0.05 when I ≤ 100 and to 0 when I > 100; see Ref 28]. Another approach—based on cross-validation— to decide upon the number of components to keep uses the index W# derived from Refs 32 and 33. In contrast to Q2 # , which depends on RESS and PRESS, the index W#, depends only upon PRESS. It is computed for the #-th component as W# = PRESS#−1 − PRESS# PRESS# × dfresidual, # df# , (20) where PRESS0 is the inertia of the data table, df# is the number of degrees of freedom for the #-th component equal to df# = I + J − 2#, (21) and dfresidual, # is the residual number of degrees of freedom which is equal to the total number of degrees of freedom of the table [equal to J(I − 1)] minus the number of degrees of freedom used by the previous components. The value of dfresidual, # is obtained as: dfresidual, # = J(I − 1) −# # k=1 (I + J − 2k) = J(I − 1) − #(I + J − # − 1). (22) Most of the time, Q2 # and W# will agree on the number of components to keep, but W# can give a more conservative estimate of the number of components to keep than Q2 # . When J is smaller than I, the value of both Q2 L and WL is meaningless because they both involve a division by zero. Volume 2, July/August 2010  2010 John Wiley & Son s, In c. 441
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有