10 one sees another big road, turn le_中国高校课件下载中心

点击下载：北京大学：《模式识别》课程教学资源（参考资料）A Tutorial on Principal Component Analysis

正在加载图片...

10 one sees another big road,turn left or right and drive down APPENDIX A:Linear Algebra this road,and so forth.In this analogy,PCA requires that each new road explored must be perpendicular to the previous,but clearly this requirement is overly stringent and the data (or This section proves a few unapparent theorems in linear town)might be arranged along non-orthogonal axes,such as algebra,which are crucial to this paper. Figure 6b.Figure 6 provides two examples of this type of data where PCA provides unsatisfying results 1.The inverse of an orthogonal matrix is its transpose. To address these problems,we must define what we consider optimal results.In the context of dimensional reduction,one Let A be an m x n orthogonal matrix where a;is the ith column measure of success is the degree to which a reduced repre- vector.The ijth element of ATA is sentation can predict the original data.In statistical terms, we must define an error function (or loss function).It can (ATA)万=a到= ∫1fi=j be proved that under a common loss function,mean squared 0 otherwise error (i.e.L2 norm),PCA provides the optimal reduced rep resentation of the data.This means that selecting orthogonal Therefore,because ATA=I,it follows that A-1=AT. directions for principal components is the best solution to pre- dicting the original data.Given the examples of Figure 6,how could this statement be true?Our intuitions from Figure 6 2.For any matrix A,ATA and AAT are symmetric. suggest that this result is somehow misleading. The solution to this paradox lies in the goal we selected for the (AAT)T -ATTAT-AAT analysis.The goal of the analysis is to decorrelate the data,or (ATA)T=ATATT-ATA said in other terms,the goal is to remove second-order depen- dencies in the data.In the data sets of Figure 6,higher order 3.A matrix is symmetric if and only if it is orthogonally dependencies exist between the variables.Therefore,remov- diagonalizable. ing second-order dependencies is insufficient at revealing all structure in the data. Because this statement is bi-directional,it requires a two-part Multiple solutions exist for removing higher-order dependen- "if-and-only-if"proof.One needs to prove the forward and cies.For instance,if prior knowledge is known about the the backwards“if-then”cases. problem,then a nonlinearity (i.e.kernel)might be applied Let us start with the forward case.If A is orthogonally di- to the data to transform the data to a more appropriate naive agonalizable,then A is a symmetric matrix.By hypothesis basis.For instance,in Figure 6a,one might examine the po- orthogonally diagonalizable means that there exists some E lar coordinate representation of the data.This parametric ap- such that A =EDET,where D is a diagonal matrix and E is proach is often termed kernel PCA. some special matrix which diagonalizes A.Let us compute Another direction is to impose more general statistical defini- tions of dependency within a data set,e.g.requiring that data AT =(EDET)T =ETTDTET =EDET =A along reduced dimensions be statistically independent.This class of algorithms,termed,independent component analysis (ICA),has been demonstrated to succeed in many domains Evidently,if A is orthogonally diagonalizable,it must also be where PCA fails.ICA has been applied to many areas of sig- symmetric. nal and image processing,but suffers from the fact that solu- The reverse case is more involved and less clean so it will be tions are (sometimes)difficult to compute. left to the reader.In lieu of this,hopefully the "forward"case Writing this paper has been an extremely instructional expe- is suggestive if not somewhat convincing. rience for me.I hope that this paper helps to demystify the motivation and results of PCA,and the underlying assump- 4.A symmetric matrix is diagonalized by a matrix of its tions behind this important analysis technique.Please send orthonormal eigenvectors. me a note if this has been useful to you as it inspires me to keep writing! Let A be a square nx n symmetric matrix with associated eigenvectors [ei,e2,...,en}.Let E=[ei e2...en]where the th column of Eis the eigenvector ei.This theorem asserts that 7 When are second order dependencies sufficient for revealing all dependen- there exists a diagonal matrix D such that A =EDET cies in a data set?This statistical condition is met when the first and second order statistics are sufficient statistics of the data.This occurs,for instance, This proof is in two parts.In the first part,we see that the when a data set is Gaussian distributed. any matrix can be orthogonally diagonalized if and only if it that matrix's eigenvectors are all linearly independent.In the second part of the proof,we see that a symmetric matrix10 one sees another big road, turn left or right and drive down this road, and so forth. In this analogy, PCA requires that each new road explored must be perpendicular to the previous, but clearly this requirement is overly stringent and the data (or town) might be arranged along non-orthogonal axes, such as Figure 6b. Figure 6 provides two examples of this type of data where PCA provides unsatisfying results. To address these problems, we must define what we consider optimal results. In the context of dimensional reduction, one measure of success is the degree to which a reduced representation can predict the original data. In statistical terms, we must define an error function (or loss function). It can be proved that under a common loss function, mean squared error (i.e. L2 norm), PCA provides the optimal reduced representation of the data. This means that selecting orthogonal directions for principal components is the best solution to predicting the original data. Given the examples of Figure 6, how could this statement be true? Our intuitions from Figure 6 suggest that this result is somehow misleading. The solution to this paradox lies in the goal we selected for the analysis. The goal of the analysis is to decorrelate the data, or said in other terms, the goal is to remove second-order dependencies in the data. In the data sets of Figure 6, higher order dependencies exist between the variables. Therefore, removing second-order dependencies is insufficient at revealing all structure in the data.7 Multiple solutions exist for removing higher-order dependencies. For instance, if prior knowledge is known about the problem, then a nonlinearity (i.e. kernel) might be applied to the data to transform the data to a more appropriate naive basis. For instance, in Figure 6a, one might examine the polar coordinate representation of the data. This parametric approach is often termed kernel PCA. Another direction is to impose more general statistical definitions of dependency within a data set, e.g. requiring that data along reduced dimensions be statistically independent. This class of algorithms, termed, independent component analysis (ICA), has been demonstrated to succeed in many domains where PCA fails. ICA has been applied to many areas of signal and image processing, but suffers from the fact that solutions are (sometimes) difficult to compute. Writing this paper has been an extremely instructional experience for me. I hope that this paper helps to demystify the motivation and results of PCA, and the underlying assumptions behind this important analysis technique. Please send me a note if this has been useful to you as it inspires me to keep writing! 7 When are second order dependencies sufficient for revealing all dependencies in a data set? This statistical condition is met when the first and second order statistics are sufficient statistics of the data. This occurs, for instance, when a data set is Gaussian distributed. APPENDIX A: Linear Algebra This section proves a few unapparent theorems in linear algebra, which are crucial to this paper. 1. The inverse of an orthogonal matrix is its transpose. Let A be an m×n orthogonal matrix where ai is the i th column vector. The i jth element of A TA is (A TA)i j = ai T aj = 1 i f i = j 0 otherwise Therefore, because A TA = I, it follows that A −1 = A T . 2. For any matrix A, ATA and AAT are symmetric. (AAT ) T = A T TA T = AAT (A TA) T = A TA T T = A TA 3. A matrix is symmetric if and only if it is orthogonally diagonalizable. Because this statement is bi-directional, it requires a two-part “if-and-only-if” proof. One needs to prove the forward and the backwards “if-then” cases. Let us start with the forward case. If A is orthogonally diagonalizable, then A is a symmetric matrix. By hypothesis, orthogonally diagonalizable means that there exists some E such that A = EDET , where D is a diagonal matrix and E is some special matrix which diagonalizes A. Let us compute A T . A T = (EDET ) T = E T TD TE T = EDET = A Evidently, if A is orthogonally diagonalizable, it must also be symmetric. The reverse case is more involved and less clean so it will be left to the reader. In lieu of this, hopefully the “forward” case is suggestive if not somewhat convincing. 4. A symmetric matrix is diagonalized by a matrix of its orthonormal eigenvectors. Let A be a square n×n symmetric matrix with associated eigenvectors {e1,e2,...,en}. Let E = [e1 e2 ... en] where the i th column of E is the eigenvector ei . This theorem asserts that there exists a diagonal matrix D such that A = EDET . This proof is in two parts. In the first part, we see that the any matrix can be orthogonally diagonalized if and only if it that matrix’s eigenvectors are all linearly independent. In the second part of the proof, we see that a symmetric matrix

<<向上翻页向下翻页>>

点击下载：北京大学：《模式识别》课程教学资源（参考资料）A Tutorial on Principal Component Analysis