《无限随机矩阵理论》课程书籍文献（英文版）Handout 6 Tuesday, September.pdf_大学文库

18. J/16.394J: The Mathematics of Infinite Random Matrices Essentials of Finite Random Matrix Theory Alan edelman Handout #6, Tuesday, September 28, 2004 This handout provides the essential elements needed to understand finite random matrix theory. A cursory observation should reveal that the tools for infinite random matrix theory are quite different from the tools for finite random matrix theory. Nonetheless, there are significantly more published applications that use finite random matrix theory as opposed to infinite random matrix theory. Our belief is that many of the results that have been historically derived using finite random matrix theory can be reformulated and answered using infinite random matrix theory. In this sense, it is worth recognizing that in many applications it is an integral of a function of the eigenvalues that is more important that the mere distribution of the eigenvalues. For finite random matrix theory, the tools that often come into play when setting up such integrals are the Matric Jacobians, the Joint Eigenvalue Densities and the Cauchy-Binet theorem. We describe these in subsequent section 1 Matrix and vector Differentiation In this section. we concern ourselves with the differentiation of matrices. Differentiating matrix and vector functions is not significantly harder than differentiating scalar functions except that we need notation to keep track of the variables. We titled this section"matrix and vector"differentiation, but of course it is the function that we differentiate. The matrix or vector is just a notational package for the scalar functions involved. In the end. a derivative is nothing more than the "linearization" of all the involved functions We find it useful to think of this linearization both symbolically(for manipulative purposes)as well as numerically (in the sense of small numerical perturbations ). The differential notation idea captures these viewpoints very well We begin with the familiar product rule for scalars d(uv)=u(du)+u(du) from which we can derive that d(z)=3 z2dz. We refer to dz as a differential We all unconsciously interpret the"dr" symbolically as well as numerically. Sometimes it is nice to confirm on a computer that (x+e)3-x3 I do this by taking z to be 1 or 2 or randn(1) and e to be 001 or. 0001 The product rule holds for matrices as well d(Uv)=U(dv)+(du)V In the examples we will see some symbolic and numerical interpretations. Example 1: Y=X3 We use the product rule to differentiate Y(X)=x to obtain that d(x)=X(dX)+ X(dX)X+(dx)X

1 18.338J/16.394J: The Mathematics of Infinite Random Matrices Essentials of Finite Random Matrix Theory Alan Edelman Handout #6, Tuesday, September 28, 2004 This handout provides the essential elements needed to understand finite random matrix theory. A cursory observation should reveal that the tools for infinite random matrix theory are quite different from the tools for finite random matrix theory. Nonetheless, there are significantly more published applications that use finite random matrix theory as opposed to infinite random matrix theory. Our belief is that many of the results that have been historically derived using finite random matrix theory can be reformulated and answered using infinite random matrix theory. In this sense, it is worth recognizing that in many applications it is an integral of a function of the eigenvalues that is more important that the mere distribution of the eigenvalues. For finite random matrix theory, the tools that often come into play when setting up such integrals are the Matrix Jacobians, the Joint Eigenvalue Densities and the Cauchy-Binet theorem. We describe these in subsequent sections. Matrix and Vector Differentiation In this section, we concern ourselves with the differentiation of matrices. Differentiating matrix and vector functions is not significantly harder than differentiating scalar functions except that we need notation to keep track of the variables. We titled this section “matrix and vector” differentiation, but of course it is the function that we differentiate. The matrix or vector is just a notational package for the scalar functions involved. In the end, a derivative is nothing more than the “linearization” of all the involved functions. We find it useful to think of this linearization both symbolically (for manipulative purposes) as well as numerically (in the sense of small numerical perturbations). The differential notation idea captures these viewpoints very well. We begin with the familiar product rule for scalars, d(uv) = u(dv) + v(du), from which we can derive that d(x3) = 3x2dx. We refer to dx as a differential. We all unconsciously interpret the “dx” symbolically as well as numerically. Sometimes it is nice to confirm on a computer that 3 (x + ǫ)3 − x ≈ 3x 2 . (1) ǫ I do this by taking x to be 1 or 2 or randn(1) and ǫ to be .001 or .0001. The product rule holds for matrices as well: d(UV ) = U(dV ) + (dU)V . In the examples we will see some symbolic and numerical interpretations. Example 1: Y = X3 We use the product rule to differentiate Y (X) = X3 to obtain that d(X3) = X2(dX) + X(dX)X + (dX)X2

When I introduce undergraduate students to matrix multiplication, I tell them that matrices are like scalars except that they do not commute at first. Numerically take X-randn(n)and E-randn(n) for e=001 say, and then compule m less familiar The numerical (or first order perturbation theory) interpretation applies, but it may see X+∈E ≈X2E+XEX+EX This is the matrix version of (1). Holding X fixed and allowing E to vary, the right-hand side is a linear function of E. There is no simpler form possible Symbolically(or numerically) one can take dX= Ek which is the matrix that has a one in element(k, 1) nd 0 elsewhere. Then we can write down the matrix of partial derivatives 0X3 X(Ekl)+X(Ek)X+(Ek)X As we let h, I vary over all possible indices, we obtain all the information we need to compute the derivative in any general direction E In general, the directional derivative of Yi,(X)in the direction dX is given by(dY)ij. For a particula matrix X, dy(X) is a matrix of directional derivatives corresponding to a first order perturbation in tl direction E=dX. It is a matrix of linear functions corresponding to the linearization of Y(X)about X Structured Perturbations We sometimes restrict our E to be a structured perturbation. For example if X is triangular, symmetric antisymmetric, or even sparse then often we wish to restrict E so that the pattern is maintained in the perturbed matrix as well. An important case occurs when X is orthogonal. We will see in an example belor that we will want to restrict E so that X E is antisymmetric when X is orthogonal Here y is a scalar and dot products commute so that dy= 2r dz. When y= l, a is on the unit sphere. o stay on the sphere, we need dy=0 so that z dr=0, i.e., the tangent to the sphere is perpendicular to the sphere. Note the two uses of dy. In the first case it is the change to the squared length of a. In the second case, by setting dy=0, we find perturbations to a which to first order do not change the length at ll.Indeed if one computes(r+dr)(r+dr) for a small finite dr, one sees that if r de=0 then the length changes only to second order. Geometrically, one can draw a tangent to a circle. The distance to the circle is second order in the distance along the tangent Example 3: y=z Ar Again y is scalar. We have dy=dxTAz +Adz. If A is symmetric then dy=2x"Adz Example 4: Y=X-1 We have that XY=I so that X(dy)+(dX)Y=0 so that dy=-X-dXX-I We recommend that the reader compute e-((X+EE)-X)numerically and verify that it is equal to-x-IEX-I In other words (X+E)-=X--eX-Ex-1+O(e2) Example 5:I=QQ If Q is orthogonal we have that Q dQ +dQ Q=0 so that Q"dQ is antisymmetric. If y is a scalar function of T1, T2,..., In then we have the"chain rule dy +dx2+.+d

When I introduce undergraduate students to matrix multiplication, I tell them that matrices are like scalars, except that they do not commute. The numerical (or first order perturbation theory) interpretation applies, but it may seem less familiar at first. Numerically take X=randn(n) and E=randn(n) for ǫ = .001 say, and then compute (X + ǫE)3 − X3 ≈ X2E + XEX + EX2 . (2) ǫ This is the matrix version of (1). Holding X fixed and allowing E to vary, the right-hand side is a linear function of E. There is no simpler form possible. Symbolically (or numerically) one can take dX = Ekl which is the matrix that has a one in element (k, l) and 0 elsewhere. Then we can write down the matrix of partial derivatives: ∂X3 = X2(Ekl) + X(Ekl)X + (Ekl)X2 . ∂xkl As we let k, l vary over all possible indices, we obtain all the information we need to compute the derivative in any general direction E. In general, the directional derivative of Yij (X) in the direction dX is given by (dY )ij . For a particular matrix X, dY (X) is a matrix of directional derivatives corresponding to a first order perturbation in the direction E = dX. It is a matrix of linear functions corresponding to the linearization of Y (X) about X. Structured Perturbations We sometimes restrict our E to be a structured perturbation. For example if X is triangular, symmetric, antisymmetric, or even sparse then often we wish to restrict E so that the pattern is maintained in the perturbed matrix as well. An important case occurs when X is orthogonal. We will see in an example below that we will want to restrict E so that XTE is antisymmetric when X is orthogonal. Example 2: y = xTx Here y is a scalar and dot products commute so that dy = 2xTdx. When y = 1, x is on the unit sphere. To stay on the sphere, we need dy = 0 so that xTdx = 0, i.e., the tangent to the sphere is perpendicular to the sphere. Note the two uses of dy. In the first case it is the change to the squared length of x. In the second case, by setting dy = 0, we find perturbations to x which to first order do not change the length at all. Indeed if one computes (x+ dx)T (x+ dx) for a small finite dx, one sees that if xT dx = 0 then the length changes only to second order. Geometrically, one can draw a tangent to a circle. The distance to the circle is second order in the distance along the tangent. Example 3: y = xTAx Again y is scalar. We have dy = dxTAx + xTAdx. If A is symmetric then dy = 2xTAdx. Example 4: Y = X−1 We have that XY = I so that X(dY ) + (dX)Y = 0 so that dY = −X−1dXX−1 . We recommend that the reader compute ǫ−1((X + ǫE)−1 − X−1) numerically and verify that it is equal to −X−1EX−1 . In other words, (X + ǫE) −1 = X−1 − ǫX−1EX−1 + O(ǫ2). Example 5: I = QTQ If Q is orthogonal we have that QTdQ + dQTQ = 0 so that QTdQ is antisymmetric. In general, d(QTQ) = QTdQ+dQTQ, but with no orthogonality condition on Q, there is no anti-symmetry condition on QTdQ. If y is a scalar function of x1, x2, . . . , xn then we have the “chain rule” ∂y ∂y ∂y dy = dx1 + dx2 + . . . + dxn . ∂x1 ∂x2 ∂xn

Often we wish to apply the chain rule to every element of a vector or matrix Let X P2+gr pq+rsthen pr+rs gs+82 dY=XdX+dXX 2 Matrix Jacobians(getting started) 2.1 Definition Let IE Rn and y=y(a)E Rn be a differentiable function of a. It is well known that the Jacobian matrix 0 dIn evaluated at a point a approximates y(a) by a linear function. Intuitively y(a +dr)s y(r)+JSr, i.e., J is the matrix that allows us to invoke perturbation theory. The function y may be viewed as performing a hange of variables Furthermore(intuitively)if a little box of n dimensional volume e surrounds a, then it is transformed by y into a parallelopiped of volume det Je around y(a). Therefore the Jacobian det is the magnification factor for volumes If we are integrating some function of y E R as in p(y)dy, (where dy=dy..dyn), then when we change variables from y to z where y= y(a), then the integral becomes ply(a)ldet(a)l dr. For many people this becomes a matter of notation, but one should understand intuitively that the Jacobian tells you how volume elements scale The determinant is 0 exactly where the change of variables breaks down. It is always instructive to see when this happens. Either there is no"a'"locally for each"y" or there are many as in the example of polar coordinates at the Notation: throughout this book, J denotes the Jacobian matrix.(Sometimes called the derivative or simply the Jacobian in the literature. We will consistently write det for the Jacobian determinant (un- fortunately also called the Jacobian in the literature. When we say Jacobian, we will be talking about both 2.2 Simple Examples (n=2 We get our feet wet with some simple 2x 2 examples. Every reader is familiar with changing scalar variables f(a)dr=/f(2)(2y)dy We want the reader to be just as comfortable when f is a scalar function of a matrix and we change X=Y2 f(X)(dx)= One can compute all of the 2 by 2 Jacobians that follow by hand. but in some cases it can be tedious and hard to get right on the first try. Code 8.1 in MATLAB takes away the drudgery and gives the right answer. Later we will learn fancy ways to get the answer without too much drudgery and also without the aid of a computer. We now consider the following examples 2X 2 Example 1: Matrix Square Y=X

R R Z Z Often we wish to apply the chain rule to every element of a vector or matrix. 2 Let X = p q and Y = X2 = p + qr pq + rs then r s pr + rs qs + s2 dY = XdX + dXX . (3) 2 Matrix Jacobians (getting started) 2.1 Definition Let x ∈ Rn and y = y(x) ∈ Rn be a differentiable function of x. It is well known that the Jacobian matrix   ∂y1 ∂y1  ∂x1 · · · ∂xn   .  ∂yi J = . .  .  = . .  ∂xj  i,j=1,2,...,n ∂yn ∂yn ∂x1 · · · ∂xn evaluated at a point x approximates y(x) by a linear function. Intuitively y(x + δx) ≈ y(x) + Jδx, i.e., J is the matrix that allows us to invoke perturbation theory. The function y may be viewed as performing a change of variables. Furthermore (intuitively) if a little box of n dimensional volume ǫ surrounds x, then it is transformed by y into a parallelopiped of volume det | | J ǫ around y(x). Therefore the Jacobian | | det J is the magnification factor for volumes. If we are integrating some function of y ∈ Rn as in p(y)dy, (where dy = dy1 . . . dyn), then when we change variables from y to x where y = y(x), then the integral becomes p(y(x))|det( ∂yi ) dx. For many ∂xj | people this becomes a matter of notation, but one should understand intuitively that the Jacobian tells you how volume elements scale. The determinant is 0 exactly where the change of variables breaks down. It is always instructive to see when this happens. Either there is no “x” locally for each “y” or there are many as in the example of polar coordinates at the origin. Notation: throughout this book, J denotes the Jacobian matrix. (Sometimes called the derivative or simply the Jacobian in the literature.) We will consistently write det J for the Jacobian determinant (unfortunately also called the Jacobian in the literature.) When we say Jacobian, we will be talking about both. 2.2 Simple Examples (n=2) We get our feet wet with some simple 2×2 examples. Every reader is familiar with changing scalar variables as in Z Z f(x) dx = f(y 2)(2y) dy . We want the reader to be just as comfortable when f is a scalar function of a matrix and we change X = Y 2: f(X)(dX) = f(Y 2)(Jacobian)(dY ). One can compute all of the 2 by 2 Jacobians that follow by hand, but in some cases it can be tedious and hard to get right on the first try. Code 8.1 in MATLAB takes away the drudgery and gives the right answer. Later we will learn fancy ways to get the answer without too much drudgery and also without the aid of a computer. We now consider the following examples: 2 × 2 Example 1: Matrix Square (Y = X2)

p " # " # The Jacobian matrix is  1  √p 0 0   1  √  s J =  0 0 1  2   √ r/p r/s 2 ps √ps √ − − ps so that 1 det J = 4ps Breakdown occurs if p or s is 0. Example 7: Traceless Symmetric = Polar Decomposition (S = QΛQT , tr S = 0) The reader will recall the usual definition of polar coordinates. If (p, s) are Cartesian coordinates, then the angle is θ = arctan(s/p) and the radius is r = p2 + s2 . If we take a symmetric traceless 2 × 2 matrix p s S = , s −p and compute its eigenvalue and eigenvector decomposition, we find that the eigendecomposition is mathematically equivalent to the familiar transformation between Cartesian and polar coordinates. Indeed one of the eigenvectors of S is (cos φ, sin φ), where φ = θ/2. The Jacobian matrix is   √p2 p +s √p2 s +s 2 2 J =   −s p p2+ 2 s2 p2+s The Jacobian is the inverse of the radius. This corresponds to the familiar formula using the more usual notation dxdy/r = drdθ so that det J = 1/r. Breakdown occurs when r = 0. Example 8: The Symmetric Eigenvalue Problem (S = QΛQT ) We compute the Jacobian for the general symmetric eigenvalue and eigenvector decomposition. Let p s cos θ − sin θ λ1 cos θ − sin θ T S = = . sin θ cos θ λ2 sin θ cos θ s r We can compute the eigenvectors and eigenvalues of S directly in MATLAB and compute the Jacobian of the two eigenvalues and the eigenvector angle, but when we tried with the Maple toolbox we found that the symbolic toolbox did not handle “arctan” very well. Instead we found it easy to compute the Jacobian in the other direction. We write S = QΛQT , where Q is 2 × 2 orthogonal and Λ is diagonal. The Jacobian is   −2 sin θ cos θ (λ1 − λ2) cos2 θ sin 2 θ  J =  2 sin θ cos θ (λ  1 − λ2) sin 2 θ cos2 θ    (cos2 θ − sin 2 θ)(λ1 − λ2) sin θ cos θ − sin θ cos θ so that det J = λ1 − λ2 . Breakdown occurs if S is a multiple of the identity. Example 9: Symmetric Congruence (Y = ATSA)

Y 3 Y = BXAT and the Kronecker Product 3.1 Jacobian of Y = BXAT (Kronecker Product Approach) There is a “nuts and bolts” approach to calculate some Jacobian determinants. A good example function is the matrix inverse Y = X−1 . We recall from Example 4 of Section 1 that dY = −X−1dXX−1 . In words, the perturbation dX is multiplied on the left and on the right by a fixed matrix. When this happens we are in a “Kronecker Product” situation, and can instantly write down the Jacobian. We provide two definitions of the Kronecker Product for square matrices A ∈ Rn,n to B ∈ Rm,m . See [1] for a nice discussion of Kronecker products. Operator Definition A ⊗ B is the operator from X ∈ Rm,n to Y ∈ Rm,n where Y = BXAT . We write (A ⊗ B)X = BXAT . Matrix Definition (Tensor Product) A ⊗ B is the matrix  a11B . . . a1m2 B  A ⊗ B =  . . . . . .  . (4) am11B . . . am1m2 B The following theorem is important for applications. Theorem 1. det(A ⊗ B) = (det A)m(det B)n Application: If Y = X−1 then dY = −(X−T ⊗ X−1) dX so that det J| = det X −2n | | | . Notational Note: The correspondence between the operator definition and matrix definition is worth spelling out. It corresponds to the following identity in MATLAB Y = B * X * A’ Y(:) = kron(A,B) * X(:) % The second line does not change Y Here kron(A,B) is exactly the matrix in Equation (4), and X(:) is the column vector consisting of the columns of X stacked on top of each other. (In computer science this is known as storing an array in “column major” order.) Many authors write vec(X), where we use X(:). Concretely, we have that vec(BXAT) = (A ⊗ B)vec(X) where A ⊗ B is as in (4). Proofs of Theorem 8.1: Assume A and B are diagonalizable, with Aui = λiui (i = 1, . . . , n) and Bvi = µivi (i = 1, . . . , m). Let Mij = T . The mn matrices Mij form a basis for Rmn v and they are eigenmatrices of our map since iuj BMijAT = µiλjMij . The determinant is µiλj or (det A) m(det B) n . (5) 1≤i≤n 1≤j≤m The assumption of diagonalizability is not important

Also one can directly manipulate the matrix since obviously A ⊗ B = (A ⊗ I)(I ⊗ B) as operators, and det I ⊗ B = (det B)n and det A ⊗ I which can be reshuffled into det I ⊗ A = (det A)m . Other proofs may be obtained by working with the “LU” decomposition of A and B, the SVD of A and B, the QR decomposition, and many others. Mathematical Note: That the operator Y = BXAT can be expressed as a matrix is a consequence of linearity: B(c1X1 + c2X2)AT = c1BX1AT + c2BX2AT i.e. (A ⊗ B)(c1X1 + c2X2) = c1(A ⊗ B)X1 + c2(A ⊗ B)X2 It is important to realize that a linear transformation from Rm,n to Rm,n is defined by an element of Rmn,mn , i.e., by the m2n2 entries of an mn×mn matrix. The transformation defined by Kronecker products is an m2 + n2 subspace of this m2n2 dimensional space. Some Kronecker product properties: 1. (A ⊗ B)T = AT ⊗ BT 2. (A ⊗ B)−1 = A−1 ⊗ B−1 , A ∈ Rn,n , B ∈ Rm,m 3. det(A ⊗ B) = det(A)m det(B)n 4. tr(A ⊗ B) = tr(A)tr(B) 5. A ⊗ B is orthogonal if A and B are orthogonal 6. (A ⊗ B)(C ⊗ D) = (AC) ⊗ (BD) 7. If Au = λu, and Bv = µv, then if X = vuT , then BXAT = λµX, and also AXTBT = λµXT . Therefore A ⊗ B and B ⊗ A have the same eigenvalues, and transposed eigenvectors. It is easy to see that property 5 holds, since if A and B are orthogonal (A ⊗ B)T = AT ⊗ BT = A−1 ⊗ B−1 = (A ⊗ B)−1 . Linear Subspace Kronecker Products Some researchers consider the symmetric Kronecker product ⊗sym. In fact it has become clear that one might consider an anti-symmetric, upper-triangular or even a Toeplitz Kronecker product. We formulate a general notion: Definition: Let S denote a linear subspace of Rmn and πS a projection onto S. If A ∈ Rn,n and B ∈ Rm,m then we define (A ⊗ B)SX = πS (BXAT) for X ∈ S. Comments 1. If S is the set of symmetric matrices then m = n and BXAT + AXBT (A ⊗ B)symX = 2 2. If S is the set of anti-sym matrices, then m = n BXAT + AXBT (A ⊗ B)antiX = as well 2 but this matrix is anti-sym