628 Chapter 14.Statistical Description of Data Stephens,M.A.1970,Journal of the Royal Statistical Society,ser.B,vol.32,pp.115-122.[1] Anderson,T.W.,and Darling,D.A.1952,Annals of Mathematica/Statistics,vol.23,pp.193-212. [2) Darling,D.A.1957,Annals of Mathematical Statistics,vol.28,pp.823-838.[3] Michael,J.R.1983,Biometrika,vol.70,no.1,pp.11-17.[4] Noe,M.1972,Annals of Mathematical Statistics,vol.43,pp.58-64.[5] Kuiper,N.H.1962,Proceedings of the Koninklijike Nederlandse Akademie van Wetenschappen, ser.A,vol.63,pp.38-47.[6] Stephens,M.A.1965,Biometrika,vol.52,pp.309-321.[7] Fisher,N.I.,Lewis,T.,and Embleton,B.J.J.1987,Statistical Analysis of Spherical Data (New York:Cambridge University Press).[8] 14.4 Contingency Table Analysis of Two Distributions In this section,and the next two sections,we deal with measures ofassociation 毫2ae时 令 for two distributions.The situation is this:Each data point has two or more different quantities associated with it,and we want to know whether knowledge of one quantity gives us any demonstrable advantage in predicting the value of another quantity.In many cases,one variable will be an"independent"or"control"variable,and another will be a"dependent"or"measured"variable.Then,we want to know if the latter Programs variable is in fact dependent on or associated with the former variable.If it is,we want to have some quantitative measure of the strength of the association.One often OF SCIENTIFIC hears this loosely stated as the question of whether two variables are correlated or uncorrelated,but we will reserve those terms for a particular kind of association (linear.or at least monotonic).as discussed in $14.5 and 814.6. Notice that,as in previous sections,the different concepts of significance and strength appear:The association between two distributions may be very significant even if that association is weak-if the quantity of data is large enough. It is useful to distinguish among some different kinds of variables,with different 10621 categories forming a loose hierarchy. A variable is called nominal ifits values are the members of some unordered Numerical Recipes 43106 set.For example,"state of residence"is a nominal variable that (in the U.S.)takes on one of 50 values;in astrophysics,"type of galaxy"is a nominal variable with the three values“spiral,.”“elliptical,.”and“irregular.. (outside A variable is termed ordinal if its values are the members of a discrete,but Software. ordered,set.Examples are:grade in school,planetary order from the Sun (Mercury =1,Venus=2,...),number of offspring.There need not be any concept of"equal metric distance"between the values of an ordinal variable,only that they be intrinsically ordered. We will call a variable continuous if its values are real numbers,as are times,distances,temperatures,etc.(Social scientists sometimes distinguish between interval and ratio continuous variables,but we do not find that distinction very compelling.) A continuous variable can always be made into an ordinal one by binning it into ranges.If we choose to ignore the ordering of the bins,then we can turn it into
628 Chapter 14. Statistical Description of Data Permission is granted for internet users to make one paper copy for their own personal use. Further reproduction, or any copyin Copyright (C) 1988-1992 by Cambridge University Press. Programs Copyright (C) 1988-1992 by Numerical Recipes Software. Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521-43108-5) g of machinereadable files (including this one) to any server computer, is strictly prohibited. To order Numerical Recipes books or CDROMs, visit website http://www.nr.com or call 1-800-872-7423 (North America only), or send email to directcustserv@cambridge.org (outside North America). Stephens, M.A. 1970, Journal of the Royal Statistical Society, ser. B, vol. 32, pp. 115–122. [1] Anderson, T.W., and Darling, D.A. 1952, Annals of Mathematical Statistics, vol. 23, pp. 193–212. [2] Darling, D.A. 1957, Annals of Mathematical Statistics, vol. 28, pp. 823–838. [3] Michael, J.R. 1983, Biometrika, vol. 70, no. 1, pp. 11–17. [4] No´e, M. 1972, Annals of Mathematical Statistics, vol. 43, pp. 58–64. [5] Kuiper, N.H. 1962, Proceedings of the Koninklijke Nederlandse Akademie van Wetenschappen, ser. A., vol. 63, pp. 38–47. [6] Stephens, M.A. 1965, Biometrika, vol. 52, pp. 309–321. [7] Fisher, N.I., Lewis, T., and Embleton, B.J.J. 1987, Statistical Analysis of Spherical Data (New York: Cambridge University Press). [8] 14.4 Contingency Table Analysis of Two Distributions In this section, and the next two sections, we deal with measures of association for two distributions. The situation is this: Each data point has two or more different quantities associated with it, and we want to know whether knowledge of one quantity gives us any demonstrable advantage in predicting the value of another quantity. In many cases, one variable will be an “independent” or “control” variable, and another will be a “dependent” or “measured” variable. Then, we want to know if the latter variable is in fact dependent on or associated with the former variable. If it is, we want to have some quantitative measure of the strength of the association. One often hears this loosely stated as the question of whether two variables are correlated or uncorrelated, but we will reserve those terms for a particular kind of association (linear, or at least monotonic), as discussed in §14.5 and §14.6. Notice that, as in previous sections, the different concepts of significance and strength appear: The association between two distributions may be very significant even if that association is weak — if the quantity of data is large enough. It is useful to distinguish among some different kinds of variables, with different categories forming a loose hierarchy. • A variable is called nominal if its values are the members of some unordered set. For example, “state of residence” is a nominal variable that (in the U.S.) takes on one of 50 values; in astrophysics, “type of galaxy” is a nominal variable with the three values “spiral,” “elliptical,” and “irregular.” • A variable is termed ordinal if its values are the members of a discrete, but ordered, set. Examples are: grade in school, planetary order from the Sun (Mercury = 1, Venus = 2, ...), number of offspring. There need not be any concept of “equal metric distance” between the values of an ordinal variable, only that they be intrinsically ordered. • We will call a variable continuous if its values are real numbers, as are times, distances, temperatures, etc. (Social scientists sometimes distinguish between interval and ratio continuous variables, but we do not find that distinction very compelling.) A continuous variable can always be made into an ordinal one by binning it into ranges. If we choose to ignore the ordering of the bins, then we can turn it into
14.4 Contingency Table Analysis of Two Distributions 629 1. 2. red green 1.male #of #of #of red males green males males Nu N12 N1- #of #of #of 2.female red females green females females N21 N22 N2: ... #of red of green total NI (Nort server Figure 14.4.1.Example of a contingency table for two nominal variables,here sex and color.The row America computer, University Press. THE and column marginals (totals)are shown.The variables are"nominal,"i.e.,the order in which their values are listed is arbitrary and does not affect the result of the contingency table analysis.If the ordering of values has some intrinsic meaning,then the variables are "ordinal"or "continuous,"and correlation techniques ($14.5-814.6)can be utilized. 9 Progra a nominal variable.Nominal variables constitute the lowest type of the hierarchy, and therefore the most general.For example,a set of several continuous or ordinal variables can be turned.if crudely,into a single nominal variable,by coarsely a binning each variable and then taking each distinct combination of bin assignments as a single nominal value.When multidimensional data are sparse,this is often the only sensible way to proceed. OF SCIENTIFIC COMPUTING (ISBN The remainder of this section will deal with measures of association between 1888192 nominal variables.For any pair of nominal variables,the data can be displayed as a contingency table,a table whose rows are labeled by the values of one nominal 10621 variable,whose columns are labeled by the values of the other nominal variable, and whose entries are nonnegative integers giving the number of observed events FuurrgProglrion Numerical Recipes 43106 for each combination of row and column (see Figure 14.4.1).The analysis of association between nominal variables is thus called contingency table analysis or (outside crosstabulation analysis. We will introduce two different approaches.The first approach,based on the North Software. chi-square statistic,does a good job of characterizing the significance of association. but is only so-so as a measure of the strength(principally because its numerical values have no very direct interpretations).The second approach,based on the information-theoretic concept ofentropy,says nothing at all about the significance of association (use chi-square for that!)but is capable of very elegantly characterizing the strength of an association already known to be significant. Measures of Association Based on Chi-Square Some notation first:Let Ni;denote the number of events that occur with the
14.4 Contingency Table Analysis of Two Distributions 629 Permission is granted for internet users to make one paper copy for their own personal use. Further reproduction, or any copyin Copyright (C) 1988-1992 by Cambridge University Press. Programs Copyright (C) 1988-1992 by Numerical Recipes Software. Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521-43108-5) g of machinereadable files (including this one) to any server computer, is strictly prohibited. To order Numerical Recipes books or CDROMs, visit website http://www.nr.com or call 1-800-872-7423 (North America only), or send email to directcustserv@cambridge.org (outside North America). 1. male 2. female . . . . . . . . . . . . . . . . . . . . . . . . 1. . . . red # of red males N11 # of red females N21 # of green females N22 # of green males N12 # of males N1⋅ # of females N2⋅ 2. green # of red N ⋅1 # of green N⋅2 total # N Figure 14.4.1. Example of a contingency table for two nominal variables, here sex and color. The row and column marginals (totals) are shown. The variables are “nominal,” i.e., the order in which their values are listed is arbitrary and does not affect the result of the contingency table analysis. If the ordering of values has some intrinsic meaning, then the variables are “ordinal” or “continuous,” and correlation techniques (§14.5-§14.6) can be utilized. a nominal variable. Nominal variables constitute the lowest type of the hierarchy, and therefore the most general. For example, a set of several continuous or ordinal variables can be turned, if crudely, into a single nominal variable, by coarsely binning each variable and then taking each distinct combination of bin assignments as a single nominal value. When multidimensional data are sparse, this is often the only sensible way to proceed. The remainder of this section will deal with measures of association between nominal variables. For any pair of nominal variables, the data can be displayed as a contingency table, a table whose rows are labeled by the values of one nominal variable, whose columns are labeled by the values of the other nominal variable, and whose entries are nonnegative integers giving the number of observed events for each combination of row and column (see Figure 14.4.1). The analysis of association between nominal variables is thus called contingency table analysis or crosstabulation analysis. We will introduce two different approaches. The first approach, based on the chi-square statistic, does a good job of characterizing the significance of association, but is only so-so as a measure of the strength (principally because its numerical values have no very direct interpretations). The second approach, based on the information-theoretic concept of entropy, says nothing at all about the significance of association (use chi-square for that!), but is capable of very elegantly characterizing the strength of an association already known to be significant. Measures of Association Based on Chi-Square Some notation first: Let Nij denote the number of events that occur with the
630 Chapter 14.Statistical Description of Data first variable z taking on its ith value,and the second variable y taking on its jth value.Let N denote the total number of events,the sum of all the Nii's.Let Ni. denote the number of events for which the first variable x takes on its ith value regardless of the value of y;N.j is the number of events with the jth value of y regardless of x.So we have N.=∑N N=∑N (14.4.1) N=∑N=∑N N.j and Ni.are sometimes called the row and column totals or marginals,but we will use these terms cautiously since we can never keep straight which are the rows and which are the columns! ICAL The null hypothesis is that the two variables x and y have no association.In this case,the probability of a particular value of x given a particular value of y should RECIPES be the same as the probability of that value of z regardless of y.Therefore,in the null hypothesis,the expected number for any Nij,which we will denote n,can be 9 calculated from only the row and column totals, which implies Ni.N.i N.iN nij= (14.4.2) t9983 9 Notice that if a column or row total is zero,then the expected number for all the entries in that column or row is also zero;in that case,the never-occurring bin of z or y should simply be removed from the analysis. The chi-square statistic is now given by equation(14.3.1),which,in the present case,is summed over all entries in the table, X2-∑-n)2 (14.4.3) i,j Numerica 10.621 The number of degrees of freedom is equal to the number of entries in the table 431 (product of its row size and column size)minus the number of constraints that have Recipes arisen from our use of the data themselves to determine the nij.Each row total and column total is a constraint,except that this overcounts by one,since the total of the column totals and the total of the row totals both equal N,the total number of data North points.Therefore,if the table is of size I by 7,the number of degrees of freedom is IJ-I-J+1.Equation (14.4.3),along with the chi-square probability function (86.2),now give the significance of an association between the variables z and y. Suppose there is a significant association.How do we quantify its strength,so that(e.g.)we can compare the strength of one association with another?The idea here is to find some reparametrization of x2 which maps it into some convenient interval,like 0 to 1,where the result is not dependent on the quantity of data that we happen to sample,but rather depends only on the underlying population from which the data were drawn.There are several different ways of doing this.Two of the more common are called Cramer's V and the contingency coefficient C
630 Chapter 14. Statistical Description of Data Permission is granted for internet users to make one paper copy for their own personal use. Further reproduction, or any copyin Copyright (C) 1988-1992 by Cambridge University Press. Programs Copyright (C) 1988-1992 by Numerical Recipes Software. Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521-43108-5) g of machinereadable files (including this one) to any server computer, is strictly prohibited. To order Numerical Recipes books or CDROMs, visit website http://www.nr.com or call 1-800-872-7423 (North America only), or send email to directcustserv@cambridge.org (outside North America). first variable x taking on its ith value, and the second variable y taking on its jth value. Let N denote the total number of events, the sum of all the N ij ’s. Let Ni· denote the number of events for which the first variable x takes on its ith value regardless of the value of y; N·j is the number of events with the jth value of y regardless of x. So we have Ni· = j Nij N·j = i Nij N = i Ni· = j N·j (14.4.1) N·j and Ni· are sometimes called the row and column totals or marginals, but we will use these terms cautiously since we can never keep straight which are the rows and which are the columns! The null hypothesis is that the two variables x and y have no association. In this case, the probability of a particular value of x given a particular value of y should be the same as the probability of that value of x regardless of y. Therefore, in the null hypothesis, the expected number for any Nij , which we will denote nij , can be calculated from only the row and column totals, nij N·j = Ni· N which implies nij = Ni·N·j N (14.4.2) Notice that if a column or row total is zero, then the expected number for all the entries in that column or row is also zero; in that case, the never-occurring bin of x or y should simply be removed from the analysis. The chi-square statistic is now given by equation (14.3.1), which, in the present case, is summed over all entries in the table, χ2 = i,j (Nij − nij )2 nij (14.4.3) The number of degrees of freedom is equal to the number of entries in the table (product of its row size and column size) minus the number of constraints that have arisen from our use of the data themselves to determine the nij . Each row total and column total is a constraint, except that this overcounts by one, since the total of the column totals and the total of the row totals both equal N, the total number of data points. Therefore, if the table is of size I by J, the number of degrees of freedom is IJ − I − J + 1. Equation (14.4.3), along with the chi-square probability function (§6.2), now give the significance of an association between the variables x and y. Suppose there is a significant association. How do we quantify its strength, so that (e.g.) we can compare the strength of one association with another? The idea here is to find some reparametrization of χ2 which maps it into some convenient interval, like 0 to 1, where the result is not dependent on the quantity of data that we happen to sample, but rather depends only on the underlying population from which the data were drawn. There are several different ways of doing this. Two of the more common are called Cramer’s V and the contingency coefficient C
14.4 Contingency Table Analysis of Two Distributions 631 The formula for Cramer's V is V= x2 N min(I-1,J-1) (14.4.4) where I and J are again the numbers of rows and columns,and N is the total number of events.Cramer's V has the pleasant property that it lies between zero and one inclusive,equals zero when there is no association,and equals one only when the association is perfect:All the events in any row lie in one unique column, and vice versa.(In chess parlance,no two rooks,placed on a nonzero table entry, can capture each other. In the case of I=J=2,Cramer's V is also referred to as the phi statistic. The contingency coefficient C is defined as C x2+N (14.4.5) RECIPES It also lies between zero and one,but (as is apparent from the formula)it can never achieve the upper limit.While it can be used to compare the strength of association of two tables with the same I and J,its upper limit depends on I and J.Therefore it can never be used to compare tables of different sizes. 兰 49 Press. The trouble with both Cramer's V and the contingency coefficient Cis that,when they take on values in between their extremes,there is no very direct interpretation 9 of what that value means.For example,you are in Las Vegas,and a friend tells you that there is a small,but significant,association between the color of a croupier's SCIENTIFIC eyes and the occurrence of red and black on his roulette wheel.Cramer's V is about 0.028,your friend tells you.You know what the usual odds against you are(because 6 of the green zero and double zero on the wheel).Is this association sufficient for you to make money?Don't ask us! 1920 COMPUTING (ISBN #include #include "nrutil.h" #define TINY 1.0e-30 A small number Numerica 10621 void cntabl(int **nn,int ni,int nj,float *chisq,float *df,float *prob, float *cramrv,float *ccc) Recipes 43108 Given a two-dimensional contingency table in the form of an integer array nn[1..ni][1..nj] this routine returns the chi-square chisq,the number of degrees of freedom df,the significance level prob(small values indicating a significant association),and two measures of association, (outside Cramer's V (cramrv)and the contingency coefficient C (ccc). North Software. float gammg(float a,float x); int nnj,nni,j,i,minij; float sum=0.0,expctd,*sumi,*sumj,temp; sumi=vector(1,ni); sumj=vector(1,nj); nnisni; Number of rows nnj=nj; and columns. for(i=1;1<=n1;i++)[ Get the row totals. sumi[i]=0.0; for (j=1;j<=mj;j++){ sumi[i]+nn[i][j]; sum +nn[i][j];
14.4 Contingency Table Analysis of Two Distributions 631 Permission is granted for internet users to make one paper copy for their own personal use. Further reproduction, or any copyin Copyright (C) 1988-1992 by Cambridge University Press. Programs Copyright (C) 1988-1992 by Numerical Recipes Software. Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521-43108-5) g of machinereadable files (including this one) to any server computer, is strictly prohibited. To order Numerical Recipes books or CDROMs, visit website http://www.nr.com or call 1-800-872-7423 (North America only), or send email to directcustserv@cambridge.org (outside North America). The formula for Cramer’s V is V = χ2 N min (I − 1, J − 1) (14.4.4) where I and J are again the numbers of rows and columns, and N is the total number of events. Cramer’s V has the pleasant property that it lies between zero and one inclusive, equals zero when there is no association, and equals one only when the association is perfect: All the events in any row lie in one unique column, and vice versa. (In chess parlance, no two rooks, placed on a nonzero table entry, can capture each other.) In the case of I = J = 2, Cramer’s V is also referred to as the phi statistic. The contingency coefficient C is defined as C = χ2 χ2 + N (14.4.5) It also lies between zero and one, but (as is apparent from the formula) it can never achieve the upper limit. While it can be used to compare the strength of association of two tables with the same I and J, its upper limit depends on I and J. Therefore it can never be used to compare tables of different sizes. The trouble with both Cramer’s V and the contingency coefficient C is that, when they take on values in between their extremes, there is no very direct interpretation of what that value means. For example, you are in Las Vegas, and a friend tells you that there is a small, but significant, association between the color of a croupier’s eyes and the occurrence of red and black on his roulette wheel. Cramer’s V is about 0.028, your friend tells you. You know what the usual odds against you are (because of the green zero and double zero on the wheel). Is this association sufficient for you to make money? Don’t ask us! #include #include "nrutil.h" #define TINY 1.0e-30 A small number. void cntab1(int **nn, int ni, int nj, float *chisq, float *df, float *prob, float *cramrv, float *ccc) Given a two-dimensional contingency table in the form of an integer array nn[1..ni][1..nj], this routine returns the chi-square chisq, the number of degrees of freedom df, the significance level prob (small values indicating a significant association), and two measures of association, Cramer’s V (cramrv) and the contingency coefficient C (ccc). { float gammq(float a, float x); int nnj,nni,j,i,minij; float sum=0.0,expctd,*sumi,*sumj,temp; sumi=vector(1,ni); sumj=vector(1,nj); nni=ni; Number of rows nnj=nj; and columns. for (i=1;i<=ni;i++) { Get the row totals. sumi[i]=0.0; for (j=1;j<=nj;j++) { sumi[i] += nn[i][j]; sum += nn[i][j];
632 Chapter 14.Statistical Description of Data if (sumi[i]==0.0)--nni; Eliminate any zero rows by reducing the num- ber. for (j=1;j<=nj;j++) Get the column totals sumj[j]=0.0; for (i=1;i<=ni;i++)sumj[j]+nn[i][j]; if (sumj[j]==0.0)--nnj; Eliminate any zero columns. *df=nni*nnj-nni-nnj+1; Corrected number of degrees of freedom. *ch1sg=0.0; for(1=1;1<=n1;1+){ Do the chi-square sum. for (j=1ii<=nj;j++) expctd=sumj[j]*sumi[i]/sum; 三 temp=nn[i][j]-expctd; *chisq +temp*temp/(expctd+TINY); Here TINY guarantees that any eliminated row or column will not contribute to the sum. *prob=gammq (0.5*(*df),0.5*(*chisq)) Chi-square probability function minij nni nnj nni-1 nnj-1; *cramrv=sgrt(*chisq/(sum*minij)); 3 *ccc=sqrt (*chisq/(*chisq+sum)); free_vector(sumj,1,nj); free_vector(sumi,1,ni); RECIPES I Press. Measures of Association Based on Entropy Consider the game of"twenty questions,"where by repeated yes/no questions 9 you try to eliminate all except one correct possibility for an unknown object.Better yet,consider a generalization of the game,where you are allowed to ask multiple IENTIFIC choice questions as well as binary (yes/no)ones.The categories in your multiple choice questions are supposed to be mutually exclusive and exhaustive(as are "yes" and“no"). The value to you of an answer increases with the number of possibilities that it eliminates.More specifically,an answer that eliminates all except a fraction p of the remaining possibilities can be assigned a value-Inp(a positive number,since p<1).The purpose of the logarithm is to make the value additive,since (e.g.)one Recipes Numerica 10621 question that eliminates all but 1/6 of the possibilities is considered as good as two questions that,in sequence,reduce the number by factors 1/2 and 1/3. 43106 So that is the value of an answer:but what is the value of a question?If there Recipes are possible answers to the question (i=1,...,1)and the fraction of possibilities consistent with the ith answer is pi(with the sum of the pi's equal to one),then the value of the question is the expectation value of the value of the answer,denoted H. H (14.4.6) In evaluating (14.4.6),note that lim plnp=0 (14.4.7) D- The value H lies between 0 and In I.It is zero only when one of the pi's is one,all the others zero:In this case,the question is valueless,since its answer is preordained
632 Chapter 14. Statistical Description of Data Permission is granted for internet users to make one paper copy for their own personal use. Further reproduction, or any copyin Copyright (C) 1988-1992 by Cambridge University Press. Programs Copyright (C) 1988-1992 by Numerical Recipes Software. Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521-43108-5) g of machinereadable files (including this one) to any server computer, is strictly prohibited. To order Numerical Recipes books or CDROMs, visit website http://www.nr.com or call 1-800-872-7423 (North America only), or send email to directcustserv@cambridge.org (outside North America). } if (sumi[i] == 0.0) --nni; Eliminate any zero rows by reducing the num- } ber. for (j=1;j<=nj;j++) { Get the column totals. sumj[j]=0.0; for (i=1;i<=ni;i++) sumj[j] += nn[i][j]; if (sumj[j] == 0.0) --nnj; Eliminate any zero columns. } *df=nni*nnj-nni-nnj+1; Corrected number of degrees of freedom. *chisq=0.0; for (i=1;i<=ni;i++) { Do the chi-square sum. for (j=1;j<=nj;j++) { expctd=sumj[j]*sumi[i]/sum; temp=nn[i][j]-expctd; *chisq += temp*temp/(expctd+TINY); Here TINY guarantees that any eliminated row or column will not contribute to the sum. } } *prob=gammq(0.5*(*df),0.5*(*chisq)); Chi-square probability function. minij = nni < nnj ? nni-1 : nnj-1; *cramrv=sqrt(*chisq/(sum*minij)); *ccc=sqrt(*chisq/(*chisq+sum)); free_vector(sumj,1,nj); free_vector(sumi,1,ni); } Measures of Association Based on Entropy Consider the game of “twenty questions,” where by repeated yes/no questions you try to eliminate all except one correct possibility for an unknown object. Better yet, consider a generalization of the game, where you are allowed to ask multiple choice questions as well as binary (yes/no) ones. The categories in your multiple choice questions are supposed to be mutually exclusive and exhaustive (as are “yes” and “no”). The value to you of an answer increases with the number of possibilities that it eliminates. More specifically, an answer that eliminates all except a fraction p of the remaining possibilities can be assigned a value − ln p (a positive number, since p < 1). The purpose of the logarithm is to make the value additive, since (e.g.) one question that eliminates all but 1/6 of the possibilities is considered as good as two questions that, in sequence, reduce the number by factors 1/2 and 1/3. So that is the value of an answer; but what is the value of a question? If there are I possible answers to the question (i = 1,...,I) and the fraction of possibilities consistent with the ith answer is pi (with the sum of the pi’s equal to one), then the value of the question is the expectation value of the value of the answer, denoted H, H = − I i=1 pi ln pi (14.4.6) In evaluating (14.4.6), note that limp→0 p ln p =0 (14.4.7) The value H lies between 0 and ln I. It is zero only when one of the p i’s is one, all the others zero: In this case, the question is valueless, since its answer is preordained
14.4 Contingency Table Analysis of Two Distributions 633 H takes on its maximum value when all the pi's are equal,in which case the question is sure to eliminate all but a fraction 1/I of the remaining possibilities. The value H is conventionally termed the entropy of the distribution given by the pi's,a terminology borrowed from statistical physics. So far we have said nothing about the association of two variables;but suppose we are deciding what question to ask next in the game and have to choose between two candidates,or possibly want to ask both in one order or another.Suppose that one question,x,has I possible answers,labeled by i,and that the other question, y,as/possible answers,labeled by j.Then the possible outcomes of asking both 三 questions form a contingency table whose entries Nj,when normalized by dividing by the total number of remaining possibilities N,give all the information about the p's.In particular,we can make contact with the notation(14.4.1)by identifying N Pij= N P.= N (outcomes of question x alone) (14.4.8) RECIPES I N P.i= N (outcomes of question y alone) 0> Press. The entropies of the questions x and y are,respectively, Programs H)--∑plnp H()=->p.j lnp.j (14.4.9) IENTIFIC The entropy of the two questions together is 6 H(z,)=-∑Pnp (14.4.10) i.j Now what is the entropy of the question y given (that is,if is asked first)? 、彩 10-621 It is the expectation value over the answers to z of the entropy of the restricted y distribution that lies in a single column of the contingency table(corresponding Numerical Recipes 43106 to the x answer): (outside H)=-∑∑n型=-∑P%lh (14.4.11) Pi.Pi. i.j Correspondingly,the entropy of x given y is =-少兴如是于则兴 (14.4.12) P. We can readily prove that the entropy of y given z is never more than the entropy of y alone,i.e.,that asking z first can only reduce the usefulness of asking
14.4 Contingency Table Analysis of Two Distributions 633 Permission is granted for internet users to make one paper copy for their own personal use. Further reproduction, or any copyin Copyright (C) 1988-1992 by Cambridge University Press. Programs Copyright (C) 1988-1992 by Numerical Recipes Software. Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521-43108-5) g of machinereadable files (including this one) to any server computer, is strictly prohibited. To order Numerical Recipes books or CDROMs, visit website http://www.nr.com or call 1-800-872-7423 (North America only), or send email to directcustserv@cambridge.org (outside North America). H takes on its maximum value when all the pi’s are equal, in which case the question is sure to eliminate all but a fraction 1/I of the remaining possibilities. The value H is conventionally termed the entropy of the distribution given by the pi’s, a terminology borrowed from statistical physics. So far we have said nothing about the association of two variables; but suppose we are deciding what question to ask next in the game and have to choose between two candidates, or possibly want to ask both in one order or another. Suppose that one question, x, has I possible answers, labeled by i, and that the other question, y, as J possible answers, labeled by j. Then the possible outcomes of asking both questions form a contingency table whose entries Nij , when normalized by dividing by the total number of remaining possibilities N, give all the information about the p’s. In particular, we can make contact with the notation (14.4.1) by identifying pij = Nij N pi· = Ni· N (outcomes of question x alone) p·j = N·j N (outcomes of question y alone) (14.4.8) The entropies of the questions x and y are, respectively, H(x) = − i pi· ln pi· H(y) = − j p·j ln p·j (14.4.9) The entropy of the two questions together is H(x, y) = − i,j pij ln pij (14.4.10) Now what is the entropy of the question y given x (that is, if x is asked first)? It is the expectation value over the answers to x of the entropy of the restricted y distribution that lies in a single column of the contingency table (corresponding to the x answer): H(y|x) = − i pi· j pij pi· ln pij pi· = − i,j pij ln pij pi· (14.4.11) Correspondingly, the entropy of x given y is H(x|y) = − j p·j i pij p·j ln pij p·j = − i,j pij ln pij p·j (14.4.12) We can readily prove that the entropy of y given x is never more than the entropy of y alone, i.e., that asking x first can only reduce the usefulness of asking
634 Chapter 14.Statistical Description of Data y(in which case the two variables are associated!): H-H)=-∑plh色 i,j P.i =∑pnPP i.j pii ≤∑P% p.jpi. -1 (14.4.13) pij =∑p.P-∑Pg ij =1-1=0 where the inequality follows from the fact 分 Cam lnw≤w-1 (14.4.14) We now have everything we need to define a measure of the"dependency"ofy on z,that is to say a measure of association.This measure is sometimes called the RECIPESI 2d 2 uncertainty coefficient of y.We will denote it as U(y), U)≡-H H(y) (14.4.15) This measure lies between zero and one,with the value 0 indicating that x and y have no association,the value 1 indicating that knowledge of x completely predicts 9 y.For in-between values,U(yz)gives the fraction of y's entropy H(y)that is 05◆ lost if x is already known(i.e.,that is redundant with the information in z).In our game of"twenty questions,"U(y)is the fractional loss in the utility of question y if question x is to be asked first. If we wish to view z as the dependent variable,y as the independent one,then interchanging x and y we can of course define the dependency of x on y, Ul)=)-H) H(x) (14.4.16) If we want to treat z and y symmetrically,then the useful combination turns Numerica 10.621 out to be U(x,)三2 H()+H(z)-H(x,) H(x)+H() (14.4.17 43126 If the two variables are completely independent,then H(,y)=H()+H(y),so (14.4.17)vanishes.If the two variables are completely dependent,then H()= H(y)=H(,y),so (14.4.16)equals unity.In fact,you can use the identities (easily proved from equations 14.4.9-14.4.12) H(a,)=H(x)+H(z)=H()+H(xy) (14.4.18) to show that U(x,)= H()U(ly)+H(y)U(y) H()+H(y) (14.4.19) i.e.,that the symmetrical measure is just a weighted average of the two asymmetrical measures(14.4.15)and (14.4.16),weighted by the entropy ofeach variable separately. Here is a program for computing all the quantities discussed,H(),H(y), H(ly),H(yl),H(,y),U(ly),U(yl),and U(,y):
634 Chapter 14. Statistical Description of Data Permission is granted for internet users to make one paper copy for their own personal use. Further reproduction, or any copyin Copyright (C) 1988-1992 by Cambridge University Press. Programs Copyright (C) 1988-1992 by Numerical Recipes Software. Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521-43108-5) g of machinereadable files (including this one) to any server computer, is strictly prohibited. To order Numerical Recipes books or CDROMs, visit website http://www.nr.com or call 1-800-872-7423 (North America only), or send email to directcustserv@cambridge.org (outside North America). y (in which case the two variables are associated!): H(y|x) − H(y) = − i,j pij ln pij/pi· p·j = i,j pij ln p·jpi· pij ≤ i,j pij p·jpi· pij − 1 = i,j pi·p·j − i,j pij = 1 − 1=0 (14.4.13) where the inequality follows from the fact ln w ≤ w − 1 (14.4.14) We now have everything we need to define a measure of the “dependency” of y on x, that is to say a measure of association. This measure is sometimes called the uncertainty coefficient of y. We will denote it as U(y|x), U(y|x) ≡ H(y) − H(y|x) H(y) (14.4.15) This measure lies between zero and one, with the value 0 indicating that x and y have no association, the value 1 indicating that knowledge of x completely predicts y. For in-between values, U(y|x) gives the fraction of y’s entropy H(y) that is lost if x is already known (i.e., that is redundant with the information in x). In our game of “twenty questions,” U(y|x) is the fractional loss in the utility of question y if question x is to be asked first. If we wish to view x as the dependent variable, y as the independent one, then interchanging x and y we can of course define the dependency of x on y, U(x|y) ≡ H(x) − H(x|y) H(x) (14.4.16) If we want to treat x and y symmetrically, then the useful combination turns out to be U(x, y) ≡ 2 H(y) + H(x) − H(x, y) H(x) + H(y) (14.4.17) If the two variables are completely independent, then H(x, y) = H(x) + H(y), so (14.4.17) vanishes. If the two variables are completely dependent, then H(x) = H(y) = H(x, y), so (14.4.16) equals unity. In fact, you can use the identities (easily proved from equations 14.4.9–14.4.12) H(x, y) = H(x) + H(y|x) = H(y) + H(x|y) (14.4.18) to show that U(x, y) = H(x)U(x|y) + H(y)U(y|x) H(x) + H(y) (14.4.19) i.e., that the symmetrical measure is just a weighted average of the two asymmetrical measures (14.4.15) and (14.4.16),weighted by the entropy of each variable separately. Here is a program for computing all the quantities discussed, H(x), H(y), H(x|y), H(y|x), H(x, y), U(x|y), U(y|x), and U(x, y):
14.4 Contingency Table Analysis of Two Distributions 635 #include #include "nrutil.h" #define TINY 1.0e-30 A small number. void cntab2(int **nn,int ni,int nj,float *h,float *hx,float *hy, float *hygx,float *hxgy,float *uygx,float tuxgy,float *uxy) Given a two-dimensional contingency table in the form of an integer array nn[i]j],where i labels the x variable and ranges from 1 to ni,j labels the y variable and ranges from 1 to nj, this routine returns the entropy h of the whole table,the entropy hx of the x distribution,the entropy hy of the y distribution,the entropy hygx of y given x,the entropy hxgy of x given y, the dependency uygx of y on z (eq.14.4.15).the dependency uxgy of z on y (eq.14.4.16). and the symmetrical dependency uxy (eq.14.4.17). int i,j; float sum=0.0,p,*sumi,*sumji granted for 18881992 sumi-vector(1,ni); sumj=vector(1,nj); 1600 (including this one) for(1=1;1<=n1;1++)[ Get the row totals. sum1[i]=0.0; 872 for (j=1;j<=nj;j++){ /Cambridge sumi[i]+nn[i][j]; 7422 from NUMERICAL RECIPES IN sum +nn[i][j]; for (j=1;j<=nj;j++){ Get the column totals. sumj[i]=0.0; (North America server computer, to make one paper UnN电.t THE for (i=1;i<-ni;i++) ART sumj[j]+nn[i][j]; *hx=0.0; Entropy of the x distribution, Programs for(1=1;1<n1;1++) copyfor thei if (sumi[i]){ p=sumi[i]/sum; st st whx -p*log(p); to dir Copyright (C) *hy=0.0; and of the y distribution for (j=1;j<=nj;j++) if (sumj[j]){ rectcustsen OF SCIENTIFIC COMPUTING(ISBN p=sumj[j]/sum; *hy -p*log(p); *h=0.0; v@cambri 10-621 for(1=1;1<=n1;1++) Total entropy:loop over both z for (j=1;i<=nj;j++) and y. if (nn[i][j]){ 1988-1992 by Numerical Recipes 43108 p=nn[i][j]/sum; *h -p*log(p); *hygx=(*h)-(*hx); Uses equation(14.4.18). (outside North Amer *hxgy=(*h)-(*hy); as does this. Software. *uygx=(*hy-*hygx)/(*hy+TINY); Equation (14.4.15). *uxgy=(*hx-*hxgy)/(*hx+TINY); Equation (14.4.16). ying of *uxy=2.0*(*hx+*hy-*h)/(*hx+*hy+TINY) Equation (14.4.17) free_vector(sumi,1,nj); free_vector(sumi,1,ni); CITED REFERENCES AND FURTHER READING: Dunn,O.J.,and Clark,V.A.1974,Applied Statistics:Analysis of Variance and Regression(New York:Wiley)
14.4 Contingency Table Analysis of Two Distributions 635 Permission is granted for internet users to make one paper copy for their own personal use. Further reproduction, or any copyin Copyright (C) 1988-1992 by Cambridge University Press. Programs Copyright (C) 1988-1992 by Numerical Recipes Software. Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521-43108-5) g of machinereadable files (including this one) to any server computer, is strictly prohibited. To order Numerical Recipes books or CDROMs, visit website http://www.nr.com or call 1-800-872-7423 (North America only), or send email to directcustserv@cambridge.org (outside North America). #include #include "nrutil.h" #define TINY 1.0e-30 A small number. void cntab2(int **nn, int ni, int nj, float *h, float *hx, float *hy, float *hygx, float *hxgy, float *uygx, float *uxgy, float *uxy) Given a two-dimensional contingency table in the form of an integer array nn[i][j], where i labels the x variable and ranges from 1 to ni, j labels the y variable and ranges from 1 to nj, this routine returns the entropy h of the whole table, the entropy hx of the x distribution, the entropy hy of the y distribution, the entropy hygx of y given x, the entropy hxgy of x given y, the dependency uygx of y on x (eq. 14.4.15), the dependency uxgy of x on y (eq. 14.4.16), and the symmetrical dependency uxy (eq. 14.4.17). { int i,j; float sum=0.0,p,*sumi,*sumj; sumi=vector(1,ni); sumj=vector(1,nj); for (i=1;i<=ni;i++) { Get the row totals. sumi[i]=0.0; for (j=1;j<=nj;j++) { sumi[i] += nn[i][j]; sum += nn[i][j]; } } for (j=1;j<=nj;j++) { Get the column totals. sumj[j]=0.0; for (i=1;i<=ni;i++) sumj[j] += nn[i][j]; } *hx=0.0; Entropy of the x distribution, for (i=1;i<=ni;i++) if (sumi[i]) { p=sumi[i]/sum; *hx -= p*log(p); } *hy=0.0; and of the y distribution. for (j=1;j<=nj;j++) if (sumj[j]) { p=sumj[j]/sum; *hy -= p*log(p); } *h=0.0; for (i=1;i<=ni;i++) Total entropy: loop over both x for (j=1;j<=nj;j++) and y. if (nn[i][j]) { p=nn[i][j]/sum; *h -= p*log(p); } *hygx=(*h)-(*hx); Uses equation (14.4.18), *hxgy=(*h)-(*hy); as does this. *uygx=(*hy-*hygx)/(*hy+TINY); Equation (14.4.15). *uxgy=(*hx-*hxgy)/(*hx+TINY); Equation (14.4.16). *uxy=2.0*(*hx+*hy-*h)/(*hx+*hy+TINY); Equation (14.4.17). free_vector(sumj,1,nj); free_vector(sumi,1,ni); } CITED REFERENCES AND FURTHER READING: Dunn, O.J., and Clark, V.A. 1974, Applied Statistics: Analysis of Variance and Regression (New York: Wiley)
636 Chapter 14.Statistical Description of Data Norusis,M.J.1982.SPSS Introductory Guide:Basic Statistics and Operations:and 1985,SPSS- X Advanced Statistics Guide (New York:McGraw-Hill). Fano,R.M.1961,Transmission of Information (New York:Wiley and MIT Press),Chapter 2. 14.5 Linear Correlation We next turn to measures of association between variables that are ordinal 三 or continuous,rather than nominal.Most widely used is the linear correlation coefficient.For pairs of quantities (i,y),i=1,...,N,the linear correlation coefficient r (also called the product-moment correlation coefficient,or Pearson's 鱼君 r)is given by the formula ICAL (x-)(班-) r= (14.5.1) ∑(x-)2∑(-列2 9 where,as usual,is the mean of the i's,7 is the mean of the yi's. The value ofr lies between-1 and 1.inclusive.It takes on a value of 1,termed 王。分 "complete positive correlation,"when the data points lie on a perfect straight line with positive slope,with z and y increasing together.The value 1 holds independent of the magnitude of the slope.If the data points lie on a perfect straight line with aRS兰g%0 9 negative slope,y decreasing as increases,then r has the value-1;this is called "complete negative correlation."A value ofr near zero indicates that the variables x and y are uncorrelated. When a correlation is known to be significant,r is one conventional way of summarizing its strength.In fact,the value of r can be translated into a statement about what residuals(root mean square deviations)are to be expected if the data are fitted to a straight line by the least-squares method(see $15.2,especially equations 15.2.13-15.2.14).Unfortunately,r is a rather poor statistic for deciding whether an observed correlation is statistically significant,and/or whether one observed sfgG分N 10-521 correlation is significantly stronger than another.The reason is that r is ignorant of Numerica the individual distributions of r and y,so there is no universal way to compute its 43106 distribution in the case of the null hypothesis. About the only general statement that can be made is this:If the null hypothesis is that x and y are uncorrelated,and if the distributions for x and y each have enough convergent moments ("tails"die off sufficiently rapidly),and if N is large (typically >500),then r is distributed approximately normally,with a mean of zero and a standard deviation of 1/vN.In that case,the (double-sided)significance of the correlation,that is,the probability thatr should be larger than its observed value in the null hypothesis,is (14.5.2) where erfc(x)is the complementary error function,equation (6.2.8),computed by the routines erffc or erfcc of $6.2.A small value of(14.5.2)indicates that the
636 Chapter 14. Statistical Description of Data Permission is granted for internet users to make one paper copy for their own personal use. Further reproduction, or any copyin Copyright (C) 1988-1992 by Cambridge University Press. Programs Copyright (C) 1988-1992 by Numerical Recipes Software. Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521-43108-5) g of machinereadable files (including this one) to any server computer, is strictly prohibited. To order Numerical Recipes books or CDROMs, visit website http://www.nr.com or call 1-800-872-7423 (North America only), or send email to directcustserv@cambridge.org (outside North America). Norusis, M.J. 1982, SPSS Introductory Guide: Basic Statistics and Operations; and 1985, SPSSX Advanced Statistics Guide (New York: McGraw-Hill). Fano, R.M. 1961, Transmission of Information (New York: Wiley and MIT Press), Chapter 2. 14.5 Linear Correlation We next turn to measures of association between variables that are ordinal or continuous, rather than nominal. Most widely used is the linear correlation coefficient. For pairs of quantities (xi, yi), i = 1,...,N, the linear correlation coefficient r (also called the product-moment correlation coefficient, or Pearson’s r) is given by the formula r = i (xi − x)(yi − y) i (xi − x)2 i (yi − y)2 (14.5.1) where, as usual, x is the mean of the xi’s, y is the mean of the yi’s. The value of r lies between −1 and 1, inclusive. It takes on a value of 1, termed “complete positive correlation,” when the data points lie on a perfect straight line with positive slope, with x and y increasing together. The value 1 holds independent of the magnitude of the slope. If the data points lie on a perfect straight line with negative slope, y decreasing as x increases, then r has the value −1; this is called “complete negative correlation.” A value of r near zero indicates that the variables x and y are uncorrelated. When a correlation is known to be significant, r is one conventional way of summarizing its strength. In fact, the value of r can be translated into a statement about what residuals (root mean square deviations) are to be expected if the data are fitted to a straight line by the least-squares method (see §15.2, especially equations 15.2.13 – 15.2.14). Unfortunately, r is a rather poor statistic for deciding whether an observed correlation is statistically significant, and/or whether one observed correlation is significantly stronger than another. The reason is that r is ignorant of the individual distributions of x and y, so there is no universal way to compute its distribution in the case of the null hypothesis. About the only general statement that can be made is this: If the null hypothesis is that x and y are uncorrelated, and if the distributions for x and y each have enough convergent moments (“tails” die off sufficiently rapidly), and if N is large (typically > 500), then r is distributed approximately normally, with a mean of zero and a standard deviation of 1/ √ N. In that case, the (double-sided) significance of the correlation, that is, the probability that |r| should be larger than its observed value in the null hypothesis, is erfc |r| √ N √2 (14.5.2) where erfc(x) is the complementary error function, equation (6.2.8), computed by the routines erffc or erfcc of §6.2. A small value of (14.5.2) indicates that the