正在加载图片...
620 Chapter 14.Statistical Description of Data 14.3 Are Two Distributions Different? Given two sets of data,we can generalize the questions asked in the previous section and ask the single question:Are the two sets drawn from the same distribution function,or from different distribution functions?Equivalently,in proper statistical language,"Can we disprove,to a certain required level of significance,the null hypothesis that two data sets are drawn from the same population distribution function?"Disproving the null hypothesis in effect proves that the data sets are from different distributions.Failing to disprove the null hypothesis,on the other hand, 三 only shows that the data sets can be consistent with a single distribution function. 81 One can never prove that two data sets come from a single distribution,since (e.g.) no practical amount of data can distinguish between two distributions which differ only by one part in 1010 Proving that two distributions are different,or showing that they are consistent, is a task that comes up all the time in many areas of research:Are the visible stars distributed uniformly in the sky?(That is,is the distribution of stars as a function % of declination-position in the sky-the same as the distribution of sky area as a function of declination?)Are educational patterns the same in Brooklyn as in the 9 Bronx?(That is,are the distributions of people as a function of last-grade-attended the same?)Do two brands of fluorescent lights have the same distribution of burn-out times?Is the incidence of chicken pox the same for first-born,second-born, third-born children,etc.? These four examples illustrate the four combinations arising from two different a之w 9 9 dichotomies:(1)The data are either continuous or binned.(2)Either we wish to compare one data set to a known distribution,or we wish to compare two equally unknown data sets.The data sets on fluorescent lights and on stars are continuous, since we can be given lists of individual burnout times or of stellar positions.The data sets on chicken pox and educational level are binned,since we are given tables of numbers of events in discrete categories:first-born,second-born,etc.;or 6th Grade,7th Grade,etc.Stars and chicken pox,on the other hand,share the property that the null hypothesis is a known distribution(distribution of area in the sky,or incidence of chicken pox in the general population).Fluorescent lights and 10621 educational level involve the comparison of two equally unknown data sets(the two brands,or Brooklyn and the Bronx). One can always turn continuous data into binned data,by grouping the events 43106 into specified ranges of the continuous variable(s):declinations between 0 and 10 degrees,10 and 20,20 and 30,etc.Binning involves a loss of information,however. 腿 Also,there is often considerable arbitrariness as to how the bins should be chosen. North Along with many other investigators,we prefer to avoid unnecessary binning of data. The accepted test for differences between binned distributions is the chi-square test.For continuous data as a function of a single variable,the most generally accepted test is the Kolmogorov-Smirnov test.We consider each in turn. Chi-Square Test Suppose that Ni is the number of events observed in the ith bin,and that ni is the number expected according to some known distribution.Note that the Ni's are620 Chapter 14. Statistical Description of Data Permission is granted for internet users to make one paper copy for their own personal use. Further reproduction, or any copyin Copyright (C) 1988-1992 by Cambridge University Press. Programs Copyright (C) 1988-1992 by Numerical Recipes Software. Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521-43108-5) g of machine￾readable files (including this one) to any server computer, is strictly prohibited. To order Numerical Recipes books or CDROMs, visit website http://www.nr.com or call 1-800-872-7423 (North America only), or send email to directcustserv@cambridge.org (outside North America). 14.3 Are Two Distributions Different? Given two sets of data, we can generalize the questions asked in the previous section and ask the single question: Are the two sets drawn from the same distribution function, or from different distribution functions? Equivalently, in proper statistical language, “Can we disprove, to a certain required level of significance, the null hypothesis that two data sets are drawn from the same population distribution function?” Disproving the null hypothesis in effect proves that the data sets are from different distributions. Failing to disprove the null hypothesis, on the other hand, only shows that the data sets can be consistent with a single distribution function. One can never prove that two data sets come from a single distribution, since (e.g.) no practical amount of data can distinguish between two distributions which differ only by one part in 1010. Proving that two distributions are different, or showing that they are consistent, is a task that comes up all the time in many areas of research: Are the visible stars distributed uniformly in the sky? (That is, is the distribution of stars as a function of declination — position in the sky — the same as the distribution of sky area as a function of declination?) Are educational patterns the same in Brooklyn as in the Bronx? (That is, are the distributions of people as a function of last-grade-attended the same?) Do two brands of fluorescent lights have the same distribution of burn-out times? Is the incidence of chicken pox the same for first-born, second-born, third-born children, etc.? These four examples illustrate the four combinations arising from two different dichotomies: (1) The data are either continuous or binned. (2) Either we wish to compare one data set to a known distribution, or we wish to compare two equally unknown data sets. The data sets on fluorescent lights and on stars are continuous, since we can be given lists of individual burnout times or of stellar positions. The data sets on chicken pox and educational level are binned, since we are given tables of numbers of events in discrete categories: first-born, second-born, etc.; or 6th Grade, 7th Grade, etc. Stars and chicken pox, on the other hand, share the property that the null hypothesis is a known distribution (distribution of area in the sky, or incidence of chicken pox in the general population). Fluorescent lights and educational level involve the comparison of two equally unknown data sets (the two brands, or Brooklyn and the Bronx). One can always turn continuous data into binned data, by grouping the events into specified ranges of the continuous variable(s): declinations between 0 and 10 degrees, 10 and 20, 20 and 30, etc. Binning involves a loss of information, however. Also, there is often considerable arbitrariness as to how the bins should be chosen. Along with many other investigators, we prefer to avoid unnecessary binning of data. The accepted test for differences between binned distributions is the chi-square test. For continuous data as a function of a single variable, the most generally accepted test is the Kolmogorov-Smirnov test. We consider each in turn. Chi-Square Test Suppose that Ni is the number of events observed in the ith bin, and that ni is the number expected according to some known distribution. Note that the N i’s are
向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有