Taylor Francis Communications in Statistics-Theory and Methods ISSN:0361-0926(Print)1532-415X(Online)Journal homepage:http://www.tandfonline.com/loi/lsta20 The Chi Square Test With Both Margins Fixed I.S.Alalouf To cite this article:I.S.Alalouf(1987)The Chi Square Test With Both Margins Fixed,Communications in Statistics-Theory and Methods,16:1,29-43,DOl: 10.1080/03610928708829350 To link to this article:http://dx.doi.org/10.1080/03610928708829350 曲 Published online:27 Jun 2007. E Submit your article to this journal Article views:12 View related articles 电 Citing articles:3 View citing articles Full Terms Conditions of access and use can be found at http://www.tandfonline.com/action/journallnformation?journalCode=lsta20 Download by:[China Science Technology University] Date:14 September 2015,At:17:36
Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=lsta20 Download by: [China Science & Technology University] Date: 14 September 2015, At: 17:36 Communications in Statistics - Theory and Methods ISSN: 0361-0926 (Print) 1532-415X (Online) Journal homepage: http://www.tandfonline.com/loi/lsta20 The Chi Square Test With Both Margins Fixed I.S. Alalouf To cite this article: I.S. Alalouf (1987) The Chi Square Test With Both Margins Fixed, Communications in Statistics - Theory and Methods, 16:1, 29-43, DOI: 10.1080/03610928708829350 To link to this article: http://dx.doi.org/10.1080/03610928708829350 Published online: 27 Jun 2007. Submit your article to this journal Article views: 12 View related articles Citing articles: 3 View citing articles
COMMUN.STATIST.-THEOR.METH.,16(1),29-43 (1987) THE CHI SQUARE TEST WITH BOTH MARGINS FIXED I.S.Alalouf Universite du Quebec a Montreal Key Wonds and Phiases:contingency table:Pearson's statistic; asymptotic distnibution ABSTRACT It is well known that the chi square test for independence in a two-way contingency table is valid when the cell frequen- cies follow either a multinomial distribution or a product of multinomial distributions.We show that the test is valid in the third case as well,namely the case where both margins are fixed.The proof is reasonably self-contained,relying essen- tially on the Central Limit Theorem,and is easily made to cover the first two cases.It also leads naturally to a discussion of partitioning degrees of freedom. 1.INTRODUCTION 办 Consider an r x c contingency table with frequencies xij, i=1,..,r,j 1,..,c.Let x+,....,xe the row totals and N],...,N the column totals.Let N=E N;be the total sample size.It is well known that,under the hypothesis of independence,the distribution of Pearson's chi square statistic, 29 Copyright 1987 by Marcel Dekker,Inc
AQCTD.hP'P z>L,L, AA.J.A.2L It is well known that the cni square tesi for independmae iii a tws-way contingency table is valid =hen the cell frequencies follow either a multinomial distribution or a product of multinomiai distributions. We siiow that the test 2s ;;=7-' "--A= X?: ' the third case as well, namely the case where both margins are fixed. The proof is reasonably self-contained, relying essentially on the Central Limit Theorem, and is easily made to cover the first two cases. it also leads naturally to a discussion of ;---'-:--..-- U~JL LAIIUL~~~~~ dLBILLY .3m,-dac of f reedom: 1. INTRODUCTION Consider ai~ rr c contingency table with frequencies x ij ' i=l ,..Y r, j = 1,. . Let xl+,..-.,xrt be the row totals c and N1,. . . , N the column totals. Let N = C N. be the total '-1 J sample size. It is well known that, under &e hypothesis of independence, the distribution of Pearson's chi square statistic, 29 Downloaded by [China Science & Technology University] at 17:36 14 September 2015 Copyright @ 1987 by Marcel Dekker, Inc
30 ALALOUF X-克82 (1.1) 1=jj=1+N1N is a chi square uder either one of the following models:(1)N is fixed and the xij's have a joint multinomial distribution;(2) the column totals N1,..,Nc are fixed and for j=1,...,c,the random vectors (x1j)are independent multinomials. Specifically,let (pij,i=1,....r,j=1,..c)denote the cell probabilities under the multinomial model (1)Pi+= P1jr1F1,,r and p+.对?P1j寸s1,,c.The nu11 1 hypothesis is that Pij=Pit P.and under this hypothesis,the joint probability function of the cell frequencies is ,PP+产 1349:43 (1.2) Under the product multinomial model(2),the null hypothesis is that each of the c multinomials has the same probability vector,(P1+,...,P)say.Under this hypothesis and model,the joint probability function of the cell frequencies is ) 州 19 (1,3) KojouyaL Finally,it may happen that both sets of margins are fixed. One example of where this might occur in practice is the follo- wing.Suppose N individuals belonging to c classes (ethnic groups,for example)are hired for N summer jobs which fall into r categories.Both the number of individuals of each class and the number of jobs of each category are fixed.The hypothesis of independence in this context is the hypothesis that individuals are assigned to different job categories without regard to their papeojuMo
ALALOUF is a chi square under eliher one of rhe foilowing models: (lj ?.! is fixed and the x 's have a joint multinomial distribution; (2) ij *L. =- - - LL~C COI;C~~FI~~ iotais Ti, 9.. j[* p_;:<p_& fOr j z i- ? cj t-~ i C random vectors (xlj,. . . ,X .) are independent multinomials. rj Specifically, iet (p.. i = 1, . . . , r, j = 1, . ..: c} denote 1J ' the cell probabilities under the multi~oxial iiiodel p,, = A, Under the product multinomial mode:! 12), the nldl hvcotlrlesis . - is that each of the c multinomials has the same probability vector, (pl+, . . . ,p*) say. IJnder this hypothesis and model, the joint probability function of the cell frequencies is Finally, it may happen that - beth sets of margins are fixed. One example of where this might occur in practice is the following. Suppose N individuals belonging to c classes (ethnic groups, for example) are hired for N summer jobs which fall into r categories. Both the number of individuals of each class and th2 number of jobs of each category are fixed. The hypothesis of independence in this context is the hypothesis that individuals are assigned to different job categories without regard to their Downloaded by [China Science & Technology University] at 17:36 14 September 2015
CHI SQUARE TEST WITH BOTH MARGINS FIXED 31 class.Under this assumption,the joint probability fuction of the cell frequencies is N:可×时 (1.4) ij The distribution (1.4)is derived under the assumption of random assignment of people to jobs. The sequence of models (1.2),(1.3)and (1.4)may be obtai- ned by successive conditioning:(1.3)is the conditional distri- bution of the xij given the colum totals;and (1.4)is the conditional distribution under (1.3)given the row totals. Indeed,even (1.4)may be considered as a conditional distribu- tion under a model in which the x's are independent Poisson variables,given the sum N (see,e.g.,Haberman,1974). It is known that the asympLotic distribution of (1.1)is chi square under (1.2)or (1.3),as well as under the indepen- dent Poissons model. Under (1.4),does the statistic (1.1)have an asymptotic chi square distribution?The answer is that it does,provided each of the marginal frequencies tends to infinity in such a way that the relative marginal frequencies converge to numbers in (0,1).We prove this in the next section on the assumption that the vector of observations is asymptotically a multivariate normal,while the proof of the asymptotic normality is given in the appendix.Both proofs are reasonably self contained, relying on the central limit theorem and on some basic results on the distribution of quadratic forms.We point out in Section 3 that the same derivation leads very easily to the conclusion that (1.1)is asymptotically a chi square in the two other models:one multinomial with no constraints on the margins,and c multinomials with constraints only on the column totals.We papeoluMo also show how an overall chi square may be split into several in-
CHI SQUARE TEST WITH BOTH MARGINS FIXED 31 class. Under this assun?tton, the joint probability function of the cell freqcencies Is rho LLxb distribution ti ,,.-; I,? -- 4.: A~-;~.-.,; -- ~ii.?er the .~ssu-ption of random assignment of people to jobs. The sequence of mcdeis (l.2), (1.3) ard (1-4) may he obtained by svcc~ssivr condition in^: (1.3) is the conditional distribution of the xij given "Le column totals; and (1.h) is the - conditional distribution under (1.3) given the row totals, , - ~~d~~d, even ; i, 4) may k- gc. currDLa ----: ' iLc.L, ,.-,-, 2 z~ ii ccnditional % -- 'I- L!22 ~ &>~JL!LEi62ii~ c~l:gkr:L?~?537~ af ('::) i.~ , L - - chi square under (1.2) or (1,3), as well as under the independent Poissons model. Under (1.41, does the statistic (1.1) have an asymptotic chi square distribution'? The answer is that it does, provided each of the marginal frequencies tends to infinity in such a way that the reiative marginai frequencies converge to numbers in 1. We pr~ne this in the nPxt section on the assumption that the vector of observations is asymptotically a multivariate normai, while the prmf of the asymptotic normality is given in the appendix. Both proofs are reasonably self contained, relying on the centrai iimit theorem and on some basic results on the distribution of quadratic £oms. We point out ir-, Sectior, 3 that the same derivation leads very easily to the conclusion that (1.1) is asymptotically a chi square in the two other models: one multinomial with no constraints on the margins, and c multfnomials with constraints only en the coium. totalsi We also show how ari overall chi square may be split into several inDownloaded by [China Science & Technology University] at 17:36 14 September 2015
32 ALALOUF dependent chi squares to test various subhypotheses.Thus one proof covers a number of issues related to the traditional chi square test. 2.MAIN RESULT Consider a population of N elements partitioned into r c1 insscs v对th Froquene1es为4r'艺,Xt=N,Asampling i=1 scheme which leads to (1.4)is the following.Suppose a sample of size N is drawn without replacement,then a sample of size N,from the remaining elements of the population,and so on until a sample of size N is drawn from a population which by then contains only Ne elements.That is,the last sample is not random conditional on all previous samples.Then the probability function of the xij's is given by (1.4).Let =(1”,x',j=1,,c (2,1) 1195.5 be the j-th column of the contingency table,and let x=(xi,,x)'. (2.2) Let Kojouy aL n=(1,,7'=(x1N,,× P=(1,,P。)'=(N,,Ne/N (2.3) D=diag (#],...,) Finally,define V D-T T (2,4) 入Ko Then,using (1.4),we find that
scheme which leads to (1.4) is the following. Suppose a sample ~f size N, is draw- without replacement, t-hen a sample of size N2 from ihe remaining elements of the population, aid so oii until a sample of size N is drawn from a population which by then contains only N elements; Thar is, the last sample is not C -- Landom mnditlonal on a1.l -- .,e, .-'-.-- .,,,, samples. 'Then che probability be the j-th column of the contingency table, and let Let Finally, define TJzD=lT7r' Then, using (1,4), we find that Downloaded by [China Science & Technology University] at 17:36 14 September 2015
CHI SQUARE TEST WITH BOTH MARGINS FIXED 33 E(x)=P;” j1,.·, Var(x:)=IN2/(N-1)JP (1-P)V j=1,·,c (2.5) Cov(x)=-[N2/(N-1)V Let x =(x,...,x)'.The expectation and covariance matrix of x are E(x)=NEm=μ Var(x)=[N2/(N-1)[P-MEVE'M]=Z (2.6) where r=diagfP V,....PV) M=diag(P.PeI (2.7) g=(红' The matrix E is rcxr,diag(A],...,A}denotes a block diagonal matrix with blocks A,...,A,and I denotes the identity matrix of order r. As will be proved in the appendix,the vector x is asympto- tically normal.Therefore a quadratic form x'Ax is asymptoti- cally a chi square if AZA A and the chi square will be central ifAμ=0,One such matrix is A=[(N-1)/N2][r-EvE'1 where r--diagipilv,..v) y=D-1-ee' euryo] and e is a column of r one's. It can be easily verified that AEA A and that Au =0. Moreover,Pearson's statistic (1.1)is simply [N/(N-1)]x'Ax
CHI SQUARE TEST WITH BOTH MARGINS FIXED -- n are where The ~atrix E is rcxr, diag{~, , . . . ,R1 den~tes a block diagonal - It matrix with blocks A1,. . . ,A,, and 11- denotes the identity matrix of order r. As will be proved in the appendix, the vector x is asymptotically normal. Therefore a quadratic form x'Ax is asymptotically a chi square if ACA = A and the chi square will be central if A'o = 0, One such matrix fs and e is a colmm of r one's. It call be ea~ILy verified that, AiA = A ai;d zhzz Ap = I?, Moreover, Pearson's statistic (1.1) is sinply r~/(N-l) Ix'Ax, Downloaded by [China Science & Technology University] at 17:36 14 September 2015
34 ALALOUF which is asymptotically equivalent to x'Ax and hence is asympto- tically a chi square.The number of degrees of freedom is tr(AZ) which can be easily shown to be (r-1)(c-1). 3.DISCUSSION As we noted before,(1.4)is the conditional distribution of (1.1)given the marginal totals in either a multinomial model with probability function (1.2)or a product-multinomial model with probability function (1.3).Thus the asymptotic chi square distribution obtained in Section 2 is the asymptotic conditional distribution under the product and product-multi- nomial models.Since the chi square distribution is independent of the marginal totals,this is also the unconditional asymptotic distribution. Thus ihe one proof given here for the case of fixed margi- nals can be easily extended to the other two models. There is a neat way of summarizing the situation:we may think of the frequencies as a single multinomial and study the distribution of (1.1)under constraints on the two sets of margins.In some cases,these constraints are imposed by the experiment itself;in other cases one set of constraints is imposed by the experiment,the other by the statistician as a mathematical device;and in yet other cases,the statistician 入.Ka imposes all constraints.Whatever the case,the result is the same. The splitting of degrees of freedom for testing various hypotheses follows easily.As a simple example,consider a set of independent multinomials with numbers of trials N,...Nc and probability vectors ,...Suppose we want to test simultaneously 0:1=…=万c90 (3.1) where Io is a given probability vector.The usual test
3k ALALOUF -.LJ W~:LLL~ -L is aejmptoticaliy equzl dia~Li~nt~a~E uf (z, 1) given ths margiiial to :als iii either a mulcinoniai nlodel .._I L? wrLn probability function (1.2 j or a prsduct-rmltinomial model 22th probability fu-..ction (1.3). %us the asymptotic chi square distributi-nr? obtaked Secticr, 2 is the as;/rptotic conditional distribution under the product and product-nuitinomial. models. Since the chi sQuare distribution is independent of the marginal totals, this is also the iincondltional asym9totic di str-n~~t jnp. mr- ~ ~ Ailus ille me proof given here for the case of fixed marginal~ can be easily extended fo the other two models. -- mere is a neat way of summarizing the situation: we may -Li-l. ;;;x. ~f the frequencies as a single muitinsmiai and study the distribution of (1.1) under constraints on the two sets of margins. In some cases, these constraints are imposed by the experiment itself; in other cases one set of constraints is imp~sed by the experiment, the other by the statistician as a mathematicai device; 2nd in yet sther cases, the statistician imposes all constraints. Whatever the case, the result is the same. The splitting of degrees of freedom for testing various hypotheses follows easily. As a simple example, consider a set of independent multinomials with numbers of trials N1, ... XI C and probability vectors 5, ..., n . Suppose we want to test - simul taneousiy 7. - no: 11~ = '.. =q =q c 0' (3.1) where ITo is a given probability vector. The usual test Downloaded by [China Science & Technology University] at 17:36 14 September 2015
CHI SQUARE TEST WITH BOTH MARGINS FIXED 35 statistic for testing=0is(31-0D。2(-的"0/ where Do diag(70),that is,the diagonal matrix with elements those in the vector To.So a natural choice for testing Ho is 8%my (3,2) j=1 which is asymptotically a chi square with c(r-1)degrees of fredom.But we could also use the statistic %,g6g网 (3.3) where diag(),(/N).This is the same as using a pooled estimate of the variance under Ho instead of the true variance under Ho.In view of the convergence in probabili- ty of D to Do under Ho,and the continuity of (3.3)as a funetion of D,(3.2)and (3.3)are asymptotically equivalent,As can be easily verified,(3.3)can be split as follows: 9E:LI1 g%y615写屑;+-1G-50a.6 The two terms in (3.4)are asymptotically independent,since we have shown that conditional on n the first term is asymptotically distributed as a chi square.Thanks to the split in (3.4),it is possible to differentiate between two ways in which the hypothesis (3.1)can be false:the T's are equal to each other but not to o,or the n's are not equal to each other but their weighted average is equal to To. The first term in (3.4)can be further split as follows 鸟6 ]q papeojuMo ‘年i2gi
CHI SQUARE TEST WITH BOTH MARGINS FIXED 35 .. .. - where D = diag!:), x' = (x,+!N,, ;. ,x iN). This is the same as ri iising a pooled estimate of the variacce wider Ha instead of the !! 61.io -.--i . en-O 6 - r" view of the convergence In prcbsbiii- --8 " --- - ty of D to En under E,,, and the cont5si;ity cf (1.43 2% z fx~ctim or' B, !?,2) and , it is possible to differentiate between two ways in which the hypothesis (3.1) can be false: the 7,'s are equal to each other J but not to a,, or the ?.'s are not equal to each other but u - i their weighted average is equal to TI 0' The first term in (3.4) can be further split as follows Downloaded by [China Science & Technology University] at 17:36 14 September 2015
36 ALALOUF +41-司611-司+22-61(2- C1 re c is an integer less than c,M=2 Ni,M2 j=1 c1 ,3X:)21=212·Tisp11t8 the continge如 j=1 table into 2 subtables of c and c-c]colums,respectively. Each of the first two terms tests for the equality of 's within the corresponding subtable;and the last two terms toge- ther test for equality between subtable averages. APPENDIX We show here that x =(x,...,x)'as defined by (2.2)and (2.1),has a limit normal distribution.Our proof follows close- ly that of Lehmann (1975,p.352)for convergence to normality in finite population sampling.Let m'=(mf,...,me)be an rc- component fixed vector,with m=(m),1,....c. will show that m'x=mx,is asymptotically normal pr j=1 J ded m lies in a certain vector space to be characterized later. Note that 点点 where the xi's are the contingency table entries. Let U,...Uy be independent random variables,each distri- buted uniformly on (0,1).Define the sets of integers j-1 B与nl2。。<n≤乏。g3,1…,c (A.1) 8=0 =0 and the intervals 入Ko j-1 A=x18P。<x≤。Pg,j1,,c(a.2) s=0 8=0
ALALOUF . - - -. . - - . . firkJ-'NNUlX rr &e -L--- LA-,. ,.L-4. -- = ;-*: -*:-, 7 Z,L~%JW iic~z LLL~L A\A7 j 0 -;A , as def tne2 by (2 -2) a.'d - (2.1). has a limlt normal distribution. Our pmsf follows close- - .. . -z '7 ..----- f7C7C - - . . '2L.I: -I___T_I_____.-,-. *- ?iL__ -, LLIGL V1 LTIII:ICIII: jl>: J p J-'L/ bClL LIIIIVCL'CIIZLG 0 LLJ LIULIII-IILJ $2 finite nnaulzt'ln -r snx~licn. . .2 let m'= (mi5.1.1~') he ?n rs- ' c component fixed ~ector. with m: = (rnr,9+,.,n~..2); j=i 3-.. >c. 3 IJ Lj We will show that m'x = m!x is asymptotically normal provi- j=l J j ded m lies in a certain vector space to be characterized later. nr mte that where the xiiTs are the contingency table entries. - J Let TuT l,...EN be independent random variables, each distributed uniformly on (O,l). Define the sets of integers Downloaded by [China Science & Technology University] at 17:36 14 September 2015
CHI SQUARE TEST WITH BOTH MARGINS FIXED 37 where P6a0=0,Ps=NsN,8=1,,c (A.3) For k=1,...,N define the random variables 可1if'keA与 ajk 1o otherwise j=1,.·,c (A.4) and 11 F RK 6 B时 j=1,,C (A.5) where R is the rank of Uk,k=1,...,N. 4120 The vectors a'k (alk,...,ack),k=1,...,N,are independent, each distributed as a muitinomial with one trial and probabili- ties P1,...,Pc.We also have 寸 E(ajk)E(bjk)=P,j=1,...c k=1,, (A.6) e[KlISIAlun var(ajk)=Var(bjk)=Pj(1-Pj) The ajk's and the bik's define two schemes for splitting into c samples a population of N elements belonging to r cate- gories.Let the N elements E,...,Ey be so labeled that the first x)+elements are in category 1,the next x2+elements are in category 2,and so on till the last x elements,which are in category r.Then the k-th element is in category i if k is an integer in the set C defined by 1-1 1 C1三{k1Σ。×stk≤Σ。xg+于11,·,FCA.7) s=0 The bjk's define precisely the sampling scheme described in the introduction,and the contingency table frequencies
CHI SQUARE TEST WITH BOTH MARGINS FIXED where j l if U, c A, n j a.. = J-A. 2 -1 . :, jC (.4.4 ) Jk i LO otherwise and Var(a > = Var(b. ) = ?.(I-P.) j k J k J J Tne ajkVs and the b 's define txs schemes for splitting jk into c samples a population of N elements belonging to r categories. Let the ~eiclments E, I, - . ,-17 be so iabeled that the first xl+ elements are in category 1, the next x elements are 2+ in category 2, and so on till the last x~ elements, which are in category r. Then the k-th element is in category i if k is an integer in the set C. defined by i-l i C. =ikl C x <kr C x ii, r (A.7) I gi s+ s=o s=O The b 's define precisely the sampling scheme described in the j k introduction, and the contingency table frequencies Downloaded by [China Science & Technology University] at 17:36 14 September 2015