Crosstabulation Lecture 4 Crosstabulations are also called contingency tables eir simplest form t Chi-Square Test count of the categories of one variable for each category of another variable For example, we might like to examine a ablation of age of t whether they have ever had a child E199T Suvey Iets-srss Ist.Editer File Edit Yiw Data turfan analyze Graphs utilities finde yelp This lecture covers Crosstabulation Chi-Square Test
1 1 Lecture 4 Chi-Square Test 2 This lecture covers • Crosstabulation • Chi-Square Test 2 3 Crosstabulation • Crosstabulations are also called contingency tables or two-way frequency tables. In their simplest form they are the count of the categories of one variable for each category of another variable. • For example, we might like to examine a crosstabulation of age of woman with whether they have ever had a child. 4
Cae processing summay We can also calculate row or column percentages. The following table shows column percentages. It presents the percentage age distribution for each year age group* whether have any chid category of WCEB. We can see straight away that the no category is younger than the yes We call this a 7x2 table because it has 5year age group *whether have any child 7 rows and 2 columns a table with r rows and c columns is an rxc table. The wihin whether table shows the relationship between two hether have ay gorica variables. The explanatory variable is the treatment(the drugs). The response variable is success(no relapse) 24%183% or failure(relapse). The two-way table gives the counts for all 6 combinations of 155%126% values of these variables each of the counts occupies a cell of the table
3 5 6 • We call this a table because it has 7 rows and 2 columns. A table with r rows and c columns is an table. The table shows the relationship between two categorical variables. The explanatory variable is the treatment (the drugs). The response variable is success (no relapse) or failure (relapse). The two-way table gives the counts for all 6 combinations of values of these variables. Each of the counts occupies a cell of the table. r c × 7 2 × 4 7 • We can also calculate row or column percentages. The following table shows column percentages. It presents the percentage age distribution for each category of WCEB. We can see straight away that the NO category is younger than the YES category. 8 5-year age group * whether have any child Crosstabulation % within whether have any child 51.9% .0% 10.3% 36.0% 7.3% 13.0% 7.9% 22.1% 19.3% 1.5% 22.4% 18.3% 1.3% 14.4% 11.8% .6% 18.2% 14.7% .9% 15.5% 12.6% 100.0% 100.0% 100.0% 15-19 20-24 25-29 30-34 35-39 40-44 45-49 5-year age group Total no yes whether have any child Total
The question is: Is there a significant When creating crosstabulations it is relationship between woman's age and standard practice to use the dependent ng ever nad a variable as the rows and the independent variable as the columns We can create a crosstabulation with three variables. For example, we may want to see the age distribution for WCEB for urban and rural area separately. This is shown in the following table Please examine this question by your own after class. I will discuss some other examples Rural and urban age distribution for Example 1: Treating cocaine WCEB addiction This is a three-year study on medication to help cocaine addicts stay off cocaine: D, L, and P. Each treatment was randomly assigned with 24 subjects The counts and proportions who avoided relapse into caine use during the study
5 9 • When creating crosstabulations it is standard practice to use the dependent variable as the rows and the independent variable as the columns. • We can create a crosstabulation with three variables. For example, we may want to see the age distribution for WCEB for urban and rural area separately. This is shown in the following table. 10 Rural and urban age distribution for WCEB 5-year age group * whether have any child * place of residence Crosstabulation % within whether have any child 57.3% .1% 10.7% 33.9% 8.4% 13.2% 5.0% 22.9% 19.6% 1.4% 22.8% 18.8% 1.4% 12.9% 10.8% .3% 17.9% 14.6% .8% 15.0% 12.3% 100.0% 100.0% 100.0% 37.4% 8.9% 41.5% 3.3% 12.4% 15.6% 19.3% 18.4% 1.9% 21.1% 16.5% 1.1% 19.9% 15.4% 1.5% 19.1% 14.9% 1.1% 17.3% 13.5% 100.0% 100.0% 100.0% 15-19 20-24 25-29 30-34 35-39 40-44 45-49 5-year age group Total 15-19 20-24 25-29 30-34 35-39 40-44 45-49 5-year age group Total place of residence rural urban no yes whether have any child Total 5-year age group * whether have any child * place of residence Crosstabulation % within 5-year age group 99.5% .5% 100.0% 48.0% 52.0% 100.0% 4.8% 95.2% 100.0% 1.4% 98.6% 100.0% 2.4% 97.6% 100.0% .4% 99.6% 100.0% 1.3% 98.7% 100.0% 18.7% 81.3% 100.0% 100.0% 100.0% 79.4% 20.6% 100.0% 20.1% 79.9% 100.0% 2.7% 97.3% 100.0% 1.7% 98.3% 100.0% 2.4% 97.6% 100.0% 2.0% 98.0% 100.0% 23.8% 76.2% 100.0% 15-19 20-24 25-29 30-34 35-39 40-44 45-49 5-year age group Total 15-19 20-24 25-29 30-34 35-39 40-44 45-49 5-year age group Total place of residence rural urban no yes whether have any child Total 6 11 • The question is: Is there a significant relationship between woman’s age and having ever had a child? 5-year age group 15-19 20-24 25-29 30-34 35-39 40-44 45-49 Mean whether have any child 1.2 1.0 .8 .6 .4 .2 0.0 age 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 Mean whether have any child 1.2 1.0 .8 .6 .4 .2 0.0 Please examine this question by your own after class. I will discuss some other examples. 12 Example 1: Treating cocaine addiction • This is a three-year study on medication to help cocaine addicts stay off cocaine: D, L, and P. Each treatment was randomly assigned with 24 subjects. The counts and proportions who avoided relapse into cocaine use during the study:
Group Treatment Subjects No relapse Proportion Here is the two-way table of the cocaine addiction data 0.250 0.167 relapse No Ye D 080 The sample proportions of subjects who We want to test the null hypothesis that stayed off cocaine are quite different. Are there are no differences among the these data good evidence that the proportions of successes for addicts given proportions of successes for the three the three treatments treatments differ in the population of all cocaine addicts? P1=p2=P3 Does success differ significantly The alternative hypothesis is that there is between the treatments? some difference that not all three proportions are equal Is there a significant relationship between treatment and success? H,: not all of p, p,, and pa are equal
7 13 3 P 24 4 0.167 2 L 24 6 0.250 1 D 24 14 0.583 Group Treatment Subjects No relapse Proportion 14 • The sample proportions of subjects who stayed off cocaine are quite different. Are these data good evidence that the proportions of successes for the three treatments differ in the population of all cocaine addicts? Does success differ significantly between the treatments? Is there a significant relationship between treatment and success? 8 15 Here is the two-way table of the cocaine addiction data: 16 • We want to test the null hypothesis that there are no differences among the proportions of successes for addicts given the three treatments: • The alternative hypothesis is that there is some difference, that not all three proportions are equal: 01 2 3 Hp p p : = = 1 12 3 H pp p : not all of , , and are equal
To test Ho, we compare the observed In more formal language, if we have n counts in a two way table with the dependent tries and the probability of a expected counts, the counts we would success on each try is p, we expect np expect if H, were true. If the observed successes If we draw an sRs of n counts are far from the expected counts individuals from a population in which the that is evidence against H proportion of successes is p, we expect np accesses in the sample. That s the fact behind the formula for expected counts in Expected counts Let's apply this fact to the cocaine study The two-way table with row and column The expected count in any cell of a two totals is way table when H, is true is expected count row total x column total
9 17 • To test , we compare the observed counts in a two-way table with the expected counts, the counts we would expect if were true. If the observed counts are far from the expected counts, that is evidence against . H0 H0 H0 18 Expected counts • The expected count in any cell of a twoway table when is true is H0 row total column total expected count table total × = 10 19 • In more formal language, if we have n independent tries and the probability of a success on each try is p, we expect np successes. If we draw an SRS of n individuals from a population in which the proportion of successes is p, we expect np successes in the sample. That’s the fact behind the formula for expected counts in a two-way table. 20 • Let’s apply this fact to the cocaine study. The two-way table with row and column totals is
will find the expected count for the cell in row 1 and column 1. The proportion of Observed versus expected counts all 72 subjects who succeed in avoiding a count of successes column I total 24 Yes table total table total 72 3 D 16 Think of this as p, the overall proportion of 16 successes. If H is true, we expect this same proportion of successes in all three groups Because 1/3 of all subjects succeed, we So the expected count of successes among the 24 subjects who took D is expect 1/3 of the 24 subjects in each group to avoid a relapse if there are no differences among the treatments. In fact, D has more successes(14)and fewer failures(10)than expected. The Phas fewer successes (4)and more relapses This expected count has the form (20). d does much better than P, with L in row I total× column l total24×24 table total 72
11 21 • We will find the expected count for the cell in row 1 and column 1. The proportion of all 72 subjects who succeed in avoiding a relapse is count of successes column 1 total 24 1 table total table total 72 3 = == Think of this as p, the overall proportion of successes. If is true, we expect this same proportion of successes in all three groups. H0 22 • So the expected count of successes among the 24 subjects who took D is 1 24 8 3 np = ×= This expected count has the form: row 1 total column 1 total 24 24 table total 72 × × = 12 23 Observed versus expected counts 24 • Because 1/3 of all subjects succeed, we expect 1/3 of the 24 subjects in each group to avoid a relapse if there are no differences among the treatments. In fact, D has more successes (14) and fewer failures (10) than expected. The P has fewer successes (4) and more relapses (20). D does much better than P, with L in between
The chi-Square Test The chi-square statistic is a sum of term one for each cell in the table. In the The statistical test that tells us whether cocaine example, 14 of the D group those differences are statistically succeeded in avoiding a relapse. The significant compares the observed and expected count for this cell is 8. So the expected counts. The test statistic that component of the chi-square statistic from makes the comparison is the chi-square this cell is statistic (observed count-expected count) expected count (14-8)36 88 Chi-square statistic Think of the chi-square statistic z as a measure of the distance of the observed The chi-square statistic is a measure of counts from the expected counts. Like any how far the observed counts in a two-way distance, it is always zero or positive, and table are from the expected counts. The it is zero only when the observed counts formula for the statistic is are exactly equal to the expected counts arge values of x are evidence against H (observed count-expected count)- because they say that the observed counts are far from what we would expect if h. were tru
13 25 The Chi-Square Test • The statistical test that tells us whether those differences are statistically significant compares the observed and expected counts. The test statistic that makes the comparison is the chi-square statistic. 26 Chi-square statistic • The chi-square statistic is a measure of how far the observed counts in a two-way table are from the expected counts. The formula for the statistic is 2 2 (observed count expected count) expected count χ − = ∑ 14 27 • The chi-square statistic is a sum of terms, one for each cell in the table. In the cocaine example, 14 of the D group succeeded in avoiding a relapse. The expected count for this cell is 8. So the component of the chi-square statistic from this cell is 2 2 (observed count expected count) expected count (14 8) 36 4.5 8 8 − − = == 28 • Think of the chi-square statistic as a measure of the distance of the observed counts from the expected counts. Like any distance, it is always zero or positive, and it is zero only when the observed counts are exactly equal to the expected counts. Large values of are evidence against because they say that the observed counts are far from what we would expect if were true. 2 χ 2 χ H0 H0
The chi-square distribution There are three major properties The chi-square distributions are a family of of a chi-square distribution distributions that take only positive values and are skewed to the right. A specific chi- Chi-square is either 0 or positive, never square distribution is specified by giving its degrees of freedom A chi-square distribution in not symmetrical The chi-square test for a two-way table Its skewness is positive. As the number of with r rows and c columns uses critical degrees of freedom increases, chi-square values from the chi-square distribution with (r-1)(c-1)degrees of freedom. The P-value approaches a symmetric distribution is the area to the right of x under the chi- There is a particular distribution for each quare density curve degree of freedom Figure 1 shows the density curves for three z/X members of the chi-square family of Table e distributions gives critical values distributions to find p for a 出##出 test 需
15 29 The chi-square distribution • The chi-square distributions are a family of distributions that take only positive values and are skewed to the right. A specific chisquare distribution is specified by giving its degrees of freedom. • The chi-square test for a two-way table with r rows and c columns uses critical values from the chi-square distribution with (r-1)(c-1) degrees of freedom. The P-value is the area to the right of under the chisquare density curve. 2 χ 30 Figure 1 shows the density curves for three members of the chi-square family of distributions. 16 31 There are three major properties of a chi-square distribution • Chi-square is either 0 or positive, never negative. • A chi-square distribution in not symmetrical. Its skewness is positive. As the number of degrees of freedom increases, chi-square approaches a symmetric distribution. • There is a particular distribution for each degree of freedom. 32 Table E gives critical values for chi-square distributions. Use Table E to find Pvalue for a chi-square test
Using SPSS we can easily find the P-value We use the formula to calculate chi-square statistic. x'=y(observed count-expected count) 82+01656:8) 8-162(4-8)(20-162 16 16 =450+225+0.500+025+200+100 =1050 OMEUTE prob. 1-CDE, 0HISQ010,5.2) The two-way table has 3 rows and 2 If we want our significance level to be columns. That is, [3, C=2. The chi-square 0.05, the critical value is 5.99. To reject statistic therefore has degrees of freedom the null hypothesis at the 0.05 level the 1)c1)=(3-1)(21)=(2)1)=2 value of chi-square needs to be greater than 5.99. If it were less than 5.99 the null Look in the df=2 row of table e. the chi- quare statistic x =10.5 falls between the hypothesis would be accepted 0.01 and 0.05 critical values Remember In this example, the value of chi-square is 10.5, which is greater than 5.99, so we that the chi-square test is always one- sided So the P-value of x =10.5 is reject the null hypothesis in this case. We between 0.01 and 0.05 The p-value is can conclude that there is a statistically equal to 0.005 when rounded to three ignificant difference in effects of the decimal places treatments(drugs)
17 33 • We use the formula to calculate chi-square statistic: 2 2 2 22 22 2 (observed count expected count) expected count (14 8) (10 16) (6 8) 8 16 8 (18 16) (4 8) (20 16) 16 8 16 4.50 2.25 0.500 0.25 2.00 1.00 10.50 χ − = ∑ −−− =+ + −−− + ++ =++ +++ = 34 • The two-way table has 3 rows and 2 columns. That is, r=3, c=2. The chi-square statistic therefore has degrees of freedom (r-1)(c-1)=(3-1)(2-1)=(2)(1)=2. • Look in the df=2 row of Table E. The chisquare statistic =10.5 falls between the 0.01 and 0.05 critical values. Remember that the chi-square test is always onesided. So the P-value of =10.5 is between 0.01 and 0.05. The P-value is equal to 0.005 when rounded to three decimal places. 2 χ 2 χ 18 35 Using SPSS we can easily find the P-value. 36 • If we want our significance level to be < 0.05, the critical value is 5.99. To reject the null hypothesis at the 0.05 level the value of chi-square needs to be greater than 5.99. If it were less than 5.99 the null hypothesis would be accepted. • In this example, the value of chi-square is 10.5, which is greater than 5.99, so we reject the null hypothesis in this case. We can conclude that there is a statistically significant difference in effects of the treatments (drugs)
Using Crosstabs in SPSS Calculating the expected counts and then the chi-square statistic by hand is a bit time- consuming. We can avoid this trouble by using SPSS's crosstabs But you need to arrange the data in the following format: PSS Data Ed EieE出 lie Data rasion Analyz Aied CeerI tile Edit Miex Data Transfar 回母型叫回回上 rpre 1 ADT· RF APSF Drstahdatin relate clinear parametric tes multiple Pearse Issing Value Analysis
19 37 Using Crosstabs in SPSS • Calculating the expected counts and then the chi-square statistic by hand is a bit timeconsuming. We can avoid this trouble by using SPSS’s crosstabs. But you need to arrange the data in the following format: 38 20 39 40