Chi-Square Tests

When qualitative attributes of individuals within a population are the object of study, data often shows up in tabular form. Some null hypothesis is stated concerning the population, which corresponds to a statement about the expected form of the table. In order to test the hypothesis, it is necessary to compare the observed table and the expected one, i.e., it is necessary to measure the difference between the two tables.

A test of independence

Assume that we interview 200 registered voters, and ask them (a) which party candidate they supported in 1992, and (b) on which TV channel they followed the election returns (network, all-news, or Comedy Central). We obtain the data tabulated below. [Obviously, a full study would expand the list of nominees to include independent candidates, and the channel list to include both other news sources and "none". I'm just keeping the example small.]

	network	CNN	CC
Republican	46	24	26
Democrat	30	34	40

It appears that Republicans favored the network coverage, and Democrats Comedy Central. But do we have enough evidence here to draw a strong conclusion? Let's formulate a null hypothesis: Candidate preference was independent of viewing preference. We'll use the data to test this hypothesis -- A low significance level will tell us that our data, all by itself, provides strong evidence against the null hypothesis.

What would we expect to see, as the result of our study, if candidate and viewing preferences were independent (i.e., if the null hypothesis were true)? We would expect the frequencies within each cell to be the products of the true marginal frequencies. And, although we don't know the true marginal frequencies, we can estimate them from our data!

The estimated marginal frequencies are

	network	CNN	CC
Republican	46	24	26	48%
Democrat	30	34	40	52%
	38%	29%	33%

With these estimates, we would expect, were the null hypothesis true, cell frequencies of

	network	CNN	CC
Republican	18.24%	13.92%	15.84%	48%
Democrat	19.76%	15.08%	17.16%	52%
	38%	29%	33%

and, since the sample size was 200, we would have expected actual cell entries of

	network	CNN	CC
Republican	36.48	27.84	31.68
Democrat	39.52	30.16	34.32

It remains only to measure the difference between our original observations, and our expectations were the null hypothesis true. The squared difference between cell entries in the first (observations) table and the last (expectations) is a tempting way to measure cell differences. However, that would count the difference between 10 and 15 to be just as large as the difference between 200 and 205, while instinct suggests that the first difference is somehow more substantial than the second. So, instead, we measure cell differences by squaring the difference between the cell entries, and then dividing by the "expected" entry. And finally, we add these "normalized" differences to measure the aggregate difference between the two tables. In this case, the resulting number is 7.7546.

Is this a "large" difference? To answer the question, we take the standard statistical approach: We imagine the entire procedure (from the drawing of the original sample, to the calculation of the final table difference) being carried out on a population in which the two traits being studied really are independent, we consider the result of the procedure as a random variable, and we ask how extreme the actual difference is in the distribution of that random variable.

A random variable resulting from a procedure such as ours is said to have a c²-distribution. There are many c²-distributions, each with a differing number of "degrees of freedom". In this case, we note that our problem has 2 degrees of freedom: If one wishes to distribute 200 individuals into the six cells of the table, and get the marginal frequencies we estimated, then, once two cells (for example, the two leftmost cells in the top row) are specified, the other entries are all "forced". (Conducting a similar "test of independence" on a table with r rows and c columns, we would end up with (r-1)(c-1) degrees of freedom.) Looking in a c² table, we find that a difference as large as that we see here can be expected to turn up, purely due to sampling error, when the null hypothesis is true, less than 2.5% of the time, but more than 1% of the time. So the data, all by itself, makes us very suspicious of the null hypothesis, i.e., provides very strong evidence that candidate and viewing preferences are related.

In contrast, suppose that we had sampled only 100 voters, and had obtained the following table:

	network	CNN	CC
Republican	23	12	13
Democrat	15	17	20

Our calculations would have led us to a c²-statistic of only 3.8773. The significance level of the data would have been somewhere between 25% and 10%, indicating that the data did not offer much contradiction to the null hypothesis.

A goodness-of-fit test

60 truck drivers are monitored over the course of a year. Delivery errors are tracked for each of the drivers, and at year's-end the mean number of errors per driver is 6.35, with the following results observed:

number of delivery errors
0-4	5-6	7-8	9-10	11+
15	13	16	8	8

We wonder whether some drivers are doing significantly better than others, so we take as our null hypothesis: All drivers have the same inherent error rate. How strongly does the data contradict this hypothesis?

If the null hypothesis is true, then our observations should fit a Poisson distribution. So we determine how many drivers we would expect to see in each cell, were the number of errors Poisson with mean 6.35:

expected number of delivery errors
0-4	5-6	7-8	9-10	11+
14.46	18.56	15.53	7.93	3.53

The c ²-statistic in this case is 7.37. The number of degrees of freedom? We have 5 numbers in our table, but they must sum to 60 (taking away one degree of freedom), and the original data was used to obtain the estimated mean of 6.35 (taking another): Three degrees of freedom are left. Turning to the c ² table, we find that the significance level of the data is between 5% and 10% (i.e., the data is significant at the 10% level, but not at the 5% level). So we have some evidence that the workers are not homogeneous, but not very strong evidence.

Historical note

One of the first applications of the c ² goodness-of-fit test was to Gregor Mendel's sweetpea breeding data, which appears too good to be true (i.e., his data fit the predictions of his genetic model extremely well). The null hypothesis: Mendel honestly reported his data. The test: Compute the c ²-statistic, and do a left-tail test, to see if it's surprisingly small. The result: Mendel's data, all by itself, provides extraordinarily strong evidence against the null hypothesis. It seems that he was so convinced of his theory's correctness, and so wanted to persuade others, that he "cleaned up" results which didn't seem to fit (even though, with the understanding of uncertainty that we have and he lacked, we would expect to see occasional anomalous results of the type he apparently "corrected," purely due to sampling error).