Lesson 44: Chi-Square Goodness-of-Fit Tests

Suppose the Penn State student population is 60% female and 40% male. Then, if a sample of 100 students yields 53 females and 47 males, can we conclude that the sample is (random and) representative of the population? That is, how "good" do the data "fit" the probability model? As the title of the lesson suggests, that's the kind of question that we will answer in this lesson.

The General Approach

As is often the case, we'll motivate the methods of this lesson by way of example. Specifically, we'll return to the question posed in the introduction to this lesson.

nittany lionExample

Suppose the Penn State student population is 60% female and 40% male. Then, if a sample of 100 students yields 53 females and 47 males, can we conclude that the sample is (random and) representative of the population? That is, how "good" do the data "fit" the assumed probability model of 60% female and 40% male?

Solution. Testing whether there is a "good fit" between the observed data and the assumed probability model amounts to testing:

\(H_0 : p_F =0.60\)
\(H_A : p_F \ne 0.60\)

Now, letting Y1 denote the number of females selected, we know that Y1 follows a binomial distribution with n trials and probability of success p1.  That is:

\(Y_1 \sim b(n,p_1)\)

Therefore, the expected value and variance of Y1 are, respectively:

\(E(Y_1)=np_1\)  and   \(Var (Y_1) =np_1(1-p_1)\)

And, letting Y2 denote the number of males selected, we know that Y2 = n − Yfollows a binomial distribution with n trials and probability of success p2. That is:

\(Y_2 = n-Y_1 \sim (b(n,p_2)=b(n,1-p_1)\)

Therefore, the expected value and variance of Y2 are, respectively:

\(E(Y_2)=n(1-p_1)=np_2\)   and   \(Var(Y_2)=n(1-p_1)(1-(1-p_1))=np_1(1-p_1)=np_2(1-p_2)\)

Now, for large samples (np1 ≥ 5 and n(1−p1) ≥ 5), the Central Limit Theorem yields the normal approximation to the binomial distribution. That is:

\[Z=\frac{(Y_1-np_1)}{\sqrt{np_1(1-p_1)}}\]

follows, at least approximately, the standard normal N(0,1) distribution. Therefore, upon squaring Z, we get that:

\[Z^2=Q_1=\frac{(Y_1-np_1)^2}{np_1(1-p_1)}\]

follows an approximate chi-square distribution with one degree of freedom. Now, we could stop there. But, that's not typically what is done. Instead, we can rewrite Q1 a bit. Let's start by multiplying Q1 by 1 in a special way, that is, by multiplying it by ((1−p1)+p1):

\[Q_1=\frac{(Y_1-np_1)^2}{np_1(1-p_1)} \times ((1-p_1)+p_1)\]

Then, distributing the "1" across the numerator, we get:

\[Q_1=\frac{(Y_1-np_1)^2(1-p_1)}{np_1(1-p_1)}+\frac{(Y_1-np_1)^2p_1}{np_1(1-p_1)}\]

which simplies to:

\[Q_1=\frac{(Y_1-np_1)^2}{np_1}+\frac{(Y_1-np_1)^2}{n(1-p_1)}\]

Now, taking advantage of the fact that Y1 = n − Yand p1 = 1 − p2, we get:

\[Q_1=\frac{(Y_1-np_1)^2}{np_1}+\frac{(n-Y_2-n(1-p_2))^2}{np_2}\]

which simplifies to:

\[Q_1=\frac{(Y_1-np_1)^2}{np_1}+\frac{(-(Y_2-np_2))^2}{np_2}\]

Just one more thing to simplify before we're done:

\[Q_1=\frac{(Y_1-np_1)^2}{np_1}+\frac{(Y_2-np_2)^2}{np_2}\]

In summary, we have rewritten Q1 as:

\[ Q_1=\sum_{i=1}^{2}\frac{(Y_i-np_i)^2}{np_i}=\sum_{i=1}^{2}\frac{(\text{OBSERVED }-\text{ EXPECTED})^2}{EXPECTED} \]

We'll use this form of Q1, and the fact that Q1 follows an approximate chi-square distribution with one degree of freedom, to conduct the desired hypothesis test.

Before we return to solving the problem posed by our example, a couple of points are worthy of emphasis.

(1) First, Q1 has only one degree of freedom, since there is only one independent count, namely Y1. Once Y1 is known, the value of Y2 = n − Yimmediately follows.

(2) Note that the derived approach requires the Central Limit Theorem to kick in. The general rule of thumb is that the expected number of successes must be at least 5 (that is, np1 ≥ 5) and the expected number of failures must be at least 5 (that is, n(1−p1) ≥ 5).

(3) The statistic Qwill be large if the observed counts are very different from the expected counts. Therefore, we must reject the null hypothesis H0 if Qis large. How large is large? Large is determined by the values of a chi-square random variable with one degree of freedom, which can be obtained either from a statistical software package, such as Minitab or SAS, or from a standard chi-square table, such as the one in the back of our text book.

(4) The statistic Qis called the chi-square goodness-of-fit statistic.

old mainExample (continued)

Suppose the Penn State student population is 60% female and 40% male. Then, if a sample of 100 students yields 53 females and 47 males, can we conclude that the sample is (random and) representative of the population? Use the chi-square goodness-of-fit statistic to test the hypotheses:

\(H_0 : p_F =0.60\)
\(H_A : p_F \ne 0.60\)

using a significance level of α = 0.05.

Solution. The value of the test statistic Qis:

\[Q_1=\frac{(53-60)^2}{60}+\frac{(47-40)^2}{40}=2.04\]

We should reject the null hypothesis if the observed number of counts is very different from the expected number of counts, that is, if Qis large. Because Qfollows a chi-square distribution with one degree of freedom, we should reject the null hypothesis, at the 0.05 level, if:

\[Q_1 \ge \chi_{0.05, 1}^{2}=3.84\]

Because:

\[Q_1=2.04 < 3.84\]

we do not reject the null hypothesis. There is not enough evidence at the 0.05 level to conclude that the data don't fit the assumed probability model. 

As an aside, it is interesting to note the relationship between using the chi-square goodness of fit statistic Q1 and the Z-statistic we've previously used for testing a null hypothesis about a proportion. In this case:

\[ Z=\frac{0.53-0.60}{\sqrt{\frac{(0.60)(0.40)}{100}}}=-1.428 \]

which, you might want to note that, if we square it, we get the same value as we did for Q1:

\[Q_1=Z^2 =(-1.428)^2=2.04\]

as we should expect. The Z-test for a proportion tells us that we should reject if:

\[|Z| \ge 1.96 \]

Well, again, if we square it, we should see that that's equivalent to rejecting if:

\[Q_1 \ge (1.96)^2 =3.84\]

And not surprisingly, the P-values obtained from the two approaches are identical. The P-value for the chi-square goodness-of-fit test is:

\[P=P(\chi_{(1)}^{2} > 2.04)=0.1532\]

while the P-value for the Z-test is:

\[P=2 \times P(Z>1.428)=2(0.0766)=0.1532\]

Identical, as we should expect!

Extension to K Categories

The work on the previous page is all well and good if your probability model involves just two categories, which as we have seen, reduces to conducting a test for one proportion. What happens if our probability model involves three or more categories? It takes some theoretical work beyond the scope of this course to show it, but the chi-square statistic that we derived on the previous page can be extended to accommodate any number of k categories. We can thank the famous statistician Karl Pearson for doing the dirty work for us (in 1900).

extension cordThe Extension

Suppose an experiment can result in any of k mutually exclusive and exhaustive outcomes, say A1, A2, ..., Ak. If the experiment is repeated n independent times, and we let pi = P(Ai) and Yi = the number of times the experiment results in Ai, i = 1, ..., k, then we can summarize the number of observed outcomes and the number of expected outcomes for each of the k categories in a table as follows:

Table with k categories

Karl Pearson showed that the chi-square statistic Qk−1 defined as:

\[Q_{k-1}=\sum_{i=1}^{k}\frac{(Y_i - np_i)^2}{np_i}  \]

follows approximately a chi-square random variable with k−1 degrees of freedom. Let's try it out on an example.

candyExample

A particular brand of candy-coated chocolate comes in five different colors that we shall denote as:

Let pi equal the probability that the color of a piece of candy selected at random belongs to Ai, for i = 1, 2 3, 4, 5. Test the following null and alternative hypotheses:

\(H_0 : p_{Br}=0.4,p_{Y}=0.2,p_{O}=0.2,p_{G}=0.1,p_{C}=0.1   \)

\(H_A : p_{i} \text{ not specified in null (many possible alternatives) } \)

using a random sample of n = 580 pieces of candy whose colors yielded the respective frequencies 224, 119, 130, 48 and 59. (This example comes from exercise 8.1-2 in the Hogg and Tanis (8th edition) text book).

Solution. We can summarize the observed (yi) and expected (npi) counts in a table as follows:

table of data

where, for example, the expected number of brown candies is:

np1 = 580(0.40) = 232

and the expected number of green candies is:

np4 = 580(0.10) = 58

Once we have the observed and expected number of counts, the calculation of the chi-square statistic is straightforward. It is: 

\[Q_4=\frac{(224-232)^2}{232}+\frac{(119-116)^2}{116}+\frac{(130-116)^2}{116}+\frac{(48-58)^2}{58}+\frac{(59-58)^2}{58}  \]

Simplifying, we get:

\[Q_4=\frac{64}{232}+\frac{9}{116}+\frac{196}{116}+\frac{100}{58}+\frac{1}{58}=3.784  \]

Because there are k = 5 categories, we have to compare our chi-square statistic Q4 to a chi-square distribution with k−1 = 5−1 = 4 degrees of freedom:

\[\text{Reject }H_0 \text{ if } Q_4\ge \chi_{4,0.05}^{2}=9.488\]

Because Q4 = 3.784 < 9.488, we fail to reject the null hypothesis.  There is insufficient evidence at the 0.05 level to conclude that the distribution of the color of the candies differs from that specified in the null hypothesis.

By the way, this might be a good time to think about the practical meaning of the term "degrees of freedom." Recalling the example on the last page, we had two categories (male and female) and one degree of freedom. If we are sampling n = 100 people and 53 of them are female, then we absolutely must have 100−53 = 47 males.  If we had instead 62 females, then we absolutely must have 100−62 = 38 males. That is, the number of females is the one number that is "free" to be any number, but once it is determined, then the number of males immediately follows. It is in this sense that we have "one degree of freedom."

With the example on this page, we have five categories of candies (brown, yellow, orange, green, coffee) and four degrees of freedom.  If we are sampling n = 580 candies, and 224 are brown, 119 are yellow, 130 are orange, and 48 are green, then we absolutely must have 580−(224+119+130+48) = 59 coffee-colored candies. In this case, we have four numbers that are "free" to be any number, but once they are determined, then the number of coffee-colored candies immediately follows. It is in this sense that we have "four degrees of freedom."

Unspecified Probabilities

For the two examples that we've thus far considered, the probabilities were pre-specified. For the first example, we were interested in seeing if the data fit a probability model in which there was a 0.60 probability that a randomly selected Penn State student was female. In the second example, we were interested in seeing if the data fit a probability model in which the probabilities of selecting a brown, yellow, orange, green, and coffee-colored candy was 0.4, 0.2, 0.2, 0.1, and 0.1, respectively. That is, we were interested in testing specific probabilities:

\(H_0 : p_{B}=0.40,p_{Y}=0.20,p_{O}=0.20,p_{G}=0.10,p_{C}=0.10 \)

Someone might be also interested in testing whether a data set follows a specific probability distribution, such as:

\(H_0 : X \sim b(n, 1/2)\)

What if the probabilities aren't pre-specified though? That is, suppose someone is interested in testing whether a random variable is binomial, but with an unspecified probability of success:

\(H_0 : X \sim b(n, p)\)

Can we still use the chi-square goodness-of-fit statistic? The short answer is yes... with just a minor modification. 

four dimesExample

Let X denote the number of heads when four dimes are tossed at random. One hundred repetitions of this experiment resulted in 0, 1, 2, 3, and 4 heads being observed on 8, 17, 41, 30, and 4 trials, respectively. Under the assumption that the four dimes are independent, and the probability of getting a head on each coin is p, the random variable X is b(4, p). In light of the observed data, is b(4, p) a reasonable model for the distribution of X?

Solution. In order to use the chi-square statistic to test the data, we need to be able to determine the observed and expected number of trials in which we'd get 0, 1, 2, 3, and 4 heads. The observed part is easy... we know those:

table

It's the expected numbers that are a problem. If the probability p of getting a head were specified, we'd be able to calculate the expected numbers. Suppose, for example, that p = 1/2. Then, the probability of getting zero heads in four dimes is:

\[P(X=0)=\binom{4}{0}\left(\frac{1}{2}\right)^0\left(\frac{1}{2}\right)^4=0.0625  \]

and therefore the expected number of trials resulting in 0 heads is 100 × 0.0625 = 6.25. We could make similar calculations for the case of 1, 2, 3, and 4 heads, and we would be well on our way to using the chi-square statistic:

\[Q_4=\sum_{i=0}^{4}\frac{(Obs_i-Exp_i)^2}{Exp_i}  \]

and comparing it to a chi-square distribution with 5−1 = 4 degrees of freedom. But, we don't know p, as it is unspecified! What do you think the logical thing would be to do in this case? Sure... we'd probably want to estimate p. But then that begs the question... what should we use as an estimate of p?

One way of estimating p would be to minimize the chi-square statistic Q4 with respect to p, yielding an estimator \(\tilde{p}\). This \(\tilde{p}\) estimator is called, perhaps not surprisingly, a minimum chi-square estimator of p. If \(\tilde{p}\) is used in calculating the expected numbers that appear in Q4, it can be shown (not easily, and therefore we won't!) that Q4 still has an approximate chi-square distribution but with only 4−1 = 3 degrees of freedom. The number of degrees of freedom of the approximating chi-square distribution is reduced by one, because we have to estimate one parameter in order to calculate the chi-square statistic. In general, the number of degrees of freedom of the approximating chi-square distribution is reduced by d, the number of parameters estimated. If we estimate two parameters, we reduce the degrees of freedom by two. And so on.

This all seems simple enough. There's just one problem... it is usually very difficult to find minimum chi-square estimators. So what to do? Well, most statisticians just use some other reasonable method of estimating the unspecified parameters, such as maximum likelihood estimation. The good news is that the chi-square statistic testing method still works well. (It should be noted, however, that the approach does provide a slightly larger probability of rejecting the null hypothesis than would the approach based purely on the minimized chi-square.)

Let's summarize.

Chi-square method when parameters are unspecified. If you are interested in testing whether a data set fits a probability model with d parameters left unspecified:

(1) Estimate the d parameters using the maximum likelihood method (or another reasonable method).

(2) Calculate the chi-square statistic Qk−1 using the obtained estimates.

(3) Compare the chi-square statistic to a chi-square distribution with (k−1)−d degrees of freedom.

four dimesExample (continued)

Let X denote the number of heads when four dimes are tossed at random. One hundred repetitions of this experiment resulted in 0, 1, 2, 3, and 4 heads being observed on 8, 17, 41, 30, and 4 trials, respectively. Under the assumption that the four dimes are independent, and the probability of getting a head on each coin is p, the random variable X is b(4, p). In light of the observed data, is b(4, p) a reasonable model for the distribution of X?

Solution. Given that four dimes are tossed 100 times, we have 400 coin tosses resulting in 205 heads for an estimated probability of success of 0.5125:

\[\hat{p}=\frac{0(8)+1(17)+2(41)+3(30)+4(4)}{400}=\frac{205}{400}=0.5125 \]

Using 0.5125 as the estimate of p, we can use the binomial p.m.f. (or Minitab!) to calculate the probability that X = 0, 1, ..., 4:

table

and then, using the probabilities, the expected number of trials resulting in 0, 1, 2, 3, and 4 heads:

table

Calculating the chi-square statistic, we get:

\[Q_4=\frac{(8-5.65)^2}{5.65}+\frac{(17-23.75)^2}{23.75}+ ... + \frac{(4-6.90)^2}{6.90}  =4.99\]

We estimated d = 1 parameter in calculating the chi-square statistic. Therefore, we compare the statistic to a chi-square distribution with (5−1)−1 = 3 degrees of freedom. Doing so:

\[Q_4= 4.99 < \chi_{3,0.05}^{2}=7.815\]

we fail to reject the null hypothesis. There is insufficient evidence at the 0.05 level to conclude that the data don't fit a binomial probability model.

Let's take a look at another example.

geiger counterExample

Let X equal the number of alpha particles emitted from barium-133 in 0.1 second and counted by a Geiger counter. One hundred observations of X produced these data

It is claimed that X follows a Poisson distribution. Use a chi-square goodness-of-fit statistic to test whether this is true. 

Solution. Note that very few observations resulted in 0, 1, or 2 alpha particles being emitted in 0.1 second. And, very few observations resulted in 10, 11, or 12 alpha particles being emitted in 0.1 second. Therefore, let's "collapse" the data at the two ends, yielding us nine "not-so-sparse" categories:

table

Because λ, the mean of X, is not specified, we can estimate it with its maximum likelihood estimator, namely, the sample mean. Using the data, we get:

\[\bar{x}=\frac{1(1)+2(4)+3(13)+ ... + 12(1)}{100}=\frac{559}{100}=5.6\]

We can now estimate the probability that an observation will fall into each of the categories. The probability of falling into category 1, for example, is:

\[P(\{1\})=P(X=0)+P(X=1)+P(X=2) =\frac{e^{-5.6}5.6^0}{0!}+\frac{e^{-5.6}5.6^1}{1!}+\frac{e^{-5.6}5.6^2}{2!}=0.0824 \]

Here's what our table looks like now, after adding a column containing the estimated probabilities:

table

Now, we just have to add a column containing the expected number falling into each category. The expected number falling into category 1, for example, is 0.0824 × 100 = 8.24. Doing a similar calculation for each of the categories, we can add our column of expected numbers:

table of data

Now, we can use the observed numbers and the expected numbers to calculate our chi-square test statistic. Doing so, we get:

\[Q_{9-1}=\frac{(5-8.24)^2}{8.24}+\frac{(13-10.82)^2}{10.82}+ ... +\frac{(4-5.39)^2}{5.39}=5.7157  \]

Because we estimated d = 1 parameter, we need to compare our chi-square statistic to a chi-square distribution with (9−1)−1 = 7 degrees of freedom. That is, our critical region is defined as:

\[\text{Reject } H_0 \text{ if } Q_8 \ge \chi_{8-1, 0.05}^{2}=\chi_{7, 0.05}^{2}=14.07  \]

Because our test statistic doesn't fall in the rejection region, that is:

\[Q_8=5.77157 < \chi_{7, 0.05}^{2}=14.07\]

we fail to reject the null hypothesis. There is insufficient evidence at the 0.05 level to conclude that the data don't fit a Poisson probability model.

Continuous Random Variables

What if we are interested in using a chi-square goodness-of-fit test to see if our data follow some continuous distribution? That is, what if we want to test:

\[ H_0 : F(w) =F_0(w)\]

where F0(w) is some known, specified distribution. Clearly, in this situation, it is no longer obvious as to what constitutes each of the categories. Perhaps we could all agree that the logical thing to do would be to divide up the interval of possible values into k "buckets" or "categories," called A1, A2, ..., Ak, say, into which the observed data can fall. Letting Yi denote the number of times the observed value of W belongs to bucket Ai, i = 1, 2, ..., k, the random variables Y1, Y2, ..., Yk follow a multinomial distribution with parameters n, p1, p2, ..., pk−1. The hypothesis that we actually test is a modification of the null hypothesis above, namely:

\[H_{0}^{'} : p_i = p_{i0}, i=1, 2, ... , k \]

The hypothesis is rejected if the observed value of the chi-square statistic:

\[Q_{k-1} =\sum_{i=1}^{k}\frac{(Obs_i - Exp_i)^2}{Exp_i}\]

is at least as great as \(\chi_{\alpha}^{2}(k-1)\). If the hypothesis \(H_{0}^{'} : p_i = p_{i0}, i=1, 2, ... , k\)  is not rejected, then we do not reject the original hypothesis \(H_0 : F(w) =F_0(w)\) .

Let's make this proposed procedure more concrete by taking a look at an example.

bell curveExample

The IQs of one-hundred randomly selected people were determined using the Stanford-Binet Intelligence Quotient Test. The resulting data were, in sorted order, as follows:

iqs

Test the null hypothesis that the data come from a normal distribution with mean 100 and standard deviation 16.

Solution.  Hmmmm. So, where do we start? Well, we first have to define some categories. Let's divide up the interval of possible IQs into k = 10 sets of equal probability 1/k = 1/10. Perhaps this is best seen pictorially:

drawing

So, what's going on in this picture? Well, first the normal density is divided up into 10 intervals of equal probability (0.10). Well, okay, so the picture is not drawn very well to scale. At any rate, we then find the IQs that correspond to the k = 10 cumulative probabilities 0.1, 0.2, 0.3, etc. This is done in two steps: (1) first by finding the Z-scores associated with the cumulative probabilities 0.1, 0.2, 0.3, etc. and (2) then by converting each Z-score into an X-value. It is those X-values (IQs) that will make up the "right-hand side" of each bucket:

table

Now, it's just a matter of counting the number of observations that fall into each bucket to get the observed (Obs'd) column, and calculating the expected number (0.10 × 100 = 10) to get the expected (Exp'd) column: 

table of counts


As illustrated in the table, using the observed and expected numbers, we see that the chi-square statistic is 8.2. We reject if the following is true: 

\[Q_9 =8.2 \ge \chi_{10-1, 0.05}^{2} =\chi_{9, 0.05}^{2}=16.92\]

It isn't! We do not reject the null hypothesis at the 0.05 level. There is insufficient evidence to conclude that the data do not follow a normal distribution with mean 100 and standard deviation 16.

Using Minitab to Lighten the Workload

Example

This is how I used Minitab to help with the calculations of the alpha particle example on the Unspecified Probabilities page in this lesson.

1. Use Minitab's Calc >> Probability distribution >> Poisson command to determine the Poisson(5.6) probabilities:

Poisson probabilities

2. Enter the observed counts into one column and copy the probabilities (collapsing some categories, if necessary) into another column. Use Minitab's Calc >> Calculator command to generate the remaining necessary columns:

minitab output

3. Sum up the "Chisq" column to obtain the chi-square statistic Q.

chi square statistic

Example

This is how I used Minitab to help with the calculations of the IQ example on the Continuous Random Variables page in this lesson.

1. The sorted data:

sorted data

2. The working table:

the working table

3. The chi-square statistic:

chi-square statistic