Suppose the Penn State student population is 60% female and 40% male. Then, if a sample of 100 students yields 53 females and 47 males, can we conclude that the sample is (random and) representative of the population? That is, how "good" do the data "fit" the probability model? As the title of the lesson suggests, that's the kind of question that we will answer in this lesson.

As is often the case, we'll motivate the methods of this lesson by way of example. Specifically, we'll return to the question posed in the introduction to this lesson.

Suppose the Penn State student population is 60% female and 40% male. Then, if a sample of 100 students yields 53 females and 47 males, can we conclude that the sample is (random and) representative of the population? That is, how "good" do the data "fit" the assumed probability model of 60% female and 40% male?

**Solution.** Testing whether there is a "good fit" between the observed data and the assumed probability model amounts to testing:

\(H_0 : p_F =0.60\)

\(H_A : p_F \ne 0.60\)

Now, letting *Y*_{1} denote the number of **females** selected, we know that *Y*_{1} follows a binomial distribution with *n* trials and probability of success *p*_{1}. That is:

\(Y_1 \sim b(n,p_1)\)

Therefore, the expected value and variance of *Y*_{1} are, respectively:

\(E(Y_1)=np_1\) and \(Var (Y_1) =np_1(1-p_1)\)

And, letting *Y*_{2} denote the number of **males** selected, we know that *Y*_{2} = *n* − *Y*_{1 }follows a binomial distribution with *n* trials and probability of success *p*_{2}. That is:

\(Y_2 = n-Y_1 \sim (b(n,p_2)=b(n,1-p_1)\)

Therefore, the expected value and variance of *Y*_{2} are, respectively:

\(E(Y_2)=n(1-p_1)=np_2\) and \(Var(Y_2)=n(1-p_1)(1-(1-p_1))=np_1(1-p_1)=np_2(1-p_2)\)

Now, for large samples (*np*_{1} ≥ 5 and *n*(1*−p*_{1}) ≥ 5), the Central Limit Theorem yields the normal approximation to the binomial distribution. That is:

\[Z=\frac{(Y_1-np_1)}{\sqrt{np_1(1-p_1)}}\]

follows, at least approximately, the standard normal *N*(0,1) distribution. Therefore, upon squaring *Z*, we get that:

\[Z^2=Q_1=\frac{(Y_1-np_1)^2}{np_1(1-p_1)}\]

follows an approximate chi-square distribution with one degree of freedom. Now, we could stop there. But, that's not typically what is done. Instead, we can rewrite *Q*_{1} a bit. Let's start by multiplying *Q*_{1} by 1 in a special way, that is, by multiplying it by ((1−*p*_{1})+*p*_{1}):

\[Q_1=\frac{(Y_1-np_1)^2}{np_1(1-p_1)} \times ((1-p_1)+p_1)\]

Then, distributing the "1" across the numerator, we get:

\[Q_1=\frac{(Y_1-np_1)^2(1-p_1)}{np_1(1-p_1)}+\frac{(Y_1-np_1)^2p_1}{np_1(1-p_1)}\]

which simplies to:

\[Q_1=\frac{(Y_1-np_1)^2}{np_1}+\frac{(Y_1-np_1)^2}{n(1-p_1)}\]

Now, taking advantage of the fact that *Y*_{1} = *n* − *Y*_{2 }and *p*_{1} = 1 − *p*_{2}, we get:

\[Q_1=\frac{(Y_1-np_1)^2}{np_1}+\frac{(n-Y_2-n(1-p_2))^2}{np_2}\]

which simplifies to:

\[Q_1=\frac{(Y_1-np_1)^2}{np_1}+\frac{(-(Y_2-np_2))^2}{np_2}\]

Just one more thing to simplify before we're done:

\[Q_1=\frac{(Y_1-np_1)^2}{np_1}+\frac{(Y_2-np_2)^2}{np_2}\]

In summary, we have rewritten *Q*_{1} as:

\[ Q_1=\sum_{i=1}^{2}\frac{(Y_i-np_i)^2}{np_i}=\sum_{i=1}^{2}\frac{(\text{OBSERVED }-\text{ EXPECTED})^2}{EXPECTED} \]

We'll use this form of *Q*_{1}, and the fact that *Q*_{1} follows an approximate chi-square distribution with one degree of freedom, to conduct the desired hypothesis test.

Before we return to solving the problem posed by our example, a couple of points are worthy of emphasis.

(1) First, *Q*_{1} has only one degree of freedom, since there is only one independent count, namely *Y*_{1}. Once *Y*_{1} is known, the value of *Y*_{2} = *n* − *Y*_{1 }immediately follows.

(2) Note that the derived approach requires the Central Limit Theorem to kick in. The general rule of thumb is that the expected number of successes must be at least 5 (that is, *np*_{1} ≥ 5) and the expected number of failures must be at least 5 (that is, *n*(1*−p*_{1}) ≥ 5).

(3) The statistic *Q*_{1 }will be large if the observed counts are very different from the expected counts. Therefore, we must reject the null hypothesis *H*_{0} if *Q*_{1 }is *large*. How large is *large*? *Large* is determined by the values of a chi-square random variable with one degree of freedom, which can be obtained either from a statistical software package, such as Minitab or SAS, or from a standard chi-square table, such as the one in the back of our text book.

(4) The statistic *Q*_{1 }is called the **chi-square goodness-of-fit statistic**.

Suppose the Penn State student population is 60% female and 40% male. Then, if a sample of 100 students yields 53 females and 47 males, can we conclude that the sample is (random and) representative of the population? Use the chi-square goodness-of-fit statistic to test the hypotheses:

\(H_0 : p_F =0.60\)

\(H_A : p_F \ne 0.60\)

using a significance level of *α* = 0.05.

**Solution.** The value of the test statistic *Q*_{1 }is:

\[Q_1=\frac{(53-60)^2}{60}+\frac{(47-40)^2}{40}=2.04\]

We should reject the null hypothesis if the observed number of counts is very different from the expected number of counts, that is, if *Q*_{1 }is large. Because *Q*_{1 }follows a chi-square distribution with one degree of freedom, we should reject the null hypothesis, at the 0.05 level, if:

\[Q_1 \ge \chi_{0.05, 1}^{2}=3.84\]

Because:

\[Q_1=2.04 < 3.84\]

we do not reject the null hypothesis. There is not enough evidence at the 0.05 level to conclude that the data don't fit the assumed probability model.

As an aside, it is interesting to note the relationship between using the chi-square goodness of fit statistic *Q*_{1} and the *Z*-statistic we've previously used for testing a null hypothesis about a proportion. In this case:

\[ Z=\frac{0.53-0.60}{\sqrt{\frac{(0.60)(0.40)}{100}}}=-1.428 \]

which, you might want to note that, if we square it, we get the same value as we did for *Q*_{1}:

\[Q_1=Z^2 =(-1.428)^2=2.04\]

as we should expect. The *Z*-test for a proportion tells us that we should reject if:

\[|Z| \ge 1.96 \]

Well, again, if we square it, we should see that that's equivalent to rejecting if:

\[Q_1 \ge (1.96)^2 =3.84\]

And not surprisingly, the *P*-values obtained from the two approaches are identical. The *P*-value for the chi-square goodness-of-fit test is:

\[P=P(\chi_{(1)}^{2} > 2.04)=0.1532\]

while the *P*-value for the *Z*-test is:

\[P=2 \times P(Z>1.428)=2(0.0766)=0.1532\]

Identical, as we should expect!

The work on the previous page is all well and good if your probability model involves just two categories, which as we have seen, reduces to conducting a test for one proportion. What happens if our probability model involves three or more categories? It takes some theoretical work beyond the scope of this course to show it, but the chi-square statistic that we derived on the previous page can be extended to accommodate any number of *k* categories. We can thank the famous statistician Karl Pearson for doing the dirty work for us (in 1900).

Suppose an experiment can result in any of *k* mutually exclusive and exhaustive outcomes, say *A*_{1}, *A*_{2}, ..., *A _{k}*. If the experiment is repeated

Karl Pearson showed that the chi-square statistic *Q*_{k−1} defined as:

\[Q_{k-1}=\sum_{i=1}^{k}\frac{(Y_i - np_i)^2}{np_i} \]

follows approximately a chi-square random variable with *k*−1 degrees of freedom. Let's try it out on an example.

A particular brand of candy-coated chocolate comes in five different colors that we shall denote as:

*A*_{1}= {brown}*A*_{2}= {yellow}*A*_{3}= {orange}*A*_{4}= {green}*A*_{5}= {coffee}

Let *p*_{i} equal the probability that the color of a piece of candy selected at random belongs to *A*_{i}, for *i* = 1, 2 3, 4, 5. Test the following null and alternative hypotheses:

\(H_0 : p_{Br}=0.4,p_{Y}=0.2,p_{O}=0.2,p_{G}=0.1,p_{C}=0.1 \)

\(H_A : p_{i} \text{ not specified in null (many possible alternatives) } \)

using a random sample of *n* = 580 pieces of candy whose colors yielded the respective frequencies 224, 119, 130, 48 and 59. (This example comes from exercise 8.1-2 in the Hogg and Tanis (8th edition) text book).

**Solution.** We can summarize the observed (*y _{i}*) and expected (

where, for example, the expected number of **brown** candies is:

*np*_{1} = 580(0.40) = 232

and the expected number of **green** candies is:

*np*_{4} = 580(0.10) = 58

Once we have the observed and expected number of counts, the calculation of the chi-square statistic is straightforward. It is:

\[Q_4=\frac{(224-232)^2}{232}+\frac{(119-116)^2}{116}+\frac{(130-116)^2}{116}+\frac{(48-58)^2}{58}+\frac{(59-58)^2}{58} \]

Simplifying, we get:

\[Q_4=\frac{64}{232}+\frac{9}{116}+\frac{196}{116}+\frac{100}{58}+\frac{1}{58}=3.784 \]

Because there are *k* = 5 categories, we have to compare our chi-square statistic *Q*_{4} to a chi-square distribution with *k*−1 = 5−1 = 4 degrees of freedom:

\[\text{Reject }H_0 \text{ if } Q_4\ge \chi_{4,0.05}^{2}=9.488\]

Because *Q*_{4} = 3.784 < 9.488, we fail to reject the null hypothesis. There is insufficient evidence at the 0.05 level to conclude that the distribution of the color of the candies differs from that specified in the null hypothesis.

By the way, this might be a good time to think about the practical meaning of the term "degrees of freedom." Recalling the example on the last page, we had two categories (male and female) and one degree of freedom. If we are sampling *n* = 100 people and 53 of them are female, then we absolutely must have 100−53 = 47 males. If we had instead 62 females, then we absolutely must have 100−62 = 38 males. That is, the number of females is the one number that is "free" to be any number, but once it is determined, then the number of males immediately follows. It is in this sense that we have "one degree of freedom."

With the example on this page, we have five categories of candies (brown, yellow, orange, green, coffee) and four degrees of freedom. If we are sampling *n* = 580 candies, and 224 are brown, 119 are yellow, 130 are orange, and 48 are green, then we absolutely must have 580−(224+119+130+48) = 59 coffee-colored candies. In this case, we have four numbers that are "free" to be any number, but once they are determined, then the number of coffee-colored candies immediately follows. It is in this sense that we have "four degrees of freedom."

For the two examples that we've thus far considered, the probabilities were pre-specified. For the first example, we were interested in seeing if the data fit a probability model in which there was a 0.60 probability that a randomly selected Penn State student was female. In the second example, we were interested in seeing if the data fit a probability model in which the probabilities of selecting a brown, yellow, orange, green, and coffee-colored candy was 0.4, 0.2, 0.2, 0.1, and 0.1, respectively. That is, we were interested in testing specific probabilities:

\(H_0 : p_{B}=0.40,p_{Y}=0.20,p_{O}=0.20,p_{G}=0.10,p_{C}=0.10 \)

Someone might be also interested in testing whether a data set follows a specific probability distribution, such as:

\(H_0 : X \sim b(n, 1/2)\)

What if the probabilities aren't pre-specified though? That is, suppose someone is interested in testing whether a random variable is binomial, but with an unspecified probability of success:

\(H_0 : X \sim b(n, p)\)

Can we still use the chi-square goodness-of-fit statistic? The short answer is yes... with just a minor modification.

Let *X* denote the number of heads when four dimes are tossed at random. One hundred repetitions of this experiment resulted in 0, 1, 2, 3, and 4 heads being observed on 8, 17, 41, 30, and 4 trials, respectively. Under the assumption that the four dimes are independent, and the probability of getting a head on each coin is *p*, the random variable *X* is *b*(4, *p*). In light of the observed data, is *b*(4, *p*) a reasonable model for the distribution of *X*?

**Solution.** In order to use the chi-square statistic to test the data, we need to be able to determine the observed and expected number of trials in which we'd get 0, 1, 2, 3, and 4 heads. The observed part is easy... we know those:

It's the expected numbers that are a problem. If the probability *p* of getting a head were specified, we'd be able to calculate the expected numbers. Suppose, for example, that *p* = 1/2. Then, the probability of getting zero heads in four dimes is:

\[P(X=0)=\binom{4}{0}\left(\frac{1}{2}\right)^0\left(\frac{1}{2}\right)^4=0.0625 \]

and therefore the expected number of trials resulting in 0 heads is 100 × 0.0625 = 6.25. We could make similar calculations for the case of 1, 2, 3, and 4 heads, and we would be well on our way to using the chi-square statistic:

\[Q_4=\sum_{i=0}^{4}\frac{(Obs_i-Exp_i)^2}{Exp_i} \]

and comparing it to a chi-square distribution with 5−1 = 4 degrees of freedom. But, we don't know *p*, as it is unspecified! What do you think the logical thing would be to do in this case? Sure... we'd probably want to estimate *p*. But then that begs the question... what should we use as an estimate of *p*?

One way of estimating *p* would be to minimize the chi-square statistic *Q*_{4} with respect to *p*, yielding an estimator \(\tilde{p}\). This \(\tilde{p}\) estimator is called, perhaps not surprisingly, a **minimum chi-square estimator** of *p*. If \(\tilde{p}\) is used in calculating the expected numbers that appear in *Q*_{4}, it can be shown (not easily, and therefore we won't!) that *Q*_{4} still has an approximate chi-square distribution but with only 4−1 = 3 degrees of freedom. The number of degrees of freedom of the approximating chi-square distribution is reduced by one, because we have to estimate one parameter in order to calculate the chi-square statistic. In general, the number of degrees of freedom of the approximating chi-square distribution is reduced by *d*, the number of parameters estimated. If we estimate two parameters, we reduce the degrees of freedom by two. And so on.

This all seems simple enough. There's just one problem... it is usually very difficult to find minimum chi-square estimators. So what to do? Well, most statisticians just use some other reasonable method of estimating the unspecified parameters, such as maximum likelihood estimation. The good news is that the chi-square statistic testing method still works well. (It should be noted, however, that the approach does provide a slightly larger probability of rejecting the null hypothesis than would the approach based purely on the minimized chi-square.)

Let's summarize.

(1) Estimate the (2) Calculate the chi-square statistic (3) Compare the chi-square statistic to a chi-square distribution with ( |

Let *X* denote the number of heads when four dimes are tossed at random. One hundred repetitions of this experiment resulted in 0, 1, 2, 3, and 4 heads being observed on 8, 17, 41, 30, and 4 trials, respectively. Under the assumption that the four dimes are independent, and the probability of getting a head on each coin is *p*, the random variable *X* is *b*(4, *p*). In light of the observed data, is *b*(4, *p*) a reasonable model for the distribution of *X*?

**Solution.** Given that four dimes are tossed 100 times, we have 400 coin tosses resulting in 205 heads for an estimated probability of success of 0.5125:

\[\hat{p}=\frac{0(8)+1(17)+2(41)+3(30)+4(4)}{400}=\frac{205}{400}=0.5125 \]

Using 0.5125 as the estimate of *p*, we can use the binomial p.m.f. (or Minitab!) to calculate the probability that *X* = 0, 1, ..., 4:

and then, using the probabilities, the expected number of trials resulting in 0, 1, 2, 3, and 4 heads:

Calculating the chi-square statistic, we get:

\[Q_4=\frac{(8-5.65)^2}{5.65}+\frac{(17-23.75)^2}{23.75}+ ... + \frac{(4-6.90)^2}{6.90} =4.99\]

We estimated *d* = 1 parameter in calculating the chi-square statistic. Therefore, we compare the statistic to a chi-square distribution with (5−1)−1 = 3 degrees of freedom. Doing so:

\[Q_4= 4.99 < \chi_{3,0.05}^{2}=7.815\]

we fail to reject the null hypothesis. There is insufficient evidence at the 0.05 level to conclude that the data don't fit a binomial probability model.

Let's take a look at another example.

Let *X* equal the number of alpha particles emitted from barium-133 in 0.1 second and counted by a Geiger counter. One hundred observations of *X* produced these data.

It is claimed that *X* follows a Poisson distribution. Use a chi-square goodness-of-fit statistic to test whether this is true.

**Solution.** Note that very few observations resulted in 0, 1, or 2 alpha particles being emitted in 0.1 second. And, very few observations resulted in 10, 11, or 12 alpha particles being emitted in 0.1 second. Therefore, let's "collapse" the data at the two ends, yielding us nine "not-so-sparse" categories:

Because *λ*, the mean of *X*, is not specified, we can estimate it with its maximum likelihood estimator, namely, the sample mean. Using the data, we get:

\[\bar{x}=\frac{1(1)+2(4)+3(13)+ ... + 12(1)}{100}=\frac{559}{100}=5.6\]

We can now estimate the probability that an observation will fall into each of the categories. The probability of falling into category 1, for example, is:

\[P(\{1\})=P(X=0)+P(X=1)+P(X=2) =\frac{e^{-5.6}5.6^0}{0!}+\frac{e^{-5.6}5.6^1}{1!}+\frac{e^{-5.6}5.6^2}{2!}=0.0824 \]

Here's what our table looks like now, after adding a column containing the estimated probabilities:

Now, we just have to add a column containing the expected number falling into each category. The expected number falling into category 1, for example, is 0.0824 × 100 = 8.24. Doing a similar calculation for each of the categories, we can add our column of expected numbers:

Now, we can use the observed numbers and the expected numbers to calculate our chi-square test statistic. Doing so, we get:

\[Q_{9-1}=\frac{(5-8.24)^2}{8.24}+\frac{(13-10.82)^2}{10.82}+ ... +\frac{(4-5.39)^2}{5.39}=5.7157 \]

Because we estimated *d* = 1 parameter, we need to compare our chi-square statistic to a chi-square distribution with (9−1)−1 = 7 degrees of freedom. That is, our critical region is defined as:

\[\text{Reject } H_0 \text{ if } Q_8 \ge \chi_{8-1, 0.05}^{2}=\chi_{7, 0.05}^{2}=14.07 \]

Because our test statistic doesn't fall in the rejection region, that is:

\[Q_8=5.77157 < \chi_{7, 0.05}^{2}=14.07\]

we fail to reject the null hypothesis. There is insufficient evidence at the 0.05 level to conclude that the data don't fit a Poisson probability model.

What if we are interested in using a chi-square goodness-of-fit test to see if our data follow some continuous distribution? That is, what if we want to test:

\[ H_0 : F(w) =F_0(w)\]

where *F*_{0}(*w*) is some known, specified distribution. Clearly, in this situation, it is no longer obvious as to what constitutes each of the categories. Perhaps we could all agree that the logical thing to do would be to divide up the interval of possible values into *k* "buckets" or "categories," called *A*_{1}, *A*_{2}, ..., *A _{k}*, say, into which the observed data can fall. Letting

\[H_{0}^{'} : p_i = p_{i0}, i=1, 2, ... , k \]

The hypothesis is rejected if the observed value of the chi-square statistic:

\[Q_{k-1} =\sum_{i=1}^{k}\frac{(Obs_i - Exp_i)^2}{Exp_i}\]

is at least as great as \(\chi_{\alpha}^{2}(k-1)\). If the hypothesis \(H_{0}^{'} : p_i = p_{i0}, i=1, 2, ... , k\) is not rejected, then we do not reject the original hypothesis \(H_0 : F(w) =F_0(w)\) .

Let's make this proposed procedure more concrete by taking a look at an example.

The IQs of one-hundred randomly selected people were determined using the Stanford-Binet Intelligence Quotient Test. The resulting data were, in sorted order, as follows:

Test the null hypothesis that the data come from a normal distribution with mean 100 and standard deviation 16.

**Solution.** Hmmmm. So, where do we start? Well, we first have to define some categories. Let's divide up the interval of possible IQs into *k* = 10 sets of equal probability 1/*k* = 1/10. Perhaps this is best seen pictorially:

So, what's going on in this picture? Well, first the normal density is divided up into 10 intervals of equal probability (0.10). Well, okay, so the picture is not drawn very well to scale. At any rate, we then find the IQs that correspond to the *k* = 10 cumulative probabilities 0.1, 0.2, 0.3, etc. This is done in two steps: (1) first by finding the *Z*-scores associated with the cumulative probabilities 0.1, 0.2, 0.3, etc. and (2) then by converting each *Z*-score into an *X*-value. It is those *X*-values (IQs) that will make up the "right-hand side" of each bucket:

Now, it's just a matter of counting the number of observations that fall into each bucket to get the observed (**Obs'd**) column, and calculating the expected number (0.10 × 100 = 10) to get the expected (**Exp'd**) column:

As illustrated in the table, using the observed and expected numbers, we see that the chi-square statistic is 8.2. We reject if the following is true:

\[Q_9 =8.2 \ge \chi_{10-1, 0.05}^{2} =\chi_{9, 0.05}^{2}=16.92\]

It isn't! We do not reject the null hypothesis at the 0.05 level. There is insufficient evidence to conclude that the data do not follow a normal distribution with mean 100 and standard deviation 16.

This is how I used Minitab to help with the calculations of the alpha particle example on the Unspecified Probabilities page in this lesson.

1. Use Minitab's **Calc >> Probability distribution >> Poisson** command to determine the Poisson(5.6) probabilities:

2. Enter the observed counts into one column and copy the probabilities (collapsing some categories, if necessary) into another column. Use Minitab's **Calc >> Calculator** command to generate the remaining necessary columns:

3. Sum up the "Chisq" column to obtain the chi-square statistic *Q*.

This is how I used Minitab to help with the calculations of the IQ example on the Continuous Random Variables page in this lesson.

1. The sorted data:

2. The working table:

3. The chi-square statistic: