Key concepts:
Objectives:

Links to background information on the topics in this lesson and some programming help and examples.
Lesson 1 was a brief introduction to the categorical data and discrete distributions. We looked at different types of categorical data as well as counts and proportions. In this lesson we will see the simplest way to represent counts and proportions and then do the simplest type of analysis. Let's begin with the definition of a oneway frequency table.
A frequency table arises when sample units are classified into mutually exclusive categories; the number of units falling into each category is recorded. "One way" means that units are classified according to a single categorical variable, and that is the table's dimension. The cells of the table contain frequency counts and can be arranged in a single row or column. The size of the table refers to the number of cells.
Let us look at some examples and see how they relate to the simplest concept of contingency table.
In 2006, the Penn State's Student Affairs Office of Research and Assessment (SAORA) conducted a survey on alcohol use of undergraduate students at the Penn State, University Park campus ( Pulse Survey: Student Drinking, March 2006). A sample of 1315 undergraduate students were classified as to whether they are highrisk drinkers or not.
HighRisk

Count

Yes

630

No

685

Total

1315

Since the sample size, n = 1315, is fixed a priori and there are only two outcomes, e.g., you are a highrisk drinker or not, this is an example of a binomial sampling.
The questions of interest are
How many teams scored certain number of goals during the first round of 2002 FIFA World Cup soccer championship ( http://www.worldcupsoccer.info/ )? For example, 25 teams scored no goals during their first round matches. The table below records the frequency of the number of goals scored.
Number of goals

Count

0

23

1

37

2

20

3

11

4

2

5

1

6

0

7

0

8

1

Total

95

This is a oneway table of size 9. Note here that the total count (i.e., sample size n = 95 ) was not fixed a priori and this could be considered Poisson sampling since we fixed the time period during which we recorded the outcomes of interest. We may ask questions such as
Start a timer and roll a die for two minutes. Count the number of times each die face comes up on top during this two minutes. This is an example of a Poisson sampling. You may get results similar to those in the following table. Note that we could not fix the sample size n = 30 a priori!
Face

Count

1

3

2

7

3

5

4

10

5

2

6

3

Total

30

You may also get the table above, or a similar one by deciding instead to roll a die n = 30 times, and record out of these 30 times at each roll which face comes up. This is an example of Multinomial sampling. Now, we may ask the question, "Is this die fair?", and use the sampling knowledge to answer it.
A sample of n = 96 persons is obtained, and the eye color of each person is recorded.
Eye color

Count

Brown

46

Blue

22

Green

26

Other

2

Total

96

Notice that brown, blue, green, and other have no intrinsic ordering. The response variable, eye color, is therefore an unordered categorical or nominal variable.
Hypothetical attitudes of n = 116 people towards war. They were asked to state their opinion on a 5 point scale regarding the statement: "This is a necessary war".
Attitude

Count

Strongly disagree

35

Disagree

27

Agree

23

Strongly agree

31

Total

116

The response categories in this example are clearly ordered, but no objectively defined numerical scores can be attached to the categories. The response variable, attitude, is therefore said to be an ordered categorical or ordinal variable.
Hypothetical attitudes of n = 130 people towards the war, but with additional categories for "not sure" and "refused to answer."
Attitude

Count

Strongly disagree

35

Disagree

27

Agree

23

Strongly agree 
31

Not sure 
6

Refusal 
8

Total

130

The first four categories are clearly ordered, but the placement of "not sure" and "refusal" in the ordering is questionable. We would have to say that this response is partially ordered.
Number of children in n = 100 randomly selected families.
Number of children

Count

0

19

1

26

2

29

3

13

45

11

6+

2

Total

100

The original data, the raw number of children, has been coarsened into six categories (0, 1, 2, 3, 4–5, 6+). These categories are clearly ordered, but—unlike the previous example—the categories have objectively defined numeric values attached to them. We can say that this table represents coarsened numeric data or interval variable.
Total gross income of n = 100 households.
Income

Count

below \$10,000 
11

\$10,000–\$24,999 
23

\$25,000–\$39,999 
30

\$40,000–\$59,999 
24

\$60,000 and above 
12

Total 
100

The original data (raw incomes) were essentially continuous. Any type of data, continuous or discrete, can be grouped or coarsened into categories.
Grouping data will typically result in some loss of information. How much information is lost depends on (a) the number of categories and (b) the question being addressed. In this example, grouping has somewhat diminished our ability to estimate the mean or median household income. Our ability to estimate the proportion of households with incomes below \$10,000 has not been affected, but estimating the proportion of households with incomes above \$75,000 is now virtually impossible. (Why?)
Remember measurement hierarchy? If we have a statistical method that is developed for analysis of ordinal categorical data that method typically will not be appropriate for analysis of, for example, nominal categorical data. This is why we are making a distinction between ordinal and nominal variables and partial ordinal variables. Let's get a little bit more technical and see the notation relevant for oneway frequency tables before we go into a discussion of sampling schemes.
Suppose data are collected on a discrete variable, X, with k categories. A oneway frequency table with k cells will be denoted by the vector:
\(X=(X_1,X_2,\ldots,X_k)\)
where X_{j} is the count or frequency in cell j and x_{j} or n_{j} is the observed count in this cell j. In some textbooks the author will use x_{j} and other authors will use n_{j}; the latter is more consistent with your textbook (Agresti's) notation.
Note that X is a summary of the original raw dataset consisting of:
\(n=\sum\limits_{j=1}^k x_j=\sum\limits_{j=1}^k n_j\)
observations.
For instance, the Eye Colors example had four types of eye colors, thus k = 4.
X_{1} is the unknown number of people in the population with "brown" eyes. The observed number of people with brown eyes is x_{1} = n_{1} = 46. The total sample size n = n_{1} + n_{2} + n_{3} + n_{4} = 46 + 22 + 26 + 2 = 96.
If the observed units are randomly sampled from a large population, then x = (n_{1}, n_{2 }, n_{3 }, ... , n_{k}) will have a joint discrete probability distribution such as Poisson, Binomial, or Multinomial; a technical review of these common discrete distributions is given in Lesson 1.
Let π_{j} be the population probability (or proportion) that a randomly selected individual, (i.e., an observational unit), falls into the jth cell of the table. Then the joint probability distribution will be denoted as
\(\pi=(\pi_1,\pi_2,\pi_3,\ldots,\pi_k)=\{\pi_j\}\)
For this to be a proper joint probability distribution the probabilities must sum to 1:
\(\sum\limits_{j=1}^k \pi_j=1\)
In the social sciences we typically think of π_{j} as a proportion of individuals in the population with a particular characteristic. For example, π_{1} = proportion of people with brown eyes in the population of interest. In statistical inference, our goal is to estimate unknown parameters π_{j}, and/or their joint distribution.
For observed data we typically use p instead of π, to denote observed joint distribution { p_{j }} where p_{j} = n_{j}/n is the observed proportion of people with a certain characteristic, e.g. the proportion of people with brown eyes in our example is p_{1} = 46 / 96 = 0.479. p_{1} is an example of a sample statistic, and a point estimate of unknown population parameter, π_{1}.
For the Eye Color example:



What are some ways of generating these oneway tables of counts?
Why do you think we care about the random mechanism that generated the data? 
Any data analysis requires some assumptions about the data generation process. For continuous data and linear regression, for example, we assume that the response variable have been randomly generated from a normal distribution. For categorical data we will often assume that data have been generated from a Poisson, binomial or multinomial distribution. Statistical analysis depends on the data generation mechanism, although in some situations, depending on the objective, we may ignore that mechanism and simplify our analysis.
The following sampling methods correspond to the distributions considered
Poisson sampling assumes that the random mechanism to generate the data can be described by a Poisson distribution. It is useful for modeling counts or events that occur randomly over a fixed period of time or in a fixed space. Often it is useful when the probability of any particular incidence happening is very small while the number of incidences is very large. (This is very much like a binomial distribution where success probability π of a trial is very very small but the number of trials n is very very large. This is known as the limiting condition).
For example, consider the World Cup soccer data example where we collect data on the frequency on the number of goals scored by teams during the first round matches of the 2002 World Cup. Another example is rolling of a dice during a fixed twominute time period. Similarly count the number of emails you received between 4pm5pm on a Friday, or number of students accessing STAT 504 course website on a Saturday, etc.
Let X be the number of goals scored in the matches of the first round of the World Cup. X ∼ Poisson (λ)
\(P(X=x)=\dfrac{\lambda^x e^{\lambda}}{x!}\qquad x=0,1,2,\ldots\)
Where λ is the parameter describing the rate, that is the mean of the distribution, e.g., the average number of goals scored during the first round matches. Once you know λ, you know everything there is to know about this distribution. x! stands for x factorial, i.e., x!=1*2*3*...*x. P(X=x) or P(x) is a probability that a randomly chosen team scored x number of goals in a game, e.g.:
\(P(X=0)=\dfrac{\lambda^0 e^{\lambda}}{0!}=\dfrac{1\cdot e^{\lambda} }{1}=e^{\lambda}\)
How the average rate λ = 1.38 is obtained, is given below. Then \(P(X=0)=e^{1.38}=\dfrac{1}{e^{1.38}}=0.252\) is the probability that a randomly chosen team will score 0 goals in the first round match of the World Cup. For the remaining probabilities see the table at the end of this page.
In this sampling scheme, the overall n is not fixed. We also assume that X_{1}, X_{2}, ... X_{k} are independent Poisson variables, either each with the same rate λ or each with different rate λ_{1}, λ_{2}, ...., λ_{k} such that λ_{1} + λ_{2} + ... + λ_{k} = λ.
Recall that mean and variance of Poisson distribution are the same; e.g., E(X) = Var(X) = λ. However in practice, the observed variance is usually larger than the theoretical variance and in the case of Poisson, larger than its mean. This is known as overdispersion, an important concept that occurs with discrete data. We assumed that each team has the same probability of in each match of the first round of scoring goals, but it's more realistic to assume that these probabilities will vary by the teams skills, the day the matches were played because of the weather, maybe even if the order of the matches, etc. Then we may observe more variations in the scoring than the Poisson model predicts. Analyses assuming binomial, Poisson or multinomial distributions are sometimes invalid because of overdispersion. We will see more on this later when we study logistic regression and Poisson regression models.
Let us see how we can do some basic calculations with the World Cup Soccer example under the Poisson model.
QUESTION: What is the most likely mean number of goals scored; that is, what is the most likely value of the unknown parameter λ given the data x?
We can answer this question by relying on the basic principle of statistical inference, e.g., point estimation, confidence intervals and/or hypothesis testing.
Recall from Lesson 1 on Likelihood and MLE: The most common point estimate is the "maximum likelihood estimate" (MLE) which has nice statistical properties, and it is the most likely value of the parameter given the data; it is the value that maximizes the likelihood function.
The MLE of λ from the Poisson distribution is the sample mean or the expectation of the distribution, and from the computation below for our example this is approximately:
\begin{align}
\bar{x} &= \dfrac{1}{95}\sum\limits_{i=1} x_i\\
&= \dfrac{1}{95} (0\times 23+1\times 37+2\times 20+3\times 11+4\times 2+5\times 1+6\times 0+ 7\times 0+ 8 \times 1)\\
&= \dfrac{131}{95}\\
&= 1.38\\
\end{align}
Thus, $\hat{\lambda}=1.38$ goals per first round matches.
An asymptotic or approximate 95% confidence interval is then:
\([1.381.96\sqrt{1.38/95},1.38+1.96\sqrt{1.38/95}]=[1.14,1.62]\)
and we are 95% confident that the mean number of goals scored by a team during the first round matchups will be somewhere between 1.14 and 1.62. Now that we have some estimate of the mean number of goals we can calculate the expected probabilities of a randomly chosen team scoring 0, 1, 2, 3, etc... number of goals, as well as the expected frequencies (or counts). For example, under this Poisson model with $\hat{\lambda}=1.38$, the expected probability of scoring 2 goals is $\hat{\pi}_2=p_2=P(X=2)=\frac{{1.38}^{2}e^{1.38}}{2!}=0.239$ and the expected frequency is $np_2=95*0.239=22.75$ (see the 3rd row of the table below).
Here is a link to the World Cup Soccer data (text file).
You can easily do these calculation by hand or in Excel or in any other software package you are using. Here they are in SAS and R.
Here is the SAS program soccer.sas.
For the complete output, see the course SAS page.
Please Note: Most PROC FREQ SAS options do NOT work for oneway tables, thus some coding is needed.
Here is a link to this code in R. soccer.R.
You can click on the 'Inspect' button below to see how the Poisson probabilities are calculated using R.
Please Note: There are some discrepancies between the R code file and Inspect! The file itself contains line comments explaining the code.
Here is a summary of these probabilities:
Number of goals

Observed Counts

Expected probabilities under assumed Poisson model

Expected
Counts

0

23

0.252

23.93

1

37

0.347

32.99

2

20

0.239

22.75

3

11

0.110

10.46

4

2

0.038

3.61

5

1

0.010

0.99

6

0

0.002

0.23

7

0

0.0005

0.05

8

1

0.00008

0.01

Total

95

In the graphical form:
For additional general calculations with Poisson distribution, see the following methods in SAS and R:
Here is the SAS program for calculating Poisson probabilities: PoissonCal.sas.
Here is the R code for calculating Poisson probabilities: PoissonCal.R.
Similarly binomial and multinomial sampling data also can be analyzed.
When data are collected on a predetermined number of units and are then classified according to two levels of a categorical variable, a binomial sampling emerges. Consider the Highrisk Drinking Example where we have highrisk drinkers versus nonhighrisk drinkers. In this study there was a fixed number of trials (e.g., fixed number of students surveyed, n=1315) where the researcher counted the number of "successes" and the number of "failures" that occur. We can let X be the number of "successes" that is the number of students who are the highrisk drinkers. We can use the binomial probability distribution (i.e., binomial model), to describe this particular variable.
Binomial distributions are characterized by two parameters: n, which is fixed  this could be the number of trials or the total sample size if we think in terms of sampling, and π, which usually denotes a probability of "success". In our example this would be the probability that someone is a highrisk drinker in the population of Penn State students. Please note that some textbooks will use π to denote the population parameter and p to denote the sample estimate, whereas some may use p for the population parameters as well. We may do both; don't be confused by this, just make sure to read carefully the specification. Once you know n and π, the probability of success, you know the mean and variance of the binomial distribution, and you know everything about that binomial distribution.
Below are the probability density function, mean and variance of the binomial variable.
\(f(x)=\dfrac{n!}{x!(nx)!}π^x(1π)^{nx}\qquad \text{for }x=0,1,2,\ldots,n\)
Mean E (X) = nπ
Variance Var (X) = nπ (1  π)
QUESTION: What is the probability that no students are heavy drinkers, i.e., P(X = 0)?
Let's assume that π = 0.5.
\begin{align}
P(X=0n=1315,\pi=0.5) &= \binom{n}{x} \pi^x(1\pi)^{nx}\\
&= \dfrac{n!}{x!(nx)!}\pi^x(1\pi)^{nx}\\
&= \dfrac{1315!}{0!(13150)!}(0.5)^0(0.5)^{1315}\\
&= 1\cdot1\cdot(0.5)^{1315}\\
&\approx 0\\
\end{align}
What's the probability that there are X = 1000 heavy drinkers in this example? 
QUESTION: What is the true population proportion of students who are highrisk drinkers at Penn State?
This is a statistical inference question that can be answered with a point estimate, confidence intervals and hypothesis tests about proportions. The likelihood function for Binomial L(π ; x) is a measure of how close the population proportion π is to the data x; The Maximum Likelihood Estimate (MLE) is the most likely value for π given the observed data, and for the binomial distribution this is the sample mean,
\(\hat{\pi}=\dfrac{\sum x_i}{n}=\dfrac{x}{n}\)
and for the expected counts,
\(\hat{\mu}=\hat{\pi}\cdot n\)
Thus for our example, assuming the Binomial distribution, our "best" guess estimate of the true proportion of students who are highrisk drinkers is
\(p=\hat{\pi}=\dfrac{630}{1315}=0.48\)
Here are also the likelihood and loglikelihood graphs for our example. We can see that the peak of the likelihood is at the proportion value equal to 0.48. The loglikelihood looks quadratic which means that the largesample normal theory should work fine, and we can use the approximate 95% confidence intervals.
The MLE is used for statistical inference, such as testing for a specific value of π, or giving a range of likely values of π, via the 95% confidence interval. A key result here comes from understanding the properties of the sampling distribution of the sample proportion p.
The Rule for Sample Proportion: If numerous samples of size n are taken with n large enough, the frequency curve of the sample proportions from the various samples will be approximately normal with the mean E(p)=π and variance Var(p)=π(1 π)/n . "Large enough" usually means that the number of successes and failures are not small, i.e., nπ ≥ 5, and n(1  π) ≥ 5. The larger the sample size n, the sampling distribution of p is better approximated by a normal distribution. Note that the sampling distribution of p is really a discrete rather than a continuous distribution, but we rely on the above described approximation for statistical inference unless we deal with small samples; for the latter case see Agresti (2007), Sec 1.4.31.4.5.
QUESTION: Is the population proportion of heavydrinkers significantly different from 50%?
H_{0}: π = π_{0} vs. H_{A}: π ≠ π_{0}

The z test statistic: \(z=\dfrac{p\pi_0}{\sqrt{\dfrac{\pi_0(1\pi_0)}{n}}}\) where p is the sample proportion, in our case 0.48, and π_{0} is the value we are testing for, 0.5.
H_{0}: π = 0.5 vs. H_{A}: π ≠ 0.5

The statistic z = 1.45 with two sided pvalue of 0.136. Thus we do not have a very strong evidence that the proportion of highrisk drinkers is different from 50%; i.e., do not reject the null hypothesis.
The usual (1  α) × 100% CI, holds:
\(p\pm z_{\alpha/2}\sqrt{\dfrac{p(1p)}{n}}\)
This interval is known as the Wald confidence interval.
For our example, the 95% CI is 0.48 ± 1.96 × 0.014 = (0.453, 0.507). We can be 95% confident that the true population proportion of students highrisk drinkers is between 0.454 and 0.506.
However, when the proportions are extremely small or large, π < 0.20 or π > 0.80, this CI does not work very well. It is better to consider the likelihoodratiobased CI, as discussed in Lesson 1. This interval is more complex computationally but in essence simple by evaluating the likelihood function plotted below where we are looking for all possible values of $pi_0$ for which the null hypothesis would not be rejected. In our case we get 95% CI to be (0.453, 0.506), here is a plot of that confidence interval.
To do the above calculations in R and SAS, see the drinking.R and drinking.sas files below. Also, watch the viewlets that will walk you through how these program works.
Here is the SAS program drinking.sas.
And click on the 'Inspect' icon to see a walk through of the last part of this program.
Here is the R program drinking.R.
And, here is a walkthrough of this program.
The third alternative, also likelihoodbased confidence interval, known as the Score confidence interval in essence is looking for ALL π_{0} values that yield the desired test statistics, e.g., for 95% CI, z_{α/2} = ± 1.96. To simplify, we need to solve the following equation:
\(\dfrac{p\pi_0}{\sqrt{\dfrac{\pi_0(1\pi_0)}{n}}}=z_{\alpha/2}\)
which is the same as:
\(\dfrac{p\pi_0}{\sqrt{\dfrac{\pi_0(1\pi_0)}{n}}}=z_{\alpha/2} \text{ and } \dfrac{p\pi_0}{\sqrt{\dfrac{\pi_0(1\pi_0)}{n}}}=z_{\alpha/2}\)
To do this, we can solve the following quadratic equation:
\(\left(1+\dfrac{z^2_{\alpha/2}}{n}\right)\pi^2_0+\left(2p\dfrac{z^2_{\alpha/2}}{n}\right)\pi_0+p^2=0\)
\(a\pi^2_0+b\pi_0+c=0\)
\(\pi_0=\dfrac{b\pm \sqrt{b^24ac}}{2a}\)
In this example, the 95% Score CI is the same as the Wald CI, $(0.453, 0.507)$. Notice that the only difference between the Wald CI and the Score CI, is the standard error where in the former its calculated using the value of sample proportion p and in the latter the null value of π_{0}.
Traditionally, most software packages will give you the Wald CI, but nowadays we are starting to see the score and the likelihoodratio ones too.
The margin of error is defined to be the length of the 95% CI for a single proportion. So the 95% CI is p ± MOE; see the CNN example for a review. Confidence interval is the widest when π_{0} = 0.5. This knowledge is useful in determining sample size for given conditions.
Often we are interested in knowing what sample size is needed for a specific margin of error for a population proportion. For example, how large a sample would we need such that the 99% confidence interval is of width m. Solving the following for n:
\(m=2.575\sqrt{\pi(1\pi)/n}\)
\(n=(2.575)^2 \pi(1\pi)/m^2\)
Since π is unknown, take π = 0.5 to get the widest possible MOE. For example, if the required MOE is 3%
\(n=(2.575)^2(0.5)(0.5)/(0.03)(0.03)=1841.84\)
which is rounded up to 1842.
For additional simple Binomial distribution calculations see Lesson 0 and SimpleBinomialCal.R code with its viewlet that walks you through the code.
Multinomial sampling may be considered as a generalization of Binomial sampling. Data are collected on a predetermined number of individuals that is units and classified according to the levels of a categorical variable of interest (e.g., see Examples 4 through 8 in the Introduction of this Lesson).
X ∼ Mult (n, π), with the probability density function
\(f(x)=\dfrac{n!}{x_1!x_2!\cdots x_k!} \pi_1^{x_1}\pi_2^{x_2}\cdots \pi_k^{x_k}\) , \(x=(x_1,\ldots,x_k)\)
where n is fixed and known sample size, and π is the vector of population proportions \(\pi=(\pi_1,\pi_2,\ldots,\pi_k)\).
Each X_{j} represents a population count in the cell j, x_{j} observed sample count, and each π_{j }represents a population proportion in the cell j.
Multinomial distribution admits a mean vector with
\(E(X_j)=n\pi_j\)
The variance of X_{j} is
\(Var(X_j)=n\pi_j(1\pi_j)\)
and the covariance between X_{j} and X_{k} is
\(cov(X_j,X_k)=n\pi_j\pi_k\)
Each cell thus has a binomial marginal distribution. Binomial is a special case of multinomial for k = 2.
\(f(x)=\dfrac{n!}{x_1!x_2!}\pi_1^{x_1}\pi_2^{x_2}\)
let,
\begin{array}{lcr}
x &= & x_1\\
x_2 &=& nx\\
\pi &=& \pi_1\\
\pi_2 &=& 1\pi
\end{array}
then,
\(f(x)=\binom{n}{x}\pi^x(1\pi)^{nx}=\dfrac{n!}{x!(nx)!}\pi^x(1\pi)^{nx},\qquad x=0,1,2,\ldots,n\)
To be consistent with Agresti(2007), pg.4 or Agresti(2013), pg.6, where the random variable is denoted as Y, and probability density function as P(y), we can write the above as,
\(p(y)=\dfrac{n!}{y!(ny)!}\pi^y(1\pi)^{ny},\qquad y=0,1,2,\ldots,n\)
If n is random (not fixed, as was the case with the total number of goals scored in the World Soccer example), but we have no interest in making inferences about the sample size n only about the underlying cell probabilities, we can apply the multinomial model instead of the Poisson model and the inferences will be the same. But we do have to worry about other kinds of model failure; e.g., are the assumptions of the multinomial model satisfied?
Multinomial Model (distribution) Assumptions:
The most common violation of these assumptions occurs when clustering is present in the data. Clustering means that some of the trials occur in groups or clusters, and that trials within a cluster tend to have outcomes that are more similar than trials from different clusters. Clustering can be thought of as a violation of either (a) or (b).
In this example, eye color was recorded for n = 96 persons.
Eye color

Count

Brown

46

Blue

22

Green

26

Other

2

Total

96

Suppose that the sample included members from the same family as well as unrelated individuals. Persons from the same family are more likely to have similar eye color than unrelated persons, so the assumptions of the multinomial model would be violated. If both parents have brown eye color, it is very likely that their offspring will also have brown eye color. Whereas eye color of family members related by marriage will not violate multinomial assumption, distribution of eye color of blood relations will.
Now suppose that the sample consisted of "unrelated" persons randomly selected within Pennsylvania. In other words, persons are randomly selected from a list of Pennsylvania residents. If two members of the same family happen to be selected into the sample purely by chance, that's okay; the important thing is that each person on the list has an equal chance of being selected, regardless of who else is selected.
Could this be considered a multinomial situation? For all practical purposes, yes. The sampled individuals are not independent according the common English definition of the word, because they all live in Pennsylvania. But we can suppose that they are independent from a statistical viewpoint, because the individuals are exchangeable; before the sample is drawn, no two are a priori any more likely to have the same eye color than any other two.
Based on the readings on Multinomial distribution in Lesson 1 and 2 can you answer the following questions:

A goodnessoffit test, in general, refers to measuring how well do the observed data correspond to the fitted (assumed) model. We will use this concept throughout the course as a way of checking the model fit. Like in a linear regression, in essence, the goodnessoffit test compares the observed values to the expected (fitted or predicted) values.
A goodnessoffit statistic tests the following hypothesis:
H_{0}: the model M_{0} fits
vs.
H_{A}: the model M_{0} does not fit (or, some other model M_{A} fits)
Most often the observed data represent the fit of the saturated model, the most complex model possible with the given data. Thus, most often the alternative hypothesis (H_{A}) will represent the saturated model M_{A} which fits perfectly because each observation has a separate parameter. Later in the course we will see that M_{A} could be a model other than the saturated one. Let us now consider the simplest example of the goodnessoffit test with categorical data.
In the setting for oneway tables, we measure how well an observed variable X corresponds to a Mult (n, π) model for some vector of cell probabilities, π. We will consider two cases:
In other words, we assume that under the null hypothesis data come from a Mult (n, π) distribution, and we test whether that model fits against the fit of the saturated model. The rationale behind any model fitting is the assumption that a complex mechanism of data generation may be represented by a simpler model. The goodnessoffit test is applied to corroborate our assumption.
Consider our Dice Example from the Introduction. We want to test the hypothesis that there is an equal probability of six sides; that is compare the observed frequencies to the assumed model: X ∼ Multi (n = 30, π_{0} = (1/6, 1/6, 1/6, 1/6, 1/6, 1/6)). You can think of this as simultaneously testing that the probability in each cell is being equal or not to a specified value, e.g.
H_{0}: ( π_{1}, π_{2}, π_{3}, π_{4}, π_{5}, π_{6}) = (1/6, 1/6, 1/6, 1/6, 1/6, 1/6)
vs.
H_{A}: ( π_{1}, π_{2}, π_{3}, π_{4}, π_{5}, π_{6}) ≠ (1/6, 1/6, 1/6, 1/6, 1/6, 1/6).
Most software packages will already have builtin functions that will do this for you; see the next section for examples in SAS and R. Here is a stepby step procedure to help you conceptually understand this test better and what is going on behind these functions.
Step 1: If vector π is unknown we need to estimate these unknown parameters, and proceed to Step 2; If vector π is known proceed to Step 2. Step 2: Calculate the estimated (fitted) cell probabilities $\hat{\pi_j}$s, and expected cell frequencies, E_{j}'s under H_{0}. Step 3: Calculate the Pearson goodnessoffit statistic, X ^{2} and/or the deviance statistic, G^{2} and compare them to appropriate chisquared distributions to make a decision. Step 4: If the decision is borderline or if the null hypothesis is rejected, further investigate which observations may be influential by looking, for example, at residuals. 
The Pearson goodnessoffit statistic is
\(X^2=\sum\limits_{j=1}^k \dfrac{(X_jn\hat{\pi}_j)^2}{n\hat{\pi}_j}\)
An easy way to remember it is
\(X^2=\sum\limits_j \dfrac{(O_jE_j)^2}{E_j}\)
where O_{j} = X_{j} is the observed count in cell j, and $E_j=E(X_j)=n\hat{\pi}_j$ is the expected count in cell j under the assumption that null hypothesis is true, i.e. the assumed model is a good one. Notice that \(\hat{\pi}_j\) is the estimated (fitted) cell proportion π_{j }under H_{0}.
The deviance statistic is
\(G^2=2\sum\limits_{j=1}^k X_j \text{log}\left(\dfrac{X_j}{n\hat{\pi}_j}\right)\)
where "log" means natural logarithm. An easy way to remember it is
\(G^2=2\sum\limits_j O_j \text{log}\left(\dfrac{O_j}{E_j}\right)\)
In some texts, G^{2} is also called the likelihoodratio test statistic, for comparing the likelihoods (l_{0} and l_{1 }) of two models, that is comparing the loglikelihoods under H_{0} (i.e., loglikelihood of the fitted model, L_{0}) and loglikelihood under H_{A} (i.e., loglikelihood of the larger, less restricted, or saturated model L_{1}): G^{2 }= 2log(l_{0}/l_{1}) = 2(L_{0}  L_{1}). A common mistake in calculating G^{2} is to leave out the factor of 2 at the front.
Note that X^{2} and G^{2} are both functions of the observed data X and a vector of probabilities π. For this reason, we will sometimes write them as X^{2}(x, π) and G^{2}(x, π), respectively; when there is no ambiguity, however, we will simply use X^{2} and G^{2}. We will be dealing with these statistics throughout the course; in the analysis of 2way and kway tables, and when assessing the fit of loglinear and logistic regression models.
X^{2} and G^{2} both measure how closely the model, in this case Mult (n, π) "fits" the observed data.
The answer is provided by this result:
If x is a realization of X ∼ Mult(n, π), then as n becomes large, the sampling distributions of both X^{2}(x, π) and G^{2}(x, π) approach chisquared distribution with df = k 1, where k = number of cells, χ2_{k−1}.
This means that we can easily test a null hypothesis H_{0} : π = π_{0} against the alternative H_{1} : π ≠ π_{0} for some prespecified vector π_{0}. An approximate αlevel test of H_{0} versus H_{1} is:
Reject H_{0} if computed X^{2}(x, π_{0}) or G^{2}(x, π_{0}) exceeds the theoretical value χ^{2} _{k−1}(1 − α).
Here, χ^{2}_{k−1}(1 − α) denotes the (1 − α)th quantile of the χ^{2}_{k−1} distribution, the value for which the probability that a χ^{2}_{k−1} random variable is less than or equal to it is 1 − α. The pvalue for this test is the area to the right of the computed X^{2} or G^{2} under the χ^{2}_{k−1} density curve. Below is a simple visual example. Consider a chisquared distribution with df=10. Let's assume that a computed test statistic is X^{2}=21. For α=0.05, the theoretical value is 18.31.
Useful functions in SAS and R to remember for computing the pvalues from the chisquare distribution are:
probchi(test statistic,df), e.g.,1probchi(21,10)=0.021
You can quickly review the chisquared distribution in Lesson 0, or check out http://www.statsoft.com/textbook/stathome.html and http://www.ruf.rice.edu/~lane/stat_sim/chisq_theor/index.html. The STATSOFT link also has brief reviews of many other statistical concepts and methods.
Here are a few more comments on this test.
Suppose that we roll dice 30 times and observe the following table showing the number of times each face ends up on top.
Face

Count

1

3

2

7

3

5

4

10

5

2

6

3

Total

30

We want to test the null hypothesis that the dice is fair. Under this hypothesis, X ∼ Mult(n = 30, π_{0}) where π_{0} = (1/6 , 1/6 , 1/6 , 1/6 , 1/6 , 1/6 ).
This is our assumed model, and under this H_{0}, the expected counts are E_{j} = 30*1/6= 5 for each cell j. One way to calculate the difference between what we are assuming and what we are observing is to evaluate the the goodnessoffit statistics as discussed in the previous section.
\begin{eqnarray*}
X^2 &= & \dfrac{(35)^2}{5}+\dfrac{(75)^2}{5}+\dfrac{(55)^2}{5}\\
& & +\dfrac{(105)^2}{5}+\dfrac{(25)^2}{5}+\dfrac{(35)^2}{5}\\
&=& 9.2
\end{eqnarray*}
\begin{eqnarray*}
G^2 &=& 2\left(3\text{log}\dfrac{3}{5}+7\text{log}\dfrac{7}{5}+5\text{log}\dfrac{5}{5}\right.\\
& & \left.+ 10\text{log}\dfrac{10}{5}+2\text{log}\dfrac{2}{5}+3\text{log}\dfrac{3}{5}\right)\\
&=& 8.8
\end{eqnarray*}
Note that numerical values of Χ^{2} and G^{2} can be different. We can calculate the pvalues for these statistics in order to help determine how well this model fits. The pvalues are P(χ^{2}_{5} ≥ 9.2) = .10 and P(χ^{2}_{5} ≥ 8.8) = .12. Given these pvalues, with the critical value or Type I error of α=0.05, we fail to reject the null hypothesis. But rather than relying on pvalues, we should also think about the actual values of X^{2} and G^{2}statistics. Small values imply a good fit here, i.e., the observed values are close to what you had assumed. Large values are going to imply the opposite. In this case, the fairdice model doesn't fit the data very well, but the fit isn't bad enough to conclude beyond a reasonable doubt that the dice is unfair.
Next, we show how to do this in SAS and R; you can run either.
The following SAS code, dice_rolls.sas, will perform the goodnessoffit test for the example above.
Here is the output generated by this program, and the headings are selfexplanatory:
Notice that this SAS code only computes the Pearson chisquare statistic and not the deviance statistic.
Can you identify the relevant statistics and the pvalue in the output? What does the column labeled "Percent" represent? 
The following R code, dice_rolls.R will perform the same analysis as what we just did in SAS. You can copy the entire code in the R window and your output will be saved into two files, dice_rolls.out and dice_rolls_Results. If you are new to R, however, I suggest you run this line by line to get more familiar with the commands; you can simply copy and past into the Rterminal. For other options, see a very basic intro to R.
dice_rolls.out (part of the output)

Here are a few additional comments regarding the above R code and its output. If you run Inspect, the data there will be labeled as "ex7" where in the updated code the label is "dice", but it's the same data. The last part of the R code does the same analysis, but gives an output that looks more like what SAS produces; we did this to show a bit more programming in R and how you can play with creating different outputs.
IMPORTANT! The Pearson chisquared statistic is X^{2} = 9.2 (pvalue=0.101), the deviance statistic (from the R output) is G^{2}=8.78 (pvalue=0.118), and they both follow the chisquared sampling distribution with df=5. It is not uncommon to observe the discrepancy in the values between these two statistics especially for the smaller sample sizes. Notice, however, that in this case they do lead to the same conclusion  there is a moderate evidence that the dice is fair.
Can you identify the relevant statistics and the pvalue in the output? What does the column labeled "Percentage" in dice_rolls.out represent? 
Tall cutleaf tomatoes were crossed with dwarf potatoleaf tomatoes, and n = 1611 offspring were classified by their phenotypes.
Phenotypes

Count

tall cutleaf

926

tall potatoleaf

288

dwarf cutleaf

293

dwarf potatoleaf

104

Genetic theory says that the four phenotypes should occur with relative frequencies 9 : 3 : 3 : 1, and thus are not all equally as likely to be observed. The dwarf potatoleaf is less likely to observed than the others. Do the observed data support this theory?
Under the null hypothesis, the probabilities are
π_{1} = 9/16 , π_{2} = π_{3} = 3/16 , π_{4} = 1/16 ,
and the expected frequencies are
E_{1} = 1611(9/16) = 906.2,
E_{2} = E_{3} = 1611(3/16) = 302.1, and
E_{4} = 1611(1/16) = 100.7.
We calculate the fit statistics and find that X^{2} = 1.47 and G^{2} = 1.48, which are nearly identical. The pvalues based on the χ^{2} distribution with 3 d.f. are approximately equal to 0.69. Therefore, we fail to reject the null hypothesis and the data are consistent with the genetic theory. Again, the small values of these statistics imply that the fit between the data and the proposed model is good!
Here is how to do the computations in R using the following code :
This has stepbystep calculations and using of builtin chisq.test() with producing some nice output including Pearson and deviance residuals.
Here is the above analysis done in SAS:
Do you recall what the residuals are from linear regression? How would you define them in this context? What do they tell you about the tomato example? 
How large is the discrepancy between the two proposed models? The previous analysis provides a summary of differences between the proposed models. If we want to know more specifically where these differences are, and which differences may lead to potentially rejecting the null hypothesis, residuals can be inspected for relevant clues. There are two types of residuals we will consider: Pearson and deviance residuals.
Pearson residuals
The Pearson goodnessoffit statistic can be written as \(X^2=\sum\limits_{j=1}^k r^2_j\) , where
\(r^2_j=\dfrac{(X_jn\hat{\pi}_j)^2}{n\hat{\pi}_j}=\dfrac{(O_jE_j)^2}{E_j}\)
represents the contribution to X^{2} by cell j. The quantity
\(r_j=\dfrac{X_jn\hat{\pi}_j}{\sqrt{n\hat{\pi}_j}}=\dfrac{O_jE_j}{\sqrt{E_j}}\)
is called the Pearson residual for cell j, and it compares the observed with the expected counts. The sign (positive or negative) indicates whether the observed frequency in cell j is higher or lower than the value fitted under the model, and the magnitude indicates the degree of departure. When data do not fit a model, examination of the Pearson residuals often helps to diagnose where the model has failed.
How large should a "typical" value of r_{j} be? Recall that the expectation of a χ^{2} random variable is its degrees of freedom. Thus if a model is true, E(X^{2}) ≈ E(G^{2}) ≈ k − 1, and the typical size of a single r_{j}^{2} is (k − 1)/k. Thus, if the absolute value, r_{j}, is much larger than \(\sqrt{(k1)/k}\)—say, 2.0 or more—then the model does not appear to fit well for cell j. This helps to identify those cells that do not fit the expected model as closely as we might have assumed.
Deviance residuals
Like the X^{2} statistic, the deviance, G^{2}, can be regarded as the sum of squared Deviance residuals,
\(G^2=\sum\limits_{j=1}^k \epsilon^2_j\)
where
\(\epsilon_j=\sqrt{\left2X_j\text{log}\dfrac{X_j}{n\hat{\pi}_j}\right}\times \text{sign}(X_jn\hat{\pi}_j)\)
The sign function can take three values:
When the expected counts E_{j} are all fairly large (much greater than 5) the deviance and Pearson residuals resemble each other quite closely.
Below is a table of observed counts, expected counts, and residuals for the fairdie example; for calculations see dice_rolls.R. Unfortunately, CELLCHI2 option in SAS that gives these residuals does NOT work for oneway tables; we will use it for higherdimensional tables. Please note, this is one of the reasons why we give examples in both SAS and R; that is, there is not a single software package that can do all the analysis, and often we need to write a piece of code. For additional calculations see tomato_types.R.
cell j

O_{j}

E_{j}

r_{j}

ε_{j}

1

3

5

0.89

1.75

2

7

5

+0.89

2.17

3

5

5

+0.00

0.00

4

10

5

+2.24

3.72

5

2

5

1.34

1.91

6

3

5

+0.89

1.75

The only cell that seems to deviate substantially from the fairdice model is cell j = 4. If the dice is not fair, then it may be "loaded" in favor of the outcome 4. But recall that the pvalue was about .10, so the evidence against fairness is not overwhelming.
If an X_{j} is zero and all π_{j}'s are positive, then the Pearson X^{2} can be calculated without any problems but there is a problem in computing the deviance, G^{2}; if X_{j} = 0 then the deviance residual is undefined, and if we use the standard formula,
\(G^2=2\sum\limits_{j=1}^k X_j\text{log}\dfrac{X_j}{n\hat{\pi}_j}\)
an error will result. But if we write the deviance to avoid the potential problem of division by zero:
\(G^2=2\text{log}\dfrac{L(p;X)}{L(\hat{\pi};X)}=2\text{log}\prod\limits_{j=1}^k \left(\dfrac{X_j/n}{\hat{\pi}_j}\right)^{X_j}\)
a cell with X_{j} = 0 contributes 1 to the product and may be ignored. Thus we may calculate the deviance statistic as
\(G^2=2\sum\limits_{i:X_j>0} X_j\text{log}\dfrac{X_j}{n\hat{\pi}_j}\)
Alternatively, we can set the deviance residuals to zero for cells with X_{j} = 0 and take $G^2= \sum_j \epsilon_j^2$ as before. But if we do that, $\epsilon_j = 0$ should not be interpreted as "the model fits well in cell j." The fit could be quite poor, especially if E_{j} is large.
If any element of vector π is zero, then X^{2} and G^{2} both break down. One simple possible way to avoid such problems is to put a very small mass in all the cells, including the zerocell (e.g., $10^{5}$, $0.05$, or do a sensitivity analysis by removing or collapsing cells and redoing the tests, or try the Bayesian analysis (see Agresti (2013), Sec 1.6).
Structural zeros and sampling zeros. Note that observed zero's in the cell do not necessarily mean that there cannot be any observation in those cells. In the example of World Cup soccer, no team scored 6 or 7 goals in round one of the tournament. That does not mean that scoring 6 or 7 goals is an impossibility, that is the underlying cell probabilities $\pi_j$'s are positive but in the same we happen to observe no outcomes. These are examples of sampling zeros. Alternatively consider an example of categorizing male and female patients on whether they have carcinoma. Since only females can have ovarian cancer and males can have prostate cancer, the following cells are said to have structural zeros, and the underlying cell probabilities $\pi_j=0$. We will work with structural zero's later. To handle structural zeros, an adjustment to the degreesoffreedom is required, whereas sampling zeros do not require any special consideration other than introduction of small mass in those cells.
Patient

Ovarian Cancer

Prostate cancer

Male

0

7

Female

10

0

Total  10  7 
For the soccer example we can test how well does the Poisson model fits the observed data (see soccer.R and soccer.sas). If we ignore the Poisson model at first, and test if the data fit the Multinomial model with all scores having equal probability of occurring, then from the chisquared goodnessoffit test we get X^{2} = 127.66 (after adding 1/2 count to each cell in the table), df = 8, pvalue near to zero and G^{2} = 128.64, df = 8, pvalue near to zero, we conclude that the tested Multinomial model does not fit the observed data well. Almost all cells but the 4th one have high residual values, e.g., absolute values of residuals is greater than 2. To test if the Poisson model fits, the underlying cell proportions depend on the unknown parameter $\lambda$ that we first need to estimate; the value is $\hat{\lambda}=1.38$ and this will also reduce the degrees of freedom by 1 (see the next section). From the chisquared goodnessoffit test we get X^{2} = 128.79, df = 7, pvalue near to zero. Thus the tested Poisson model does not fit the observed data well. However, if we look at the Pearson residuals, we see that the model fits well except for the very last cell which has a very large residual value of 11.267; e.g., absolute values of residuals is greater than 2. Removing that cell or combining it with the previous 3 cells, may give a better result.
For many statistical models, we do not know the vector of probabilities π a priori, but can only specify it up to some unknown parameters. More specifically, the cell proportions may be known functions of one or more other unknown parameters.
HardyWeinberg problem. Suppose that a gene is either dominant (A) or recessive (a), and the overall proportion of dominant genes in the population is p. If we assume mating is random (i.e. members of the population choose their mates in a manner that is completely unrelated to this gene), then the three possible genotypes—AA, Aa, and aa—should occur in the socalled HardyWeinberg proportions:
genotype

proportion

no. of dominant genes

AA

π_{1} = p^{2}

2

Aa

π_{2} = 2p(1 − p)

1

aa

π_{3} = (1 − p)^{2}

0

Note that this is equivalent to saying that the number of dominant genes that an individual has (0, 1, or 2) is distributed as Bin(2, p), where the parameter p is not specified. we have to first estimate p to be able to estimate (i.e., say something about the) the unknown cell proportions in vector π.
Number of Children (The Poisson Model). Suppose that we observe the following numbers of children in n = 100 families:
no. of children: 
0

1

2

3

4+

count: 
19

26

29

13

13

Are these data consistent with a Poisson distribution? Recall that if a random variable Y has a Poisson distribution with mean λ, then
\(P(Y=y)=\dfrac{\lambda^y e^{\lambda}}{y!}\)
for y = 0, 1, 2, . . .. Therefore, under the Poisson model, the proportions given some unknown λ, are provided in the table below. For example, $\pi_1=P(Y=0)=\dfrac{\lambda^0 e^{\lambda}}{0!}=e^{\lambda}$.
no. of children

proportion

0

π_{1} = e^{−λ} 
1

π_{2} = λe^{−λ} 
2

π_{3} = λ^{2}e^{−λ}/2 
3

π_{4} = λ^{3}e^{−λ}/6 
4+

π_{5} = 1 − Σ^{4}_{j=1} π_{j} 
In both of these examples, the null hypothesis is that the multinomial probabilities π_{j} depend on one or more unknown parameters in a known way. In the children's example, we maybe want to know the proportion of the families in the sample population that have 2 children. In more general notation, the model specifies that:
π_{1} = g_{1}(θ),
π_{2} = g_{2}(θ),
...
π_{k} = g_{k}(θ),
where g_{1}, g_{2}, . . . , g_{k} are known functions but the parameter θ is unknown (e.g., $\lambda$ in the children's example or $p$ in the genetics example). Let S_{0} denote the set of all π that satisfy these constraints for some parameter θ. We want to test
H_{0} : π ∈ S_{0} versus H_{1} : π ∈ S,
where S denotes the probability simplex (the space) of all possible values of π. (Notice that S is a (k − 1)dimensional space, but the dimension of S_{0} is the number of free parameters in θ.)
The method for conducting this test is as follows.
\(\hat{\pi}_1=g_1(\hat{\theta})\)
\(\hat{\pi}_2=g_2(\hat{\theta})\)
...
\(\hat{\pi}_k=g_k(\hat{\theta})\)
\(X^2=\sum\limits_j \dfrac{(O_jE_j)^2}{E_j}\) and \(G^2=2\sum\limits_j O_j \text{log}\dfrac{O_j}{E_j}\) as usual.
If \(X^2(x,\hat{\pi})\) and \(G^2(x,\hat{\pi})\) are calculated as described above, then the distribution of both X^{2} and G^{2} under the null hypothesis as n → ∞, approaches χ^{2}_{ν}, where ν equals the number of unknown parameters under the alternative hypothesis minus the number of unknown parameters under the null hypothesis, ν = (k − 1) − d, where d = dim(θ), i.e., the number of parameters in θ.
The difference between this result and the previous one is that the expected cell counts E_{1}, E_{2}, . . . , E_{k} used to calculate X^{2} and G^{2} now contain unknown parameters. Because we need to estimate d parameters to find E_{1}, E_{2}, . . . , E_{k}, the largesample distribution of X^{2} and G^{2} has changed; it's still a chisquared distribution, but the degrees of freedom have dropped by d , the number of unknown parameters we first need to estimate.
Are the data below consistent with a Poisson model?
no. of children: 
0

1

2

3

4+

count: 
19

26

29

13

13

Let's test the null hypothesis that these data are Poisson. First, we need to estimate λ, the mean of the Poisson distribution, and thus here $d=1$. Recall that if we have an iid sample y_{1}, y_{2}, . . . , y_{n} from a Poisson distribution, then the ML estimate of λ is just the sample mean, \(\hat{\lambda}=n^{1}\sum_{i=1}^n y_i\). Based on the table above, we know that the original data y_{1}, y_{2}, . . . , y_{n} contained 19 values of 0, 26 values of 1, and so on; however, we don't know the exact values of the original data that fell into the category 4+. It is also not easy to find MLE of λ in this situation without involved numerical computations. To make matters easy, suppose for now that of the 13 values that were classified as 4+, ten were equal to 4 and three were equal to 5. Then the ML estimate of λ is, therefore,
\(\hat{\lambda}=\dfrac{19(0)+26(1)+29(2)+13(3)+10(4)+3(5)}{100}=1.78\)
Under this estimate of λ, the expected counts for the first four cells (0, 1, 2, and 3 children, respectively) are
E_{1} = 100e−1.78 = 16.86,
E_{2} = 100(1.78)e−1.78 = 30.02,
E_{3} = 100(1.78)^{2}e−1.78/2 = 26.72,
E_{4} = 100(1.78)^{3}e−1.78/6 = 15.85.
The expected count for the 4+ cell is most easily found by noting that Σ_{j} E_{j} = n, and thus
E_{5} = 100 − (16.86 + 30.02 + 26.72 + 15.85) = 10.55.
This leads to:
X^{2} = 2.08 and G^{2} = 2.09.
Since the general multinomial model here has k − 1 = 4 parameters, where k is the number of cell that is $\pi_j$'s, and the Poisson model has just one parameter $\lambda$, the degrees of freedom for this test are ν = 4 − 1 = 3, and the pvalues are
\(P(\chi^2_3\geq2.08)=0.56\)
\(P(\chi^2_3\geq2.09)=0.55\)
The Poisson model seems to fit well; there is no evidence that these data are not Poisson. Below is an example of how to do these computations using R and SAS.
Here is how to fit the Poisson Model in R using the following code :
The function dpois() calculates Poisson probabilities. You can also get the X^{2} in R by using function chisq.test(ob, p=pihat) in the above code, but notice in the output below, that the degrees of freedom, and thus the pvalue, are not correct:
Chisquared test for given probabilities
data: ob
Xsquared = 2.0846, df = 4, pvalue = 0.7202
You can use this X^{2} statistic, but need to calculate the new pvalue based on the correct degrees of freedom, in order to obtain correct inference.
Here is how we can do this goodnessoffit test in SAS, by using precalculated proportions (pihats). The TESTP option specifies expected proportions for a oneway table chisquare test. But notice in the output below, that the degrees of freedom, and thus the pvalue, are not correct:
ChiSquare Test for Specified Proportions  ChiSquare 2.0892 DF 4 Pr > ChiSq 0.7194
You can use this X^{2} statistic, but need to calculate the new pvalue based on the correct degrees of freedom, in order to obtain correct inference.
children.sas (program  text file)

children.lst (output  text file)

The frist two lessons introduced the key distributions (Poisson, binomial, and multinomial), for sampling schemes for categorical data. It also introduced basic notations and reviewed hypothesis testing, confidence intervals, likelihood and maximum likelihood estimation. Maximum likelihood is the key estimation method for statistical inference with categorical data. We also made a distinction between three types of largesample theory hypothesis tests and related confidence intervals used in the analysis of categorical data: Wald, LikelihoodRatio, and Score. These will appear over and over again throughout the course. The Wald test is the most widely used one. For example, we will see when we fit a logistic regression model that the zstatistic and the confidence intervals for the regression parameter estimates are Wald CIs. Thus, unless specified, you can assume that a given test statistic and approximate confidence intervals are based on the Wald inference.
These concepts were demonstrated through four examples:
Lesson 2 also introduced the goodnessoffit test, goodnessoffit test statistics (e.g, Pearson's chisquare and Deviance), and the corresponding residuals. The distinction was made between the goodnessoffit tests with known versus unknown population parameters. This idea of comparing the assumed model (H_{0}) to a distribution of the observed data, and assessing how close they fit, will be used throughout the course, as will these test statistics. For example, in the next lesson we will see that the Chisquared test of independence is just a goodnessoffit test where the assumed model (H_{0}) of the independence of two random variables is compared to a saturated model, that is to the observed data.
These concepts were further demonstrated in SAS and R with four examples: