# Lesson 2: One-Way Tables and Goodness-of-Fit Test

### One-Way Tables and Goodness of Fit

#### Key concepts:

• One-way Frequency Table
• Sampling Schemes (Models)
• Pearson goodness-of-fit statistic
• Deviance statistic
• Pearson residuals

#### Objectives:

• Learn how to compute the goodness-of-fit statistics
• Understand how well an observed table of counts corresponds to the multinomial model Mult(n, π) for some vector π.
• Introduce the FREQ procedure in SAS and the prop.test and the chisq.test in R

Links to background information on the topics in this lesson and some programming help and examples.

• Understand chi-square distribution:
• SAS source on PROC FREQ:
• SAS source on one-way frequency tables:
• R: Test of Equal or Given Proportions and Chi-square goodness-of-fit test.

• Ch.1 Agresti (2007, 1996)
• If you are using other textbooks or editions: Ch.1 Agresti (2013, 2002, 1996)

# 2.1 - Introduction and Examples

Lesson 1 was a brief introduction to the categorical data and discrete distributions. We looked at different types of categorical data as well as counts and proportions. In this lesson we will see the simplest way to represent counts and proportions and then do the simplest type of analysis. Let's begin with the definition of a one-way frequency table.

A frequency table arises when sample units are classified into mutually exclusive categories; the number of units falling into each category is recorded. "One way" means that units are classified according to a single categorical variable, and that is the table's dimension. The cells of the table contain frequency counts and can be arranged in a single row or column.  The size of the table refers to the number of cells.

Let us look at some examples and see how they relate to the simplest concept of contingency table.

### Example 1 - High-Risk Drinking at Penn State

In 2006, the Penn State's Student Affairs Office of Research and Assessment (SAORA) conducted a survey on alcohol use of undergraduate students at the Penn State, University Park campus ( Pulse Survey: Student Drinking, March 2006). A sample of 1315 undergraduate students were classified as to whether they are high-risk drinkers or not.

 High-Risk Count Yes 630 No 685 Total 1315

Since the sample size, n = 1315, is fixed a priori and there are only two outcomes, e.g., you are a high-risk drinker or not, this is an example of a binomial sampling.

The questions of interest are

1. What is the true population proportion of students who are high-risk drinkers at Penn State?
2. Is this proportion significantly different from 50%? In other words, is a randomly chosen student equally likely to be a high risk or non-high-risk drinker?

### Example 2 - World Cup Soccer

How many teams scored certain number of goals during the first round of 2002 FIFA World Cup soccer championship ( http://www.worldcup-soccer.info/ )?  For example, 25 teams scored no goals during their first round matches. The table below records the frequency of the number of goals scored.

 Number of goals Count 0 23 1 37 2 20 3 11 4 2 5 1 6 0 7 0 8 1 Total 95

This is a one-way table of size 9. Note here that the total count (i.e., sample size n = 95 ) was not fixed a priori and this could be considered Poisson sampling since we fixed the time period during which we recorded the outcomes of interest. We may ask questions such as

1. What is the mean number of goals scored?; or
2. What is the probability that a team will score four goals?

### Example 3 - Dice Rolls

Start a timer and roll a die for two minutes. Count the number of times each die face comes up on top during this two minutes. This is an example of a Poisson sampling. You may get results similar to those in the following table. Note that we could not fix the sample size n = 30 a priori!

 Face Count 1 3 2 7 3 5 4 10 5 2 6 3 Total 30

You may also get the table above, or a similar one by deciding instead to roll a die n = 30 times, and record out of these 30 times at each roll which face comes up. This is an example of Multinomial sampling. Now, we may ask the question, "Is this die fair?", and use the sampling knowledge to answer it.

### Example 4 - Eye Colors

A sample of n = 96 persons is obtained, and the eye color of each person is recorded.

 Eye color Count Brown 46 Blue 22 Green 26 Other 2 Total 96

Notice that brown, blue, green, and other have no intrinsic ordering. The response variable, eye color, is therefore an unordered categorical or nominal variable.

### Example 5  - Attitudes Towards War - 1

Hypothetical attitudes of n = 116 people towards war. They were asked to state their opinion on a 5 point scale regarding the statement: "This is a necessary war".

 Attitude Count Strongly disagree 35 Disagree 27 Agree 23 Strongly agree 31 Total 116

The response categories in this example are clearly ordered, but no objectively defined numerical scores can be attached to the categories. The response variable, attitude, is therefore said to be an ordered categorical or ordinal variable.

### Example 6 - Attitudes Towards a War - 2

Hypothetical attitudes of n = 130 people towards the war, but with additional categories for "not sure" and "refused to answer."

 Attitude Count Strongly disagree 35 Disagree 27 Agree 23 Strongly agree 31 Not sure 6 Refusal 8 Total 130

The first four categories are clearly ordered, but the placement of "not sure" and "refusal" in the ordering is questionable. We would have to say that this response is partially ordered.

### Example 7 - Number of Children in Families

Number of children in n = 100 randomly selected families.

 Number of children Count 0 19 1 26 2 29 3 13 4-5 11 6+ 2 Total 100

The original data, the raw number of children, has been coarsened into six categories (0, 1, 2, 3, 4–5, 6+). These categories are clearly ordered, but—unlike the previous example—the categories have objectively defined numeric values attached to them. We can say that this table represents coarsened numeric data or interval variable.

### Example 8 - Household Incomes

Total gross income of n = 100 households.

 Income Count below \$10,000 11 \$10,000–\$24,999 23 \$25,000–\$39,999 30 \$40,000–\$59,999 24 \$60,000 and above 12 Total 100

The original data (raw incomes) were essentially continuous. Any type of data, continuous or discrete, can be grouped or coarsened into categories.

Grouping data will typically result in some loss of information. How much information is lost depends on (a) the number of categories and (b) the question being addressed. In this example, grouping has somewhat diminished our ability to estimate the mean or median household income. Our ability to estimate the proportion of households with incomes below \$10,000 has not been affected, but estimating the proportion of households with incomes above \$75,000 is now virtually impossible. (Why?)

Remember measurement hierarchy? If we have a statistical method that is developed for analysis of ordinal categorical data that method typically will not be appropriate for analysis of, for example, nominal categorical data. This is why we are making a distinction between ordinal and nominal variables and partial ordinal variables. Let's get a little bit more technical and see the notation relevant for one-way frequency tables before we go into a discussion of sampling schemes.

# 2.2 - Notation & Structure

Suppose data are collected on a discrete variable, X, with k categories. A one-way frequency table with k cells will be denoted by the vector:

$$X=(X_1,X_2,\ldots,X_k)$$

where Xj is the count or frequency in cell j and xj or nj is the observed count in this cell j. In some textbooks the author will use xj and other authors will use nj; the latter is more consistent with your textbook (Agresti's) notation.

Note that X is a summary of the original raw dataset consisting of:

$$n=\sum\limits_{j=1}^k x_j=\sum\limits_{j=1}^k n_j$$

observations.

For instance, the Eye Colors example had four types of eye colors, thus k = 4.

X1 is the unknown number of people in the population with "brown" eyes. The observed number of people with brown eyes is x1 = n1 = 46. The total sample size n = n1 + n2 + n3 + n4 = 46 + 22 + 26 + 2 = 96.

#### Joint Distribution

If the observed units are randomly sampled from a large population, then x = (n1, n2 , n3 , ... , nk) will have a joint discrete probability distribution such as Poisson, Binomial, or Multinomial; a technical review of these common discrete distributions is given in Lesson 1.

Let πj be the population probability (or proportion) that a randomly selected individual, (i.e., an observational unit), falls into the jth cell of the table. Then the joint probability distribution will be denoted as

$$\pi=(\pi_1,\pi_2,\pi_3,\ldots,\pi_k)=\{\pi_j\}$$

For this to be a proper joint probability distribution the probabilities must sum to 1:

$$\sum\limits_{j=1}^k \pi_j=1$$

In the social sciences we typically think of πj as a proportion of individuals in the population with a particular characteristic. For example, π1 = proportion of people with brown eyes in the population of interest. In statistical inference, our goal is to estimate unknown parameters πj, and/or their joint distribution.

For observed data we typically use p instead of π, to denote observed joint distribution { pj } where pj = nj/n is the observed proportion of people with a certain characteristic, e.g. the proportion of people with brown eyes in our example is p1 = 46 / 96 = 0.479. p1 is an example of a sample statistic, and a point estimate of unknown population parameter, π1.

For the Eye Color example:

 X = Color Count Brown n1=46 Blue n2=22 Green n3=26 Other n4=2 Total n=96
 X = Color PopulationProportion π Brown π1 Blue π2 Green π3 Other π4 Total 1
 X = Color ObservedProportions Brown p1 = 0.479 Blue p2 = 0.229 Green p3 = 0.271 Other p4 = 0.021 Total 1

# 2.3 - Sampling Schemes

What are some ways of generating these one-way tables of counts?

 Why do you think we care about the random mechanism that generated the data?

Any data analysis requires some assumptions about the data generation process. For continuous data and linear regression, for example, we assume that the response variable have been randomly generated from a normal distribution. For categorical data we will often assume that data have been generated from a Poisson, binomial or multinomial distribution. Statistical analysis depends on the data generation mechanism, although in some situations, depending on the objective, we may ignore that mechanism and simplify our analysis.

The following sampling methods correspond to the distributions considered

• Unrestricted sampling (corresponds to Poisson distribution)
• Sampling with fixed total sample size (corresponds to Binomial or Multinomial distributions)

# 2.3.1 - Poisson Sampling

Poisson sampling assumes that the random mechanism to generate the data can be described by a Poisson distribution. It is useful for modeling counts or events that occur randomly over a fixed period of time or in a fixed space. Often it is useful when the probability of any particular incidence happening is very small while the number of incidences is very large. (This is very much like a binomial distribution where success probability π of a trial is very very small but the number of trials n is very very large. This is known as the limiting condition).

For example, consider the World Cup soccer data example where we collect data on the frequency on the number of goals scored by teams during the first round matches of the 2002 World Cup. Another example is rolling of a dice during a fixed two-minute time period. Similarly count the number of emails you received between 4pm-5pm on a Friday, or number of students accessing STAT 504 course website on a Saturday, etc.

Let X be the number of goals scored in the matches of the first round of the World Cup. XPoisson (λ)

$$P(X=x)=\dfrac{\lambda^x e^{-\lambda}}{x!}\qquad x=0,1,2,\ldots$$

Where λ is the parameter describing the rate, that is the mean of the distribution, e.g., the average number of goals scored during the first round matches. Once you know λ, you know everything there is to know about this distribution. x! stands for x factorial, i.e., x!=1*2*3*...*x.  P(X=x) or P(x) is a probability that a randomly chosen team scored x number of goals in a game, e.g.:

$$P(X=0)=\dfrac{\lambda^0 e^{-\lambda}}{0!}=\dfrac{1\cdot e^{-\lambda} }{1}=e^{-\lambda}$$

How the average rate λ = 1.38 is obtained, is given below.  Then $$P(X=0)=e^{-1.38}=\dfrac{1}{e^{1.38}}=0.252$$ is the probability that a randomly chosen team will score 0 goals in the first round match of the World Cup. For the remaining probabilities see the table at the end of this page.

#### The Poisson Model (distribution) Assumptions

1. Independence: Events must be independent (e.g. the number of goals scored by a team should not make the number of goals scored by another team more or less likely.)
2. Homogeneity: The mean number of goals scored is assumed to be the same for all teams.
3. Time period (or space) must be fixed

In this sampling scheme, the overall n is not fixed. We also assume that X1, X2, ... Xk are independent Poisson variables, either each with the same rate λ or each with different rate λ1, λ2, ...., λk such that λ1 + λ2 + ... + λk = λ.

Recall that mean and variance of Poisson distribution are the same; e.g., E(X) = Var(X) = λ. However in practice, the observed variance is usually larger than the theoretical variance and in the case of Poisson, larger than its mean. This is known as overdispersion, an important concept that occurs with discrete data. We assumed that each team has the same probability of in each match of the first round of scoring goals, but it's more realistic to assume that these probabilities will vary by the teams skills, the day the matches were played because of the weather, maybe even if the order of the matches, etc. Then we may observe more variations in the scoring than the Poisson model predicts.  Analyses assuming binomial, Poisson or multinomial distributions are sometimes invalid because of overdispersion. We will see more on this later when we study logistic regression and Poisson regression models.

Let us see how we can do some basic calculations with the World Cup Soccer example under the Poisson model.

QUESTION: What is the most likely mean number of goals scored; that is, what is the most likely value of the unknown parameter λ given the data x?

We can answer this question by relying on the basic principle of statistical inference, e.g., point estimation, confidence intervals and/or hypothesis testing.

Recall from Lesson 1 on Likelihood and MLE: The most common point estimate is the "maximum likelihood estimate" (MLE) which has nice statistical properties, and it is the most likely value of the parameter given the data; it is the value that maximizes the likelihood function.

The MLE of  λ from the Poisson distribution is the sample mean or the expectation of the distribution, and from the computation below for our example this is approximately:

\begin{align}
\bar{x} &= \dfrac{1}{95}\sum\limits_{i=1} x_i\\
&= \dfrac{1}{95} (0\times 23+1\times 37+2\times 20+3\times 11+4\times 2+5\times 1+6\times 0+ 7\times 0+ 8 \times 1)\\
&= \dfrac{131}{95}\\
&= 1.38\\
\end{align}

Thus, $\hat{\lambda}=1.38$ goals per first round matches.

$$[1.38-1.96\sqrt{1.38/95},1.38+1.96\sqrt{1.38/95}]=[1.14,1.62]$$

and we are 95% confident that the mean number of goals scored by a team during the first round match-ups will be somewhere between 1.14 and 1.62. Now that we have some estimate of the mean number of goals we can calculate the expected probabilities of a randomly chosen team scoring 0, 1, 2, 3, etc... number of goals, as well as the expected frequencies (or counts). For example, under this Poisson model with $\hat{\lambda}=1.38$, the expected probability of scoring 2 goals is $\hat{\pi}_2=p_2=P(X=2)=\frac{{1.38}^{2}e^{-1.38}}{2!}=0.239$ and the expected frequency is $np_2=95*0.239=22.75$ (see the 3rd row of the table below).

### Example - World Cup Soccer

Here is a link to the World Cup Soccer data (text file).

You can easily do these calculation by hand or in Excel or in any other software package you are using. Here they are in SAS and R.

Here is the SAS program soccer.sas.

For the complete output, see the course SAS page.

Please Note: Most PROC FREQ SAS options do NOT work for one-way tables, thus some coding is needed.

Here is a link to this code in R. soccer.R.

You can click on the 'Inspect' button below to see how the Poisson probabilities are calculated using R.

Please Note: There are some discrepancies between the R code file and Inspect! The file itself contains line comments explaining the code.

Here is a summary of these probabilities:

 Number of goals Observed Counts Expected probabilities under assumed Poisson model Expected Counts 0 23 0.252 23.93 1 37 0.347 32.99 2 20 0.239 22.75 3 11 0.110 10.46 4 2 0.038 3.61 5 1 0.010 0.99 6 0 0.002 0.23 7 0 0.0005 0.05 8 1 0.00008 0.01 Total 95

In the graphical form:

For additional general calculations with Poisson distribution, see the following methods in SAS and R:

Here is the SAS program for calculating Poisson probabilities: PoissonCal.sas.

Here is the R code for calculating Poisson probabilities: PoissonCal.R.

Here is a walk-through of this code:

Similarly binomial and multinomial sampling data also can be analyzed.

# 2.3.2 - Binomial Sampling

When data are collected on a pre-determined number of units and are then classified according to two levels of a categorical variable, a binomial sampling emerges. Consider the High-risk Drinking Example where we have high-risk drinkers versus non-high-risk drinkers. In this study there was a fixed number of trials (e.g., fixed number of students surveyed, n=1315) where the researcher counted the number of "successes" and the number of "failures" that occur. We can let X be the number of "successes" that is the number of students who are the high-risk drinkers. We can use the binomial probability distribution (i.e., binomial model), to describe this particular variable.

Binomial distributions are characterized by two parameters: n, which is fixed - this could be the number of trials or the total sample size if we think in terms of sampling, and π, which usually denotes a probability of "success". In our example this would be the probability that someone is a high-risk drinker in the population of Penn State students. Please note that some textbooks will use π to denote the population parameter and p to denote the sample estimate, whereas some may use p for the population parameters as well.  We may do both; don't be confused by this, just make sure to read carefully the specification. Once you know n and π, the probability of success, you know the mean and variance of the binomial distribution, and you know everything about that binomial distribution.

Below are the probability density function, mean and variance of the binomial variable.

$$f(x)=\dfrac{n!}{x!(n-x)!}π^x(1-π)^{n-x}\qquad \text{for }x=0,1,2,\ldots,n$$

Mean E (X) = nπ
Variance Var (X) = nπ (1 - π)

#### Binomial Model (distribution) Assumptions

• Fixed n: the total number of trials/events, (or total sample size) is fixed.
• Each event has two possible outcomes, referred to as "successes" or "failures", (e.g., each student can be either a heavy drinker or a non-heavy drinker; heavy drinker being a success here).
• Independent and Identical Events/Trials:
• Identical trials means that probability of success is the same for each trial.
• Independent means that the outcome of one trial does not affect the outcome of the other, (e.g. one student being a heavy drinker or not does not affect the status of the next student, and each student has the same probability, π, of being a heavy drinker.)

### Example - Heavy Drinking Students

QUESTION: What is the probability that no students are heavy drinkers, i.e., P(X = 0)?

Let's assume that π = 0.5.

\begin{align}
P(X=0|n=1315,\pi=0.5) &= \binom{n}{x} \pi^x(1-\pi)^{n-x}\\
&= \dfrac{n!}{x!(n-x)!}\pi^x(1-\pi)^{n-x}\\
&= \dfrac{1315!}{0!(1315-0)!}(0.5)^0(0.5)^{1315}\\
&= 1\cdot1\cdot(0.5)^{1315}\\
&\approx 0\\
\end{align}

 What's the probability that there are X = 1000 heavy drinkers in this example?

QUESTION
: What is the true population proportion of students who are high-risk drinkers at Penn State?

This is a statistical inference question that can be answered with a point estimate, confidence intervals and hypothesis tests about proportions. The likelihood function for Binomial L ; x) is a measure of how close the population proportion π is to the  data x; The Maximum Likelihood Estimate (MLE) is the most likely value for π given the observed data, and for the binomial distribution this is the sample mean,

$$\hat{\pi}=\dfrac{\sum x_i}{n}=\dfrac{x}{n}$$

and for the expected counts,

$$\hat{\mu}=\hat{\pi}\cdot n$$

Thus for our example, assuming the Binomial distribution, our "best" guess estimate of the true proportion of students who are high-risk drinkers is

$$p=\hat{\pi}=\dfrac{630}{1315}=0.48$$

Here are also the likelihood and loglikelihood graphs for our example. We can see that the peak of the likelihood is at the proportion value equal to 0.48. The loglikelihood looks quadratic which means that the large-sample normal theory should work fine, and we can use the approximate 95% confidence intervals.

The MLE is used for statistical inference, such as testing for a specific value of π, or giving a range of likely values of π, via the 95% confidence interval. A key result here comes from understanding the properties of the sampling distribution of the sample proportion p.

The Rule for Sample Proportion: If numerous samples of size n are taken with n large enough, the frequency curve of the sample proportions from the various samples will be approximately normal with the mean E(p)=π and variance Var(p)=π(1- π)/n . "Large enough" usually means that the number of successes and failures are not small, i.e., nπ ≥ 5, and n(1 - π) 5. The larger the sample size n, the sampling distribution of p is better approximated by a normal distribution.  Note that the sampling distribution of p is really a discrete rather than a continuous distribution, but we rely on the above described approximation for statistical inference unless we deal with small samples; for the latter case see Agresti (2007), Sec 1.4.3-1.4.5.

QUESTION: Is the population proportion of heavy-drinkers significantly different from 50%?

#### Large-sample hypothesis test about π

 H0: π = π0 vs. HA: π ≠ π0

The z test statistic: $$z=\dfrac{p-\pi_0}{\sqrt{\dfrac{\pi_0(1-\pi_0)}{n}}}$$ where p is the sample proportion, in our case 0.48, and π0 is the value we are testing for, 0.5.

 H0: π = 0.5 vs. HA: π ≠ 0.5

The statistic z = -1.45 with two sided p-value of 0.136. Thus we do not have a very strong evidence that the proportion of high-risk drinkers is different from 50%; i.e., do not reject the null hypothesis.

### Confidence interval for π

The usual (1 - α) × 100% CI, holds:

$$p\pm z_{\alpha/2}\sqrt{\dfrac{p(1-p)}{n}}$$

This interval is known as the Wald confidence interval.

For our example, the 95% CI is 0.48 ± 1.96 × 0.014 = (0.453, 0.507). We can be 95% confident that the true population proportion of students high-risk drinkers is between 0.454 and 0.506.

However, when the proportions are extremely small or large, π < 0.20 or π > 0.80, this CI does not work very well. It is better to consider the likelihood-ratio-based CI, as discussed in Lesson 1.  This interval is more complex computationally but in essence simple by evaluating the likelihood function plotted below where we are looking for all possible values of $pi_0$ for which the null hypothesis would not be rejected. In our case we get 95% CI to be (0.453, 0.506), here is a plot of that confidence interval.

To do the above calculations in R and SAS, see the drinking.R and drinking.sas files below.  Also, watch the viewlets that will walk you through how these program works.

Here is the SAS program drinking.sas.

And click on the 'Inspect' icon to see a walk through of the last part of this program.

Here is the R program drinking.R.

And, here is a walk-through of this program.

If you encounter an error with the sink() function, please see the following page with  support materials for R.

The third alternative, also likelihood-based confidence interval, known as the Score confidence interval in essence is looking for ALL π0 values that yield the desired test statistics, e.g., for 95% CI, zα/2 = ± 1.96. To simplify, we need to solve the following equation:

$$\dfrac{|p-\pi_0|}{\sqrt{\dfrac{\pi_0(1-\pi_0)}{n}}}=z_{\alpha/2}$$

which is the same as:

$$\dfrac{|p-\pi_0|}{\sqrt{\dfrac{\pi_0(1-\pi_0)}{n}}}=z_{\alpha/2} \text{ and } \dfrac{|p-\pi_0|}{\sqrt{\dfrac{\pi_0(1-\pi_0)}{n}}}=-z_{\alpha/2}$$

To do this, we can solve the following quadratic equation:

$$\left(1+\dfrac{z^2_{\alpha/2}}{n}\right)\pi^2_0+\left(-2p-\dfrac{z^2_{\alpha/2}}{n}\right)\pi_0+p^2=0$$

$$a\pi^2_0+b\pi_0+c=0$$

$$\pi_0=\dfrac{-b\pm \sqrt{b^2-4ac}}{2a}$$

In this example, the 95% Score CI is the same as the Wald CI, $(0.453, 0.507)$. Notice that the only difference between the Wald CI and the Score CI, is the standard error where in the former its calculated using the value of sample proportion p and in the latter the null value of π0.

Traditionally, most software packages will give you the Wald CI, but nowadays we are starting to see the score and the likelihood-ratio ones too.

### Margin of error (MOE)

The margin of error is defined to be the length of the 95% CI for a single proportion.  So the 95% CI is p ± MOE; see the CNN example for a review. Confidence interval is the widest when π0 = 0.5. This knowledge is useful in determining sample size for given conditions.

#### Sample Size

Often we are interested in knowing what sample size is needed for a specific margin of error for a population proportion. For example, how large a sample would we need such that the 99% confidence interval is of width m. Solving the following for n:

$$m=2.575\sqrt{\pi(1-\pi)/n}$$

$$n=(2.575)^2 \pi(1-\pi)/m^2$$

Since π is unknown, take π = 0.5 to get the widest possible MOE. For example, if the required MOE is 3%

$$n=(2.575)^2(0.5)(0.5)/(0.03)(0.03)=1841.84$$

which is rounded up to 1842.

For additional simple Binomial distribution calculations see Lesson 0 and SimpleBinomialCal.R code with its viewlet that walks you through the code.

# 2.3.3 - Multinomial Sampling

Multinomial sampling may be considered as a generalization of Binomial sampling. Data are collected on a pre-determined number of individuals that is units and classified according to the levels of a categorical variable of interest (e.g., see Examples 4 through 8 in the Introduction of this Lesson).

X ∼ Mult (n, π), with the probability density function

$$f(x)=\dfrac{n!}{x_1!x_2!\cdots x_k!} \pi_1^{x_1}\pi_2^{x_2}\cdots \pi_k^{x_k}$$ , $$x=(x_1,\ldots,x_k)$$

where n is fixed and known sample size, and π is the vector of population proportions $$\pi=(\pi_1,\pi_2,\ldots,\pi_k)$$.

Each Xj represents a population count in the cell j, xj  observed sample count, and each πrepresents a population proportion in the cell j.

Multinomial distribution admits a mean vector with

$$E(X_j)=n\pi_j$$

The variance of Xj is

$$Var(X_j)=n\pi_j(1-\pi_j)$$

and the covariance between Xj and Xk is

$$cov(X_j,X_k)=-n\pi_j\pi_k$$

Each cell thus has a binomial marginal distribution. Binomial is a special case of multinomial for k = 2.

$$f(x)=\dfrac{n!}{x_1!x_2!}\pi_1^{x_1}\pi_2^{x_2}$$

let,

\begin{array}{lcr}
x &= & x_1\\
x_2 &=& n-x\\
\pi &=& \pi_1\\
\pi_2 &=& 1-\pi
\end{array}

then,

$$f(x)=\binom{n}{x}\pi^x(1-\pi)^{n-x}=\dfrac{n!}{x!(n-x)!}\pi^x(1-\pi)^{n-x},\qquad x=0,1,2,\ldots,n$$

To be consistent with Agresti(2007), pg.4 or Agresti(2013), pg.6, where the random variable is denoted as Y, and probability density function as P(y), we can write the above as,

$$p(y)=\dfrac{n!}{y!(n-y)!}\pi^y(1-\pi)^{n-y},\qquad y=0,1,2,\ldots,n$$

If n is random (not fixed, as was the case with the total number of goals scored in the World Soccer example), but we have no interest in making inferences about the sample size n only about the underlying cell probabilities, we can apply the multinomial model instead of the Poisson model and the inferences will be the same. But we do have to worry about other kinds of model failure; e.g., are the assumptions of the multinomial model satisfied?

Multinomial Model (distribution) Assumptions:

1. the n trials are independent, and
2. the parameter vector π remains constant from trial to trial.

The most common violation of these assumptions occurs when clustering is present in the data. Clustering means that some of the trials occur in groups or clusters, and that trials within a cluster tend to have outcomes that are more similar than trials from different clusters. Clustering can be thought of as a violation of either (a) or (b).

### Example - Eye Color

In this example, eye color was recorded for n = 96 persons.

 Eye color Count Brown 46 Blue 22 Green 26 Other 2 Total 96

Suppose that the sample included members from the same family as well as unrelated individuals. Persons from the same family are more likely to have similar eye color than unrelated persons, so the assumptions of the multinomial model would be violated. If both parents have brown eye color, it is very likely that their offspring will also have brown eye color. Whereas eye color of family members related by marriage will not violate multinomial assumption, distribution of eye color of blood relations will.

Now suppose that the sample consisted of "unrelated" persons randomly selected within Pennsylvania. In other words, persons are randomly selected from a list of Pennsylvania residents. If two members of the same family happen to be selected into the sample purely by chance, that's okay; the important thing is that each person on the list has an equal chance of being selected, regardless of who else is selected.

Could this be considered a multinomial situation? For all practical purposes, yes. The sampled individuals are not independent according the common English definition of the word, because they all live in Pennsylvania. But we can suppose that they are independent from a statistical viewpoint, because the individuals are exchangeable; before the sample is drawn, no two are a priori any more likely to have the same eye color than any other two.

 Based on the readings on Multinomial distribution in Lesson 1 and 2 can you answer the following questions:  What is the MLE  vector for eye color example? If we fuse "other" eye color with "brown", how does the distribution change; i.e., What is the Mulitnomial distribution now? It turns out that we can partition "brown" eyes as 20 with "hazel" color and 26 with "dark brown". How would you characterize these distributions now?

# 2.4 - Goodness-of-Fit Test

A goodness-of-fit test, in general, refers to measuring how well do the observed data correspond to the fitted (assumed) model. We will use this concept throughout the course as a way of checking the model fit. Like in a linear regression, in essence, the goodness-of-fit test compares the observed values to the expected (fitted or predicted) values.

A goodness-of-fit statistic tests the following hypothesis:

H0: the model M0 fits
vs.
HA
: the model M0 does not fit (or, some other model MA fits)

Most often the observed data represent the fit of the saturated model, the most complex model possible with the given data. Thus, most often the alternative hypothesis (HA) will represent the saturated model MA which fits perfectly because each observation has a separate parameter.  Later in the course we will see that MA could be a model other than the saturated one. Let us now consider the simplest example of the goodness-of-fit test with categorical data.

In the setting for one-way tables, we measure how well an observed variable X corresponds to a Mult (n, π) model for some vector of cell probabilities, π. We will consider two cases:

1. when vector π is known, and
2. when vector π is unknown.

In other words, we assume that under the null hypothesis data come from a Mult (n, π) distribution, and we test whether that model fits against the fit of the saturated model. The rationale behind any model fitting is the assumption that a complex mechanism of data generation may be represented by a simpler model. The goodness-of-fit test is applied to corroborate our assumption.

Consider our Dice Example from the Introduction. We want to test the hypothesis that there is an equal probability of six sides; that is compare the observed frequencies to the assumed model: X ∼ Multi (n = 30, π0 = (1/6, 1/6, 1/6, 1/6, 1/6, 1/6)). You can think of this as simultaneously testing that the probability in each cell is being equal or not to a specified value, e.g.

H0: ( π1, π2, π3, π4, π5, π6) = (1/6, 1/6, 1/6, 1/6, 1/6, 1/6)

vs.

HA: ( π1, π2, π3, π4, π5, π6) ≠ (1/6, 1/6, 1/6, 1/6, 1/6, 1/6).

Most software packages will already have built-in functions that will do this for you; see the next section for examples in SAS and R. Here is a step-by step procedure to help you conceptually understand this test better and what is going on behind these functions.

 Step 1: If vector π is unknown we need to estimate these unknown parameters, and proceed to Step 2; If vector π is known proceed to Step 2. Step 2: Calculate the estimated (fitted) cell probabilities $\hat{\pi_j}$s, and expected cell frequencies, Ej's under H0.  Step 3: Calculate the Pearson goodness-of-fit statistic, X 2 and/or the deviance statistic, G2 and compare them to appropriate chi-squared distributions to make a decision. Step 4: If the decision is borderline or if the null hypothesis is rejected, further investigate which observations may be influential by looking, for example, at residuals.

### Pearson and deviance test statistics

The Pearson goodness-of-fit statistic is

$$X^2=\sum\limits_{j=1}^k \dfrac{(X_j-n\hat{\pi}_j)^2}{n\hat{\pi}_j}$$

An easy way to remember it is

$$X^2=\sum\limits_j \dfrac{(O_j-E_j)^2}{E_j}$$

where Oj = Xj is the observed count in cell j, and $E_j=E(X_j)=n\hat{\pi}_j$ is the expected count in cell j under the assumption that null hypothesis is true, i.e. the assumed model is a good one. Notice that $$\hat{\pi}_j$$ is the estimated (fitted) cell proportion πj under H0.

The deviance statistic is

$$G^2=2\sum\limits_{j=1}^k X_j \text{log}\left(\dfrac{X_j}{n\hat{\pi}_j}\right)$$

where "log" means natural logarithm. An easy way to remember it is

$$G^2=2\sum\limits_j O_j \text{log}\left(\dfrac{O_j}{E_j}\right)$$

In some texts, G2 is also called the likelihood-ratio test statistic, for comparing the likelihoods (l0 and l1 ) of two models, that is comparing the loglikelihoods under H0 (i.e., loglikelihood of the fitted model, L0) and loglikelihood under HA (i.e., loglikelihood of the larger, less restricted, or saturated model L1): G2 = -2log(l0/l1) = -2(L0 - L1). A common mistake in calculating G2 is to leave out the factor of 2 at the front.

Note that X2 and G2 are both functions of the observed data X and a vector of probabilities π. For this reason, we will sometimes write them as X2(x, π) and G2(x, π), respectively; when there is no ambiguity, however, we will simply use X2 and G2. We will be dealing with these statistics throughout the course; in the analysis of 2-way and k-way tables, and when assessing the fit of log-linear and logistic regression models.

### Testing the Goodness-of-Fit

X2 and G2 both measure how closely the model, in this case Mult (n, π) "fits" the observed data.

• If the sample proportions pj = Xj /n (i.e., saturated model) are exactly equal to the model's πj for cells j = 1, 2, . . . , k, then Oj = Ej for all j, and both X2 and G2 will be zero. That is, the model fits perfectly.
• If the sample proportions pj deviate from the $\hat{\pi}$'s computed under H0, then X2 and G2 are both positive. Large values of X2 and G2 mean that the data do not agree well with the assumed/proposed model M0.

#### How can we judge the sizes of X2 and G2?

The answer is provided by this result:

If x is a realization of X ∼ Mult(n, π), then as n becomes large, the sampling distributions of both X2(x, π) and G2(x, π) approach chi-squared distribution with df = k -1, where k = number of cells, χ2k−1.

This means that we can easily test a null hypothesis H0 : π = π0 against the alternative H1 : ππ0 for some pre-specified vector π0. An approximate α-level test of H0 versus H1 is:

Reject H0 if computed X2(x, π0) or G2(x, π0) exceeds the theoretical value χ2 k−1(1 − α).

Here, χ2k−1(1 − α) denotes the (1 − α)th quantile of the χ2k−1 distribution, the value for which the probability that a χ2k−1 random variable is less than or equal to it is 1 − α. The p-value for this test is the area to the right of the computed X2 or G2 under the χ2k−1 density curve. Below is a simple visual example. Consider a chi-squared distribution with df=10. Let's assume that a computed test statistic is X2=21. For α=0.05, the theoretical value is 18.31.

Useful functions in SAS and R to remember for computing the p-values from the chi-square distribution are:

• In R, p-value=1-pchisq(test statistic, df), e.g., 1-pchisq(21,10)=0.021
• In SAS, p-value=1-probchi(test statistic,df), e.g.,1-probchi(21,10)=0.021

You can quickly review the chi-squared distribution in Lesson 0, or check out http://www.statsoft.com/textbook/stathome.html and http://www.ruf.rice.edu/~lane/stat_sim/chisq_theor/index.html. The STATSOFT link also has brief reviews of many other statistical concepts and methods.

Here are a few more comments on this test.

• When n is large and the model is true, X2 and G2 tend to be approximately equal. For large samples, the results of the X2 and G2 tests will be essentially the same.
• An old-fashioned rule of thumb is that the χ2 approximation for X2 and G2 works well provided that n is large enough to have Ej = nπj ≥ 5 for every j. Nowadays, most agree that we can have Ej< 5 for some of the cells (say, 20% of them). Some of the Ej's can be as small as 2, but none of them should fall below 1. If this happens, then the χ2 approximation isn't appropriate, and the test results are not reliable.

• In practice, it's a good idea to compute both X2 and G2 to see if they lead to similar results. If the resulting p-values are close, then we can be fairly confident that the large-sample approximation is working well.

• If it is apparent that one or more of the Ej's are too small, we can sometimes get around the problem by collapsing or combining cells until all the Ej's are large enough. But we can also perform a small-sample inference or exact inference. We will see more on this in Lesson 3. Please note that the small-sample inference can be conservative for discrete distributions, that is may give a larger p-value than it really is (e.g., for more details see Agresti (2007), Sec. 1.4.3-1.4.5, and 2.6; Agresti (2013), Sec. 3.5, and for Bayesian inference Sec 3.6.)

• In most applications, we will reject the null hypothesis X ∼ Mult (n, π) for large values of X2 or G2. On rare occasions, however, we may want to reject the null hypothesis for unusually small values of X2 or G2. That is, we may want to define the p-value as P2 k−1 X2) or P(χ2 k−1G2). Very small values of X2 or G2 suggest that the model fits the data too well, i.e. the data may have been fabricated or altered in some way to fit the model closely. This is how R.A. Fisher figured out that some of Mendel's experimental data must have been fraudulent (e.g., see Agresti (2007), page 327; Agresti (2013), page 19).

# 2.5 - Examples in SAS/R: Dice Rolls & Tomato

## Example - Dice Rolls

Suppose that we roll dice 30 times and observe the following table showing the number of times each face ends up on top.

 Face Count 1 3 2 7 3 5 4 10 5 2 6 3 Total 30

We want to test the null hypothesis that the dice is fair. Under this hypothesis, X ∼ Mult(n = 30, π0) where π0 = (1/6 , 1/6 , 1/6 , 1/6 , 1/6 , 1/6 ).

This is our assumed model, and under this H0, the expected counts are Ej = 30*1/6= 5 for each cell j. One way to calculate the difference between what we are assuming and what we are observing is to evaluate the the goodness-of-fit statistics as discussed in  the previous section.

\begin{eqnarray*}
X^2 &= & \dfrac{(3-5)^2}{5}+\dfrac{(7-5)^2}{5}+\dfrac{(5-5)^2}{5}\\
& & +\dfrac{(10-5)^2}{5}+\dfrac{(2-5)^2}{5}+\dfrac{(3-5)^2}{5}\\
&=& 9.2
\end{eqnarray*}

\begin{eqnarray*}
G^2 &=& 2\left(3\text{log}\dfrac{3}{5}+7\text{log}\dfrac{7}{5}+5\text{log}\dfrac{5}{5}\right.\\
& & \left.+ 10\text{log}\dfrac{10}{5}+2\text{log}\dfrac{2}{5}+3\text{log}\dfrac{3}{5}\right)\\
&=& 8.8
\end{eqnarray*}

Note that numerical values of Χ2 and G2 can be different. We can calculate the p-values for these statistics in order to help determine how well this model fits. The p-values are P25 ≥ 9.2) = .10 and P25 ≥ 8.8) = .12. Given these p-values, with the critical value or Type I error of α=0.05, we fail to reject the null hypothesis. But rather than relying on p-values, we should also think about the actual values of X2 and G2statistics. Small values imply a good fit here, i.e., the observed values are close to what you had assumed. Large values are going to imply the opposite. In this case, the fair-dice model doesn't fit the data very well, but the fit isn't bad enough to conclude beyond a reasonable doubt that the dice is unfair.

Next, we show how to do this in SAS and R; you can run either.

The following SAS code, dice_rolls.sas, will perform the goodness-of-fit test for the example above.

Here is the output generated by this program, and the headings are self-explanatory:

Notice that this SAS code only computes the Pearson chi-square statistic and not the deviance statistic.

Here is a short video describing the above output.
 Can you identify the relevant statistics and the p-value in the output? What does the column labeled "Percent" represent?

The following R code, dice_rolls.R will perform the same analysis as what we just did in SAS. You can copy the entire code in the R window and your output will be saved into two files, dice_rolls.out and dice_rolls_Results. If you are new to R, however, I suggest you run this line by line to get more familiar with the commands; you can simply copy and past into the R-terminal. For other options, see a very basic intro to R.

 dice_rolls.R dice_rolls.out (part of the output)

Here are a few additional comments regarding the above R code and its output. If you run Inspect, the data there will be labeled as "ex7" where in the updated code the label is "dice", but it's the same data. The last part of the R code does the same analysis, but gives an output that looks more like what SAS produces; we did this to show a bit more programming in R and how you can play with creating different outputs.

IMPORTANT! The Pearson chi-squared statistic is X2 = 9.2 (p-value=0.101), the deviance statistic (from the R output) is G2=8.78 (p-value=0.118), and they both follow the chi-squared sampling distribution with df=5. It is not uncommon to observe the discrepancy in the values between these two statistics especially for the smaller sample sizes. Notice, however, that in this case they do lead to the same conclusion --- there is a moderate evidence that the dice is fair.

 Can you identify the relevant statistics and the p-value in the output? What does the column labeled "Percentage" in dice_rolls.out represent?

### Example - Tomato Phenotypes

Tall cut-leaf tomatoes were crossed with dwarf potato-leaf tomatoes, and n = 1611 offspring were classified by their phenotypes.

 Phenotypes Count tall cut-leaf 926 tall potato-leaf 288 dwarf cut-leaf 293 dwarf potato-leaf 104

Genetic theory says that the four phenotypes should occur with relative frequencies 9 : 3 : 3 : 1, and thus are not all equally as likely to be observed. The dwarf potato-leaf is less likely to observed than the others. Do the observed data support this theory?

Under the null hypothesis, the probabilities are

π1 = 9/16 , π2 = π3 = 3/16 , π4 = 1/16 ,

and the expected frequencies are

E1 = 1611(9/16) = 906.2,
E
2 = E3 = 1611(3/16) = 302.1, and
E
4 = 1611(1/16) = 100.7.

We calculate the fit statistics and find that X2 = 1.47 and G2 = 1.48, which are nearly identical. The p-values based on the χ2 distribution with 3 d.f. are approximately equal to 0.69. Therefore, we fail to reject the null hypothesis and the data are consistent with the genetic theory. Again, the small values of these statistics imply that the fit between the data and the proposed model is good!

Here is how to do the computations in R using the following code :

This has step-by-step calculations and using of built-in chisq.test() with producing some nice output including Pearson and deviance residuals.

Here is the above analysis done in SAS:

 Do you recall what the residuals are from linear regression? How would you define them in this context? What do they tell you about the tomato example?

# 2.6 - Residuals

How large is the discrepancy between the two proposed models? The previous analysis provides a summary of differences between the proposed models. If we want to know more specifically where these differences are, and which differences may lead to potentially rejecting the null hypothesis, residuals can be inspected for relevant clues. There are two types of residuals we will consider: Pearson and deviance residuals.

Pearson residuals

The Pearson goodness-of-fit statistic can be written as $$X^2=\sum\limits_{j=1}^k r^2_j$$ , where

$$r^2_j=\dfrac{(X_j-n\hat{\pi}_j)^2}{n\hat{\pi}_j}=\dfrac{(O_j-E_j)^2}{E_j}$$

represents the contribution to X2 by cell j. The quantity

$$r_j=\dfrac{X_j-n\hat{\pi}_j}{\sqrt{n\hat{\pi}_j}}=\dfrac{O_j-E_j}{\sqrt{E_j}}$$

is called the Pearson residual for cell j, and it compares the observed with the expected counts. The sign (positive or negative) indicates whether the observed frequency in cell j is higher or lower than the value fitted under the model, and the magnitude indicates the degree of departure. When data do not fit a model, examination of the Pearson residuals often helps to diagnose where the model has failed.

How large should a "typical" value of  rj be? Recall that the expectation of a χ2 random variable is its degrees of freedom. Thus if a model is true, E(X2) ≈ E(G2) ≈ k − 1, and the typical size of a single rj2  is (k − 1)/k. Thus, if the absolute value, |rj|, is much larger than $$\sqrt{(k-1)/k}$$—say, 2.0 or more—then the model does not appear to fit well for cell j. This helps to identify those cells that do not fit the expected model as closely as we might have assumed.

Deviance residuals

Like the X2 statistic, the deviance, G2, can be regarded as the sum of squared Deviance residuals,

$$G^2=\sum\limits_{j=1}^k \epsilon^2_j$$

where

$$\epsilon_j=\sqrt{\left|2X_j\text{log}\dfrac{X_j}{n\hat{\pi}_j}\right|}\times \text{sign}(X_j-n\hat{\pi}_j)$$

The sign function can take three values:

•  -1 if (Xj - nπj ) < 0,
•   0 if (Xj- nπj ) = 0, or
•   1 if (Xj- nπj) > 0.

When the expected counts Ej are all fairly large (much greater than 5) the deviance and Pearson residuals resemble each other quite closely.

### Example - Dice Rolls cont'd.

Below is a table of observed counts, expected counts, and residuals for the fair-die example; for calculations see dice_rolls.R. Unfortunately, CELLCHI2 option in SAS that gives these residuals does NOT work for one-way tables; we will use it for higher-dimensional tables. Please note, this is one of the reasons why we give examples in both SAS and R; that is, there is not a single software package that can do all the analysis, and often we need to write a piece of code. For additional calculations see tomato_types.R.

 cell j Oj Ej rj εj 1 3 5 -0.89 -1.75 2 7 5 +0.89 2.17 3 5 5 +0.00 0.00 4 10 5 +2.24 3.72 5 2 5 -1.34 -1.91 6 3 5 +0.89 -1.75

The only cell that seems to deviate substantially from the fair-dice model is cell j = 4. If the dice is not fair, then it may be "loaded" in favor of the outcome 4. But recall that the p-value was about .10, so the evidence against fairness is not overwhelming.

#### Effects of Zero Cell Counts.

If an Xj is zero and all πj's are positive, then the Pearson X2 can be calculated without any problems but there is a problem in computing the deviance, G2; if Xj = 0 then the deviance residual is undefined, and if we use the standard formula,

$$G^2=2\sum\limits_{j=1}^k X_j\text{log}\dfrac{X_j}{n\hat{\pi}_j}$$

an error will result. But if we write the deviance to avoid the potential problem of division by zero:

$$G^2=2\text{log}\dfrac{L(p;X)}{L(\hat{\pi};X)}=2\text{log}\prod\limits_{j=1}^k \left(\dfrac{X_j/n}{\hat{\pi}_j}\right)^{X_j}$$

a cell with Xj = 0 contributes 1 to the product and may be ignored. Thus we may calculate the deviance statistic as

$$G^2=2\sum\limits_{i:X_j>0} X_j\text{log}\dfrac{X_j}{n\hat{\pi}_j}$$

Alternatively, we can set the deviance residuals to zero for cells with Xj = 0 and take $G^2= \sum_j \epsilon_j^2$ as before. But if we do that, $\epsilon_j = 0$ should not be interpreted as "the model fits well in cell j." The fit could be quite poor, especially if Ej is large.

If any element of vector π is zero, then X2 and G2 both break down. One simple possible way to avoid such problems is to put a very small mass in all the cells, including the zero-cell (e.g., $10^{-5}$, $0.05$, or do a sensitivity analysis by removing or collapsing cells and re-doing the tests, or try the Bayesian analysis (see Agresti (2013), Sec 1.6).

Structural zeros and sampling zeros. Note that observed zero's in the cell do not necessarily mean that there cannot be any observation in those cells. In the example of World Cup soccer, no team scored 6 or 7 goals in round one of the tournament. That does not mean that scoring 6 or 7 goals is an impossibility, that is the underlying cell probabilities $\pi_j$'s are positive but in the same we happen to observe no outcomes. These are examples of sampling zeros. Alternatively consider an example of categorizing male and female patients on whether they have carcinoma. Since only females can have ovarian cancer and males can have prostate cancer, the following cells are said to have structural zeros, and the underlying cell probabilities $\pi_j=0$. We will work with structural zero's later. To handle structural zeros, an adjustment to the degrees-of-freedom is required, whereas sampling zeros do not require any special consideration other than introduction of small mass in those cells.

Patient
Ovarian Cancer
Prostate cancer
Male
0
7
Female
10
0
Total 10 7

### Example - Soccer

For the soccer example we can test how well does the Poisson model fits the observed data (see soccer.R and soccer.sas).  If we ignore the Poisson model at first, and test if the data fit the Multinomial model with all scores having equal probability of occurring, then from the chi-squared goodness-of-fit test we get X2 = 127.66 (after adding 1/2 count to each cell in the table), df = 8, p-value near to zero and G2 = 128.64, df = 8, p-value near to zero, we conclude that the tested Multinomial model does not fit the observed data well. Almost all cells but the 4th one have high residual values, e.g., absolute values of residuals is greater than 2. To test if the Poisson model fits, the underlying cell proportions depend on the unknown parameter $\lambda$ that we first need to estimate; the value is $\hat{\lambda}=1.38$ and this will also reduce the degrees of freedom by 1 (see the next section). From the chi-squared goodness-of-fit test we get X2 = 128.79, df = 7, p-value near to zero. Thus the tested Poisson model does not fit the observed data well.  However, if we look at the Pearson residuals, we see that the model fits well except for the very last cell which has a very large residual value of 11.267; e.g., absolute values of residuals is greater than 2. Removing that cell or combining it with the previous 3 cells, may give a better result.

# 2.7 - Goodness-of-Fit Tests: Cell Probabilities Functions of Unknown Parameters

For many statistical models, we do not know the vector of probabilities π a priori, but can only specify it up to some unknown parameters. More specifically, the cell proportions may be known functions of one or more other unknown parameters.

Hardy-Weinberg problem. Suppose that a gene is either dominant (A) or recessive (a), and the overall proportion of dominant genes in the population is p. If we assume mating is random (i.e. members of the population choose their mates in a manner that is completely unrelated to this gene), then the three possible genotypes—AA, Aa, and aa—should occur in the so-called Hardy-Weinberg proportions:

 genotype proportion no. of dominant genes AA π1 = p2 2 Aa π2 = 2p(1 − p) 1 aa π3 = (1 − p)2 0

Note that this is equivalent to saying that the number of dominant genes that an individual has (0, 1, or 2) is distributed as Bin(2, p), where the parameter p is not specified. we have to first estimate p to be able to estimate (i.e., say something about the) the unknown cell proportions in vector  π.

Number of Children (The Poisson Model). Suppose that we observe the following numbers of children in n = 100 families:

 no. of children: 0 1 2 3 4+ count: 19 26 29 13 13

Are these data consistent with a Poisson distribution? Recall that if a random variable Y has a Poisson distribution with mean λ, then

$$P(Y=y)=\dfrac{\lambda^y e^{-\lambda}}{y!}$$

for y = 0, 1, 2, . . .. Therefore, under the Poisson model, the proportions given some unknown λ, are provided in the table below. For example, $\pi_1=P(Y=0)=\dfrac{\lambda^0 e^{-\lambda}}{0!}=e^{-\lambda}$.

 no. of children proportion 0 π1 = e−λ 1 π2 = λe−λ 2 π3 = λ2e−λ/2 3 π4 = λ3e−λ/6 4+ π5 = 1 − Σ4j=1 πj

In both of these examples, the null hypothesis is that the multinomial probabilities πj depend on one or more unknown parameters in a known way. In the children's example, we maybe want to know the proportion of the families in the sample population that have 2 children. In more general notation, the model specifies that:

π1 = g1(θ),
π2 = g2(θ),
...
πk = gk(θ),

where g1, g2, . . . , gk are known functions but the parameter θ is unknown (e.g., $\lambda$ in the children's example or $p$ in the genetics example). Let S0 denote the set of all π that satisfy these constraints for some parameter θ. We want to test

H0 : π ∈ S0  versus   H1 : π ∈ S,

where S denotes the probability simplex (the space) of all possible values of π. (Notice that S is a (k − 1)-dimensional space, but the dimension of S0 is the number of free parameters in θ.)

The method for conducting this test is as follows.

1. Estimate θ by an efficient method (e.g. maximum likelihood). Call the estimate $$\hat{\theta}$$.
2. Calculate estimated cell probabilities $$\hat{\pi}=(\hat{\pi}_1,\hat{\pi}_2,\ldots,\hat{\pi}_k)$$, where
3. $$\hat{\pi}_1=g_1(\hat{\theta})$$
$$\hat{\pi}_2=g_2(\hat{\theta})$$
...
$$\hat{\pi}_k=g_k(\hat{\theta})$$

4. Calculate the goodness-of-fit statistics $$X^2(x,\hat{\pi})$$ and $$G^2(x,\hat{\pi})$$. That is, calculate the expected cell counts 1$$E_1=n\hat{\pi}_1$$, $$E_2=n\hat{\pi}_2$$, . . ., $$E_k=n\hat{\pi}_k$$ , and find
5. $$X^2=\sum\limits_j \dfrac{(O_j-E_j)^2}{E_j}$$ and $$G^2=2\sum\limits_j O_j \text{log}\dfrac{O_j}{E_j}$$ as usual.

If  $$X^2(x,\hat{\pi})$$ and $$G^2(x,\hat{\pi})$$ are calculated as described above, then the distribution of both X2 and G2 under the null hypothesis as n → ∞, approaches χ2ν, where ν equals the number of unknown parameters under the alternative hypothesis minus the number of unknown parameters under the null hypothesis, ν = (k − 1) − d, where d = dim(θ), i.e., the number of parameters in θ.

The difference between this result and the previous one is that the expected cell counts E1, E2, . . . , Ek used to calculate X2 and G2 now contain unknown parameters. Because we need to estimate d parameters to find E1, E2, . . . , Ek, the large-sample distribution of X2 and G2 has changed; it's still a chi-squared distribution, but the degrees of freedom have dropped by d , the number of unknown parameters we first need to estimate.

### Example - Number of Children, continued

Are the data below consistent with a Poisson model?

 no. of children: 0 1 2 3 4+ count: 19 26 29 13 13

Let's test the null hypothesis that these data are Poisson. First, we need to estimate λ, the mean of the Poisson distribution, and thus here $d=1$. Recall that if we have an iid sample y1, y2, . . . , yn from a Poisson distribution, then the ML estimate of λ is just the sample mean, $$\hat{\lambda}=n^{-1}\sum_{i=1}^n y_i$$. Based on the table above, we know that the original data y1, y2, . . . , yn contained 19 values of 0, 26 values of 1, and so on; however, we don't know the exact values of the original data that fell into the category 4+. It is also not easy to find MLE of λ in this situation without involved numerical computations. To make matters easy,  suppose for now that of the 13 values that were classified as 4+, ten were equal to 4 and three were equal to 5. Then the ML estimate of λ is, therefore,

$$\hat{\lambda}=\dfrac{19(0)+26(1)+29(2)+13(3)+10(4)+3(5)}{100}=1.78$$

Under this estimate of λ, the expected counts for the first four cells (0, 1, 2, and 3 children, respectively) are

E1 = 100e−1.78 = 16.86,
E2 = 100(1.78)e−1.78 = 30.02,
E3 = 100(1.78)2e−1.78/2 = 26.72,
E4 = 100(1.78)3e−1.78/6 = 15.85.

The expected count for the 4+ cell is most easily found by noting that Σj Ej = n, and thus

E5 = 100 − (16.86 + 30.02 + 26.72 + 15.85) = 10.55.

X2 = 2.08 and G2 = 2.09.

Since the general multinomial model here has k − 1 = 4 parameters, where k is the number of cell that is $\pi_j$'s,  and the Poisson model has just one parameter $\lambda$, the degrees of freedom for this test are ν = 4 − 1 = 3, and the p-values are

$$P(\chi^2_3\geq2.08)=0.56$$

$$P(\chi^2_3\geq2.09)=0.55$$

The Poisson model seems to fit well; there is no evidence that these data are not Poisson. Below is an example of how to do these computations using R and SAS.

Here is how to fit the Poisson Model in R using the following code :

The function dpois() calculates Poisson probabilities. You can also get the X2 in R by using function chisq.test(ob, p=pihat) in the above code, but notice in the output below, that the degrees of freedom, and thus the p-value, are not correct:

Chi-squared test for given probabilities

data: ob
X-squared = 2.0846, df = 4, p-value = 0.7202

You can use this X2 statistic, but need to calculate the new p-value based on the correct degrees of freedom, in order to obtain correct inference.

Here is how we can do this goodness-of-fit test in SAS, by using pre-calculated proportions (pihats). The TESTP option specifies expected proportions for a one-way table chi-square test. But notice in the output below, that the degrees of freedom, and thus the p-value, are not correct:

     Chi-Square Test
for Specified Proportions
-------------------------
Chi-Square         2.0892
DF                      4
Pr > ChiSq         0.7194

You can use this X2 statistic, but need to calculate the new p-value based on the correct degrees of freedom, in order to obtain correct inference.

 children.sas (program - text file) children.lst (output - text file)

# 2.8 - Summary

The frist two lessons introduced the key distributions (Poisson, binomial, and multinomial), for sampling schemes for categorical data. It also introduced  basic notations and reviewed hypothesis testing, confidence intervals, likelihood and maximum likelihood estimation. Maximum likelihood is the key estimation method for statistical inference with categorical data. We also made a distinction between three types of large-sample theory hypothesis tests and related confidence intervals used in the analysis of categorical data: Wald, Likelihood-Ratio, and Score. These will appear over and over again throughout the course. The Wald test is the most widely used one. For example, we will see when we fit a logistic regression model that the z-statistic and the confidence intervals for the regression parameter estimates are Wald CIs. Thus, unless specified, you can assume that a given test statistic and approximate confidence intervals are based on the Wald inference.

These concepts were demonstrated through four examples:

Lesson 2 also introduced the goodness-of-fit test, goodness-of-fit test statistics (e.g, Pearson's chi-square and Deviance), and the corresponding residuals. The distinction was made between the goodness-of-fit tests with known versus unknown population parameters. This idea of comparing the assumed model (H0) to a distribution of the observed data, and assessing how close they fit, will be used throughout the course, as will these test statistics. For example, in the next lesson we will see that the Chi-squared test of independence is just a goodness-of-fit test where the assumed model (H0) of the independence of two random variables is compared to a saturated model, that is to the observed data.

These concepts were further demonstrated in SAS and R with four examples: