# Lesson 10: Log-Linear Models

### Introduction to Loglinear Models

Thus far in the course we have alluded to log-linear models several times, but have never got down to the basics of it. When we dealt with inter-relationships among several categorical variables, our focus had been on describing independence, interactions or associations between two, three or more categorical variables mostly via

• single summary statistics, and
• with significance testing.

Log-linear models go beyond a single summary statistics and specify how the cell counts depend on the levels of categorical variables. They model the association and interaction patterns among categorical variables. The log-linear modeling is natural for Poisson, Multinomial and Product-Mutlinomial sampling. They are appropriate when there is no clear distinction between response and explanatory variables, or there are more than two responses. This is a major difference between logistic models and log-linear models. In the former a response is identified, but no such special status is assigned to any variable in log-linear modelling. By default log-linear models assume discrete variables to be nominal, but these models can be adjusted to deal with ordinal and matched data. Log-linear models are more general than logit models, but some log-linear models have direct correspondence to logit models.

Consider graduate admissions at Berkeley.  We may consider all possible relationships among A = Admission, D = Department and S = Gender. Alternatively, we may consider as response and and as covariates in which case the possible logit models are:

• logit model for A with only an intercept;
• logit model for A with a main effect for D;
• logit model for A with a main effect for S;
• logit model for A with a main effects for D and S; and
• logit model for A with main effects for D and S and the D × S interaction.

Corresponding to each of the above a log-linear model may be defined. The notations below follow those of Lesson 5.

• Model of joint independence (DS, A), which indicates neither D nor S has an effect on A is equivalent to a logit model for A with only an intercept;
• Model of conditional independence (DS, DA), which indicates that sex has no effect on A after the effect of department is included, is equivalent to a logit model for A with a main effect for D;
• Another conditional independence model (DS, SA) is equivalent to a logit model for A with a main effect for S only;
• Model of no three-factor interaction (DS, DA, SA) indicates that the effect of sex on A is the same at each level of department, is equivalent to a logit model for A with main effects for D and S; and
• Model of three-factor interaction or the saturated model (DSA) indicates that the effect of sex on A varies across departments and is equivalent to a logit model for A with main effects for D and S and the D × S interaction.

“Equivalent," means that two models give equivalent goodness-of-fit statistics relative to a saturated model, and equivalent expected counts for each cell. Log-linear models are not exactly the same as logit models, because the log-linear models describe the joint distribution of all three variables, whereas the logit models describe only the conditional distribution of A given D and S. Log-linear models have more parameters than the logit models, but the parameters corresponding to the joint distribution of D and S are not of interest.

In general, to construct a log-linear model that is equivalent to a logit model, we need to include all possible associations among the predictors. In the Berkeley example, we need to include DS in every model.  This lesson will walk-through examples how this is done in both SAS and R.

In subsequent sections we look at the log-linear models in more detail. The two great advantages of log-linear models are that they are flexible and they are interpretable. Log-linear models have all the flexibility associated with ANOVA and regression. We have mentioned before that log-linear models are also another form of GLM. They also have natural interpretations in terms of odds and frequently have interpretations in terms of independence, as we have shown above.

# 10.1 - Log-Linear Models for Two-way Tables

In this section we will focus on the following key concepts and objectives.

#### Key Concepts:

• Two-way log-linear model of independence
• Saturated two-way log-linear model
• Parameters Constraints, Estimation and Interpretation for these models
• Inference for log-linear models

#### Objectives

• Understand the structure of the log-linear models in two-way tables
• Understand the concepts of independence and associations described via log-linear models in two-way tables

Log-linear models for two-way tables describe associations and interaction patterns among two categorical random variables.

Recall, that a two-way ANOVA models the expected value of a continuous variable (e.g., plant length) depending on the levels of two categorical variables (e.g., low/high sunlight and low/high water amount). In contrast, the log-linear models express the cell counts (e.g., the number of plants in a cell) depending on levels of two categorical variables.

Let μij be the expected counts, E(nij), in an I × J table, created by two random variables A and B

Objective:

Model the cell counts: μij = nπij

Model structure:

An analogous saturated log-linear model to two-way ANOVA with interaction is:

$$\text{log}(\mu_{ij})=\mu+\alpha_i+\beta_j+\gamma_{ij}$$

or in the notation used by Agresti:

$$\text{log}(\mu_{ij})=\lambda+\lambda_i^A+\lambda_j^B+\lambda_{ij}^{AB}$$

where i = 1, ..., I, j = 1, ...., J, are levels of categorical random variables A and B, with constraints: Σi λi = Σj λj = Σi Σj λij = 0, to deal with overparametrization. Overparametrization means that the number of parameters is more than that can be uniquely estimated. This model is over-parametrized because term λij already has I × J parameters corresponding to the cell means μij. The constant, λ, and the "main effects", λi and λj give us additional 1 + I + J parameters. Superscripts denote variables A and B. We will see more on this in the next sections.

Model Assumptions:

The N = I × J counts in the cells are assumed to be independent observations from a Poisson random variable, $n_{ij}\sim$Poisson(μij). The log-linear modeling is natural for Poisson, Multinomial and Product-Multinomial sampling, like we have discussed in earlier lectures.

Recall the Vitamin C study, a 2 × 2 example from Lesson 3. Are the type of treatment and contracting cold independent? If there are associated, in which way are they associated?

You already know how to answer the above questions via the chi-square test of independence, but now we want to model the cell counts with the log-linear model of independence and ask if this model fits well.

#### Log-linear Models for Two-Way Tables

Given two categorical random variables, A and B, there are two main types of models we will consider:

# 10.1.1 - Model of Independence for Two-way Tables

Recall, that the independence can be stated in terms of cell probabilities as a product of marginal probabilities,

$$\pi_{ij}=\pi_{i+}\pi_{+j}$$     i = 1, ..., I, j = 1, ..., J

and in terms of cell frequencies,

$$\mu_{ij}=n\pi_{ij}=n\pi_{i+}\pi_{+j}$$       i = 1, ..., I, j = 1, ..., J

By taking natural logarithms on both sides of "=" sign, we obtain the loglinear model of independence:

\begin{array}{lcl}
\text{log}(\mu_{ij}) &=& \text{log}n+\text{log}\pi_{i+}+\text{log}\pi_{+j}\\
\text{log}(\mu_{ij}) &=& \lambda+\lambda_i^A+\lambda_j^B
\end{array}

where superscripts A and B are just used to denote the two categorical variables. The second equation is a more standard notation also used in your textbook by Agresti.

This is an ANOVA type-representation where:

• λ represents the "overall" effect, or the grand mean of the logarithms of the expected counts, and it ensures that Σi Σj μij = n, that is, the expected cell counts under the fitted model add up to the total sample size n.
• $$\lambda^A_I$$ represents the "main" effect of variable A, or a deviation from the grand mean, and it ensures that Σj μij = ni+, that is, the marginal totals under the fitted model add up to the observed marginal counts. It represents the effect of classification in row i.
• $$\lambda^B_J$$ represents the "main" effect of variable B, or a deviation from the grand mean, and it ensures that Σi μij = n+j. This is the effect of classification in column j.
• $$\lambda^A_I=\lambda^B_J=0$$, that is, Σi λAi = Σj λBj = 0, to deal with over-parametrization (see below).

The maximum likelihood (ML) fitted values for the cell counts are the same as the expected (fitted) values under the test of independence in two way tables, i.e., E(μ)ij = ni+n+j/n. Thus, the X2 and G2 for the test of independence are goodness-of-fit statistics for the log-linear model of independence testing that the independence model holds versus that it does not, or more specifically testing that the independence model is true vs. saturated model is true. This model also implies that ALL odds ratios should be equal to 1.

#### Parameter Constraints & Uniqueness

For an I × J table, and the model

$$\text{log}(\mu_{ij})=\lambda+\lambda_i^A+\lambda_j^B$$

there are I terms in the set {$$\lambda^A_I$$}, but one of them is redundant so there are I − 1 unknown parameters, e.g., {λ1A, ... , λI-1A}, and there are J terms in the set {$$\lambda^B_J$$}, and one is redundant, so there are J − 1 unknown parameters, e.g., {λ1B , ...., λJ-1B}. (Why is one of them redundant?) There can be many different parameterizations, that is there is NOT a unique set of parameters. Regardless which set we use, we need to set the constraints to account for redundant parameters. Nonexistence of unique set of parameters does not mean that the expected cell counts will change depending on which set of parameters is being used. It simply means that the estimates of the effects may be obtained under different sets of constraints, which will lead to different interpretations. But expected cell counts will remain the same.

DUMMY CODING: To avoid over-parametrization, one member in the set of λ's is fixed to have a constant value, typically 0. This corresponds to using dummy coding for the categorical variables (e.g. A = 1, 0). By default, in SAS PROC GENMOD, the last level is set to 0, e.g., for the expected count in the first cell, and for the last cell,

$$\text{log}(\mu_{11})=\lambda+\lambda_1^A+\lambda_1^B=\lambda+\lambda_1^A+\lambda_1^B$$

$$\text{log}(\mu_{22})=\lambda+\lambda_2^A+\lambda_2^B=\lambda+0+0=\lambda$$

By default, in R glm() the first level of the categorical variable is set to 0, e.g., for the first and the last cell,

$$\text{log}(\mu_{11})=\lambda+\lambda_1^A+\lambda_1^B=\lambda+0+0=\lambda$$

$$\text{log}(\mu_{22})=\lambda+\lambda_2^A+\lambda_2^B=\lambda+\lambda_2^A+\lambda_2^B$$

ANOVA-type CODING: Another way to avoid over-parametrization is to fix the sum of the terms equal to a constant, typically 0. That is the ANOVA-type constraint. This corresponds to using the so called ”effect” coding for categorical variables (e.g. A = 1, 0, −1). By default, SAS PROC CATMOD and R loglin(), use the zero-sum constraint, e.g., the expected cell count in the first cell and the last cell,

$$\text{log}(\mu_{11})=\lambda+\lambda_1^A+\lambda_1^B$$

$$\text{log}(\mu_{22})=\lambda+\lambda_2^A+\lambda_2^B=\lambda-\lambda_1^A-\lambda_1^B$$

We will see more on these with a specific example in the next section.

#### Link to odds and odds ratio

We can have different parameter estimates (i.e, different values of λ's) depending on type of constraints we set. So, what is unique about these parameters that will lead us to the "same" inference regardless of parametrization? The differences, that is the log odds, are unique:

$$\lambda_i^A-\lambda_{i'}^A$$

$$\lambda_j^B-\lambda_{j'}^B$$

where subscript  i denotes one level of categorical variable A and i' denotes another level of the same variable; similarly for B.

Thus the odds is also unique!

\begin{align}
\text{log}(odds) &= \text{log}\left(\dfrac{\mu_{i1}}{\mu_{i2}}\right)=\text{log}(\mu_{i1})-\text{log}(\mu_{i2})\\
&= (\lambda+\lambda_i^A+\lambda_1^B)-(\lambda+\lambda_i^A+\lambda_2^B)=\lambda_1^B-\lambda_2^B\\
\end{align}

odds = exp(λ1B − λ2B)

Think about the following question, then click on the icon to the left display an answer.

#### How about the odds ratio (OR)?

If we have model of independence we expect that the log(odds ratio)=0, that is OR=exp(log(odds ratio))=exp(0)=1. Can you finish the calculation below and show that you get zero?

\begin{align}
\text{log}(oddsratio) &= \text{log}\left(\dfrac{\mu_{11}\mu_{22}}{\mu_{12}\mu_{21}}\right)\\
&= \text{log}(\mu_{11})+\text{log}(\mu_{22})-\text{log}(\mu_{12})+\text{log}(\mu_{21})\\
&= \cdots\\
\end{align}

The odds ratio measures the strength of the association and depends only on the interaction terms {$$\lambda_{ij}^{AB}$$ }, which clearly does not appear in this model, but we will discuss it when we see the saturated log-linear model.

Do you recall, how many odds ratios do we need to completely characterize associations in I × J tables?

### Log-linear Model of Independence for Two-Way Tables -- more formal version

Objective: Model the cell counts: μij = nπij; that is, the response is the cell count.

Main assumptions:

• The N = IJ counts in the cells are assumed to be independent observations of a Poisson random variable, and
• there is no interaction between two categorical variables, i.e., $$\lambda_{ij}^{AB}=0$$, for all i, j or odds ratio = 1 or log(odds) = 0.

Model Structure:

$$\text{log}(\mu_{ij})=\lambda+\lambda_i^A+\lambda_j^B$$

where A and B stand for two categorical variables.

MLE's:

$$E(\mu_{ij})=m_{ij}=\hat{\mu}_{ij}=\dfrac{n_{i+}n_{+j}}{n}$$

Parameter constraints: can be different, but e.g., ANOVA type are

$$\sum_i \lambda_i^A=\sum_j \lambda_j^B =0=\sum_i\sum_j \lambda_{ij}^{AB}=0$$

Parameter estimation and interpretation:

• λ represents an "overall" effect, or a grand mean of the logarithms of the expected counts, and it ensures that Σi Σj μij = n. When estimates of model parameters are derived by equating the observed and the expected total, such estimates are said to be 'conditioned on the total number of observations'.
• λAi represents a "main" effect of variable A, or a deviation from a grand mean, and it ensures that Σj μij = ni+. It represents the effect of classification in row i.
• λBJ represents a "main" effect of variables B, or a deviation from a grand mean, and it ensures that Σi μij = n+j. This is the effect of classification in j.

The values of the parameter estimates may differ based on the constraints, but the estimates of the cell counts, odds and odds ratios will be same! Thus, our inferences will be the same!

For example for ANOVA like constraints, for a 2 × 2 table, the expected cell counts are:

$$\lambda=\dfrac{1}{IJ}\sum\sum \text{log}(m_{ij})$$, and log(odds ratio): $$\text{log}(\Theta)=4\lambda_{11}^{AB}$$

Model Fit:

Use the Pearson X2 statistic and deviance or likelihood-ratio test statistic, $$G^2=2\text{log} \sum\sum n_{ij}\text{log}\left(\dfrac{n_{ij}}{m_{ij}}\right)$$ , as we have seen when we did the test for independence between two variables.

Effectively, we are testing, again,

H0 = independence model is true

vs.

HA = saturated model is true.

Thus the deviance statistic is based on the likelihood ratios of the null model of independence versus the saturated model; recall the lecture notes from the earlier in the semester; e.g.,  goodness-of-fit tests.

In general, recall, for the likelihood ratio test, we fit the full model (e.g. saturated model) and a reduced model (e.g. independence model). The LR statistic is

ΔG2 = 2 × (loglikelihood from full model −loglikelihood from reduced model) ,

and the p-value is  $$P(\Delta G^2 \geq \chi^2_k)$$ where k is the difference in df between two models, or difference in the number of parameters, as before.

Model selection: In this case, use the same statistics as above.

Next we will consider how to fit this model in SAS and R.

# 10.1.2 - Example: Therapeutic Value of Vitamin C

Recall the Vitamin C study, a 2 × 2 example from Lesson 3. From the earlier analysis, we learned that the treatment and response for this data are not independent, e.g.,

(G2 = 4.87, df = 1, p−value = 0.023),

and we estimated various strength of associations, e.g. odds ratios 0.49, indicating that skiers who took placebo were twice more likely to get cold than the ones who took vitamin C.

Next we will consider how we answer the same question but by fitting the log-linear model of independence.

There are (at least) two ways of fitting these models in SAS and R. First, we will see the specification for two procedures in SAS (PROC CATMOD and PROC GENMOD); see VitaminCLoglin.sas and different parts of output in VitaminCLoglin.lst.  Furthermore below, you can see more on the corresponding R functions loglin() and glm(), by running the Inspect video, but I suggest that the R users also read sections on SAS; see VitaminCLoglin.R and different parts of output in VitaminC.Loglin.out.

#### The PROC CATMOD procedure

INPUT:

The MODEL statement (line) specifies that we are fitting the model on two variables treatment and response specified on the left-hand side of the MODEL statement. The MODEL statement only takes independent variables in its right-hand side. In the case of fitting a log-linear model, we need to use a keyword _response_ on the right-hand side.

The LOGLIN statement specifies that it's a model of independence, because we are stating that the model has only main effects, one for "treatment" variable and the other for the "response" variable, in this case.

OUTPUT:

Population Profiles and Response profiles
Our responses, that is what we are modeling, are four cell counts:

Response Matrix
Gives the parameter constraint coding, e.g., dummy or effect, -1,1, contrasts, etc....

Maximum Likelihood Analysis
An iterative procedure is used to find MLE estimates for the model parameters.

Maximum Likelihood Analysis of Variance
Like in the ANOVA, we are breaking down the variability across factors. More specifically, the main effect TREATMENT test if the subject are evenly distributed over the two levels of this variable. The null hypothesis assumes that they are versus alternative that they are not. Based on the output below, are the subjects evenly distributed across the placebo and the vitamin C conditions?

The main effect RESPONSE test if the subjects are evenly distributed over the two levels of this variable. Wald χ2 = 98.12, df = 1 indicated highly uneven distribution.

The LIKELIHOOD RATIO (or the deviance statistic) tells us about the overall fit of the model. Does this value look familiar? Does the model of independence fit well?

Analysis of Maximum Likelihood Estimates
Gives the estimates of the parameters. We focus on estimation of odds.

For example, $$\lambda_1^A=\lambda_{placebo}^{TREATMENT}=0.00358=-\lambda_{ascobic}^{TREATMENT}=-\lambda_2^A$$
What are the odds of getting cold? Recall that PROC CATMOD, by default, uses zero-sum constraints, and the signs for the parameters for each cell are expressed in the Response Matrix output. Based on that output (see the above output) the log odds are:
\begin{align}
\text{log}(\mu_{i1}/\mu_{i2}) &=\text{log}(\mu_{i1})-\text{log}(\mu_{i2})\\
&=[\lambda+\lambda_{placebo}^{TREATMENT}+\lambda_{cold}^{RESPONSE}]-[\lambda+\lambda_{placebo}^{TREATMENT}+\lambda_{nocold}^{RESPONSE}]\\
&=\lambda_{cold}^{RESPONSE}-\lambda_{nocold}^{RESPONSE}\\
&=\lambda_{cold}^{RESPONSE}+\lambda_{cold}^{RESPONSE}\\
&=-0.7856-0.7856=-1.5712\\
\end{align}
So the odds are exp(-1.5712) = 0.208. Notice that the odds are the same for all rows!

 What are the odds for being in the placebo group? Note that this question is not really relevant here, but it's a good exercise to figure out if you understand the matching between the response matrix contrasts and how do we come up with estimates of the odds. Hint: log(odds)=0.00716.

#### The PROC GENMOD procedure

Fits a log-linear model as a Generalized Linear Model (GLM), and by default uses dummy coding. We will mostly use PROC GENMOD in this class.

INPUT:

The CLASS statement (line) specifies that we are dealing with two random variables, treatment and response.

The MODEL statement (line) specifies that our response are the cell counts, and link=log that we are modeling log of counts; two variables should form a model of independence, and error distribution, that is sampling distribution is Poisson; the remaining specifications ask for different parts of output to be printed on your screen or in the .lst document. For other options see SAS help.

OUTPUT:

Model Information
Gives assumptions about the model. In our case, tells us that we are modeling the log, i.e., the LINK FUNCTION, of the responses, i.e. DEPENDENT VARIABLE, that is counts which have Poisson distribution.

Class Level Information
Information on the categorical variables in our model

Criteria For Assessing Goodness of Fit
Gives the information on the convergence of the algorithm for MLE for the parameters, and how well the model fits. What is the value of the deviance statistic, i.e., G2? How about Pearson X2? Note that they are the same as we got when we did the chi-squared test of independence; we will discussed "scaled" options later in relation to the concept of overdispersion. Note also the value of the log likelihood for the fitted model, in this case the independence.

Analysis of Parameter Estimates
Gives individual parameter estimates. Recall that PROC GENMOD does not uses zero-sum constraints, but fixes typically the last level of each variable to be equal to zero.

$$\hat{\lambda}=4.75,\ \hat{\lambda}^A_1=0.0072,\ \hat{\lambda}^A_2=0,\ \hat{\lambda}^B_1=-1.5712,\ \hat{\lambda}^B_2=0$$

So what are the estimated cell counts for having cold given that you took vitamin C based on this model:

$$\mu_{21}=\text{exp}(\lambda+\lambda_2^A+\lambda_1^B)=\text{exp}(4.745+0-1.5712)=23.91$$

How about the other cells: μ11,  μ12, and μ22?

What are the estimated odds of getting cold?
$$\text{exp}(\lambda_1^B-\lambda_2^B)=\text{exp}(-1.571-0)=0.208$$

LR Statistics for Type 3 Analysis
Testing the main effects, as in Maximum Likelihood Analysis of Variance in the CATMOD, except this is the likelihood ratio (LR) statistic and not the Wald statistic. Do you recall the difference between the two from Lesson 2 in testing difference in proportions? Let the null hypothesis be that there is no main effect, or in this case that there is equal distribution of subjects across the levels of a particular categorical variable, e.g., response. Remember that the Wald statistic depends only on the behavior of log-likelihood at the point of ML estimate and its standard error; e.g., z=MLE/SE(MLE). The likelihood-ratio statistic depends on the behavior of log-likelihood at the point of the ML estimate but also at the point of the null hypothesis, e.g., -2(L0-L1). When the sample size is moderate or small, likelhood ratio statistic is better choice. From the CATMOD output, the chi-squared Wald statistic was 98.12, with df=1, and we reject the null hypothesis. Here, we reach the same conclusion, except, that LR statistic is 130.59, df=1.

Observation Statistics
Gives information on expected cell counts (i.e., PRED), log of the expected counts (i.e., XBETA), standard errors (i.e., Std), confidence interval (i.e., Lower, Upper) and five different residuals for further assessment of the model fit, e.g., "Resraw"=observed count - predicted (expected) count, "Reschi"=pearson residual, "Resdev"= deviance residuals, "StReschi"=standardized pearson residuals, etc.

#### The LOGLIN and GLM functions in R

To do the same analysis in R we can use loglin() function that corresponds to PROC CATMOD  and glm() function that corresponds to PROC GENMOD. See, VitaminCLoglin.R and VitaminC.Loglin.out and the viewlet which runs step-by-step through the commands and the output.

(viewlet for VitaminCLoglin.R)

Here is a brief comparison to loglin() and glm() output with respect to model estimates.

The following command fits the log-linear model of independence, and from the part of the output that gives parameter estimates, we see that they correspond to the output from CATMOD, cold=exp(Λcoldresponsenocoldresponse)=exp(-0.7856-0.7856)=exp(-1.5712).

loglin(ski, list(1, 2), fit=TRUE, param=TRUE)$param$"(Intercept)"[1] 3.963656$param$Treatment     Placebo     VitaminC  0.003584245 -0.003584245 $param$Cold      Cold     NoCold -0.7856083  0.7856083

The following command fits the log-linear model of independence using glm(), and from the part of the output that gives parameter estimates, we see that they correspond to the output from GENMOD, but with the signs reversed, e.g. odds of nocold=exp(Λnocoldresponse)=exp(1.5712), thus odds of cold exp(-1.5712).

# 10.1.4 - Saturated Loglinear Model for Two-Way Tables

Objective: Model the cell counts: μij = nπij

Main assumptions:

• The N = IJ counts in the cells are assumed to be independent observations of a Poisson random variable, and

Model structure:

$$\text{log}(\mu_{ij})=\lambda+\lambda_i^A+\lambda_j^B+\lambda_{ij}^{AB}$$

Which constraints must hold? It depends on type of parameterization, and software you are using as each application brings with it differences in implementation. The key is, it doesn't matter, because your inference in the end will be the same. This we have already discussed to some length in case of independence models.You do, however, want to specify the procedure you used in any publication of your analysis, so you do want to understand the concepts involved here. Notice, now that we are also fitting an "interaction" term, that is the association term between A and B.

Parameter estimates and interpretation:

The constant and main effect λ's have the same meaning as before.

$${\lambda_{ij}^{AB}}'s$$ (1) represent the interaction/association between two variables, (2) reflect the departure from independence, and (3) ensure that μij = nij

The odds ratio is directly related to the interaction terms. For example, for a 2 × 2 table:

\begin{align}
\text{log}(\theta) &= \text{log}\left(\dfrac{\mu_{11}\mu_{22}}{\mu_{12}\mu_{21}}\right)\\
&= \\
&= \lambda_{11}^{AB}+\lambda_{22}^{AB}-\lambda_{12}^{AB}-\lambda_{21}^{AB}
\end{align}

Think about the following question, then click on the icon to the left display an answer.

#### Can you fill in the missing line for the equation above?

How many odds ratios are there? There should be (I-1) ×(J-1) which should be equal to the unique number of λij's in the model.

How many unique parameters are there in this model?

 Term # of terms #of constraints # of unique parameters λ 1 0 1 {λiA} I 1 I - 1 {λjB} J 1 J - 1 {λijAB} I × J I + J - 1 (I - 1) × (J - 1)

I × J = N is a perfect fit!

The saturated model is when:

1. the fitted values are exactly equal to observed values, that is the model fits the data perfectly,
2. df = 0, i.e., the number of unique parameters equals the number of cells,
3. this is the most complex model,
4. has the independence model as a special case. What does this imply about the assumption for the interaction terms?
5. there is a direct functional relationship with the odds ratio (and the unique number of those).

Model selection: None here, but we typically want a simpler model that smooths the data more, and it's more parsimonious. We will see later on the model selection. We compare other models, typically to the saturated model.

### Examples - General Social Survey and Vitamin C

Let's evaluate the CATMOD and GENMOD output for VitaminCLoglin.sas and vote.sas and R code for those, VitaminCLoglin.R and vote.R.

 (viewlet for VitaminCLoglin.R) (viewlet for Vote.R)

To fit the saturated models, we need to specify or add an interaction term in our code.

In CATMOD:

The LOGLIN statement (line) now specifies the saturated model. treatment|response specification adds an interaction term to the main effects model.

In GENMOD:

The '*' sign in the response*treatment, indicates that we want the interaction term and ALL lower order terms; that is, the saturated model.

Notice, that we can also specify the saturated model by explicitly entering the main effects in the model statement (and I suggest you use this option because it will be easier to read the values of the parameter estimates from the output):

model count = response treatment response*treatment/link=log dist=poisson lrci type3 obstats;

In these two model specifications under GENMOD, you will see that we are fitting different number of parameters and the ouput sections on LR Statistics For Type 3 Analysis will be different, but the estimates of odds and odds-ratios will be the same!

For the output, we have the same parts as we did for the independence model, except now we have more parameter estimates. We will consider the latter GENMOD parametrization and compare it to the CATMOD.

So, what are the odds of getting cold given that a person took a placebo pill? Let's first start with log-odds:

$$\text{log}(\mu_{11}/\mu_{12})=\text{log}(\mu_{11})-\text{log}(\mu_{12})=[\lambda+\lambda_1^A+\lambda_1^B+\lambda_{11}^{AB}]-[\lambda+\lambda_1^A+\lambda_2^B+\lambda_{12}^{AB}]$$

Based on GENMOD parameterization: λ1B11AB- λ2B - λ12AB = -1.9708 + 0.7134 - 0 - 0 and the odds are exp(-1.2574) = 0.2844

Based on CATMOD parameterization: [λ+λ1A + λ1B11AB ]- [ λ+λ1A- λ1B - λ11AB] = 2 λ1B+2 λ11AB = 2(-0.807 + 0.1784) and the odds are exp(log odds) = 0.2844

In R, loglin(),

loglin(ski, list(c(1, 2)), fit=TRUE, param=TRUE)

In R, glm(),

glm(ski.data$Freq~ski.data$Treatment*ski.data$Cold, family=poisson()) and like the note for SAS, you can also do glm(ski.data$Freq~ski.data$Treatment+ski.data$Cold+ski.data$Treatment*ski.data$Cold, family=poisson())
 Q1: What are the odds ratio based on CATMOD or loglin()? How about GENMOD or glm()?

For example, from the second way of fitting model with GENMOD, odds ratio=exp(0.7134)=2.041, with 95% CI [exp(0.079), exp(1.3770)]. This corresponds to 1/2.041=0.49 which we saw before.

treatment*response    placebo     cold       1      0.7134      0.3293      0.0790      1.3770

From R, glm(), we get the same:

ski.data$TreatmentVitaminC:ski.data$ColdNoCold  0.7134

 Q2: Is the interaction of treatment and response significant (hint: look at LR Statistics for Type 3 Analysis). Where does df = 3 come from?

 Q3: Model fits perfectly. Let's do some model selection. Compare the independence and saturated model, by calculating the LR statistics (e.g. difference in deviances). $$2\times (973.0657-970.629)\approx 4.8717$$ How many df does this statistic have?

 How about data in vote.sas or vote.R? What can you say about the saturated model? What are the estimates of the odds-ratios?

#### Hierarchical Models

These models include all lower order terms that comprise higher-order terms in the model.

(A, B) is a simpler model than (AB)

Interpretation does not depend on how the variables are coded.

Think about the following question, then click on the icon to the left display an answer.

#### Is this a hierarchical model? Why or why not?

$$\text{log}(\mu_{ij})=\lambda+\lambda_i^A+ \lambda_{ij}^{AB}$$

We want hierarchical models so that interaction terms represent associations. If there is a significant interaction term, we do NOT look at the lower-order terms, but only interpret the higher-order terms because the value of lower-order terms is coding dependent and can be misleading.

Next we'll see how these models extend to three-way tables.

# 10.2 - Log-linear Models for Three-way Tables

In this section we will extend the concepts we learned about log-linear models for two-way tables to three-way tables. We will learn how to fit varous models of independence discussed in Lesson 5, e.g., conditional independence, joint independence and homogenous associations model. We will also learn additional statistics, besides the usual X2 and G2, to assess the model fit, and to choose the "best" model.

#### Key concepts:

• Three-way Log-linear models
• Parameters Constraints, Estimation and Interpretation
• Model selection and Inference for log-linear models

#### Objectives:

• Understand the structure of the log-linear models in three-way tables
• Understand the concepts of independence and associations described via log-linear models in three-way tables

• Agresti (2007) Ch. 7, 8
• Agresti (2013) Ch. 8, 9

Expanding the log-linear model notation to 3-way tables:

$$\text{log}(\mu_{ijk})=\lambda+\lambda_i^A+\lambda_j^B+\lambda_k^C+\lambda_{ij}^{AB}+\lambda_{ik}^{AC}+\lambda_{jk}^{BC}+\lambda_{ijk}^{ABC}$$

The main questions for this lesson are:

• What do the λ terms mean in this model?
• What hypothesis about them correspond to the models of independence we already know?
• What are some efficient ways to specify and interpret these models and tables?
• What are some efficient ways to fit and select among many possible models in three and higher dimensions?

As before for three-way tables, there are multiple models we can test, and now fit. The log-linear models we will fit and evaluate are:

Let us go back to our familiar dataset on graduate admissions at Berkeley:

 Dept. Men rejected Men accepted Women rejected Women accepted A 313 512 19 89 B 207 353 8 17 C 205 120 391 202 D 278 139 244 131 E 138 53 299 94 F 351 22 317 24

Let D = department, S = sex, and A = admission status (rejected or accepted). We analyzed this as a three-way table before, and specifically we looked at partial and marginal tables. Now we will look at it from a log-linear model point of view. You will work with this example on the next homework as well. Let y be the frequency or count in a particular cell of the three-way table. See berkelyLoglin.sas (and berkeley.sas) or berkeleyLoglin.R (and berkeley.R).

# 10.2.2 - Saturated Log-Linear Model

This model is the default model in a way that serves for testing of goodness-of-fit of the other models. Recall that saturated model has the maximum number of parameters and fitting a saturated model is the same as estimating ML parameters of distributions appropriate for each cell of the contingency table.

Objective: Model the cell counts: μijk = nπijk

Main assumptions:

• The N = IJK counts in the cells are assumed to be independent observations of a Poisson random variable, and

Model structure:

$$\text{log}(\mu_{ijk})=\lambda+\lambda_i^A+\lambda_j^B+\lambda_k^C+\lambda_{ij}^{AB}+\lambda_{ik}^{AC}+\lambda_{jk}^{BC}+\lambda_{ijk}^{ABC}$$

Parameter constraints: can be different, but ANOVA type constraints imply:

$$\sum_i\lambda_i^A=\sum_j\lambda_j^B=\sum_i\sum_j\lambda_{ij}^{AB}=\sum_i\sum_j\sum_k\lambda_{ijk}^{ABC}=0$$

Parameter estimation and interpretation:

• λ represents an "overall" effect, or a grand mean of the logarithms of the expected counts, and it ensures that Σi Σj Σk μijk = n
• $$\lambda_i^A,\lambda_j^B,\lambda_k^C$$ represent the "main" effects of variables A, B, and C, or deviations from the grand mean.
• $$\lambda_{ij}^{AB},\lambda_{ik}^{AC},\lambda_{jk}^{BC}$$ represent the interaction/association between two variables while controlling the third (e.g., conditional odds ratios, test for partial associations), and reflect the departure from independence
• $$\lambda_{ijk}^{ABC}$$ represents the interaction/association between three variables and reflect the departure from independence

If there is a significant interaction term, we DO NOT look at the lower-order terms, but only interpret the higher-order terms because the values of lower-order terms are coding dependent and can be misleading. The highest-order terms in log-linear models correspond to the so called minimal sufficient statistics (MSS) for log-linear models. The MSSs are margins of the table, and in this case it is the entire table since the highest-order term is the three-way association and we have three random variables. This makes it equivalent to fitting independent Poisson distribution to each cell of the I x J rable.

Model Fit: Saturated model has a prefect fit, G2 = 0, df = 0. df =number of cells − number of unique parameters in the model.

Model Selection: Relevant when comparing to simpler models. Saturated model is the most complex model possible!!

#### Fitting in SAS and R

We will first discuss berkeleyLoglin.sas, and then berkeleyLoglin.R.

Using PROC GENMOD, let us fit the saturated log-linear model.

When you use the order=data option, GENMOD orders the levels of class variables in the same order as they appear in the dataset. For each class variable, GENMOD creates a set of dummy using the last category as a reference group. Recall the CATMOD and GENMOD coding from Two-way Log-linear models.

Therefore, we can interpret a two-way association as a log-odds ratio for the two variables in question, with the other variable held constant at its last category (i.e., conditional odds-ratios).

Here’s a portion of the SAS output, that is the table of ML estimates.

For how to do this in R see berkeleyLoglin.R which uses the loglin() and glm() functions.

The intercept is a normalizing constant and should be ignored. The main effects for D, S and A are all difficult to interpret and not very meaningful since we have significant two-way and three-way associations; the two- and three-way associations are highly meaningful. For example, the estimated coefficient for the SA association is 0.1889.

Exponentiating this coefficient gives

$$\text{exp}(0.1889)=1.208$$,

which is the estimated SA odds ratio for Department F since that is the reference department in this analysis. The reference group for S is "female," and the reference group for A is "accept." If we write the 2 × 2 table for S × A in Department F, i.e., the "partial" table, with the reference groups in the last row and column, we get

 Dept F Reject Accept Men 351 22 Women 317 24

for which the estimated odds ratio is:

$$351 \times 24/317 \times 22=1.208$$.

The Wald z-statistic for this coefficient,

$$z=0.1889/0.3052$$,

which corresponds to Chi-square statistic 0.622 = 0.38 with p-value 0.5359, and indicates that the SA odds ratio for Department F is not significantly different from 1.00 or that the log odds ratio is not significantly different from 0.

The 95% confidence interval for the parameter estimate, that is for the log odds ratio, is (−0.4092, 0.7870). Thus the 95% confidence interval for the odds ratio is

$$(\text{exp}(-0.4092),\text{exp}(0.7870))=(0.67,2.20)$$.

Compare this to the output from PROC FREQ with CMH option

To get the SA odds ratio for any other department, we have to combine the SA coefficient with one of the DSA coefficients. For example, the SA odds ratio for Department A is

$$\text{exp}(0.1889+0.8632)=2.864$$.

Based on:

\begin{align}
\theta_{SA(i="A")} &= \dfrac{\mu_{ijk} \mu_{ij'k'}}{\mu_{ij'k} \mu_{ijk'}}\\
&= \\
&= (\lambda_{jk}+\lambda_{j'k'}-\lambda_{j'k}-\lambda_{jk'})+(\lambda_{ijk}+\lambda_{ij'k'}-\lambda_{ij'k}-\lambda_{ijk'})\\
&= (0.1889+0-0-0)+(0.8632+0-0-0)
\end{align}

The Wald z-statistic for the first DSA coefficient,

$$z=0.8632/0.4027$$,

indicates that the SA odds ratio for Department A is significantly different from the SA odds ratio in Department F. To see if the SA odds ratio in Department A is significantly different from 1.00, we would have to compute the standard error for the sum of the two coefficients using the estimated covariance matrix, or refit the model by fixing the level of interest equal to 0.

____________________________________________________________

Let us see about parameter estimates based on the output from berkeleyLoglin.R. More specifically we will consider the output based on the following R line from our code:

berk.sat<-glm(berk.data$Freq~berk.data$Admit*berk.data$Gender*berk.data$Dept, family=poisson())

The intercept is a normalizing constant and should be ignored. The main effects for D, S and A are all difficult to interpret and not very meaningful since we have significant two-way and three-way associations; the two- and three-way associations are highly meaningful. For example, the estimated coefficient for the SA association is -1.0521, from

berk.data$AdmitRejected:berk.data$GenderFemale      -1.05208    0.26271  -4.005 6.21e-05 ***

Exponentiating this coefficient gives

$$\text{exp}(-1.0521)=0.3492$$,

which is the estimated SA odds ratio for Department A since that is the reference department in this analysis. The reference group for S is "male," and the reference group for A is "accept." If we write the 2 × 2 table for S × A in Department A, i.e., the "partial" table, with the reference groups in the last row and column, we get

 Dept A Reject Accept Women 19 89 Men 313 512

for which the estimated odds ratio is:

$$19 \times 512/313 \times 89=0.3492$$

The Wald z-statistic for this coefficient,

$$z=-1.0521/0.2627=-4.005$$

which corresponds to Chi-square statistic (-4.005)2 = 16.04 with p-value nearly 0, and indicates that the SA odds ratio for Department A is significantly different from 1.00 or that the log odds ratio is significantly different from 0.

The 95% confidence interval for the parameter estimate, that is for the log odds ratio, is -1.0521±1.96(0.2627). Thus the 95% confidence interval for the odds ratio is

$$(\text{exp}(-1.567),\text{exp}(-0.537))=(0.209,0.584)$$

To get the SA odds ratio for any other department, we have to combine the SA coefficient with one of the DSA coefficients. For example, the SA odds ratio for Department F is

$$\text{exp}(-1.0521+0.8632)=\text{exp}(-0.1889)=0.828$$

and if you compare this with the calculation from SAS above, we get the same odds-ratio, that is 1/0.828=1.028. Check the boxed explanation in SAS above, how we derived this.

The Wald z-statistic for the last DSA coefficient,

$$z=0.8632/0.4027=2.14$$

indicates that the SA odds ratio for Department F is significantly different from the SA odds ratio in Department A. To see if the SA odds ratio in Department F is significantly different from 1.00, we would have to compute the standard error for the sum of the two coefficients using the estimated covariance matrix, or refit the model by fixing the level of interest equal to 0.

In many situations, however, we take recourse to saturated model only as a last resort. As the number of variables grow, saturated models become more and more difficult to interpret. In the following sections we look at simpler models which are useful in explaining the associations among the discrete variables of interest.

# 10.2.3 - Complete Independence

Objective: Only have the main effects for the three variables; we are testing here for complete independence among these three variables.

Main assumptions:

• The N = IJK counts in the cells are assumed to be independent observations of Poisson random variable, and
• there are no partial interactions, $$lambda_{ij}^{AB}=\lambda_{ik}^{AC}=\lambda_{jk}^{BC}=0$$, for all i, j, k or all conditional odds ratio= 1 or log(oddsratio) = 0.
• \begin{array}{lcl}
\theta_{AB(k)} &=& \dfrac{\mu_{ijk}\mu_{i'j'k}}{\mu_{i'jk}\mu_{ij'k}}\\
\theta_{AC(j)} &=& \dfrac{\mu_{ijk}\mu_{i'jk'}}{\mu_{i'jk}\mu_{ijk'}}\\
\theta_{BC(i)} &=& \dfrac{\mu_{ijk}\mu_{ij'k'}}{\mu_{ij'k}\mu_{ijk'}}
\end{array}

• there is no interaction between three categorical variables, i.e., $$\lambda_{ijk}^{ABC}=0$$ for all i, j, k.

Model Structure:

$$\text{log}(\mu_{ijk})=\lambda+\lambda_i^A+\lambda_j^B+\lambda_k^C$$

A, B and C stand for three categorical variables.

MLE’s:

$$E(\mu_{ijk})=m_{ijk}=\hat{\mu}_{ijk}=\dfrac{n_{i++}n_{+j+}n_{++k}}{n^2}$$

Parameter constraints: can be different, but ANOVA type are ...

 Can you fill in the constraints here?

#### Parameter estimates and interpretation:

Write the model in terms of the variables, and recall the log-linear notation we used when we introduced the complete independence model in the Three-way table notes.

#### Fitting the model in SAS and R

In SAS, the model of complete independence (D, S, A) can be fitted like this (see berkeleyLoglin.sas):

What are the estimated odds of male vs. female in this example? From the output (run the code) the ML estimate of the parameter S-Male, thus, the odds of being male are higher than being female applicant:

exp(0.382) = 1.467 = 2691/1835

with p-value < .0001 indicating that the odds are significantly different. Note these are odds, not odds ratios! (When we are dealing with main effects we do not look at odds ratios.)

How about odds of being rejected? What can you conclude from the part of the output below?

But, we should really check the overall fit of the model first, to determine if these estimates are meaningful.

Model Fit:

The goodness-of-fit statistics indicate that the model does not fit.

If the model fits well, the "Value/DF" would be close to 1.

Do you recall how we get the DFs?

df = number of cells - number of fitted parameters in the model.

df = number of fitted parameters in the saturated model - number of fitted parameters in our model.

Recall that these goodness-of-fit statistics, compare our fitted model to the saturated model. Thus our model does not fit well in comparison to the saturated model.

Let's see parts of relevant R code, (berekeleyLoglin.R), and one way to fit the model of independence is:

berk.ind<-glm(berk.data$Freq~berk.data$Admit+berk.data$Gender+berk.data$Dept, family=poisson())

What are the odds of being a female vs. male applicant? See,

berk.data$GenderFemale -0.38287 0.03027 -12.647 < 2e-16 *** Thus the odds are exp(-0.3829)=0.68. Compare this to the result we got from SAS (see above) which was 1.467=1/0.68. The same result just odds of male vs. female. Can you figure out the odds of being rejected? But, we should really check the overall fit of the model first, to determine if these estimates are meaningful. Model Fit: The goodness-of-fit statistics indicate that the model does not fit. Null deviance: 2650.1 on 23 degrees of freedom Residual deviance: 2097.7 on 16 degrees of freedom Here the "residual deviance" is the likelihood-ratio statistic, G2=2097.7 with df=16. If the ratio of G2/df is much greater than 1, you can reject the fitted model, or compute the p-value for this statistic like we did in the earlier lessons. Thus the model of complete independence does NOT fit well in comparison to the saturated model, i.e., reject the model of complete independence. The "null deviance" is the deviance for the intercept only model. Next, let us see an example of the joint independence model. # 10.2.4 - Joint Independence Two variables jointly independent of the third, e.g., (DS,A), (DA, S), (SA,D). We will not go through all of these potential models, but instead take a look at one of these. Consider, for example, modeling (D, S) jointly independent from A. Objective: Main assumptions: • The N = IJK counts in the cells are assumed to be independent observations of a Poisson random variable, (this always stays the same...) and • there is no interaction between three categorical variables, i.e., $$\lambda_{ijk}^{DSA}=0$$ for all i, j, k, and • there is no partial interactions, $$\lambda_{ik}^{DA}=\lambda_{jk}^{SA}=0$$, for all i, j, k or conditional odds ratios • \begin{array}{lcl} \theta_{DA(j)} &=& \dfrac{\mu_{ijk}\mu_{i'jk'}}{\mu_{i'jk}\mu_{ijk'}}=1\\ \theta_{SA(i)} &=& \dfrac{\mu_{ijk}\mu_{ij'k'}}{\mu_{ij'k}\mu_{ijk'}}=1 \end{array} Model Structure: $$\text{log}(\mu_{ijk})=\lambda+\lambda_i^D+\lambda_j^S+\lambda_k^A+\lambda_{ij}^{DS}$$ Parameter estimates and interpretation: In SAS, this model can be fitted like this: This model implies that association between D and S does NOT depend on level of the variable A, that is association between department and gender is independent of the rejection/acceptance decision. $$\theta_{DS(A=reject)}=\theta_{DS(A=accept)}=\dfrac{\text{exp}(\lambda_{ij}^{DS})\text{exp}(\lambda_{i'j'}^{DS})}{\text{exp}(\lambda_{i'j}^{DS})\text{exp}(\lambda_{ij'}^{DS})}$$ Does this control for variable A or ignore it? (see berkeley.sas). Since we are assuming that (DS) are independent of levels of A, then we are assuming that DS given A is the same as DS so the marginal DS associations are the same as partial association DS for each level of A, i.e., if this model fits well, we can ignore A with respect to D and S. The estimated coefficient of DS association (run berkeleyLoglin.sas) of 1.9436 implies D*S DeptA Male 1 1.9436 0.1268 1.6950 2.1921 234.84 that the estimated odds ratio is exp(1.9436) = 6.98 with 95% CI (exp(1.695), exp(2.192)) = (5.45, 8.96) with reference group "female" and "department F". But, we should really check the overall fit of the model first, to determine if these estimates are meaningful. Model Fit: The goodness-of-fit statistics indicate that the model does not fit since the "Value/df" is much larger than 1. Recall, you could also compute the p-value the G2 or X2 here too. How did we get these DF? df = (IJK-1)-((I-1)+(J-1)+(K-1)+(I-1)(J-1)) df = (IJ-1)(K-1) So where is the lack of fit? As before, look at residuals. Recall that adjusted residuals have approximately N(0, 1) distribution, e.g., standardized Pearson residuals (StReschi). In general, we have a lack of fit, (1) if we have a large number of cells and adjusted residuals are greater than 3, or (2) if we have a small number of cells and adjusted residuals are greater than 2. Here is only part of the output for the 2x2x6 table. Notice that the absolute value of the StReschi for the first five cells are all large, e.g., in cell 1, StResch = -15.1632. If you look at the residuals for other cells, you will notice that most of them are rather large, indicating that this model fits poorly on the entire data. Evaluate the residuals: In R, here is one way to fit (DS, A) model: berk.join=glm(berk.data$Freq~berk.data$Admit+berk.data$Gender+berk.data$Dept+berk.data$Gender*berk.data$Dept, family=poisson()) From the output (add the above function to berkeleyLoglin.R and run it), the estimated coefficient of DS association for "female" and "department F" is 1.9436, berk.data$GenderFemale:berk.data$DeptF 1.94356 0.12683 15.325 < 2e-16 *** Thus, the estimated odds-ratio is exp(1.9436) = 6.98 with 95% CI (exp(1.695), exp(2.192)) = (5.45, 8.96). Males are about 7 times more likely than females to be in department A (the reference level here) than in the department F. For example, this is the same value we would get by estimating the sample odds ratio in this table.  Dept A Dept F Men 825 373 Women 108 341 But, we should really check the overall fit of the model first, to determine if these estimates are meaningful. Model Fit: The goodness-of-fit statistics indicate that the model does not fit since the "Value/df" is much larger than 1, e.g, for the deviance statistc: Residual deviance: 877.06 on 11 degrees of freedom Recall, you could also compute the p-value the G2 here too, by running 1-pchisq(877.06,11). How did we get these DF? df = (IJK-1)-((I-1)+(J-1)+(K-1)+(I-1)(J-1)) df = (IJ-1)(K-1) So where is the lack of fit? As before, look at residuals. Recall that adjusted residuals have approximately N(0, 1) distribution, e.g., standardized Pearson residuals (StReschi). In general, we have a lack of fit, (1) if we have a large number of cells and adjusted residuals are greater than 3, or (2) if we have a small number of cells and adjusted residuals are greater than 2. Recall from VitaminCLoglin.R, to get Pearson residuals, residuals(berk.join, "pearson"), and to get standardize residulas, you need to adjust them by their value of the "influence", e.g., residuals(berk.join, "pearson")/sqrt(lm.influence(berk.join)$hat). Here are the values for the first three observations, and they are all very large.

 15.1987867 -10.3430897  11.4793671

If you look at the residuals for other cells, you will notice that most of them are rather large, indicating that this model fits poorly on the entire data.

Next, let us look at what is often the most interesting model of conditional independence.

# 10.2.5 - Conditional Independence

Two variables conditionally independent given the third, e.g., (DA, SA), (DS, SA), (DS, DA). Consider modeling D independent from A given S.

Objective:

Main assumptions:

• The N = IJK counts in the cells are assumed to be independent observations of a Poisson random variable, and
• there is no interaction between three categorical variables, i.e., $$\lambda_{ijk}^{DSA}=0$$ for all i, j, k, and
• there is no partial interactions, $$\lambda_{ik}^{DA}=0$$, for all i, j, k or conditional odds ratio
• $$\theta_{DA(j)} = \dfrac{\mu_{ijk}\mu_{i'jk'}}{\mu_{i'jk}\mu_{ijk'}}=1$$

Model Structure:

$$\text{log}(\mu_{ijk})=\lambda+\lambda_i^D+\lambda_j^S+\lambda_k^A+\lambda_{ij}^{DS}+\lambda_{jk}^{SA}$$

Parameter estimates and interpretation:

This model implies that partial odds ratios are characterized by the two-way interaction terms, and that the associations between D and S do NOT depend on the levels A, and the associations between S and A do NOT depend on the levels of D.

$$\theta_{DS(A=reject)}=\theta_{DS(A=accept)}=\dfrac{\text{exp}(\lambda_{ij}^{DS})\text{exp}(\lambda_{i'j'}^{DS})}{\text{exp}(\lambda_{i'j}^{DS})\text{exp}(\lambda_{ij'}^{DS})}$$

$$\theta_{SA(D="A")}=\ldots=\theta_{SA(D="F")}=\dfrac{\text{exp}(\lambda_{jk}^{SA})\text{exp}(\lambda_{j'k'}^{SA})}{\text{exp}(\lambda_{j'k}^{SA})\text{exp}(\lambda_{jk'}^{SA})}$$

 How do these relate to partial and marginal tables? How would you interpret the parameters? Do you notice any similarity in the parameter estimates from the joint independence model? What is the estimated coefficient of DS association for department A? Update, the given SAS or R code to fit the above model, run it, and SUBMIT your answers on ANGEL; notes below may help.

In SAS, this model can be fitted as:

Model Fit:

The goodness-of-fit statistics indicate that the model does not fit.

How did we get these DF?

df = (IJK-1) - ((I-1) + (J-1) + (K-1) + (I-1)(J-1) + (I-1)(K- 1)) = I(J-1)(K-1)

So where is the lack of fit?

As before, look at residuals. For example, the adjusted residual for the first cell is -12.14792, a great deviation from zero.

We can also evaluate overall the individual parameters and their significance:

This is like an ANOVA table in ANOVA and regression models. All parameters are significantly different from zero. That is they contribute significantly in describing the relationships between our variables, but the overall lack of fit of the model suggests that they are not sufficient.

In R, (DS, SA) model can be fitted as:

berk.cind=glm(berk.data$Freq~berk.data$Admit+berk.data$Gender+berk.data$Dept+berk.data$Gender*berk.data$Dept+berk.data$Gender*berk.data$Admit, family=poisson())

Model Fit:

The goodness-of-fit statistics indicate that the model does not fit, e.g., Residual deviance:  783.6  on 10  degrees of freedom

How did we get these DF?

df = (IJK-1)-((I-1)+(J-1)+(K-1)+(I-1)(J-1)+(I-1)(K- 1))=I(J-1)(K-1)

So where is the lack of fit?

As before, look at residuals. For example, the pearson residual for the first cell is 7.551454, a great deviation from zero.

We can also evaluate overall the individual parameters and their significance, by running a command anova(berk.cind):

Analysis of Deviance TableModel: poisson, link: logResponse: berk.data$FreqTerms added sequentially (first to last) Df Deviance Resid. Df Resid. DevNULL 23 2650.10berk.data$Admit                   1   230.03        22    2420.07berk.data$Gender 1 162.87 21 2257.19berk.data$Dept                    5   159.52        16    2097.67berk.data$Gender:berk.data$Dept   5  1220.61        11     877.06berk.data$Admit:berk.data$Gender  1    93.45        10     783.61

This is like an ANOVA table in ANOVA and regression models. All parameters are significantly different from zero. That is they contribute significantly in describing the relationships between our variables, but the overall lack of fit of the model suggests that they are not sufficient. If you compare this with the ANOVA table from the SAS output given above, some of the values differ because of the different order we entered the variables into the model; just like in linear regression models. The values also differ because R gives you the values for the deviance statistic (e.g., "Deviance Resid") whereas SAS gives you the usual chi-squared. To compute the p-value here, e.g. for the last term, 1-pchisq(93.45, 1), which will be nearly 0 indicating that overall this term contributes significantly to the fit of the model.

# 10.2.6 - Homogeneous Association

The homogeneous associations model is also known as a model of no-3-way interactions, or as a model of no-second order interactions, (DS, DA, SA)

Objective: same as before...

Main assumptions:

• The N = IJK counts in the cells are assumed to be independent observations of a Poisson random variable, and
• there is no interaction between three categorical variables, i.e., $$\lambda_{ijk}^{DSA}=0$$ for all i, j, k

Model Structure:

$$\text{log}(\mu_{ijk})=\lambda+\lambda_i^D+\lambda_j^S+\lambda_k^A+\lambda_{ij}^{DS}+\lambda_{jk}^{SA}+\lambda_{ik}^{DA}$$

This model implies that ALL partial odds ratios are characterized by the two-way interaction terms, and that the associations between D and S do NOT depend on the levels of the variables A, and the associations between S and A do NOT depend on the levels of D, and the associations between D and A do NOT depend on the levels of S.

$$\theta_{DS(A=reject)}=\theta_{DS(A=accept)}=\dfrac{\text{exp}(\lambda_{ij}^{DS})\text{exp}(\lambda_{i'j'}^{DS})}{\text{exp}(\lambda_{i'j}^{DS})\text{exp}(\lambda_{ij'}^{DS})}$$

$$\theta_{SA(D="A")}=\ldots=\theta_{SA(D="F")}=\dfrac{\text{exp}(\lambda_{jk}^{SA})\text{exp}(\lambda_{j'k'}^{SA})}{\text{exp}(\lambda_{j'k}^{SA})\text{exp}(\lambda_{jk'}^{SA})}$$

$$\theta_{DA(S="male")}=\theta_{DA(S="female")}=\dfrac{\text{exp}(\lambda_{ik}^{DA})\text{exp}(\lambda_{i'k'}^{DA})}{\text{exp}(\lambda_{i'k}^{DA})\text{exp}(\lambda_{ik'}^{DA})}$$

Does this model fit? Even this model doesn't fit well, but it seems to fit better than the previous models, e.g., G2= 20.2251, df= 5, Value/df=4.0450

Think about the following question, then click on the icon to the left display an answer.

#### Can you figure out DF?

Parameter estimates and interpretation:

In SAS, this model can be fitted as:

Are all terms in the model significant (e.g. look at ’Type 3 Analysis output’); recall you need to use option "type3". For example, here is the ANOVA-like table that shows that SA association does not seem to be significant,

  LR Statistics For Type 3 Analysis

                          Chi-
Source           DF     Square    Pr > ChiSq
D                 5     360.60        <.0001
S                 1     252.95        <.0001
A                 1     302.79        <.0001
D*S               5    1128.13        <.0001
D*A               5     762.42        <.0001
S*A               1       1.44        0.2306

Here is part of the output from the "Analysis of Parameter Estimates" given the values for all the parameters,

                       Analysis Of Parameter Estimates                                                     Likelihood Ratio                                         Standard    95% Confidence       Chi-Parameter                  DF  Estimate     Error        Limits         SquareD          DeptC            1    2.2236    0.1649    1.9110    2.5587   181.85D          DeptD            1    1.7479    0.1682    1.4282    2.0888   108.04D          DeptE            1    1.4816    0.1762    1.1446    1.8369    70.70D          DeptF            0    0.0000    0.0000    0.0000    0.0000      .  S          Male             1   -0.0008    0.1064   -0.2096    0.2077     0.00S          Female           0    0.0000    0.0000    0.0000    0.0000      .  A          Reject           1    2.6261    0.1577    2.3286    2.9482   277.18A          Accept           0    0.0000    0.0000    0.0000    0.0000      .  D*S        DeptA   Male     1    2.0004    0.1357    1.7376    2.2698   217.35D*S        DeptA   Female   0    0.0000    0.0000    0.0000    0.0000      .  D*S        DeptB   Male     1    3.0753    0.2229    2.6576    3.5346   190.43D*S        DeptB   Female   0    0.0000    0.0000    0.0000    0.0000      .  D*S        DeptC   Male     1   -0.6637    0.1044   -0.8688   -0.4596    40.45D*S        DeptC   Female   0    0.0000    0.0000    0.0000    0.0000      .  D*S        DeptD   Male     1    0.0432    0.1058   -0.1640    0.2506     0.17D*S        DeptD   Female   0    0.0000    0.0000    0.0000    0.0000      .  D*S        DeptE   Male     1   -0.7934    0.1167   -1.0232   -0.5658    46.25D*S        DeptE   Female   0    0.0000    0.0000    0.0000    0.0000      .  D*S        DeptF   Male     0    0.0000    0.0000    0.0000    0.0000      .  D*S        DeptF   Female   0    0.0000    0.0000    0.0000    0.0000      .  D*A        DeptA   Reject   1   -3.3053    0.1700   -3.6498   -2.9822   378.17D*A        DeptA   Accept   0    0.0000    0.0000    0.0000    0.0000      .  D*A        DeptB   Reject   1   -3.2616    0.1788   -3.6225   -2.9206   332.89D*A        DeptB   Accept   0    0.0000    0.0000    0.0000    0.0000      .  D*A        DeptC   Reject   1   -2.0444    0.1679   -2.3846   -1.7252   148.31D*A        DeptC   Accept   0    0.0000    0.0000    0.0000    0.0000      .  D*A        DeptD   Reject   1   -2.0174    0.1699   -2.3614   -1.6940   141.01D*A        DeptD   Accept   0    0.0000    0.0000    0.0000    0.0000      .  D*A        DeptE   Reject   1   -1.5678    0.1804   -1.9306   -1.2219    75.49D*A        DeptE   Accept   0    0.0000    0.0000    0.0000    0.0000      .  D*A        DeptF   Reject   0    0.0000    0.0000    0.0000    0.0000      .  D*A        DeptF   Accept   0    0.0000    0.0000    0.0000    0.0000      .  S*A        Male    Reject   1    0.0967    0.0808   -0.0612    0.2556     1.43S*A        Male    Accept   0    0.0000    0.0000    0.0000    0.0000      .  S*A        Female  Reject   0    0.0000    0.0000    0.0000    0.0000      .  S*A        Female  Accept   0    0.0000    0.0000    0.0000    0.0000      .

Recall, we are interested in the highest-order terms, thus two-way associations here, and they correspond to log-odds ratios. For example, SA-Male,Reject is 0.0967. What does that tell us about estimated odds ratio between the gender and admission? Compute exp(0.0967) and do the same for the confidence bounds. You can infer that the odds of admission for males and females, ignoring the department, are the same. However, this model does not fit well, so we can't really rely on the inferences based on this model.

In R, here is one way of fitting this model,

berk.hom=glm(berk.data$Freq~berk.data$Admit+berk.data$Gender+berk.data$Dept+berk.data$Gender*berk.data$Dept++berk.data$Dept*berk.data$Admit+berk.data$Gender*berk.data$Admit, family=poisson())

Are all terms in the model significant, e.g., look at the anova(berk.hom) output,

Analysis of Deviance Table
Model: poisson, link: log
Response: berk.data$Freq Terms added sequentially (first to last)  Df Deviance Resid. Df Resid. Dev NULL 23 2650.10 berk.data$Admit                   1   230.03        22    2420.07
berk.data$Gender 1 162.87 21 2257.19 berk.data$Dept                    5   159.52        16    2097.67
berk.data$Gender:berk.data$Dept   5  1220.61        11     877.06
berk.data$Admit:berk.data$Dept    5   855.32         6      21.74
berk.data$Admit:berk.data$Gender  1     1.53         5      20.20

We see that the last added term SA association (Admi-Gender) didn't contribute significantly to the fit of the model.

Now take a look at the parameter estimates, and for example association of Gender and Admit, where the value is -0.09987. Compute the exp(-0.09987) and of its confidence interval. What do you conclude? You should get an estimated odds-ratio that is NOT significantly different from 1, thus gender and admission seem to be marginally independent (i.e., ignoring the department). However, this model does not fit well (G2=20.05, df=5), so we can't really rely on the inferences based on this model.

-0.09987
Coefficients:                                               Estimate Std. Error z value Pr(>|z|)    (Intercept)                                     6.27150    0.04271 146.855  < 2e-16 ***berk.data$AdmitRejected -0.58205 0.06899 -8.436 < 2e-16 ***berk.data$GenderFemale                         -1.99859    0.10593 -18.866  < 2e-16 ***berk.data$DeptB -0.40322 0.06784 -5.944 2.78e-09 ***berk.data$DeptC                                -1.57790    0.08949 -17.632  < 2e-16 ***berk.data$DeptD -1.35000 0.08526 -15.834 < 2e-16 ***berk.data$DeptE                                -2.44982    0.11755 -20.840  < 2e-16 ***berk.data$DeptF -3.13787 0.16174 -19.401 < 2e-16 ***berk.data$GenderFemale:berk.data$DeptB -1.07482 0.22861 -4.701 2.58e-06 ***berk.data$GenderFemale:berk.data$DeptC 2.66513 0.12609 21.137 < 2e-16 ***berk.data$GenderFemale:berk.data$DeptD 1.95832 0.12734 15.379 < 2e-16 ***berk.data$GenderFemale:berk.data$DeptE 2.79519 0.13925 20.073 < 2e-16 ***berk.data$GenderFemale:berk.data$DeptF 2.00232 0.13571 14.754 < 2e-16 ***berk.data$AdmitRejected:berk.data$DeptB 0.04340 0.10984 0.395 0.693 berk.data$AdmitRejected:berk.data$DeptC 1.26260 0.10663 11.841 < 2e-16 ***berk.data$AdmitRejected:berk.data$DeptD 1.29461 0.10582 12.234 < 2e-16 ***berk.data$AdmitRejected:berk.data$DeptE 1.73931 0.12611 13.792 < 2e-16 ***berk.data$AdmitRejected:berk.data$DeptF 3.30648 0.16998 19.452 < 2e-16 ***berk.data$AdmitRejected:berk.data\$GenderFemale -0.09987    0.08085  -1.235    0.217

Next, we’ll explore some more these models in terms of model fit (e.g. deviance, residuals, ...) and model selection.

# 10.2.7 - Summary Inference for the "Admissions" example

We will look in more details the statistical inference and model selection with log-linear models for k-way tables in the next part of this lesson. For now, let's take a quick look at different parts that we consider when assessing the fit of the model and inference based on the model parameter estimates in the context of the Berkeley Admissions example. Recall that for each of the previously discussed models we looked at:

• Chi-squared goodness of fit for the overall fit of the model,
• Residuals,
• Tests about partial associations, and/or
• CI’s for odds ratios

#### Chi-squared goodness-of-fit tests

Recall, the null hypothesis is that the fitted model holds, that is the fitted model gives a good representation of the data versus the alternative that the saturated model fits well.

These are global tests for overall model fit, and and all assumptions we have been discussing about the chi-squared and deviance (likelihood-ratio) statistics hold.

Below is the table with all possible models for the Admissions example data. You will complete the rest of the table for the homework; i.e., fill in the fit statistics, dfs and p-values.

 Model df G2 p-value X2 p-value (D, S, A) 16 2097.2 < .001 1999.6 < .001 (DS, A) 11 876.6 < .001 797.1 < .001 (D, SA) (DA, S) (DS, SA) 10 782.65 < .001 714.30 < .001 (DS, DA) (DA, SA) (DS, DA, SA) 5 20.23 < .01 18.83 < .01 (DSA) 0 0

Which model would you choose based on the overall fit statistics? The homogeneous model looks like the best fit. Remember, the greater the number of parameters, the better the fit will be.  We will learn a bit later more about model selection. What we want is to find the simplest model that fits the best. We can look at all possible models, and with 3 random variables, there is 9 of such models. But sometimes we can also reduce the space of possible models based on the question we are trying to answer and the contextual knowledge. For example, thinking of A as a "response" and D, S as "covariates," it makes sense to focus attention on

• the null model (DS, A), which says that neither D nor S an effect on A;
• (DS, DA), which says that sex has no effect on A after the effect of department is included;
• (DS, SA), which says that department has no effect on A after the effect of sex is included;
• (DS, DA, SA), which says that the effect of sex on A is the same at each level of department; and
• (DSA), which says that the effect of sex on A varies across departments.

Once we find a model (or a few of them) that look(s) good (i.e., have a good fit), then we want to further explore the fit that is the lack-of fit. We can look at the residuals.

#### Residuals

Assessing how well are the cells fitted by the model. If you have selected a good model the cells should have small residuals.

As before we can consider Pearson residuals and adjusted residuals. Recall that the adjusted residuals are approximately N(0, 1), and if the model holds these should be close to zero. (The smaller the difference between the expected and the observed - the better the fit.)

There is a lack of fit if:

1. There are a few cells (small N) and adjusted residuals are greater than 2 or
2. The sample is large (large N) and adjusted residuals are greater than 3

#### Partial Association (PA) Tests

The partial association (PA) tests relate to testing of a specific coefficient in the model. The null hypothesis can be stated as:

1. there is no partial association of two variable given the third, or e.g. there is no partial association between department, D and acceptance/rejection, A., or
2. conditional (partial) odds ratios equals 1, or
3. two way interaction term equal zero.

To test this, we test the difference between a model that does NOT contain the partial association and a more complex model that includes this effect. We do this by computing the difference of deviance's (G2s) of the two models. The only difference between these two models is the inclusion of the parameter of interest in the alternative that is in the more complex model.

For example:

 Model df G2 p-value H0 Δdf ΔG2 (DS, DA, SA) 5 20.23 < .01 - - - (DS, SA) 10 782.65 < .001 $$\lambda_{ik}^{DA}=0$$ 5 762.42 (DS, DA) (DA, SA)

We are interested in testing if  (DA) term is statistically significant, that is H0: λDA = 0. We compare the the model (DS, SA) with a more complex model (DS,DA,SA) where the only difference between these two models is the (DA) term. Difference in deviances then is, 782.65-20.23=762.42 and degrees of freedom, 10-5=5. Since Value/df=762.42/5 is much larger than 1, we can reject the null hypothesis, thus there seems to be significant association between D and A given S.

Complete the rest of the table for homework, and make appropriate inferences based upon what you find. Notice that you should do this only for nested models. Here is a hierarchical or nested listing:

(ABC) - saturated model

(AB, BC, AC) - homogeneous - 3 way interaction = 0

(AB, BC)    (AB, AC)    (BC, AC)

(A, BC)       (B, AC)      (C, AB)

(A, B, C)

For example, if we want to test the significance of (BC) term, we could compare (A,BC) model to (A,B,C) model. Also reall that these models are hierarchical that is the models with higher association include lower order terms too, e.g. (A,BC) model has terms (A,B,C, and BC), but we use the shorthand notation (A,BC).

Can you suggest how you would you test for other parameters? Let's take a look at this more in depth...

# 10.2.8 - Inference for Log-linear Models for Higher-Way Tables

We will cast the discussion in terms of 3-way tables, but all the concepts easily extend to higher-way tables just require more calculations/computations. We will first see a bit more on an overall fit of the model and then discuss model selection in a search of the "best" model. We will work with a 2×2×2 table that summarizes a study of blue collar job satisfaction.

#### Key Concepts

• More on model fit
• Chi-squared goodness of fit
• Residuals
• CI’s for odds ratios
• Model Selection
• Substantive importance and considerations.
• Closeness between observed and fitted odds ratios.
• Dissimilarity index.
• Correlations between observed and fitted values.
• Bayesian information Criterion (& others).

#### Objectives

• Understand the basic inference in three-way and higher-way loglinear models
• Understand the concept of model selection, and the basic tools for log-linear model selection

• Agresti (2007) Ch. 7, 8
• Agresti (2013) Ch. 8, 9

# 10.2.9 - Model fit and lack-of-fit

### Example - Job Satisfaction

Blue collar workers job satisfaction from large scale investigation in Denmark in 1968 (Andersen, 1985). Three variables are (1) Worker (W) job satisfaction (low, high), (2) Supervisor (S) satisfaction (low, high) and (3) Quality of management (M) (bad, good).

References: See Categorical Data Analysis Using SAS (2000), Sec. 16.3.5 and collar.sas and/or collar.R (you will need collart.txt for the R code).

Here is a table of observed and fitted values for selected models:

 Manage Super Worker nijk M,S,W MS, W MS, MW MS, MW, SW bad low low 103 50.15 71.78 97.16 102.26 bad low high 87 82.59 118.22 92.84 87.74 bad high low 32 49.59 27.96 37.84 32.74 bad high high 42 81.67 46.04 36.16 41.26 good low low 59 85.10 63.47 51.03 59.74 good low high 109 140.15 104.53 116.97 108.26 good high low 78 84.15 105.79 85.97 77.26 good high high 205 138.59 174.21 197.28 205.74

Notice that as you add more parameters in your model, you get the closer fit of the expected to the observed values; this is to be expected since the saturated model will give us the perfect fit. But remember, our goal is to find the simplest possible model that explains the most variability in the data; i.e., the parsimonious model.

Let's see how we can assess the fit of each model and compare it to other models in order to render our final decision.

#### Chi-squared goodness-of-fit tests

For testing the overall fit of the model.

H0 : The fitted model (M0) fits well
HA : The saturated model (MA) fits well

L0 = loglikelihood for the fitted model
LA = loglikelihood for the saturated model

G2 = −2(L0LA)
df = df0dfA = # cells - # unique parameters
df = # unique parameters in MA - # unique parameters in M0.

 Model df G2 p-value X2 p-value (M, S, W) 4 118.00 < 0.001 128.09 < 0.001 (MS, W) 3 35.60 < 0.001 35.72 < 0.001 (MW, S) 3 87.79 < 0.001 85.02 < 0.001 (M, WS) 3 102.11 < 0.001 99.09 < 0.001 (MW, SW) 2 71.90 < 0.001 70.88 < 0.001 (MS, MW) 2 5.39 0.07 5.41 0.07 (MS, WS) 2 19.71 < 0.001 19.88 < 0.001 (MW, SW, MS) 1 .065 0.80 .069 0.80

These are global tests. All the assumptions and properties about the chi-squared and deviance statistics that we have learned thus far hold. Based on these values there are two reasonably well fitted models: conditional independence of S and W given M, (MS, MW), and homogeneous associations model, (MS,SW,MS).

#### Residuals

The residuals can be used to assess how well are the cells fitted by the model. A good model should have small residuals.

As before we can consider the Pearson residuals, adjusted residuals, and deviance residuals; we defined those in Lesson 2 -- recall that you can always use the "Search Engine" for the course to find certain info. For adjusted residuals, for example, if the model holds these should be close to zero.

There is a lack of fit if:

1. There are a few cells (small N) and adjusted residuals are greater than 2 or
2. The sample is large (large N) and adjusted residuals are greater than 3

Here is a table with observed values, fitted values for the two models of interest and their adjusted residuals.

 (MS, MW) (MS, MW, SW) Manage Super Worker nijk $$\hat{\mu}_{ijk}$$ adj res $$\hat{\mu}_{ijk}$$ adj res bad low low 103 97.16 1.60 102.26 .25 bad low high 87 92.84 -1.60 87.74 -.25 bad high low 32 37.84 -1.60 32.74 -.25 bad high high 42 36.16 1.60 41.26 .25 good low low 59 51.03 1.69 59.74 -.25 good low high 109 116.97 -1.69 108.26 .25 good high low 78 85.97 1.69 77.26 .25 good high high 205 197.28 -1.69 205.74 -.25

The reported residuals are really pretty small, and both models look good. Is there additional information that could help us to be a bit more confident about our decision to accept or reject the certain model?

Recall, in SAS, OBSTATS option in MODEL statement produces various residuals, e.g., StResChi:

                             Observation StatisticsObservation      count  manager  super  worker       Pred      Xbeta        Std                          HessWgt      Lower      Upper     Resraw     Reschi                           Resdev   StResdev   StReschi     Reslik          1        103  bad      low    low     97.159091  4.5763497   0.094248                        97.159091  80.771732   116.8712  5.8409091  0.5925687                        0.5867754   1.585496  1.6011497  1.5990147

Recall, in R, use the function residuals(). More specifically, if you use rstandard() gives you standardized deviance residuals, and rstudent() gives you likelihood residuals, e.g., rstandard(modelCI) from collar.R:

> rstandard(modelCI)        1         2         3         4         5         6         7         8  1.585496 -1.618395 -1.645241  1.560709  1.645956 -1.706941 -1.714344  1.676040

We can, for example, compare these two models to each other via the so called partial association test.

#### Partial Association (PA) Tests

H0 :   there is no partial association of two variable given the third, or

conditional (partial) odds ratios equal 1, or

two-way interaction term equal zero, e.g. $$\lambda_{jk}^{SW}=0$$

HA :   there is partial association of two variable given the third, or

conditional (partial) odds ratios is NOT equal 1, or

two-way interaction term is NOT equal zero, e.g. $$\lambda_{jk}^{SW} \neq 0$$

To test this, we test the difference between a model that does NOT contain the partial association of interest and a more complex model that includes this effect by looking at the likelihood ratio statistics or at the difference of deviances of the two models :

For example, let

M0 the restricted model (MS, MW)
M1 the more complex model (MS, MW, SW)
$$\Delta G^2=G^2_0-G^2_1$$
Δdf = df0 df1

Notice the similarity of the partial association test to the previously discussed goodness-of-fit test for assessing the "global" fit of the model. Here we are doing a relative comparison of two models, where is in the goodness-of-fit test we always compare the fitted model to the saturated model, that is, M1 here is MA in the goodness of fit test.

Here is a tabular summary that compares each of conditional independence models to the homogeneous association model.

 Model df G2 p H0 Δdf ΔG2 p-value (MW, SW, MS) 1 .065 0.80 - - - - (MW, SW) 2 71.90 < 0.001 $$\lambda_{ij}^{MS}=0$$ 1 71.835 < 0.001 (MS, MW) 2 5.39 0.07 $$\lambda_{jk}^{SW}=0$$ 1 5.325 < 0.01 (MS, WS) 2 19.71 < 0.001 $$\lambda_{ik}^{M W}=0$$ 1 19.645 < 0.001

From the above table, we can conclude that each of the terms significantly contributes to the better fit of the model, if that term is present, thus the homogeneous association models is a better fit than any of the conditional independence models.

Here is an animated graphical representation summarizing the fits and partial tests for all models in this example:

Seems thus far that the homogeneous model describe our data the best. We typically do inference on the chosen model in terms of odds ratios.

#### Confidence Intervals for Odds Ratios

Recall, the odds ratios are direct estimates of log-linear parameters.

Consider estimating the partial odds ratio for SW in our example from the homogeneous association model.

Based on SAS/GENMOD (see collar.sas and collarDIndex.sas )

$$\text{log}(\hat{\theta}_{SW(i)})=0.3872\text{ and }\hat{\theta}_{SW(i)}=\text{exp}(0.3872)=1.47$$

Recall that this is a Wald statistic, thus a

(1 − α) × 100% CI for $$\text{log}(\hat{\theta}_{SW(i)})$$

$$\text{log}(\hat{\theta}_{SW(i)}) \pm z_{\alpha/2}(ASE)$$

E.g. 95% CI for $$\text{log}(\hat{\theta}_{SW(i)})$$:

0.3872 ± 1.96(0.167) = (0.056, 0.709)

and 95% CI for $$\hat{\theta}_{SW(i)}$$:

(exp(0.056), exp(0.709) = (1.06, 2.03)

Based on SAS/GENMOD (see collar.sas and collarDIndex.sas )

$$\text{log}(\hat{\theta}_{SW(i)})=0.3872\text{ and }\hat{\theta}_{SW(i)}=\text{exp}(0.3872)=1.47$$

Recall that this is a Wald statistic, thus a

(1 − α) × 100% CI for $$\text{log}(\hat{\theta}_{SW(i)})$$

$$\text{log}(\hat{\theta}_{SW(i)}) \pm z_{\alpha/2}(ASE)$$

E.g. 95% CI for $$\text{log}(\hat{\theta}_{SW(i)})$$:

0.3872 ± 1.96(0.167) = (0.056, 0.709)

and 95% CI for $$\hat{\theta}_{SW(i)}$$:

(exp(0.056), exp(0.709) = (1.06, 2.03)

But the lower bound is very close to 1. This should raise the question for you: Is this term really significant? Could we go, after all, with a simpler model (MS,MW)?

#### Effects of sample size and practical significance

Of the two models in our example, which one is the ”better” model? How do I choose between the simpler more parsimonious model or the one that is more complex?

 Criterion (MS, MW) (MS, MW, SW ) Model goodness of fit G2 = 5.39 df = 2, p = .07 .065 df = 1, p = .80 Largest adjusted residual 1.69 .25 Likelihood ratio test of $$\lambda_{jk}^{SW}=0$$ - G2 = 5.325 df = 1, p = .03 Complexity simpler more complex

Do we really need the SW partial association?

It maybe a weak effect, i.e., it may not be important in a practical sense. Although it appears statistically significant which can be due to the large sample size (n = 715) for a relatively small table N = 23.

So how do we choose between these two? How do we consider the trade-off between a simple parsimonious model (practical significance) and a more complex and closer to reality model (statistical significance). This decision is often driven by the context of the problem and the above analysis we have done. But let's see some more statistics that we can use to aid us in the decision of choosing the "best" model.

# 10.2.10 - Model Selection

Choosing a single ”best” model among the presence of more than one reasonable model involves some subjective judgment.

Some additional things to consider in the model selection process:

• Substantive importance and considerations.
• Closeness between observed and fitted odds ratios.
• Dissimilarity index.
• Correlations between observed and fitted values.
• Information Criterion (AIC, & BIC, and others).

Automatic model selection procedures for log-linear models, such as stepwise, forward selection, backward elimination are either not available or limited in software packages. In SAS, neither PROC CATMOD or GENMOD can do these for log-linear models. In R, you can do this via the step() function, but you must exercise caution. An alternative, is to do model selection via graphical models and use a software called MIM, (which has an interface with R); you can explore this on your own --- see for some references at the end of this section.

#### Closeness between Observed and Fitted Odds Ratios

If the two models have nearly the same values for the odds ratio, choose the simpler one.

However, what is "nearly the same values"? What do we do?

This is a subjective decision, and depends on the purpose of the data.

 Fitted Odds Ratio Model W - S M - W M - S (MS, MW) 1.00 2.40 4.32 (MS, WS, MW) 1.47 2.11 4.04 Observed or (MSW) - level 1 1.55 2.19 4.26 (MSW) - level 2 1.42 2.00 3.90

They seem close enough, but what else can we look at...

#### Dissimilarity Index

The Dissimilarity Index is a summary statistic describing how close the fitted values of a model are to the observed data (e.g. observed counts). For a table of any dimension, it equals

$$D=\dfrac{\sum_i|n_i-\hat{\mu}_i|}{2n}=\dfrac{\sum_i|p_i-\hat{\pi}_i|}{2}$$

For our example used in this lesson, see collarDIndex.sas.

For the model (MW, MS):

D = 55.2306 / 2 (715) = 0.039

What does this mean? We could interpret this values such that, we would need to move 3.9% percent of the observations to achieve a perfect fit of the model (MW, MS) to observed (sample) data.

For the model (MW, MS, SW):

D = 5.8888 / 2 (715) = 0.004

Here we would only need to move 0.4% of the observations to achieve a perfect fit of the models (MW, MS, SW) to the observed data.

Which one do we use? Possibly the model of conditional independence. I, personally, could be OK with moving about 3.9% of the data.

#### Properties of D

• 0 ≤ D ≤ 1
• D = the proportion of sample cases that need to move to a different cell to have the model fit perfectly.
• Small D means that there is little difference between fitted values and observed counts.
• Larger D means that there is a big difference between fitted values and observed counts.
• D is an estimate of the change, Δ, which measures the lack-of-fit of the model in the population.
• When the model fits perfectly in the population: Δ = 0, and D overestimates the lack-of-fit (especially for small samples).
• For large samples when the model does not fit perfectly: G2 > and X2 will be large, and D reveals when the lack-of-fit is important in a practical sense.
• Rule-of-thumb: D < 0.03 indicates non-important lack-of-fit.

#### Correlations between observed and fitted counts

A large correlation value indicates that the observed and fitted values are close.

For our example, for the conditional independence model (MW, MS) in our example, r = 0.9906, and for the homogeneous model, r = 0.9999. Is there really a difference here? If I round to two decimal places, there is no difference at all!

#### Information Criteria

These are different statistics that compare fitted and observed values, that is the loglikelihoods of two models, but take into account the complexity of the model and/or sample size.

The complexity typically refers to the number of parameters in the model.

Note: Unlike partial association tests, these statistics do NOT require the models to be nested.

The two most commonly used ones are:

Akiake Information Criteria (AIC)

AIC = −2 × LogL + 2 × number of parameters

Bayesian Information Criteria (BIC)

BIC = G2 − (df) × logN = −2 × log(B)

where N = total number of observations, and B = posterior odds.

Note: The smaller the value, the better the model.

There is considerable debate about which is better and in what situations.  See Sensitivity and specificity of information criteria from the Methodology Center for more.

#### BIC Statistic and Bayesian Model Selection

BIC gives a way of considering the trade-off between a simple parsimonious model (practical significance) and a more complex and closer to reality model (statistical significance).

Suppose we are considering a model M0, and comparing it to the saturated model, M1. Which model gives a better description of the main features as reflected in the data? (e.g. which of the two is more likely to be the true model? We can use posterior odds which then figure into the BIC:

$$B=\dfrac{P(M_0|X)}{P(M_1|X)}$$

where X = data.

• When comparing a set of models, choose the one with the smallest BIC value.
• If BIC is negative choose M0
• The models do not have to be nested.
• This procedure provides you with a consistent model in the sense that in large samples, it chooses the correct model with high probability.

The following table gives both BIC and AIC for our running example.

 Model df G2 p-value BIC Parameters AIC (MSW) 0 0.00 1.000 .00 8 - (MS)(MW)(SW) 1 .06 .800 -6.51 7 -13.94 (MW)(SW) 2 71.90 .000 58.76 6 59.90 (MS)(WS) 2 19.71 .000 6.57 6 7.71 (SM)(WM) 2 5.39 .068 -7.75 6 -6.61 (MW)(S) 3 87.79 .000 68.07 5 77.79 (WS)(M) 3 102.11 .000 82.07 5 92.11 (MS)(W) 3 35.60 .000 15.88 5 25.60 (M)(S)(W) 4 118.00 .000 91.71 4 110.00

Think about the following question, then click on the icon to the left display an answer.

#### Which model would you choose based on BIC? How about AIC? Is it the same model?

How did we get the values in the table above?

In , for example for (MS,MW) model:

Criteria For Assessing Goodness Of FitCriterion                 DF           Value        Value/DFDeviance                   2          5.3871          2.6936

Then, BIC=5.3871-2*log(715)=-7.75

But, keep in mind that this G2 is computed for comparing deviance of the fitted model to the saturated model. In some other software packages or for a different procedure in SAS this value maybe relative to a different model other than the saturated model. Another way to obtain G2 is to compute -2*logL from the given log likelihoods from the "Criteria For Assessing Goodness Of Fit", e.g.,

Log Likelihood                     2599.0752

G2=-2(logLik(cond.indep.model) - logLik(saturated.model) = -2(2599.0752-2601.7687)=5.38714

BIC: Locate the G2, and its df for the particular model, and the sample size N.

In , for example for (MS,MW) model, summary(modelCI) from collar.R gives:

    Null deviance: 208.7748  on 7  degrees of freedomResidual deviance:   5.3871  on 2  degrees of freedomAIC: 66.863

Then, BIC=5.3871-2*log(715)=-7.75

But, keep in mind that this G2 is computed for comparing deviance of the fitted model to the saturated model. In some other software packages or for a different procedure in SAS this value maybe relative to a different model other than the saturated model. Another way to obtain G2 is to compute -2*logL from the given log likelihoods by using the function logLik(), e.g.,

> logLik(modelCI)'log Lik.' -27.43141 (df=6)

G2=-2(logLik(modelCI) - logLik(modelSat))=5.38714

You can also compute BIC, by using AIC() function in R; for more details see collar.R and further below.

AIC: Locate the G2, and its df for the particular model, and the sample size N, or use provided functions.

In , for example for (MS,MW) model relative to the saturated model, the :

AIC=5.3871-2*6=-6.61

In , for example for (MS,MW) model, summary(modelCI) from collar.R gives:

    Null deviance: 208.7748  on 7  degrees of freedomResidual deviance:   5.3871  on 2  degrees of freedomAIC: 66.863

Then, relative to the saturated model, the AIC=5.3871-2*6=-6.61.

What is AIC=66.863 value? This is a "correct" AIC value but with a different normalizing constant in comparison to what is reported in the table above. You can use the AIC values produced by R directly and you will reach the same conclusions since the relative relationships of these values across different models still holds. As an exercise, run collar.R and look at the AIC values for all of our models, sort them out and show yourself that the lowest ones are for the (MS, MW) and (MS, SW, MS) models besides the saturated model.

You can use the AIC() function on each model, and also to get the BIC. For example,

AIC(modelCI) will produce 66.863.

AIC(modelCI, k=log(sum(count)))=AIC(modelCI, k=log(715)) will produce a BIC value for the (MS,MW) model. Then to obtain the BIC value as we did above, take the difference of the BIC values for the two models:

BIC=AIC(modelCI, k=log(sum(count)))-AIC(modelSat,k=log(sum(count)))

References:

• Raftery, A.E. (1985). A note on Bayes factors for log-linear contingency table models with vague prior information. Journal of the Royal Statistical Society, Series B.
• Raftery, A. E. (1986). Choosing models for cross-classifications. American Sociological Review, 51, 145146.
• Spiegelhalter, D.J. and Smith, A.F.M. (1982). Bayes Factors for linear and log-linear models with vague prior information. Journal of the Royal Statistical Society, Series B, 44, 377387.

### Summary of model selection strategies higher-way tables

For higher-way (or k-way) tables, which are cross-classification of data on k discrete random variables,  the log-linear models are handled basically the same as you would for three-way tables, these are just more complex in terms of the number of possible models and parameters that you could examine. Higher -way tables are going to have many more two-way, three-way and k-way associations. For example, here is a 6-way table that cross-classifies workers by 6 binary random variables regarding their health conditions:

#### Model selection strategies with Loglinear models

Log-linear models are general in a sense that do not make an explicit distinction between responses and explanatory variables. They can be used if there are more than two response variables.

• Determine if some variables are responses and some explanatory. Include associations terms for the explanatory variables in the model. Focus your model search on models that relate the responses to explanatory variables.
• If a margin is fixed by design, included the appropriate term in the loglinear model (to ensure that the marginal fitted values from the model equals to observed margin).
• Try to determine the level of complexity that is necessary by fitting models with
• marginal/main effects only
• all 2-way associations
• all 3-way associations, etc....
• all highest-way associations.
• Try a backward elimination strategy (analogous to one discussed for logit models that we will see in the next lesson) or a stepwise procedure. Be careful in using computer algorithms; you are better off doing likelihood ratio tests, e.g. blue collar data.

You can also look at the set of Case Studies in the Case Studies folder which is still under the development! Some of those talk about logistic regression models that we have not seen yet. However, Smoking and Stress example includes a log-linear model with more than 3 random variables.

#### Graphical Models and Contingency Tables

Graphical models are useful and are widely applicable. Graphs can be used to:

• visually represent scientific content of models and thus facilitate communication,
• break down complex problems/models into smaller and simpler pieces that can be studied separately, and
• are natural data structures for programming.

For contingency tables graphical models can determine when marginal and partial associations are the same such that we can collapse a multi-way table into a smaller table (or tables) to study certain associations.

They represent substantive theories and hypotheses, which correspond to certain loglinear/logit models.

Two basic types of graphical models are (1) undirected graphical models, and (2) directed graphical models (DAGs or Bayesian Networks). Other types are chain graphs, ancestral graphs, etc...

Here are a number of references that you can use to pursue your interest in the graphical model methods:

• History:
• Statistical Physics (Gibbs, 1902)
• Genetics and Path analysis (Wright, 1921,1923, 1934
• Contingency tables (Barlett, 1935)
• Theory and application:
• Cowell, R.G., Dawid, A.P., Lauritzen, S.L, and Spiegelhalter, D.J. (1999) Probabilistic Networks and Expert Systems. NY: Springer-Verlag.
• Darroch, J.N., Lauritzen, S.L., and Speed, T.P. (1980). Markov fields and log-linear models for contingency tables. Annals of Statistics, 8, 522539.
• Edwards, D. (2000). Introductions to Graphical Modeling, 2nd edition. NY: SpringerVerlag. (includes MIM software).
• Lauritzen, S.L. (1996). Graphical Models. NY: Oxford Science Publications.
• Whittaker, J. (1990). Graphical Models in Applied Multivariate Statistics.Wiley.

Some on-line references

If time permits, and if there is interest, we can cover the basics of graphical models at the end of the course, or you can study these as a part of your project.