Thus far in the course we have alluded to loglinear models several times, but have never got down to the basics of it. When we dealt with interrelationships among several categorical variables, our focus had been on describing independence, interactions or associations between two, three or more categorical variables mostly via
Loglinear models go beyond a single summary statistics and specify how the cell counts depend on the levels of categorical variables. They model the association and interaction patterns among categorical variables. The loglinear modeling is natural for Poisson, Multinomial and ProductMutlinomial sampling. They are appropriate when there is no clear distinction between response and explanatory variables, or there are more than two responses. This is a major difference between logistic models and loglinear models. In the former a response is identified, but no such special status is assigned to any variable in loglinear modelling. By default loglinear models assume discrete variables to be nominal, but these models can be adjusted to deal with ordinal and matched data. Loglinear models are more general than logit models, but some loglinear models have direct correspondence to logit models.
Consider graduate admissions at Berkeley. We may consider all possible relationships among A = Admission, D = Department and S = Gender. Alternatively, we may consider A as response and D and S as covariates in which case the possible logit models are:
Corresponding to each of the above a loglinear model may be defined. The notations below follow those of Lesson 5.
“Equivalent," means that two models give equivalent goodnessoffit statistics relative to a saturated model, and equivalent expected counts for each cell. Loglinear models are not exactly the same as logit models, because the loglinear models describe the joint distribution of all three variables, whereas the logit models describe only the conditional distribution of A given D and S. Loglinear models have more parameters than the logit models, but the parameters corresponding to the joint distribution of D and S are not of interest.
In general, to construct a loglinear model that is equivalent to a logit model, we need to include all possible associations among the predictors. In the Berkeley example, we need to include DS in every model. This lesson will walkthrough examples how this is done in both SAS and R.
In subsequent sections we look at the loglinear models in more detail. The two great advantages of loglinear models are that they are flexible and they are interpretable. Loglinear models have all the flexibility associated with ANOVA and regression. We have mentioned before that loglinear models are also another form of GLM. They also have natural interpretations in terms of odds and frequently have interpretations in terms of independence, as we have shown above.
In this section we will focus on the following key concepts and objectives.
Key Concepts:
Objectives
Readings

Loglinear models for twoway tables describe associations and interaction patterns among two categorical random variables.
Recall, that a twoway ANOVA models the expected value of a continuous variable (e.g., plant length) depending on the levels of two categorical variables (e.g., low/high sunlight and low/high water amount). In contrast, the loglinear models express the cell counts (e.g., the number of plants in a cell) depending on levels of two categorical variables.
Let μ_{ij} be the expected counts, E(n_{ij}), in an I × J table, created by two random variables A and B.
Objective:
Model the cell counts: μ_{ij} = nπ_{ij}
Model structure:
An analogous saturated loglinear model to twoway ANOVA with interaction is:
\(\text{log}(\mu_{ij})=\mu+\alpha_i+\beta_j+\gamma_{ij}\)
or in the notation used by Agresti:
\(\text{log}(\mu_{ij})=\lambda+\lambda_i^A+\lambda_j^B+\lambda_{ij}^{AB}\)
where i = 1, ..., I, j = 1, ...., J, are levels of categorical random variables A and B, with constraints: Σ_{i} λ_{i} = Σ_{j} λ_{j} = Σ_{i} Σ_{j} λ_{ij} = 0, to deal with overparametrization. Overparametrization means that the number of parameters is more than that can be uniquely estimated. This model is overparametrized because term λ_{ij }already has I × J parameters corresponding to the cell means μ_{ij}. The constant, λ, and the "main effects", λ_{i} and λ_{j} give us additional 1 + I + J parameters. Superscripts denote variables A and B. We will see more on this in the next sections.
Model Assumptions:
The N = I × J counts in the cells are assumed to be independent observations from a Poisson random variable, $n_{ij}\sim$Poisson(μ_{ij}). The loglinear modeling is natural for Poisson, Multinomial and ProductMultinomial sampling, like we have discussed in earlier lectures.
Recall the Vitamin C study, a 2 × 2 example from Lesson 3. Are the type of treatment and contracting cold independent? If there are associated, in which way are they associated?
You already know how to answer the above questions via the chisquare test of independence, but now we want to model the cell counts with the loglinear model of independence and ask if this model fits well.
Given two categorical random variables, A and B, there are two main types of models we will consider:
Let us start with the model of independence.
Recall, that the independence can be stated in terms of cell probabilities as a product of marginal probabilities,
\(\pi_{ij}=\pi_{i+}\pi_{+j}\) i = 1, ..., I, j = 1, ..., J
and in terms of cell frequencies,
\(\mu_{ij}=n\pi_{ij}=n\pi_{i+}\pi_{+j}\) i = 1, ..., I, j = 1, ..., J
By taking natural logarithms on both sides of "=" sign, we obtain the loglinear model of independence:
\begin{array}{lcl}
\text{log}(\mu_{ij}) &=& \text{log}n+\text{log}\pi_{i+}+\text{log}\pi_{+j}\\
\text{log}(\mu_{ij}) &=& \lambda+\lambda_i^A+\lambda_j^B
\end{array}
where superscripts A and B are just used to denote the two categorical variables. The second equation is a more standard notation also used in your textbook by Agresti.
This is an ANOVA typerepresentation where:
The maximum likelihood (ML) fitted values for the cell counts are the same as the expected (fitted) values under the test of independence in two way tables, i.e., E(μ)_{ij} = n_{i+}n_{+j}/n. Thus, the X^{2} and G^{2} for the test of independence are goodnessoffit statistics for the loglinear model of independence testing that the independence model holds versus that it does not, or more specifically testing that the independence model is true vs. saturated model is true. This model also implies that ALL odds ratios should be equal to 1.
For an I × J table, and the model
\(\text{log}(\mu_{ij})=\lambda+\lambda_i^A+\lambda_j^B\)
there are I terms in the set {\(\lambda^A_I\)}, but one of them is redundant so there are I − 1 unknown parameters, e.g., {λ_{1}^{A}, ... , λ_{I1}^{A}}, and there are J terms in the set {\(\lambda^B_J\)}, and one is redundant, so there are J − 1 unknown parameters, e.g., {λ_{1}^{B} , ...., λ_{J1}^{B}}. (Why is one of them redundant?) There can be many different parameterizations, that is there is NOT a unique set of parameters. Regardless which set we use, we need to set the constraints to account for redundant parameters. Nonexistence of unique set of parameters does not mean that the expected cell counts will change depending on which set of parameters is being used. It simply means that the estimates of the effects may be obtained under different sets of constraints, which will lead to different interpretations. But expected cell counts will remain the same.
DUMMY CODING: To avoid overparametrization, one member in the set of λ's is fixed to have a constant value, typically 0. This corresponds to using dummy coding for the categorical variables (e.g. A = 1, 0). By default, in SAS PROC GENMOD, the last level is set to 0, e.g., for the expected count in the first cell, and for the last cell,
\(\text{log}(\mu_{11})=\lambda+\lambda_1^A+\lambda_1^B=\lambda+\lambda_1^A+\lambda_1^B\)
\(\text{log}(\mu_{22})=\lambda+\lambda_2^A+\lambda_2^B=\lambda+0+0=\lambda\)
By default, in R glm() the first level of the categorical variable is set to 0, e.g., for the first and the last cell,
\(\text{log}(\mu_{11})=\lambda+\lambda_1^A+\lambda_1^B=\lambda+0+0=\lambda\)
\(\text{log}(\mu_{22})=\lambda+\lambda_2^A+\lambda_2^B=\lambda+\lambda_2^A+\lambda_2^B\)
ANOVAtype CODING: Another way to avoid overparametrization is to fix the sum of the terms equal to a constant, typically 0. That is the ANOVAtype constraint. This corresponds to using the so called ”effect” coding for categorical variables (e.g. A = 1, 0, −1). By default, SAS PROC CATMOD and R loglin(), use the zerosum constraint, e.g., the expected cell count in the first cell and the last cell,
\(\text{log}(\mu_{11})=\lambda+\lambda_1^A+\lambda_1^B\)
\(\text{log}(\mu_{22})=\lambda+\lambda_2^A+\lambda_2^B=\lambda\lambda_1^A\lambda_1^B\)
We will see more on these with a specific example in the next section.
We can have different parameter estimates (i.e, different values of λ's) depending on type of constraints we set. So, what is unique about these parameters that will lead us to the "same" inference regardless of parametrization? The differences, that is the log odds, are unique:
\(\lambda_i^A\lambda_{i'}^A\)
\(\lambda_j^B\lambda_{j'}^B\)
where subscript i denotes one level of categorical variable A and i' denotes another level of the same variable; similarly for B.
Thus the odds is also unique!
\begin{align}
\text{log}(odds) &= \text{log}\left(\dfrac{\mu_{i1}}{\mu_{i2}}\right)=\text{log}(\mu_{i1})\text{log}(\mu_{i2})\\
&= (\lambda+\lambda_i^A+\lambda_1^B)(\lambda+\lambda_i^A+\lambda_2^B)=\lambda_1^B\lambda_2^B\\
\end{align}
odds = exp(λ_{1}^{B} − λ_{2}^{B})
The odds ratio measures the strength of the association and depends only on the interaction terms {\(\lambda_{ij}^{AB}\) }, which clearly does not appear in this model, but we will discuss it when we see the saturated loglinear model.
Do you recall, how many odds ratios do we need to completely characterize associations in I × J tables?
Objective: Model the cell counts: μ_{ij} = nπ_{ij}; that is, the response is the cell count.
Main assumptions:
Model Structure:
\(\text{log}(\mu_{ij})=\lambda+\lambda_i^A+\lambda_j^B\)
where A and B stand for two categorical variables.
MLE's:
\(E(\mu_{ij})=m_{ij}=\hat{\mu}_{ij}=\dfrac{n_{i+}n_{+j}}{n}\)
Parameter constraints: can be different, but e.g., ANOVA type are
\(\sum_i \lambda_i^A=\sum_j \lambda_j^B =0=\sum_i\sum_j \lambda_{ij}^{AB}=0\)
Parameter estimation and interpretation:
The values of the parameter estimates may differ based on the constraints, but the estimates of the cell counts, odds and odds ratios will be same! Thus, our inferences will be the same!
For example for ANOVA like constraints, for a 2 × 2 table, the expected cell counts are:
\(\lambda=\dfrac{1}{IJ}\sum\sum \text{log}(m_{ij})\), and log(odds ratio): \(\text{log}(\Theta)=4\lambda_{11}^{AB}\)
Model Fit:
Use the Pearson X^{2} statistic and deviance or likelihoodratio test statistic, \(G^2=2\text{log} \sum\sum n_{ij}\text{log}\left(\dfrac{n_{ij}}{m_{ij}}\right)\) , as we have seen when we did the test for independence between two variables.
Effectively, we are testing, again,
H_{0} = independence model is true
vs.
H_{A} = saturated model is true.
Thus the deviance statistic is based on the likelihood ratios of the null model of independence versus the saturated model; recall the lecture notes from the earlier in the semester; e.g., goodnessoffit tests.
In general, recall, for the likelihood ratio test, we fit the full model (e.g. saturated model) and a reduced model (e.g. independence model). The LR statistic is
ΔG^{2} = 2 × (loglikelihood from full model −loglikelihood from reduced model) ,
and the pvalue is \(P(\Delta G^2 \geq \chi^2_k)\) where k is the difference in df between two models, or difference in the number of parameters, as before.
Model selection: In this case, use the same statistics as above.
Next we will consider how to fit this model in SAS and R.
Recall the Vitamin C study, a 2 × 2 example from Lesson 3. From the earlier analysis, we learned that the treatment and response for this data are not independent, e.g.,
(G^{2} = 4.87, df = 1, p−value = 0.023),
and we estimated various strength of associations, e.g. odds ratios 0.49, indicating that skiers who took placebo were twice more likely to get cold than the ones who took vitamin C.
Next we will consider how we answer the same question but by fitting the loglinear model of independence.
There are (at least) two ways of fitting these models in SAS and R. First, we will see the specification for two procedures in SAS (PROC CATMOD and PROC GENMOD); see VitaminCLoglin.sas and different parts of output in VitaminCLoglin.lst. Furthermore below, you can see more on the corresponding R functions loglin() and glm(), by running the Inspect video, but I suggest that the R users also read sections on SAS; see VitaminCLoglin.R and different parts of output in VitaminC.Loglin.out.
INPUT:
The MODEL statement (line) specifies that we are fitting the model on two variables treatment and response specified on the lefthand side of the MODEL statement. The MODEL statement only takes independent variables in its righthand side. In the case of fitting a loglinear model, we need to use a keyword _response_ on the righthand side.
The LOGLIN statement specifies that it's a model of independence, because we are stating that the model has only main effects, one for "treatment" variable and the other for the "response" variable, in this case.
OUTPUT:
What are the odds for being in the placebo group? Note that this question is not really relevant here, but it's a good exercise to figure out if you understand the matching between the response matrix contrasts and how do we come up with estimates of the odds. Hint: log(odds)=0.00716. 
Fits a loglinear model as a Generalized Linear Model (GLM), and by default uses dummy coding. We will mostly use PROC GENMOD in this class.
INPUT:
The CLASS statement (line) specifies that we are dealing with two random variables, treatment and response.
The MODEL statement (line) specifies that our response are the cell counts, and link=log that we are modeling log of counts; two variables should form a model of independence, and error distribution, that is sampling distribution is Poisson; the remaining specifications ask for different parts of output to be printed on your screen or in the .lst document. For other options see SAS help.
OUTPUT:
To do the same analysis in R we can use loglin() function that corresponds to PROC CATMOD and glm() function that corresponds to PROC GENMOD. See, VitaminCLoglin.R and VitaminC.Loglin.out and the viewlet which runs stepbystep through the commands and the output.
(viewlet for VitaminCLoglin.R)
Here is a brief comparison to loglin() and glm() output with respect to model estimates.
The following command fits the loglinear model of independence, and from the part of the output that gives parameter estimates, we see that they correspond to the output from CATMOD, cold=exp(Λ_{cold}^{response}+Λ_{nocold}^{response})=exp(0.78560.7856)=exp(1.5712).
loglin(ski, list(1, 2), fit=TRUE, param=TRUE)
$param$"(Intercept)"
[1] 3.963656
$param$Treatment
Placebo VitaminC
0.003584245 0.003584245
$param$Cold
Cold NoCold
0.7856083 0.7856083
The following command fits the loglinear model of independence using glm(), and from the part of the output that gives parameter estimates, we see that they correspond to the output from GENMOD, but with the signs reversed, e.g. odds of nocold=exp(Λ_{nocold}^{response})=exp(1.5712), thus odds of cold exp(1.5712).
glm(ski.data$Freq~ski.data$Treatment+ski.data$Cold, family=poisson())
Coefficients:
Estimate Std. Error z value Pr(>z)
(Intercept) 3.181632 0.156179 20.372 <2e16 ***
ski.data$TreatmentVitaminC 0.007168 0.119738 0.060 0.952
ski.data$ColdNoCold 1.571217 0.158626 9.905 <2e16 ***
Crossclassification of respondents according to choice for the president in 1992 presidential election (Bush, Clinton, Perot) and political view on the 7 point scale (extremely liberal, liberal, slightly liberal, moderate, slightly conservative, conservative, extremely conservative).
Let's group the 7 point scale to 3 levels, and consider a 3 × 3 table:
Bush

Clinton

Perot

Total


Liberal 
70

324

56

450

Moderate 
195

332

101

628

Conservative 
382

199

117

698

Total

647

855

274

1774

Are political view and choice independent? You already know how to do this with the chisquared test of independence. For example, see vote.sas (output: vote.lst ) or vote.R. (output: (viewlet for vote.R)),
and compare the results of chisquared tests with the loglinear model output.
Here is part of the output from SAS PROC FREQ ...
and a part from the output of the PROC GENMOD procedure.
Question: Can you identify the X^{2} and G^{2} statistics in these two outputs? What is your conclusion about independence? Does the model fit well or not? Can you identify the loglinear independence model and relevant part of the output? Note that if "Value/df" is much greater than 1, you have sufficient evidence to reject the model. 
______________________________________________________
Here is what the loglinear model would be:
\(\text{log}(\mu_{ij})=\lambda+\lambda_i^{pview}+\lambda_j^{choice}\)
Recall, that we assume the counts are coming from a Poisson distribution and that all odds ratios are equal to 1. The explicit assumption is that the interaction term is equal to 0. Our null hypothesis is that the above specified loglinear model of independence fits versus the alternative that the saturated model fits.
What is the LR test statistic? Identify it in the PROC FREQ, PROC GENMOD and PROC CATMOD part of the output. Notice, that in GENMOD, "conservative" and "perot" are reference levels. Thus equation for each "Choice" is:
Bush: log(μ_{i1}) = λ + λ_{i}^{pview} + λ_{1}^{choice}
Clinton: log(μ_{i2}) = λ + λ_{i}^{pview} + λ_{2}^{choice}
Perot: log(μ_{i3}) = λ + λ_{i}^{pview}  λ_{1}^{choice}  λ_{2}^{choice }in CATMOD. Notice in GENMOD with dummy coding this would be: log(μ_{i3}) = λ + λ_{i}^{pview} +λ_{3}^{choice}
If we want, for example, probability of being a liberal and voting for Bush:
LibBush: log(μ_{11}) = λ + λ_{1}^{pview} + λ_{1}^{choice} =4.680.439+0.859
How about the odds? We would look at the difference of the above equations; also see the previous page on general set up of parameters. For example,
BushClinton: λ + λ_{i}^{pview} + λ_{1}^{choice}  (λ + λ_{i}^{pview} + λ_{2}^{choice}) = λ_{1}^{choice} λ_{2}^{choice}
BushPerot: 2 λ_{1}^{choice}+λ_{2}^{choice }in CATMOD. Notice in GENMOD with dummy coding this would be: λ_{1}^{choice } λ_{3}^{choice}
ClintonPerot: λ_{1}^{choice}+2 λ_{2}^{choice } in CATMOD. Notice in GENMOD with dummy coding this would be: λ_{2}^{choice}  λ_{3}^{choice}
Think about the following question, then click on the icon to the left display an answer. Who had better odds based on this data, Bush or Clinton, e.g. 647/855? 
CATMOD: \(\text{exp}(\lambda_1^B\lambda_2^B)=\text{exp}(0.19350.4722)=0.756\)
GENMOD: \(\text{exp}(0.85921.1380)=0.756\)R, glm(): exp(pchoiceclinton)=exp(0.27876)=1.3215 which are the odds of Clinton vs. Bush, so Bush vs. Clinton is 1/1.3215=0.756
Think about the following question, then click on the icon to the left display an answer. How about Bush vs. Perot, e.g. 647/274? 
CATMOD: \(\text{exp}(2\lambda_1^B+\lambda_2^B)=\text{exp}(2\times 0.1935+0.4722)=2.361\)
GENMOD: \(\text{exp}(\lambda_1^B)=\text{exp}(0.8592)=2.361\)R, glm(): 1/exp(pchoiceperot)=exp(pchoiceperot)=exp(0.86992)=2.361
Question: Why doesn't the model fit well? Do you see any unusual residuals? 
Here is the output from GENMOD in SAS. The highlighted row are the possible residulas values as we dicussed earlier. Let's look at the standardized Perason residulas; recall they have approximate N(0,1) distribution, so we are looking for the absolute values which are greater than 2 or 3. Notice, for example, for the first cell, the value is 10.649., clearly a large residuls. Look at the other cells.
In R, glm() there are a number of ways to get the residuals, but one would be to use residuals() function, and to specify the type we want, e.g., the following code produces pearson residuals, the standardized (adjusted) residulas, with the formatted output. Notice the same values for the fist cell as we got from GENMOD above.
resids < residuals(vote.ind,type="pearson")
h < lm.influence(vote.ind)$hat
adjresids < resids/sqrt(1h)
round(cbind(count,fits,adjresids),2)
count fits adjresids
1 70 163.94 10.65
2 324 216.64 11.72
3 56 69.43 2.03
4 195 228.78 3.48
5 332 302.33 2.95
6 101 96.89 0.57
7 382 254.28 12.89
8 199 336.03 13.32
9 117 107.69 1.25
Objective: Model the cell counts: μ_{ij} = nπ_{ij}
Main assumptions:
Model structure:
\(\text{log}(\mu_{ij})=\lambda+\lambda_i^A+\lambda_j^B+\lambda_{ij}^{AB}\)
Which constraints must hold? It depends on type of parameterization, and software you are using as each application brings with it differences in implementation. The key is, it doesn't matter, because your inference in the end will be the same. This we have already discussed to some length in case of independence models.You do, however, want to specify the procedure you used in any publication of your analysis, so you do want to understand the concepts involved here. Notice, now that we are also fitting an "interaction" term, that is the association term between A and B.
Parameter estimates and interpretation:
The constant and main effect λ's have the same meaning as before.
\({\lambda_{ij}^{AB}}'s\) (1) represent the interaction/association between two variables, (2) reflect the departure from independence, and (3) ensure that μ_{ij} = n_{ij}
The odds ratio is directly related to the interaction terms. For example, for a 2 × 2 table:
\begin{align}
\text{log}(\theta) &= \text{log}\left(\dfrac{\mu_{11}\mu_{22}}{\mu_{12}\mu_{21}}\right)\\
&= \\
&= \lambda_{11}^{AB}+\lambda_{22}^{AB}\lambda_{12}^{AB}\lambda_{21}^{AB}
\end{align}
Think about the following question, then click on the icon to the left display an answer. Can you fill in the missing line for the equation above? 
How many odds ratios are there? There should be (I1) ×(J1) which should be equal to the unique number of λ_{ij}'s in the model.
How many unique parameters are there in this model?
Term

# of terms

#of constraints

# of unique parameters

λ

1

0

1

{λ_{i}^{A}}

I

1

I  1

{λ_{j}^{B}}

J

1

J  1

{λ_{ij}^{AB}}

I × J

I + J  1

(I  1) × (J  1)

I × J = N is a perfect fit!
The saturated model is when:
Model selection: None here, but we typically want a simpler model that smooths the data more, and it's more parsimonious. We will see later on the model selection. We compare other models, typically to the saturated model.
Let's evaluate the CATMOD and GENMOD output for VitaminCLoglin.sas and vote.sas and R code for those, VitaminCLoglin.R and vote.R.
To fit the saturated models, we need to specify or add an interaction term in our code.
In CATMOD:
The LOGLIN statement (line) now specifies the saturated model. treatmentresponse specification adds an interaction term to the main effects model.
In GENMOD:
The '*' sign in the response*treatment, indicates that we want the interaction term and ALL lower order terms; that is, the saturated model.
Notice, that we can also specify the saturated model by explicitly entering the main effects in the model statement (and I suggest you use this option because it will be easier to read the values of the parameter estimates from the output):
model count = response treatment response*treatment/link=log dist=poisson lrci type3 obstats;
In these two model specifications under GENMOD, you will see that we are fitting different number of parameters and the ouput sections on LR Statistics For Type 3 Analysis will be different, but the estimates of odds and oddsratios will be the same!
For the output, we have the same parts as we did for the independence model, except now we have more parameter estimates. We will consider the latter GENMOD parametrization and compare it to the CATMOD.
So, what are the odds of getting cold given that a person took a placebo pill? Let's first start with logodds:
\(\text{log}(\mu_{11}/\mu_{12})=\text{log}(\mu_{11})\text{log}(\mu_{12})=[\lambda+\lambda_1^A+\lambda_1^B+\lambda_{11}^{AB}][\lambda+\lambda_1^A+\lambda_2^B+\lambda_{12}^{AB}]\)
Based on GENMOD parameterization: λ_{1}^{B}+λ_{11}^{AB} λ_{2}^{B}  λ_{12}^{AB} = 1.978 + 0.7134  0  0 and the odds are exp(1.2574) = 0.2844
Based on CATMOD parameterization: [λ+λ_{1}^{A} + λ_{1}^{B}+λ_{11}^{AB} ] [ λ+λ_{1}^{A} λ_{1}^{B}  λ_{11}^{AB}] = 2 λ_{1}^{B}+2 λ_{11}^{AB }= 2(0.807 + 0.1784) and the odds are exp(log odds) = 0.2844
In R, loglin(),
loglin(ski, list(c(1, 2)), fit=TRUE, param=TRUE)
In R, glm(),
glm(ski.data$Freq~ski.data$Treatment*ski.data$Cold, family=poisson())
and like the note for SAS, you can also do
glm(ski.data$Freq~ski.data$Treatment+ski.data$Cold+ski.data$Treatment*ski.data$Cold, family=poisson())
Q1: What are the odds ratio based on CATMOD or loglin()? How about GENMOD or glm()? 
For example, from the second way of fitting model with GENMOD, odds ratio=exp(0.7134)=2.041, with 95% CI [exp(0.079), exp(1.3770)]. This corresponds to 1/2.041=0.49 which we saw before.
treatment*response placebo cold 1 0.7134 0.3293 0.0790 1.3770
From R, glm(), we get the same:
ski.data$TreatmentVitaminC:ski.data$ColdNoCold 0.7134
Q2: Is the interaction of treatment and response significant (hint: look at LR Statistics for Type 3 Analysis). Where does df = 3 come from? 
Q3: Model fits perfectly. Let's do some model selection. Compare the independence and saturated model, by calculating the LR statistics (e.g. difference in deviances). \(2\times (973.0657970.629)\approx 4.8717\) How many df does this statistic have? 
How about data in vote.sas or vote.R? What can you say about the saturated model? What are the estimates of the oddsratios? 
These models include all lower order terms that comprise higherorder terms in the model.
(A, B) is a simpler model than (AB)
Interpretation does not depend on how the variables are coded.
We want hierarchical models so that interaction terms represent associations. If there is a significant interaction term, we do NOT look at the lowerorder terms, but only interpret the higherorder terms because the value of lowerorder terms is coding dependent and can be misleading.
Next we'll see how these models extend to threeway tables.
In this section we will extend the concepts we learned about loglinear models for twoway tables to threeway tables. We will learn how to fit varous models of independence discussed in Lesson 5, e.g., conditional independence, joint independence and homogenous associations model. We will also learn additional statistics, besides the usual X^{2} and G^{2}, to assess the model fit, and to choose the "best" model.
Key concepts:
Objectives:
Readings

Expanding the loglinear model notation to 3way tables:
\(\text{log}(\mu_{ijk})=\lambda+\lambda_i^A+\lambda_j^B+\lambda_k^C+\lambda_{ij}^{AB}+\lambda_{ik}^{AC}+\lambda_{jk}^{BC}+\lambda_{ijk}^{ABC}\)
The main questions for this lesson are:
As before for threeway tables, there are multiple models we can test, and now fit. The loglinear models we will fit and evaluate are:
Let us go back to our familiar dataset on graduate admissions at Berkeley:
Dept.

Men rejected

Men accepted

Women rejected

Women accepted

A

313

512

19

89

B

207

353

8

17

C

205

120

391

202

D

278

139

244

131

E

138

53

299

94

F

351

22

317

24

Let D = department, S = sex, and A = admission status (rejected or accepted). We analyzed this as a threeway table before, and specifically we looked at partial and marginal tables. Now we will look at it from a loglinear model point of view. You will work with this example on the next homework as well. Let y be the frequency or count in a particular cell of the threeway table. See berkelyLoglin.sas (and berkeley.sas) or berkeleyLoglin.R (and berkeley.R).
This model is the default model in a way that serves for testing of goodnessoffit of the other models. Recall that saturated model has the maximum number of parameters and fitting a saturated model is the same as estimating ML parameters of distributions appropriate for each cell of the contingency table.
Objective: Model the cell counts: μ_{ijk} = nπ_{ijk}
Main assumptions:
Model structure:
\(\text{log}(\mu_{ijk})=\lambda+\lambda_i^A+\lambda_j^B+\lambda_k^C+\lambda_{ij}^{AB}+\lambda_{ik}^{AC}+\lambda_{jk}^{BC}+\lambda_{ijk}^{ABC}\)
Parameter constraints: can be different, but ANOVA type constraints imply:
\(\sum_i\lambda_i^A=\sum_j\lambda_j^B=\sum_i\sum_j\lambda_{ij}^{AB}=\sum_i\sum_j\sum_k\lambda_{ijk}^{ABC}=0\)
Parameter estimation and interpretation:
If there is a significant interaction term, we DO NOT look at the lowerorder terms, but only interpret the higherorder terms because the values of lowerorder terms are coding dependent and can be misleading. The highestorder terms in loglinear models correspond to the so called minimal sufficient statistics (MSS) for loglinear models. The MSSs are margins of the table, and in this case it is the entire table since the highestorder term is the threeway association and we have three random variables. This makes it equivalent to fitting independent Poisson distribution to each cell of the I x J rable.
Model Fit: Saturated model has a prefect fit, G^{2} = 0, df = 0. df =number of cells − number of unique parameters in the model.
Model Selection: Relevant when comparing to simpler models. Saturated model is the most complex model possible!!
We will first discuss berkeleyLoglin.sas, and then berkeleyLoglin.R.
Using PROC GENMOD, let us fit the saturated loglinear model.
When you use the order=data option, GENMOD orders the levels of class variables in the same order as they appear in the dataset. For each class variable, GENMOD creates a set of dummy using the last category as a reference group. Recall the CATMOD and GENMOD coding from Twoway Loglinear models.
Therefore, we can interpret a twoway association as a logodds ratio for the two variables in question, with the other variable held constant at its last category (i.e., conditional oddsratios).
Here’s a portion of the SAS output, that is the table of ML estimates.
For how to do this in R see berkeleyLoglin.R which uses the loglin() and glm() functions.
The intercept is a normalizing constant and should be ignored. The main effects for D, S and A are all difficult to interpret and not very meaningful since we have significant twoway and threeway associations; the two and threeway associations are highly meaningful. For example, the estimated coefficient for the SA association is 0.1889.
Exponentiating this coefficient gives
\(\text{exp}(0.1889)=1.208\),
which is the estimated SA odds ratio for Department F since that is the reference department in this analysis. The reference group for S is "female," and the reference group for A is "accept." If we write the 2 × 2 table for S × A in Department F, i.e., the "partial" table, with the reference groups in the last row and column, we get
Dept F

Reject

Accept

Men 
351

22

Women 
317

24

for which the estimated odds ratio is:
\(351 \times 24/317 \times 22=1.208\).
The Wald zstatistic for this coefficient,
\(z=0.1889/0.3052\),
which corresponds to Chisquare statistic 0.62^{2} = 0.38 with pvalue 0.5359, and indicates that the SA odds ratio for Department F is not significantly different from 1.00 or that the log odds ratio is not significantly different from 0.
The 95% confidence interval for the parameter estimate, that is for the log odds ratio, is (−0.4092, 0.7870). Thus the 95% confidence interval for the odds ratio is
\((\text{exp}(0.4092),\text{exp}(0.7870))=(0.67,2.20)\).
Compare this to the output from PROC FREQ with CMH option
To get the SA odds ratio for any other department, we have to combine the SA coefficient with one of the DSA coefficients. For example, the SA odds ratio for Department A is
\(\text{exp}(0.1889+0.8632)=2.864\).
Based on:
\begin{align}
\theta_{SA(i="A")} &= \dfrac{\mu_{ijk} \mu_{ij'k'}}{\mu_{ij'k} \mu_{ijk'}}\\
&= \\
&= (\lambda_{jk}+\lambda_{j'k'}\lambda_{j'k}\lambda_{jk'})+(\lambda_{ijk}+\lambda_{ij'k'}\lambda_{ij'k}\lambda_{ijk'})\\
&= (0.1889+000)+(0.8632+000)
\end{align}
The Wald zstatistic for the first DSA coefficient,
\(z=0.8632/0.4027\),
indicates that the SA odds ratio for Department A is significantly different from the SA odds ratio in Department F. To see if the SA odds ratio in Department A is significantly different from 1.00, we would have to compute the standard error for the sum of the two coefficients using the estimated covariance matrix, or refit the model by fixing the level of interest equal to 0.
____________________________________________________________
Let us see about parameter estimates based on the output from berkeleyLoglin.R. More specifically we will consider the output based on the following R line from our code:
berk.sat<glm(berk.data$Freq~berk.data$Admit*berk.data$Gender*berk.data$Dept, family=poisson())
The intercept is a normalizing constant and should be ignored. The main effects for D, S and A are all difficult to interpret and not very meaningful since we have significant twoway and threeway associations; the two and threeway associations are highly meaningful. For example, the estimated coefficient for the SA association is 1.0521, from
berk.data$AdmitRejected:berk.data$GenderFemale 1.05208 0.26271 4.005 6.21e05 ***
Exponentiating this coefficient gives
\(\text{exp}(1.0521)=0.3492\),
which is the estimated SA odds ratio for Department A since that is the reference department in this analysis. The reference group for S is "male," and the reference group for A is "accept." If we write the 2 × 2 table for S × A in Department A, i.e., the "partial" table, with the reference groups in the last row and column, we get
Dept A

Reject

Accept

Women  19 
89

Men 
313

512 
for which the estimated odds ratio is:
\(19 \times 512/313 \times 89=0.3492\)
The Wald zstatistic for this coefficient,
\(z=1.0521/0.2627=4.005\)
which corresponds to Chisquare statistic (4.005)^{2} = 16.04 with pvalue nearly 0, and indicates that the SA odds ratio for Department A is significantly different from 1.00 or that the log odds ratio is significantly different from 0.
The 95% confidence interval for the parameter estimate, that is for the log odds ratio, is 1.0521±1.96(0.2627). Thus the 95% confidence interval for the odds ratio is
\((\text{exp}(1.567),\text{exp}(0.537))=(0.209,0.584)\)
To get the SA odds ratio for any other department, we have to combine the SA coefficient with one of the DSA coefficients. For example, the SA odds ratio for Department F is
\(\text{exp}(1.0521+0.8632)=\text{exp}(0.1889)=0.828\)
and if you compare this with the calculation from SAS above, we get the same oddsratio, that is 1/0.828=1.028. Check the boxed explanation in SAS above, how we derived this.
The Wald zstatistic for the last DSA coefficient,
\(z=0.8632/0.4027=2.14\)
indicates that the SA odds ratio for Department F is significantly different from the SA odds ratio in Department A. To see if the SA odds ratio in Department F is significantly different from 1.00, we would have to compute the standard error for the sum of the two coefficients using the estimated covariance matrix, or refit the model by fixing the level of interest equal to 0.
In many situations, however, we take recourse to saturated model only as a last resort. As the number of variables grow, saturated models become more and more difficult to interpret. In the following sections we look at simpler models which are useful in explaining the associations among the discrete variables of interest.
Objective: Only have the main effects for the three variables; we are testing here for complete independence among these three variables.
Main assumptions:
\begin{array}{lcl}
\theta_{AB(k)} &=& \dfrac{\mu_{ijk}\mu_{i'j'k}}{\mu_{i'jk}\mu_{ij'k}}\\
\theta_{AC(j)} &=& \dfrac{\mu_{ijk}\mu_{i'jk'}}{\mu_{i'jk}\mu_{ijk'}}\\
\theta_{BC(i)} &=& \dfrac{\mu_{ijk}\mu_{ij'k'}}{\mu_{ij'k}\mu_{ijk'}}
\end{array}
Model Structure:
\(\text{log}(\mu_{ijk})=\lambda+\lambda_i^A+\lambda_j^B+\lambda_k^C\)
A, B and C stand for three categorical variables.
MLE’s:
\(E(\mu_{ijk})=m_{ijk}=\hat{\mu}_{ijk}=\dfrac{n_{i++}n_{+j+}n_{++k}}{n^2}\)
Parameter constraints: can be different, but ANOVA type are ...
Can you fill in the constraints here? 
Write the model in terms of the variables, and recall the loglinear notation we used when we introduced the complete independence model in the Threeway table notes.
In SAS, the model of complete independence (D, S, A) can be fitted like this (see berkeleyLoglin.sas):
What are the estimated odds of male vs. female in this example? From the output (run the code) the ML estimate of the parameter SMale, thus, the odds of being male are higher than being female applicant:
exp(0.382) = 1.467 = 2691/1835
with pvalue < .0001 indicating that the odds are significantly different. Note these are odds, not odds ratios! (When we are dealing with main effects we do not look at odds ratios.)
How about odds of being rejected? What can you conclude from the part of the output below?
But, we should really check the overall fit of the model first, to determine if these estimates are meaningful.
Model Fit:
The goodnessoffit statistics indicate that the model does not fit.
If the model fits well, the "Value/DF" would be close to 1.
Do you recall how we get the DFs?
df = number of cells  number of fitted parameters in the model.
df = number of fitted parameters in the saturated model  number of fitted parameters in our model.
Recall that these goodnessoffit statistics, compare our fitted model to the saturated model. Thus our model does not fit well in comparison to the saturated model.
Let's see parts of relevant R code, (berekeleyLoglin.R), and one way to fit the model of independence is:
berk.ind<glm(berk.data$Freq~berk.data$Admit+berk.data$Gender+berk.data$Dept, family=poisson())
What are the odds of being a female vs. male applicant? See,
berk.data$GenderFemale 0.38287 0.03027 12.647 < 2e16 ***
Thus the odds are exp(0.3829)=0.68. Compare this to the result we got from SAS (see above) which was 1.467=1/0.68. The same result just odds of male vs. female.
Can you figure out the odds of being rejected?
But, we should really check the overall fit of the model first, to determine if these estimates are meaningful.
Model Fit:
The goodnessoffit statistics indicate that the model does not fit.
Null deviance: 2650.1 on 23 degrees of freedom
Residual deviance: 2097.7 on 16 degrees of freedom
Here the "residual deviance" is the likelihoodratio statistic, G^{2}=2097.7 with df=16. If the ratio of G^{2}/df is much greater than 1, you can reject the fitted model, or compute the pvalue for this statistic like we did in the earlier lessons. Thus the model of complete independence does NOT fit well in comparison to the saturated model, i.e., reject the model of complete independence. The "null deviance" is the deviance for the intercept only model.
Next, let us see an example of the joint independence model.
Two variables jointly independent of the third, e.g., (DS,A), (DA, S), (SA,D). We will not go through all of these potential models, but instead take a look at one of these. Consider, for example, modeling (D, S) jointly independent from A.
Objective:
Main assumptions:
\begin{array}{lcl}
\theta_{DA(j)} &=& \dfrac{\mu_{ijk}\mu_{i'jk'}}{\mu_{i'jk}\mu_{ijk'}}=1\\
\theta_{SA(i)} &=& \dfrac{\mu_{ijk}\mu_{ij'k'}}{\mu_{ij'k}\mu_{ijk'}}=1
\end{array}
Model Structure:
\(\text{log}(\mu_{ijk})=\lambda+\lambda_i^D+\lambda_j^S+\lambda_k^A+\lambda_{ij}^{DS}\)
Parameter estimates and interpretation:
In SAS, this model can be fitted like this:
This model implies that association between D and S does NOT depend on level of the variable A, that is association between department and gender is independent of the rejection/acceptance decision.
\(\theta_{DS(A=reject)}=\theta_{DS(A=accept)}=\dfrac{\text{exp}(\lambda_{ij}^{DS})\text{exp}(\lambda_{i'j'}^{DS})}{\text{exp}(\lambda_{i'j}^{DS})\text{exp}(\lambda_{ij'}^{DS})}\)
Does this control for variable A or ignore it? (see berkeley.sas). Since we are assuming that (DS) are independent of levels of A, then we are assuming that DS given A is the same as DS so the marginal DS associations are the same as partial association DS for each level of A, i.e., if this model fits well, we can ignore A with respect to D and S.
The estimated coefficient of DS association (run berkeleyLoglin.sas) of 1.9436 implies
D*S DeptA Male 1 1.9436 0.1268 1.6950 2.1921 234.84
that the estimated odds ratio is
exp(1.9436) = 6.98 with 95% CI (exp(1.695), exp(2.192)) = (5.45, 8.96)
with reference group "female" and "department F".
But, we should really check the overall fit of the model first, to determine if these estimates are meaningful.
Model Fit:
The goodnessoffit statistics indicate that the model does not fit since the "Value/df" is much larger than 1. Recall, you could also compute the pvalue the G^{2} or X^{2} here too.
How did we get these DF?
df = (IJK1)((I1)+(J1)+(K1)+(I1)(J1))
df = (IJ1)(K1)
So where is the lack of fit?
As before, look at residuals. Recall that adjusted residuals have approximately N(0, 1) distribution, e.g., standardized Pearson residuals (StReschi). In general, we have a lack of fit, (1) if we have a large number of cells and adjusted residuals are greater than 3, or (2) if we have a small number of cells and adjusted residuals are greater than 2. Here is only part of the output for the 2x2x6 table. Notice that the absolute value of the StReschi for the first five cells are all large, e.g., in cell 1, StResch = 15.1632. If you look at the residuals for other cells, you will notice that most of them are rather large, indicating that this model fits poorly on the entire data.
Evaluate the residuals:
In R, here is one way to fit (DS, A) model:
berk.join=glm(berk.data$Freq~berk.data$Admit+berk.data$Gender+berk.data$Dept+berk.data$Gender*berk.data$Dept, family=poisson())
From the output (add the above function to berkeleyLoglin.R and run it), the estimated coefficient of DS association for "female" and "department F" is 1.9436,
berk.data$GenderFemale:berk.data$DeptF 1.94356 0.12683 15.325 < 2e16 ***
Thus, the estimated oddsratio is exp(1.9436) = 6.98 with 95% CI (exp(1.695), exp(2.192)) = (5.45, 8.96). Males are about 7 times more likely than females to be in department A (the reference level here) than in the department F. For example, this is the same value we would get by estimating the sample odds ratio in this table.

Dept A

Dept F

Men 
825

373 
Women 
108

341

But, we should really check the overall fit of the model first, to determine if these estimates are meaningful.
Model Fit:
The goodnessoffit statistics indicate that the model does not fit since the "Value/df" is much larger than 1, e.g, for the deviance statistc:
Residual deviance: 877.06 on 11 degrees of freedom
Recall, you could also compute the pvalue the G^{2} here too, by running 1pchisq(877.06,11).
How did we get these DF?
df = (IJK1)((I1)+(J1)+(K1)+(I1)(J1))
df = (IJ1)(K1)
So where is the lack of fit?
As before, look at residuals. Recall that adjusted residuals have approximately N(0, 1) distribution, e.g., standardized Pearson residuals (StReschi). In general, we have a lack of fit, (1) if we have a large number of cells and adjusted residuals are greater than 3, or (2) if we have a small number of cells and adjusted residuals are greater than 2.
Recall from VitaminCLoglin.R, to get Pearson residuals, residuals(berk.join, "pearson"), and to get standardize residulas, you need to adjust them by their value of the "influence", e.g., residuals(berk.join, "pearson")/sqrt(lm.influence(berk.join)$hat). Here are the values for the first three observations, and they are all very large.
15.1987867 10.3430897 11.4793671
If you look at the residuals for other cells, you will notice that most of them are rather large, indicating that this model fits poorly on the entire data.
Next, let us look at what is often the most interesting model of conditional independence.
Two variables conditionally independent given the third, e.g., (DA, SA), (DS, SA), (DS, DA). Consider modeling D independent from A given S.
Objective:
Main assumptions:
\(\theta_{DA(j)} = \dfrac{\mu_{ijk}\mu_{i'jk'}}{\mu_{i'jk}\mu_{ijk'}}=1\)
Model Structure:
\(\text{log}(\mu_{ijk})=\lambda+\lambda_i^D+\lambda_j^S+\lambda_k^A+\lambda_{ij}^{DS}+\lambda_{jk}^{SA}\)
Parameter estimates and interpretation:
This model implies that partial odds ratios are characterized by the twoway interaction terms, and that the associations between D and S do NOT depend on the levels A, and the associations between S and A do NOT depend on the levels of D.
\(\theta_{DS(A=reject)}=\theta_{DS(A=accept)}=\dfrac{\text{exp}(\lambda_{ij}^{DS})\text{exp}(\lambda_{i'j'}^{DS})}{\text{exp}(\lambda_{i'j}^{DS})\text{exp}(\lambda_{ij'}^{DS})}\)
\(\theta_{SA(D="A")}=\ldots=\theta_{SA(D="F")}=\dfrac{\text{exp}(\lambda_{jk}^{SA})\text{exp}(\lambda_{j'k'}^{SA})}{\text{exp}(\lambda_{j'k}^{SA})\text{exp}(\lambda_{jk'}^{SA})}\)
How do these relate to partial and marginal tables? How would you interpret the parameters? Do you notice any similarity in the parameter estimates from the joint independence model? What is the estimated coefficient of DS association for department A? Update, the given SAS or R code to fit the above model, run it, and SUBMIT your answers on ANGEL; notes below may help. 
In SAS, this model can be fitted as:
Model Fit:
The goodnessoffit statistics indicate that the model does not fit.
How did we get these DF?
df = (IJK1)  ((I1) + (J1) + (K1) + (I1)(J1) + (I1)(K 1)) = I(J1)(K1)
So where is the lack of fit?
As before, look at residuals. For example, the adjusted residual for the first cell is 12.14792, a great deviation from zero.
We can also evaluate overall the individual parameters and their significance:
This is like an ANOVA table in ANOVA and regression models. All parameters are significantly different from zero. That is they contribute significantly in describing the relationships between our variables, but the overall lack of fit of the model suggests that they are not sufficient.
In R, (DS, SA) model can be fitted as:
berk.cind=glm(berk.data$Freq~berk.data$Admit+berk.data$Gender+berk.data
$Dept+berk.data$Gender*berk.data$Dept+berk.data$Gender*berk.data$Admit, family=poisson())
Model Fit:
The goodnessoffit statistics indicate that the model does not fit, e.g., Residual deviance: 783.6 on 10 degrees of freedom
How did we get these DF?
df = (IJK1)((I1)+(J1)+(K1)+(I1)(J1)+(I1)(K 1))=I(J1)(K1)
So where is the lack of fit?
As before, look at residuals. For example, the pearson residual for the first cell is 7.551454, a great deviation from zero.
We can also evaluate overall the individual parameters and their significance, by running a command anova(berk.cind):
Analysis of Deviance Table
Model: poisson, link: log
Response: berk.data$Freq
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev
NULL 23 2650.10
berk.data$Admit 1 230.03 22 2420.07
berk.data$Gender 1 162.87 21 2257.19
berk.data$Dept 5 159.52 16 2097.67
berk.data$Gender:berk.data$Dept 5 1220.61 11 877.06
berk.data$Admit:berk.data$Gender 1 93.45 10 783.61
This is like an ANOVA table in ANOVA and regression models. All parameters are significantly different from zero. That is they contribute significantly in describing the relationships between our variables, but the overall lack of fit of the model suggests that they are not sufficient. If you compare this with the ANOVA table from the SAS output given above, some of the values differ because of the different order we entered the variables into the model; just like in linear regression models. The values also differ because R gives you the values for the deviance statistic (e.g., "Deviance Resid") whereas SAS gives you the usual chisquared. To compute the pvalue here, e.g. for the last term, 1pchisq(93.45, 1), which will be nearly 0 indicating that overall this term contributes significantly to the fit of the model.
The homogeneous associations model is also known as a model of no3way interactions, or as a model of nosecond order interactions, (DS, DA, SA)
Objective: same as before...
Main assumptions:
Model Structure:
\(\text{log}(\mu_{ijk})=\lambda+\lambda_i^D+\lambda_j^S+\lambda_k^A+\lambda_{ij}^{DS}+\lambda_{jk}^{SA}+\lambda_{ik}^{DA}\)
This model implies that ALL partial odds ratios are characterized by the twoway interaction terms, and that the associations between D and S do NOT depend on the levels of the variables A, and the associations between S and A do NOT depend on the levels of D, and the associations between D and A do NOT depend on the levels of S.
\(\theta_{DS(A=reject)}=\theta_{DS(A=accept)}=\dfrac{\text{exp}(\lambda_{ij}^{DS})\text{exp}(\lambda_{i'j'}^{DS})}{\text{exp}(\lambda_{i'j}^{DS})\text{exp}(\lambda_{ij'}^{DS})}\)
\(\theta_{SA(D="A")}=\ldots=\theta_{SA(D="F")}=\dfrac{\text{exp}(\lambda_{jk}^{SA})\text{exp}(\lambda_{j'k'}^{SA})}{\text{exp}(\lambda_{j'k}^{SA})\text{exp}(\lambda_{jk'}^{SA})}\)
\(\theta_{DA(S="male")}=\theta_{DA(S="female")}=\dfrac{\text{exp}(\lambda_{ik}^{DA})\text{exp}(\lambda_{i'k'}^{DA})}{\text{exp}(\lambda_{i'k}^{DA})\text{exp}(\lambda_{ik'}^{DA})}\)
Does this model fit? Even this model doesn't fit well, but it seems to fit better than the previous models, e.g., G^{2}= 20.2251, df= 5, Value/df=4.0450
Think about the following question, then click on the icon to the left display an answer. Can you figure out DF? 
Parameter estimates and interpretation:
In SAS, this model can be fitted as:
Are all terms in the model significant (e.g. look at ’Type 3 Analysis output’); recall you need to use option "type3". For example, here is the ANOVAlike table that shows that SA association does not seem to be significant,
LR Statistics For Type 3 Analysis
Chi
Source DF Square Pr > ChiSq
D 5 360.60 <.0001
S 1 252.95 <.0001
A 1 302.79 <.0001
D*S 5 1128.13 <.0001
D*A 5 762.42 <.0001
S*A 1 1.44 0.2306
Here is part of the output from the "Analysis of Parameter Estimates" given the values for all the parameters,
Analysis Of Parameter Estimates
Likelihood Ratio
Standard 95% Confidence Chi
Parameter DF Estimate Error Limits Square
D DeptC 1 2.2236 0.1649 1.9110 2.5587 181.85
D DeptD 1 1.7479 0.1682 1.4282 2.0888 108.04
D DeptE 1 1.4816 0.1762 1.1446 1.8369 70.70
D DeptF 0 0.0000 0.0000 0.0000 0.0000 .
S Male 1 0.0008 0.1064 0.2096 0.2077 0.00
S Female 0 0.0000 0.0000 0.0000 0.0000 .
A Reject 1 2.6261 0.1577 2.3286 2.9482 277.18
A Accept 0 0.0000 0.0000 0.0000 0.0000 .
D*S DeptA Male 1 2.0004 0.1357 1.7376 2.2698 217.35
D*S DeptA Female 0 0.0000 0.0000 0.0000 0.0000 .
D*S DeptB Male 1 3.0753 0.2229 2.6576 3.5346 190.43
D*S DeptB Female 0 0.0000 0.0000 0.0000 0.0000 .
D*S DeptC Male 1 0.6637 0.1044 0.8688 0.4596 40.45
D*S DeptC Female 0 0.0000 0.0000 0.0000 0.0000 .
D*S DeptD Male 1 0.0432 0.1058 0.1640 0.2506 0.17
D*S DeptD Female 0 0.0000 0.0000 0.0000 0.0000 .
D*S DeptE Male 1 0.7934 0.1167 1.0232 0.5658 46.25
D*S DeptE Female 0 0.0000 0.0000 0.0000 0.0000 .
D*S DeptF Male 0 0.0000 0.0000 0.0000 0.0000 .
D*S DeptF Female 0 0.0000 0.0000 0.0000 0.0000 .
D*A DeptA Reject 1 3.3053 0.1700 3.6498 2.9822 378.17
D*A DeptA Accept 0 0.0000 0.0000 0.0000 0.0000 .
D*A DeptB Reject 1 3.2616 0.1788 3.6225 2.9206 332.89
D*A DeptB Accept 0 0.0000 0.0000 0.0000 0.0000 .
D*A DeptC Reject 1 2.0444 0.1679 2.3846 1.7252 148.31
D*A DeptC Accept 0 0.0000 0.0000 0.0000 0.0000 .
D*A DeptD Reject 1 2.0174 0.1699 2.3614 1.6940 141.01
D*A DeptD Accept 0 0.0000 0.0000 0.0000 0.0000 .
D*A DeptE Reject 1 1.5678 0.1804 1.9306 1.2219 75.49
D*A DeptE Accept 0 0.0000 0.0000 0.0000 0.0000 .
D*A DeptF Reject 0 0.0000 0.0000 0.0000 0.0000 .
D*A DeptF Accept 0 0.0000 0.0000 0.0000 0.0000 .
S*A Male Reject 1 0.0967 0.0808 0.0612 0.2556 1.43
S*A Male Accept 0 0.0000 0.0000 0.0000 0.0000 .
S*A Female Reject 0 0.0000 0.0000 0.0000 0.0000 .
S*A Female Accept 0 0.0000 0.0000 0.0000 0.0000 .
Recall, we are interested in the highestorder terms, thus twoway associations here, and they correspond to logodds ratios. For example, SAMale,Reject is 0.0967. What does that tell us about estimated odds ratio between the gender and admission? Compute exp(0.0967) and do the same for the confidence bounds. You can infer that the odds of admission for males and females, ignoring the department, are the same. However, this model does not fit well, so we can't really rely on the inferences based on this model.
In R, here is one way of fitting this model,
berk.hom=glm(berk.data$Freq~berk.data$Admit+berk.data$Gender+berk.data
$Dept+berk.data$Gender*berk.data$Dept++berk.data$Dept*berk.data$Admit+berk.data
$Gender*berk.data$Admit, family=poisson())
Are all terms in the model significant, e.g., look at the anova(berk.hom) output,
Analysis of Deviance Table
Model: poisson, link: log
Response: berk.data$Freq
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev
NULL 23 2650.10
berk.data$Admit 1 230.03 22 2420.07
berk.data$Gender 1 162.87 21 2257.19
berk.data$Dept 5 159.52 16 2097.67
berk.data$Gender:berk.data$Dept 5 1220.61 11 877.06
berk.data$Admit:berk.data$Dept 5 855.32 6 21.74
berk.data$Admit:berk.data$Gender 1 1.53 5 20.20
We see that the last added term SA association (AdmiGender) didn't contribute significantly to the fit of the model.
Now take a look at the parameter estimates, and for example association of Gender and Admit, where the value is 0.09987. Compute the exp(0.09987) and of its confidence interval. What do you conclude? You should get an estimated oddsratio that is NOT significantly different from 1, thus gender and admission seem to be marginally independent (i.e., ignoring the department). However, this model does not fit well (G^{2}=20.05, df=5), so we can't really rely on the inferences based on this model.
0.09987
Coefficients:
Estimate Std. Error z value Pr(>z)
(Intercept) 6.27150 0.04271 146.855 < 2e16 ***
berk.data$AdmitRejected 0.58205 0.06899 8.436 < 2e16 ***
berk.data$GenderFemale 1.99859 0.10593 18.866 < 2e16 ***
berk.data$DeptB 0.40322 0.06784 5.944 2.78e09 ***
berk.data$DeptC 1.57790 0.08949 17.632 < 2e16 ***
berk.data$DeptD 1.35000 0.08526 15.834 < 2e16 ***
berk.data$DeptE 2.44982 0.11755 20.840 < 2e16 ***
berk.data$DeptF 3.13787 0.16174 19.401 < 2e16 ***
berk.data$GenderFemale:berk.data$DeptB 1.07482 0.22861 4.701 2.58e06 ***
berk.data$GenderFemale:berk.data$DeptC 2.66513 0.12609 21.137 < 2e16 ***
berk.data$GenderFemale:berk.data$DeptD 1.95832 0.12734 15.379 < 2e16 ***
berk.data$GenderFemale:berk.data$DeptE 2.79519 0.13925 20.073 < 2e16 ***
berk.data$GenderFemale:berk.data$DeptF 2.00232 0.13571 14.754 < 2e16 ***
berk.data$AdmitRejected:berk.data$DeptB 0.04340 0.10984 0.395 0.693
berk.data$AdmitRejected:berk.data$DeptC 1.26260 0.10663 11.841 < 2e16 ***
berk.data$AdmitRejected:berk.data$DeptD 1.29461 0.10582 12.234 < 2e16 ***
berk.data$AdmitRejected:berk.data$DeptE 1.73931 0.12611 13.792 < 2e16 ***
berk.data$AdmitRejected:berk.data$DeptF 3.30648 0.16998 19.452 < 2e16 ***
berk.data$AdmitRejected:berk.data$GenderFemale 0.09987 0.08085 1.235 0.217
Next, we’ll explore some more these models in terms of model fit (e.g. deviance, residuals, ...) and model selection.
We will look in more details the statistical inference and model selection with loglinear models for kway tables in the next part of this lesson. For now, let's take a quick look at different parts that we consider when assessing the fit of the model and inference based on the model parameter estimates in the context of the Berkeley Admissions example. Recall that for each of the previously discussed models we looked at:
Recall, the null hypothesis is that the fitted model holds, that is the fitted model gives a good representation of the data versus the alternative that the saturated model fits well.
These are global tests for overall model fit, and and all assumptions we have been discussing about the chisquared and deviance (likelihoodratio) statistics hold.
Below is the table with all possible models for the Admissions example data. You will complete the rest of the table for the homework; i.e., fill in the fit statistics, dfs and pvalues.
Model

df

G^{2}

pvalue

X^{2}

pvalue

(D, S, A) 
16

2097.2

< .001

1999.6

< .001

(DS, A) 
11

876.6

< .001

797.1

< .001

(D, SA)  
(DA, S)  
(DS, SA) 
10

782.65

< .001

714.30

< .001

(DS, DA)  
(DA, SA)  
(DS, DA, SA) 
5

20.23

< .01

18.83

< .01

(DSA) 
0

0

Which model would you choose based on the overall fit statistics? The homogeneous model looks like the best fit. Remember, the greater the number of parameters, the better the fit will be. We will learn a bit later more about model selection. What we want is to find the simplest model that fits the best. We can look at all possible models, and with 3 random variables, there is 9 of such models. But sometimes we can also reduce the space of possible models based on the question we are trying to answer and the contextual knowledge. For example, thinking of A as a "response" and D, S as "covariates," it makes sense to focus attention on
Once we find a model (or a few of them) that look(s) good (i.e., have a good fit), then we want to further explore the fit that is the lackof fit. We can look at the residuals.
Assessing how well are the cells fitted by the model. If you have selected a good model the cells should have small residuals.
As before we can consider Pearson residuals and adjusted residuals. Recall that the adjusted residuals are approximately N(0, 1), and if the model holds these should be close to zero. (The smaller the difference between the expected and the observed  the better the fit.)
There is a lack of fit if:
The partial association (PA) tests relate to testing of a specific coefficient in the model. The null hypothesis can be stated as:
To test this, we test the difference between a model that does NOT contain the partial association and a more complex model that includes this effect. We do this by computing the difference of deviance's (G^{2}s) of the two models. The only difference between these two models is the inclusion of the parameter of interest in the alternative that is in the more complex model.
For example:
Model

df

G^{2}

pvalue

H_{0}

Δdf

ΔG^{2}

(DS, DA, SA) 
5

20.23

< .01







(DS, SA) 
10

782.65

< .001

\(\lambda_{ik}^{DA}=0\)

5

762.42

(DS, DA)  
(DA, SA) 
We are interested in testing if (DA) term is statistically significant, that is H_{0}: λ^{DA} = 0. We compare the the model (DS, SA) with a more complex model (DS,DA,SA) where the only difference between these two models is the (DA) term. Difference in deviances then is, 782.6520.23=762.42 and degrees of freedom, 105=5. Since Value/df=762.42/5 is much larger than 1, we can reject the null hypothesis, thus there seems to be significant association between D and A given S.
Complete the rest of the table for homework, and make appropriate inferences based upon what you find. Notice that you should do this only for nested models. Here is a hierarchical or nested listing:
(ABC)  saturated model
(AB, BC, AC)  homogeneous  3 way interaction = 0
(AB, BC) (AB, AC) (BC, AC)
(A, BC) (B, AC) (C, AB)
(A, B, C)
For example, if we want to test the significance of (BC) term, we could compare (A,BC) model to (A,B,C) model. Also reall that these models are hierarchical that is the models with higher association include lower order terms too, e.g. (A,BC) model has terms (A,B,C, and BC), but we use the shorthand notation (A,BC).
Can you suggest how you would you test for other parameters? Let's take a look at this more in depth...
We will cast the discussion in terms of 3way tables, but all the concepts easily extend to higherway tables just require more calculations/computations. We will first see a bit more on an overall fit of the model and then discuss model selection in a search of the "best" model. We will work with a 2×2×2 table that summarizes a study of blue collar job satisfaction.
Key Concepts
Objectives
Readings

Blue collar workers job satisfaction from large scale investigation in Denmark in 1968 (Andersen, 1985). Three variables are (1) Worker (W) job satisfaction (low, high), (2) Supervisor (S) satisfaction (low, high) and (3) Quality of management (M) (bad, good).
References: See Categorical Data Analysis Using SAS (2000), Sec. 16.3.5 and collar.sas and/or collar.R (you will need collart.txt for the R code).
Here is a table of observed and fitted values for selected models:
Manage

Super

Worker

n_{ijk}

M,S,W

MS, W

MS, MW

MS, MW, SW

bad

low

low

103

50.15

71.78

97.16

102.26

bad

low

high

87

82.59

118.22

92.84

87.74

bad

high

low

32

49.59

27.96

37.84

32.74

bad

high

high

42

81.67

46.04

36.16

41.26

good

low

low

59

85.10

63.47

51.03

59.74

good

low

high

109

140.15

104.53

116.97

108.26

good

high

low

78

84.15

105.79

85.97

77.26

good

high

high

205

138.59

174.21

197.28

205.74

Notice that as you add more parameters in your model, you get the closer fit of the expected to the observed values; this is to be expected since the saturated model will give us the perfect fit. But remember, our goal is to find the simplest possible model that explains the most variability in the data; i.e., the parsimonious model.
Let's see how we can assess the fit of each model and compare it to other models in order to render our final decision.
For testing the overall fit of the model.
H_{0} : The fitted model (M_{0}) fits well
H_{A} : The saturated model (M_{A}) fits wellL_{0} = loglikelihood for the fitted model
L_{A} = loglikelihood for the saturated modelG^{2} = −2(L_{0} − L_{A})
df = df_{0} − df_{A} = # cells  # unique parameters
df = # unique parameters in M_{A}  # unique parameters in M_{0}.
Model

df

G^{2}

pvalue

X^{2}

pvalue

(M, S, W) 
4

118.00

< 0.001

128.09

< 0.001

(MS, W) 
3

35.60

< 0.001

35.72

< 0.001

(MW, S) 
3

87.79

< 0.001

85.02

< 0.001

(M, WS) 
3

102.11

< 0.001

99.09

< 0.001

(MW, SW) 
2

71.90

< 0.001

70.88

< 0.001

(MS, MW) 
2

5.39

0.07

5.41

0.07

(MS, WS) 
2

19.71

< 0.001

19.88

< 0.001

(MW, SW, MS) 
1

.065

0.80

.069

0.80

These are global tests. All the assumptions and properties about the chisquared and deviance statistics that we have learned thus far hold. Based on these values there are two reasonably well fitted models: conditional independence of S and W given M, (MS, MW), and homogeneous associations model, (MS,SW,MS).
The residuals can be used to assess how well are the cells fitted by the model. A good model should have small residuals.
As before we can consider the Pearson residuals, adjusted residuals, and deviance residuals; we defined those in Lesson 2  recall that you can always use the "Search Engine" for the course to find certain info. For adjusted residuals, for example, if the model holds these should be close to zero.
There is a lack of fit if:
Here is a table with observed values, fitted values for the two models of interest and their adjusted residuals.
(MS, MW)

(MS, MW, SW)


Manage

Super

Worker

n_{ijk}

\(\hat{\mu}_{ijk}\)

adj res

\(\hat{\mu}_{ijk}\)

adj res

bad

low

low

103

97.16

1.60

102.26

.25

bad

low

high

87

92.84

1.60

87.74

.25

bad

high

low

32

37.84

1.60

32.74

.25

bad

high

high

42

36.16

1.60

41.26

.25

good

low

low

59

51.03

1.69

59.74

.25

good

low

high

109

116.97

1.69

108.26

.25

good

high

low

78

85.97

1.69

77.26

.25

good

high

high

205

197.28

1.69

205.74

.25

The reported residuals are really pretty small, and both models look good. Is there additional information that could help us to be a bit more confident about our decision to accept or reject the certain model?
Recall, in SAS, OBSTATS option in MODEL statement produces various residuals, e.g., StResChi:
Observation Statistics
Observation count manager super worker Pred Xbeta Std
HessWgt Lower Upper Resraw Reschi
Resdev StResdev StReschi Reslik
1 103 bad low low 97.159091 4.5763497 0.094248
97.159091 80.771732 116.8712 5.8409091 0.5925687
0.5867754 1.585496 1.6011497 1.5990147
Recall, in R, use the function residuals(). More specifically, if you use rstandard() gives you standardized deviance residuals, and rstudent() gives you likelihood residuals, e.g., rstandard(modelCI) from collar.R:
> rstandard(modelCI)
1 2 3 4 5 6 7 8
1.585496 1.618395 1.645241 1.560709 1.645956 1.706941 1.714344 1.676040
We can, for example, compare these two models to each other via the so called partial association test.
H_{0} : there is no partial association of two variable given the third, or
conditional (partial) odds ratios equal 1, or
twoway interaction term equal zero, e.g. \(\lambda_{jk}^{SW}=0\)
H_{A} : there is partial association of two variable given the third, or
conditional (partial) odds ratios is NOT equal 1, or
twoway interaction term is NOT equal zero, e.g. \(\lambda_{jk}^{SW} \neq 0\)
To test this, we test the difference between a model that does NOT contain the partial association of interest and a more complex model that includes this effect by looking at the likelihood ratio statistics or at the difference of deviances of the two models :
For example, let
M_{0 } the restricted model (MS, MW)
M_{1} the more complex model (MS, MW, SW)
\(\Delta G^2=G^2_0G^2_1\)
Δdf = df_{0} − df_{1}Notice the similarity of the partial association test to the previously discussed goodnessoffit test for assessing the "global" fit of the model. Here we are doing a relative comparison of two models, where is in the goodnessoffit test we always compare the fitted model to the saturated model, that is, M_{1} here is M_{A} in the goodness of fit test.
Here is a tabular summary that compares each of conditional independence models to the homogeneous association model.
Model

df

G^{2}

p

H_{0}

Δdf

ΔG^{2}

pvalue

(MW, SW, MS) 
1

.065

0.80









(MW, SW) 
2

71.90

< 0.001

\(\lambda_{ij}^{MS}=0\)

1

71.835

< 0.001

(MS, MW) 
2

5.39

0.07

\(\lambda_{jk}^{SW}=0\)

1

5.325

< 0.01

(MS, WS) 
2

19.71

< 0.001

\(\lambda_{ik}^{M W}=0\)

1

19.645

< 0.001

From the above table, we can conclude that each of the terms significantly contributes to the better fit of the model, if that term is present, thus the homogeneous association models is a better fit than any of the conditional independence models.
Here is an animated graphical representation summarizing the fits and partial tests for all models in this example:
Seems thus far that the homogeneous model describe our data the best. We typically do inference on the chosen model in terms of odds ratios.
Recall, the odds ratios are direct estimates of loglinear parameters.
Consider estimating the partial odds ratio for SW in our example from the homogeneous association model.
Based on SAS/GENMOD (see collar.sas and collarDIndex.sas )
\(\text{log}(\hat{\theta}_{SW(i)})=0.3872\text{ and }\hat{\theta}_{SW(i)}=\text{exp}(0.3872)=1.47\)
Recall that this is a Wald statistic, thus a
(1 − α) × 100% CI for \(\text{log}(\hat{\theta}_{SW(i)})\)
\(\text{log}(\hat{\theta}_{SW(i)}) \pm z_{\alpha/2}(ASE)\)
E.g. 95% CI for \(\text{log}(\hat{\theta}_{SW(i)})\):
0.3872 ± 1.96(0.167) = (0.056, 0.709)
and 95% CI for \(\hat{\theta}_{SW(i)}\):
(exp(0.056), exp(0.709) = (1.06, 2.03)
Based on SAS/GENMOD (see collar.sas and collarDIndex.sas )
\(\text{log}(\hat{\theta}_{SW(i)})=0.3872\text{ and }\hat{\theta}_{SW(i)}=\text{exp}(0.3872)=1.47\)
Recall that this is a Wald statistic, thus a
(1 − α) × 100% CI for \(\text{log}(\hat{\theta}_{SW(i)})\)
\(\text{log}(\hat{\theta}_{SW(i)}) \pm z_{\alpha/2}(ASE)\)
E.g. 95% CI for \(\text{log}(\hat{\theta}_{SW(i)})\):
0.3872 ± 1.96(0.167) = (0.056, 0.709)
and 95% CI for \(\hat{\theta}_{SW(i)}\):
(exp(0.056), exp(0.709) = (1.06, 2.03)
But the lower bound is very close to 1. This should raise the question for you: Is this term really significant? Could we go, after all, with a simpler model (MS,MW)?
Of the two models in our example, which one is the ”better” model? How do I choose between the simpler more parsimonious model or the one that is more complex?
Criterion

(MS, MW)

(MS, MW, SW )

Model goodness of fit G^{2} = 
5.39
df = 2, p = .07 
.065
df = 1, p = .80 
Largest adjusted residual 
1.69

.25

Likelihood ratio test of \(\lambda_{jk}^{SW}=0\) 


G^{2} = 5.325
df = 1, p = .03 
Complexity 
simpler

more complex

Do we really need the SW partial association?
It maybe a weak effect, i.e., it may not be important in a practical sense. Although it appears statistically significant which can be due to the large sample size (n = 715) for a relatively small table N = 2^{3}.
So how do we choose between these two? How do we consider the tradeoff between a simple parsimonious model (practical significance) and a more complex and closer to reality model (statistical significance). This decision is often driven by the context of the problem and the above analysis we have done. But let's see some more statistics that we can use to aid us in the decision of choosing the "best" model.
Choosing a single ”best” model among the presence of more than one reasonable model involves some subjective judgment.
Some additional things to consider in the model selection process:
Automatic model selection procedures for loglinear models, such as stepwise, forward selection, backward elimination are either not available or limited in software packages. In SAS, neither PROC CATMOD or GENMOD can do these for loglinear models. In R, you can do this via the step() function, but you must exercise caution. An alternative, is to do model selection via graphical models and use a software called MIM, (which has an interface with R); you can explore this on your own  see for some references at the end of this section.
If the two models have nearly the same values for the odds ratio, choose the simpler one.
However, what is "nearly the same values"? What do we do?
This is a subjective decision, and depends on the purpose of the data.
Fitted Odds Ratio


Model

W  S

M  W

M  S


(MS, MW) 
1.00

2.40

4.32


(MS, WS, MW) 
1.47

2.11

4.04


Observed or  (MSW)  level 1 
1.55

2.19

4.26

(MSW)  level 2 
1.42

2.00

3.90

They seem close enough, but what else can we look at...
The Dissimilarity Index is a summary statistic describing how close the fitted values of a model are to the observed data (e.g. observed counts). For a table of any dimension, it equals
\(D=\dfrac{\sum_in_i\hat{\mu}_i}{2n}=\dfrac{\sum_ip_i\hat{\pi}_i}{2}\)
For our example used in this lesson, see collarDIndex.sas.
For the model (MW, MS):
D = 55.2306 / 2 (715) = 0.039
What does this mean? We could interpret this values such that, we would need to move 3.9% percent of the observations to achieve a perfect fit of the model (MW, MS) to observed (sample) data.
For the model (MW, MS, SW):
D = 5.8888 / 2 (715) = 0.004
Here we would only need to move 0.4% of the observations to achieve a perfect fit of the models (MW, MS, SW) to the observed data.
Which one do we use? Possibly the model of conditional independence. I, personally, could be OK with moving about 3.9% of the data.
A large correlation value indicates that the observed and fitted values are close.
For our example, for the conditional independence model (MW, MS) in our example, r = 0.9906, and for the homogeneous model, r = 0.9999. Is there really a difference here? If I round to two decimal places, there is no difference at all!
These are different statistics that compare fitted and observed values, that is the loglikelihoods of two models, but take into account the complexity of the model and/or sample size.
The complexity typically refers to the number of parameters in the model.
Note: Unlike partial association tests, these statistics do NOT require the models to be nested.
The two most commonly used ones are:
Akiake Information Criteria (AIC)
AIC = −2 × LogL + 2 × number of parameters
Bayesian Information Criteria (BIC)
BIC = G^{2} − (df) × logN = −2 × log(B)
where N = total number of observations, and B = posterior odds.
Note: The smaller the value, the better the model.
There is considerable debate about which is better and in what situations. See Sensitivity and specificity of information criteria from the Methodology Center for more.
BIC gives a way of considering the tradeoff between a simple parsimonious model (practical significance) and a more complex and closer to reality model (statistical significance).
Suppose we are considering a model M_{0}, and comparing it to the saturated model, M_{1}. Which model gives a better description of the main features as reflected in the data? (e.g. which of the two is more likely to be the true model? We can use posterior odds which then figure into the BIC:
\(B=\dfrac{P(M_0X)}{P(M_1X)}\)
where X = data.
The following table gives both BIC and AIC for our running example.
Model

df

G^{2}

pvalue

BIC

Parameters

AIC

(MSW) 
0

0.00

1.000

.00

8



(MS)(MW)(SW) 
1

.06

.800

6.51

7

13.94

(MW)(SW) 
2

71.90

.000

58.76

6

59.90

(MS)(WS) 
2

19.71

.000

6.57

6

7.71

(SM)(WM) 
2

5.39

.068

7.75

6

6.61

(MW)(S) 
3

87.79

.000

68.07

5

77.79

(WS)(M) 
3

102.11

.000

82.07

5

92.11

(MS)(W) 
3

35.60

.000

15.88

5

25.60

(M)(S)(W) 
4

118.00

.000

91.71

4

110.00

Think about the following question, then click on the icon to the left display an answer. Which model would you choose based on BIC? How about AIC? Is it the same model? 
How did we get the values in the table above?
In , for example for (MS,MW) model:
Criteria For Assessing Goodness Of Fit
Criterion DF Value Value/DF
Deviance 2 5.3871 2.6936
Then, BIC=5.38712*log(715)=7.75
But, keep in mind that this G^{2 }is computed for comparing deviance of the fitted model to the saturated model. In some other software packages or for a different procedure in SAS this value maybe relative to a different model other than the saturated model. Another way to obtain G^{2} is to compute 2*logL from the given log likelihoods from the "Criteria For Assessing Goodness Of Fit", e.g.,
Log Likelihood 2599.0752
G^{2}=2(logLik(cond.indep.model)  logLik(saturated.model) = 2(2599.07522601.7687)=5.38714
BIC: Locate the G^{2}, and its df for the particular model, and the sample size N.
In , for example for (MS,MW) model, summary(modelCI) from collar.R gives:
Null deviance: 208.7748 on 7 degrees of freedom
Residual deviance: 5.3871 on 2 degrees of freedom
AIC: 66.863
Then, BIC=5.38712*log(715)=7.75
But, keep in mind that this G^{2 }is computed for comparing deviance of the fitted model to the saturated model. In some other software packages or for a different procedure in SAS this value maybe relative to a different model other than the saturated model. Another way to obtain G^{2} is to compute 2*logL from the given log likelihoods by using the function logLik(), e.g.,
> logLik(modelCI)
'log Lik.' 27.43141 (df=6)
G^{2}=2(logLik(modelCI)  logLik(modelSat))=5.38714
You can also compute BIC, by using AIC() function in R; for more details see collar.R and further below.
AIC: Locate the G^{2}, and its df for the particular model, and the sample size N, or use provided functions.
In , for example for (MS,MW) model relative to the saturated model, the :
AIC=5.38712*6=6.61
In , for example for (MS,MW) model, summary(modelCI) from collar.R gives:
Null deviance: 208.7748 on 7 degrees of freedom
Residual deviance: 5.3871 on 2 degrees of freedom
AIC: 66.863
Then, relative to the saturated model, the AIC=5.38712*6=6.61.
What is AIC=66.863 value? This is a "correct" AIC value but with a different normalizing constant in comparison to what is reported in the table above. You can use the AIC values produced by R directly and you will reach the same conclusions since the relative relationships of these values across different models still holds. As an exercise, run collar.R and look at the AIC values for all of our models, sort them out and show yourself that the lowest ones are for the (MS, MW) and (MS, SW, MS) models besides the saturated model.
You can use the AIC() function on each model, and also to get the BIC. For example,
AIC(modelCI) will produce 66.863.
AIC(modelCI, k=log(sum(count)))=AIC(modelCI, k=log(715)) will produce a BIC value for the (MS,MW) model. Then to obtain the BIC value as we did above, take the difference of the BIC values for the two models:
BIC=AIC(modelCI, k=log(sum(count)))AIC(modelSat,k=log(sum(count)))
References:
For higherway (or kway) tables, which are crossclassification of data on k discrete random variables, the loglinear models are handled basically the same as you would for threeway tables, these are just more complex in terms of the number of possible models and parameters that you could examine. Higher way tables are going to have many more twoway, threeway and kway associations. For example, here is a 6way table that crossclassifies workers by 6 binary random variables regarding their health conditions:
Loglinear models are general in a sense that do not make an explicit distinction between responses and explanatory variables. They can be used if there are more than two response variables.
You can also look at the set of Case Studies in the Case Studies folder which is still under the development! Some of those talk about logistic regression models that we have not seen yet. However, Smoking and Stress example includes a loglinear model with more than 3 random variables.
Graphical models are useful and are widely applicable. Graphs can be used to:
For contingency tables graphical models can determine when marginal and partial associations are the same such that we can collapse a multiway table into a smaller table (or tables) to study certain associations.
They represent substantive theories and hypotheses, which correspond to certain loglinear/logit models.
Two basic types of graphical models are (1) undirected graphical models, and (2) directed graphical models (DAGs or Bayesian Networks). Other types are chain graphs, ancestral graphs, etc...
Here are a number of references that you can use to pursue your interest in the graphical model methods:
Some online references
If time permits, and if there is interest, we can cover the basics of graphical models at the end of the course, or you can study these as a part of your project.