This is the first time we are dealing with continuous variables in this course. Just like in any ordinary linear regression, the covariates may be both discrete and continuous. The basic principle for logistic regression is the same whether covariates are discrete or continuous, but some adjustments are necessary for goodnessoffit testing. HosmerLemeshaw statistic is very useful in this case.
Key Concepts
Objectives

In 1846, the Donner party (Donner and Reed families) left Springfield, Illinois for California in covered wagons. After reaching Fort Bridger, Wyoming, the leaders decided to find a new route to Sacramento. They became stranded in the eastern Sierra Nevada mountains at a place now called Donner Pass (right) when the region was hit by heavy snows in late October. By the time the survivors were rescued on April 21, 1847, 40 out of 87 had died.
Three variables:
Y_{i} = 1 if person i survived
Y_{i} = 0 if person i diedX_{1i} age of person i
X_{2i} = 1 if person i is male
X_{2i} = 0 if person i is female.
The analyses may be carried with donner.sas or donner.R with the dataset donner.dat (or donner.txt) which contains N = 45 observations, and the following columns: 1) Age, 2) Sex (1 = male, 0 = female), 3) Survivorship (1 = survived, 0 = dead).
Analysis: How about Oneway ANOVA?
Problems: The fit of the oneway ANOVA model F_{1,43} = 5.35, pvalue = 0.03432, R^{2} = 10%, and:
Correct Analysis: Analyze the twoway contingency table. You can do this via chisquare test of independence or a binary logistic regression model.
Survivorship


Sex

Survived

Died

Total

Female 
10

5

15

Male 
10

20

30

Total 
20

25

45

Test:
H_{0 }: Survivorship is independent of gender
H_{a }: Survivorship depends on gender
Conclusion: The null hypothesis is rejected (X^{2} = 4.50, df = 1, pvalue = 0.0339). We may conclude that survivorship does depend on gender. Among females survivorship is 67% while among males it is 33%. Apparently, females’ chance of survival is twice that of males.
Problem: By now you should be suspicious of simply ignoring a variable by marginalization without good justification, and wary of Simpson's Paradox. The above analysis does not consider any dependency between survivorship and age. The table above is created by collapsing over age, which may lead to a wrong conclusion.
Naive model: Model the probability of survival given age:
\(P(Y=1X_1)=\beta_0+\beta_1 X_1\)
Linear regression model: Model expected survival given age:
\(E(Y=1X_1)=\beta_0+\beta_1 X_1\)
Problems:
Above is a plot of the fitted linear regression model with the observed data. The estimated model is: $E(survival  age)=0.86920.01336age$ Consider predicting survival for a 70 years old person: P(surv=1  age=70)=0.86920.01336 × 70 =  0.0658 !!! This model predicts a negative probability of survival.
Correct Analysis (or one proper solution amongst others: Logistic regression model )
\(\pi_i=\dfrac{\text{exp}(\beta_0+\beta_1 X_{1i})}{1+\text{exp}(\beta_0+\beta_1 X_{1i})}\)
\(\text{log}\dfrac{\pi_i}{1\pi_i}=\beta_0+\beta_1 X_{1i}\)
where $\pi_i$=P(survival=1) probability of survival for person i, and X_{1}_{i} is age of a person i.
Below is the donner.sas SAS program. It reads a datafile stored on your computer. For SAS, you need to adjust the infile line of code depending on the platform, version of SAS, etc.
We will get additional diagnostics needed for evaluating overall fit of the model when having a continuous predictor(s) with influence iplots options in SAS.
Let’s look at a part of the output from donner.lst. Just as in R we get the same model estimates and the only difference is that R report zvalue instead of Wald ChiSquare, but recall that z^{2} ≈ Wald X^{2} (i.e., for the age: (2.063)^{2} ≈ 4.2553).
Below is the part of R code from donner.R that corresponds to donner.sas. For R, you need to adjust read.table() line depending where you stored the file.
We will get additional diagnostics needed for evaluating overall fit of the model when having a continuous predictor(s) with lm.influence() function and many others in R. See the part labeled "Model Diagnostics".
We can get the same model estimates as from the SAS output and the only difference is that R reports a zvalue instead of Wald ChiSquare, but recall that z^{2} ≈ Wald X^{2} (i.e., for the age: (2.063)^{2} ≈ 4.2553).
Here is this part of the R output:
The fitted model is
\(\text{log}\dfrac{\pi_i}{1\pi_i}=1.81830.0665X_{1i}\)
The expected probability of survival of a 70 years old individual (regardless of gender since it's not included in the model):
\(\hat{\pi}=Pr(survival)=\dfrac{\text{exp}(1.81830.0665\times 70)}{1+\text{exp}(1.81830.0665\times 70)}=0.055\)
\(\text{exp}(\beta_0)=\text{exp}(1.8183)=3.26\)
is the estimate of the odds of survival when age = 0, that is the probability of survival is 3.26 / (1 + 3.26) = 0.77.
The exponentiated slope is
\(\text{exp}(0.0665)=0.94\)
which means that every year increase in age multiplies the odds of survival by 0.94. So the odds of survival decreases with age. This is an estimated odds ratio. In SAS this value and its 95% Wald Confidence Interval is given in the output under "Odds Ratio Estimates":
Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits age 0.936 0.878 0.997
In R, the oddsratio estimate and its 95% Wald Confidence Interval (i.e., the asymptotic CI) if not directly provided but can be obtain as follows:
> exp(coefficients(result)[2])
age
0.9356907
> exp(confint(result)) ## exponentiate to get on the oddsscale
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.9940306 54.0635442
age 0.8695861 0.9898905
In general this the usual 95% Wald CI, i.e., $\hat{\beta}\pm 1.96\times SE(\beta)$, and then we need to exponentiate the ends to get the CI on the oddsscale; for this example $exp(0.0665\pm 1.96 \times 0.0322)=[0.878, 0.997]$.
Assuming the Type I error of 0.05, we reject the null hypothesis that $\beta_1=0$, so there is a weak evidence of a statistically significant relationship between the person's age and probability of survival. But how well does this model fit? Can we trust this inference?
Let's look at the values listed for AGE. While some observations share the same AGE value, most of these are unique. Crosstabulation of AGE vs. SURVIVAL gives 21 × 2 table:
Issue: The table is sparse (e.g. small observed counts and we will get small fitted values), and the sample size requirement for the use of Pearson chisquare goodnessoffit test statistic and likelihood ratio goodnessoffit test statistic is not met. In addition, when more data are collected there could be more ages included so the number of cells is not fixed. This is almost always the case when you are dealing with continuous predictor(s) such as age. Thus, the X^{2} and G^{2} statistics for logistic regression models with continuous predictors do not have approximate chisquared distributions, and are not the best statistics to use to evaluate the overall fit of the model.
There are several alternatives. The most common are
When explanatory variables are continuous, it’s hard to analyze the lack of fit without some grouping. Recall that’s what the HosmerLemeshow test does; it groups by the predicted probabilities such that there is roughly an equal number of observations per group. Below are SAS and R outputs.
In SAS:
In the model statement, the option lackfit tells SAS to compute HL statistics and print the partitioning. In this example, there are 8 groups with roughly equal number of observations per group. For example, group 1 has 5 observations and group 2 has 7. The HL ChiSquare statistic of 8.5165, with df=number of groups  2=6, has a pvalue of 0.2027 indicating that there is no issue with the lackoffit. There are some differences between SAS and R output because of difference in the cut points used to group the data, but the inferences are the same.
Conclusion: The model fits moderately well.
For R, there is no builtin function for the HosmerLemeshow goodnessoffit test. Instead use the hosmerlem() function from ROCandHL.R code provided for the class. Run the following for the Donner example, where "result" is the saved model and g is the number of groups for the HL statistic:
hosmerlem(survive,fitted(result), g=8)
X^2 Df P(>Chi)
6.3528592 6.0000000 0.3848447
The HL ChiSquare statistic of 6.353, with df=number of groups  2=6, has a pvalue of 0.385 indicating that there is no issue with the lackoffit. There are some differences between SAS and R output because of difference in the cut points used to group the data, but the inferences are the same.
Conclusion: The model fits reasonably well.
This will involve fitting different logistic regression models, some including both continuous and categorical predictors, and comparing them via likelihoodratio test. First, we review the general setup, where
H_{0 }: The probability of the characteristic does not depend on the pth variable; i.e., reduced model fits the data
H_{a }: The probability of the characteristic depends on the pth variable; i.e., full model fits the data
This is accomplished by comparing the two models as given below:
Reduced Model:
\(\text{log}\left(\dfrac{\pi_i}{1\pi_i}\right)=\beta_0+\beta_1 X_{1i}+\beta_2 X_{2i}+\ldots+\beta_{p1} X_{(p1)i}\)
which contains all variables except the variable of interest p.
Full Model:
\(\text{log}\left(\dfrac{\pi_i}{1\pi_i}\right)=\beta_0+\beta_1 X_{1i}+\beta_2 X_{2i}+\ldots+\beta_{p1} X_{(p1)i}+\beta_p X_{pi}\)
which contains all variables including the variable of interest p.
Recall, the fit of each model to the data is assessed by the likelihood or, equivalently, the loglikelihood. The larger the log likelihood, the better the model fits the data. Adding parameters to model can only result in an increase in the log likelihood. So, the full model will always have a larger log likelihood than the reduced model and thus fit better. But the question is, is that extra cost of adding an additional variable really needed or can we capture the variability in the response well enough with the simpler model.
Question: Is the loglikelihood for the full model significantly better than the loglikelihood for the reduced model? If so, we conclude that the probability that the trait is present depends on variable p.
For example, to test for the effect of gender in Donner Party data, compare:
Reduced Model:
\(\text{log}\left(\dfrac{\pi_i}{1\pi_i}\right)=\beta_0+\beta_1 X_{1i}\)
where X_{1}_{i} is age of a person i.
Full Model:
\(\text{log}\left(\dfrac{\pi_i}{1\pi_i}\right)=\beta_0+\beta_1 X_{1i}+\beta_2 X_{2i}\)
where X_{1}_{i} is age of a person i, and X_{2}_{i} the person’s gender, with female=0.
If we are testing only for a single parameter then this test should lead to the same inference as if we are testing: $H_0: \beta_2=0$ vs. $H_a: \beta_2\neq 0$.
In SAS: From the Model Fit Statistics in SAS get the likelihoods:
Model

2 log likelihood

Reduced (age) 
56.291

Full (age and sex) 
51.256

The likelihood ratio test statistic is
\(\Delta G^2 =2 \text{log} \Lambda=56.29151.256=5.035\)
Since \(\Delta G^2 =2 \text{log} \Lambda=5.035 > 5.02= \chi^2_1(0.975)\) we can reject the null hypothesis that sex has no effect on the survivorship probability at the α = 0.025 level.
If we study the table with parameter estimates and consider $H_0: \beta_{sex}=0$ vs. $H_a: \beta_{sex}\neq=0$, based on the Wald X^{2}, and pvalue of 0.0345 we can reject the null hypothesis, and thus gender does have an effect on survival.
Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error ChiSquare Pr > ChiSq Intercept 1 3.2304 1.3870 5.4248 0.0199 age 1 0.0782 0.0373 4.3988 0.0360 sex 1 1.5973 0.7555 4.4699 0.0345
Fitted Model:
\(\hat{\pi}=Pr(survival)=\dfrac{\text{exp}(3.23040.0782 X_11.5973 X_2)}{1+\text{exp}(3.23040.0782 X_11.5973 X_2)}\)
Also, from SAS, the HL = 9.323, df = 7, pvalue = 0.2305 indicates a moderate good fit.
For R, users notice that this is the usual difference in the deviance statistics for the two models, e.g., residual deviance for age only model G^{2}= 56.291 and for age and sex model, G^{2}=51.256. Thus, ΔG^{2}=2 log Λ = 56.291  51.256 = 5.035 and we can reject the null hypothesis that sex has no effect on the survivorship probability.
If we study the table with parameter estimates and consider $H_0: \beta_{sex}=0$ vs. $H_a: \beta_{sex}\neq=0$, based on the Wald X^{2}, and pvalue of 0.0345 we can reject the null hypothesis, and thus gender does have an effect on survival.
Coefficients:
Estimate Std. Error z value Pr(>z)
(Intercept) 3.23041 1.38686 2.329 0.0198 *
age 0.07820 0.03728 2.097 0.0359 *
sex 1.59729 0.75547 2.114 0.0345 *
Fitted Model:
\(\hat{\pi}=Pr(survival)=\dfrac{\text{exp}(3.23040.0782 X_11.5973 X_2)}{1+\text{exp}(3.23040.0782 X_11.5973 X_2)}\)
From R, hosmerlem(survive, fitted(result), g=10)
X^2 Df P(>Chi)
11.1771583 8.0000000 0.1918618
the HL = 11.177, df = 8, pvalue = 0.192 indicates a moderate good fit.
Conclusions:
Q: What is the probability of survival for a 24 year old male? What about 24 year old female?
For females:
\begin{align}
\hat{\pi} &= Pr(survival)\\
&= \dfrac{\text{exp}(3.23040.0782 \times 241.5973 \times 0)}{1+\text{exp}(3.23040.0782 \times 241.5973 \times 0)}\\
&= 0.806\\
\end{align}
For males:
\begin{align}
\hat{\pi} &= Pr(survival)\\
&= \dfrac{\text{exp}(3.23040.0782 \times 241.5973 \times 1)}{1+\text{exp}(3.23040.0782 \times 241.5973 \times 1)}\\
&= 0.439\\
\end{align}
As before, you can also calculate confidence intervals by using computed estimates of the standard errors and the fact that the estimates are Wald statistics.
The above analyses assume that the effect of gender on survivorship does not depend on age; that is, there is no interaction between age and sex. To test for interaction compare the following models:
Reduced Model:
\(\text{log}\left(\dfrac{\pi_i}{1\pi_i}\right)=\beta_0+\beta_1 X_{1i}+\beta_2 X_{2i}\)
where X_{1}_{i} is age of traveller i, and X_{2}_{i} the traveller’s gender.
Full Model:
\(\text{log}\left(\dfrac{\pi_i}{1\pi_i}\right)=\beta_0+\beta_1 X_{1i}+\beta_2 X_{2i}+\beta_3 X_{1i} X_{2i}\)
where X_{1}_{i} is age of traveller i, and X_{2}_{i} the traveller’s gender.
In SAS, the interaction model is fitted via
proc logistic descending; model survive = age sex age*sex /lackfit influence iplots; output out=a p=pi; run;
Model

2 log likelihood

Reduced (age, sex) 
51.256

Full (age, sex, interaction) 
47.346

From the output, we see that the likelihood ratio test statistic is \(\Delta G^2 =2 \text{log} \Lambda=3.91 > 3.84= \chi^2_1(0.95)\)
Conclusion: At the $\alpha=0.05$ level, we can reject the hypothesis that there is no significant interaction between age and gender. But this is weak evidence in support of the significant interaction term, however it does support the what we observed in the above figure. From the table of of the analysis of MLE below, however, notice that we would fail to reject that $\beta_{age*sex}$ is 0. This happens because Wald chisquare statistic is sensitive to sparseness in the data.
Q: What is the fitted model and what are the probabilities of survival for 24 years old males and females?
For females:
\begin{align}
\hat{\pi} &= Pr(survival)\\
&= \dfrac{\text{exp}(7.24500.1940 \times 246.9267 \times 0+0.1616\times 0)}{1+\text{exp}(7.24500.1940 \times 246.9267 \times 0+0.1616\times 0)}\\
&= 0.930\\
\end{align}
The odds of surviving are multiplied by a factor of exp(0.194) = 0.824 per additional year of age
For males:
\begin{align}
\hat{\pi} &= Pr(survival)\\
&= \dfrac{\text{exp}(7.24500.1940 \times 246.9267 \times 1+0.1616\times (24\times 1))}{1+\text{exp}(7.24500.1940 \times 246.9267 \times 1+0.1616\times (24\times 1))}\\
&= \dfrac{\text{exp}(0.31830.0324 \times 24)}{1+\text{exp}(0.31830.0324 \times 24)}\\
&= 0.387\\
\end{align}
The odds of surviving are multiplied by a factor of exp(0.0324) = .968 per additional year of age.
Note that the odds of survival decline more rapidly for females than for males.
In R, the interaction model is fitted via
result=glm(survive~age+sex+age*sex,family=binomial("logit"))
Model

2 log likelihood

Reduced (age, sex) 
51.256

Full (age, sex, interaction) 
47.346

From the output, we see that the likelihood ratio test statistic is \(\Delta G^2 =2 \text{log} \Lambda=3.91 > 3.84= \chi^2_1(0.95)\)
Conclusion: At the $\alpha=0.05$ level, we can reject the hypothesis that there is no significant interaction between age and gender. But this is weak evidence in support of the significant interaction term, however it does support the what we observed in the above figure. From the table of of the analysis of MLE below, however, notice that we would fail to reject that $\beta_{age*sex}$ is significant. This happens because Wald chisquare statistic is sensitive to sparseness in the data.
Q: What is the fitted model and what are the probabilities of survival for 24 years old males and females?
Coefficients:
Estimate Std. Error z value Pr(>z)
(Intercept) 7.24638 3.20517 2.261 0.0238 *
age 0.19407 0.08742 2.220 0.0264 *
sex 6.92805 3.39887 2.038 0.0415 *
age:sex 0.16160 0.09426 1.714 0.0865 .
For females:
\begin{align}
\hat{\pi} &= Pr(survival)\\
&= \dfrac{\text{exp}(7.24500.1940 \times 246.9267 \times 0+0.1616\times 0)}{1+\text{exp}(7.24500.1940 \times 246.9267 \times 0+0.1616\times 0)}\\
&= 0.930\\
\end{align}
The odds of surviving are multiplied by a factor of exp(0.194) = 0.824 per additional year of age
For males:
\begin{align}
\hat{\pi} &= Pr(survival)\\
&= \dfrac{\text{exp}(7.24500.1940 \times 246.9267 \times 1+0.1616\times (24\times 1))}{1+\text{exp}(7.24500.1940 \times 246.9267 \times 1+0.1616\times (24\times 1))}\\
&= \dfrac{\text{exp}(0.31830.0324 \times 24)}{1+\text{exp}(0.31830.0324 \times 24)}\\
&= 0.387\\
\end{align}
The odds of surviving are multiplied by a factor of exp(0.0324) = .968 per additional year of age.
Note that the odds of survival decline more rapidly for females than for males.
While the goodnessoffit statistics can tell you how well a particular model fits the data, they don’t tell you much about the lackoffit; i.e. where the model fails. To assess the lack of fit we need to look at regression diagnostics. Regression diagnostics tell you how influential each observation is to the fit of the logistic regression model.
We can evaluate the numerical values of these statistics and/or consider their graphical representation (e.g. like residual plots in linear regression).
Some measures of influence:
[Enlarge]
[Enlarge]
In SAS, under PROC LOGISTIC, INFLUENCE option and IPLOTS option will output these diagnostics. In R see, influence() function.
As pointed out before, the standardized residual values are considered to be indicative of lack of fit if their absolute value exceeds 3 (some sources are more conservative and take 2 as the threshold value).
Index plot, in which the residuals are plotted against the indices of the observations, can help us determine outliers, and/or systematic patterns in variation, and thus assess the lackoffit. In our example, all Pearson and Deviance residuals fall within 2 and +2; thus there does not seem to be any outliers or particular heavy influential points; see donner.lst for the IPLOTS.
For our example, IPLOT of Deviance residuals vs. the Case Number.
For analysis of other diagnostics see Agresti and/or SAS/R documentation, or other software you are using. More will be discussed in the next section.
For an example of a larger dataset, and more on Model Selection, see handouts relevant for the Water Level Study data below:
Go to Case Studies: The Water Level Study. The relevant SAS and R files are on the SAS supplemental pages (e.g., water_level1.sas, water_level2.sas). For corresponding R files see R supplemental pages (e.g., water.R)