7.1 - Binary Logistic Regression with Continuous Covariates

This is the first time we are dealing with continuous variables in this course. Just like in any ordinary linear regression, the covariates may be both discrete and continuous. The basic principle for logistic regression is the same whether covariates are discrete or continuous, but some adjustments are necessary for goodness-of-fit testing. Hosmer-Lemeshaw statistic is very useful in this case.

Key Concepts

  • Binary Logistic Regression with continuous predictors
  • Multiple Binary Logistic Regression with a combination of categorical and continuous predictors
  • Model Diagnostics

Objectives

  • Understand the basic ideas behind modeling binary response as a function of continuous and categorical explanatory variables.

7.1.1 - Example - The Donner Party

Image of the Donner PassIn 1846, the Donner party (Donner and Reed families) left Springfield, Illinois for California in covered wagons. After reaching Fort Bridger, Wyoming, the leaders decided to find a new route to Sacramento. They became stranded in the eastern Sierra Nevada mountains at a place now called Donner Pass (right) when the region was hit by heavy snows in late October. By the time the survivors were rescued on April 21, 1847, 40 out of 87 had died.

Three variables:

Yi = 1 if person i survived
Yi = 0 if person i died

X1i age of person i

X2i = 1 if person i is male
X2i = 0 if person i is female.

Objectives

  1. What is the relationship between survival and gender?
  2. Predict the probability of survival as a function of age
  3. After taking into account age, are women more likely to survive harsh conditions than men?  Does age affect the survival rate of men and women differently?  As we saw in Lesson 5, there are more hypotheses to consider and more potential unexpected behavior with multiple predictors.

The analyses may be carried with donner.sas or donner.R with the dataset donner.dat (or donner.txt) which contains N = 45 observations, and the following columns: 1) Age, 2) Sex (1 = male, 0 = female), 3) Survivorship (1 = survived, 0 = dead).

Q1: What is the relationship between survival and gender?

Analysis: How about One-way ANOVA?

residuals plot

Problems: The fit of the one-way ANOVA model F1,43 = 5.35, p-value = 0.0256, R2 = 11%, and:

  • Violation of model assumptions, e.g., errors are not normally distributed, non-constant variance, correlations
  • Problems with prediction: can give values between infinity and +infinity, that is outside [0,1] range, and thus not give probabilities of survival.

Correct Analysis: Analyze the two-way contingency table. You can do this via chi-square test of independence or a binary logistic regression model.

Survivorship
Sex
Survived
Died
Total
Female
10
5
15
Male
10
20
30
Total
20
25
45

Test:

H0 : Survivorship is independent of gender
Ha : Survivorship depends on gender

Conclusion: The null hypothesis is rejected (X2 = 4.50, df = 1, p-value = 0.0339).  We may conclude that survivorship does depend on gender.  Among females survivorship is 67% while among males it is 33%. Apparently, females’ chance of survival is twice that of males.

Problem: By now you should be suspicious of simply ignoring a variable by marginalization without good justification, and wary of Simpson's Paradox.  The above analysis does not consider any dependency between survivorship and age. The table above is created by collapsing over age, which may lead to a wrong conclusion.  

 

Q2: Predict the probability of survival as a function of age

Naive model: Model the probability of survival given age:

\(P(Y=1|X_1)=\beta_0+\beta_1 X_1\)

Linear regression model: Model expected survival given age:

\(E(Y=1|X_1)=\beta_0+\beta_1 X_1\)

Problems:

  • Non-linearity – a linear model may give predicted values outside (0, 1) range
  • Heteroscedasticity – the variance π(1 -π)/n is not constant.  This is due to the fact that Y follows a binomial distribution with n = 1, that is  the  ‘Bernoulli’ distribution (after the famous statistician Bernoulli, of a family of mathematicians by the same name).

fitted line plot

Above is a plot of the fitted linear regression model with the observed data. The estimated model is: $E(survival | age)=0.8692-0.01336age$ Consider predicting survival for a 70 years old person: P(surv=1 | age=70)=0.8692-0.01336 × 70 = - 0.0658 !!! This model predicts a negative probability of survival.

Correct Analysis (or one proper solution amongst others: Logistic regression model )

\(\pi_i=\dfrac{\text{exp}(\beta_0+\beta_1 X_{1i})}{1+\text{exp}(\beta_0+\beta_1 X_{1i})}\)

\(\text{log}\dfrac{\pi_i}{1-\pi_i}=\beta_0+\beta_1 X_{1i}\)

where $\pi_i$=P(survival=1) probability of survival for person i, and X1i is age of a person i.

SAS logoBelow is the donner.sas SAS program. It reads a datafile stored on your computer. For SAS, you need to adjust the infile line of code depending on the platform, version of SAS, etc.

SAS program

We will get additional diagnostics needed for evaluating overall fit of the model when having a continuous predictor(s) with influence iplots options in SAS.

Fitted model and Interpretation of parameter estimates

Let’s look at a part of the output from donner.lst. Just as in R we get the same model estimates and the only difference is that R report z-value instead of Wald Chi-Square, but recall that z2 ≈ Wald X2 (i.e., for the age:  (-2.063)2 ≈ 4.2553).

SAS output

R logoBelow is the part of R code from donner.R that corresponds to donner.sas. For R, you need to adjust read.table() line depending where you stored the file.

R program code

We will get additional diagnostics needed for evaluating overall fit of the model when having a continuous predictor(s) with lm.influence() function and many others in R. See the part labeled "Model Diagnostics".

Fitted model and Interpretation of parameter estimates

We can get the same model estimates as from the SAS output and the only difference is that R reports a z-value instead of Wald Chi-Square, but recall that z2 ≈ Wald X2 (i.e., for the age: (-2.063)2 ≈ 4.2553).

Here is this part of the R output:

R output

The fitted model is

\(\text{log}\dfrac{\pi_i}{1-\pi_i}=1.8183-0.0665X_{1i}\)

The expected probability of survival of a 70 years old individual (regardless of gender since it's not included in the model):

\(\hat{\pi}=Pr(survival)=\dfrac{\text{exp}(1.8183-0.0665\times 70)}{1+\text{exp}(1.8183-0.0665\times 70)}=0.055\)

\(\text{exp}(\beta_0)=\text{exp}(1.8183)=3.26\)

is the estimate of the odds of survival when age = 0, that is the probability of survival is 3.26 / (1 + 3.26) = 0.77.

The exponentiated slope is

\(\text{exp}(-0.0665)=0.94\)

which means that every year increase in age multiplies the odds of survival by 0.94. So the odds of survival decreases with age. This is an estimated odds ratio. In SAS this value and its 95% Wald Confidence Interval is given in the output under "Odds Ratio Estimates":

                             Odds Ratio Estimates

                               Point          95% Wald
                  Effect    Estimate      Confidence Limits

                  age          0.936       0.878       0.997

In R, the odds-ratio estimate and its 95% Wald Confidence Interval (i.e., the asymptotic CI) if not directly provided but can be obtain as follows:

> exp(coefficients(result)[2])
      age
0.9356907

> exp(confint(result)) ## exponentiate to get on the odds-scale
Waiting for profiling to be done...
                2.5 %     97.5 %
(Intercept) 0.9940306 54.0635442
age         0.8695861  0.9898905

In general this the usual 95% Wald CI, i.e., $\hat{\beta}\pm 1.96\times SE(\beta)$, and then we need to exponentiate the ends to get the CI on the odds-scale; for this example $exp(-0.0665\pm 1.96 \times 0.0322)=[0.878, 0.997]$.

Assuming the Type I error of 0.05, we reject the null hypothesis that $\beta_1=0$, so there is a weak evidence of a statistically significant relationship between the person's age and probability of survival. But how well does this model fit? Can we trust this inference?

Overall fit of the model

Let's look at the values listed for AGE. While some observations share the same AGE value, most of these are unique. Cross-tabulation of AGE vs. SURVIVAL gives 21 × 2 table:

table

Issue: The table is sparse (e.g. small observed counts and we will get small fitted values), and the sample size requirement for the use of Pearson chi-square goodness-of-fit test statistic and likelihood ratio goodness-of-fit test statistic is not met. In addition, when more data are collected there could be more ages included so the number of cells is not fixed. This is almost always the case when you are dealing with continuous predictor(s) such as age. Thus, the X2 and G2 statistics for logistic regression models with continuous predictors do not have approximate chi-squared distributions, and are not the best statistics to use to evaluate the overall fit of the model.

There are several alternatives. The most common are

  • Fit the desired model, fit an appropriate larger model (e.g. include other predictors, or a quadratic term), and look at the differences in the log-likelihood ratio statistics (e.g., difference in deviances).
  • Use Hosmer-Lemeshow (HL) statistic (Agresti (2007), pg. 147, or Agresti (2013), pg. 173 and see the earlier notes in Lesson 6.)

When explanatory variables are continuous, it’s hard to analyze the lack of fit without some grouping. Recall that’s what the Hosmer-Lemeshow test does; it groups by the predicted probabilities such that there is roughly an equal number of observations per group. Below are SAS and R outputs.

SAS logoIn SAS:

SAS output

In the model statement, the option lackfit tells SAS to compute HL statistics and print the partitioning. In this example, there are 8 groups with roughly equal number of observations per group. For example, group 1 has 5 observations and group 2 has 7. The HL Chi-Square statistic of 8.5165, with df=number of groups - 2=6, has a p-value of 0.2027 indicating that there is no issue with the lack-of-fit. There are some differences between SAS and R output because of difference in the cut points used to group the data, but the inferences are the same.

Conclusion: The model fits moderately well.

R logoFor R, there is no built-in function for the Hosmer-Lemeshow goodness-of-fit test. Instead use the hosmerlem() function from ROCandHL.R code provided for the class.  Run the following for the Donner example, where "result" is the saved model and g is the number of groups for the HL statistic:

hosmerlem(survive,fitted(result), g=8)
      X^2        Df   P(>Chi)
6.3528592 6.0000000 0.3848447

The HL Chi-Square statistic of 6.353, with df=number of groups - 2=6, has a p-value of 0.385 indicating that there is no issue with the lack-of-fit. There are some differences between SAS and R output because of difference in the cut points used to group the data, but the inferences are the same.

Conclusion: The model fits reasonably well.

Q3: After taking into account age, are women more likely to survive harsh conditions than men?

This will involve fitting different logistic regression models, some including both continuous and categorical predictors, and comparing them via likelihood-ratio test. First, we review the general set-up, where

H0 : The probability of the characteristic does not depend on the pth variable; i.e., reduced model fits the data
Ha : The probability of the characteristic depends on the pth variable; i.e., full model fits the data

This is accomplished by comparing the two models as given below:

Reduced Model:

\(\text{log}\left(\dfrac{\pi_i}{1-\pi_i}\right)=\beta_0+\beta_1 X_{1i}+\beta_2 X_{2i}+\ldots+\beta_{p-1} X_{(p-1)i}\)

which contains all variables except the variable of interest p.

Full Model:

\(\text{log}\left(\dfrac{\pi_i}{1-\pi_i}\right)=\beta_0+\beta_1 X_{1i}+\beta_2 X_{2i}+\ldots+\beta_{p-1} X_{(p-1)i}+\beta_p X_{pi}\)

which contains all variables including the variable of interest p.

Recall, the fit of each model to the data is assessed by the likelihood or, equivalently, the loglikelihood. The larger the log likelihood, the better the model fits the data. Adding parameters to model can only result in an increase in the log likelihood. So, the full model will always have a larger log likelihood than the reduced model and thus fit better. But the question is, is that extra cost of adding an additional variable really needed or can we capture the variability in the response well enough with the simpler model. 

Question: Is the log-likelihood for the full model significantly better than the log-likelihood for the reduced model? If so, we conclude that the probability that the trait is present depends on variable p.

For example, to test for the effect of gender in Donner Party data, compare:

Reduced Model:

\(\text{log}\left(\dfrac{\pi_i}{1-\pi_i}\right)=\beta_0+\beta_1 X_{1i}\)

where X1i is age of a person i.

Full Model:

\(\text{log}\left(\dfrac{\pi_i}{1-\pi_i}\right)=\beta_0+\beta_1 X_{1i}+\beta_2 X_{2i}\)

where X1i is age of a person i, and X2i the person’s gender, with female=0.

If we are testing only for a single parameter then this test should lead to the same inference as if we are testing: $H_0: \beta_2=0$ vs. $H_a: \beta_2\neq 0$.

sas logoIn SAS: From the Model Fit Statistics in SAS get the likelihoods:

Model
-2 log likelihood
Reduced (age)
56.291
Full (age and sex)
51.256

The likelihood ratio test statistic is

\(\Delta G^2 =-2 \text{log} \Lambda=56.291-51.256=5.035\)

Since  \(\Delta G^2 =-2 \text{log} \Lambda=5.035 > 5.02= \chi^2_1(0.975)\) we can reject the null hypothesis that sex has no effect on the survivorship probability at the α = 0.025 level.

If we study the table with parameter estimates and consider $H_0: \beta_{sex}=0$ vs. $H_a: \beta_{sex}\neq=0$, based on the Wald X2, and p-value of 0.0345 we can reject the null hypothesis, and thus gender does have an effect on survival.

 Analysis of Maximum Likelihood Estimates

                                    Standard          Wald
     Parameter    DF    Estimate       Error    Chi-Square    Pr > ChiSq

     Intercept     1      3.2304      1.3870        5.4248        0.0199
     age           1     -0.0782      0.0373        4.3988        0.0360
     sex           1     -1.5973      0.7555        4.4699        0.0345

Fitted Model:

\(\hat{\pi}=Pr(survival)=\dfrac{\text{exp}(3.2304-0.0782 X_1-1.5973 X_2)}{1+\text{exp}(3.2304-0.0782 X_1-1.5973 X_2)}\)

Also, from SAS, the HL = 9.323, df = 7, p-value = 0.2305 indicates a moderate good fit.

r logoFor R, users notice that this is the usual difference in the deviance statistics for the two models, e.g., residual deviance for age only model G2= 56.291 and for age and sex model, G2=51.256. Thus, ΔG2=-2 log Λ = 56.291 - 51.256 = 5.035 and we can reject the null hypothesis that sex has no effect on the survivorship probability. 

If we study the table with parameter estimates and consider $H_0: \beta_{sex}=0$ vs. $H_a: \beta_{sex}\neq=0$, based on the Wald X2, and p-value of 0.0345 we can reject the null hypothesis, and thus gender does have an effect on survival.

Coefficients:
            Estimate Std. Error z value Pr(>|z|) 
(Intercept)  3.23041    1.38686   2.329   0.0198 *
age         -0.07820    0.03728  -2.097   0.0359 *
sex         -1.59729    0.75547  -2.114   0.0345 *

Fitted Model:

\(\hat{\pi}=Pr(survival)=\dfrac{\text{exp}(3.2304-0.0782 X_1-1.5973 X_2)}{1+\text{exp}(3.2304-0.0782 X_1-1.5973 X_2)}\)

From R,  hosmerlem(survive, fitted(result), g=10)
       X^2         Df    P(>Chi)
11.1771583  8.0000000  0.1918618

the HL = 11.177, df = 8, p-value = 0.192 indicates a moderate good fit.

Conclusions:

  • After taking into account the effects of age, women had higher survival probabilities than men (2 log Λ = 5.035, df = 1; p = 0.025).
  • The odds that a male survives are estimated to be exp(-1.5973) = 0.202 times the odds that a female survives.
  • Moreover, by Walds test, the survivorship probability decreases with increasing age (X2 = 4.47; df. = 1; p = 0.0345).
  • The odds of surviving decline by a multiplicative factor of exp(-0.0782) = 0.925 per year of age.

SAS output

Computing Survivorship Probabilities

Q: What is the probability of survival for a 24 year old male? What about 24 year old female?

For females:

\begin{align}
\hat{\pi} &= Pr(survival)\\
&= \dfrac{\text{exp}(3.2304-0.0782 \times 24-1.5973 \times 0)}{1+\text{exp}(3.2304-0.0782 \times 24-1.5973 \times 0)}\\
&= 0.806\\
\end{align}

For males:

\begin{align}
\hat{\pi} &= Pr(survival)\\
&= \dfrac{\text{exp}(3.2304-0.0782 \times 24-1.5973 \times 1)}{1+\text{exp}(3.2304-0.0782 \times 24-1.5973 \times 1)}\\
&= 0.439\\
\end{align}

As before, you can also calculate confidence intervals by using computed estimates of the standard errors and the fact that the estimates are Wald statistics.

Graphical Summary of Model Results

graphical summary of model results

Independence between Age and Sex

The above analyses assume that the effect of gender on survivorship does not depend on age; that is, there is no interaction between age and sex. To test for interaction compare the following models:

Reduced Model:

\(\text{log}\left(\dfrac{\pi_i}{1-\pi_i}\right)=\beta_0+\beta_1 X_{1i}+\beta_2 X_{2i}\)

where X1i is age of traveller i, and X2i the traveller’s gender.

Full Model:

\(\text{log}\left(\dfrac{\pi_i}{1-\pi_i}\right)=\beta_0+\beta_1 X_{1i}+\beta_2 X_{2i}+\beta_3 X_{1i} X_{2i}\)

where X1i is age of traveller i, and X2i the traveller’s gender.

sas logoIn SAS, the interaction model is fitted via

proc logistic descending;
  model survive = age sex age*sex /lackfit influence iplots;
  output out=a p=pi;
  run;
Model
-2 log likelihood
Reduced (age, sex)
51.256
Full (age, sex, interaction)
47.346

From the output, we see that the likelihood ratio test statistic is \(\Delta G^2 =-2 \text{log} \Lambda=3.91 > 3.84= \chi^2_1(0.95)\)

Conclusion: At the $\alpha=0.05$ level, we can reject the hypothesis that there is no significant interaction between age and gender. But this is weak evidence in support of the significant interaction term, however it does support the what we observed in the above figure. From the table of of the analysis of MLE below, however, notice that we would fail to reject that $\beta_{age*sex}$ is 0. This happens because Wald chi-square statistic is sensitive to sparseness in the data.

Q: What is the fitted model and what are the probabilities of survival for 24 years old males and females?

SAS output

For females:

\begin{align}
\hat{\pi} &= Pr(survival)\\
&= \dfrac{\text{exp}(7.2450-0.1940 \times 24-6.9267 \times 0+0.1616\times 0)}{1+\text{exp}(7.2450-0.1940 \times 24-6.9267 \times 0+0.1616\times 0)}\\
&= 0.930\\
\end{align}

The odds of surviving are multiplied by a factor of exp(-0.194) = 0.824 per additional year of age

For males:

\begin{align}
\hat{\pi} &= Pr(survival)\\
&= \dfrac{\text{exp}(7.2450-0.1940 \times 24-6.9267 \times 1+0.1616\times (24\times 1))}{1+\text{exp}(7.2450-0.1940 \times 24-6.9267 \times 1+0.1616\times (24\times 1))}\\
&= \dfrac{\text{exp}(0.3183-0.0324 \times 24)}{1+\text{exp}(0.3183-0.0324 \times 24)}\\
&= 0.387\\
\end{align}

The odds of surviving are multiplied by a factor of exp(-0.0324) = .968 per additional year of age.

Note that the odds of survival decline more rapidly for females than for males.

r logoIn R, the interaction model is fitted via

result=glm(survive~age+sex+age*sex,family=binomial("logit"))
Model
-2 log likelihood
Reduced (age, sex)
51.256
Full (age, sex, interaction)
47.346

From the output, we see that the likelihood ratio test statistic is \(\Delta G^2 =-2 \text{log} \Lambda=3.91 > 3.84= \chi^2_1(0.95)\)

Conclusion: At the $\alpha=0.05$ level, we can reject the hypothesis that there is no significant interaction between age and gender. But this is weak evidence in support of the significant interaction term, however it does support the what we observed in the above figure. From the table of of the analysis of MLE below, however, notice that we would fail to reject that $\beta_{age*sex}$ is significant. This happens because Wald chi-square statistic is sensitive to sparseness in the data.

Q: What is the fitted model and what are the probabilities of survival for 24 years old males and females?

Coefficients:
            Estimate Std. Error z value Pr(>|z|) 
(Intercept)  7.24638    3.20517   2.261   0.0238 *
age         -0.19407    0.08742  -2.220   0.0264 *
sex         -6.92805    3.39887  -2.038   0.0415 *
age:sex      0.16160    0.09426   1.714   0.0865 .

For females:

\begin{align}
\hat{\pi} &= Pr(survival)\\
&= \dfrac{\text{exp}(7.2450-0.1940 \times 24-6.9267 \times 0+0.1616\times 0)}{1+\text{exp}(7.2450-0.1940 \times 24-6.9267 \times 0+0.1616\times 0)}\\
&= 0.930\\
\end{align}

The odds of surviving are multiplied by a factor of exp(-0.194) = 0.824 per additional year of age

For males:

\begin{align}
\hat{\pi} &= Pr(survival)\\
&= \dfrac{\text{exp}(7.2450-0.1940 \times 24-6.9267 \times 1+0.1616\times (24\times 1))}{1+\text{exp}(7.2450-0.1940 \times 24-6.9267 \times 1+0.1616\times (24\times 1))}\\
&= \dfrac{\text{exp}(0.3183-0.0324 \times 24)}{1+\text{exp}(0.3183-0.0324 \times 24)}\\
&= 0.387\\
\end{align}

The odds of surviving are multiplied by a factor of exp(-0.0324) = .968 per additional year of age.

Note that the odds of survival decline more rapidly for females than for males.

More on Model Diagnostics

While the goodness-of-fit statistics can tell you how well a particular model fits the data, they don’t tell you much about the lack-of-fit; i.e. where the model fails. To assess the lack of fit we need to look at regression diagnostics. Regression diagnostics tell you how influential each observation is to the fit of the logistic regression model.

We can evaluate the numerical values of these statistics and/or consider their graphical representation (e.g. like residual plots in linear regression).

Some measures of influence (ref. Sec. 5.2 Agresti (2007), notes on ANGEL and the discussion board,  SAS 9 on-line documentation or R help):

  • Pearson and Deviance Residuals identify observations not well explained by the model.
  • Hat Matrix Diagonal detects extreme large points in the design space. These are often labeled as "leverage" or "hi" and are related to standardized residuals. A general rule says that if hi > 2*p/n or > 3*p/n the points is influential. Here "p" is the number of parameters in the model and "n" the number of observations. For donner data example with the main effects model, 2*3/45=0.13. There are two points possibly influential.
  • The regression diagnostics work is very similar to linear regression.

Hplot
[Enlarge]

  • DFBETA assesses the effect of an individual observation on the estimated parameter of the fitted model. It’s calculated for each observation for each parameter estimate. It’s a difference in the parameter estimated due to deleting the corresponding observation. For example, for the main effects model there are 3 such plots (one for intercept, one for age and one for sex). Here is the one for AGE, showing that observation #35 is potentially influential.

DfAgeplot
[Enlarge]

  • C and CBAR provide scalar measures of the influence of the individual observations on the parameter estimates; similar to the Cook distance in linear regression. The same rules in linear regression hold here too. For example, observation #35 again seems influential (see the rest of donner.lst).
  • DIFDEV and DIFCHISQ detect observations that heavily add to the lack of fit, i.e. there is a large disagreement between the fitted and observed values. DIFDEV measures the change in the deviance (G2) as an individual observation is deleted. DIFCHISQ measures the change in X2.

In SAS, under PROC LOGISTIC, INFLUENCE option and IPLOTS option will output these diagnostics. In R see, influence() function.

As pointed out before, the standardized residual values are considered to be indicative of lack of fit if their absolute value exceeds 3 (some sources are more conservative and take 2 as the threshold value).

Index plot, in which the residuals are plotted against the indices of the observations, can help us determine outliers, and/or systematic patterns in variation, and thus assess the lack-of-fit. In our example, all Pearson and Deviance residuals fall within -2 and +2; thus there does not seem to be any outliers or particular heavy influential points; see donner.lst for the IPLOTS.

For our example, IPLOT of Deviance residuals vs. the Case Number.

plot

For analysis of other diagnostics see Agresti and/or SAS/R documentation, or other software you are using. More will be discussed in the next section.

For an example of a larger dataset, and more on Model Selection, see handouts relevant for the Water Level Study data below:

Go to Case Studies: The Water Level Study. The relevant SAS and R files are on the SAS supplemental pages (e.g., water_level1.sas, water_level2.sas). For corresponding R files see R supplemental pages (e.g., water.R)