# 7.2 - Diagnosing Logistic Regression Models

Just like a linear regression, once a logistic (or any other generalized linear) model is fitted to the data it is essential to check that the assumed model is actually a valid model. A thorough examination of the extent to which the fitted model provides an appropriate description of the observed data, is a vital aspect of the modelling process.

#### Key Concepts

• Model Diagnostics
• Overdispersion
• Receiver Operating Characteristic Curves (ROC) & Classification Tables

#### Objectives

• Understand the need for evaluating residuals and become familiar with other logistic regression model diagnostic tools

• Agresti (2007), Section 5.2 on model checking
• Agresti (2013), Section 6.2 on diagnostics and model checking

# 7.2.1 - Model Diagnostics

This section is dedicated to studying the appropriateness of the model. Do the model assumptions hold? This is done via various diagnostics, such assessing the distribution of the residuals.

Let us begin with a bit of a review of Linear Regression diagnostics. The standard linear regression model is given by:

$$y_i \sim N(\mu_i,\sigma^2)$$

$$\mu_i =x_i^T \beta$$

The two crucial features of this model are

• the assumed mean structure, $$\mu_i =x_i^T \beta$$ , and
• the assumed constant variance σ2 (homoscedasticity).

The most common diagnostic tool is the residuals, the difference between the estimated and observed values of the dependent variable. There are other useful regression diagnostics, e.g. measures of leverage and influence, but for now our focus will be on the estimated residuals.

The most common way to check these assumptions is to fit the model and then plot the residuals versus the fitted values $$\hat{y}_i=x_i^T \hat{\beta}$$ .

• If the model assumptions are correct, the residuals should fall within an area representing a horizontal band, like this:

• If the residuals have been standardized in some fashion (i.e., scaled by an estimate of σ), then we would expect most of them to have values within ±2 or ±3; residuals outside of this range are potential outliers.
• If the plot reveals some kind of curvature—for example, like this,

it suggests a failure of the mean model; the true relationship between μi and the covariates might not be linear.

• If the variability in the residuals is not constant as we move from left to right—for example, if the plot looks like this,

or like this,

then the variance V (yi) is not constant but changes as the mean μi changes.

### Analogous plots for logistic regression.

The logistic regression model says that the mean of yi is

$$\mu_i=n_i \pi_i$$

where

$$\text{log}\left(\dfrac{\pi_i}{1-\pi_i}\right)=x_i^T \beta$$

and that the variance of yi is

$$V(y_i)=n_i \pi_i(1-\pi_i)$$.

After fitting the model, we can calculate the Pearson residuals

$$r_i=\dfrac{y_i-\hat{\mu}_i}{\sqrt{\hat{V}(y_i)}}=\dfrac{y_i-n_i\hat{\pi}_i}{\sqrt{n_i\hat{\pi}_i(1-\hat{\pi}_i)}}$$

or the deviance residuals. If the ni's are "large enough", these act something like standardized residuals in linear regression. To see what's happening, we can plot them against the linear predictors,

$$\hat{\eta}_i=\text{log}\left(\dfrac{\pi_i}{1-\pi_i}\right)=x_i^T \hat{\beta}_i$$

which are the estimated log-odds of success, for cases i = 1, . . . , N.

• If the fitted logistic regression model was true, we would expect to see a horizontal band with most of the residuals falling within ± 3:

• If the ni's are small, then the plot might not have such a nice pattern, even if the model is not true. In the extreme case of ungrouped data (all ni's equal to 1), this plot becomes uninformative. From now on, we will suppose that the ni's are not too small, so that the plot is at least somewhat meaningful.
• If outliers are present—that is, if a few residuals or even one residual is substantially larger than ± 3 — then X2 and G2 may be much larger than the degrees of freedom. In that situation, the lack of fit can be attributed to outliers, and the large residuals will be easy to find in the plot.

#### Curvature

• If the plot of Pearson residuals versus the linear predictors reveals curvature—for example, like this,

then it suggests that the mean model

$$\text{log}\left(\dfrac{\pi_i}{1-\pi_i}\right)=x_i^T \beta$$      (2)

has been mis-specified in some fashion. That is, it could be that one or more important covariates do not influence the log-odds of success in a linear fashion. For example, it could be that a covariate X ought to be replaced by a transformation such as $$\sqrt{X}$$ or logX, or by a pair of variables X and X2, etc.

To see whether individual covariates ought to enter in the logit model in a non-linear fashion, we could plot the empirical logits

$$\text{log}\left(\dfrac{y_i+1/2}{n_i-y_i+1/2}\right)$$      (3)

versus each covariate in the model.

• If one of the plots looks substantially non-linear, we might want to transform that particular covariate.
• If many of them are nonlinear, however, it may suggest that the link function has been misspecified—i.e., that the left-hand side of (2) should not involve a logit transformation, but some other function such as
•
• log,
• probit (but this is often very close to logit), or
• complementary log-log.

Changing the link function will change the interpretation of the coefficients entirely; the βj's will no longer be log-odds ratios. But, depending on what the link function is, they might still have a nice interpretation. For example, in a model with a log link, $$\text{log}\pi_i=x_i^T \beta$$, an exponentiated coefficient expj) becomes a relative risk.

#### Test for correctness of the link function

Hinkley (1985) suggests a nice, easy test to see whether the link function is plausible:

• Fit the model and save the estimated linear predictors
• $$\hat{\eta}_i=x_i^T \hat{\beta}$$

• Add  $$\eta^2_i$$ to the model as a new predictor and see if it's significant.

A significant result indicates that the link function is mis-specified. A nice feature of this test is that it applies even to ungrouped data (ni's equal to one), for which residual plots are uninformative.

#### Non-constant Variance

Suppose that the residual plot shows non-constant variance as we move from left to right:

Another way to detect non-constancy of variance is to plot the absolute values of the residuals versus the linear predictors and look for a non-zero slope:

Non-constant variance in the Pearson residuals means that the assumed form of the variance function,

$$V(y_i)=n_i \pi_i (1-\pi_i)$$

is wrong and cannot be corrected by simply introducing a scale factor for overdispersion. Overdispersion and changes to the variance function will be discussed later.

### Example

The SAS on-line help documentation provides the following quantal assay dataset. In this table, xi refers to the log-dose, ni is the number of subjects exposed, and yi is the number who responded.

 xi yi ni 2.68 10 31 2.76 17 30 2.82 12 31 2.90 7 27 3.02 23 26 3.04 22 30 3.13 29 31 3.20 29 30 3.21 23 30

If we fit a simple logistic regression model, we will find that the coefficient for xi is highly significant, but the model doesn't fit. The plot of Pearson residuals versus the fitted values resembles a horizontal band, with no obvious curvature or trends in the variance. This seems to be a classic example of overdispersion.

Since there's only a single covariate, a good place to start is to plot the empirical logits as defined in equation (3) above versus X.

This is basically the same thing as a scatterplot of Y versus X in the context of ordinary linear regression. This plot becomes more meaningful as the ni's grow. With ungrouped data (all ni = 1), the empirical logits will only take two possible values—log(1/3) and log 3/1—and the plot will not be very useful.

Here is the SAS program file assay.sas:

The relationship between the logits and X seems linear. Let's fit the logistic regression and see what happens.

See assay1.sas:

In plot reschi * xb, reschi are the Pearson residuals, and xb the linear predictor.

The output reveals that the coefficient of X is highly significant, but the model does not fit.

Here is the R code assay.R that corresponds to the SAS program assay1.sas:

With plot.lm(result), R will produce four diagnostic plots, including a residual plot, a QQ plot, a scale-location plot, and a residual vs leverage plot as well.

The output reveals that the coefficient of X is highly significant, but the model does not fit.

Here is the residual plot from R output:

The residuals plots in both SAS and R above do not show any obvious curvature or trends in the variance. And there are no other predictors that are good candidates for inclusion in the model. (It can be shown that a quadratic term for log-concentration will not improve the fit.) So it is quite reasonable to attribute the problem to overdispersion.

# 7.2.2 - Overdispersion

Overdispersion is an important concept in the analysis of discrete data. Many a time data admit more variability than expected under the assumed distribution. The greater variability than predicted by the generalized linear model random component reflects overdispersion. Overdispersion occurs because the mean and variance components of a GLM are related and depends on the same parameter that is being predicted through the independent vector.

There is no such thing as overdispersion in ordinary linear regression. In a linear regression model

$$y_i \sim N(x_i^T \beta, \sigma^2)$$

the variance σ2 is estimated independently of the mean function $$x_i^T \beta$$.

With discrete response variables, however, the possibility for overdispersion exists because the commonly used distributions specify particular relationships between the variance and the mean; we will see the same holds for Poisson.

For the binomial response, if yi ~ Bin(ni, πi), the mean is μi = ni πi and the variance is μi(ni - μi) / ni.

• Overdispersion means that the data show evidence that the variance of the response yi is greater than μi(ni - μi) / ni.
• Underdispersion is also theoretically possible, but rare in practice. McCullagh and Nelder (1989) say that overdispersion is the rule rather than the exception.

In the context of logistic regression, overdispersion occurs when the discrepancies between the observed responses yi and their predicted values $$\hat{\mu}_i=n_i \hat{\pi}_i$$ are larger than what the binomial model would predict. Overdispersion arises when the ni Bernoulli trials that are summarized in a line of the dataset are

• not identically distributed (i.e. the success probabilities vary from one trial to the next), or
• not independent (i.e. the outcome of one trial influences the outcomes of other trials).

In practice, it is impossible to distinguish non-identically distributed trials from non-independence; the two phenomena are intertwined.

Issue: If overdispersion is present in a dataset, the estimated standard errors and test statistics the overall goodness-of-fit will be distorted and adjustments must be made. When a logistic model fitted to n binomial proportions is satisfactory, the residual deviance has an approximate χ2 distribution with (np) degrees of freedom, where p is the number of unknown parameters in the fitted model. Since the expected value of a χ2 distribution is equal to its degree of freedom, it follows that the residual deviance for a well-fitting model should be approximately equal to its degrees of freedom. Equivalently, we may say that the mean deviance (deviance/df) should be close to one. Similarly, if the variance of the data is greater than that under under binomial sampling, the residual mean deviance is likely to be greater than 1.

The problem of overdispersion may also be confounded with the problem of omitted covariates. If we have included all the available covariates related to yi in our model and it still does not fit, it could be because our regression function $$x_i^T \beta$$ is incomplete. Or it could be due to overdispersion. Unless we collect more data, we cannot do anything about omitted covariates. But we can adjust for overdispersion.

The most popular method for adjusting for overdispersion comes from the theory of quasilikelihood. Quasilikelihood has come to play a very important role in modern statistics. It is the foundation of many methods that are thought to be "robust" (e.g. Generalized Estimating Equations (GEE) for longitudinal data) because they do not require specification of a full parametric model. For more details see Agresti (2007, Sec 9.2) or Agresti (2013, Sec 12.2).

In the quasilikelihood approach, we must first specify the "mean function" which determines how μi = E(yi) is related to the covariates. In the context of logistic regression, the mean function is

$$\mu_i=n_i \text{expit}(x_i^T \beta)$$,

which implies

$$\text{log}\left(\dfrac{\pi_i}{1-\pi_i}\right)=x_i^T \beta$$

Then we must specify the "variance function," which determines the relationship between the variance of the response variable and its mean. For a binomial model, the variance function is μi(ni - μi) / ni.

But to account for overdispersion, we will include another factor σ2 called the "scale parameter," so that

$$V(y_i)=\sigma^2 \mu_i (n_i-\mu_i)/n_i$$.

• If σ2 ≠ 1 then the model is not binomial; σ2 > 1 is called "overdispersion" and σ2 < 1 is called "underdispersion."
• If σ2 were known, we could obtain a consistent, asymptotically normal and efficient estimate for β by a quasi-scoring procedure, sometimes called "estimating equations." For the variance function shown above, the quasi-scoring procedure reduces to the Fisher scoring algorithm that we mentioned as a way to iteratively find ML estimates.

Note that no matter what σ2 is assumed to be, we get the same estimate for β. Therefore, this method for overdispersion does not change the estimate for β at all. However, the estimated covariance for $$\hat{\beta}$$ changes from

$$\hat{V}(\hat{\beta})=(X^T W X)^{-1}$$

to

$$\hat{V}(\hat{\beta})=\sigma^2 (X^T W X)^{-1}$$

That is, the estimated standard errors must be multiplied by the factor $$\sigma=\sqrt{\sigma^2}$$.

How do we estimate σ2?

McCullagh and Nelder (1989) recommend

$$\hat{\sigma}^2=X^2/(N-P)$$

where X2 is the usual Pearson goodness-of-fit statistic, N is the number of sample cases (number of rows in the dataset we are modeling), and p is the number of parameters.

• If the model holds, then X2/(N - p) is a consistent estimate for σ2 in the asymptotic sequence N → ∞ for fixed ni's.
• The deviance-based estimate G2/(N - p) does not have this consistency property and should not be used.

This is a reasonable way to estimate σ2 if the mean model $$\mu_i=g(x_i^T \beta)$$ holds. But if important covariates are omitted, then X2 tends to grow and the estimate for σ2 can be too large. For this reason, we will estimate σ2 under a maximal model, a model that includes all of the covariates we wish to consider.

The best way to estimate σ2 is to identify a rich model for μi and designate it to be the most complicated one that we are willing to consider. For example, if we have a large pool of potential covariates, we may take the maximal model to be the model that has every covariate included as a main effect. Or, if we have a smaller number of potential covariates, we decide to include all main effects along with two-way and perhaps even three-way interactions. But we must omit at least a few higher-order interactions, otherwise we will end up with a model that is saturated.

In an overdispersed model, we must also adjust our test statistics. The statistics X2 and G2 are adjusted by dividing them by σ2. That is, tests of nested models are carried out by comparing differences in the scaled Pearson statistic, ΔX22, or the scaled deviance, ΔG22 to a chisquare distribution with Δdf degrees of freedom.

If the data are overdispersed — that is, if

$$V(y_i) \approx \sigma^2 n_i \pi_i (1-\pi_i)$$

for a scale factor σ2 > 1 ,  then the residual plot may still resemble a horizontal band, but many of the residuals will tend to fall outside the ± 3 limits. In this case, the denominator of the Pearson residual will tend to understate the true variance of the yi, making the residuals larger. If the plot looks like a horizontal band but X2 and G2 indicate lack of fit, an adjustment for overdispersion might be warranted. A warning about this, however: If the residuals tend to be too large, it doesn't necessarily mean that overdispersion is the cause. Large residuals may also be caused by omitted covariates. If some important covariates are omitted from xi, then the true πi's will depart from what your model predicts, causing the numerator of the Pearson residual to be larger than usual. That is, apparent overdispersion could also be an indication that your mean model needs additional covariates. If these additional covariates are not available in the dataset, however, then there's not much we can do about it; we may need to attribute it to overdispersion.

Note, there is no overdispersion for ungrouped data. McCullagh and Nelder (1989) point out that overdispersion is not possible if ni=1. If yi only takes values 0 and 1, the it must be distributed as Bernoulli(πi) and its variance must be πi(1 - πi). There is no other distribution with support {0,1}. Therefore, with ungrouped data, we should always assume scale=1 and not try to estimate a scale parameter and adjust for overdispersion.

### Summary of Adjusting for Overdispersion in the Binary Logistic Regression

The usual way to correct for overdispersion in a logit model is to assume that:

$$E(y_i)=n_i \pi_i$$
$$V(y_i)=\sigma^2 n_i \pi_i (1-\pi_i)$$

where σ2 is a scale parameter.

Under this modification, the Fisher-scoring procedure for estimating β does not change, but its estimated covariance matrix becomes σ2(XTWX)-1—that is, the usual standard errors are multiplied by the square root of σ2.

This will make the confidence intervals wider.

Let's get back to our example and refit the model, making an adjustment for overdispersion.

We can  refit the model, making an adjustment for overdispersion in SAS by changing the model statement to

model y/n= logconc / scale=pearson;

In SAS, including the option scale=Pearson in the model statement will perform the adjustment. It will not change the estimated coefficients βj, but it will adjust the standard errors.

Here's the output:

NOTE: The covariance matrix has been multiplied by the heterogeneity factor (Pearson Chi-Square / DF) 4.08043.

The estimated scale parameter is $$\hat{\sigma}^2=X^2/df=4.08$$. SAS automatically scales the covariance matrix by this factor, which means that

• the standard errors in the table of coefficients are multiplied by $$\sqrt{4.08} \approx 2$$, and
• the Wald statistics are divided by 4.08.

If you are using glm() in R, and want to refit the model adjusting for overdispersion one way of doing it is to use summary.glm() function. For example fit the model using glm() and save the object as RESULT. By default dispersion is equal to 1. This will perform the adjustment. It will not change the estimated coefficients βj, but it will adjust the standard errors.

Estimate from the MAXIMAL model dispersion value as X2/df. Then you can call

summary(RESULT, dispersion=4.08,correlation=TRUE,symbolic.cor = TRUE).

This should give you the same model but with adjusted covariance matrix, that is adjusted standard errors (SEs) for your beta's (estimated logistic regression coefficients) and also changed z-values. Notice it will not adjust overall fit statistics. For that try the package "dispmod" (see assay.R).

R output after adjusting for overdispersion:

There are other corrections that we could make. If we were constructing an analysis-of-deviance table, we would want to divide G2 and X2 by $$\hat{\sigma}^2$$ and use these scaled versions for comparing nested models. Moreover, in reporting residuals, it would be appropriate to modify the Pearson residuals to

$$r_i^\ast=\dfrac{y_i-n_i\hat{\pi}_i}{\sqrt{\hat{\sigma}^2n_i\hat{\pi}_i(1-\hat{\pi}_i)}}$$ ;

that is, we should divide the Pearson residuals (or the deviance residuals, for that matter) by $$\sqrt{\hat{\sigma}^2}$$.

# 7.2.3 - Receiver Operating Characteristic Curve (ROC)

A Receiver Operating Characteristic Curve (ROC) is a standard technique for summarizing classifier performance over a range of trade-offs between true positive (TP) and false positive (FP) error rates (Sweets, 1988). ROC curve is a plot of sensitivity (the ability of the model to predict an event correctly) versus 1-specificity for the possible cut-off classification probability values π0.

For logistic regression you can create a 2 × 2 classification table of predicted values from your model for your response if $$\hat{y}=0$$ or 1 versus the true value of y = 0 or 1. The prediction if $$\hat{y}=1$$ depends on some cut-off probability, π0. For example, $$\hat{y}=1$$ if  $$\hat{\pi}_i>\pi_0$$ and $$\hat{y}=0$$ if $$\hat{\pi}_i \leq \pi_0$$. The most common value for π0 = 0.5. Then $$sensitivity=P(\hat{y}=1|y=1)$$ and $$specificity=P(\hat{y}=0|y=0)$$.

The ROC curve is more informative than the classification table since it summarizes the predictive power for all possible π0.

The position of the ROC on the graph reflects the accuracy of the diagnostic test. It covers all possible thresholds (cut-off points). The ROC of random guessing lies on the diagonal line. The ROC of a perfect diagnostic technique is a point at the upper left corner of the graph, where the TP proportion is 1.0 and the FP proportion is 0.

The Area Under the Curve (AUC), also referred to as index of accuracy (A), or concordance index, c, in SAS, and it is an accepted traditional performance metric for a ROC curve. The higher the area under the curve the better prediction power the model has. c = 0.8 can be interpreted to mean that a randomly selected individual from the positive group has a test value larger than that for a randomly chosen individual from the negative group 80 percent of the time.

For more details see Agresti(2007), Sections 5.1.6-5.1.8, Agresti (2013), Section 6.2, and SAS links provided at the beginning.

Here is the SAS program assay4.sas.

Here is the resulting ROC graph.

Area under the curve is c = 0.746 indicates good predictive power of the model.

Option ctable prints the classification tables for various cut-off points. Each row of this output is a classification table for the specified Prob Level, π0.

Here is the R program file assay.R that corresponds to the SAS program assay4.sas.

Here is the ROC graph from R output:

The area under the curve is c = 0.746 which indicates good predictive power of the model.