Just like a linear regression, once a logistic (or any other generalized linear) model is fitted to the data it is essential to check that the assumed model is actually a valid model. A thorough examination of the extent to which the fitted model provides an appropriate description of the observed data, is a vital aspect of the modelling process.
Key Concepts
Objectives

This section is dedicated to studying the appropriateness of the model. Do the model assumptions hold? This is done via various diagnostics, such assessing the distribution of the residuals.
Let us begin with a bit of a review of Linear Regression diagnostics. The standard linear regression model is given by:
\(y_i \sim N(\mu_i,\sigma^2)\)
\(\mu_i =x_i^T \beta\)
The two crucial features of this model are
The most common diagnostic tool is the residuals, the difference between the estimated and observed values of the dependent variable. There are other useful regression diagnostics, e.g. measures of leverage and influence, but for now our focus will be on the estimated residuals.
The most common way to check these assumptions is to fit the model and then plot the residuals versus the fitted values \(\hat{y}_i=x_i^T \hat{\beta}\) .
it suggests a failure of the mean model; the true relationship between μ_{i} and the covariates might not be linear.
or like this,
then the variance V (y_{i}) is not constant but changes as the mean μ_{i} changes.
The logistic regression model says that the mean of y_{i} is
\(\mu_i=n_i \pi_i\)_{ }
where
\(\text{log}\left(\dfrac{\pi_i}{1\pi_i}\right)=x_i^T \beta\)
and that the variance of y_{i} is
\(V(y_i)=n_i \pi_i(1\pi_i)\).
After fitting the model, we can calculate the Pearson residuals
\(r_i=\dfrac{y_i\hat{\mu}_i}{\sqrt{\hat{V}(y_i)}}=\dfrac{y_in_i\hat{\pi}_i}{\sqrt{n_i\hat{\pi}_i(1\hat{\pi}_i)}}\)
or the deviance residuals. If the n_{i}'s are "large enough", these act something like standardized residuals in linear regression. To see what's happening, we can plot them against the linear predictors,
\(\hat{\eta}_i=\text{log}\left(\dfrac{\pi_i}{1\pi_i}\right)=x_i^T \hat{\beta}_i\)
which are the estimated logodds of success, for cases i = 1, . . . , N.
then it suggests that the mean model
\(\text{log}\left(\dfrac{\pi_i}{1\pi_i}\right)=x_i^T \beta\) (2)
has been misspecified in some fashion. That is, it could be that one or more important covariates do not influence the logodds of success in a linear fashion. For example, it could be that a covariate X ought to be replaced by a transformation such as \(\sqrt{X}\) or logX, or by a pair of variables X and X^{2}, etc.
To see whether individual covariates ought to enter in the logit model in a nonlinear fashion, we could plot the empirical logits
\(\text{log}\left(\dfrac{y_i+1/2}{n_iy_i+1/2}\right)\) (3)
versus each covariate in the model.
Changing the link function will change the interpretation of the coefficients entirely; the β_{j}'s will no longer be logodds ratios. But, depending on what the link function is, they might still have a nice interpretation. For example, in a model with a log link, \(\text{log}\pi_i=x_i^T \beta\), an exponentiated coefficient exp(β_{j}) becomes a relative risk.
Hinkley (1985) suggests a nice, easy test to see whether the link function is plausible:
\(\hat{\eta}_i=x_i^T \hat{\beta}\)
A significant result indicates that the link function is misspecified. A nice feature of this test is that it applies even to ungrouped data (n_{i}'s equal to one), for which residual plots are uninformative.
Suppose that the residual plot shows nonconstant variance as we move from left to right:
Another way to detect nonconstancy of variance is to plot the absolute values of the residuals versus the linear predictors and look for a nonzero slope:
Nonconstant variance in the Pearson residuals means that the assumed form of the variance function,
\(V(y_i)=n_i \pi_i (1\pi_i)\)
is wrong and cannot be corrected by simply introducing a scale factor for overdispersion. Overdispersion and changes to the variance function will be discussed later.
The SAS online help documentation provides the following quantal assay dataset. In this table, x_{i} refers to the logdose, n_{i} is the number of subjects exposed, and y_{i} is the number who responded.
x_{i}

y_{i}

n_{i}

2.68

10

31

2.76

17

30

2.82

12

31

2.90

7

27

3.02

23

26

3.04

22

30

3.13

29

31

3.20

29

30

3.21

23

30

If we fit a simple logistic regression model, we will find that the coefficient for x_{i} is highly significant, but the model doesn't fit. The plot of Pearson residuals versus the fitted values resembles a horizontal band, with no obvious curvature or trends in the variance. This seems to be a classic example of overdispersion.
Since there's only a single covariate, a good place to start is to plot the empirical logits as defined in equation (3) above versus X.
This is basically the same thing as a scatterplot of Y versus X in the context of ordinary linear regression. This plot becomes more meaningful as the n_{i}'s grow. With ungrouped data (all n_{i} = 1), the empirical logits will only take two possible values—log(1/3) and log 3/1—and the plot will not be very useful.
Here is the SAS program file assay.sas:
The relationship between the logits and X seems linear. Let's fit the logistic regression and see what happens.
See assay1.sas:
In plot reschi * xb, reschi are the Pearson residuals, and xb the linear predictor.
The output reveals that the coefficient of X is highly significant, but the model does not fit.
Here is the R code assay.R that corresponds to the SAS program assay1.sas:
With plot.lm(result), R will produce four diagnostic plots, including a residual plot, a QQ plot, a scalelocation plot, and a residual vs leverage plot as well.
The output reveals that the coefficient of X is highly significant, but the model does not fit.
Here is the residual plot from R output:
The residuals plots in both SAS and R above do not show any obvious curvature or trends in the variance. And there are no other predictors that are good candidates for inclusion in the model. (It can be shown that a quadratic term for logconcentration will not improve the fit.) So it is quite reasonable to attribute the problem to overdispersion.
Overdispersion is an important concept in the analysis of discrete data. Many a time data admit more variability than expected under the assumed distribution. The greater variability than predicted by the generalized linear model random component reflects overdispersion. Overdispersion occurs because the mean and variance components of a GLM are related and depends on the same parameter that is being predicted through the independent vector.
There is no such thing as overdispersion in ordinary linear regression. In a linear regression model
\(y_i \sim N(x_i^T \beta, \sigma^2)\)
the variance σ^{2} is estimated independently of the mean function \(x_i^T \beta\).
With discrete response variables, however, the possibility for overdispersion exists because the commonly used distributions specify particular relationships between the variance and the mean; we will see the same holds for Poisson.
For the binomial response, if y_{i} ~ Bin(n_{i}, π_{i}), the mean is μ_{i} = n_{i} π_{i} and the variance is μ_{i}(n_{i}  μ_{i}) / n_{i}.
In the context of logistic regression, overdispersion occurs when the discrepancies between the observed responses y_{i} and their predicted values \(\hat{\mu}_i=n_i \hat{\pi}_i\) are larger than what the binomial model would predict. Overdispersion arises when the n_{i} Bernoulli trials that are summarized in a line of the dataset are
In practice, it is impossible to distinguish nonidentically distributed trials from nonindependence; the two phenomena are intertwined.
Issue: If overdispersion is present in a dataset, the estimated standard errors and test statistics the overall goodnessoffit will be distorted and adjustments must be made. When a logistic model fitted to n binomial proportions is satisfactory, the residual deviance has an approximate χ^{2} distribution with (n – p) degrees of freedom, where p is the number of unknown parameters in the fitted model. Since the expected value of a χ^{2} distribution is equal to its degree of freedom, it follows that the residual deviance for a wellfitting model should be approximately equal to its degrees of freedom. Equivalently, we may say that the mean deviance (deviance/df) should be close to one. Similarly, if the variance of the data is greater than that under under binomial sampling, the residual mean deviance is likely to be greater than 1.
The problem of overdispersion may also be confounded with the problem of omitted covariates. If we have included all the available covariates related to y_{i} in our model and it still does not fit, it could be because our regression function \(x_i^T \beta\) is incomplete. Or it could be due to overdispersion. Unless we collect more data, we cannot do anything about omitted covariates. But we can adjust for overdispersion.
The most popular method for adjusting for overdispersion comes from the theory of quasilikelihood. Quasilikelihood has come to play a very important role in modern statistics. It is the foundation of many methods that are thought to be "robust" (e.g. Generalized Estimating Equations (GEE) for longitudinal data) because they do not require specification of a full parametric model. For more details see Agresti (2007, Sec 9.2) or Agresti (2013, Sec 12.2).
In the quasilikelihood approach, we must first specify the "mean function" which determines how μ_{i} = E(y_{i}) is related to the covariates. In the context of logistic regression, the mean function is
\(\mu_i=n_i \text{expit}(x_i^T \beta)\),
which implies
\(\text{log}\left(\dfrac{\pi_i}{1\pi_i}\right)=x_i^T \beta\)
Then we must specify the "variance function," which determines the relationship between the variance of the response variable and its mean. For a binomial model, the variance function is μ_{i}(n_{i}  μ_{i}) / n_{i}.
But to account for overdispersion, we will include another factor σ^{2} called the "scale parameter," so that
\(V(y_i)=\sigma^2 \mu_i (n_i\mu_i)/n_i\).
Note that no matter what σ^{2} is assumed to be, we get the same estimate for β. Therefore, this method for overdispersion does not change the estimate for β at all. However, the estimated covariance for \(\hat{\beta}\) changes from
\(\hat{V}(\hat{\beta})=(X^T W X)^{1}\)
to
\(\hat{V}(\hat{\beta})=\sigma^2 (X^T W X)^{1}\)
That is, the estimated standard errors must be multiplied by the factor \(\sigma=\sqrt{\sigma^2}\).
How do we estimate σ^{2}?
McCullagh and Nelder (1989) recommend
\(\hat{\sigma}^2=X^2/(NP)\)
where X^{2} is the usual Pearson goodnessoffit statistic, N is the number of sample cases (number of rows in the dataset we are modeling), and p is the number of parameters.
This is a reasonable way to estimate σ^{2} if the mean model \(\mu_i=g(x_i^T \beta)\) holds. But if important covariates are omitted, then X^{2} tends to grow and the estimate for σ^{2} can be too large. For this reason, we will estimate σ^{2} under a maximal model, a model that includes all of the covariates we wish to consider.
The best way to estimate σ^{2} is to identify a rich model for μ_{i} and designate it to be the most complicated one that we are willing to consider. For example, if we have a large pool of potential covariates, we may take the maximal model to be the model that has every covariate included as a main effect. Or, if we have a smaller number of potential covariates, we decide to include all main effects along with twoway and perhaps even threeway interactions. But we must omit at least a few higherorder interactions, otherwise we will end up with a model that is saturated.
In an overdispersed model, we must also adjust our test statistics. The statistics X^{2} and G^{2} are adjusted by dividing them by σ^{2}. That is, tests of nested models are carried out by comparing differences in the scaled Pearson statistic, ΔX^{2}/σ^{2}, or the scaled deviance, ΔG^{2}/σ^{2} to a chisquare distribution with Δdf degrees of freedom.
If the data are overdispersed — that is, if
\(V(y_i) \approx \sigma^2 n_i \pi_i (1\pi_i)\)
for a scale factor σ^{2} > 1 , then the residual plot may still resemble a horizontal band, but many of the residuals will tend to fall outside the ± 3 limits. In this case, the denominator of the Pearson residual will tend to understate the true variance of the y_{i}, making the residuals larger. If the plot looks like a horizontal band but X^{2} and G^{2} indicate lack of fit, an adjustment for overdispersion might be warranted. A warning about this, however: If the residuals tend to be too large, it doesn't necessarily mean that overdispersion is the cause. Large residuals may also be caused by omitted covariates. If some important covariates are omitted from x_{i}, then the true π_{i}'s will depart from what your model predicts, causing the numerator of the Pearson residual to be larger than usual. That is, apparent overdispersion could also be an indication that your mean model needs additional covariates. If these additional covariates are not available in the dataset, however, then there's not much we can do about it; we may need to attribute it to overdispersion.
Note, there is no overdispersion for ungrouped data. McCullagh and Nelder (1989) point out that overdispersion is not possible if n_{i}=1. If y_{i} only takes values 0 and 1, the it must be distributed as Bernoulli(π_{i}) and its variance must be π_{i}(1  π_{i}). There is no other distribution with support {0,1}. Therefore, with ungrouped data, we should always assume scale=1 and not try to estimate a scale parameter and adjust for overdispersion.
The usual way to correct for overdispersion in a logit model is to assume that:
\(E(y_i)=n_i \pi_i\)
\(V(y_i)=\sigma^2 n_i \pi_i (1\pi_i)\)
where σ^{2} is a scale parameter.
Under this modification, the Fisherscoring procedure for estimating β does not change, but its estimated covariance matrix becomes σ^{2}(X^{T}WX)^{1}—that is, the usual standard errors are multiplied by the square root of σ^{2}.
This will make the confidence intervals wider.
Let's get back to our example and refit the model, making an adjustment for overdispersion.
We can refit the model, making an adjustment for overdispersion in SAS by changing the model statement to
model y/n= logconc / scale=pearson;
In SAS, including the option scale=Pearson in the model statement will perform the adjustment. It will not change the estimated coefficients β_{j}, but it will adjust the standard errors.
Here's the output:
NOTE: The covariance matrix has been multiplied by the heterogeneity factor (Pearson ChiSquare / DF) 4.08043.
The estimated scale parameter is \(\hat{\sigma}^2=X^2/df=4.08\). SAS automatically scales the covariance matrix by this factor, which means that
If you are using glm() in R, and want to refit the model adjusting for overdispersion one way of doing it is to use summary.glm() function. For example fit the model using glm() and save the object as RESULT. By default dispersion is equal to 1. This will perform the adjustment. It will not change the estimated coefficients β_{j}, but it will adjust the standard errors.
Estimate from the MAXIMAL model dispersion value as X^{2}/df. Then you can call
summary(RESULT, dispersion=4.08,correlation=TRUE,symbolic.cor = TRUE).
This should give you the same model but with adjusted covariance matrix, that is adjusted standard errors (SEs) for your beta's (estimated logistic regression coefficients) and also changed zvalues. Notice it will not adjust overall fit statistics. For that try the package "dispmod" (see assay.R).
R output after adjusting for overdispersion:
There are other corrections that we could make. If we were constructing an analysisofdeviance table, we would want to divide G^{2} and X^{2} by \(\hat{\sigma}^2\) and use these scaled versions for comparing nested models. Moreover, in reporting residuals, it would be appropriate to modify the Pearson residuals to
\( r_i^\ast=\dfrac{y_in_i\hat{\pi}_i}{\sqrt{\hat{\sigma}^2n_i\hat{\pi}_i(1\hat{\pi}_i)}}\) ;
that is, we should divide the Pearson residuals (or the deviance residuals, for that matter) by \(\sqrt{\hat{\sigma}^2}\).
A Receiver Operating Characteristic Curve (ROC) is a standard technique for summarizing classifier performance over a range of tradeoffs between true positive (TP) and false positive (FP) error rates (Sweets, 1988). ROC curve is a plot of sensitivity (the ability of the model to predict an event correctly) versus 1specificity for the possible cutoff classification probability values π_{0}.
For logistic regression you can create a 2 × 2 classification table of predicted values from your model for your response if \(\hat{y}=0\) or 1 versus the true value of y = 0 or 1. The prediction if \(\hat{y}=1\) depends on some cutoff probability, π_{0}. For example, \(\hat{y}=1\) if \(\hat{\pi}_i>\pi_0\) and \(\hat{y}=0\) if \(\hat{\pi}_i \leq \pi_0\). The most common value for π_{0} = 0.5. Then \(sensitivity=P(\hat{y}=1y=1)\) and \(specificity=P(\hat{y}=0y=0)\).
The ROC curve is more informative than the classification table since it summarizes the predictive power for all possible π_{0}.
The position of the ROC on the graph reflects the accuracy of the diagnostic test. It covers all possible thresholds (cutoff points). The ROC of random guessing lies on the diagonal line. The ROC of a perfect diagnostic technique is a point at the upper left corner of the graph, where the TP proportion is 1.0 and the FP proportion is 0.
The Area Under the Curve (AUC), also referred to as index of accuracy (A), or concordance index, c, in SAS, and it is an accepted traditional performance metric for a ROC curve. The higher the area under the curve the better prediction power the model has. c = 0.8 can be interpreted to mean that a randomly selected individual from the positive group has a test value larger than that for a randomly chosen individual from the negative group 80 percent of the time.
For more details see Agresti(2007), Sections 5.1.65.1.8, Agresti (2013), Section 6.2, and SAS links provided at the beginning.
Here is the SAS program assay4.sas.
Here is the resulting ROC graph.
Area under the curve is c = 0.746 indicates good predictive power of the model.
Option ctable prints the classification tables for various cutoff points. Each row of this output is a classification table for the specified Prob Level, π_{0}.
Here is the R program file assay.R that corresponds to the SAS program assay4.sas.
Here is the ROC graph from R output:
The area under the curve is c = 0.746 which indicates good predictive power of the model.