Lesson 9: Data Transformations
In Lessons 4 and 7, we learned tools for detecting problems with a linear regression model. Once we've identified problems with the model, we have a number of options:
- If important predictor variables are omitted, see whether adding the omitted predictors improves the model.
- If the mean of the response is not a linear function of the predictors, try a different function. For example, polynomial regression involves transforming one or more predictor variables while remaining within the multiple linear regression framework. For another example, applying a logarithmic transformation to the response variable also allows for a nonlinear relationship between the response and the predictors while remaining within the multiple linear regression framework.
- If there are unequal error variances, try transforming the response and/or predictor variables or use "weighted least squares regression."
- If an outlier exists, try using a "robust estimation procedure."
- If error terms are not independent, try fitting a "time series model."
Transforming response and/or predictor variables therefore has the potential to remedy a number of model problems. Such data transformations are the focus of this lesson. (We cover weighted least squares and robust regression in Lesson 13 and times series models in Lesson 14.)
To introduce basic ideas behind data transformations we first consider a simple linear regression model in which:
- We transform the predictor (x) values only.
- We transform the response (y) values only.
- We transform both the predictor (x) values and response (y) values.
It is easy to understand how transformations work in the simple linear regression context because we can see everything in a scatterplot of y versus x. However, these basic ideas apply just as well to multiple linear regression models. With multiple predictors, we can no longer see everything in a single scatterplot, so now we use use residual plots to guide us.
You will discover that data transformation definitely requires a "trial and error" approach. In building the model, we try a transformation and then check to see if the transformation eliminated the problems with the model. If it doesn't help, we try another transformation and so on. We continue this cyclical process until we've built a model that is appropriate and we can use. That is, the process of model building includes model formulation, model estimation, and model evaluation:
- Model building
- Model formulation
- Model estimation
- Model evaluation
- Model use
We don't leave the model building process until we've convinced ourselves that the model meets the four conditions ("LINE") of the linear regression model. One important thing to remember is that there is often more than one viable model. The model you choose and the model a colleague chooses may be different and yet both equally appropriate. What's important is that the model you choose:
- is not overly complicated
- meets the four conditions of the linear regression model, and
- allows you to answer your research question of interest.
Don't forget that data analysis is an artful science! It involves making subjective decisions using very objective tools!
Learning objectives & outcomes
Upon completion of this lesson, you should be able to do the following:
- Understand when transforming predictor variables might help and when transforming the response variable might help (or when it might be necessary to do both).
- Use estimated regression models based on transformed data to answer various research questions.
- Make the calculations that are necessary to get meaningful interpretations of the slope parameter under log-transformed data.
- Use an estimated regression equation based on transformed data to predict a future response (prediction interval) or estimate a mean response (confidence interval).