9.7 - A Strategy for Dealing with Problematic Data Points
You should certainly have a good idea now that identifying and handling outliers and influential data points is a "wishy-washy" business. That is, the various measures that we have learned in this lesson can lead to different conclusions about the extremity of a particular data point. It is for this reason that data analysts should use the measures described herein only as a way of screening their data set for potentially influential data points. With this in mind, here is my recommended strategy for dealing with problematic data points:
First, check for obvious data errors:
- If the error is just a data entry or data collection error, correct it.
- If the data point is not representative of the intended study population, delete it.
- If the data point is a procedural error and invalidates the measurement, delete it.
Consider the possibility that you might have just misformulated your regression model:
- Did you leave out any important predictors?
- Should you consider adding some interaction terms?
- Is there any nonlinearity that needs to be modeled?
If nonlinearity is an issue, one possibility is to just reduce the scope of your model. If you do reduce the scope of your model, you should be sure to report it, so that readers do not misuse your model.
Decide whether or not deleting data points is warranted:
- Do not delete data points just because they do not fit your preconceived regression model.
- You must have a good, objective reason for deleting data points.
- If you delete any data after you've collected it, justify and describe it in your reports.
- If you are not sure what to do about a data point, analyze the data twice — once with and once without the data point — and report the results of both analyses.
First, foremost, and finally — it's okay to use your common sense and knowledge about the situation.