An important step in the analysis of any dataset is Exploratory Data Analysis (EDA), including the graphical display of data.
Why do we look at graphical displays of the data? Graphical displays may:
Many multivariate methods assume that the data have a multivariate normal distribution. Exploratory data analysis through the graphical display of data may be used to assess the normality of data. If evidence is found that the data are not normally distributed, then graphical methods may be applied to determine appropriate normalizing transformations for the data.
To apply graphical method, one has to take recourse to some software. In this course we will use SAS and Minitab to demonstarte graphical methods as well as for other applications later. Both SAS and Minitab diagrams are provided side-by-side as far as possible. If diagrams require extensive instructions, tabs are provided separately for SAS and Minitab.
The objectives of this lesson are:
Let us take a look again at the nutrition data. In 1985, the USDA commissioned a study of women’s nutrition. Nutrient intake was measured for a random sample of 737 women aged 25-50 years. The following variables were measured:
We can read the data from the nutrient.txt file into SAS with this code. Various transformed variables are also created at this step for inspection. Here are some different ways we could take a look at this data graphically using SAS (and Minitab).
Using Histograms we can:
Here we have a histogram (produced in SAS) for daily intake of calcium. Note that the data appear to be skewed to the right, suggesting that calcium is not normally distributed. This suggests that a normalizing transformation should be considered.
Common transformations include:
The square root transformation is the weakest of the above transformations, while the log transformation is the strongest. In practice, it is generally a good idea to try all three transformations to see which appears to yield the most symmetric distribution.
The following shows histograms for the raw data (calcium), square-root transformation (S_calciu), quarter-root transformation (S_S_calc), and log transformation (L_calciu). With increasingly stronger transformations of the data, the distribution shifts from being skewed to the right to being skewed to the left. Here, the square-root transformed data is still slightly skewed to the right, suggesting that the square-root transformation is not strong enough. In contrast, the log-transformed data are skewed to the left, suggesting that the log transformation is too strong. The quarter-root transformation results in the most symmetric distribution, suggesting that this transformation is most appropriate for this data.
In practice, histograms should be plotted for each of the variables, and transformations should be applied as needed. There is no 'best' transformation for all data sets.
Using Scatter Plots we can:
Here we have a scatterplot (produced in Minitab) in which calcium is plotted against iron. This plot suggests that daily intake of calcium tends to increase with increasing daily intake of iron. If the data have a bivariate normal distribution, then the scatterplot should be approximately elliptical in shape. However, the points appear to fan out from the origin, suggesting that the data are not bivariate normal.
After applying quarter-root transformations to both calcium and iron, we obtain a scatter of point that appears to be more elliptical in shape. Moreover, it appears that the relationship between the transformed variables is approximately linear. The point in the lower left-hand corner appears to be an unusual observation or outlier. Upon closer examination, it was found that this woman reported zero daily intake of iron. Since this is very unlikely to be correct, we might justifiably remove this observation from the data set.
Note that it is not appropriate to remove an observation from the data just because it is an outlier. Consider, for example, the ozone hole in the Antarctic. For years, NASA had been flying polar-orbiting satellites designed to measure ozone in the upper atmosphere without detecting an ozone hole. Then, one day, a scientist visiting the Antarctic pointed an instrument straight-up into the sky, and found evidence of an ozone hole. What happened? It turned out that the software used to process the NASA satellite data had a routine for automatically removing outliers. In this case, all observations with unusually low ozone levels were automatically removed by this routine. A close review of the raw, preprocessed data confirmed that there was an ozone hole.
The above is a special case, where the outliers themselves are the most interesting observations. In general, outliers are removed only if there is compelling reason to believe that something is wrong with the individual observations; e.g., if the observation is deemed to be impossible, as in the case of zero daily intake of iron. This underscores the need to have good field or lab notes with details on data collection process. Lab notes may indicate that something may have gone wrong with an individual observation; e.g., a laboratory sample may have been dropped on the floor leading to contamination. If such a sample results in an outlier, then that sample may legitimately be removed from the data.
Outliers often have greater influence on the results of data analyses than the remaining observations. For example, outliers have a strong influence on the calculation of the sample mean. If outliers are detected, and there is no collaborating evidence to suggest that they should be removed, then resistant statistical techniques should be applied. Here, by resistant techniques we mean techniques or processes that are not easily influenced by outliers. For example, the sample median is not sensitive to outliers, and so may be calculated in place of the sample mean, if we believe that there is a possibility that sample mean may give a wrong picture. Outlier resistant methods go well beyond the scope of this course. If outliers are detected, then you should consult with a statistician.
Here, we have a matrix of scatterplots for quarter-root transformed data on all variables. Note that each variable appears to be positively related to the remaining variables. However, the strength of that relationship depends on which pair of variables is considered. For example, quarter-root iron is strongly related to quarter-root protein, but the relationship between calcium and vitamin C is not very strong.
In this lesson we learned about:
Next, complete the homework problems that will give you a chance to put what you have learned to use...