An important step in the analysis of any dataset is Exploratory Data Analysis (EDA), including the graphical display of data.

Why do we look at graphical displays of the data? Graphical displays may:

- suggest a plausible model for the data,
- assess validity of model assumptions,
- detect outliers, or
- suggest plausible normalizing transformations

Many multivariate methods assume that the data have a multivariate normal distribution. Exploratory data analysis through the graphical display of data may be used to assess the normality of data. If evidence is found that the data are not normally distributed, then graphical methods may be applied to determine appropriate normalizing transformations for the data.

In this course we will use SAS and Minitab to demonstarte graphical methods as well as for other applications later. Both SAS and Minitab diagrams are provided side-by-side as far as possible. If diagrams require extensive instructions, tabs are provided separately for SAS and Minitab.

The objectives of this lesson are:

- Introduce graphical methods for summarizing multivariate data including histograms, scatterplot matrices, and rotating 3-dimensional scatterplots;
- Produce graphics using interactive data analysis in SAS and Minitab;
- Understand when transformations of the data should be applied and what specific transformations should be considered;
- Learn how to identify unusual observations (outliers), and understand issues regarding how outliers should be handled if they are detected.

Let us take a look again at the nutrition data. In 1985, the USDA commissioned a study of women’s nutrition. Nutrient intake was measured for a random sample of 737 women aged 25-50 years. The following variables were measured:

- Calcium(mg)
- Iron(mg)
- Protein(g)
- Vitamin A(μg)
- Vitamin C(mg)

We can read the data from the nutrient.txt file into SAS with this code. Various transformed variables are also created at this step for inspection. Here are some different ways we could take a look at this data graphically using SAS (and Minitab).

**Using Histograms we can:**

- Assess Normality
- Find Normalizing Transformations
- Detect Outliers

Here we have a histogram (produced in SAS) for daily intake of calcium. Note that the data appear to be skewed to the right, suggesting that calcium is not normally distributed. This suggests that a normalizing transformation should be considered.

Common transformations include:

- Square Root (often used with counts data)
- Quarter Root
- Log (either natural or base 10)

The square root transformation is the weakest of the above transformations, while the log transformation is the strongest. In practice, it is generally a good idea to try all three transformations to see which appears to yield the most symmetric distribution.

The following shows histograms for the raw data (calcium), square-root transformation (S_calciu), quarter-root transformation (S_S_calc), and log transformation (L_calciu). With increasingly stronger transformations of the data, the distribution shifts from being skewed to the right to being skewed to the left. Here, the square-root transformed data is still slightly skewed to the right, suggesting that the square-root transformation is not strong enough. In contrast, the log-transformed data are skewed to the left, suggesting that the log transformation is too strong. The quarter-root transformation results in the most symmetric distribution, suggesting that this transformation is most appropriate for this data.

In practice, histograms should be plotted for each of the variables, and transformations should be applied as needed. There is no 'best' transformation for all data sets.

**Using Scatter Plots we can:**

- Describe relationships between pairs of variables
- Assess linearity
- Find Linearizing Transformations
- Detect Outliers

Here we have a scatterplot (produced in Minitab) in which calcium is plotted against iron. This plot suggests that daily intake of calcium tends to increase with increasing daily intake of iron. If the data have a bivariate normal distribution, then the scatterplot should be approximately elliptical in shape. However, the points appear to fan out from the origin, suggesting that the data are not bivariate normal.

After applying quarter-root transformations to both calcium and iron, we obtain a scatter of points that appears to be more elliptical in shape. Moreover, it appears that the relationship between the transformed variables is approximately linear. The point in the lower left-hand corner appears to be an unusual observation or outlier. Upon closer examination, it was found that this woman reported zero daily intake of iron. Since this is very unlikely to be correct, we might justifiably remove this observation from the data set.

**Outliers:**

Note that it is not appropriate to remove an observation from the data just because it is an outlier. Consider, for example, the ozone hole in the Antarctic. For years, NASA had been flying polar-orbiting satellites designed to measure ozone in the upper atmosphere without detecting an ozone hole. Then, one day, a scientist visiting the Antarctic pointed an instrument straight-up into the sky, and found evidence of an ozone hole. What happened? It turned out that the software used to process the NASA satellite data had a routine for automatically removing outliers. In this case, all observations with unusually low ozone levels were automatically removed by this routine. A close review of the raw, preprocessed data confirmed that there was an ozone hole.

The above is a special case, where the outliers themselves are the most interesting observations. In general, outliers are removed only if there is compelling reason to believe that something is wrong with the individual observations; e.g., if the observation is deemed to be impossible, as in the case of zero daily intake of iron. This underscores the need to have good field or lab notes with details on data collection process. Lab notes may indicate that something may have gone wrong with an individual observation; e.g., a laboratory sample may have been dropped on the floor leading to contamination. If such a sample results in an outlier, then that sample may legitimately be removed from the data.

Outliers often have greater influence on the results of data analyses than the remaining observations. For example, outliers have a strong influence on the calculation of the sample mean. If outliers are detected, and there is no collaborating evidence to suggest that they should be removed, then resistant statistical techniques should be applied. Here, by resistant techniques we mean techniques or processes that are not easily influenced by outliers. For example, the sample median is not sensitive to outliers, and so may be calculated in place of the sample mean, if we believe that there is a possibility that sample mean may give a wrong picture. Outlier resistant methods go well beyond the scope of this course. If outliers are detected, then you should consult with a statistician.

- Describe relationships among three variables
- Detect Outliers

Using rotating scatter plots in SAS

By rotating a 3-dimensional scatterplot, the illusion of three dimensions can be achieved. Here, we are looking to see if the cloud of points is approximately elliptical in shape.

Creating a 3D Scatter plot in Minitab for L_calc, L_iron and L_prot.

- Select Graph > 3D Scatter Plot
- The default is already Simple, so click OK.
- In Z, enter L_iron. In Y, enter L_prot. In X, enter L_calc.
- Click OK.

Note: The plot (shown below) can be rotated using the 3D Graph tools that appear with the plot. If it does not appear, choose Tools > Toolbars and check 3D Graph Tools.

Click on the graphic or link below to walk through a viewlet of what this process looks like in Minitab.

- Look at all of the relationships between pairs of variables in one group of plots
- Describe relationships among three or more variables

Here, we have a matrix of scatterplots for quarter-root transformed data on all variables. Note that each variable appears to be positively related to the remaining variables. However, the strength of that relationship depends on which pair of variables is considered. For example, quarter-root iron is strongly related to quarter-root protein, but the relationship between calcium and vitamin C is not very strong.

Matrix of scatterplots generated using SAS.

proc sgscatter data=nutrient;

title "Scatterplot Matrix for Nutrition Data";

matrix S_S_calc S_S_iron S_S_prot S_S_vitA S_S_vitC;

run;

Creating a matrix of scatterplots for S_S_calc, S_S_iron, S_S_protein, S_S_vitA, and S_S_vitC in Minitab.

- Select Graph > Matrix Plot
- The default is already Simple, so click OK.
- Under Graph variables, enter S_S_calc, S_S_iron, S_S_prot, S_S_vitA, and S_S_vitC.
- Click OK.

A matrix plot for S_S_calc, S_S_iron, S_S_protein, S_S_vitA, and S_S_vitC

Click on the graphic or link below to walk through a viewlet of what this process looks like in Minitab.

In this lesson we learned about:

- How to interpret graphical displays of multivariate data;
- How to determine the most appropriate normalizing transformation of the data;
- How to detect outliers;
- Use of softwares in producing multivariate graphics
- Issues regarding when outliers should be removed from the data, or when they should be retained.

Next, complete the homework problems that will give you a chance to put what you have learned to use...