Lesson 3: Graphical Display of Multivariate Data


An important step in the analysis of any dataset is Exploratory Data Analysis (EDA), including the graphical display of data.

Why do we look at graphical displays of the data? Graphical displays may:

Many multivariate methods assume that the data have a multivariate normal distribution. Exploratory data analysis through the graphical display of data may be used to assess the normality of data. If evidence is found that the data are not normally distributed, then graphical methods may be applied to determine appropriate normalizing transformations for the data.

To apply graphical method, one has to take recourse to some software. In this course we will use SAS and Minitab to demonstarte graphical methods as well as for other applications later. Both SAS and Minitab diagrams are provided side-by-side as far as possible. If diagrams require extensive instructions, tabs are provided separately for SAS and Minitab.

Learning Objectives & Outcomes

The objectives of this lesson are:

3.1 - Graphical Methods

Example: USDA Women’s Health Survey

Let us take a look again at the nutrition data. In 1985, the USDA commissioned a study of women’s nutrition. Nutrient intake was measured for a random sample of 737 women aged 25-50 years. The following variables were measured:

We can read the data from the nutrient.txt file into SAS with this code. Various transformed variables are also created at this step for inspection. Here are some different ways we could take a look at this data graphically using SAS (and Minitab).

Univariate Cases:

Using Histograms we can:

Here we have a histogram (produced in SAS) for daily intake of calcium. Note that the data appear to be skewed to the right, suggesting that calcium is not normally distributed. This suggests that a normalizing transformation should be considered.

SAS histogram

Common transformations include:

The square root transformation is the weakest of the above transformations, while the log transformation is the strongest. In practice, it is generally a good idea to try all three transformations to see which appears to yield the most symmetric distribution.

The following shows histograms for the raw data (calcium), square-root transformation (S_calciu), quarter-root transformation (S_S_calc), and log transformation (L_calciu). With increasingly stronger transformations of the data, the distribution shifts from being skewed to the right to being skewed to the left. Here, the square-root transformed data is still slightly skewed to the right, suggesting that the square-root transformation is not strong enough. In contrast, the log-transformed data are skewed to the left, suggesting that the log transformation is too strong. The quarter-root transformation results in the most symmetric distribution, suggesting that this transformation is most appropriate for this data.

SAS plot

In practice, histograms should be plotted for each of the variables, and transformations should be applied as needed. There is no 'best' transformation for all data sets.

Bivariate Cases:

Using Scatter Plots we can:

Here we have a scatterplot (produced in Minitab) in which calcium is plotted against iron. This plot suggests that daily intake of calcium tends to increase with increasing daily intake of iron. If the data have a bivariate normal distribution, then the scatterplot should be approximately elliptical in shape. However, the points appear to fan out from the origin, suggesting that the data are not bivariate normal.


After applying quarter-root transformations to both calcium and iron, we obtain a scatter of point that appears to be more elliptical in shape. Moreover, it appears that the relationship between the transformed variables is approximately linear. The point in the lower left-hand corner appears to be an unusual observation or outlier. Upon closer examination, it was found that this woman reported zero daily intake of iron. Since this is very unlikely to be correct, we might justifiably remove this observation from the data set.



Note that it is not appropriate to remove an observation from the data just because it is an outlier. Consider, for example, the ozone hole in the Antarctic. For years, NASA had been flying polar-orbiting satellites designed to measure ozone in the upper atmosphere without detecting an ozone hole. Then, one day, a scientist visiting the Antarctic pointed an instrument straight-up into the sky, and found evidence of an ozone hole. What happened? It turned out that the software used to process the NASA satellite data had a routine for automatically removing outliers. In this case, all observations with unusually low ozone levels were automatically removed by this routine. A close review of the raw, preprocessed data confirmed that there was an ozone hole.

The above is a special case, where the outliers themselves are the most interesting observations. In general, outliers are removed only if there is compelling reason to believe that something is wrong with the individual observations; e.g., if the observation is deemed to be impossible, as in the case of zero daily intake of iron. This underscores the need to have good field or lab notes with details on data collection process. Lab notes may indicate that something may have gone wrong with an individual observation; e.g., a laboratory sample may have been dropped on the floor leading to contamination. If such a sample results in an outlier, then that sample may legitimately be removed from the data.

Outliers often have greater influence on the results of data analyses than the remaining observations. For example, outliers have a strong influence on the calculation of the sample mean. If outliers are detected, and there is no collaborating evidence to suggest that they should be removed, then resistant statistical techniques should be applied. Here, by resistant techniques we mean techniques or processes that are not easily influenced by outliers. For example, the sample median is not sensitive to outliers, and so may be calculated in place of the sample mean, if we believe that there is a possibility that sample mean may give a wrong picture. Outlier resistant methods go well beyond the scope of this course. If outliers are detected, then you should consult with a statistician.

Trivariate Cases:

Using Rotating Scatter Plots we can:

Using rotating scatter plots in SAS

SAS Rotating Scatter Plot

By rotating a 3-dimensional scatterplot, the illusion of three dimensions can be achieved. Here, we are looking to see if the cloud of points is approximately elliptical in shape.

Creating a 3D Scatter plot in Minitab for L_calc, L_iron and L_prot.

  1. Select Graph > 3D Scatter Plot
  2. minitab dialog box

  3. The default is already Simple, so click OK.
  4. In Z, enter L_iron. In Y, enter L_prot. In X, enter L_calc.
  5. Click OK.

Note: The plot (shown below) can be rotated using the 3D Graph tools that appear with the plot. If it does not appear, choose Tools > Toolbars and check 3D Graph Tools.

minitab 3d tools



Click on the graphic or link below to walk through a viewlet of what this process looks like in Minitab.

minitab dialog box

Creating a 3D scatterplot in Minitab

Multivariate Cases:

Using Matrix of Scatter Plots we can:

Here, we have a matrix of scatterplots for quarter-root transformed data on all variables. Note that each variable appears to be positively related to the remaining variables. However, the strength of that relationship depends on which pair of variables is considered. For example, quarter-root iron is strongly related to quarter-root protein, but the relationship between calcium and vitamin C is not very strong.

Matrix of scatterplots generated using SAS.

SAS Plot

Creating a matrix of scatterplots for S_S_calc, S_S_iron, S_S_protein, S_S_vitA, and S_S_vitC in Minitab.

  1. Select Graph > Matrix Plot
  2. The default is already Simple, so click OK.
  3. Under Graph variables, enter S_S_calc, S_S_iron, S_S_prot, S_S_vitA, and S_S_vitC.
  4. minitab dialog box

  5. Click OK.

A matrix plot for S_S_calc, S_S_iron, S_S_protein, S_S_vitA, and S_S_vitC


Click on the graphic or link below to walk through a viewlet of what this process looks like in Minitab.

minitab dialog box

Creating a 3D scatterplot in Minitab

3.2 - Summary

In this lesson we learned about:

Next, complete the homework problems that will give you a chance to put what you have learned to use...