Lesson 11: Principal Components Analysis (PCA)
Sometimes data are collected on a large number of variables from a single population. As an example consider the Places Rated dataset below
Example: Places Rated
In the Places Rated Almanac, Boyer and Savageau rated 329 communities according to the following nine criteria:
- Climate and Terrain
- Health Care & the Environment
- The Arts
Note that within the dataset, except for housing and crime, the higher the score the better. For housing and crime, the lower the score the better. Where some communities might rate better in the arts, other communities might rate better in other areas such as having a lower crime rate and good educational opportunities.
With a large number of variables, the dispersion matrix may be too large to study and interpret properly. There would be too many pairwise correlations between the variables to consider. Graphical displays may also not be particularly helpful when the data set is very large. With 12 variables, for example, there will be more than 200 three-dimensional scatterplots.
To interpret the data in a more meaningful form, it is necessary to reduce the number of variables to a few, interpretable linear combinations of the data. Each linear combination will correspond to a principal component.
(There is another very useful data reduction technique called Factor Analysis discussed in a subsequent lesson.)
Learning objectives & outcomes
Upon completion of this lesson, you should be able to do the following:
- Carry out a principal components analysis using SAS and Minitab
- Assess how many principal components are needed;
- Interpret principal component scores and describe a subject with a high or low score;
- Determine when a principal component analysis should be based on the variance-covariance matrix or the correlation matrix;
- Use principal component scores in further analyses.