Lesson 11: Principal Components Analysis (PCA)

Printer-friendly versionPrinter-friendly version

Introduction

Sometimes data are collected on a large number of variables from a single population. As an example consider the Places Rated dataset below

Example: Places Rated

In the Places Rated Almanac, Boyer and Savageau rated 329 communities according to the following nine criteria:

  1. Climate and Terrain
  2. Housing
  3. Health Care & the Environment
  4. Crime
  5. Transportation
  6. Education
  7. The Arts
  8. Recreation
  9. Economics

Note that within the dataset, except for housing and crime, the higher the score the better. For housing and crime, the lower the score the better. Where some communities might do better in the arts, other communities might be rated better in other areas such as having a lower crime rate and good educational opportunities.

Objective

With a large number of variables, the dispersion matrix may be too large to study and interpret properly. There would be too many pairwise correlations between the variables to consider. Graphical display of data may also not be of particular help incase the data set is very large. With 12 variables, for example, there will be more than 200 three-dimensional scatterplots to be studied!

To interpret the data in a more meaningful form, it is therefore necessary to reduce the number of variables to a few, interpretable linear combinations of the data. Each linear combination will correspond to a principal component.

(There is another very useful data reduction technique called Factor Analysis, which will be taken up in a subsequent lesson.)

Learning objectives & outcomes

Upon completion of this lesson, you should be able to do the following:

  • Carry out a principal components analysis using SAS and Minitab
  • Assess how many principal components should be considered in an analysis;
  • Interpret principal component scores. Be able to describe a subject with a high or low score;
  • Determine when a principal component analysis may be based on the variance-covariance matrix, and when the correlation matrix should be used;
  • Understand how principal component scores may be used in further analyses.