Lesson 14: Cluster Analysis

Printer-friendly versionPrinter-friendly version

Introduction

Cluster analysis is a data exploration (mining) tool for dividing a multivariate dataset into “natural” clusters (groups). We use the methods to explore whether previously undefined clusters (groups) may exist in the dataset. For instance, a marketing department may wish to use survey results to sort its customers into categories (perhaps those likely to be most receptive to buying a product, those most likely to be against buying a product, and so forth).

Cluster Analysis is used when we believe that the sample units come from an unknown number of distinct populations or sub-populations. We also assume that the sample units come from a number of distinct populations, but there is no apriori definition of those populations. Our objective is to describe those populations using the observed data.

Cluster Analysis, until relatively recently, has had very little interest. This has changed because of the interest in the bioinformatics and genome research. To explore Cluster Analysis in our lesson here, we will use an ecological example.

Learning objectives & outcomes

Upon completion of this lesson, you should be able to do the following:

  • Carry out cluster analysis using SAS or Minitab 
  • Use a dendrogram to partition the data into clusters of known composition;
  • Carry out posthoc analyses to describe differences among clusters.