12.4 - Agglomerative Method Example

Example: Woodyard Hammock Data

SAS only uses Euclidian distance metric and agglomerative clustering, while Minitab can use Manhattan, Pearson, Squared Euclidean, and Square Pearson distances too. Both SAS and Minitab use only agglomerative clustering.

Cluster analysis is carried out in SAS using a cluster analysis procedure that is abbreviated as cluster. We will look at how this is carried out in the SAS program wood1.sas below.

SAS Program

Inspect SAS program code launch SAS program

Click on the arrow in the window below to see how to perform a cluster analysis using the Minitab statistical software application.

minitab dialog box

Performing a Cluster Analysis using Minitab

Dendrograms (Tree Diagrams)

The results of cluster analysis are best summarized using a dendrogram. In a dendrogram, distance is plotted on one axis, while the sample units are given on the remaining axis. The tree shows how sample units are combined into clusters, the height of each branching point corresponding to the distance at which two clusters are joined.

In looking at the cluster history section of the SAS (or Minitab) output , we see that the Euclidean distance between sites 33 and 51 was smaller than between any other pair of sites (clusters). Therefore, this pair of sites was clustered first in the tree diagram. Following the clustering of these two sites, there are a total of n - 1 = 71 clusters, and so, the cluster formed by sites 33 and 51 is designated "CL71" . Note that the numerical value of the distances in SAS and in Minitab are different. That is because SAS shows a 'normalized' distance. However we will not be interested in the absolute value of the distance. Their relative ranking is what we will use for cluster formation.

The Euclidean distance between sites 15 and 23 was smaller than between any other pair of the 70 heretofore unclustered sites or the distance between any of those sites and CL71. Therefore, this pair of sites was clustered second. Its designation is "CL70" .

In the seventh step of the algorithm, the distance between site 8 and cluster CL67 was smaller than the distance between any pair of heretofore unclustered sites and the distances between those sites and the existing clusters. Therefore, site 8 was joined to CL67 to form the cluster of 3 sites designated as CL65.

The clustering algorithm is completed when clusters CL2 and CL5 are joined.

The plot below is generated by Minitab. In SAS the diagram is horizontal. The color scheme depends on how many clusters we want (discussed later).

dendogram example

What do you do with the information in this tree diagram?

A decision regarding the optimum number of clusters is to be taken at some point. We also need to decide which clustering technique to be used. Therefore, we have adapted the wood1.sas program to specify use of the other clustering techniques. Here are links to these program changes. In Minitab also you may select other options instead of single linkage from the appropriate box.

launch SAS program wood1.sas specifies complete linkage
launch SAS program wood2.sas is identical, except that it uses average linkage
launch SAS program wood3.sas uses the centroid method
launch SAS program wood4.sas uses the simple linkage

As we run each of these programs we must remember to keep in mind that what we really after is a good description of the data.

Applying the Cluster Analysis Process

First we want to compare results of the different clustering algorithms. Note that clusters containing one or only a few members are undesirable, as that will give rise to a large number of clusters, defeating the purpose of the whole analysis. That is not to say that, we can never have a cluster with a single member! In fact, if that happens, we need to investigate the reason. It may indicate that, the single-item cluster is completely different from the other members of the sample and is best left alone.

To arrive at the optimum number of clusters we may follow this simple guideline. Select the number of clusters that have been identified by each method. This is accomplished by finding a break point (distance) below which further branching is ignored. In practice this is not necessarily straightforward. You will need to try a number different cut points to see which is more decisive. Here are the results of this type of partitioning using the different clustering algorithm methods on the Woodyard Hammock data. Dendrogram helps to determine the breakpoint.

SAS tree diagram - complete linkage Complete Linkage Partitioning into 6 clusters yields clusters of sizes 3, 5, 5, 16, 17, and 26.
SAS tree diagram - complete linkage Average Linkage Partitioning into 5 clusters would yield 3 clusters containing only a single site each.
SAS tree diagram - complete linkage Centroid Linkage Partitioning into 6 clusters would yield 5 clusters containing only a single site each.
SAS tree diagram - complete linkage Single Linkage Partitioning into 7 clusters would yield 6 clusters containing only 1-2 sites each.

For this example complete linkage yields the most satisfactory result.

For your convenience the following screenshots are provided to demonstrate how alternative clustering procedures may be done in Minitab.

minitab output minitab output