14.5 - Agglomerative Method Example
Example: Woodyard Hammock Data
SAS only uses Euclidian distance metric and agglomerative clustering, while Minitab can use Manhattan, Pearson, Squared Euclidean, and Square Pearson distances too. Both SAS and Minitab use only agglomerative clustering.
Cluster analysis is carried out in SAS using a cluster analysis procedure that is abbreviated as cluster. We will look at how this is carried out in the SAS program wood1.sas below.
Click on the arrow in the window below to see how to perform a cluster analysis using the Minitab statistical software application.
Dendrograms (Tree Diagrams)
The results of cluster analysis are best summarized using a dendrogram. In a dendrogram, distance is plotted on one axis, while the sample units are given on the remaining axis. The tree shows how sample units are combined into clusters, the height of each branching point corresponding to the distance at which two clusters are joined.
In looking at the cluster history section of the SAS (or Minitab) output, we see that the Euclidean distance between sites 33 and 51 was smaller than between any other pair of sites (clusters). Therefore, this pair of sites was clustered first in the tree diagram. Following the clustering of these two sites, there are a total of n - 1 = 71 clusters, and so, the cluster formed by sites 33 and 51 is designated "CL71". Note that the numerical value of the distances in SAS and in Minitab are different. That is because SAS shows a 'normalized' distance. However we will not be interested in the absolute value of the distance. Their relative ranking is what we will use for cluster formation.
The Euclidean distance between sites 15 and 23 was smaller than between any other pair of the 70 heretofore unclustered sites or the distance between any of those sites and CL71. Therefore, this pair of sites was clustered second. Its designation is "CL70" .
In the seventh step of the algorithm, the distance between site 8 and cluster CL67 was smaller than the distance between any pair of heretofore unclustered sites and the distances between those sites and the existing clusters. Therefore, site 8 was joined to CL67 to form the cluster of 3 sites designated as CL65.
The clustering algorithm is completed when clusters CL2 and CL5 are joined.
The plot below is generated by Minitab. In SAS the diagram is horizontal. The color scheme depends on how many clusters we want (discussed later).
What do you do with the information in this tree diagram?
A decision regarding the optimum number of clusters is to be taken at some point. We also need to decide which clustering technique to be used. Therefore, we have adapted the wood1.sas program to specify use of the other clustering techniques. Here are links to these program changes. In Minitab also you may select other options instead of single linkage from the appropriate box.
wood1.sas | specifies complete linkage | |
wood2.sas | is identical, except that it uses average linkage | |
wood3.sas | uses the centroid method | |
wood4.sas | uses the simple linkage |
As we run each of these programs we must remember to keep in mind that what we really are after is a good description of the data.
Applying the Cluster Analysis Process
First we want to compare results of the different clustering algorithms. Note that clusters containing one or only a few members are undesirable, as that will give rise to a large number of clusters, defeating the purpose of the whole analysis. That is not to say that, we can never have a cluster with a single member! In fact, if that happens, we need to investigate the reason. It may indicate that, the single-item cluster is completely different from the other members of the sample and is best left alone.
To arrive at the optimum number of clusters we may follow this simple guideline. Select the number of clusters that have been identified by each method. This is accomplished by finding a break point (distance) below which further branching is ignored. In practice this is not necessarily straightforward. You will need to try a number different cut points to see which is more decisive. Here are the results of this type of partitioning using the different clustering algorithm methods on the Woodyard Hammock data. Dendrogram helps to determine the breakpoint.
For this example complete linkage yields the most satisfactory result.
For your convenience the following screenshots are provided to demonstrate how alternative clustering procedures may be done in Minitab.