Jump to main content.


Cluster Analysis

MD invertebrate example

description | simple example | MAIA example | how it works | caveats

Description: Cluster analysis defines groups of cases based on the similarity of multiple variables measured for each case; the algorithm picks the groups, they are not defined in advance as for DFA. Similarity is often measured in terms of "distance," which has a very general meaning in cluster analysis. The number of species common to 2 stream sites could be a measure of "distance" between the sites.

Simple example: Suppose that within a region you have measurements for stream channels that are probably related to each other such as {width, depth, sinuosity, slope}. For every possible pair of stream sites you compute a measure of distance, for example, squared distances, for all the stream measurements and then sum them. Cluster analysis returns to you a dendritic tree, or dendogram, that shows how sites were grouped (or split) first, which next, and so on until the number of clusters you initially specified is obtained.

It's up to you to interpret the output. You must evaluate the groups and determine what it is that they have in common. For this example, it might be ecoregion, elevation or valley shape.

MAIA example: Stribling, et al. (1998) used cluster analysis to group stream sites according to the invertebrate species present at each site. They used least impaired sites to ensure that clusters were related to natural species distributions rather than human disturbance. They found that the site clusters were best explained in terms of the ecoregion in which they were located.

Figure

Cluster analysis of Maryland stream invertebrates. Each branch represents a stream site. (Click for information about alternate access)

Figure: Cluster analysis of Maryland stream invertebrates. Each branch represents a stream site. Sites are marked with "n" or "s" to indicate they were in the Coastal Plain and "a", "p", or "v" indicates upland sites. The first mixed group and cluster 2 include sites mostly in the Coastal Plain; clusters 1 and 3 include mostly sites not in the Coastal Plain. Note that sites do not align perfectly, there is some mixing in the clusters which is not unusual.

How the method works: Cluster analysis can work from the top down (hierarchical or divisive) or the bottom  up (joining). That is, either groups of cases are divided into smaller groups repeatedly or individual cases are joined together repeatedly to make successively larger groups. After each successive step of splitting (or joining) a new distance measure is calculated for all possible pairs of the new groups. The model proceeds until the number of clusters you specified is reached.

The model operates by minimizing variability within clusters and maximizing variability between clusters.

Assumptions/limitations: This method doesn't assume a multivariate normal distribution for the variables like most of the other methods in this section, it is a distribution-free method.

On the other hand, cluster analysis does make a strong assumption that you have selected the appropriate distance measure for comparing cases. The outcome of any clustering analysis depends on how distance between cases was calculated. For example, when choosing a distance measure you must decide whether 2 stream sites are more alike if they share the same species or if they both have the same species missing. Different distance measures will weight these 2 cases differently.

After selecting the distance measure, you must also select the type of clustering algorithm, e.g., divisive or agglomerative. Again, different models give different results.

Biological Indicators | Aquatic Biodiversity | Statistical Primer


Local Navigation


Jump to main content.