Documentation for MEANSIM , Version 6.0
A set of programs for Mean Similarity Analysis 10/29/98
John Van Sickle
ph. 541-754-4314
fax 541-754-4338
email: vansickle.john@epa.gov
This distribution contains software for Mean Similarity Analysis,
a method for evaluating the strength of a classification of many objects (sites) into a relatively small number of groups. The classification
strength of a grouping is evaluated by the extent to which objects
within the same group are more similar to each other, on average, than
they are to objects in different groups.
The enclosed programs run under Windows 95/98/NT. If you need a DOS or Windows 3.1 version, please contact the author.
The analysis is based on a matrix of pairwise similarities (or
dissimilarities) for all possible pairs of objects. The mean similarity
analysis outputs consist of a small matrix of average similarities
within and between the chosen groups, and a statistical test of whether
average within-group similarities are greater than between-group
similarities.
This software is provided free of charge with the understanding
that it will not be used for any commercial purposes. It is reasonably
reliable, but has not been exhaustively tested and must be applied at
the user's own risk. Mention of trade names or commercial products does
not constitute endorsement or recommendation for use.
The software was developed partly under US Environmental
Protection Agency Contract 68-C5-0005 to Dynamac, Inc.
1. MEAN SIMILARITY DENDROGRAMS
This documentation closely follows the language and notation of
the article: Van Sickle, J., 1997, Using Mean Similarity Dendrograms to
Evaluate Classifications, Journal of Agricultural, Biological and
Environmental Statistics 2, 370-388. A copy of this article may be
downloaded from the same source as this software. For a paper reprint of
this article, contact the author at the above address.
The article describes a convenient dendrogram format for
displaying mean similarities, but it also describes the mean similarity
approach and provides numerous references. This software distribution
does not include software for actually drawing mean similarity dendrograms. However, users can easily draw such graphs with any
presentation graphics package, once the mean similarity matrix has been
computed using the programs discussed below.
2 -- MEAN SIMILARITY PROGRAM, AND SAMPLE INPUT AND OUTPUT FILES:
2.1-- JACCSIM.TXT
Example input data file for the mean similarity program RNDTST6.
This is an ASCII file containing a symmetric Jaccard similarity matrix
for fish assemblages from 19 sampling sites on the Willamette River,
Oregon. The sites are grouped into four river reaches (ie, four
classes). RNDTST6 computes the mean of all pairwise similarities within
and between reaches and tests for an overall between-reach vs. within-reach
difference in site similarities. The example is fully discussed in Van Sickle (1997).
File Format:
Row 1 contains the Number of sites (= number of rows and columns
in the matrix), followed by a Run ID (in quotes) which is printed on the
program output.
Each subsequent row is one row of the similarity matrix. Each row
corresponds to one site, and the first column gives a 'site code' (in
this example, just the river mile location of the sampling site). The
second column is a code denoting the group to which the site belongs. In
JACCSIM.TXT, there are 4 groups (river reaches) denoted REACH1 to
REACH4. Site and group codes must be 8 characters or less, each enclosed
in quotes, separated by at least one blank (NOT TABS). Similarity values
are also blank-delimited. Rows should be separated by carriage returns (ie, the matrix should still look like a matrix when viewed in a text
editor with the line wrap turned off.)
Each group must have at least two sites in it. Use the identical
format for a group code in all rows of the matrix where it applies. For
example, 'Reach1 ' and ' Reach1 ' would be interpreted as two
different codes, because the group name is aligned differently within
the apostrophes in these two cases. Rows and columns of the matrix need
not be sorted by groups.
2.2 -- RNDTST6.EXE
Mean similarity program. To execute, put RNDTST6.EXE and
JACCSIM.TXT in the same directory. Then apply the Windows RUN command to RNDTST6.EXE.
The program will prompt you for an input file name
(JACCSIM.TXT, in this example) and an output file name which you choose (JACCSIM.OUT, in this example). Program results are written to the screen and also to the output file.
You are also prompted for the number of randomizations for the
permutation test. Enter 0 or a positive integer. ***Please read the
"Permutation Test " Usage notes below!
In this example, the program displays mean Jaccard similarity
within each Reach, and mean similarity between all pairs of Reaches. It
also computes weighted and unweighted means of the mean within-group
similarities for each group, as well as a grand mean of all the between-group similarities. The permutation test confirms that the ratio of
overall mean Between similarity to overall mean Within similarity (M=Bbar/Wbar) is much smaller than would be expected by a chance
assignment of sites to reaches. This result gives strong evidence that
the river reaches are a 'significant' (Very small P-value)
classification of the sites.
Version 6 also computes the test statistic (Wbar-Bbar) and its
associated permutation test P-value. This statistic may be preferable
because it can be interpreted as the average length of the branches on
the mean similarity dendrogram. This statistic is almost, but not quite,
"equivalent" to Bbar/Wbar, in the sense that the two statistics will
have P-values that are very close to one another.
2.3 -- JACCSIM.OUT
Sample output file (ASCII) from RNDTST6, using JACCSIM.TXT as
input file.
2.4 USAGE NOTES FOR RNDTST6.EXE:
2.4.1 PROGRAM LIMITS:
Max. number of sites (objects) = 5000.
Max. number of site groups = 50.
2.4.2 NOTES ON PERMUTATION TEST -- PLEASE READ CAREFULLY!
2.4.2.1 A randomization procedure is used to select a random sample of
size NTRIALS, out of all the possible unique site permutations among the
groups. The P-value estimate is then calculated as (NLESS+1)/(NTRIALS+1),
where NLESS = number of randomized M-values that were strictly less than
the observed M. You choose a value for NTRIALS at the program's
prompting.
About 10000 trials is recommended (Van Sickle, 1997), to ensure a
reliable estimate of P in the neighborhood of .01 to .05. For more than
about 50 sites, this may take a few seconds to execute on a Pentium-class PC.
2.4.2.2. Exact Test for Small Classifications -- (NEW IN VERSION 6). For "small" classifications, there may not be a large number of unique possible permutations. For example, for two groups, with 5 sites
in one group and 4 in the other, there are only 9!/(5!4!) = 126 possible
permutations. In this case, choosing a large random sample of
permutations can give a poor estimate of the P-value.
For such cases, the exact P-value can instead be calculated by
systematically enumerating all of the possible permutations of sites
among groups.
After you enter your desired number of randomizations, RNDTST6
compares your request to the total number of possible unique
permutations (NPERMS) for your classification structure. If NPERMS is smaller than the requested number of randomizations, RNDTST6
prints a warning and proceeds instead with the exact test, in which all
of the NPERMS permutations are enumerated. In the exact test, the specific permutation
corresponding to observed values of M and (Wbar-Bbar) is included among the set of all
possible permutations, so that the exact P-value is calculated as (NLESS+1)/NPERMS.
You can force RNDTST6 to do an exact test by calculating the number of possible permutations (see Mean Similarity ms. for the
formula), and then entering some larger number for the desired number of randomizations. WARNING! Doing this could tie up your computer for a
LONG time!
For more discussion, see the general permutation-testing references cited in Van Sickle (1997).
2.4.2.3 The permutation test is NOT VALID if it is used to test groups that were constructed from an optimal grouping
method, such as cluster analysis, that was applied to the same similarity matrix that is being
used in the test. However, it is still valid and very useful to informally compare the magnitude of between and within group mean similarities in such
cases, as a measure of cluster strength.
2.4.3 USING DISSIMILARITIES
The program can analyze a dissimilarity (distance) matrix equally
easily. The input file of dissimilarities is formatted exactly as for similarities, and mean between- and within-group dissimilarities are
then calculated. In this case, M is the ratio of mean Between-group dissimilarity to mean Within-group dissimilarity, so a LARGE value of M
is evidence that the groups give a "strong" classification. For a strong
classification, nearly all randomized M-values will be smaller than the observed M, and the P-value should be estimated as the number of
randomized M-values that were greater than or equal to the observed M.
The program does not produce the correct P-estimate, and it should
instead be calculated by hand from the output counts as [(NTRIALS-NLESS)+1]/[NTRIALS+1].
For the exact test applied to dissimilarities, use NPERMS instead
as the denominator. Analagous reasoning gives the interpretation of (Wbar-Bbar) for
dissimilarities, and its P-value is calculated in a similar fashion to that for M.
3 -- CALCULATING SIMILARITIES
SIMCALC.SAS is a SAS program that calculates a site similarity
matrix from a site (columns) by species (rows) table. Code is given for Jaccard and Bray-Curtis similarity and for Euclidean distance. The
program writes the matrix to a file in the format required by RNDTST6. The SAS program also contains example code for nonmetric
multidimensional scaling and cluster analysis, based on the similarity matrix. The program is macro-free, well-documented, and should run on
any platform.
SITESPEC.TXT is a sample input file (site x species table) for the SAS program. Application of the SAS program to this file yielded the
Jaccard similarity matrix in JACCSIM.TXT.
4 -- USING MRPP TO CONSTRUCT MEAN SIMILARITY DENDROGRAMS (MRPPCONV.EXE).
P.W. Mielke and colleagues originally developed the permutation
test procedures for within-group mean dissimilarities (see Van Sickle (1997) for references). Their methodology is generally known as
Multiresponse Permutation Procedures (MRPP). MRPP is available as a stand-alone program, as well as a component
of the BLOSSOM and PC-ORD packages. These packages directly read site x species data and calculate some common dissimilarity measures. They also
perform additional multivariate analyses, other than MRPP. The MRPP programs report the size (number of sites) and mean
within-group dissimilarity for each group. In addition they report the overall mean of all dissimilarities (denoted "Expected Delta" in
MRPP), and the weighted overall mean (Wbar) of the within group dissimilarities (denoted "Observed Delta" in MRPP).
Bbar is the only information not supplied by MRPP programs that is needed to construct mean similarity dendrograms, and also compare Bbar to Wbar as either a ratio or
difference.
The enclosed program, MRPPCONV.EXE, computes Bbar from values entered by the user directly from MRPP output. Use the Windows Run command to execute the program. The program is interactive and uses
no input or output files.
Further Notes on MRPP and MRPPCONV:
4.1 -- The statistic Wbar calculated by RNDTST6 is identical to the
"Observed Delta" of MRPP, if the relative group size is used to weight
the within-group means in MRPP. This is the default for both BLOSSOM and PC-ORD.
4.2 -- MRPP uses Wbar (relative to its expected value and variance under the null hypothesis) as its test statistic, rather than the
Bbar/Wbar
and (Wbar-Bbar) used by RNDTST6. However, all three statistics are nearly equivalent and will give nearly identical P-values in practice.
4.3 -- BLOSSOM offers an "exact test" version of MRPP suitable for small
numbers of sites (see 2.4.2.2 above). However, the exact version of MRPP does not print out mean within-group distances, which are needed
for input to MRPPCONV. To obtain these, rerun your MRPP analysis using the regular MRPP. But note that, like any summary statistic, values of
Wbar, Bbar, and their ratio and difference are "noisy" for small numbers
of sites.
![[logo] US EPA](http://www.epa.gov/epafiles/images/logo_epaseal.gif)