Jump to main content.

Documentation for MEANSIM , Version 6.0

A set of programs for Mean Similarity Analysis 10/29/98

John Van Sickle 
ph. 541-754-4314
fax 541-754-4338
email: vansickle.john@epa.gov

    This distribution contains software for Mean Similarity Analysis, a method for evaluating the strength of a classification of many objects (sites) into a relatively small number of groups. The classification strength of a grouping is evaluated by the extent to which objects within the same group are more similar to each other, on average, than they are to objects in different groups. 

    The enclosed programs run under Windows 95/98/NT. If you need a DOS or Windows 3.1 version, please contact the author.

    The analysis is based on a matrix of pairwise similarities (or dissimilarities) for all possible pairs of objects. The mean similarity analysis outputs consist of a small matrix of average similarities within and between the chosen groups, and a statistical test of whether average within-group similarities are greater than between-group similarities.

    This software is provided free of charge with the understanding that it will not be used for any commercial purposes. It is reasonably reliable, but has not been exhaustively tested and must be applied at the user's own risk. Mention of trade names or commercial products does not constitute endorsement or recommendation for use.
    The software was developed partly under US Environmental Protection Agency Contract 68-C5-0005 to Dynamac, Inc.

    This documentation closely follows the language and notation of the article: Van Sickle, J., 1997, Using Mean Similarity Dendrograms to Evaluate Classifications, Journal of Agricultural, Biological and Environmental Statistics 2, 370-388. A copy of this article may be downloaded from the same source as this software. For a paper reprint of this article, contact the author at the above address. 
    The article describes a convenient dendrogram format for displaying mean similarities, but it also describes the mean similarity approach and provides numerous references. This software distribution does not include software for actually drawing mean similarity dendrograms. However, users can easily draw such graphs with any presentation graphics package, once the mean similarity matrix has been computed using the programs discussed below.


    Example input data file for the mean similarity program RNDTST6.  This is an ASCII file containing a symmetric Jaccard similarity matrix for fish assemblages from 19 sampling sites on the Willamette River, Oregon. The sites are grouped into four river reaches (ie, four classes). RNDTST6 computes the mean of all pairwise similarities within and between reaches and tests for an overall between-reach vs. within-reach  difference in site similarities. The example is fully discussed in Van Sickle (1997).

File Format:
    Row 1 contains the Number of sites (= number of rows and columns in the matrix), followed by a Run ID (in quotes) which is printed on the program output.

    Each subsequent row is one row of the similarity matrix. Each row corresponds to one site, and the first column gives a 'site code' (in this example, just the river mile location of the sampling site). The second column is a code denoting the group to which the site belongs. In JACCSIM.TXT, there are 4 groups (river reaches) denoted REACH1 to REACH4. Site and group codes must be 8 characters or less, each enclosed in quotes, separated by at least one blank (NOT TABS).  Similarity values are also blank-delimited. Rows should be separated by carriage returns (ie, the matrix should still look like a matrix when viewed in a text editor with the line wrap turned off.) 
    Each group must have at least two sites in it. Use the identical format for a group code in all rows of the matrix where it applies. For example, 'Reach1 ' and ' Reach1 ' would be interpreted as two different codes, because the group name is aligned differently within the apostrophes in these two cases. Rows and columns of the matrix need not be sorted by groups. 

2.2 -- RNDTST6.EXE
   Mean similarity program. To execute, put RNDTST6.EXE and JACCSIM.TXT in the same directory. Then apply the Windows RUN command to RNDTST6.EXE. 
    The program will prompt you for an input file name  (JACCSIM.TXT, in this example) and an output file name which you choose (JACCSIM.OUT, in this example). Program results are written to the screen and also to the output file.
    You are also prompted for the number of randomizations for the permutation test. Enter 0 or a positive integer. ***Please read the "Permutation Test " Usage notes below! 
    In this example, the program displays mean Jaccard similarity within each Reach, and mean similarity between all pairs of Reaches. It also computes weighted and unweighted means of the mean within-group similarities for each group, as well as a grand mean of all the between-group similarities. The permutation test confirms that the ratio of overall mean Between similarity to overall mean Within similarity (M=Bbar/Wbar) is much smaller than would be expected by a chance assignment of sites to reaches. This result gives strong evidence that the river reaches are a 'significant' (Very small P-value) classification of the sites.
    Version 6 also computes the test statistic (Wbar-Bbar) and its associated permutation test P-value. This statistic may be preferable because it can be interpreted as the average length of the branches on the mean similarity dendrogram. This statistic is almost, but not quite, "equivalent" to Bbar/Wbar, in the sense that the two statistics will have P-values that are very close to one another. 

    Sample output file (ASCII) from RNDTST6, using JACCSIM.TXT as input file.



Max. number of sites (objects) = 5000.
Max. number of site groups = 50.

2.4.2 NOTES ON PERMUTATION TEST -- PLEASE READ CAREFULLY!  A randomization procedure is used to select a random sample of size NTRIALS, out of all the possible unique site permutations among the groups. The P-value estimate is then calculated as (NLESS+1)/(NTRIALS+1), where NLESS = number of randomized M-values that were strictly less than the observed M. You choose a value for NTRIALS at the program's prompting.
    About 10000 trials is recommended (Van Sickle, 1997), to ensure a reliable estimate of P in the neighborhood of .01 to .05. For more than about 50 sites, this may take a few seconds to execute on a Pentium-class PC.  Exact Test for Small Classifications -- (NEW IN VERSION 6). For "small" classifications, there may not be a large number of unique possible permutations. For example, for two groups, with 5 sites in one group and 4 in the other, there are only 9!/(5!4!) = 126 possible permutations. In this case, choosing a large random sample of permutations can give a poor estimate of the P-value.
    For such cases, the exact P-value can instead be calculated by systematically enumerating all of the possible permutations of sites among groups. 
    After you enter your desired number of randomizations, RNDTST6 compares your request to the total number of possible unique permutations (NPERMS) for your classification structure. If NPERMS is smaller than the requested number of randomizations, RNDTST6 prints a warning and proceeds instead with the exact test, in which all of the NPERMS permutations are enumerated. In the exact test, the specific permutation corresponding to observed values of M and (Wbar-Bbar) is included among the set of all possible permutations, so that the exact P-value is calculated as (NLESS+1)/NPERMS.
    You can force RNDTST6 to do an exact test by calculating the number of possible permutations (see Mean Similarity ms. for the formula), and then entering some larger number for the desired number of randomizations. WARNING! Doing this could tie up your computer for a LONG time!
    For more discussion, see the general permutation-testing references cited in Van Sickle (1997). The permutation test is NOT VALID if it is used to test groups that were constructed from an optimal grouping method, such as cluster analysis, that was applied to the same similarity matrix that is being used in the test. However, it is still valid and very useful to informally compare the magnitude of between and within group mean similarities in such cases, as a measure of cluster strength. 

    The program can analyze a dissimilarity (distance) matrix equally easily. The input file of dissimilarities is formatted exactly as for similarities, and mean between- and within-group dissimilarities are then calculated. In this case, M is the ratio of mean Between-group dissimilarity to mean Within-group dissimilarity, so a LARGE value of M is evidence that the groups give a "strong" classification. For a strong classification, nearly all randomized M-values will be smaller than the observed M, and the P-value should be estimated as the number of randomized M-values that were greater than or equal to the observed M.
    The program does not produce the correct P-estimate, and it should instead be calculated by hand from the output counts as [(NTRIALS-NLESS)+1]/[NTRIALS+1]. For the exact test applied to dissimilarities, use NPERMS instead
as the denominator. Analagous reasoning gives the interpretation of (Wbar-Bbar) for dissimilarities, and its P-value is calculated in a similar fashion to that for M.

    SIMCALC.SAS is a SAS program that calculates a site similarity matrix from a site (columns) by species (rows) table. Code is given for Jaccard and Bray-Curtis similarity and for Euclidean distance. The program writes the matrix to a file in the format required by RNDTST6. The SAS program also contains example code for nonmetric multidimensional scaling and cluster analysis, based on the similarity matrix. The program is macro-free, well-documented, and should run on
any platform. 
SITESPEC.TXT is a sample input file (site x species table) for the SAS program. Application of the SAS program to this file yielded the Jaccard similarity matrix in JACCSIM.TXT.

    P.W. Mielke and colleagues originally developed the permutation test procedures for within-group mean dissimilarities (see Van Sickle (1997) for references). Their methodology is generally known as Multiresponse Permutation Procedures (MRPP). MRPP is available as a stand-alone program, as well as a component of the BLOSSOM and PC-ORD packages. These packages directly read site x species data and calculate some common dissimilarity measures. They also perform additional multivariate analyses, other than MRPP.  The MRPP programs report the size (number of sites) and mean within-group dissimilarity for each group. In addition they report the overall mean of all dissimilarities (denoted "Expected Delta" in MRPP), and the weighted overall mean (Wbar) of the within group dissimilarities (denoted "Observed Delta" in MRPP). Bbar is the only information not supplied by MRPP programs that is needed to construct mean similarity dendrograms, and also compare Bbar to Wbar as either a ratio or difference.

    The enclosed program, MRPPCONV.EXE, computes Bbar from values entered by the user directly from MRPP output. Use the Windows Run command to execute the program. The program is interactive and uses  no input or output files.

Further Notes on MRPP and MRPPCONV:

4.1 -- The statistic Wbar calculated by RNDTST6 is identical to the "Observed Delta" of MRPP, if the relative group size is used to weight the within-group means in MRPP. This is the default for both BLOSSOM and PC-ORD.

4.2 -- MRPP uses Wbar (relative to its expected value and variance under the null hypothesis) as its test statistic, rather than the Bbar/Wbar and (Wbar-Bbar) used by RNDTST6. However, all three statistics are nearly equivalent and will give nearly identical P-values in practice.

4.3 -- BLOSSOM offers an "exact test" version of MRPP suitable for small numbers of sites (see above). However, the exact version of MRPP does not print out mean within-group distances, which are needed for input to MRPPCONV. To obtain these, rerun your MRPP analysis using the regular MRPP. But note that, like any summary statistic, values of Wbar, Bbar, and their ratio and difference are "noisy" for small numbers of sites.

ORD Home | NHEERL Home

Local Navigation

Jump to main content.