Premise 33 - Multivariate statistical analyses
Click below to view some of the premises from Karr and Chu (1999).
Multivariate statistical analyses often overlook biological knowledge
From "Restoring Life in Running Waters" by James R. Karr and Ellen W. Chu
(Reprinted with permission from Island Press)
To many field biologists, "statistics" means "multivariate statistics" because field data are complex and multidimensional. Despite the availability of numerous statistical techniques, monitoring studies have used the same multivariate techniques since the 1960s (Potvin and Travis 1993). These multivariate approaches including cluster analysis, factor analysis, and widely used ordination techniques such as principal components analysis (PCA; James and McCulloch 1990) extract the maximum statistical variance in variance-covariance matrices, usually across species or sites (Ludwig and Reynolds 1988). Unfortunately, the contexts in which multivariate methods have been applied have often precluded detecting, understanding, and basing decisions on some of the most important signals from biological systems.
The fault lies not with multivariate statistics themselves, which can provide important insights about the structure of data sets, but rather with how they are used. Multivariate analyses were developed for finding patterns, not assessing impacts. Failure to understand the difference, or to keep it in mind when interpreting biological data, can cause mistakes. We believe that misinterpretation is more common with multivariate techniques than with the multimetric approach. Certainly it is easier for people without statistical training to understand the results of a multimetric analysis. Many authors have covered the use of multivariate methods (Wright et al. 1993; Davies et al. 1995; Davies and Tsomides 1997; Walsh 1997), so this premise discusses some of the problems associated with their misuse in biological monitoring.
First, some ordination techniques, including PCA, assume that the data follow a multivariate normal distribution (Tabachnick and Fidell 1989), a pattern that is in fact rare in data from biological monitoring. These methods assume smooth continuous relationships, either a linear or simple polynomial pattern, but relationships among environmental variables are often nonlinear. In multivariate analysis, the numerous zeros and frequent high abundances typical of biomonitoring data show up as outliers with a potentially strong influence on the statistical solution (Gauch 1982; Tabachnick and Fidell 1989), so the data are often transformed to "fix" departures from normality, usually without success (Ter Braak 1986). Second, data are often edited (e.g., rare taxa are deleted), which may result in omitting important biological information (Walsh 1997; Cao et al. 1998).
Third, depending on which variables an analysis includes, multivariate techniques may fail to discriminate among important sources of variability, such as natural and human-induced variation or variation caused by sampling, subsampling, and error. Most multivariate data matrices contain a mix of sites, some with little influence from humans, others subject to different degrees of human influence. The matrices often mix data from different seasons or from, for example, different stream sizes or lake types. Although variables may be similarly confounded in multimetric analyses, it is usually easier to recognize and avoid this pitfall because multimetric analyses do not rely on computers to "discover" the relevant pattern.
Finally, multivariate approaches assume that statistically describing maximum variance will identify the most meaningful signal about biological condition. But because multivariate methods reduce the dimensionality of the original data by extracting or "loading" the maximum amount of variance on successive axes, they lose biological information at each step. This problem is compounded if the initial choice of biological variables was made without considering whether the variables responded across a continuum of human influence.
The most common applications of multivariate statistics rely on lists of taxa and their abundances to detect differences among sampled sites or times (Reynoldson and Metcalfe-Smith 1992; Norris and Georges 1993; Norris 1995; Pan et al. 1996; Reynoldson and Zarull 1993). PCA, for instance, uses mathematical algorithms to extract variance from a matrix of species abundances, one of the most variable aspects of biology, rather than examining how the animals feed, reproduce, use their habitat, or respond to human activities. When species-abundance matrices are the focus, important ecological attributes never even make it into the analysis. The combined loss of signal--because major important components of biology are ignored and because the statistical procedure cannot apportion variance to definable causes--limits the ability of the most common multivariate applications to discern complex patterns and to help investigators understand them.
In one telling example of the pitfalls of multivariate analyses of species abundances,9 two investigators advocated excluding rare species, saying that such species simply add "noise to the community structure signal and... little information to the data analysis. ...We recommend excluding all taxa that contribute less than 1% of the total number or occur at less that 10% of the sites" (Reynoldson and Rosenberg 1996: 5; see also Marchant 1989; Norris 1995). Yet the presence of rare taxa indicates ecological conditions capable of supporting those often sensitive taxa; far from adding noise, rare taxa offer special clues about a site's environmental quality (Karr 1991; Courtemanch 1996; Fore et al. 1996; Cao et al. 1998).
Furthermore, comparing the results of PCA on real data with PCA on matrices of random numbers showed that the percentage of variance described may be similar for both, especially for the second and subsequent principal components; that loadings of original variables on principal axes are often as high for random numbers as for real data; and that matrix size is an important determinant of the amount of variance extracted (Karr and Martin 1981). Multivariate techniques were unable to discern known deterministic relationships in one study (Armstrong 1967), and in another, they manufactured relationships in data sets containing no such relationships (Rexstad et al. 1988).
PCA reflects the underlying linear correlation (or covariance) among all the variables in the matrix. If no, or small, correlations exist, then PCA can manufacture relationships. The problem can be avoided with a careful examination of the correlation matrix before applying PCA. Without careful choice of variables conveying reliable signals about biological condition or, as Gotelli and Graves (1996) argue, without a comparison of the data against a null model showing pattern(s) that would occur in the absence of any effect, multivariate statistics can misguide resource assessment efforts. General uses of PCA seldom give results that go beyond common sense (Karr and Martin 1981; Fore et al. 1996; Stewart-Oaten 1996). Gotelli and Graves (1996: 137) go so far as to suggest that "multivariate analysis has been greatly abused by ecologists. ...Drawing polygons (or amoebas) around groups of species [or points], and interpreting the results often amounts to ecological palmistry. Ad hoc 'explanations' often are based on the original untransformed variables, so that the multivariate transformation offers no more insight than the original variables did."
The key danger of overreliance on multivariate analyses is that management decisions may be based on statistical properties of data--on the structure of a covariance matrix--rather than on biological knowledge and understanding. In fact, when multivariate analyses examine the same biological attributes used in multimetric indexes, they yield essentially identical results (Hughes et al., in press). The key message, then, is to choose attributes and use procedures to account for biological impacts, not just to describe pattern. Avoid analytical "shortcuts" that are not easily understood or that must be done idiosyncratically for every data set. There is simply no substitute, either in multivariate statistics or in multimetric indexes, for careful application of biological and ecological knowledge, regardless of the analytical tool. Careful design of sampling, thoughtful analysis of data, and careful description of biological condition can eliminate the need for general approaches that merely extract variance.
9 From the Ninth Annual Technical Information Workshop on study design and data analysis in benthic macroinvertebrate assessments (North American Benthological Society meeting, June 1996).
References
Armstrong, J. S. 1967. Derivation of theory by means of factor analysis, or Tom Swift and his electric factor analysis machine. Am. Stat. 21: 17-21.
Cao, Y., D. D. Williams, and N. E. Williams. 1998. How important are rare species in aquatic habitat bioassessment? Bull. N. Am. Benthol. Soc. 15: 109.
Courtemanch, D. L. 1996. Commentary on the subsampling procedure used for rapid bioassessments. J. N. Am. Benthol. Soc. 15: 381-385.
Davies, S. P., and L. Tsomides. 1997. Methods for biological sampling and analysis of Maine's inland waters. DEP-LW107-A97. Maine Department of Environmental Protection, Augusta.
Davies, S. P., L. Tsomides, D. L. Courtemanch, and F. Drummond. 1995. Maine biological monitoring and biocriteria development program. Maine Department of Environmental Protection, Bureau of Land and Water Quality, Division of Environmental Assessment, Augusta.
Fore, L. S., J. R. Karr, and R. W. Wisseman. 1996. Assessing invertebrate responses to human activities: Evaluating alternative approaches. J. N. Am. Benthol. Soc. 15: 212-231.
Gauch, H. G. 1982. Multivariate Analysis in Community Ecology. Cambridge University Press, Cambridge, UK.
Gotelli, N. J., and G. R. Graves. 1996. Null Models in Ecology. Smithsonian Institution Press, Washington, DC.
Hughes, R. M., L. Reynolds, P. R. Kaufmann, A. T. Herlihy, T. Kincaid, and D. P. Larsen. In press. Development and application of an index of fish assemblage integrity for wadeable streams in the Willamette Valley, Oregon, USA. Can. J. Fish. Aquat. Sci.
James, F. C., and C. E. McCullough. 1990. Multivariate analysis in ecology and systematics: Panacea or Pandora's box? Annu. Rev. Ecol. Syst. 21: 129-166.
Karr, J. R. 1991. Biological integrity: A long-neglected aspect of water resource management. Ecol. Appl. 1: 66-84.
Karr, J. R., and T. E. Martin. 1981. Random numbers and principal components: Further searches for the unicorn. Pages 20-24 in D. Capen, ed. The use of multivariate statistics in studies of wildlife habitat. US For. Serv. Gen Tech. Rep. RM-87.
Ludwig, J. A., and J. F. Reynolds. 1988. Statistical Ecology. Wiley, New York.
Marchant, R. 1989. A subsampler for samples of benthic invertebrates. Bull. Aust. Soc. Limnol. 12: 49-52.
Norris, R. H. 1995. Biological monitoring: The dilemma of data analysis. J. N. Am. Benthol. Soc. 14: 440-450.
Norris, R. H., and A. Georges. 1993. Analysis and interpretation of benthic surveys. Pages 234-286 in D. M. Rosenberg and V. H. Resh, eds. Freshwater Biomonitoring and Benthic Macroinvertebrates. Chapman and Hall, New York.
Pan, Y., R. J. Stevenson, B. H. Hill, A. T. Herlihy, and C. B. Collins. 1996. Using diatoms as indicators of ecological conditions in lotic systems: A regional assessment. J. N. Am. Benthol. Soc. 15: 481-494.
Potvin, C., and J. Travis, eds. 1993. Statistical methods: An upgrade for biologists. Ecology 74: 1614-1676.
Reynoldson, T. B., and J. L. Metcalfe-Smith. 1992. An overview of the assessment of aquatic ecosystem health using benthic invertebrates. J. Aquat. Ecosyst. Health 1: 295-308.
Reynoldson, T. B., and D. M. Rosenberg. 1996. Sampling strategies and practical considerations in building reference data bases for the prediction of invertebrate community structure. Pages 1-31 in R. C. Bailey, R. H. Norris, and B. Reynoldson, eds. Study Design and Data Analysis in Benthic Macroinvertebrate Assessments of Freshwater Ecosystems Using a Reference Site Approach. Technical Information Workshop, North American Benthological Society, Kalispell, MT.
Reynoldson, T. B., and M. A. Zarull. 1993. An approach to the development of biological sediment guidelines. Pages 177-200 in S. Woodley, J. Kay, and G. Francis, eds. Ecological Integrity and the Management of Ecosystems. St. Lucie Press, Delray Beach, FL.
Stewart-Oaten, A. 1996. Goals in environmental monitoring. Pages 17-28 in R. J. Schmitt and C. W. Osenberg, eds. Detecting Ecological Impacts: Concepts and Applications in Coastal Habitats, Academic Press, San Diego, CA.
Tabachnick, B. G., and L. S. Fidell. 1989. Using Multivariate Statistics, 2d ed. HarperCollins, New York.
Ter Braak, C. J. F. 1986. Canonical correspondence analysis: A new eigenvector technique for multivariate direct gradient analysis. Ecology 67: 1167-1179.
Walsh, C. J. 1997. A multivariate method for determining optimal subsample size in the analysis of macroinvertebrate samples. Mar. Freshwater Res. 48: 241-248.
Wright, J. F., M. T. Furse, and P. D. Armitage. 1993. RIVPACS: A technique for evaluating the biological quality of rivers in the UK. Eur. Water Pollut. Control 3: 15-25.
![[logo] US EPA](http://www.epa.gov/epafiles/images/logo_epaseal.gif)