Jump to main content.


Comparison of Statistical Methods

In the table below, click the circle in the row and column of the 2 tests you want to compare. The table is a diagonal matrix and, therefore, lists only 1 comparison for each possible pair. In other words, correlation vs. Anova is listed but not the reverse, Anova vs. correlation.

Each univariate and bivariate method is compared to every other method (green circles), and every multivariate method is compared to every other multivariate method (purple circles), but only a smaller set of comparisons are given for univariate vs. multivariate tests (half green and purple). The comparisons that don't make sense are marked "na" for not applicable.

 

 

UV

2-samp

Anova

Corr

Reg

MV

Cluster

MR

DFA

Manova

PCA

Univariate

 

 

 

 

 

 

 

 

 

 

 

2-sample

na

 

 

 

 

 

 

 

 

 

 

Anova

na

green circle

 

 

 

 

 

 

 

 

 

Correlation

na

green circle

green circle

 

 

 

 

 

 

 

 

Regression

na

green circle

green circle

green circle

 

 

 

 

 

 

 

Multivariate

green circle

na

na

na

na

 

 

 

 

 

 

Cluster

na

na

na

na

na

purple circle

 

 

 

 

 

Multiple reg.

na

na

purple and green circle

na

purple and green circle

na

purple circle

 

 

 

 

DFA

na

na

na

na

na

na

purple circle

purple circle

 

 

 

Manova

na

na

purple and green circle

na

na

na

purple circle

purple circle

purple circle

 

 

PCA

na

na

na

na

na

na

purple circle

purple circle

purple circle

 

Canonical

na

na

na

na

na

na

purple circle

purple circle

purple circle

purple circle

 

Univariate (and Bivariate) vs. Multivariate

"Uni" = one. "Bi" = two. "Multi" = many. The prefix refers to the number of response variables in the model.

Statistical models often take the form of an equation where the left hand side includes the dependent variable(s) and the right includes the independent variable(s).

Dependent variable(s) = f (independent variables).

For univariate analysis there is 1 variable on the left hand side and usually 1 (but possibly more, for example, in Anova) on the right. Or you might have only 1 variable and no equation, for example, when calculating percentiles or drawing histograms.

Bivariate analysis includes 2 variables, for example, in regression or correlation.

For multivariate analysis there can be multiple variables on either side of the equation. Often the variables on each side of the equation are also related to each other (i.e. correlated).

The word "independent" gets used a lot in statistics and can become confusing. It's easy to understand why the variables on the left are called dependent, their values depend on the values of the other variables. For example, invertebrate taxa richness at a stream site depends on the amount of urban development upstream. The reverse is not true, because urban development does not depend on taxa richness. In this case urban development is the predictor or independent variable.

In statistics we often test whether an independent variable can predict the dependent variable. Remember what's being tested is the strength of association between the variables, not whether 1 variable >causes< the other.

What gets confusing is that for multivariate analysis, the sets of variables on each side of the equation may or may not be independent of each other. In this case, it may be easier to think of sets of variables as being correlated or not correlated with each other.

There are a couple of exceptions to the classification system used above. Even though Anova and multiple regression are nearly equivalent mathematically, Anova is typically considered a univariate technique and multiple regression is considered multivariate. The categories depend somewhat on your point of view.

back to top
 

Two-sample vs. Anova

Anova is an extension of the t-test. If you are testing for difference in means for 2 groups you would use a t-test (or its nonparametric equivalent, the Mann-Whitney U-test). For more than 2 groups, i.e., a multisample comparison, you would use a 1-way Anova.

 

Two-sample vs. Correlation

A two-sample comparison tests typically compare the means of 2 groups. High and low elevation might be the grouping variable and stream width the variable of interest, or dependent variable. For a test of correlation, the grouping variable is replaced by a continuous variable so the question becomes, "Does stream width increase (or decrease) with elevation?"
 

Two-sample vs. Regression

A two-sample comparison tests whether 2 groups are different in some way (e.g., means or variances), regression tests whether 1 variable can predict another. A two-sample test might use a grouping variable such as valley vs. montane to test for differences in the average stream gradient. A regression model might use elevation to predict stream gradient. Regression represent a broader paradigm for testing variable in that two-sample tests asks whether stream gradient differs while regression asks how does stream gradient differ.

back to top
 

Anova vs. Correlation

Anova tests for mean differences in groups, for example does the mean elevation differ across ecoregions. Correlation tests whether 2 variables go up or down together (or vice versa), for example, elevation and stream gradient. Anova tests whether groups of cases are different while correlation tests whether variables change together.
 

Anova vs. Regression

In Anova the independent variable is categorical, for regression it is continuous. Both methods test whether the dependent variable can be predicted from the independent variable(s).
 

Correlation vs. Regression

Correlation assumes that both variables can change while regression (typically) assumes that the independent variable is measured without error. A significant correlation means that 2 variables tend to go up or down together (or change in opposite directions), for whatever reason. A significant regression means that the dependent variable can be predicted by the independent variable, again, for whatever reason. Causality is not tested by either method. Though theoretically different, both tests usually give the same answer (significant or not).

back to top


Anova vs. Multiple regression

All Anova models can also be solved using multiple regression, though the reverse is not true. The categories from Anova can be recoded as dummy variables in regression and the results will be the same. Anova is used for independent variables that are categorical (e.g., ecoregion) while multiple regression models can be used with categorical or continuous or a mix of both.
 

Anova vs. Manova

For Anova, there is 1 dependent variable that you are trying to explain by grouping the cases with 1 or more independent variables. For Manova, there can be multiple dependent variables you wish to associate with the grouping variable(s). For example, with Anova you might test for differences in elevation for different ecoregions. With Manova you might test for differences in {elevation, stream gradient, and percent forest cover} based on ecoregion; 3 variables that are often correlated with each other. If instead you were testing whether 3 independent variables, such as elevation, conductivity, and canopy cover, change by ecoregion, you could also perform 3 separate Anova tests.

Manova can test multiple dependent variables simultaneously while keeping track of their correlation with each other. If you used multiple Anova tests instead for correlated variables, you would be more likely to commit a type I error, that is, conclude that the groups were significantly different when they were not. This type of error occurs when multiple tests are performed at a specified p-value because with enough tests you are likely to get a significant result due to chance alone. (At p=0.05 this will occur about 5% of the time). When the variables are correlated the probability of this type of error increases.
 

Regression vs. Multiple regression

Regression draws a line that relates changes in 1 variable to changes in another. On the left hand side, multiple regression also has 1 variable, but the right hand side can have more than 1. For 2 independent variables, multiple regression draws a plane; and for more than 2 a "surface" or "hyper-plane." Thus, a regression the line takes the form

Y = aX + intercept.

For multiple regression there can be many X-variables

Y = aX1 = bX2 + cX3 + ... + intercept.

back to top
 

Multivariate vs. Cluster

Cluster analysis is the only multivariate method described here that does not rely on the mathematically convenient properties of the multivariate normal distribution to obtain solutions. Rather, cluster analysis uses measures of distance.
 

Cluster vs. Multiple regression

Both methods use multiple independent variables. Cluster analysis uses them to group cases, multiple regression uses them to predict values for a dependent variable.
 

Cluster vs. Discriminant function analysis

Both methods are used to classify cases. Cluster analysis takes the multiple independent variables and defines its own clusters. DFA also used multiple independent variables, but instead of finding its own groups, it uses the group that you provide. The equations derived by DFA to classify the groups you initially defined can also be used to classify new cases.

DFA is often used to test for differences in the clusters defined by cluster analysis.

back to top
 

Cluster vs. Manova

Cluster analysis puts cases into groups; Manova tests whether the grouping variable that you selected (the independent variable) defines groups that have different values for the dependent variables.
 

Cluster vs. PCA

Cluster analysis groups similar cases together while PCA separates cases as far apart as possible. Both use multiple independent variables. PCA is typically used to reduce the number of variables for further analysis.
 

Cluster vs. Canonical correlation

Cluster analysis explores the relationships between cases; canonical correlation explores the relationships between variables.
 

Multiple regression vs. DFA

Both methods are similar in that they use a single dependent variable and multiple independent variables. On the left hand side of the equation, they differ in that the dependent variable for DFA is categorical but in multiple regression it is continuous. On the right hand side of the equation, correlation among the independent variables is a nuisance and can result in an unstable solution for multiple regression. DFA, in contrast, expects correlation among the independent variables and includes it explicitly it in the model.

Multiple regression predicts values for the dependent variable. In contrast, DFA starts with the already defined groups and finds the combinations of independent variables that maximize the differences among the groups. For example, you might use urban land cover, number of mines, and number of NPDES permits to predict taxa richness at a stream site using multiple regression, being careful to choose variables that aren't too correlated. DFA, in contrast, might use the abundances of various metals-sensitive species to predict whether mine waste is present.

back to top
 

Multiple regression vs. Manova

On the right hand side of the equation, both multiple regression and Manova can have several independent variables. Both methods assume that the independent variables are also independent of each other (i.e. not correlated), or nearly so. On the left hand side, multiple regression has just 1 variable while Manova has 2 or more.
 

Multiple regression vs. PCA

Multiple regression uses several independent variables to try to predict a value for a dependent variable on the other side of the equation. PCA just uses 1 set of variable and looks for a new variable (a linear combination) that will summarize the larger set of variables. For PCA there is no other side of the equation, the model doesn't try to predict values for another variable, it just looks at the relationships for a single set of variables. For example, you might use urban land cover, number of mines and number of NPDES permits to predict taxa richness at a stream site using multiple regression, being careful to choose variables that aren't too correlated. PCA on the other hand expects correlated variables. For example, you might use PCA to summarize a set that includes {percent forest cover, percent urban cover, chlorine, and number of wastewater treatment plants}.

back to top
 

Multiple regression vs. Canonical correlation

Multiple regression relates 1 dependent variable to a set of independent variables that are not correlated with each (or not much). Canonical correlation evaluates the relationship between 2 sets of variables. An example for multiple regression would be taxa richness as a function of elevation and percent urban area in the watershed. Canonical correlation might explore the relationships between a set of biological metrics {mayfly taxa richness, stonefly taxa richness, and predator taxa richness} and a set of disturbance measures {percent urban area, percent forested area, and chlorine}. For canonical correlation, correlation within sets of variables is ok. For multiple regression the group of independent variables should not be too correlated.
 

Discriminant function analysis vs. Manova

Discriminant function analysis and Manova are mathematically the same, but the difference is that the independent and dependent variables trade places. In DFA we ask if there is some combination of variables that reliably separates groups. In Manova, asks if the dependent variables (actually a linear combination of the dependent variables) are different for different groups. Using DFA, we might ask, do the values for biological metrics reliably distinguish between impaired and reference sites? With Manova we are asking, are the biological metrics different for impaired and reference sites?

back to top
 

Discriminant function analysis vs. PCA

For discriminant function analysis we begin by defining the categories, for example, impaired or reference sites and ask the analysis to identify the best combinations of variables to define those categories. For PCA there are no guidelines or a priori groups. PCA must sort cases with its own algorithms. DFA combines variables to match the categories and groups you define, PCA also combines variables, but there is no constraint to match a priori groupings. For example, physical measures of streams might be used in a DFA to discriminate between ecoregions. In PCA the physical measures of streams would also be entered, and treated similarly as for DFA, but you must look at the plots afterward to determine if there are any clear grouping revealed by PCA, for example, stream sites in the same ecoregion might plot more closely to each other.
 

Discriminant function analysis vs. Canonical correlation

Discriminant function analysis is a simple case of canonical correlation. Canonical correlation has many variables on each side of the equation that may also be correlated with each other. DFA has sets of variables on each side of the equation, but on 1 side, the variables are all categorical and may not be correlated with each other. For example, if canonical correlation uses 2 sets of variables such as invertebrate metrics and physical measures of the stream channel, DFA would use a categorical variable such as channel types of Rosgen class on 1 side and the full set of invertebrate metrics on the other.

back to top
 

Manova vs. PCA

Manova tests whether different group of sites have different mean values for a set of response variables. For example, Manova tests whether the set of variables {stream width, depth, and gradient} varies for different ecoregions. PCA does not compare variables on either side of an equation because there is no "other side" of the equation for PCA. PCA looks for relationships within a set of variables. For example, PCA could use the same set of correlated variables used by Manova, and the plots might show grouping by ecoregion after the analysis was complete, but information about ecoregion is not included in the model.
 

Manova vs. Canonical correlation

Canonical correlation is the general model and Manova is a special case. Canonical correlation has many variables on each side of the equation that may also be correlated with each other. Manova also has several variables on each side of the equation, but in 1 set, the independent set, the variables are not correlated with each other. For example, if canonical correlation uses 2 sets of variables such as invertebrate metrics and physical measures of the stream channel, Manova might use the same set of invertebrate metrics, but on the other side would be categorical variables such as channel shape or ecoregion. CCA would return combinations of variables from each set that are maximally correlated while MANOVA would test whether the ecoregions differed in terms of the invertebrate metrics.

back to top
 

PCA vs. Canonical correlation

Canonical correlation uses 2 sets of variables while PCA uses only 1. For each set of variables, canonical correlation creates a new variable which is a linear combination of the original variables. Variables are combined in such a way that maximizes the correlation between the new variables on each side.

PCA is a simple case of canonical correlation that uses 1 set of variables. In PCA, there is no left hand side of the equation. For example, if canonical correlation uses 2 sets of variables such as invertebrate metrics and physical measures of the stream channel, PCA would use only 1, either the biological metrics or the physical measures, but not both. PCA would provide a few new variables, or axes, that summarize the variation across cases for the set of variables.

back to top

 

Biological Indicators | Aquatic Biodiversity | Statistical Primer


Local Navigation


Jump to main content.