Comparison of Statistical Methods
In the table below, click the circle in the row and column of the 2 tests you want to compare. The table is a diagonal matrix and, therefore, lists only 1 comparison for each possible pair. In other words, correlation vs. Anova is listed but not the reverse, Anova vs. correlation.
Each univariate and bivariate method is compared to every other method (green circles), and every multivariate method is compared to every other multivariate method (purple circles), but only a smaller set of comparisons are given for univariate vs. multivariate tests (half green and purple). The comparisons that don't make sense are marked "na" for not applicable.
|
|
UV |
2-samp |
Anova |
Corr |
Reg |
MV |
Cluster |
MR |
DFA |
Manova |
PCA |
| Univariate |
|
|
|
|
|
|
|
|
|
|
|
| 2-sample |
na |
|
|
|
|
|
|
|
|
|
|
| Anova |
na |
|
|
|
|
|
|
|
|
|
|
| Correlation |
na |
|
|
|
|
|
|
|
|
||
| Regression |
na |
|
|
|
|
|
|
|
|||
| Multivariate |
na |
na |
na |
na |
|
|
|
|
|
|
|
| Cluster |
na |
na |
na |
na |
na |
|
|
|
|
|
|
| Multiple reg. |
na |
na |
na |
na |
|
|
|
|
|||
| DFA |
na |
na |
na |
na |
na |
na |
|
|
|
||
| Manova |
na |
na |
na |
na |
na |
|
|
||||
| PCA |
na |
na |
na |
na |
na |
na |
|
||||
| Canonical |
na |
na |
na |
na |
na |
na |
Univariate (and Bivariate) vs. Multivariate
"Uni" = one. "Bi" = two. "Multi" = many. The prefix refers to the number of response variables in the model.
Statistical models often take the form of an equation where the left hand side includes the dependent variable(s) and the right includes the independent variable(s).
Dependent variable(s) = f (independent variables).
For univariate analysis there is 1 variable on the left hand side and usually 1 (but possibly more, for example, in Anova) on the right. Or you might have only 1 variable and no equation, for example, when calculating percentiles or drawing histograms.
Bivariate analysis includes 2 variables, for example, in regression or correlation.
For multivariate analysis there can be multiple variables on either side of the equation. Often the variables on each side of the equation are also related to each other (i.e. correlated).
The word "independent" gets used a lot in statistics and can become confusing. It's easy to understand why the variables on the left are called dependent, their values depend on the values of the other variables. For example, invertebrate taxa richness at a stream site depends on the amount of urban development upstream. The reverse is not true, because urban development does not depend on taxa richness. In this case urban development is the predictor or independent variable.
In statistics we often test whether an independent variable can predict the dependent variable. Remember what's being tested is the strength of association between the variables, not whether 1 variable >causes< the other.
What gets confusing is that for multivariate analysis, the sets of variables on each side of the equation may or may not be independent of each other. In this case, it may be easier to think of sets of variables as being correlated or not correlated with each other.
There are a couple of exceptions to the classification system used above. Even though Anova and multiple regression are nearly equivalent mathematically, Anova is typically considered a univariate technique and multiple regression is considered multivariate. The categories depend somewhat on your point of view.
Anova is an extension of the t-test. If you are testing for difference
in means for 2 groups you would use a t-test (or its nonparametric equivalent,
the Mann-Whitney U-test). For more than 2 groups, i.e., a multisample
comparison, you would use a 1-way Anova.
A two-sample comparison tests typically compare the means of 2 groups.
High and low elevation might be the grouping variable and stream width
the variable of interest, or dependent variable. For a test of correlation,
the grouping variable is replaced by a continuous variable so the question
becomes, "Does stream width increase (or decrease) with elevation?"
A two-sample comparison tests whether 2 groups are different in some way (e.g., means or variances), regression tests whether 1 variable can predict another. A two-sample test might use a grouping variable such as valley vs. montane to test for differences in the average stream gradient. A regression model might use elevation to predict stream gradient. Regression represent a broader paradigm for testing variable in that two-sample tests asks whether stream gradient differs while regression asks how does stream gradient differ.
Anova tests for mean differences in groups, for example does the mean
elevation differ across ecoregions. Correlation tests whether 2 variables
go up or down together (or vice versa), for example, elevation and stream
gradient. Anova tests whether groups of cases are different while correlation
tests whether variables change together.
In Anova the independent variable is categorical, for regression it is
continuous. Both methods test whether the dependent variable can be predicted
from the independent variable(s).
Correlation assumes that both variables can change while regression (typically) assumes that the independent variable is measured without error. A significant correlation means that 2 variables tend to go up or down together (or change in opposite directions), for whatever reason. A significant regression means that the dependent variable can be predicted by the independent variable, again, for whatever reason. Causality is not tested by either method. Though theoretically different, both tests usually give the same answer (significant or not).
All Anova models can also be solved using multiple regression, though
the reverse is not true. The categories from Anova can be recoded as dummy
variables in regression and the results will be the same. Anova is used
for independent variables that are categorical (e.g., ecoregion) while
multiple regression models can be used with categorical or continuous
or a mix of both.
For Anova, there is 1 dependent variable that you are trying to explain by grouping the cases with 1 or more independent variables. For Manova, there can be multiple dependent variables you wish to associate with the grouping variable(s). For example, with Anova you might test for differences in elevation for different ecoregions. With Manova you might test for differences in {elevation, stream gradient, and percent forest cover} based on ecoregion; 3 variables that are often correlated with each other. If instead you were testing whether 3 independent variables, such as elevation, conductivity, and canopy cover, change by ecoregion, you could also perform 3 separate Anova tests.
Manova can test multiple dependent variables simultaneously while keeping
track of their correlation with each other. If you used multiple Anova
tests instead for correlated variables, you would be more likely to commit
a type I error, that is, conclude that the groups were significantly different
when they were not. This type of error occurs when multiple tests are
performed at a specified p-value because with enough tests you are likely
to get a significant result due to chance alone. (At p=0.05 this will
occur about 5% of the time). When the variables are correlated the probability
of this type of error increases.
Regression vs. Multiple regression
Regression draws a line that relates changes in 1 variable to changes in another. On the left hand side, multiple regression also has 1 variable, but the right hand side can have more than 1. For 2 independent variables, multiple regression draws a plane; and for more than 2 a "surface" or "hyper-plane." Thus, a regression the line takes the form
Y = aX + intercept.
For multiple regression there can be many X-variables
Y = aX1 = bX2 + cX3 + ... + intercept.
Cluster analysis is the only multivariate method described here that
does not rely on the mathematically convenient properties of the multivariate
normal distribution to obtain solutions. Rather, cluster analysis uses
measures of distance.
Cluster vs. Multiple regression
Both methods use multiple independent variables. Cluster analysis uses
them to group cases, multiple regression uses them to predict values for
a dependent variable.
Cluster vs. Discriminant function analysis
Both methods are used to classify cases. Cluster analysis takes the multiple independent variables and defines its own clusters. DFA also used multiple independent variables, but instead of finding its own groups, it uses the group that you provide. The equations derived by DFA to classify the groups you initially defined can also be used to classify new cases.
DFA is often used to test for differences in the clusters defined by cluster analysis.
Cluster analysis puts cases into groups; Manova tests whether the grouping
variable that you selected (the independent variable) defines groups that
have different values for the dependent variables.
Cluster analysis groups similar cases together while PCA separates cases
as far apart as possible. Both use multiple independent variables. PCA
is typically used to reduce the number of variables for further analysis.
Cluster vs. Canonical correlation
Cluster analysis explores the relationships between cases; canonical
correlation explores the relationships between variables.
Both methods are similar in that they use a single dependent variable and multiple independent variables. On the left hand side of the equation, they differ in that the dependent variable for DFA is categorical but in multiple regression it is continuous. On the right hand side of the equation, correlation among the independent variables is a nuisance and can result in an unstable solution for multiple regression. DFA, in contrast, expects correlation among the independent variables and includes it explicitly it in the model.
Multiple regression predicts values for the dependent variable. In contrast, DFA starts with the already defined groups and finds the combinations of independent variables that maximize the differences among the groups. For example, you might use urban land cover, number of mines, and number of NPDES permits to predict taxa richness at a stream site using multiple regression, being careful to choose variables that aren't too correlated. DFA, in contrast, might use the abundances of various metals-sensitive species to predict whether mine waste is present.
Multiple regression vs. Manova
On the right hand side of the equation, both multiple regression and
Manova can have several independent variables. Both methods assume that
the independent variables are also independent of each other (i.e. not
correlated), or nearly so. On the left hand side, multiple regression
has just 1 variable while Manova has 2 or more.
Multiple regression uses several independent variables to try to predict a value for a dependent variable on the other side of the equation. PCA just uses 1 set of variable and looks for a new variable (a linear combination) that will summarize the larger set of variables. For PCA there is no other side of the equation, the model doesn't try to predict values for another variable, it just looks at the relationships for a single set of variables. For example, you might use urban land cover, number of mines and number of NPDES permits to predict taxa richness at a stream site using multiple regression, being careful to choose variables that aren't too correlated. PCA on the other hand expects correlated variables. For example, you might use PCA to summarize a set that includes {percent forest cover, percent urban cover, chlorine, and number of wastewater treatment plants}.
Multiple regression vs. Canonical correlation
Multiple regression relates 1 dependent variable to a set of independent
variables that are not correlated with each (or not much). Canonical correlation
evaluates the relationship between 2 sets of variables. An example for
multiple regression would be taxa richness as a function of elevation
and percent urban area in the watershed. Canonical correlation might explore
the relationships between a set of biological metrics {mayfly taxa richness,
stonefly taxa richness, and predator taxa richness} and a set of disturbance
measures {percent urban area, percent forested area, and chlorine}. For
canonical correlation, correlation within sets of variables is ok. For
multiple regression the group of independent variables should not be too
correlated.
Discriminant function analysis vs. Manova
Discriminant function analysis and Manova are mathematically the same, but the difference is that the independent and dependent variables trade places. In DFA we ask if there is some combination of variables that reliably separates groups. In Manova, asks if the dependent variables (actually a linear combination of the dependent variables) are different for different groups. Using DFA, we might ask, do the values for biological metrics reliably distinguish between impaired and reference sites? With Manova we are asking, are the biological metrics different for impaired and reference sites?
Discriminant function analysis vs. PCA
For discriminant function analysis we begin by defining the categories,
for example, impaired or reference sites and ask the analysis to identify
the best combinations of variables to define those categories. For PCA
there are no guidelines or a priori groups. PCA must sort cases with its
own algorithms. DFA combines variables to match the categories and groups
you define, PCA also combines variables, but there is no constraint to
match a priori groupings. For example, physical measures of streams might
be used in a DFA to discriminate between ecoregions. In PCA the physical
measures of streams would also be entered, and treated similarly as for
DFA, but you must look at the plots afterward to determine if there are
any clear grouping revealed by PCA, for example, stream sites in the same
ecoregion might plot more closely to each other.
Discriminant function analysis vs. Canonical correlation
Discriminant function analysis is a simple case of canonical correlation. Canonical correlation has many variables on each side of the equation that may also be correlated with each other. DFA has sets of variables on each side of the equation, but on 1 side, the variables are all categorical and may not be correlated with each other. For example, if canonical correlation uses 2 sets of variables such as invertebrate metrics and physical measures of the stream channel, DFA would use a categorical variable such as channel types of Rosgen class on 1 side and the full set of invertebrate metrics on the other.
Manova tests whether different group of sites have different mean values
for a set of response variables. For example, Manova tests whether the
set of variables {stream width, depth, and gradient} varies for different
ecoregions. PCA does not compare variables on either side of an equation
because there is no "other side" of the equation for PCA. PCA
looks for relationships within a set of variables. For example, PCA could
use the same set of correlated variables used by Manova, and the plots
might show grouping by ecoregion after the analysis was complete, but
information about ecoregion is not included in the model.
Manova vs. Canonical correlation
Canonical correlation is the general model and Manova is a special case. Canonical correlation has many variables on each side of the equation that may also be correlated with each other. Manova also has several variables on each side of the equation, but in 1 set, the independent set, the variables are not correlated with each other. For example, if canonical correlation uses 2 sets of variables such as invertebrate metrics and physical measures of the stream channel, Manova might use the same set of invertebrate metrics, but on the other side would be categorical variables such as channel shape or ecoregion. CCA would return combinations of variables from each set that are maximally correlated while MANOVA would test whether the ecoregions differed in terms of the invertebrate metrics.
Canonical correlation uses 2 sets of variables while PCA uses only 1. For each set of variables, canonical correlation creates a new variable which is a linear combination of the original variables. Variables are combined in such a way that maximizes the correlation between the new variables on each side.
PCA is a simple case of canonical correlation that uses 1 set of variables. In PCA, there is no left hand side of the equation. For example, if canonical correlation uses 2 sets of variables such as invertebrate metrics and physical measures of the stream channel, PCA would use only 1, either the biological metrics or the physical measures, but not both. PCA would provide a few new variables, or axes, that summarize the variation across cases for the set of variables.
![[logo] US EPA](http://www.epa.gov/epafiles/images/logo_epaseal.gif)