Canonical Correlation
description | simple example | MAIA example | how it works | caveats
Description: Canonical correlation takes 2 sets of variables, creates new variables for each set such that the correlation of the new variables is maximized. You give the model 2 sets of variables and the model returns pairs of new variables, made from linear combinations of the original variables, that are the most highly correlated.
Simple example: Suppose you have calculated some new metrics for marine zooplankton. This group has not been used extensively for biological monitoring and you have very few specific hypotheses about how they should relate to different measures of human disturbance. The zooplankton metrics represent first set of variables and measures of human disturbance represent the second set. Canonical correlation will combine variables from each set to provide the highest correlation with the other set.
MAIA example: Hill, et al. (in review) used canonical correlation to evaluate the relationship between measures of human disturbance and candidate diatom metrics. After determining the canonical axes for both sets of variables, they used the first canonical axis derived from human disturbance measures to test for differences in genus- and species-level identification of diatoms. They found that the number of diatom species that tolerate nutrient enrichment (eutraphentic taxa) increased significantly with human disturbance but that the number of genera did not.
Figure
Figure: The proportion of eutraphentic diatom taxa - that is, taxa tolerant of nutrient enrichment - is plotted against the canonical covariate derived from measures of human disturbance.
How the method works: For each set of variables, new variables are created from linear combinations of the original variables. As for PCA, this is done to reduce the number of variables. The new variables, or axes, are similar to those derived from a PCA. The difference here is that the way the variables are combined depends on how well they correlate with the new variables derived for the other set.
The idea behind canonical correlation is to reduce the complexity of the data by reducing the number of variables while still including as much relevant information as possible from the original data. Redundant information, that is, highly correlated variables, are combined, or collapsed, into single variables. Relevant information is retained by constraining the solution to maximize the correlation between the new variables derived from the two, original sets of variables.
Assumptions/limitations: There are many possible combinations of variables from each set. Whether the combination of variables selected is significant depends more on the interpretation than a statistical test of significance. Consequently, this techniques is often considered exploratory rather than predictive in a statistical sense.
![[logo] US EPA](http://www.epa.gov/epafiles/images/logo_epaseal.gif)