Jump to main content.


Regression

EMAP fish example

description | simple example | MAIA example | diatom example | how it works | caveats

Description: You give regression 2 variables and the model gives you a line that best predicts values for the dependent variable. The test statistic, r-squared, measures the strength of association between the 2 variables. The closer the data points are to the predicted line, the larger is r-squared.

Simple example: You would like to test whether a group of invertebrates that eats mostly fallen leaf litter (shredders) increases with riparian canopy. Shredders depend on leaves from trees, but the reverse is not true, trees do not depend on shredders.  Regression tests whether an increase in riparian cover is a significant predictor of the percentage of shredders at a stream site.

MAIA example: Some of the fish metrics tested for inclusion in the EMAP fish index were significantly correlated with the size of the watershed, taxa richness increased as streams got bigger (Stoddard, pers. comm. and McCormick, et al. [in review]). To control for this confounding factor (see also Sampling Bias), they used regression to predict species richness at undisturbed sites with a given watershed size. They next applied the regression model to all the sites, disturbed and undisturbed.

After controlling for the effect of watershed size on metric values, they tested the response of the metric to various measures of human disturbance. They tested the residuals from the regression model rather than testing the metric values directly. The residuals represent the variability in metric values after removing the effect of watershed size.

Figure

Benthic Fish Species - The number of benthic fish species increased with watershed size for reference sites (upper panel). The amount of variance not explained by watershed size is represented by the distance of the points from the regression line. These distances are called the residuals and are indicated for a few points by arrows (lower panel).

Figure. The number of benthic fish species increased with watershed size for reference sites (upper panel). The amount of variance not explained by watershed size is represented by the distance of the points from the regression line. These distances are called the residuals and are indicated for a few points by arrows (lower panel).

The diatom multimetric index declined significantly as disturbance increased in Virginia and Pennsylvania. Regression was not significant for West Virginia, possibly because the intensity of human disturbance was less in this state.

Diatom example: Regression of diatom index on human disturbance: Multiple independent tests of biological metrics and indexes insure that the method is robust and not specific to a particular data set.

To test the diatom multimetric index, the index was regressed on a PCA axis derived from chloride, total N, riparian condition measures, road density, % urban, forest, agriculture, and mine cover. The axis represented a linear combination of the various measures of human disturbance.

The diatom multimetric index declined significantly as disturbance increased in Virginia and Pennsylvania. Regression was not significant for West Virginia, possibly because the intensity of human disturbance was less in this state.

The data used here were from 1995 and were independent of the data collected in 1993-94 used to test the metrics.

How the method works: Regression draws a line that minimizes the sum of the distances for all the points from that predicted line. "Least squares" regression refers to the square of the distance from each point to the predicted line. When those squared distances are the smallest for the most points, that line is the best fit. The distance from each point to the predicted line is also called the residual and it is the amount left over that wasn't explained by the linear relationship.

Assumptions/limitations: Regression assumes the relationship between the 2 variables is linear. If the relationship is very strong, but very curved, regression may say the variables are unrelated. To detect this, always graph your data.

Regression is sensitive to outliers. If 1 point is far from the main group, it can have much more influence on the slope of the line than the other points, this is called leverage. Graphing the data will reveal the outliers.

Regression assumes a normal distribution of the residuals, that is, distance of each value from the regression line. This in turn implies that the response variable is continuous.

Regression also assumes that the independent variable(s) are measured without error. If the predictor variable has error associated with it, estimates of test statistics may be affected.

Biological Indicators | Aquatic Biodiversity | Statistical Primer


Local Navigation


Jump to main content.