Multiple Regression
description | simple example | MAIA example | diatom example | how it works | caveats
Description: Bivariate regression uses 1 independent variable to predict values for 1 dependent variable. Multiple regression uses 2 or more independent variables to predict the dependent variable. You give multiple regression several independent variables and a dependent variable that you want to predict. Multiple regression finds linear combinations of the independent variables (IV's) that best predict values for the dependent variable (DV).
DV = f (2 or more IV's)
Simple example: Suppose you calculate invertebrate taxa richness for reference stream sites across a broad region. You may want to test whether taxa richness of invertebrates is predicted by some combination of elevation and latitude. Taxa richness would be the dependent variable and latitude and elevation would be independent variables.
MAIA example: Herlihy et al. (1998) tested which water chemistry variables were most closely associated with watershed land cover. They ran 13 separate multiple regression models, using a different dependent variable each time, e.g., chloride or pH. On the other side of the equation were the 5 predominant classes of land use/land cover in the region: forests, agriculture, urban, wetland and barren. They compared r-squared values from each model to determine which chemistry variable was most closely associated with land use and found that chloride was the best.
They repeated the entire process for a second type of land cover data and found that the results were similar.
| Dependent variable | Satellite | USGS maps |
|---|---|---|
| Chloride | 0.48 | 0.47 |
| Base cations (Ca, Mg, Na, K) | 0.40 | 0.36 |
| Nitrate | 0.39 | 0.34 |
| Acid Neutralizing Capacity | 0.34 | 0.30 |
| Total Phosphorus | 0.31 | 0.31 |
| Ammonium | 0.25 | 0.34 |
| Dissolved organic carbon | 0.20 | 0.13 |
| Turbidity | 0.18 | 0.16 |
| pH | 0.17 | 0.15 |
| Manganese | 0.12 | 0.17 |
| Sulfate | 0.11 | 0.16 |
| Silicon | 0.07 | 0.06 |
| Aluminum | 0.06 | 0.08 |
Table: Water chemistry variables for stream sites and the r-squared values from multiple regression models for land use/land cover data derived from satellite data and USGS digital maps.
Regression of diatom index on human disturbance:
Multiple independent tests of biological metrics and indexes insure that the method is robust and not specific to a particular data set. To test the diatom multimetric index, the index was regressed on a PCA axis derived from chloride, total N, riparian condition measures, road density, % urban, forest, agriculture, and mine cover. The axis represented a linear combination of the various measures of human disturbance.
The diatom multimetric index declined significantly as disturbance increased in Virginia and Pennsylvania. Regression was not significant for West Virginia, possibly because the intensity of human disturbance was less in this state.
The data used here were from 1995 and were independent of the data collected in 1993-94 used to test the metrics.
How the method works: In the standard method, all the independent variables enter into the equation simultaneously. The significance of each is assessed as if it had entered the regression after all the other independent variables were already in the model. You can test the significance of each independent variable, given the other variables, and you can test for significance of the entire model.
Assumptions/limitations: All the assumptions of simple linear regression apply here as well. Plots of residuals will help detect non-linear relationships.
The model prefers that the independent variable are not correlated with each other. A little bit of correlation is ok, but too much will degrade the solution and make it unreliable. That means when you change the variables on the right hand side of the equation, you can get very different results regarding which are significant predictors of the dependent variable. The solution is to eliminate 1 variable from each pair that is highly correlated. You could also use PCA to come up with fewer variables that are less related.
Different methods for model selection, such as forward or backward stepwise, can produce models with different subsets of independent variables, that is, different results. Often these search methods are automated by the computer program and so the user should select carefully the most appropriate method.
![[logo] US EPA](http://www.epa.gov/epafiles/images/logo_epaseal.gif)