Methodology and Interpretation
Escherichia coli Cell Count Predictions (bacterial cells/100 ml)
Methodology
The 244 water-quality sampling locations were used to delineate 244 subwatersheds. The
study area can be depicted as a grouping of the 244 subwatersheds, each of which
contains a single hydrologic outlet (sometimes called a "pour point"). The 244
subwatersheds facilitate the mapping of landscape metrics in landscape reporting units.
The 244 subwatersheds provide the basis for the statistical development of water-quality
vulnerability indicators. It is important to understand that some of the 244 subwatersheds
are nested completely within other larger subwatersheds, and thus the total area of the
244 subwatersheds exceeds the total study area. The value of using this unconventional
view of the landscape is that the cumulative effects of landscape condition on water
quality can be assessed, thereby increasing the predictive power of any determined
relationships between land-cover and water-quality parameter. Thus, in the two-
dimensional metric and indicator maps, solely a portion of some larger subwatersheds are
shown, but in the digital browser there is an additional capability for viewing the nested
(i.e., stacked) subwatersheds and viewing a synopsis of the landscape metrics and other
pertinent information for all subwatersheds.
The Missouri land-cover data set and the Arkansas land-cover data set were imported and
evaluated using Imagine image processing software (ERDAS v. 9.0). While evaluating
the forest class in the Missouri land-cover data, it was discovered that the original
classification did not fully capture the forest area. The forest areas that were lacking
where filled in by classifying and merging a new forest layer from multiple Landsat
Thematic Mapper (TM) imagery 'scenes'. The new Missouri land-cover
was superimposed onto 2003 county Digital Ortho Quarter Quadrangles (DOQQ). The
preliminary land-cover map was edited and updated to match the 2003 DOQQs, using
aerial photographic interpretation techniques. Edits were made to the forest, urban,
water, and agriculture classes. The Arkansas land-cover was then superimposed onto the
DOQQs and updated. The updated state land-cover maps were exported as ArcInfo
(ESRI ArcGIS v. 9.0) Grids. Each land-cover classification schema was aggregated to
meet the project's classification scheme requirements. The land-cover grids
for Missouri and Arkansas were merged to create a unified land-cover map for 2003,
and used for statistical analyses of landscape metrics (i.e., Landscape Metric Maps - Analysis by Decade Status,
and Landscape Metric Change Maps - Analysis by Decadal Interval).
For each of the selected sites, the watershed support area was delineated and a suite of
landscape metrics was calculated. A total of 46 landscape metrics were tested among
watersheds. Measured total phosphorous, total ammonia, and E. coli solely existed in 18,
6, and 15 sites, respectively. Landscape metrics were for year 2003 and surface water
constituents were averaged over a period of 1997-2002. Each of the surface water
constituents from the above sites was used in a Partial Least Squares analysis (PLS) to
predict water-quality values for all of the 244 subwatersheds.
PLS is a multivariate analysis technique that permits analysis and prediction for data sets
with missing values, with collinearity, and with a relatively small number of observations
(for additional information about PLS see references, below). In the PLS analyses, both
data sets (e.g. water and landscape variables) are first centered and scaled. A linear
combination is composed on the independent variables (T = Lo W; T is the score and W is
weight) forming a number of orthogonal latent variables [T] that are less in number
(dimensions) than that of the original landscape variables. The linear combination in [T]
is formed so that the covariance between [T] and the linear composition of the dependent
variables are maximized (T& U; U = Bo V; U is the score and V is weight). Prediction of
both water and landscape data will be via regression on the common latent variables (T).
Modeling and prediction in PLS, therefore, is not solely based on the conditional
distribution of the predictors (water variables) in the presence of independent variables
(landscape variables), instead it accounts for both landscape and water together through
[T].
PLS produces n-1 factors, with each factor containing a pair of scores (Ti, Ui). Linear
combinations on each data set are called factors. The above was the extraction of the first
factor. PLS extracts the second factor using the residuals from the first and finds the
linear combinations of both data sets such that their covariance is maximized. This
process is repeated by taking residuals from the previous factor, producing n-1 factors,
where n is the number of observations. For example, if the number of sites (observations)
is 89, then 88 factors will be produced. Not all of these factors are significant using the
Cross Validation (CV) method; only the significant factors are used in the final model.
When applying CV, data set is divided into groups (5 to 9 groups; see references in Nash
et al., 2005). The fitted models are tested using the test data sets and the predicted values
are compared with that of observed using PRESS (Predictive Residual Sum of Square) to
assess the predictive ability of the model. SAS gives the root means PRESS and its
significant level (the lower the value, the better the model is).
After defining the significant PLS factors; scores, weights and VIP (Variable Influence
on Projection) are used to examine the strength of the relationship, irregularities and the
contribution of the independent variable (landscape) in the model. If VIP for an
independent variable is small in value, it implies that variable has a relatively small
contribution to the model and may be deleted from the model. It was indicated VIP
values of less than 0.8 are considered to be small. The quality of the model was
determined by examining the residuals for both the response and the landscape variables.
An examination of any possible outliers using residuals was carried out to finalize the
fitted PLS model. SAS was used for statistical analyses.
Interpretation
The E. coli PLS model resulted in two significant factors explaining 99.7% of the variability
in E. coli cell counts at a subwatershed's pour point. As in the total phosphorus PLS model,
stream density in a subwatershed
is an important inverse correlate for E. coli. Other important positive correlates are
percent impervious surfaces,
percent urban, and
road density within a
subwatershed, as well as percent urban
in close proximity to streams.
(The following excerpt is from Monitoring Water-Quality: Volunteer Stream Monitoring
- A Methods Manual - U.S. EPA Report Number EPA/841/B-97/003)
Members of two bacteria groups, coliforms and fecal streptococci, are used as indicators
of possible sewage contamination because they are commonly found in human and
animal feces. Although they are generally not harmful themselves, they indicate the
possible presence of pathogenic (disease-causing) bacteria, viruses, and protozoans that
also live in human and animal digestive systems. Therefore, their presence in streams
suggests that pathogenic microorganisms might also be present and that swimming and
eating shellfish might be a health risk.
Sources of fecal contamination to surface waters include wastewater treatment plants, on-
site septic systems, domestic and wild animal manure, and storm runoff. In addition to
the possible health risk associated with the presence of elevated levels of fecal bacteria,
they can also cause cloudy water, unpleasant odors, and an increased oxygen demand.
Escherichia coli (E. coli) is a species of fecal coliform bacteria that is specific to fecal
material from humans and other warm-blooded animals. EPA recommends E. coli as the
best indicator of health risk from water contact in recreational waters.
References
Helland, I. S., 1988. On the structure of partial least square regression. Commun.
Statist. Simula. 17(2), 581-607.
Lindberg, W., Persson, J-A, and Wold, S., 1983. Partial Least-Square method for
spectrofluorimetric analysis of mixture of humic acid and lignisulfonate. Anal. Chem.
55, 643-648.
Nash, M.S., Chaloud, D., and Lopez, R.D., 2005. Multivariate Analyses (Canonical
Correlation Analysis and Partial Least Square, PLS) to Model and Assess the
Association of Landscape Metrics to Surface Water Chemical and Biological Properties
using Savannah River Basin Data. Untied States Environmental Protection Agency.
EPA/600/X-05/004. 82pp.
SAS (SAS Institute), 1998. Version 9 User's Guide. SAS Institute. Inc., Cary, NC.
Wold, S., 1995. PLS for multivariate Linear Modeling. In: H. van de Waterbeemd
(Editor), Chemometric methods in molecular design methods and principles in medicinal
chemistry. Verlag-Chemie, Weinheim, Germany, p.195-218.

Quantile: Each class contains an approximately equal number (count) of features. A quantile
classification is well-suited to linearly distributed data. Because features are grouped by the number
within each class, the resulting map can be misleading, in that similar features can be separated into
adjacent classes, or features with widely different values can be lumped into the same class. This
distortion can be minimized by increasing the number of classes. For continuity of the browser content,
and consistency among maps, legend gradients are from higher values (red) to lower values (green).
Metric input GIS data:
- Water-quality sampling locations - Metadata