You are here:
Virtual Beach (VB)
On this page:
Virtual Beach is a software package designed for developing site-specific statistical models for the prediction of pathogen indicator levels at recreational beaches.
VB is primarily designed for beach managers responsible for making decisions regarding beach closures due to pathogen contamination. However, researchers, scientists, engineers, and students interested in studying relationships between water quality indicators and ambient environmental conditions will find VB useful.
Virtual Beach version 3.0.4 (VB3.0.4) has been added to the CEAM site. Virtual Beach facilitates the development of statistical models of pathogen indicator levels at recreational beaches. VB3.0.4 reads input data from a text or Excel file, assists the user in preparing the data for statistical analysis, and provides three analytical techniques for model development: multiple linear regression (MLR), partial least squares regression (PLS), and a gradient boosting machine (GBM). With an integrated mapping component to determine the geographic orientation of the beach, the software can automatically decompose wind/current speed and direction into along-shore and onshore/offshore components. VB3.0.4 can produce new variables from sets of variables in the input file (e.g., means, minimums, maximums, differences, sums, products), and it can test an array of transformations on the independent variables to maximize the linearity of the relationship between the response and those independent variables. In the MLR module, automated censoring of models with a high degree of multi-colinearity occurs during the selection process. The PLS and GBM modules institute 5-fold cross-validation during model development to avoid over specification. The prediction module of VB3.0.4 has a direct link to the USGS EnDDaT system to automatically retrieve data for beach sites in the Great Lakes region.
Applications and Possible Uses
Most common usage of VB will be to generate statistical models for the prediction of pathogen indicator levels for freshwater/saltwater beach sites. Analyses have been performed at these locations:
- Predicting E.coli levels at Huntington Beach, OH (2000-2010).
- Predicting enterococci levels (culturable and qPCR) at various Great Lakes' beaches: West Beach, Porter, IN; Washington Park, Michigan City, IN; Silver Beach, St. Joseph, MI; Huntington Beach, Bay Village, OH; South Shore, Milwaukee, WI.
- Predicting enterococci levels (culturable and qPCR) at various marine beaches: Goddard Beach, West Warwick, RI; Edgewater Beach, Biloxi, MS; Fairhope Beach, Mobile, AL; Hobie Beach, Miami, FL; La Monserratte, Puerto Rico; Boqueron Beach, Puerto Rico, Surfside Beach, Myrtle Beach, SC.
VB3 is a direct descendant of VB2 (the most recent release of this version is VB2.4.3). The original Virtual Beach Model Builder application (VB1) was developed by Walter Frick and Zhongfu Ge at the USEPA in Athens, Ga. VB1 can be characterized as a linear regression model-building tool that supports a primarily manual analysis of data sets via visual inspection of data plots and manipulation of variables (e.g., transformations, creating interaction terms), followed by an iterative process of testing, comparing and evaluating models. The fitness of developed models is computed and tracked, allowing for comparison and eventual selection of a “best” model for the dataset under consideration. This model can then produce estimates of pathogen indicator levels using current or forecasted environmental data from the site.
VB2 enhanced the functionality of its predecessor, performing similar functions (visual inspection of univariate data plots, manual transformations of individual variables, MLR model building, prediction, etc.), but also automated and extended functionality in several ways:
- The Map component provided users with information on the location and availability of local data sources through the map interface. These sources include the USGS National Water Information System (NWIS), the National Climatic Data Center (NCDC), and the U.S. EPA STORET database (STORET). These sources provide recently collected and/or forecasted data for generating predictions by a chosen MLR model.
- The Map component provided a convenient method for defining beach orientation by overlaying the beach on current shore-line layers (satellite images, Google Maps, MS Virtual Earth, etc). Given this orientation, VB2 could calculate wind, wave, or current components (the A-component is parallel to shore and the O-component is perpendicular to shore), which can be important predictor variables.
- Although manual processing and analysis of imported data (visual inspection of univariate data plots and the transformations/interactions of variables) was retained, the data processing component of VB2 provided automated generation of all possible 2nd order interaction terms amongst a set of IVs, formation of more complex functions of multiple columns, and automated testing of a suite of variable transformations for improved model linearity. This functionality increased the number of models to evaluate during later selection routines and removes the burden/difficulty of manual assessment placed on users of VB1.
- Within the linear regression analysis component, multi-colinearity amongst predictor variables was handled automatically. Any model containing an IV with a high degree of correlation with other IVs (as measured by a large Variance Inflation Factor [VIF]) was removed from consideration during model selection.
- During MLR model selection, models were ranked by a user-selected evaluation criterion. Possible criteria include R2, Adjusted R2, Akaike Information Criterion (AIC), Corrected AIC, Predicted Error Sum of Squares (PRESS), Bayes Information Criterion (BIC), Accuracy, Sensitivity, Specificity, or the model’s Root Mean Square Error (RMSE). Regardless of which criterion is chosen, the software records the ten best models in terms of that criterion. In comparison, VB1 had only a single comparative criterion, Mallow’s Cp.
- As the number of IVs in a dataset increases, possible MLR models increase exponentially (considering transforms/interactions), resulting in trillions of possible models from a modest number (12-13) of IVs. VB2 implemented a Genetic Algorithm (GA) that effectively and efficiently searched for the best possible MLR model. Alternatively, VB2 users could perform an exhaustive calculation in which all possible combinations of IVs were tested if the number of possible models was reasonably small (< 500,000 or so). Both the GA and exhaustive approaches greatly expanded the model-building capabilities of VB2 compared to VB1.
- Users no longer had to enter data values in transformed, interacted, or component-decomposed form to make a prediction with a chosen MLR model. On the VB2 MLR Prediction tab, a user-selected model is coded into an input grid with data entry columns matching the main effects of the model. Any mathematical manipulation of these IVs is then automatically performed prior to making predictions.
VB3 primarily builds onto VB2 by adding additional statistical methods to give users more flexibility in modeling their datasets. In addition to MLR, users can now use Partial Least Squares (PLS) regression and a Gradient Boosting Machine (GBM) in order to fit their data and make predictions. The re-designed software architecture (using DotSpatial libraries) can now easily accommodate future expansions of the suite of modeling tools. The Prediction tab of VB3 also allows direct interaction with the USGS’s data acquisition system, EnDDaT, for automated dataset construction and ease of FIB prediction from web-accessible data.
Technical Support and Training
Questions regarding the Virtual Beach application and its supporting software and documents can be directed to Mike Cyterski at EPA’s National Exposure Research Laboratory in Athens, GA. A user’s manual for VB3 is currently available. As training materials (including video tutorials) become available, they will be posted here.
Quality Assurance and Quality Control
VB2 and VB3 have undergone quality assurance testing to ensure their computations are consistent with other statistical packages (R and SAS), and the user’s manuals for each version are internally reviewed.