Green Bay Mass Balance
Draft Interim Project Deliverable
WORK ASSGINMENT MANAGER:
Mr. Robert King
This report presents the results of an effort to create an annotated inventory of selected files from the Green Bay Mass Balance (GBMB) Study. The inventory was compiled by Tetra Tech, Inc. under contract to EPA's Office of Water/Assessment and Watershed Protection Division (AWPD) under EPA Contract No. 68-C3-0303. This deliverable is the first under Task 1 of Work Assignment 042.
GREEN BAY MASS BALANCE ANNOTATED INVENTORY
Draft Interim Project Deliverable
GREEN BAY MASS BALANCE DATA SETS
Annotated Inventory, Diskettes 13 - 51
October 29, 1993
This report presents the results of an effort to create an annotated inventory of selected files from the Green Bay Mass Balance (GBMB) Study. The inventory was compiled by Tetra Tech, Inc. under contract to EPA's Office of Water/Assessment and Watershed Protection Division (AWPD) under EPA Contract No. 68C30303. This deliverable is the first under Task 1 of Work Assignment 042.
The GBMB Study was conducted in 199091 to pilot the technique of mass balance analysis in understanding the sources and effects of toxic pollutants in the Great Lakes' food chain. The study, headed by EPA's Great Lakes National Program Office (GLNPO) and the Wisconsin Department of Natural Resources, had many participants from the Federal, state, interagency, and academic communities. The study focused on four representative chemicals or chemical classes: PCBs, dieldrin, cadmium, and lead.
The data collected during the GBMB Study are key to many ongoing environmental initiatives:
Their primary function is to test the mass balance technique for toxics loadings determination
The data will serve as a key "reality check" for the Water Systems Modernization program, allowing comparison and contrast of the STORET X logical model to an actual monitoring program to measure how comprehensively STORET X serves this particular user community.
This data will be a part of realizing the integration of environmental compliance data with water quality monitoring data via EPA's Gateway/ENVIROFACTS development effort.
Data collected from the GBMB Study are held by EPA's Large Lakes and Rivers Research Station (LLRC) in Grosse Ile, Michigan.
GBMB DATA SETS
GBMB data are stored as data sets on approximately 400 floppy diskettes. One diskette represents one data collection effort. There was no consistent size to the data sets. Any one diskette could have part of a data set or more than one data set on it. A data set could have one or more files in it. In all, there are an estimated 24,000 distinct data sets among the 400 diskettes.
Fiftyone diskettes were selected by LLRC for this study, based on their representativeness of the entire data collection. The fiftyone diskettes had different contents and were comprised of one or more files that may have a variety of formats. Twelve of the diskettes had data that was created from summarizing data on all the other (388) diskettes; those diskettes were therefore omitted from this inventory as being duplicative. The remaining 39 diskettes had approximately 805 files.
The first step in creating this inventory was to verify each diskette for "readability" (i.e., trying to open the diskette on a PC and looking over the data that appeared). A listing of the data files on each diskette was prepared next. All documentation files were then reviewed and compared to the raw data files. The documentation files contained notes from the principal investigator who created the file, including information on how to read them and important remarks or qualifications about their data. The data contents of each file were summarized, described, and annotated for special features. The results of this effort are in Appendix A.
STRUCTURE OF THIS REPORT
This report presents the inventory of the first 39 of the 400 GBMB data diskettes. The purpose of this report is to present the interim findings of creating the inventory, and to provide a representation of GBMB data to be compared and contrasted to the STORET X logical model. There are three chapters and an appendix. In addition to this Chapter 1, which introduces the study, there are two chapters and one appendix:
Chapter 2 Findings, discussing the characteristics of the GBMB data sets
Chapter 3 Conclusions/Next Steps, presenting the inferences derived and the lessons learned from inventorying the first thirty nine GBMB diskettes.
Appendix A contains the actual inventory of the thirtynine inventoried GBMB diskettes ( #1351).
ORGANIZATION OF THE GBMB INVENTORY
Appendix A contains the inventory of GBMB diskettes thirteen through fiftyone . Each data collection effort is summarized with the following information (to the extent that they were present):
Diskette number and brief description An overview of who collected the data, and why the data were collected
Format The data structure and data element names for a given set of files, repeated if there are multiple file types/organizations on a diskette; the data files are typically in ASCII format and the information of this section would include start/stop columns and variable descriptions
Example data An excerpt of the data from the diskette being described, repeated for each file type on a diskette
Notes Coding information on how the files were named, where the stations are located, and how parameters are coded; parameter codes may represent fish species, sampling equipment, investigator comments, lab notes, or field notes
Quality Assurance (QA) and Ancillary Notes All summary QA data found on the diskette, as well as other miscellaneous notes that provide additional insight on the data characteristics
Directory A listing of all files; in most cases, the diskettes provided a highlevel annotated list of what each file contained.
Organizing the descriptions in this format was most useful for ASCIIformatted diskettes. Those diskettes in spreadsheet format (diskettes 32,33, 34, 35, 36, and 37) were more difficult to annotate because of their width, some of which extended to over 150 columns. For these diskettes, the description in Appendix A represents only an example portion of the actual spreadsheets, capturing column headers (i.e., variable names) and several data elements.
There are several generalizations or observations that were noted during creation of the GBMB annotated inventory. These are presented next.
Avoiding duplicate data file review. It was common to have individual files for specific yearandstation combinations. In some cases, data were even further subsetted (e.g., analyzed vs. not analyzed for GBMB or pollutantspecific). Care had to be taken to ensure that the same data was not being analyzed twice because it was in different formats. An illustration, from Diskette #13, is shown below:
Data of Unpredictable Format and Content. Certain features of the GBMB data sets (e.g., chemical variables, locationfixed station) were very similar in throughout all the data sets they are in. Other features, such as media, documentation, and format, were quite dissimilar across data sets. The largest dissimilarities would typically occur from data set to data set. On some diskettes, the format varies from one file to another.
Different file organization. Data may be organized horizontally or vertically depending on the diskette. In addition, data in some records were compressed, so that they actually contained information that would normally be in several records (e.g., relatively humidity).
Stations already in the current STORET. Some data reference STORET station numbers as the key for the data records to their names. An example from Diskette #20 looks like:
Diskettes with both raw and processed data. Some diskettes contained DMR data in both processed format (e.g., monthly average, minimum and maximum flow, and suspended solids) and raw daily data values.
"Moving" stations. Some station locations were changed. Both the old and new station locations have the same station name (i.e., same name, different locations).
Drifters. Some files contained current meter data. Surface drifters were set in place and their velocities were traced over time to measure drift.
Imprecise values. Some data were reported to be below the level of detection (LOD) and level of quantification (LOQ). In those cases, the reported value, the LOD, and the LOQ were given.
Variable QA data. For duplicate samples, QA data includes an analysis of coefficient of variation (e.g., more than just the lab blanks, spikes, and duplicates). For example:
Undefined codes. Some investigator comments were coded, but the meanings of the codes were not provided.
Use of different sampling instrumentation. Some parameters in a file were measured at the same place and time, but using different sampling methods.
Negative precipitation values. Small negative readings for the precipitation fields are explained by electrical noise or "drift" from the Campbell data logger. These negative readings often are counterbalanced by small positive readings. This is also the case for daily totals.
Files of unusual formats. Some files on the same diskette had small amounts of data of widely ranging formats.
Different types of dates. Some data, such as sediment trap data, include dates that may not be the same as sample collection data, including deployment and recovery dates.
Sediment core data. Sediment core depths may be described as a range and mass summed over depth. For example:
Data of different quality. Some data files contained both corrected and uncorrected data (e.g., blank corrected).
Spreadsheet data questions. Spreadsheets in particular had unusual characteristics, notably:
Some data are reported with more decimal places than the "READ ME" file suggested was accurate
The user is not restricted to a common field width and longer comments may be covered up making conversion to other software more difficult.
Files generally tended to be very wide (e.g., up to column "FN").
Blank records are added to be natural breaks in the data, but create problems for interpretation.
These special characteristics of the many data files made it necessary to read each file to create the annotated inventory. Such an exercise may have to be done for all GBMB files (i.e., the files on the remaining 350 diskettes) to properly convert and enter their data into STORET X or any other data base.
There were several lessons learned in creating this interim GBMB annotated inventory. This chapter discusses key issues that arose during this inventory effort, and provides comments on the implications of or decisions necessary for integration/migration of all GBMB data into an integrated data collection.
EACH DATA SET MUST BE INDIVIDUALLY REVIEWED TO CREATE AN INTEGRATED DATA BASE.
This annotated inventory of the first 51 GBMB diskettes is sufficient to proceed with the STORET X/GBMB comparison because the data in the 51 diskettes are representative of the entire data collection. To create a data base that integrates all the types of data in the GBMB data set, however, each file still must be individually addressed. This is because there are as many types of data, formats, and documentation as there are data sets. No assumptions can be made that any data set is the same as another. This will require a considerable effort, in some cases with the cooperation of the principal investigator that created the data file(s).
DATA SETS SHOULD BE PRIORITIZED FOR INCLUSION IN AN INTEGRATED DATA BASED
A prudent approach to the large effort of creating an integrated GBMB data base from these diskettes would be to assign priorities to the files contained on them. Some files would be considered of paramount importance, and will be migrated into the integrated data base first. Others will be considered of secondary importance and will be migrated as resources and/or time allow. Yet others may be omitted from the data base entirely. Highest priority should be given to those data sets that are:
A lower priority should be given to poorly documented data sets, or data sets containing drastically different formats from one file to the next. This type of prioritization may be accomplished with input from LLRC staff and investigators.
THE DATA BASE RESULTING FROM THIS STUDY SHOULD HAVE DATA ONLY FROM THE GBMB DISKETTES
Based upon further research, it was determined that the STORET data base contains data for some of the stations used for GBMB sampling. Those data were determined to be primarily conventional pollutants (N, P, BOD, etc.) to determine eutrophication and not toxics loadings. Although inclusion of those data will enrich the eventual GBMB integrated data base, the data base created under this effort should be limited to data from the GBMB diskettes.
SOME GBMB DATA ARE OF VALUE EVEN THOUGH THEY MAY BE OUTSIDE THE SCOPE OF STORET X
It is already clear that some of the data in the GBMB data collection are not in conformance with the STORET X logical model. For example, some data sets contain "calculated" data (data derived from raw data), yet the current STORET X logical model is limited to sample (i.e., raw) data only. Some files are missing documentation, have nonwater quality/biological monitoring data, or may have inadequate/absent locational data. These data may be of great value if they are supplemented by some further research. The STORET X logical model should be used as a way of identifying dissimilarities in the data of the GBMB program, but not as a basis for excluding data from the eventual integrated data base. In addition, some further research, such as obtaining missing latitude/longitude data, may make otherwise incompatible but valuable data sets usable.
GBMB Annotated Inventory Diskettes 13-51