CADDIS Volume 4: Data Analysis
Exploratory Data Analysis
- What is EDA?
- Mapping Data
Author: M. McManus
Mapping Data: Spatial Analysis and GIS
An important step of a causal analysis is to define and map the spatial extent, or geographical area, of your study area. A map of the study area can help identify other sources of data, facilitate exploratory data analysis, and highlight samples in which spatial autocorrelation may be an issue. Being able to combine data from many different sources is both a strength and a weakness of using a geographical information system (GIS) to produce a map (Waller and Gotway 2004). Brewer (2006) provides some basic principles for mapping data in GIS. Some common questions to ask when mapping your study area include:
1. In what watershed does the study occur?
The Watershed Boundary Dataset (WBD) provided by U.S. Department of Agriculture’s National Resources Conservation Service on the Geospatial Data Gateway contains the hierarchy and areas for the six nested levels of hydrologic units (region, subregion, basin, subbasin, watershed, and subwatershed). The numbering scheme for the hydrologic units increases by two digits per level, beginning with a two digit hydrologic unit for region and ending with a twelve digit hydrologic unit for subwatershed. The WBD also describes different types of hydrological modification, such as stormwater ditches, levees, navigation canals, at the watershed and subwatershed scales, and such modifications may be candidate causes to consider in the analysis.
2. What rivers and streams flow through the study area?
The NHDPlus is a geospatial dataset providing the locations for streams and rivers, and incorporating elements from the National Hydrography Dataset (NHD), the National Elevation Dataset (NED), the National Land Cover Dataset (NLCD), and the WBD. The U.S. Geological Survey web site StreamStats provides stream-flow statistics and drainage-basin characteristics.
3. In what ecoregion does the study area occur?
An ecoregion is an area with environmental resources that are similar such as vegetation, climate, soils, and geological substrate. Regions with similar topography, climate, and geology are expected to have water bodies that are similar in hydrology and water chemistry. Knowing the ecoregion may allow you to compare the measurements in your study area to measurements from other water bodies in a relevant region or to select the data to be included in exposure-response modeling. Descriptions of the ecoregions and data on ecoregions can be downloaded at the National Atlas web site.
4. What administrative boundaries occur in the study area?
The Topographically Integrated Geographic Encoding and Reference (TIGER) data provided by the U.S. Bureau of Census contains county, metropolitan, and urban areas, and the Census Bureau’s demographic data can be linked to the geographic data. Other datasets, such as transportation networks and the National Wetlands Inventory, are available at GOS-Geospatial One Stop .
5. Can water quality monitoring sites from the case be mapped?
Other sources for spatial data include state environmental protection agencies and state natural resource agencies. Metadata for these spatial datasets should include information on the coordinate system, spatial extent, and descriptions of the variables, how and when the data were collected, and contact information for the creators and managers of the data.
6. What software are available?
A variety of Geographical Information Systems (GIS) software can be used, and some of these include ArcMap (ESRI - The GIS Software Leader ), R (CRAN Task View: Analysis of Spatial Data ), MapWindow (MapWindow Open Source GIS ), and the Geographic Resources Analysis Support System (GRASS GIS ). Analysts handling spatial data will need to have a working knowledge of GIS software so that they can perform basic GIS operations such as a spatial query, layering of several different spatial datasets, and buffering. Waller and Gotway (2004) cover some of the fundamentals of using GIS.
Norton et al. (2002) and Cormier et al. (2002) performed a causal analysis on the Little Sicoto River, near Marion, Ohio, and we have updated that map (Figure 1) using some of the GIS datasets described above. Besides data on location, these shapefiles also contain other information that may be helpful for causal analysis. For example, the NHD dataset contains the reach code, or reach address. The reach code consists of the 8-digit hydrologic unit number followed by a 6-digit arbitrarily assigned sequence of numbers. This reach code is referenced in data provided by other U.S. EPA programs, such as Impaired Waters and Fish Consumption Advisories (the Reach Address Database). One can also obtain the location and information about facilities and sites in relevant subwatersheds that are subject to environmental regulation from the U.S. EPA Geospatial Data Access Project. For the Little Scioto, the City of Marion’s wastewater treatment plant was identified as a relevant point source using this dataset. Finally, the locations of recent samples collected by the Ohio EPA were added to the map.