CADDIS Volume 4: Data Analysis
Exploratory Data Analysis
- What is EDA?
- Mapping Data
Author: J.F. Paul
An initial step in EDA is to examine how the values of different variables are distributed. Graphical approaches for examining the distribution of the data include histograms, boxplots, cumulative distribution functions, and Q-Q plots. Information on the distribution of values is often useful for selecting appropriate analyses and confirming whether assumptions underlying particular methods are supported (e.g., normally distributed residuals for a least squares regression).
A histogram summarizes the distribution of the data by placing observations into intervals (also called classes or bins) and counting the number of observations in each interval. The y-axis can be number of observations, percent of total, fraction of total (or probability), or density (in which the height of the bar multiplied by the width of the interval corresponds to the relative frequency of the interval). The appearance of a histogram can depend on how the intervals are defined. An example of a histogram is shown in Figure 1 for log-transformed total nitrogen from the EMAP-West Streams Survey data set.
A box and whisker plot (also referred to as boxplot) provides a compact summary of the distribution of a variable. A standard boxplot consists of (1) a box defined by the 25th and 75th percentiles, (2) a horizontal line or point on the box at the median, and (3) vertical lines (whiskers) drawn from each hinge (quartile) to the extreme value. In a slight variation of the standard boxplot, whiskers extend to a span distance from the hinge, and outliers beyond the span are identified. The span (S) is calculated as:
S = 1.5 x (75th percentile - 25th percentile)
Boxplots are particularly useful for comparing the distributions of different subsets of a single variable.
The cumulative distribution function (CDF) is a function F(X) that is the probability that the observations of a variable are not larger than a specified value. The reverse CDF is also frequently used, and it displays the probability that the observations are greater than a specified value. In constructing the CDF, weights (e.g., inclusion probabilities from a probability design) can be used. In this way the probability that a value of the variable in the statistical population is less than a specified value is estimated. Otherwise, for equal weighting of observations, the CDF applies only to the observed values.
CDFs for phosphorus data from the EMAP northeast lakes survey are shown in Figure 3. The CDF for the sampled sites is shown in black (equal weight to the data), while the blue line is the estimated CDF for the statistical population of all lakes in the northeast U.S. (inclusion probabilities from probability design used as weights in estimation). The median phosphorus concentration for all of the lakes sampled is 11 μg/L, while the estimated median for all of the lakes in the northeast U.S. is 17 μg/L.
A quantile-quantile (Q-Q) plot, or probability plot, is a graphical means for comparing a variable to a particular, theoretical distribution or to compare it to the distribution of another variable. One common application of the Q-Q plot is to check whether a variable is normally distributed. A comparison of EMAP-West total nitrogen observations and log-transformed total nitrogen observations to a normal distribution are shown in Figure 4. The log-transform greatly increases the degree to which observed total nitrogen values approximate a normal distribution.