CADDIS Volume 4: Data Analysis
Exploratory Data Analysis
- What is EDA?
- Mapping Data
Topics in Conditional Probability Analysis
- How do I calculate conditional probabilities?
- How do I use conditional probability analysis in causal analysis?
- More information
Author: J.F. Paul
Conditional Probability Analysis
Conditional probability is the probability of some event (Y), given the occurrence of some other event (X), and is typically written as P(Y | X). Our application of conditional probability uses a dichotomous response variable, which requires that a threshold value is applied to a continuous response variable that categorizes a sample into one of two categories (e.g., poor quality vs. not not poor quality). For example, here we are interested in sites with a low relative abundance of clinger taxa, compared to total benthic taxa. We categorize sites at which the relative abundance of clingers is less than 40% as "poor" (Figure 1, left plot).
We use conditional probability analysis (CPA) to estimate the probability of observing Y (e.g., a site with poor biological condition) if you also observe a particular condition X. Continuing our example, we might be interested in the probability of observing clinger relative abundances less that 40% when the percentage of fine sediments in the substrate exceeds a given value (Xc), or P(Y | X > Xc). An illustrative graph of this relationship is shown in Figure 1 (right plot), where the curve represents the probability of observing a low relative abundance of clingers (i.e., < 40%) when the percentage of sand/fines exceeds a given value. In this example, an increase in the percentages of sand/fines from 0 to 50% was associated with an increase in the probability of observing poor biological conditions (as indicated by the relative abundance of clinger taxa) from 60% to about 80%.
Conditional probabilities can be calculated by dividing the joint probability of observing both events by the probability of observing the conditioning event (Equation 1).
For our purposes, CPA involves the application of the above analysis technique to biological monitoring data to assist stressor identification in causal analysis. Additional background and detail can be found in Paul and McDonald (2005); however, this paper discusses CPA as applied to identifying thresholds of impact, which is a different purpose than stressor identification.
A tool for computing conditional probabilities is available in CADStat.
CPA can be used as a data exploration tool. Similar to scatter plots and linear correlation, CPA can be used to help understand associations between pairs of variables (e.g., a stressor and a response).
- Since CPA requires a dichotomous response variable (i.e., there either is, or is not, an effect), you must identify a threshold value of the response metric that defines unacceptable conditions (e.g., a response value that determines if a water body is biologically impaired).
- CPA is most meaningful when applied to field data collected using a
randomized, probabilistic sampling design.
A probability sample is selected in an explicit
manner that allows statements to be made for estimates of the
statistical population from which it was selected (Overton 1990).
Two key characteristics of a probability sample are that
(1) the probability of sampling any element of the statistical
population is known (this implies a definition of the statistical
population of interest), and (2) the inclusion probability of any
sample of the population is positive, that is, all samples have a
known non-zero probability of being included in the sample of sites
(Cochran 1977, Overton 1993). The inclusion probability of any
element is defined as the probability with which the element is
included in the statistical population.
- If your sites were selected using a probability design, then their inclusion probabilities can be used to weight the analysis and extrapolate the results to the larger statistical population. For example, if the statistical population was defined as all 1st to 3rd order streams in a watershed, then the results would be representative for all 1st to 3rd order streams in that watershed, not just those stream segments that were sampled.
- If the probability of inclusion of a stream segment is unknown (which is typical for targeted sites or "found" data), the results of the analyses would be expressed in terms of the stream segments for which you have observations, and equal weighting would be applied to the stream segment data.