Jump to main content or area navigation.

Contact Us

CADDIS Volume 4: Data Analysis

Exploratory Data Analysis

Conditional Probability Analysis

Conditional probability is the probability of some event (Y), given the occurrence of some other event (X), and is typically written as P(Y | X). Our application of conditional probability uses a dichotomous response variable, which requires that a threshold value is applied to a continuous response variable that categorizes a sample into one of two categories (e.g., poor quality vs. not not poor quality). For example, here we are interested in sites with a low relative abundance of clinger taxa, compared to total benthic taxa. We categorize sites at which the relative abundance of clingers is less than 40% as "poor" (Figure 1, left plot).

We use conditional probability analysis (CPA) to estimate the probability of observing Y (e.g., a site with poor biological condition) if you also observe a particular condition X. Continuing our example, we might be interested in the probability of observing clinger relative abundances less that 40% when the percentage of fine sediments in the substrate exceeds a given value (Xc), or P(Y | X > Xc). An illustrative graph of this relationship is shown in Figure 1 (right plot), where the curve represents the probability of observing a low relative abundance of clingers (i.e., < 40%) when the percentage of sand/fines exceeds a given value. In this example, an increase in the percentages of sand/fines from 0 to 50% was associated with an increase in the probability of observing poor biological conditions (as indicated by the relative abundance of clinger taxa) from 60% to about 80%.

Scatter plot of WVSCI versus pH
Figure 1. Left plot: Percent substrate sand/fines versus relative abundance of clinger taxa. Data from EMAP-West. Red horizontal line shows relative abundance of clinger taxa = 40%, an example threshold for defining "poor" biological conditions. Right plot: Conditional probability, P(Y | X > Xc), for X = percent sand/fines fines in stream segment substrate and Y is defined as relative abundance of clingers < 40%. Lines show bootstrapped upper and lower 95% confidence intervals.

Conditional probabilities can be calculated by dividing the joint probability of observing both events by the probability of observing the conditioning event (Equation 1).

Equation for calculating conditional
probability
Equation 1.

For our purposes, CPA involves the application of the above analysis technique to biological monitoring data to assist stressor identification in causal analysis. Additional background and detail can be found in Paul and McDonald (2005); however, this paper discusses CPA as applied to identifying thresholds of impact, which is a different purpose than stressor identification.

How do I calculate conditional probabilities?

A tool for computing conditional probabilities is available in CADStat.

How do I use conditional probability in causal analysis?

CPA can be used as a data exploration tool. Similar to scatter plots and linear correlation, CPA can be used to help understand associations between pairs of variables (e.g., a stressor and a response).

More information

  • Since CPA requires a dichotomous response variable (i.e., there either is, or is not, an effect), you must identify a threshold value of the response metric that defines unacceptable conditions (e.g., a response value that determines if a water body is biologically impaired).

  • CPA is most meaningful when applied to field data collected using a randomized, probabilistic sampling design. A probability sample is selected in an explicit manner that allows statements to be made for estimates of the statistical population from which it was selected (Overton 1990). Two key characteristics of a probability sample are that (1) the probability of sampling any element of the statistical population is known (this implies a definition of the statistical population of interest), and (2) the inclusion probability of any sample of the population is positive, that is, all samples have a known non-zero probability of being included in the sample of sites (Cochran 1977, Overton 1993). The inclusion probability of any element is defined as the probability with which the element is included in the statistical population.
    • If your sites were selected using a probability design, then their inclusion probabilities can be used to weight the analysis and extrapolate the results to the larger statistical population. For example, if the statistical population was defined as all 1st to 3rd order streams in a watershed, then the results would be representative for all 1st to 3rd order streams in that watershed, not just those stream segments that were sampled.
    • If the probability of inclusion of a stream segment is unknown (which is typical for targeted sites or "found" data), the results of the analyses would be expressed in terms of the stream segments for which you have observations, and equal weighting would be applied to the stream segment data.

Top of page


Jump to main content.