## Exploratory Data Analysis

##### Topics in Conditional Probability Analysis

Author: J.F. Paul

### Conditional Probability Analysis

Conditional probability is the probability of some event (Y), given the occurrence of some other event (X), and is typically written as P(Y | X). Our application of conditional probability uses a dichotomous response variable, which requires that a threshold value is applied to a continuous response variable that categorizes a sample into one of two categories (e.g., poor quality vs. not not poor quality). For example, here we are interested in sites with a low relative abundance of clinger taxa, compared to total benthic taxa. We categorize sites at which the relative abundance of clingers is less than 40% as "poor" (Figure 1, left plot).

We use conditional probability analysis (CPA) to estimate the probability of observing Y (e.g., a site with poor biological condition) if you also observe a particular condition X. Continuing our example, we might be interested in the probability of observing clinger relative abundances less that 40% when the percentage of fine sediments in the substrate exceeds a given value (Xc), or P(Y | X > Xc). An illustrative graph of this relationship is shown in Figure 1 (right plot), where the curve represents the probability of observing a low relative abundance of clingers (i.e., < 40%) when the percentage of sand/fines exceeds a given value. In this example, an increase in the percentages of sand/fines from 0 to 50% was associated with an increase in the probability of observing poor biological conditions (as indicated by the relative abundance of clinger taxa) from 60% to about 80%.

Figure 1. Left plot: Percent substrate sand/fines versus relative abundance of clinger taxa. Data from EMAP-West. Red horizontal line shows relative abundance of clinger taxa = 40%, an example threshold for defining "poor" biological conditions. Right plot: Conditional probability, P(Y | X > Xc), for X = percent sand/fines fines in stream segment substrate and Y is defined as relative abundance of clingers < 40%. Lines show bootstrapped upper and lower 95% confidence intervals.

Conditional probabilities can be calculated by dividing the joint probability of observing both events by the probability of observing the conditioning event (Equation 1).

Equation 1.

For our purposes, CPA involves the application of the above analysis technique to biological monitoring data to assist stressor identification in causal analysis. Additional background and detail can be found in Paul and McDonald (2005); however, this paper discusses CPA as applied to identifying thresholds of impact, which is a different purpose than stressor identification.

#### How do I calculate conditional probabilities?

A tool for computing conditional probabilities is available in CADStat.

#### How do I use conditional probability in causal analysis?

CPA can be used as a data exploration tool. Similar to scatter plots and linear correlation, CPA can be used to help understand associations between pairs of variables (e.g., a stressor and a response).