CADDIS Volume 4: Data Analysis

## Exploratory Data Analysis

- What is EDA?

- Variable

Distributions - Scatterplots

- Correlation

Analysis - Conditional

Probability - Multivariate

Approaches - Mapping Data

- References

##### Correlation Analysis

Authors: G.W. Suter II, P. Shaw-Allen, S.M. Cormier

### Correlation Analysis

Correlation analysis is a method for measuring the covariance of two random
variables in a matched data set. Covariance is usually expressed as the *correlation coefficient*
of two variables *X* and *Y*. The correlation coefficient is a unitless number that varies from -1 to +1.
The magnitude of the correlation coefficient is the standardized degree of association between *X* and *Y*.
The sign is the direction of the association, which can be positive or
negative.

Pearson's product-moment correlation coefficient, *r*, measures
the degree of linear association between two variables.
Spearman's rank-order correlation coefficient (ρ) uses the ranks
of the data, and can provide a more robust estimate of the degree to
which two variables are associated. Kendall's tau (τ) has the same underlying assumptions
as Spearman's (ρ), but represents the probability that the two variables are ordered nonrandomly.

A value of r, ρ, or τ is interpreted as follows:

- A coefficient of 0 indicates that the variables are not related (Figure 1, left).
- A negative coefficient indicates that as one variable increases, the other decreases (Figure 1, center).
- A positive coefficient indicates that as one variable increases the other also increases (Figure 1, right).
- Larger absolute values of coefficients indicate stronger associations (e.g., Figure 1, right and center). However, small Pearson coefficients may be due to a nonlinear relationship (Figure 2).

Examples of different behaviors of Pearson's and Spearman's
correlations are shown in Figure 2. Pearson's *r*
does not accurately represent the strength of the non-linear
association in Figure 2 (left plot).
Pearson's *r* and Spearmans ρ provide different
estimates of correlation depending upon the distribution of the
data (Figure 2, right plot).

#### How do I calculate correlations?

A tool for calculating correlations is available in CADStat, and most any spreadsheet or statistical software will compute them as well.

#### How do I use correlation in causal analysis?

Correlation analysis is used primarily as a data exploration technique to reveal the degree of association in a set of matched data. This information can inform subsequent analyses of relationships between variables. In particular, correlation can indicate possible factors that confound a relationship of interest. In most data, pairwise correlations may not provide enough insights, and multivariate exploratory analyses are recommended.