# What is Correlation?

Besides not being causation, many pedantically smart laymen don’t know what correlation is. I’m here to fix that with a mathematical, an intuitive explanation and a brief philosophical comment.

### Intuitive Explanation

Correlation is a quantitative measure of how well the highs line up with the highs and the lows line up with the lows between two arrays. Correlation will always be between -1 and 1, inclusively. A correlation of 1 indicates a linear, positive relationship between two variables. A correlation of zero indicates no correlation. A correlation of negative one indicates a perfect, negative relationship.

### Mathematical Definition

This value can be computed in Excel through the CORREL() function, but stepping into the formula helps enhance the understanding. Feel free to skim over this part to the applications and philosophy sections. Mathematically, correlation can be computed as follows:

$correl(X,Y) = \frac{E[(x-\mu _x)(y-\mu _y)]}{\sigma _X \sigma _Y}$

Where $X$ and $Y$ are arrays of equal length, $\mu _A$ and $E[A]$ give the expected value of $A$, and $\sigma _A$ gives the standard deviation of $A$. As previously discussed, expected value is a probability-weighted average of all possible values. Standard deviation is a measure of how spread out an array is. It will be explained in a future post. As commonly used when analyzing historical data, we will weight each observation equally.

I created a small example in Excel.

Click here for formula view. The first two columns show four observations of my simulated values. Series A was entered manually. Series B was generated based on Series A but with a small amount of noise. I preserved the original formula in cell H3.

For each value, correlation starts by computing how different it is from the expected value of its series. Values under the average, like when $t=1$, are displayed as negative deviations.

The rightmost column multiplies the two deviations by each other. If they are large and the same direction, like at $t=1$ and $t=4$, then the product of the deviations will increase the correlation. If they are in opposite directions, like $t=3$, then the product of the deviations will decrease the correlation. The average across all observations, 723, is known as the covariance. This is related to the concept of variance, which will be discussed in a future post. The covariance value tells us everything that correlation tells us, but covariance is scale dependent. High input values will often result in high covariance regardless of the actual relationship.

To reach correlation, the covariance amount is divided by the maximum possible correlation given the standard deviation of the series. This is equal to $\sigma _X \sigma _Y$. The last two rows show that computing the correlation manually yields the same result as the =CORREL() function.

### Significance

With real-world data, it is very unlikely that uncorrelated variables will have a correlation of exactly zero due to noise in the data. Given this, it is necessary for a practitioner to know how likely any returned result is caused by noise. Sheet 2 in this post’s workbook gives a significance table. Derivation and the underlying assumptions will be covered at a later date.

If you have two years of monthly data (N=24), then you will need a correlation of at least 0.344 to be 95% sure the result is not caused by noise. A correlation of at least 0.404 will increase your certainty to 95.7%.

In our example (N=4), correlation is significance at the 95% and 97.5% significance level but not at the 99.5% significance level.

The lower the number of observations, the greater the correlation must be before we are certain it is statistically significant.

### Applications

Correlation is useful to practitioners looking for relationships between variables. For example, a business valuator might analyze the correlation of an expense line item with revenue to demonstrate that a line item is variable. A practitioner studying historical performance might look at the correlation between a hotel’s revenue and historical sales tax proceeds to determine how exposed a business is to swings in economic activity.

Correlation should be thought of as the start of analysis, not the end. For example, suppose a hurricane decreases revenue and increases repairs and maintenance expense. This will serve to lower the correlation between the two line items, but this outlier may not be representative of the actual relationship. A trained practitioner thinking about business realities will always deliver more reliable results than a single measure.

### Philosophy

Relevant XKCD: “Correlation doesn’t imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing ‘look over there.'”

There is some merit to the old saying that “correlation does not imply causation.” For example, sometimes a confounding variable will cause two other variables to change. During the summer, both ice cream and shark attacks increase. Shark attacks and ice cream consumption will be correlated. However, it would be a mistake to assume that ice cream causes shark attacks. If ice cream consumption spiked because a new ice cream store opened, we wouldn’t expect shark attacks to increase. Therefore, a correlation does not always equal a causation.

Mathematically, this statement also works in reverse. It is possible to generate relationships that are explicitly causal but have a correlation at or near zero. This is especially true for non-linear relationships. Some excellent examples are found in this excellent Wikipedia illustration.

That being said, across many interesting fields of thought, correlation is heavily correlated with causation, and the result of this quick-and-dirty measure should not be ignored.

There is an annoying tendency among some thinkers to use lists of logical fallacies as a toolkit to shoot down ideas they disagree with. Cite a consensus among experts, and they’ll claim you are “arguing from authority.” Point out there is no evidence of something, and they’ll say that “the absence of evidence isn’t the evidence of absence.” Point out that X and Y are correlated, and they’ll say that this doesn’t imply causation.

No, experts aren’t always right. No, the lack of evidence for a claim doesn’t destroy a claim. No, correlation doesn’t always equal causation. However, we don’t need absolute mathematical precision to make compelling cases that change our minds. Experts usually know more than laymen, the absence of evidence is evidence of absence and correlation is a powerful indicator that tells us much about the relationship between two variables.

Know where the math might fail, and apply correlation to relevant problems with exactly the right amount of confidence.

All figures were generated from this excel book. Feel free to download it and play around.