More data, more correlations – but is it causation?

Written by Yusuf Moolla | 27 Aug 2025

TL;DR

• More data means more potential patterns.

• Correlation can reflect causation, but isn’t proof.

• Sometimes a confounding variable creates the perception of causation.

• Often there’s no underlying relationship at all; just an accidental match.

We know that correlation is not the same as causation. But we can get carried away when we see patterns, or strong relationships. And this is exacerbated by the increase in data sources, particularly third-party data.

This article walks through correlation and causation and how third-party data is increasing the need to check the patterns our models are finding.

Causation, confounding variables, spurious correlations

Causation: when one thing directly makes the other thing happen.

For instance, in insurance pricing, annual mileage directly affects the likelihood of accident claims. More time on the road increases accident risk. This is fairly straightforward.

Importantly, the reverse is not always true. If A correlates with B, and we find that A causes B, we can’t then conclude that B causes A. The direction matters.

Confounding/lurking variable: there is a real relationship, but not causation, because there's a linking factor.

For example, if ice cream consumption and sunglasses sales both increase in summer, then we could conclude that people are tempted to buy sunglasses while on an ice cream high. Of course, here it is easy to see that the common factor is the season, with temperature affecting one and sun glare affecting the other.

The same could easily happen with business data. For example, an increase in home insurance claims and higher umbrella sales might both be caused by more rain.

Spurious correlation: often, the “pattern” is just random or accidental.

There are loads of examples of this. You've probably come across quite a few.

Tyler Vigen has several examples here, including made-up generative AI explanations, like this one about the correlation between cheddar cheese consumption and solar power generated in Haiti:

“As cheddar cheese consumption increased, the collective brainpower of the population reached new heights. With innovative solutions, they developed a way to harness the renewable energy of cheese dreams, leading to a surge in solar power generation in Haiti. The Cheesy Brainwave Initiative has now sparked a gouda revolution in the energy sector! Curd you believe it?”

Why this is important now

This topic has been around for some time. But we’re now using more data sources in our algorithmic systems, increasing the potential for the models to find patterns.

More data can be helpful, helping us improve pricing, fraud detection, etc.

But it also increases the risk of introducing patterns that distract from what’s actually relevant.

The more data we use, the higher the number of possible relationships. This makes spurious correlations more likely, especially when variable selection is automated.

Third party data, in particular, presents specific challenges. Among these are:

New data sources might be unfamiliar. For example, if we use external loyalty card data, but don’t understand the underlying behaviour driving the purchases, our models could over-value or misinterpret spending habits.
Gaps and quality issues: missing fields, outdated records, or other errors can cause the model to impact certain customers, reducing model accuracy.
Data might be collected for a different purpose: Consumer data brokers build detailed segmentation models for marketing. These classify individuals into lifestyle groups/segments, which might be useful in tailoring advertising, but they aren’t designed for use in determining insurance risk. They are about marketing trends, rather than actual predictors of insurance outcomes. This can be especially problematic if the trends overlap with protected classes.

The bottom line

When we don’t fully understand where the data comes from or how it was generated, our models can pick up patterns that aren’t meaningful for our context. The patterns might seem useful because of a hidden confounder or some sort of data quirk.

As a result, well-intentioned algorithms might end up relying on relationships that don’t reflect the real drivers of risk or fraud. Increasingly, regulators and stakeholders expect a rational explanation for any variable used. So we need to carefully screen the data we use, with clear reasoning for every variable, especially when we use external data.

Disclaimer: The information in this article does not constitute legal advice. It may not be relevant to your circumstances. It was written for specific algorithmic contexts within banks and insurance companies, may not apply to other contexts, and may not be relevant to other types of organisations.

View full post