The Data Analytics Blog

Our news and views relating to Data Analytics, Big Data, Machine Learning, and the world of Credit.

All Posts

5 Correlation Types In Data Science And How To Not Fool Yourself

May 25, 2018 at 4:06 PM

As part of our blog series on cognitive biases and logical fallacies that data scientists should avoid, today we address a prevalent logical fallacy: the "correlation proves causation" fallacy. Correlation due to causation is just one of the five main categories of causation, and this blog will look into each of the five.

The reason we are running this series of blogs is to highlight critical thinking within the workplace and particularly in data-science. The discipline of metacognition (thinking about thinking) is essential in ensuring that when it comes to using data to lead you to the truth, you can trust what the data is telling you. This is particularly relevant in the so-called post-truth world. (Click to Tweet!)

Epistemology: the study or a theory of the nature and grounds of knowledge especially with reference to its limits and validity [Merriam-Webster]

In a previous blog on motivated reasoning, I showed how analysts could fall into the trap of searching for evidence to support a pre-held belief. Through a variety of statistically sloppy practices including the file-drawer effect (publication bias), Texas sharpshooter fallacy, p-hacking, self-fulfilling prophecy, confirmation bias and cherry picking a conclusion can be drawn from incomplete or biased data. The correlation-causation is one of the more familiar fallacies.   

“Correlation means causation” is just one of the five main types of correlation. We explore each of the five and how to identify them.

Correlation implies causation?

While it’s frequently said that correlation does not imply causation, it is not entirely true as in an observational study, correlation indicates the possibility of a causal link. The difference is that a further study/experiment needs to be conducted to determine whether this is true.

Observational Study

An observational study is what happens when an analyst looks retrospectively at data. A common problem is determining what conclusions can be drawn from the data. When motivated reasoning drives an analyst – it is quite common for that analyst to make an absolute determination. Instead, they should look to form hypotheses and then look to conduct (if possible) randomised trials to control for the independent variable. Alternatively, if they can control for that variable (through drawing out cohort groups – and conduct a study thereon) – they can draw much stronger hypotheses.

An observational study might show that red-wine drinkers live longer? Does that mean red-wine makes one live longer (direct causation) or that red-wine drinkers are more affluent and affluent people live longer (marker for another variable)?

Reasons for Correlation

Reasons for correlationSo when an analyst observes a strong correlation – it's important to recognise that there could be many reasons for the correlation. I've listed the five main reasons with examples and steps to identify the cause.

Marker for another causal variable (W causes X; W causes Y)

This is also known as the "third-cause fallacy". Two data fields appear correlated – it would be tempting to infer a causal link between the two, but in fact, there is a third common variable that is causing the other two.

The wine example above is an example of this type of correlation. Another rather prosaic example would be a correlation of umbrella purchases and lightning strikes. Common sense would dictate that umbrella sales don’t cause lightning strikes (or vice versa). The third common variable is stormy weather which causes both variables.


Indirect Causation (X causes Z which causes Y)

The correlation between two factors may not indicate a direct causal relationship. There may be an intervening variable at play. 


An example of this might be a frequently studied association of consumption of tea and lung cancer reduction. Many low-quality studies show strong inverse correlations between tea consumption (10 cups/day) and the onset of lung cancer. However, drinking 10 cups of tea per day means there is less time to smoke. Other studies have controlled for this factor through cohort analysis and have determined that consumption of tea for non-smokers/ex-smokers is less correlated.  

To avoid indirect causation, it is worth doing multivariate (instead of univariate) analysis. This will ensure that variable “Z” and variable “X” are analysed together. Similarly, cohort analysis on observational studies may assist too. For time-series data one can utilise Granger causality testing which gives a reasonable indication of direct causality.

Direct Causation (X causes Y)

Direct causation sometimes referred to by its Latin name “post hoc ergo propter hoc” (after therefore because of) is what we are ultimately interested in with predictive analysis. Just because a variable comes after action, does not mean that the action produces the variable.  

While this may be obvious, it is worth noting another sub-category to look out for, and that is the cyclic causation scenario.  Here Y may cause X too. The often cited example of this is the predator-prey population relationship (modelled using Lotka-Volterra modelling). Here the predator population grows with an increase in prey population; but conversely the prey population will decrease with a large predator population.

Predator Prey Population

Back to correlation due to causation.

Credit Example

In the credit risk space, having a deep understanding of the confounding data environment is essential. For something as simple as setting a good/bad definition for application scorecards, the performance of a good/bad account can be affected by many things, not only poor credit-risk behaviour. A common phenomenon in SME lending is that a business applying for working capital that is denied and then fails as a business does not necessarily imply that the decision was correct. It could be that a working capital loan could have supported the company to succeed.

Similarly, in behavioural scoring, we may predict an account has a high probability of being good in 12 months. As a result, we decide to aggressively market to the account pushing the account holder over the limit. The result is that the customer defaults on their loan. Did we underestimate the risk of the client or was our action too aggressive?


The fourth correlation type is pure coincidence. Humans are notoriously bad at understanding probabilities and we are consciously seeking patterns and are prone to confirmation bias. (Click to Tweet!) When we stumble upon a pattern or correlation, we may then be tempted to jump to the conclusion that causation is at hand. There are countless correlations out there. I stumbled across the website with some hilarious examples including the one here!

Cheese consumption bedsheet deaths

When viewing correlation in an observational study, you should make hypotheses to test later. When modelling you should keep hold-out samples or out-of-time samples on which to test the correlations.


Sometimes the reason for correlation may not be known. Technically (scientifically) nothing is known 100%, but for some analysis, it may be too premature to draw any conclusion regarding reasons for correlation.

For more on how scientists fool themselves, I can recommend this article from Nature.

predictive analytics guide

Thomas Maydon
Thomas Maydon
Thomas Maydon is the Head of Credit Solutions at Principa. With over 13 years of experience in the Southern African, West African and Middle Eastern retail credit markets, Tom has primarily been involved in consulting, analytics, credit bureau and predictive modelling services. He has experience in all aspects of the credit life cycle (in multiple industries) including intelligent prospecting, originations, strategy simulation, affordability analysis, behavioural modelling, pricing analysis, collections processes, and provisions (including Basel II) and profitability calculations.

Latest Posts

Incorporating Credit Lifecycle Predictive Outcomes In Your Collections And Recoveries Call Centre

In a collections environment, an agent needs to follow up with numerous customers on their outstanding credit and the more distinct information the agent has on each customer, the better the agent will understand who they are interacting with and what the opportunities, risks and expectation of the collections call with the client are.

[Slideshare] How To Make Your Business Data Work For You

Common barriers to success: Skills shortage: data scientists are in high demand and in low supply. Companies lack the skills to develop advanced data analytics or machine learning applications. Cost: recruiting and building up or training a team, as well as infrastructure costs are immense. Inefficiency and low ROI on: acquisition campaigns; re-activation and retention campaigns; outbound sales calls and debt collection. Resulting in: No or ineffective use of data. High cost to get insights from data. Low returns from campaigns. What’s the alternative? Machine Learning as a Service (MLaaS): removes infrastructure skills and requirements for machine learning, allowing you to begin benefiting from machine learning quickly with little investment. Subscription based pricing, allowing you to benefit using machine learning while minimising your set-up costs and seeing returns sooner. Answers as a Service: Use historic data and machine learning to allow answers to increase in accuracy with time. MLaaS with predictive models pre-developed to answers specific questions: Genius Call Connect: What is the best time and number to call customers? Genius Customer Growth: Which customers are most likely to respond to cross-sell? Genius Re-activation: Which dormant customers are worth re-activating? Genius Customer Retention: Which customers are most likely to churn? Genius Leads: Which contacts are likely to respond to my campaign? Genius Risk Classifier: Which debtors are most likely to pay or roll? Benefits of Genius: Quick and cost-effective ability to leverage machine learning: Minimal set-up time Minimal involvement from IT Subscription based service Looking to make your data work for your business? Read more on Genius to see how it can help your business succeed. 

5 Must-Join Facebook Pages For Data Science, Machine Learning And Artificial Intelligence In 2019

While LinkedIn has traditionally been thought of as the business or work focussed social platform, Facebook has been making headway into gaining market share in the space as well. With company pages and groups, Facebook is catering to every interest and aspiration that people might have – and combining that with their social interactions and news sources. Facebook aims to give users a one-stop-shop experience, and it’s very good at doing it.