The Data Analytics Blog

Our news and views relating to Data Analytics, Big Data, Machine Learning, and the world of Credit.

All Posts

P-Hacking - Are You Guilty Of Data Fishing?

April 11, 2018 at 8:04 AM

A year ago, I published an article about motivated reasoning and how that can damage the data analytics process. It is part of a blog series on cognitive biases and logical fallacies that data analysts should avoid. Today I’d like to extend this conversation into a topical matter: p-hacking, also known as data fishing.

To understand p-hacking, let’s look first at the p-value?

Leaving statistical jargon aside, we use the p-value to understand how confident we should be when observing a sample. The challenge comes into play when the observations from this sample are extrapolated across the entire population. You may have seen this in election poll surveys 


where a small set of observations are used to determine/infer how a population voted, the p-value indicates how significant a conclusion may be. Essentially here, it would be related to size of sample relative to population and the size of the trend (percentage break-down of votes). 

Now when we observe a study (medical trial, crime statistics, voting stats), we should always look out for the p-value. However, in recent times statisticians have made note that many studies that site “highly reliable” p-values (usually a p<0.05) have been subject to a dubious activity of “p-hacking” or data dredging.

What is p-hacking?

P-hacking is the exercise of fishing through data looking for correlations that have a p-value (below your threshold) and reporting this correlation as statistically significant (without assessing causality).

P-hacking example:

  1. We run a campaign offering new cell-phone contracts to a large group of individuals (let’s say 100,000)
  2. A proportion take up the offer (let’s say 1,000)
  3. We then look back at the demographic information (let’s say 300 demographic fields) on all 100,000 individuals to see whether anything looks predictive of the customer taking up an offer.
  4. We happen to find that those with kids between 5-7 are three times more likely to take up an offer than the rest of the population.
  5. The p-value associated is measured at 0.01 (i.e. there’s a 1-percent chance that this trend is an anomaly)

A p-value appears impressive. But the reality is that the analyst has made a crucial mistake. Although 1% appears small, the analyst has actually dredged each of the demographic fields until he/she has stumbled upon what is actually an anomaly. If reported as a determinant factor, then this is p-hacking. 

As a point of interest. If there are 70 fields with a p-value of 0.01 (or less), then there is more than a 50% chance we will find at least one field that is correlated due to “noise” in the data. For p-value threshold of 0.05 (or less), then the number of fields is just 14.

How do we prevent p-hacking

Firstly, the trend in the example may indeed be true, but to know this, we need to do other tests. One such test would be to keep a hold-out sample to determine whether the trend holds true in that sample too (we do this as standard when modelling).

While machine learning techniques may give you quick models, unless sufficient attention towards validation is given, you may end up with models built on anomalies. 

“As data scientists, we need to be continuously aware of the risk of fooling ourselves.”

Another worthwhile consideration is to try and understand the logic behind the relationship (i.e. “is there any logic that might explain a consumer with kids aged 5-7 being so much more likely of purchasing our product that others”) – if not, then exercise additional caution. 

Another consideration would be to stay abreast of other industries to witness how they may deal with such issues. In the medical words “observational studies” are done to look for correlations or patterns. A hypothesis (not a conclusion) is then made, and further studies are done to determine whether the hypothesis holds. This is done not as a retrospective observational study, but rather a forward-looking statistical test.

"False facts are highly injurious to the progress of science, for they often long endure; but false views, if supported by some evidence, do little harm, as everyone takes a salutary pleasure in proving their falseness; and when this is done, one path towards error is closed and the road to truth is often at the same time opened." – Charles Darwin "The Descent of Man" (1871)

As data scientists, we need to be continuously aware of the risk of fooling ourselves. Feel free to fish for trends, but don't be lured by the bait of seemingly significant results.  Express your results responsibly.  For more on this, I can recommend this excellent article in Nature and the embedded video from Veritasium.

For more information on how Principa’s Data Scientists might help your organisation, get in touch with us.

predictive analytics guide

Thomas Maydon
Thomas Maydon
Thomas Maydon is the Head of Credit Solutions at Principa. With over 13 years of experience in the Southern African, West African and Middle Eastern retail credit markets, Tom has primarily been involved in consulting, analytics, credit bureau and predictive modelling services. He has experience in all aspects of the credit life cycle (in multiple industries) including intelligent prospecting, originations, strategy simulation, affordability analysis, behavioural modelling, pricing analysis, collections processes, and provisions (including Basel II) and profitability calculations.

Latest Posts

Incorporating Credit Lifecycle Predictive Outcomes In Your Collections And Recoveries Call Centre

In a collections environment, an agent needs to follow up with numerous customers on their outstanding credit and the more distinct information the agent has on each customer, the better the agent will understand who they are interacting with and what the opportunities, risks and expectation of the collections call with the client are.

[Slideshare] How To Make Your Business Data Work For You

Common barriers to success: Skills shortage: data scientists are in high demand and in low supply. Companies lack the skills to develop advanced data analytics or machine learning applications. Cost: recruiting and building up or training a team, as well as infrastructure costs are immense. Inefficiency and low ROI on: acquisition campaigns; re-activation and retention campaigns; outbound sales calls and debt collection. Resulting in: No or ineffective use of data. High cost to get insights from data. Low returns from campaigns. What’s the alternative? Machine Learning as a Service (MLaaS): removes infrastructure skills and requirements for machine learning, allowing you to begin benefiting from machine learning quickly with little investment. Subscription based pricing, allowing you to benefit using machine learning while minimising your set-up costs and seeing returns sooner. Answers as a Service: Use historic data and machine learning to allow answers to increase in accuracy with time. MLaaS with predictive models pre-developed to answers specific questions: Genius Call Connect: What is the best time and number to call customers? Genius Customer Growth: Which customers are most likely to respond to cross-sell? Genius Re-activation: Which dormant customers are worth re-activating? Genius Customer Retention: Which customers are most likely to churn? Genius Leads: Which contacts are likely to respond to my campaign? Genius Risk Classifier: Which debtors are most likely to pay or roll? Benefits of Genius: Quick and cost-effective ability to leverage machine learning: Minimal set-up time Minimal involvement from IT Subscription based service Looking to make your data work for your business? Read more on Genius to see how it can help your business succeed. 

5 Must-Join Facebook Pages For Data Science, Machine Learning And Artificial Intelligence In 2019

While LinkedIn has traditionally been thought of as the business or work focussed social platform, Facebook has been making headway into gaining market share in the space as well. With company pages and groups, Facebook is catering to every interest and aspiration that people might have – and combining that with their social interactions and news sources. Facebook aims to give users a one-stop-shop experience, and it’s very good at doing it.