The Data Analytics Blog

Our news and views relating to Data Analytics, Big Data, Machine Learning, and the world of Credit.

All Posts

Finding Value In Transaction Data (Part 3)

December 15, 2015 at 3:12 PM

In my previous post, we looked at the first “D” of the 3D approach of identifying and extracting value out of your transaction data: Determination.

If you recall, I proposed a 3 step approach (the 3D approach) to realising value from a variety of large data sets:

  1. Determination –scour data sources to establish if and where there might be value
  2. Development – create models that will be developed for the decision areas where value was identified with data that is predictive of this predetermined outcome
  3. Deployment – implement and run the developed market

In this post, we will continue looking at the 5 steps of the Determination phase:

  1. Incorporating the data
  2. Aggregating the data
  3. Identifying the target areas of value
  4. Scouring the data for value
  5. Reviewing and planning mini-projects

Let’s now look at steps 4 and 5.

4. Scouring the data for value

The process of data scouring or prospecting involves looking at the observation – such as the aggregated data - and assessing whether this data might predict future target areas of value. Essentially, is there a correlation between an historic piece/group of data and a future outcome?

To determine this, analysts will typically assess a regression relationship between the observation characteristic and the outcome (target) variable. Initially the analysis will be univariate (single variable) analysis, and later multi-variable analysis can take place.

Univariate analysis involves taking an observation characteristic, creating attribute groups and then calculating the information value. The equation to calculate this is displayed below:

 Univariate calculation for finding value in big data

The resultant information value calculated per field and given target variable will indicate whether a field has value or not. An indicative table of strength values is displayed below.

Range Strength
0 0.02 Non-predictive
0.02 0.1 Weak
0.1 0.3 Medium
0.3 0.5 Strong
0.5 100 Very Strong

An example of the calculation is given. Here an online retailer wants to understand whether customers who bought books in January 2015 - March 2015 were likely to buy music in April 2015 - June 2015. Therefore, the rows represent the observation data and the columns represent the outcome, or target, data.


Does 3 months of book purchases predict a music purchase 3m later?

Music purchases
(Apr 2015-Jun 2015)

Customers with no purchases

Customers with no music purchases

ALL Customers

Book purchases (Jan 2015-Mar 2015)

Customers with no book purchases




Customers with book purchases





ALL Customers




This is then converted to column percentages:

Does 3 months of book purchases predict a music purchase 3m later?

Music purchases
(Apr 2015-Jun 2015)

Customers with no purchases

Customers with music purchases


Book purchases (Jan 2015-Mar 2015)

No purchases




Book purchases




ALL Customers




The calculations are then run:



























Sum of D1 and D2




Therefore book purchases in the first quarter of 2015 had a medium correlation to music purchases in the second quarter of 2015.

This calculation is run for all observation variables on of the target variables. The resultant can be displayed in a data-scouring heat-map. An example is displayed here.


The rows comprise observation, or the aggregated, characteristics. The columns represent each outcome target value. The colours represent the degree of correlation (according to the mapping table - red being the strongest and blue being the weakest).

This exercise, if done well, will set the business up for a well-ordered series of projects to extract value from the data.

Measuring incremental lift

Once the univariate analysis is complete, it is worthwhile assessing how correlated each characteristic is with the others. Correlation analysis is often run to determine this. For example, it may be that “Number of purchases in the last 1m” is positively correlated to someone responding to a marketing offer. Similarly “Number of purchases in the last 3m” may also be positively correlated. The naïve conclusion may be that both characteristics should be used together to predict response to an offer. The reality is that these two characteristics are highly correlated.

Many statistical techniques and measures can be used to determine the correlation. Ultimately, the characteristics could be grouped into correlated items to summarise the number of low-correlated but predictive groups. This information is important when it comes to determining what model – for example, decision tree/segmentation, clusters, scorecard - should be considered in the next phase.

5. Reviewing and planning mini-projects

Once the heat maps are developed the following should be determined:

  1. For each target variable – was there significant amount of data with a strong prediction of the target?
  2. Could this data be used to create models, such as scores, segmentation or clustering? The variables identified should be tested for co-linearity/correlation to determine if the strong characteristics can add value.
  3. Can the data merging, aggregation and model deployment be successfully accomplished given current hardware and software constraints? If not, what needs to change and is the business ready for it?
  4. Which exercise is likely to produce the most value?
  5. Should the business run projects in parallel sprints or in a relay (one-at-a-time) fashion?


The Determination step gives the data analysts, risk/loyalty/marketing managers a very good insight into which fields might be valuable and for what purposes. The next step is the Development step. In the next blog we’ll be exploring this step.

In my next blog post, I’ll take you through the next phase of finding value in your transaction data: Development.

Subscribe to our blog

Thomas Maydon
Thomas Maydon
Thomas Maydon is the Head of Credit Solutions at Principa. With over 17 years of experience in the Southern African, West African and Middle Eastern retail credit markets, Tom has primarily been involved in consulting, analytics, credit bureau and predictive modelling services. He has experience in all aspects of the credit life cycle (in multiple industries) including intelligent prospecting, originations, strategy simulation, affordability analysis, behavioural modelling, pricing analysis, collections processes, and provisions (including Basel II) and profitability calculations.

Latest Posts

How chat is revolutionising the digital onboarding experience

Principa’s onboarding chatbot solution; Atura allows lenders to engage a customer effectively through an application process while accessing necessary data and decisioning calls using Principa’s SmartSuite software. The digital revolution “Digital” has been a financial services buzz-word for some time. Most South African lenders Principa works with have been working hard to adapt to a digital existence for several years. Some have been successful, others are still working on the challenge - and most have only partially adapted.

How to choose the correct collections chatbot

Principa has a wealth of experience in building and deploying chatbots for the financial services industry. Our custom-built solution is flexible and fully customisable which allows your bot to assume your brand’s persona. We can also seamlessly integrate with existing systems. Click here to find out more. 

Model validation and adjustment

The time is NOW for model validation and adjustment. One of the major premises used in credit scoring is that “the future is like the past”. It’s usually a rational assumption and gives us a reasonable platform on which to build scorecards whether they be application scorecards, behavioural scores, collection scores or financial models. That is reasonable until something unprecedented comes along. You can read about this black swan event in our previous two blogs here and here.