The Data Analytics Blog

Our news and views relating to Data Analytics, Big Data, Machine Learning, and the world of Credit.

All Posts

The Top Predictive Analytics Pitfalls To Avoid

November 24, 2016 at 8:20 AM

3 cartoon business men walking towards the edge of a cliff and the front one is blindfolded

Predictive Analytics can yield amazing results. The lift that can be achieved by basing future decisions from observed patterns in historical events can far outweigh anything that can be achieved by relying on gut-feel or being guided by anecdotal events.  There are numerous examples that demonstrate the possible lift that can be achieved across all possible industries, but a test we did recently in the retail sector showed that applying stable predictive models gave us a five-fold increase in the take-up of the product when compared against a random sample.  Let’s face it, there would not be so much focus on Predictive Analytics and in particular Machine Learning if it was not yielding impressive results.

But predictive models are not bullet proof.  They can be a bit like race horses: somewhat sensitive to changes and with a propensity to leave the rider on the ground wondering what on earth just happened. (Click to Tweet!)

Download our guide to using machine learning in business, where we explore how you can use machine learning to better tap into your business data and gain valuable, informing insights to improve your business revenue. 

The commoditising of Machine Learning is making data science a lot more accessible to the non data scientists of the world than ever before. With this in mind, my colleague and I sat and pondered, and we devised the following list of top predictive analytics pitfalls to avoid in order to keep your models performing as expected:

  • Making incorrect assumptions on the underlying training data.
    Rushing in and making too many assumptions on the underlying training data can often lead to egg on the proverbial face. Take time to understand the data and trends in the distributions, missing values, outliers, etc.
  • Working with low volumes.
    Low volumes is the data scientist’s unhappy place – they can lead to statistically weak, unstable and unreliable models.
  • The over-fitting chestnut.
    In other words, creating a model that has many branches and therefore seems to provide better discrimination of the target variable, but falls over in the real world as it has introduced too much noise into the model.
  • Bias in the training data.
    For example, you only offered a certain product to the Millennials. So, guess what? The Millennials are going to come through strongly in the model.
  • Including test data in the training data.
    There have been a few epic fails where the test data has been included in the training data – giving the impression that the model is going to perform fantastically, but in reality results in a broken model. In the predictive analytics world, if the results are too good to be true, it is worth spending more time on your validations and even getting a second opinion to check over your work.
  • Not being creative with the provided data.
    Predictive models can be significantly improved by creating some clever characteristics or features that can be used to better explain the trends in the data. Too often data scientists will work with what has been provided and will not spend enough time considering more creative features from the underlying data that can strengthen the models in ways that an improved algorithm cannot achieve.
  • Expecting machines to understand business.
    Machines cannot figure out (yet) what the business problem is and how best to tackle the problem. This is not always straight forward and can require some careful thought, involving wholesome discussions with the business stakeholders.
  • Using the wrong metric to measure the performance of a model.
    For example, out of 10,000 cases you have only two cases that are fraudulent and 9,998 cases that are not fraud. If the performance metric used in the model training was just straightforward accuracy, the model would attempt to maximize accuracy. So if it predicts all 10,000 cases to not be fraud, the model would have an accuracy of 99,98% which is seemingly amazing, but it does not serve any purpose in identifying fraud. It simply identifies 99,98% percent of the non-fraud instances correctly. So, for rare event modelling (of which fraud is a good example), alternative approaches need to be applied.
  • Using plain linear models on non-linear interaction.
    This happens commonly when, for example, building binary classifiers and logistic regression is chosen as the preferred method, when in reality the relationship between the features are not linear. Using tree-based models or support vector machines work better in such cases. Not knowing which methods are applicable to which problems results in poor models and subsequent predictions.
  • Forgetting about outliers.Outliers usually deserve special attention or should be ignored entirely, some methods of modelling are extremely sensitive to outliers and forgetting to remove or cater for them can cause poor performance in your model.
  • Performing regularisation without standardisation.
    Many practitioners are not aware of the redundancy of applying regularisation to the model’s features without first standardising the data so all the data is on the same scale. Regularisation would be biased, because it will penalise features that are on smaller scales more. For example, if there is a feature that is on a scale of 3,000 – 10,000, and another variable that is on the scale of 0 – 1, and another on the scale of -9,999 to 9,999.
  • Not taking into account the real-time scoring environment.
    Practitioners can sometimes get distracted by building the most perfect model, but when it comes to deployment, it is so complex that the model cannot be integrated into the operational system.
  • Using characteristics that will not be available in the future, due to operational reasons.
    One may identify a very predictive characteristic (like gender), but due to regulations, this field cannot be used in modelling, or the capturing of the field has been suspended and will be available in the future for use in the model.
  • Not considering the real-world implications and possible fallout of applying effective predictive analytics.
    American retailer Target made headlines four years ago when New York Times reporter Charles Duhigg brought to the public’s attention the now famous incident of Target’s analytics models predicting a teenager’s pregnancy before her father knew.  As some have pointed out, just because you can, doesn’t mean you should.

We’ve collectively learned some valuable lessons using predictive analytics over the years. Why not take advantage? If you’d like some further guidance in navigating the predictive analytics field, drop us a line and get in touch! We’d be happy to meet up for an informal chat over some coffee, share some knowledge and learn about your predictive analytics projects and plans.

how to use predictive business analytics

Robin Davies
Robin Davies
Robin Davies was the Head of Product Development at Principa for many years during which Robin’s team packaged complex concepts into easy-to-use products that help our clients to lift their business in often unexpected ways. Robin is currently the Head of Machine Learning at a prestigious firm in the UK.

Latest Posts

The 7 types of credit risk in SME lending

  It is common knowledge in the industry that the credit risk assessment of a consumer applying for credit is far less complex than that of a business that is applying for credit. Why is this the case? Simply put, consumers are usually very similar in their requirements and risks (homogenous) whilst businesses have far more varying risk elements (heterogenous). In this blog we will look at all the different risk elements within a business (here SME) credit application. These are: Risk of proprietors Risk of business Reason for loan Financial ratios Size of loan Risk industry Risk of region Before we delve into this list, it is worth noting that all of these factors need to be deployable as assessment tools within your originations system so it is key that you ensure your system can manage them. If you are on the look out for a loans origination system, then look no further than Principa’s AppSmart. If you are looking for a decision engine to manage your scorecards, policy rules and terms of business then take a look at our DecisionSmart business rules engine. AppSmart and DecisionSmart are part of Principa’s FinSmart Universe allowing for effective credit management across the customer life-cycle.   The different risk elements within a business credit application 1) Risk of proprietors For smaller organisations the risk of the business is inextricably linked to the financial well-being of the proprietors. How small is small? The rule of thumb is companies with up to two to three proprietors should have their proprietors assessed for risk too. This fits in with the SME segment. What data should be looked at? Generally in countries with mature credit bureaux, credit data is looked at including the score (there is normally a score cut-off) and then negative information such as the existence of judgements or defaults; these are typically used within policy rules. Those businesses with proprietors with excessive numbers of “negatives” may be disqualified from the loan application. Some credit bureaux offer a score of an individual based on the performance of all the businesses with which they are associated. This can also be useful in the credit risk assessment process. Another innovation being adopted internationally is the use of psychometrics in credit evaluation of the proprietors. To find out more about adopting credit scoring, read our blog on how to adopt credit scoring.   2) Risk of business The risk of the business should be managed through both scores and policy rules. Lenders will look at information such as the age of company, the experience of directors and the size of company etc. within a score. Alternatively, many lenders utilise the business score offered by credit bureaux. These scores are typically not as strong as consumer scores as the underlying data is limited and sometimes problematic. For example, large successful organisations may have judgements registered against their name which, unlike for consumers, is not necessarily a direct indication of the inability to service debt.   3) Reason for loan The reason for a loan is used more widely in business lending as opposed to unsecured consumer lending. Venture capital, working capital, invoice discounting and bridging finance are just some of many types of loan/facilities available and lenders need to equip themselves with the ability to manage each of these customer types whether it is within originations or collections. Prudent lenders venturing into the SME space for the first time often focus on one or two of these loan types and then expand later – as the operational implication for each type of loan is complex.   4) Financial ratios Financial ratios are core to commercial credit risk assessment. The main challenge here is to ensure that reliable financials are available from the customer. Small businesses may not be audited and thus the financials may be less trustworthy. Financial ratios can be divided into four categories: Profitability Leverage Coverage Liquidity Profitability can be further divided into margin ratios and return ratios. Lenders are frequently interested in gross profit margins; this is normally explicit on the income statement. The EBIDTA margin and operating profit margins are also used as well as return ratios such as return on assets, return on equity and risk-adjusted-returns. Leverage ratios are useful to lenders as they reflect the portion of the business that is financed by debt. Lower leverage ratios indicate stability. Leverage ratios assessed often incorporate debt-to-asset, debt-to-equity and asset-to-equity. Coverage ratios indicate the coverage that income or assets provide for the servicing of debt or interest expenses. The higher the coverage ratio the better it is for the lender. Coverage ratios are worked out considering the loan/facility that is being applied for. Finally, liquidity ratios indicate the ability for a company to convert its assets into cash. There are a variety of ratios used here. The current ratio is simply the ratio of assets to liabilities. The quick ratio is the ability for the business to pay its current debts off with readily available assets. The higher the liquidity ratios the better. Ratios are used both within credit scorecards as well as within policy rules. You can read more about these ratios here.   5) Size of loan When assessing credit risk for a consumer, the risk of the consumer does not normally change with the change of loan amount or facility (subject to the consumer passing affordability criteria). With business loans, loan amounts can range quite dramatically, and the risk of the applicant is normally tied to the loan amount requested. The loan/facility amount will of course change the ratios (mentioned in the last section) which could affect a positive/negative outcome. The outcome of the loan application is usually directly linked to a loan amount and any marked change to this loan amount would change the risk profile of the application.   6) Risk of industry The risk of an industry in which the SME operates can have a strong deterministic relationship with the entity being able to service the debt. Some lenders use this and those who do not normally identify this as a missing element in their risk assessment process. The identification of industry is always important. If you are in manufacturing, but your clients are the mines, then you are perhaps better identified as operating in mining as opposed to manufacturing. Most lenders who assess industry, will periodically rule out certain industries and perhaps also incorporate industry within their scorecard. Others take a more scientific approach. In the graph below the performance of an industry is tracked for two years and then projected over the next 6 months; this is then compared to the country’s GDP. As the industry appears to track above the projected GDP, a positive outlook is given to this applicant and this may affect them favourably in the credit application.                   7) Risk of Region   The last area of assessment is risk of region. Of the seven, this one is used the least. Here businesses,  either on book or on the bureau, are assessed against their geo-code. Each geo-code is clustered, and the projected outlook is given as positive, static or negative. As with industry this can be used within the assessment process as a policy rule or within a scorecard.   Bringing the seven risk categories together in a risk assessment These seven risk assessment categories are all important in the risk assessment process. How you bring it all together is critical. If you would like to discuss your SME evaluation challenges or find out more about what we offer in credit management software (like AppSmart and DecisionSmart), get in touch with us here.

Collections Resilience post COVID-19 - part 2

Principa Decisions (Pty) L

Collections Resilience post COVID-19

Principa Decisions (Pty) L