Risk Insights Logo

Data in audit - reducing noise, false positives

October 13, 2019

When using data, one of the main challenges faced by auditors is the volume of exceptions generated. How can we overcome this?

Traditional audit sampling typically involves evaluation of between 5 and 50 items; consequently, the number of exceptions doesn't fall out of that range.

However, when we use larger sets of data - across full populations - the number of exceptions produced can be high.

The sheer volume can mean:

  • Difficulty in eliminating false positives - noise!
  • Unnecessary pain for business folk (who must deal with the mountain)
  • Audit teams losing faith in the use of data for audits


There are various approaches to overcome this, e.g. progressively categorising results into specific buckets based on key characteristics, reviewing a sample of those, and then extrapolating the results of the reviewed sample to the remainder of the population of exceptions.

However, when the key characteristics can't be easily categorised e.g., when the exceptions are based on and include both structured and unstructured data - the traditional approach doesn't quite fit.

An alternate method that generally works well involves the use of machine learning techniques. A similar approach - but using different techniques - that can significantly reduce the number of false positives:

  • providing business stakeholders with relevant information to consider
  • providing audit stakeholders with the comfort that we did not simply discard results but used a well-structured and defensible approach to create a manageable set of exceptions for follow-up.


An example of how this is applied



A home loan (mortgage) assurance review for a financial services organisation.

The organisation's Internal Audit function is relatively small- fewer than ten FTE- but progressive, punching significantly above its weight, and respected by stakeholders.

The team decided to take a data driven approach, opting to cover ALL accounts and transactions for just over one year.

Just over 800m records.

KNIME- an open source analytics platform - was used to analyse the data, across various data sets, including:

  • customer and account master files
  • offset account links
  • account transactions
  • customer relationship management (CRM)

Because the CRM data was primarily free text, we used a set of natural language processing (NLP) techniques, providing a level of structure, and then blended the processed data with the other structured data sets.

With the data in a format that could be used easily, we performed several analyses.

This included:

  1. Recalculations - full population recalculation of transactions
  2. Master data quality checks (e.g. duplicate member records)
  3. Profiling and understanding manual adjustments
  4. Checking the application of product rules.
  5. Confirming adherence to customer instructions, including offset account links

Most of these don’t need much explanation. But why offset account links?

Let’s explain why we decided to do this and the challenges that we faced.



Some customers have multiple deposit and loan accounts.

Linking those accounts can save money, consolidating credit and debit balances. This is typically referred to as an offset mortgage. It is common in Australia and the United Kingdom, and different to the “All-in-one” in the United States.

It works something like this:

Lending rates are usually higher than deposit rates, so offsetting saves money.

This is popular within the industry, as the saving is not trivial. In the simple but common example above, the saving is almost 10%.  The larger the deposit balance, the larger the difference.

But the linking can easily fail because most banking systems were not originally built to deal with this type of relationship between accounts.  They generally work well with standalone home loans, or standalone deposit accounts. But combining them often means a patchy workaround with some scenarios that have not been envisaged or properly tested.

There has been a fair level of regulatory (and media) attention to such failures over the past few years, with hefty infringement penalties and costs. An example of this is the AUD12m that this bank had to pay to customers. There would have been separate associated costs e.g., relating to the calculations of the refunds.

So for our project, we decided to check whether offsets were established properly.  This means identifying expected offset links and then comparing those to the actual links that were in operation.

If we expected to see an offset link (identified for example in a customer interaction or complaint), and the link hasn’t been established, we would then need to investigate this as an exception.



We found thousands of potential exceptions.

We expected that many of the exceptions would turn out to be false positives, but we couldn’t possibly investigate them all to find the real exceptions.


How can we find the needles in that haystack? 


Traditional solution

The traditional approaches would typically look like one of these:

  1. select and test a random sample
  2. profile a representative sample (e.g., in a spreadsheet) to find common characteristics, to manually identify and eliminate false positives.

Why not opt for the traditional approaches?

  1. random sampling may have merit, but with techniques to better target anomalies now available, this is not defensible and doesn't add real value
  2. representative profiling can work with structured data, for a smaller set of features (columns) - but the free text data translated to over 5,000 columns, so this would not be feasible.


Alternate solution

The software that we were using has strong predictive modelling capability. So, we decided to  use it.

This is the process; it sounds complicated, but is not difficult to implement:

The result

More than 90% of the exceptions were eliminated (false positives), to produce a manageable set of a few hundred to investigate.

Model accuracy was approximately 70%.  Such a process is rarely going to be 100% accurate but this is certainly better than random sampling alone. It can be defended.

Remember that this was achieved by a relatively small Internal Audit team.


Tools, approaches and techniques to improve the use of data within audit are now readily available. Are you using them?


Share this article

Get more insights like this

Blog Post
The Assurance Blog
March 3, 2022

Data in Audit Guide

Read article
Blog Post
The Assurance Blog
December 16, 2021

The Data-Confident Internal Auditor: Software

Read article

Subscribe to our mailing list

Get notified by email about new blog posts and podcast episodes by the Risk Insights Team.