Risk Insights Logo

Small datasets for audits: 5 ways to extract value

February 28, 2021

Is big data more valuable than smaller data?

What do we even mean when we ask this question?

The thing is, there’s no real definition for either.

The term “big data” is now quite dated. It’s been around for nearly a decade but has never been defined.

Some may say that "small data" is the opposite of "big data".

Others define "small data" as a subset of "big data".


It all can be very confusing. 

So let’s set terminology aside and focus on how we, as auditors, can derive value from smaller datasets.


It is easy to see why larger datasets can be used to generate value. If we have a bigger data set to work with, we're comfortable with the quality of the data, and we have defined our approach well, it can easily generate audit value.


Equally, if we have a smaller high-quality dataset to work with, and a solid approach, we can use it to generate audit value.


But how exactly can we generate value for our audits with smaller data sets?

This article outlines five ways to do that.


Why this is important for auditors?

With the marketing hype (and FOMO) generated around the term “big data”, the perceived value of smaller datasets has been diluted.

But, as auditors, we often only have smaller data sets to work with.


Does that mean that the resulting work will be less valuable?



Does it mean that we need to somehow find bigger data to produce valuable results?

Again, no.


Smaller datasets can be extremely useful in audits.

Particularly where it is already at high levels of quality and integrity.



Five ways to extract value


1. Extend the timeframe of your data

Do we have data that we can work with over a longer timeframe?

Let’s say we have data for a week or a month. Or even a year. Because the data set is not very big, we can easily work with data for a longer timeframe. This enables a different perspective. It can also increase the reliability of our analysis.


For example, if we are looking at data that has one transaction or event per day. We then aren't limited by technology, or the time that it would take to do the analysis. With one transaction per day, data for 10 years would only be around 3,500 records. Unless you have a very wide dataset (many columns), this is very manageable.

One ancillary benefit of extending the timeframe, is that the underlying processes may not have been consistent over that period. The data may not all look the same, what the data represents may vary. The longer timeframe allows us to see what those changes might be - through the data. We can use the data to see what those changes over time were. How did the processes change?

In some cases, you might find that the process has gone backwards, and it is a little bit worse than it was, earlier on. Or you may find that the process is a bit shorter and more efficient, or a bit longer and more effective.


Let’s say we have an audit that covers a six-month period. We have a small quantity of data that covers this period. Is it useful or reasonable to go back further than that six-month period, even though that's prior to the audit period?


We often want to do this, for various reasons. For example:

  1. It helps us compare the audit period to a previous period - this can be a useful point of reference. Has our organisation moved in a positive direction?
  2. If the period doesn’t cover a key event, like a calendar or financial year end, we can’t easily see the impact of those key events. And yet they are important milestones for our analysis.


2. Combine/augment with other internal data

Often, when we’re looking at a particular domain, using the data from a specific process doesn’t give us the full picture. What’s really going on with our selected topic, subject, domain or audit area?

It is helpful to combine our data with internal data from adjacent functions or processes.

A simple example of this, that is common practice, is where we're doing a payroll audit and we bring in procurement data. Augmenting the data from the audit area (payroll) with data from another area (procurement) gives us a view that is a bit broader, across the organization, and is still relevant to our audit. Like when we match vendor details to employee details to find potential conflicts of interest.

Another example. Suppose we’re auditing sales. The sales data is reasonably small. If we combine the sales data with marketing data, it can help us to understand what happened before the sales occurred. What happens, marketing wise, in the lead up to those sales. Are there any discernible patterns? Of course, we won’t immediately draw any conclusions about the marketing driving the sales.  It is not necessarily a causative relationship.  But correlations can be used to paint a picture, and then we do more work to determine whether the correlation is actually causation.

If we want to go further, we can combine sales data with support data.  For example, is there a relationship between complaints and returns?  Is there a relationship between complaints and a drop in sales? And so forth.


Combining proprietary data sets gives us a broader perspective. Beyond the individual data for the individual topic or subject matter that you're looking at.

Now sometimes this can yield insights that are not directly relevant to the audit.  But then they can inform our future audits, for example, or even throw up some new topics or new risks that need to be addressed. We can then set these aside for further discussion with the broader audit team.


Augmentation could also be where we combine master data with transactional data and then transactional data with audit logs. This opens up a whole range of possibilities. For example, logs can help us identify very granular specifics about an event - exactly who (username, terminal), when (date and time) and how (pre-event user activity, which screen they accessed, post-event activity, etc).


3. Combine/augment with open data

Similar to the previous scenario, but this is where we are combining our internal data with open data.  Open data here refers to both:

  • Open data that is available to the public
  • Partially open data that is available to specific organizations


Public open data

There has been significant growth in the volume of open data in recent years.  And the trend shows no signs of slowing down. This is good for auditors.  Government statistics, performance data on the delivery of services and geographic data are particularly useful for us.

Geographic data that allows us to visualize data on a map is quite interesting.

We can use it to visualize, for example, where we have no coverage or where we have too much coverage.

We can also use it to understand where there may be large distances between two geographies that we cover as an organization.

Using geocoded or mapping data also creates a lot of impact:  because people are usually interested in their local area and what’s happening in their community. We can use it to see how public services are being delivered to cohorts within certain geographic boundaries. It can help visualise our findings to provide some really powerful messages.

Open data has helped deliver several effectiveness and efficiency outcomes.  As an example of this, the UK Government used open data to save £4m in 15 minutes – and this was back in 2013!  A lot of money now, even more back then.


Partially open data

Data that is available to a limited set of recipients.

For example, organizations within a regulated industry. In financial services, open data initiatives have been ongoing in several countries.  This is where anonymized data is shared between financial services institutions. In this case the data is a bit more open than proprietary/internal data, but still not available to the general public (although some of it might be).

If we have access to such partially open data, we should use it – our counterparts in first/second line should be using it too.


4. Weighted risk/performance indicators

The previous three techniques were about getting hold of more data, for a longer timeframe, combining, etc.

This one is about exploring our dataset by creating scenarios.  Let’s say we’re interested in understanding relative performance or relative risk.  We can translate our data into performance indicators or risk indicators.


As an example. We have a thousand records (rows of data). Within that, we have five performance indicators (usually fields or columns).

Now not all performance indicators are equal.  Some are more important than others. So we assign a weighting to each indicator to reflect its importance.

Sometimes we want to know what the results will be if we modify the weightings.  Think “what-if” scenarios.


For each of the five indicators, we assign several weightings.  We combine the weighted indices to generate a total score – the performance score or risk score.

We often use these four as weight options:

  • -1 (allow for an indicator to affect the score negatively)
  • 0 (allow for an indicator to have no impact)
  • 1 (default)
  • 2 (allow for an indicator to have more impact).


So each record has four options, for each of the five indicators.

That becomes a rather large set of data.

Five to the power four, per record.

That’s 625,000 records.


Now we have a much larger set of data to work with, from our small data set.

It allows us to look at different scenarios and work out what different weightings we want to apply to the individual performance indicators before we bring them together.


We can do this with a large data set, but the results may be unwieldy.  For example, if we started with a million records, we would end up with 625 million records.


5. Use semi-structured / text data

Semi-structured in the sense that the data contains fields that have free text data.

An example is complaints data. When customers or citizens call or write to complain or provide feedback to us. In the case of a phone call, we could take a recording of that and transcribe it, or we can manually capture the notes of what the customer has said.

If we use natural language processing on text data, when we're looking at phrases and keywords and what they look like across datasets, we can end up with many thousands of columns. So, they’re still shorter datasets (rows), but a lot wider (columns).

Phrase analysis and keyword analysis, particularly for customer feedback and complaints, can be very valuable for a range of audits. This is detailed here.


Text data is another one where a small data set could be our friend, because a large data set can be quite unwieldy. Analysis of text data can also identify new risks that you might want to put on your audit plan. It can be a very useful technique.


Don't hang your hat on big data.

Smaller datasets can be equally powerful for your audits.

Think about using these five techniques to extract that extra value.

Share this article

Get more insights like this

Blog Post
The Assurance Blog
March 3, 2022

Data in Audit Guide

Read article
Blog Post
The Assurance Blog
December 16, 2021

The Data-Confident Internal Auditor: Software

Read article

Subscribe to our mailing list

Get notified by email about new blog posts and podcast episodes by the Risk Insights Team.