Episode 47 | Three practical ways to fast-track the data cleansing process

The Assurance Show
The Assurance Show
Episode 47 | Three practical ways to fast-track the data cleansing process
/

 

Summary

In this episode we discuss three simple ways to improve the data cleansing process – easier, faster and higher quality.

  1. Find the optimal data source (e.g., don’t use a poorly structured dataset)
  2. Cleanse iteratively – you may not need all of the fields/columns
  3. Import iteratively – start with a subset of the data and build on that as needed

 

Transcript

Narrator: You’re listening to the Assurance Show. The podcast for performance auditors and internal auditors that focuses on data and risk. Your hosts are Conor McGarrity and Yusuf Moolla.

Yusuf: So Conor, today, I want to quickly talk about some recent experiences that we’ve had in the use of data and that relates to finding the easy way out. So that sounds a bit wrong. Obviously we don’t want to find, necessarily, the easy way out. But, making sure that we’re not doing work that we shouldn’t need to do.

Conor: I’m intrigued by this. So I want to hear the examples you’re going to use because you haven’t briefed me on them.

Yusuf: Okay. Good stuff. So the first one is, quite often open data, but also internal data that we might use. We find data in things like annual reports, company presentations, in the public sector it would be in financial statements, budget papers, and the like. And sometimes, in our excitement or haste, to grab that data and use it, we go to that unstructured source and then do all sorts of weird and wonderful work in order to pull it out. So we can take a PDF, we can extract tables from it, then detailed work to format those tables. We had a case like that a few months ago. And we’re actually seeing another one. We were working with a client who had taken a PDF, and that had spent about a week cleansing that data and coming up with a structured dataset to use. And one of the first questions that we asked was is this the only source of data that we have, because data that flows into a PDF or report or the like, potentially would have come from a structured source. And it turns out that the data came from a database that we could have extracted directly from and used in its original form. So that creates extra work , it’s not just extra effort, but it’s also, we then found errors in the conversion from unstructured form to structured form.

Conor: Instead of this own structured information that you have here, can you ask around within your own organization about how it got here in the first instance to try and shortcut that process so that we don’t end up with the situation halfway in where we find out there was a database or something that existed from the outset..

Yusuf: That’s right. Back in the day we learned to go and ask the question, trial and error. So you learn as you go that there’s a better way to do things. But yeah, so that example is about efficiency, but also about quality. Don’t go and do that work. So when you’re faced with a PDF of an annual report or a performance statement or budget statement, the first question is not how can I take the data out of this for my purpose? The first question is where does this data come from? And maybe I can source that data directly. Now, sometimes you can’t, you just don’t have access to the individuals, or it may be that somebody created it manually. But the first point of call is how can I do this in a simple way? So that is taking the easy way out in that case. But it’s also the better way because it produces better quality and efficiency.

Conor: And it minimizes your risk of getting things wrong in translation.

Yusuf: Yeah. So the second example, is where we taking structured data and we go through a cleansing process and we probably spoke about this before, but we all know that it takes a long, long time to cleanse the data usually takes a long time. And despite what certain visualization tool providers might promise, it usually does take quite a bit of time to actually cleanse data and standardize it and format it for use. And if you’re not going through the process, you’re going to end up with, usually going to end up with poor results, or at least you won’t have confidence in the results that you have. However, when you’re going through the cleansing process, the question you need to continuously ask yourself is, am I going to actually use this data field? Or am I going to use this data source? Because often you end up with large sets of data. And when I say large, I mean wide in terms of the number of fields or the number of columns that you have. And the automatic data analyst approach would be to cleanse everything that they have. So that any potential question that might come up will be easy to answer given the cleansed data that they now have. And that is a little bit of an inefficient approach. So we need to think about key fields that we’re going to be using? And then cleanse those first. If you’re using a repeatable analysis processes, if you’re using a workflow where it’s quite easy to go into that workflow and make changes, to flow downstream. It’s better to start small. Start with the data that you know you’re really going to need. And then you know you’re really going to use cleanse that first and then come back later, if you need to cleanse additional fields for use, that may answer questions that come up later on or where you may need to delve into some of the deals. So that’s something again that we learned over the years. Initially when you’re starting your sort of data journey, initially, you just want to grab everything and cleanse everything and make sure you have this, perfect 100% dataset. the better thing to do, is identify the specific fields that you are going to be using, focus on cleansing those. And then if you need something later on down the track, come back to it. So it’s an iterative process.

Conor: That ability to identify the fields that you will be using it initially, is that dependent upon having a clear objective for your audit as to what you’re trying to achieve, as opposed to an exploratory process where you just want to look at everything you have.

Yusuf: We always advocate for, both hypothesis focused, but also exploratory. Even in the exploratory, you don’t have to have everything a hundred percent clean. What the hypothesis based approach is pretty straightforward. You just work backwards from the question you’re trying to answer and get to the fields that you need, but even with exploratory, you will, intuitively know what you need, right? There’s certain fields that you just need, certain dates and IDs and text fields and things that you will just need straight up. So you cleanse that to the best level that you can. And then as you explore you iteratively cleanse for that exploration. So you’re exploring certain fields. There’s a question that you have that has gone unanswered or that you only have a partial answer to, and you think, you know what, there’s actually value in cleansing that field further to be able to answer this exploratory question that I have. But again, that’s coming back to what you need as opposed to trying to do it upfront.

Conor: Yeah. One of the risks, which has a positive side and a negative side is curious data analysts like to explore a lot. But they need to step back sometimes and ask what is the answer I’m trying to come up with here? Or am I adding any value by doing this further exploration for this particular field.

Yusuf: It’s an ongoing battle, right? we’ve been doing this for a long time, but I get a new data, I automatically want to go and just fix everything before I jump into it. stop yourself and resist that urge and say, no, stop what do you need? What can, what you need? And then if I need more, I’ll come back and fix it later.

Conor: I think you’re still working on your discipline on that.

Yusuf: I think I am. Yeah. , and I don’t think I’ll stop, right. Because it’s ongoing learning and it’s a nice learning that sort of doesn’t mean. So the third one is, particularly important when you bringing data together so this happens often when you’re looking at data over multiple years. let’s say you bringing 10 or 15 or 20 years worth of data together. Sometimes the data from 20 years ago, not sometimes often, always, almost always the data from 20 years ago will look different to this yesterday. when you have a situation like that, first explore a smaller set. So it could be one year, two years, maybe even three years. just depends on your circumstance, but don’t try to grab all 20 and cleanse all 20 the first time around. If you are going to be looking at some sort of longitudinal analysis, start small, start with the first year of data. What do I have maybe add a couple of years to that. and then cleanse that specific. And then later on, you can bring other data in. One of the risks that you have with that is there may be some data that existed 20 years ago that doesn’t exist anymore, that you may need to adjust for, but then you need to ask yourself if I don’t have a current view of that, am I really gonna use that field? So you may have 10 fields in your current data set 20 fields in an old data, set those other 10 fields. If I don’t have any contemporary information on it, do I really need. What am I going to do with it? what comparison am I going to be able to make the best thing you might be able to come up with as a result is that we’re not collecting the data anymore. It could have been useful. It should be useful, I mean, there’s value in that, but really the incremental value of that is, significantly outweighed by the level of effort you’re going to need to put in to get the data cleansed. So start small, start closer to the time that you’re in at the moment and then work backwards from there. Most workflow, style tools, et cetera. It doesn’t really matter how much you have except when you’re trying to deal with all of the anomalies all of the exceptions that may exist. So anybody working with data will know. The more data you bring in, the more potential you have for having to fix up some of those edge cases. That specific missing field or this one year in which we didn’t have this particular column, ends up taking a lot of time. And so you want to try to avoid that. So that’s the third simple step to make your analysis process easier, more efficient less frustrate.

Conor: And taking those tips on board will save you a lot of time in the law.

Yusuf: Summing up. First thing is unstructured, first look for the structured equivalent. The second one is don’t try to cleanse all of the data fields that you have straight up, make it an iterative process. And the third one is try to start with a smaller set of data. If you can, before expanding into the full volume of data that you might have over years or over departments or whatever.

Conor: Thanks Yusuf.

Yusuf: Thanks Conor.

Narrator: If you enjoyed this podcast, please share with a friend and rate us in your podcast app. For immediate notification of new episodes, you can subscribe at assuranceshow.com. The link is in the show notes.