Episode 20 | Data related terminology that is commonly misused

The Assurance Show
The Assurance Show
Episode 20 | Data related terminology that is commonly misused
/

 

 

Summary

In this episode we discuss six (6) sets of words and phrases that are commonly misused.

  1. Using data vs analytics
  2. Algorithm vs model
  3. Data owner vs data “custodian”
  4. AI vs machine learning
  5. Open source vs open data
  6. Big data vs big data

Transcript

 

Yusuf Moolla: Today, we’re going to talk about six commonly used phrases in the field of data within audit. So that’s internal auditors using data and performance auditors using data and how those phrases can be misunderstood, misrepresented, misused, and can create confusion.

Conor McGarrity: Terminology is always a bugbear for people working in the audit world. So the six things just very quickly we’re going to run through: analytics versus using data. Number two, algorithm versus model. The third thing is data custodian versus data owner.  Fourthly, AI versus machine learning.  Fifth, open source versus open data. And the last one we’ll talk about is big data versus big data. Some meaty subjects in there. Let’s crack straight into it.

Yusuf: Sounds good. So why is this important for auditors?   We generally try to do things in a logical structured way.  The challenge we find with using terminology incorrectly is that it does create situations where  you don’t have clarity about what it is that you’re doing and how it is that you ask for what  you need.  So, as a head of audit, if you’re asking one of your team to do something. If you’re using the wrong terminology for it, you may get ambiguous responses, meaning some people may understand exactly what you mean. Some people may interpret it differently.  Getting onto the same page in terms of how these are used is important, so that we don’t fall into that trap.

Conor: And of course, this is even more difficult where you don’t have some accepted definitions for some of these terms.

Yusuf: The first one is analytics versus using data.

Conor: They sound similar if not the same to me. So what’s the difference? Is there any difference?

Yusuf: The word analytics has so many different definitions. In fact, lots of people will look at analysis versus analytics and because analytics sounds so similar to analysis, it’s often confused. They often are used interchangeably. The thing about it is that the definition isn’t that important and why the definition isn’t that important is there isn’t a standard definition for it. So why are we saying analytics versus using data? It’s that there’s various things that you can do with data. You can explore, you can profile, you can recalculate, you can model, you can visualize and report. So it’s important to think about using data. And I think that the term analytics is slowly going to fade away. And the reason for that is what we mean when we say the word analytics varies so much from auditor to auditor. Some people use analytics only to mean, for example, recalculations or some very deep analysis of data. And it’s become a phrase that’s then used for that component of the use of data where understanding and exploring data and reporting and visualizing that data has been lost.

So we do need to appreciate that exploring data, recalculating and modeling, and then visualizing are all important parts of the using data continuum.  Using the term analytics can mean that we ring fence what data analytics is. And so while the term analytics might be broad and encompass a range of methods, because of that limitation I think it’s better that we say using data and drop the term analytics.

Conor: So instead of using the catch all term analytics, should we be more specific in how we’re using the data? When we described the work we’re doing with the data, instead of saying we’re doing some data analytics, should we say we’re using the data to do profiling or we’re using the data for visualization or a little bit more specificity around exactly how we’re using it as opposed to using the broad term?

Yusuf: It doesn’t really matter.  As long as we understand that we’re going to be using data for whatever it is that we need on a particular project. That’s the point there really; that the term analytics has been used so much and it’s been so ambiguous to so many people that we may just want to start to veer away from it.

Conor: It’s been misappropriated and misused over the course of its lifetime.

Yusuf: And the other thing also that it creates – this challenge it creates – is that as  auditors that may not be analytics specialists,  or that may not be  focused on the use of analytics, so that may not understand machine learning, et cetera yet.  The minute you say data analytics, it almost becomes like a thing and it’s not really a thing, right?

So using data as part of the audit, is something that we all have done. If you are an auditor of any variety. And particularly if you’re in internal audit or in performance audit, you have used data on at least one audit that you’ve done before, unless you’re fresh out of university, you have used data.

And then if somebody asks you, can you do analytics? Your response may be no.  But the reality is you have used data in some way. You’ve explored some data, or you’ve gathered some data, or you’ve done some sampling or you’ve done some recalculation. And even if you haven’t done some recalculation, you may have used data as part of your reporting.

Using data is such a broad thing that, yes, it’s important to understand how to use data properly, but that doesn’t mean that it needs to be its own little field that only a certain select set of people can do. And that you need to have been able to craft a SQL query or write a Python script to be able to say I can use data as part of my audit.

Conor: It’s almost like the term data analytics has become its own little cottage industry. And that creates like a spectrum of fear for people.

Yusuf: So break that fear.

Conor: So we just go back to using terminology that’s simple and accurate – we use data for our audits.

Yusuf: That’s right. Simple.

Conor: Number two.

Yusuf: Algorithm versus model. And I’m not going to go into the full history of algorithm, but the word algorithm derives from an individual. He wrote the book that talked about computation and calculation, Al Jabber, which is actually the term that became algebra. So the guy’s name was Al Khwarizmi and from his name, the term algorithm was formed. And slowly over the years, it was converted into Latin and French and various other languages.

And in the 19th century, the term algorithm was formally used in English. So it had been used before in English, but it was formally used in English to describe algorithms. So now 19th century, meaning the 18 hundreds, which means well over a hundred years that the term has been used.

Okay. So that’s a brief history, right? The term algorithm has been in the media lately and has been in various industry bodies lately, described as something that could potentially have bias.  The reality is that algorithms themselves don’t have bias. It’s only when you apply data to an algorithm to create a model that you then get potential for bias. Because data and whether you use all of the data and the right data is where the bias exists.

The algorithm is purely the framework that is used. So you set up an algorithm to be a template, a framework, a broad set of instructions that needs to be followed. And in the field of machine learning you have various supervised algorithms and unsupervised algorithms:  random forests and linear regression and decision trees, et cetera. Now those algorithms are standardized or reasonably standardized. We get lots of people developing them and creating new ones and XG Boosts, et cetera, but they’re fairly standardized and they can be applied to various problems.

Now it’s in that application. When you take data that applies to a problem. And you use that data to modify the algorithm and instantiate it to create a model that describes the universe for that problem that’s when you potentially get bias. So the bias exists in the data that bias then gets transferred with the data as applied to the algorithm gets transferred into the model, but the underlying algorithm doesn’t have bias.

Conor: So the relationship between an algorithm and a model, is that a one on one relationship or could there be multiple algorithms for a single model?

Yusuf: It’s a many to many relationship. So you could have multiple algorithms for a model and you could have multiple models for an algorithm.  Remember also that some algorithms are actually derived from multiple algorithms themselves.

So you take a set of algorithms that may work well together, put them together and you create the new algorithm. Now, the model that you then create, you either use an algorithm or a set of algorithms to create with data that then is used to train the algorithm or instantiate the algorithm, to become the model that describes the universe in which the problem exists and the solutions that could come out of it.

Conor: So you say that there’s no bias in an algorithm, but you also said that an algorithm is an instruction. So, there could be bias in an instruction.

Yusuf:  Yes. if you decide that the algorithm is going to be so specific. That it will automatically exclude certain variables for whatever reason, then you’ve taken an algorithm and you’ve made it very specific. So if you, say that an algorithm is going to be used on a particular set of data, and you explain what the fields are. It’s not really a base algorithm anymore.

It’s now morphing into a model because you’ve used some data on it. The algorithm as it originally would have stood, would not have that data focus or that data field focus in it. The algorithm itself would be able to be applied to multiple different data sets.

Conor: As auditors, are we more interested in the model as opposed to the algorithms?

Yusuf: We’re interested in both. the first thing is that when we’re auditing machine learning, for example, we’ve got to be able to determine whether the algorithm is fit for purpose. It needs to be appropriate to the objective that you’re trying to achieve.

The bigger part is what data have you used and what is the model doing? The algorithm, plus the data equals the model, with some training and various things in between. But your model is really what it is that you want to be focusing on because the model explains what it is that you’re going to do in the context of the problem that you have.

Conor: Algorithms and models have a many to many relationship. And there is no inherent bias in the algorithm unless you’ve made it very specific. Is that right?

Yusuf: Yes. So there wouldn’t usually be, there may be some instances where the algorithm has bias for some specific reason, the general algorithm that you’ll be using that will come out of Python libraries or our libraries or other machine learning libraries. Those will be fairly generic. If you’re looking at something like fraud, for example, and trying to understand what the propensity for fraud might be either in employees or in suppliers or in customers, if you’re using data that is biased, then your model will have bias inherent in it, but the algorithm doesn’t. So the algorithm that you started off with will be the same, regardless of what data you’re using.  But by including those fields, gender, race, ethnicity, potentially those sorts of fields, that’s when you get some bias that then gets built into the model.

Conor: So when you first spoke about this particular issue, algorithm versus model, you said it was a bugbear of yours. Is this because those terminologies are being used in the wrong circumstance or interchangeably?

Yusuf: There’s bodies that we’ve seen that have been created and really good work being done by them where they’re developing toolkits for ethical use AI, toolkits for ethical use of machine learning.  And they are literally saying one of their principles is that all algorithms have bias. And that’s such a misnomer, you know, we’ve got to start with, if we are going to tackle what is a complex and challenging problem, we’ve got to at least start with the right terminology to begin with.  If we start at the wrong place, we’re not going to end up in the right place.

Conor: Cool.  Point number three. So here, we’re going to have a quick chat about data custodians versus data owners.

Yusuf: When you hear data owner, Conor, what does that mean to you?

Conor: The person in the organization that has accountability for the safekeeping, the gathering, the collection and the promulgation of data.

Yusuf: And if you needed to get some data to conduct an audit, either in the public sector or the private sector, what would the data owner represent to you?

Conor: What do you mean by represent?

Yusuf: If you know who the data owner is that you need to go and get some data. From what challenges do you normally experience in dealing with them?

Conor: Well, the first challenge, which is legitimate for the most part is why do you want to get this data? What is your objective with it as an auditor?

Yusuf: And what’s your response to that?

Conor: Because our role as auditors, for example, in the public sector, sectors is we needed to perform our independent assessment. making sure your governance is right on accountability mechanisms.  we needed to do our jobs, which is to deliver, independent advice.

Yusuf: And have you had situations where the data on it has said as the data on a decline, this request?

Conor: Yes.

Yusuf: Do you think that is a fair response?

Conor: [00:13:09] I think, you know, the answer to that and the answer is that’s a very unreasonable response.

Yusuf: I would suggest that part of the reason for that is that we’ve got this term data owner.  there’s a good reason for having that it’s so that you have an individual or a team that is accountable for the way in which data is used. So that is not used incorrectly, et cetera.

However, for auditors we have a challenge in that. The data owner believes that they have the right to tell us that we can’t get access to the sub data because they don’t agree with the purpose for which we’re going to be using that data. Yes, they own the data for terminology’s sake, but really particularly within the public sector, they are custodians of their data on behalf of government and the individuals that they serve, which is the community, the public as auditors, our job is to protect the public interest. And so if we are not able to get access to data because a data custodian has been given the term data owner and believes that they can decline our request.  Then we have a challenge and this is where there’s a disconnect between what data ownership is and what data custodian should be is you can’t own my data  as a public servant  if you think about something like the department of transport, I have a driver’s license and I have a car. And because of that, you have data about me. You don’t own my data. You are a custodian of that data. And when an order to comes in to determine what you’ve been doing with that data, or to come and get some of that data in order to be able to fulfill their mandate as an individual. I’m talking about myself here. Not everybody will see it the same way, but as an individual. I don’t believe that you are able to say the audit, they can’t get that because the auditor is looking after my interests. That is my data. You are holding that data on my behalf.  this is where the custodian versus owner challenge comes in.

I’m not thinking about the term custodian versus owner, as it applies to your standard data governance frameworks, which are generally full of waffle. I’m talking about really what is your role? Do you own the data or are you holding it and making sure that it’s being used for the right purpose? And therefore, should you be denying a request for data if it comes from a source that is looking to use that to protect the public interest?

Conor: So they’re gate keepers really. And we would expect them to ask the right questions of auditors to ensure that it’s being used properly, but their default position is not to withhold it.  Because it’s not theirs to withhold.

Yusuf: The private sector would be very similar. So we think about some of the things in the private sector. You have a board who in most cases and,  any mature internal audit function that’s listening to, this would probably have,  responsibility directly into the board of directors,  They have to make sure that the stakeholder interests are protected. You may have a senior manager or a manager somewhere that then has responsibility for data. If they question the use of data by internal audit, then they have every right to,  and they are greeted with the response that the audit committee, or the board have asked for this work to be done. They cannot refuse that they don’t have the level of authority. Unfortunately, the term data owner appears to provide that authority and it creates both efficiency and effectiveness challenges with getting access to data and using it.

Conor: So in essence, we need to reflect upon those people that perhaps call themselves data owners and they need to maybe understand that their role is more on the custodianship rather than ownership, so to speak.

It goes right to the purpose. And the purpose of auditor’s getting that data is to protect either the organization or members of the public or shareholders or all of the above.

Yusuf: This goes into a broader discussion around  what data is being useful and what the purpose for which data is being used. Most audit teams and most performance audit functions would not use data to,  victimize individuals, And quite often, and we’ve heard this discussion and this debate now a few times, and maybe this is another subject of a separate  broader discussion. But if you’re collecting data for a particular purpose,  and you then have an audit later on for what you can use that data for valid purposes, we should be able to use that data regardless. I can’t see too many situations where an auditor’s request for data should be met with “why do you need this?”.

Because even if you explain very well, why you need that data, you can’t limit yourself to only using that data for that individual purpose. Because if, later on, you need to use it for something else legitimately, you should be open to use it.

Conor: Moving on to number four, AI versus machine learning

Yusuf: What does AI mean to you?

Conor: AI, artificial intelligence means to me that somebody has given an instruction to a machine to do something, and then that machine goes and does it without the human having to. perform the task themselves.

Yusuf: Interestingly, that is one of the definitions, because again, there isn’t a standard well accepted definition for what exactly AI is.

And in fact, there’s something called the AI effect where the definition of AI or what is encompassed by AI changes. Every time a problem is solved by AI. The AI effect says that as soon as the problem is solved by AI, that’s not AI anymore. It’s weird and it’s strange, In the eighties, AI was getting a computer to do what the human would do, like calculate one plus one. Okay. So one plus one equals two. We know that. We put it into a calculator. So the calculator is now an AI. Put it into a computer. A computer now has some AI functionality built into it, but as soon as the problem is solved, it’s not AI anymore. Difficult to explain. Right. But the broad application of AI is not just machine learning.

AI can describe all sorts of things. Machine learning is a field within AI, and that’s where you have both supervised and unsupervised.  algorithms that can help solve particular problems or create opportunities. And so AI is not equal to machine learning and machine learning is not equal to AI, even though they are used interchangeably a lot.

Conor: So AI is a catch all term. Really

Yusuf: It’s meant to be anything that isn’t human intelligence. anytime you create any form of intelligence that isn’t in the natural brain of a human, and is in the form of something that is robotic in some way, like a calculator or computer, that’s what the broad AI term was meant to encapsulate. machine learning is a very small subset of it.

Conor: So in all the terms of people are talking about AI on their audits, they’re using it so broadly as to be meaningless.

Yusuf: The other thing also is that often people talk about using AI and machine learning. And you can’t have AI and machine learning. That’s like saying I have cattle and cows,

So you either using AI more broadly, which means you’re using the computer in some sense, or you’re specifically using machine learning techniques or algorithms or.  solutions. understanding what that is, is important. It’s probably less important to get it a hundred percent correct. But saying that we use AI in our audits and saying that we have an AI enabled audit and we’ve seen newer technologies come up now where AI enabled is splashed all over it. What does that mean? You’ve used the computer. So does that mean that I’ve used a word document to store my working papers and therefore I have an AI enabled audit.

You could probably say that given the scope of what I mean. So something called, I don’t exactly say what the name is, something AI auditor. And it is really just journal entry testing for financial statement audit.

Conor: I saw, somebody marketing some auditing software. And, they said that it was AI enabled. AI will support and enable your audit where it was really just that the software had some statistical information sitting behind what was being done by the user. And that was being sold as an AI enabled audit.

Yusuf: Yeah. I just saw another one that has Benford’s law and suddenly it’s an AI enabled auditing software. So that’s not technically incorrect, but given what people believe AI to be, it’s a little bit misleading and deceptive I think. So either we get misled and deceived. Or we accept that AI is broad and encompasses many things. If you are using machine learning as part of your audit solution, then it becomes very specific. So what does that mean? That means that we are using some form of machine learning algorithm to apply to our data, to be able to get to a result.

Conor: If anybody’s spruiking AI enabled audit, you need to really zone in and ask the questions. What do you actually mean by that?

Yusuf: As auditors, we hang our hats on integrity and ethics. And I think if you’re misleading people by using the term AI auditor, then you probably shouldn’t be in the field of auditing. Go and do it somewhere else. It’s not the thing to do for auditors. And if you are an auditor who is spruiking this “AI enabled software”, you need to think very carefully about what you’re doing as well. And I’m sorry that this is a strong message, but  this just doesn’t go to the principles by which auditors conduct their work and think about themselves and do what they do.

Conor: Agree. A bit of a broader message there. Okay. Number five. I’m interested in this one, particularly. Open source versus open data.  How do we differentiate? What is the difference?

Yusuf: So open source means that the source is open. So open source typically refers to software because software has a source, meaning the program by which the software was created.

All software has an underlying program, and open source means that you can see all of the program code. So the source of the software is open and it’s in the public domain.

You can go and see exactly what the individual lines of code that create that software is as opposed to proprietary software like windows,  where you can use the software, but you can’t actually see the code. That’s used to create it, nothing wrong with either of those, right.

They both have their place. Open data is where the data has been made available to the public. So in both cases, it’s open, but you can’t have open source data because data doesn’t have a programming source. So you can’t open up the source. And in fact, if you did have data that you wanted to make open, but the source of that data had some confidentiality and therefore you excluded certain data fields from the data that you then made open. Then it’s no longer open source data, if you want to use source in a slightly broader context. For that reason, you talk about either open source software or open data, and they’re two different things.

Often we hear the term open source data. What does that mean? What is it that programming language that you created too? It doesn’t mean anything. Right. And so that’s just a really simple, let’s talk about open source software and talk about open data and understand where that comes from, what the openness means and not try to mix the two.

Conor: Our last topic for today, number six. Big data versus big data. .

Yusuf: I don’t know that we actually have to say any more after saying big data vs big data.  We’ll call it big dayta versus big daata. Because really there is no definition for big data. There’s no accepted definition.

Conor: How can there not be? It’s such a sexy term.

Yusuf: It is a nice, sexy term, as you say, a buzzword,  there’s no definition for it. People will talk about the five Vs or the four Vs or maybe the thousand Vs because it’s big. But when people talk about big data, they’re talking about data sets that are bigger than they’ve ever seen before. does that mean that. As soon as you see it, it’s not big anymore. And it’s gotta be bigger than that to be big. it just doesn’t make sense.  Just a buzzword. And I think as auditors, we need to see beyond that and just stop using the term.

Conor: You won’t get any disagreement from me.

Yusuf: And it’s unfortunate again, that as auditors we are seeing more of it and because auditors have been late to the party in terms of using data, the vendors for data consulting and for data software will still come in and talk about, agile AI enabled big data audit solutions. No, just forget about it, don’t use it. It diminishes your brand as an auditor.

Conor: I recall, we saw maybe three years ago an internal audit report that had suggested that it had used big data in terms of its analysis of a particular objective. I think there were two Excel files that were the source of the big data that we thought was very strange at the time.

Yusuf: Most importantly, the value of using bigger sets of data, like we said before, is not bigger than the value of using small data. We had a discussion on how to use small data and what we can do with small data and what sort of value we can derive from it.

The value is not in the size. The value is in the quality and what it is that you can use for achieving the particular audit objective.

Conor: That takes us to the end of our six topics. So in summary, what we discussed today was. Analytics versus using data on the, our preference certainly is we have more of a conversation around using the data that’s available to us. Instead of using that terminology analytics that’s been thrown around, secondly algorithm versus model.

And you spoke there about, the misapprehension that there’s an inherent bias in algorithms. but the bias perhaps exists in the model.  Thirdly, data custodian versus data owner, and the importance of the person performing that function – that they understand that they don’t own the data. and they are really just performing a gatekeeper or custodian role of it. Number four, artificial intelligence, AI versus machine learning.

Yusuf: And there’s no really versus there, right?

Conor: It’s a subset. Fifthly, you spoke about open source versus open data. Open source being about the program and Open Data being about the data which has made either publicly available or more broadly available. And lastly big data versus big data. It’s a non conversation, a term that’s thrown around too freely. And we probably should stay away from that. Protecting our brand as auditors.