TL;DR • Technical stakeholders need detailed explanations. • Non-technical stakeholders need plain...
Review triggers and practical checks for algorithmic systems (Part 2)
Our algorithmic systems and models need regular attention to make sure they continue to operate accurately and fairly. Behaviour changes. Data changes. Feedback loops change how algorithms behave.
Depending on the system and the business, there are clear triggers for when we need a closer look at specific parts of the system. In the first article, we discussed the typical triggers.
When those triggers don’t apply, or miss things, there are practical checks we can run as well. This article explores the first practical check.
Check 1: Complaints, feedback and interactions
Most banks and insurers already spend a lot of money on complaints processes, systems and reporting. Beyond the normal use for customer expectations and compliance, this can be a rich source of data to check how our algorithms are behaving.
It’s like a supplementary check between our formal reviews or other triggers. While it shouldn't replace proactive testing, listening to what customers are already telling us is one way to spot issues we might otherwise miss.
Problems (and solutions, for our purposes) with complaints data
There are a few problems with this data, but we can do something about them.
1. Data Quality
Almost every complaint system that I’ve seen has data quality issues. Free‑text fields. Inconsistent codes. Gaps. Legacy workarounds.
Solution: We acknowledge the data quality issues, but for our purpose here we don’t need perfection. We use what’s there (the patterns tell us where to look), remembering that they don’t prove anything on their own. Anything serious we find this way still needs proper validation. We also accept that busy frontline staff might just log a generic 'customer dissatisfied' code, so we will miss things if we only rely on specific keywords.
2. Exclusions (due to definitions)
The definitions of “complaint” can be narrow. One of the worst examples I’ve seen required the customer to say “I want to complain” before it would be recorded as a complaint. Anything else was just “feedback” and quietly ignored or deprioritised. There are variations of this.
Solution: We add in feedback data. If we also have access to data on other interactions (everyday customer contacts), it can be very useful too. Note: Connecting things like call centre notes and other interactions data can be difficult, so we might start with complaints/feedback only.
3. Filtering and shaping
Most complaints reporting is organised by product, channel, root cause, or something similar. Useful, but not necessarily how we need the data sliced for algorithm checks.
Solution: We take the raw data and shape it for our purposes. For example, we might filter on data that is relevant to our algorithm and the decision it makes. We won’t have a complete set (in part due to the data completeness issues), but it’s still valuable.
4. Access to the data
Customer data access is subject to privacy rules and internal politics. The team testing the algorithm usually doesn’t own the complaints data, and the customer experience team may not just hand it over.
Solution: This needs a leadership mandate. We may already have this for all data. If not, we’ll need specific permission to allow our team to access this data safely and securely for this purpose.
What our customer interaction data can tells us about our algorithmic system
There are plenty of opportunities to use this data. How we use it will depend on the specifics of our model/system, what we’re using the system for, and what our customer contact data contains.
To give this some real flavour, here are two examples of how I’ve used this:
a. Offset accounts
We were trying to identify accounts that were not properly set up as offsets. We already knew about a few, so we searched customer interactions data for those, identified the relevant patterns, then looked for others that matched that pattern. This revealed other patterns. For example, customers asking about offsets: how they can set it up, what account combinations are allowed, or maybe just for the form they need to fill out. By looping through how the system dealt with the specific customers and accounts, we were able to pinpoint where the setup logic had failed.
b. Incoming insurance commissions
We were recalculating commissions for a white labelled product: our entity earned commission for referrals to an insurance company. But we didn’t have all the data we needed at first. So we jump started the process by searching the data for any mention of insurance, related keywords (e.g. “premium”, “claim”, “policy”), specific text that matched the insurance policy number pattern, etc. That gave us a starting set of records to test the commission rules against, again identifying potential problems with how the algorithmic system worked. Note: we acknowledged that the effectiveness of this approach was limited by our ability to identify the right keywords.
In other words, we’re using what customers tell us to find parts of the system that deserve a closer look.
Not a replacement for usual triggers
The complaints and interactions in each of those cases were supplementary; we were not relying on them to solve the problem. This type of check can typically only catch errors that harm customers, which customers noticed and told us about. It also means filtering out a lot of noise: often the algorithm is completely accurate, and the customer is simply unhappy with a fair decline or a correct price increase.
If a system error accidentally under-prices a premium, no one complains. If a system error results in overcharging and the customer doesn’t notice it, this data won’t tell us that. We also have to remember that vulnerable customers are often the least likely to complain, so we can't rely on this method alone to catch fairness issues. It is a lagging indicator: by the time customers complain, the issue is already in the wild.
So we don’t let this distract us from proactive system testing.
Clusters
Because complaint and feedback data is messy, absolute numbers are less helpful than clusters. A small number of complaints can be a strong early warning if they’re tightly clustered around things like:
- A product or particular variant
- A customer segment (e.g. an age demographic, new customers)
- A specific outcome (e.g. limit reduced, claim partially paid)
Notes:
- We still need to ask whether these clusters reflect real issues, and are not just louder voices in one channel or a social media trend driving specific complaints or simply the algorithm correctly reacting to a shift in the economy (like interest rate changes impacting a specific demographic).
- We need to know what our normal baseline of complaints looks like, so we can tell when a cluster is an anomaly.
- We need thresholds for deciding what we’ll look into, so we don't waste time chasing every minor anomaly or normal business friction (like customers simply disliking a fair price increase).
- We also need to align with compliance expectations: if we find issues, we must follow a formal process for managing them.
New technical opportunities
With large language models, there will be more possibilities than before. They may allow us to find relevant patterns more effectively. But at the time of writing this, most teams are not yet beyond the initial exploration phase. When we do use them, we will need to be careful and have the right guardrails up: security, privacy, internal models (not the public ones), human review, etc. And use them for triage, rather than final judgement.
Capture the upside, manage the downside
Complaints, feedback and interaction data can be very rich sources. But we can only do this if our leadership is prepared to act on what we find.
We need discipline: keep the team focused on the algorithm, rather than getting distracted by other broken processes they uncover in the complaints data.
We also need a structured approach to capture the upside, and controls to manage the potential downside.
Disclaimer: The info in this article is not legal advice. It may not be relevant to your circumstances. It was written for specific contexts within banks and insurers, may not apply to other contexts, and may not be relevant to other types of organisations.