The US Treasury released the Financial Services AI Risk Management Framework earlier this year. It has an accompanying AI lexicon, which we touched on when it came out.
It is more than 500 pages of content, so not easy to digest. And not easy to cover in one go. So we’ll break it up into bite-sized thematic chunks.
Let’s start with fairness. It has several related guidelines, including proportionality, sources of bias and testing fairness itself. Then there are adjacent items that may not discuss fairness explicitly, but that are pre-requisites for some of those that do. These are split across various objectives and example controls. So, I wondered how to read it and use it specifically for testing fairness.
From years of traditional system review work, I still find it useful to split systems into four parts: data inputs, data processing, outputs, and control. It’s simple enough for non‑specialists to use, but still gives us a way to spot gaps. The technology might be quite different, but this basic split still works. The details, of course, are very different for algorithmic fairness.
In this article, I’ve taken fairness concepts from various parts of the FS AI RMF and grouped some of the expectations using that split.
There are, broadly speaking, four things we need to look at: inputs, processing, outputs, and control.
This includes internal data, external data, and hard-coded input data parameters.
[Example RMF link: MP-2.2.2 AI Interdependencies and Dependencies].
Some things are straightforward. Direct use of protected attributes, like when we use age for insurance pricing.
But avoiding direct protected attributes does not make a model neutral. Other fields can act as stand‑ins or proxies. Postcode/zip code, channel, type of driver's license, and similar variables can encode personal characteristics.
Then there are external data sources, which can be surprisingly problematic. Poor data quality, different information to what we have internally, or just plain irrelevant. A simple example is marketing data that has no clear link to credit, or to claims, but still finds its way into those models. When that data is labelled “demographic data”, the need to drop it is clear.
We start by asking:
For machine learning models, these questions apply both to the training data and to the data used to classify cases in production. We also need to ask whether the training data is reasonably representative of our customers or cases.
Here the focus is on how the model or rules use those inputs.
[RMF link: control objectives under “Measure” that deal with feature impacts, and context‑specific fairness expectations.]
We don’t always need the full internals, but we do need to understand enough to ask simple questions like:
We check whether the system is consistently more wrong for some groups than others (broadly what the framework means when it refers to testing for disparate impact and monitoring model performance across different segments).
[Example RMF link: MS-2.1 Measuring Nondiscrimination]
We do this at a reasonable level of disaggregation, not just at a high-level summary. We’re looking for statistically meaningful gaps, asking questions like:
Any material gaps need clear, evidenced, approved (see control below) reasons that aren't just reflecting historical inequalities. Some gaps, depending on relevant legislative restrictions, might need immediate fixes rather than reasons. For example, discrimination on the basis of race is generally not allowed in Australia and several other countries.
This is not only about policies, although policies are part of it. Importantly, we need objectives, principles, awareness and human oversight and accountability.
[RMF link: governance control objectives that cover roles, responsibilities, training, and escalation for AI‑related risks.]
If testing shows uneven outcomes or proxy effects, who decides what happens next, and how quickly? Who can say “this is not acceptable” and require changes to rules, thresholds or models?
How do we train our people to identify discrimination risk? Do we train everyone involved, including data scientists and senior execs? Do we engage a cross-section of stakeholders to determine fairness requirements?
Have we defined owners, time‑bound follow‑ups, and re‑testing after changes? If not, fairness testing will only work at a point in time. We need it to be sustainable, with awareness, accountability, oversight.
The FS AI RMF is here: https://cyberriskinstitute.org/artificial-intelligence-risk-management/
Disclaimer: The info in this article is not legal advice. It may not be relevant to your circumstances. It was written for specific contexts within banks and insurers, may not apply to other contexts, and may not be relevant to other types of organisations.