Curated Insight: Organising fairness checks from the FS AI RMF

Written by Yusuf Moolla | 06 May 2026

TL;DR

• The FS AI RMF has useful but scattered expectations on fairness, buried in more than 500 pages.

• This article pulls together some of the fairness related testing guidelines into one place.

• It structures these using a traditional system testing split: inputs, processing, outputs, and control.

The US Treasury released the Financial Services AI Risk Management Framework earlier this year. It has an accompanying AI lexicon, which we touched on when it came out.

It is more than 500 pages of content, so not easy to digest. And not easy to cover in one go. So we’ll break it up into bite-sized thematic chunks.

Let’s start with fairness. It has several related guidelines, including proportionality, sources of bias and testing fairness itself. Then there are adjacent items that may not discuss fairness explicitly, but that are pre-requisites for some of those that do. These are split across various objectives and example controls. So, I wondered how to read it and use it specifically for testing fairness.

From years of traditional system review work, I still find it useful to split systems into four parts: data inputs, data processing, outputs, and control. It’s simple enough for non‑specialists to use, but still gives us a way to spot gaps. The technology might be quite different, but this basic split still works. The details, of course, are very different for algorithmic fairness.

In this article, I’ve taken fairness concepts from various parts of the FS AI RMF and grouped some of the expectations using that split.

Four practical aspects of fairness checks

There are, broadly speaking, four things we need to look at: inputs, processing, outputs, and control.

1. Data inputs

This includes internal data, external data, and hard-coded input data parameters.

[Example RMF link: MP-2.2.2 AI Interdependencies and Dependencies].

Some things are straightforward. Direct use of protected attributes, like when we use age for insurance pricing.

But avoiding direct protected attributes does not make a model neutral. Other fields can act as stand‑ins or proxies. Postcode/zip code, channel, type of driver's license, and similar variables can encode personal characteristics.

Then there are external data sources, which can be surprisingly problematic. Poor data quality, different information to what we have internally, or just plain irrelevant. A simple example is marketing data that has no clear link to credit, or to claims, but still finds its way into those models. When that data is labelled “demographic data”, the need to drop it is clear.

We start by asking:

Do we use protected attributes directly?
Could a variable reasonably be acting as a proxy for a protected attribute?
Are we using any external data that has no direct connection to this model’s objective, or the decisions it needs to feed into?

For machine learning models, these questions apply both to the training data and to the data used to classify cases in production. We also need to ask whether the training data is reasonably representative of our customers or cases.

2. Data processing

Here the focus is on how the model or rules use those inputs.

[RMF link: control objectives under “Measure” that deal with feature impacts, and context‑specific fairness expectations.]

We don’t always need the full internals, but we do need to understand enough to ask simple questions like:

Which inputs drive most of the decision, and does that line up with how we think the decision should be made?
For protected attributes, whether direct or proxy, if we were to remove an input, cap it, or handle it differently, does the reduction in uneven outcomes outweigh the loss in overall accuracy? For removal in particular, does that create more unfair outcomes (see outputs below)?

3. Outputs

We check whether the system is consistently more wrong for some groups than others (broadly what the framework means when it refers to testing for disparate impact and monitoring model performance across different segments).

[Example RMF link: MS-2.1 Measuring Nondiscrimination]

We do this at a reasonable level of disaggregation, not just at a high-level summary. We’re looking for statistically meaningful gaps, asking questions like:

Are approval, decline, flag or investigation rates noticeably different for disparate groups (whether using collected data or accepted inference methods)?
Are false positives or false negatives clustering in particular segments (e.g. particular protected attributes)?
Have we considered intersectional groups (not just one protected attribute at a time)?

Any material gaps need clear, evidenced, approved (see control below) reasons that aren't just reflecting historical inequalities. Some gaps, depending on relevant legislative restrictions, might need immediate fixes rather than reasons. For example, discrimination on the basis of race is generally not allowed in Australia and several other countries.

4. Control (governance, oversight, training)

This is not only about policies, although policies are part of it. Importantly, we need objectives, principles, awareness and human oversight and accountability.

[RMF link: governance control objectives that cover roles, responsibilities, training, and escalation for AI‑related risks.]

If testing shows uneven outcomes or proxy effects, who decides what happens next, and how quickly? Who can say “this is not acceptable” and require changes to rules, thresholds or models?

How do we train our people to identify discrimination risk? Do we train everyone involved, including data scientists and senior execs? Do we engage a cross-section of stakeholders to determine fairness requirements?

Have we defined owners, time‑bound follow‑ups, and re‑testing after changes? If not, fairness testing will only work at a point in time. We need it to be sustainable, with awareness, accountability, oversight.

The FS AI RMF is here: https://cyberriskinstitute.org/artificial-intelligence-risk-management/

Disclaimer: The info in this article is not legal advice. It may not be relevant to your circumstances. It was written for specific contexts within banks and insurers, may not apply to other contexts, and may not be relevant to other types of organisations.

View full post