Skip to content

Finding Race in FS Data

TL;DR
• Race has four/five key attributes.
• Some racial attributes are easier to spot and manage than others.
• We can control for each of them, to varying degrees.

 

This is the second article in a series about finding protected attributes in data.

In ensuring that our algorithms are fair, one of the key discrimination categories that we need to manage is Race.

Race contains several attributes. These typically include race, colour, descent, and origin (national/ethnic). 

Immigration status is sometimes added. Religion is not typically included, but may form part of broader obligations (e.g., the Canadian Human Rights Act). Because some jurisdictions prohibit racial discrimination completely, while allowing for exemptions in other categories, we will exclude both immigration status and religion from this article.

Some of the attributes are easier to spot and manage than others.
But how do we prevent 'colour' based discrimination?
Can we really do anything about all the attributes, in terms of ensuring fair algorithms?

This article explores where each attribute might appear in our data and what we can do about it. 

 

Race attributes in data

Let’s consider each of the attributes we listed earlier. For each, we’ll identify where these might be found in structured and/or unstructured data.

Before diving in, remember that algorithms can discriminate without ever seeing these attributes directly. Things like postcodes can serve as proxies. So even if we don't collect race data, we might still create racial bias.

1. Race

Race is often considered to be a social construct. It also suffers from a circular definition problem: “racial discrimination includes race”. In general, race is subjective.

Race might be able to be identified “objectively” – from a data perspective, if there is such a thing. There are data fields labelled “race”. For example, some countries classify, or used to classify, people’s race.

“Indian” was the official bucket the Apartheid South African leaders put me into. For context, my ancestors left India a hundred years ago, I have never been to the country, and I don’t speak any of the languages. So, I’m not quite an Indian – which itself is a nationality, not a ”race”. (To be clear, I don't reject my Indian heritage).

This type of classification is problematic on many fronts. Either way, it could be codified in structured data (a field in a database that lists me as an “Indian”) or in unstructured form (I have an old identity document that lists my “race”).

There may well be other such examples, or similar “objective” data.

 

2. Colour

Another loaded topic. People of any given racial profile can have a variety of skin tones. A photo or video can reflect colour. With increasing use of unstructured data, inadvertent discrimination based on colour has been increasing.

In the context of banks and insurers, this can happen when we use photos for identity verification, record video calls, or process visual data in some other way. It may not yet be a big problem, but it can’t be ignored, especially as usage of these media grows. We need to carefully check how we are capturing and using such information.

 

3. Descent

Ancestry could be captured as structured data. This happens, for example, in census data collection.

It could be inferred from a name, which is problematic. I’ll use myself as an example again. My first name is Arabic, but I am not Middle Eastern, which is where most people with such names might be from.

It could also be inferred from language preferences.

It isn’t ordinarily available it in structured policy, claims, or loan data.

 

4. National or ethnic origin

In some structured data – for example, country of birth. Other structured data can be non-specific, but sufficiently binary to be a problem; for example, previous address, type of visa, type of driver's license - each of these can be used to segregate locals from foreigners, even if only in a binary way.

In some unstructured data – for example, passports. Can also be inferred from language preferences. 

Another source of data predicts this, so using it needs to be carefully considered. Third party demographic data, purportedly for marketing purposes, can easily land in customer acquisition or claims data.

 

What we can do

The answer to our question "Can we really do anything about all race attributes in terms of ensuring fair algorithms?" is a resounding "Yes".

To varying degrees, depending on the attribute, the data we use and how our algorithmic systems are constructed.

In short, we inspect our algorithmic systems, and:

  • consider each attribute carefully
  • double check third party data
  • identify edge cases – particularly in unstructured data and structured data that can be used to draw inferences (this is especially important when using more advanced techniques)
  • test outputs for patterns that correlate with racial demographics, even when race data isn't directly used
  • regularly monitor outputs, especially decisions, across different customer groups to spot bias patterns early.

Disclaimer: The information in this article does not constitute legal advice. It may not be relevant to your circumstances. It was written for specific algorithmic contexts within banks and insurance companies, may not apply to other contexts, and may not be relevant to other types of organisations.


 

Weekly Articles                                                                                             Get weekly emails in your inbox.