Skip to content

Finding Protected Attributes in FS Data

TL;DR
• Protected attributes hide in unstructured data, proxies, and third-party sources.
• Knowing where they exist helps detect algorithmic bias and determine/prove fairness.

 

This is the first article in a series on finding protected attributes in banking and insurance data. The series will explore the major discrimination categories and their data connections.

This introductory article explains why we need to know whether we have this data, and where to look.

Understanding where race, gender, disability, and age appear in financial services data is important for managing bias risk, meeting customer, stakeholder and regulatory expectations.

Most banks and insurers know they need to prevent discrimination. If your algorithms have the potential to make biased decisions, but you don't know where the protected attributes exist in your data, you're flying blind.

 

Two key reasons for exploring this topic

1.     Identifying bias sources

Algorithms can discriminate through both obvious and hidden pathways. Your credit model might not use a "race" field, but it could still discriminate through postcodes, spending patterns, or even photo analysis during identity verification. Without mapping where racial characteristics appear across the data ecosystem, we can't identify all the potential bias entry points.

2.     Assessing disparate impact

Regulators increasingly expect evidence that your algorithms treat distinct groups fairly. But how do you test for racial bias if you don't know which customers are which race? If your system derives skin colour from unstructured sources like photos but you lack structured demographic data, measuring and proving fairness becomes problematic.

Consider this scenario: your pricing algorithm uses photos for identity verification. The facial recognition technology processes skin tone, inferring race, potentially affecting premium/rate calculations. Without structured race data to test against, you might not detect that darker-skinned customers systematically receive higher quotes.

 

Data can be apparent, but is often inferred

Protected attributes rarely appear neatly labelled in FS databases. Instead, they often hide in:

  • Unstructured data (e.g., photos, names, addresses)
  • Proxy variables (e.g., postcodes, purchasing patterns)
  • Third-party data sources (e.g., demographic marketing profiles)
  • System assumptions (e.g., title fields suggesting gender).

The complexity varies across the four main discrimination categories.

Race operates as a social construct. Gender involves sensitive characteristics. Disability information can surface through medical and other data. Age seems straightforward but can create unexpected algorithmic challenges.

Each category, and attribute, requires different detection strategies and presents different bias patterns.

 

Next

We'll explore each discrimination category, and its attributes, in detail.

Starting with race - finding racial attributes in banking and insurance data.

Note: this series of articles won't address how to deal with potential bias. The mitigation strategies depend largely on the nature and purpose of the process/decision. We plan to delve in this in future articles.


Disclaimer: The information in this article does not constitute legal advice. It may not be relevant to your circumstances. It was written for specific algorithmic contexts within banks and insurance companies, may not apply to other contexts, and may not be relevant to other types of organisations.


 

Weekly Articles                                                                                             Get weekly emails in your inbox.