TL;DR • Preventing bias is not just about fairness. • We explore a scenario in insurance fraud...
Finding Protected Attributes in FS Data
This is the first article in a series on finding protected attributes in banking and insurance data. The series will explore the major discrimination categories and their data connections.
This introductory article explains why we need to know whether we have this data, and where to look.
Understanding where race, gender, disability, and age appear in financial services data is important for managing bias risk, meeting customer, stakeholder and regulatory expectations.
Most banks and insurers know they need to prevent discrimination. If your algorithms have the potential to make biased decisions, but you don't know where the protected attributes exist in your data, you're flying blind.
Two key reasons for exploring this topic
1. Identifying bias sources
Algorithms can discriminate through both obvious and hidden pathways. Your credit model might not use a "race" field, but it could still discriminate through postcodes, spending patterns, or even photo analysis during identity verification. Without mapping where racial characteristics appear across the data ecosystem, we can't identify all the potential bias entry points.
2. Assessing disparate impact
Regulators increasingly expect evidence that your algorithms treat distinct groups fairly. But how do you test for racial bias if you don't know which customers are which race? If your system derives skin colour from unstructured sources like photos but you lack structured demographic data, measuring and proving fairness becomes problematic.
Consider this scenario: your pricing algorithm uses photos for identity verification. The facial recognition technology processes skin tone, inferring race, potentially affecting premium/rate calculations. Without structured race data to test against, you might not detect that darker-skinned customers systematically receive higher quotes.
Data can be apparent, but is often inferred
Protected attributes rarely appear neatly labelled in FS databases. Instead, they often hide in:
- Unstructured data (e.g., photos, names, addresses)
- Proxy variables (e.g., postcodes, purchasing patterns)
- Third-party data sources (e.g., demographic marketing profiles)
- System assumptions (e.g., title fields suggesting gender).
The complexity varies across the four main discrimination categories.
Race operates as a social construct. Gender involves sensitive characteristics. Disability information can surface through medical and other data. Age seems straightforward but can create unexpected algorithmic challenges.
Each category, and attribute, requires different detection strategies and presents different bias patterns.
Next
We'll explore each discrimination category, and its attributes, in detail.
Starting with race - finding racial attributes in banking and insurance data.
Note: this series of articles won't address how to deal with potential bias. The mitigation strategies depend largely on the nature and purpose of the process/decision. We plan to delve in this in future articles.
Disclaimer: The information in this article does not constitute legal advice. It may not be relevant to your circumstances. It was written for specific algorithmic contexts within banks and insurance companies, may not apply to other contexts, and may not be relevant to other types of organisations.
