This is the third guest interview episode of Algorithm Integrity Matters.
Finding Sex/Gender in FS Data
This is the third article in a series about finding protected attributes in data.
We need to manage 4 key discrimination categories in ensuring that our algorithms are fair. We explored Racial Discrimination in the previous article.
Another key category is Sex Discrimination. It contains eight distinct attributes: sex, gender identity, sexual orientation, intersex status, pregnancy, breastfeeding, marital status, and family or carer responsibilities.
As with Race, some of these attributes are easier to spot and manage than others. So, we ask the same question again: Can we really control for all eight attributes in terms of ensuring fair algorithms?
This article explores where each attribute might appear in our data and what we can do about it.
Gender Attributes in Data
Let's consider each of the 8 attributes we listed earlier, and identify where these might be found in structured and/or unstructured data.
We note that algorithms can discriminate without ever seeing these attributes directly. Things like employment history, spending patterns, product choices, and transaction timing can serve as proxies. So even if we don't collect gender data, we might still create gender bias.
1. Sex
Sex is assigned at birth. Gender, on the other hand, is either based on sex, or individually selected. Historically, sex and gender were interchangeable. This is evolving, but because it used to be treated the same, older data and older systems might still reflect them as the same.
Sex is typically captured in structured data through sex/gender fields, title fields (Mr/Ms), legal documents, and self-identification forms. Title fields are not always reliable; e.g., “Dr” doesn’t identify sex, but a model could infer one or the other sex based on its training data.
2. Gender Identity
Gender identity rarely appears in structured data. It might be captured in diversity surveys or customer service notes, but isn't typically collected systematically.
It could potentially be inferred from name changes, title changes, or communication preferences, but these inferences are highly unreliable and potentially discriminatory.
In general, gender identity is invisible in most datasets.
3. Sexual Orientation
Sexual orientation almost never appears directly in financial services data. It might be inferred from joint account holders, beneficiary relationships, or address sharing, but these inferences are problematic.
Two men sharing an address might be flatmates, brothers, or partners. Making assumptions about sexual orientation from such data creates discrimination risks.
This attribute is largely invisible in financial data.
4. Intersex Status
Intersex status is protected under Australian law but rarely visible in data. It might appear in medical insurance claims or specific identity documents, but most financial institutions wouldn't encounter this information.
When it does appear, it's typically in unstructured data like medical records or identity documents.
5. Pregnancy
Pregnancy can be inferred from various data sources. Health insurance claims, medical appointments, changes in spending patterns (baby products, medical visits), or parental leave requests all suggest pregnancy.
Third-party data from retailers and digital platforms can reveal pregnancy status through purchasing and browsing behaviour.
Pregnancy status can appear in both structured (insurance, leave records) and unstructured data (purchasing patterns).
6. Breastfeeding
Breastfeeding rarely appears directly in financial data but might be inferred from purchasing patterns, health insurance claims, or workplace accommodation requests.
Like pregnancy, this information might come through third-party data sources that track health and baby-related purchases.
7. Marital or Relationship Status
This commonly appears in structured data through account types (joint accounts), beneficiary nominations, emergency contacts, and address sharing. Application forms often collect this directly.
However, de facto relationships, same-sex partnerships, and relationship changes create complexity. Someone might be legally single but in a committed relationship, or recently separated but still sharing accounts.
Either way, this status can be directly identified or inferred.
8. Family or Carer Responsibilities
This might be derived from transaction patterns (school fees, childcare, aged care), flexible work arrangements, or parental leave records. Emergency contact lists and beneficiary nominations also suggest family relationships.
Third-party demographic data often includes household composition information that reveals potential caring responsibilities. Except for specific marketing tasks, carefully determined, we should generally avoid using such data in our loan approval, underwriting and claims algorithms.
What We Can Do
The answer to our question "Can we really control for all eight attributes in terms of ensuring fair algorithms?" is not that straightforward.
Unlike race, many gender-related attributes are either invisible in financial data or highly sensitive when they do appear. This creates different challenges around inference, assumption, and privacy.
In short, we can inspect our algorithmic systems, and:
- Avoid inferring sex from indirect data – a bit more challenging than it sounds, given the various data points that can create such inferences
- Question assumptions built into structured data like title fields
- Check third-party data for sex-related predictions
- Monitor for patterns that might inadvertently disadvantage either sex, for example test whether either direct or inferred data affects credit or insurance decisions.
Disclaimer: The information in this article does not constitute legal advice. It may not be relevant to your circumstances. It was written for specific algorithmic contexts within banks and insurance companies, may not apply to other contexts, and may not be relevant to other types of organisations.
