How to Audit Your AI for Bias: A Step-by-Step Guide
We audited 12 production AI systems. 9 had measurable bias. Not malicious — just untested. Here's the exact methodology we used to find and fix it.
Step 1: Define Protected Groups
Start with legal requirements, then expand:
Legal (mandatory):
- Disability status
Extended (recommended):
- Education level
Domain-specific:
- Hiring: School tier, gap years
Step 2: Gather Data
You need:
- Production data (what it's actually seeing)
Check for representation:
- Is there historical skew (e.g., fewer women in tech roles in training data)?
Example: A hiring tool trained on 2018–2023 data will reflect pandemic-era patterns. That may not be what you want in 2026.
Step 3: Choose Metrics
Demographic Parity
What it measures: Are positive outcomes equally distributed across groups?
Formula: P(Ŷ = 1 | A = 0) = P(Ŷ = 1 | A = 1)
When to use: When false positives are equally bad for all groups.
Example: Loan approvals. If 60% of Group A gets approved, 60% of Group B should too.
Equal Opportunity
What it measures: Are true positive rates equal across groups?
Formula: P(Ŷ = 1 | Y = 1, A = 0) = P(Ŷ = 1 | Y = 1, A = 1)
When to use: When you care about catching qualified candidates equally.
Example: Hiring. If someone is qualified, they should have equal chance of being hired regardless of group.
Calibration
What it measures: Does the model's confidence match reality across groups?
Example: If the model predicts 80% default risk, approximately 80% should actually default — in every group.
Individual Fairness
What it measures: Are similar individuals treated similarly?
When to use: When you want case-by-case consistency.
Step 4: Run the Audit
Tool: Use Aequitas or Fairlearn
``python
Install
pip install aequitas
Basic audit
from aequitas.audit import Audit
from aequitas.plotting import Plot
audit = Audit(df, "race", "predicted", "actual")
audit.summary()
`
What to Test
- Group-level metrics:
- Selection rate by group
- False positive rate by group
- False negative rate by group
- True positive rate by group
- Threshold analysis:
- What happens at different decision thresholds?
- Is there a threshold that's fair for all groups?
- Or do you need group-specific thresholds?
- Intersectionality:
- Don't just test gender and race separately
- Test Black women vs. White men vs. Asian non-binary
- The worst bias is often at intersections
Step 5: Analyze Results
Red Flags
| Finding | Severity | Action |
|---------|----------|--------|
| 20%+ difference in false positive rate | Critical | Stop deployment, retrain |
| 10–20% difference | High | Mitigate before deployment |
| 5–10% difference | Medium | Monitor, plan fix |
| <5% difference | Low | Document, review annually |
Case Study: Hiring Tool
What we found:
- Root cause: Women less likely to apply; training data had fewer positive examples
Fix:
- Result: 8% gap (acceptable, monitored)
Step 6: Fix or Mitigate
Option 1: Fix the Data
- Augment underrepresented groups
Option 2: Fix the Model
- Apply fairness constraints during training
Option 3: Fix the Threshold
- Post-process predictions for fairness
Option 4: Human Review
- Document override reasons
Step 7: Document Everything
Required for compliance:
- Monitoring plan
Template:
`
Audit Date: [Date]
System: [Name]
Auditor: [Person/Team]
Protected Groups: [List]
Metrics: [Parity/Opportunity/Calibration]
Results:
Group A: Selection rate 62%, FPR 12%, FNR 18%
Group B: Selection rate 58%, FPR 14%, FNR 20%
Gap: 4% selection, 2% FPR, 2% FNR
Finding: [Acceptable/Requires mitigation/Critical]
Action: [What was done]
Retest Date: [When]
``
Step 8: Monitor Continuously
Bias isn't a one-time fix. Models drift.
Monitor:
- After data changes: Immediate audit
Alerts:
- Any group representation <5%
The Catch
Three failure modes we see:
- Fixing numbers, not outcomes: You can equalize selection rates while still being unfair. Always validate with qualitative review.
The Bottom Line
Bias audits aren't optional anymore. The EU AI Act requires them for high-risk systems. The FTC is fining companies for discriminatory AI. Civil lawsuits are starting.
Cost: $10K–50K for a professional audit. $2K–5K to do it yourself.
Timeline: 2–4 weeks for initial audit. 1 week for retests.
Start with: Your highest-risk system. The one that affects the most people with the biggest consequences.
The companies that get caught aren't evil — they're just untested. Don't be untested.
Daily AI Intelligence, Free
Get AI news and analysis delivered to your inbox. No spam. Unsubscribe anytime.
One-click unsubscribe · We never share your data