Fairness in machine learning: Equalized Odds

I recently dipped my toes into fairness concepts in machine learning. What does being fair mean, practically? is it fair that I was not born with the physique required to qualify for the NBA?

Absolutely it’s fair.

In the lottery of life, my genes were decided on in the same way as they were for everyone else who was born. Fairness does NOT mean “treating everyone the same”, rather to have the same starting point in that nature-lottery.

That written, in reality it’s not only nature who is calling the shots. We the people, also play an important part. When it comes to high-stakes predictions (say who gets a loan, or who passes a medical screening) we want enforce the same starting point for everyone. Regrettably, applying the same model to everyone doesn’t directly imply equal treatment. Why? Because we don’t treat everyone the same. Simply speaking, models are trained on real-world data produced by… us. And, since we do not treat everyone the same, our models inevitably reflect that reality. We can only ask to much from our models (see my short rant about the bias in AI misconception in that regards).

When we speak about fairness in machine learning, a powerful “referee” is a concept called Equalized Odds. Intuitively it means “the same error rates for everyone” or “equal accuracy across groups”; if two individuals are equally qualified for a loan, they should have the same chance of being correctly approved on the one hand, or carry the same risk of being incorrectly rejected on the other hand. Note the moral stance here: we acknowledge that our classifier will make mistakes, but we want the mistakes to be distributed in a particular way (fairly). By way of contrast, we should not accept a model that assigns one group higher likelihood for misclassification, just because. So Equalized Odds is about equalizing the conditional error rates.

Formally speaking, equalizing the error behavior across groups means that for protected attribute $A$ (say gender, or race), the Equalized Odds is equivalent to:

$\Pr(\hat{Y} = 1 \mid A = 0, Y = y) = \Pr(\hat{Y} = 1 \mid A = 1, Y = y), \quad \forall y \in \{0,1\}$

In words, your gender ( $A$ ) shouldn’t matter. People with the same true outcome $(Y = y)$ should have the same probability of being correctly or incorrectly classified by the model. Equalized Odds forces both error types to be balanced across groups. This is an attractive concept in many domains where different error types are socially costly:

In lending: false positives (giving a loan that defaults) vs false negatives (denying a loan that would repay).
In medical screening: false positives (unnecessary anxiety/tests) vs false negatives (missed disease).
In policing: false positives (unwarranted scrutiny) vs false negatives (missed threats).

Equalized Odds says: whatever we’re doing, we shouldn’t systematically impose one type of mistake more heavily on one group than another, conditional on the truth $(Y = y)$ . Also note, Equalized Odds is a distributional constraint, it does not guarantee fairness per individual. It’s fairness in aggregate conditional error rates (read: on average, not per person).
Now for the part statisticians have learned to expect:

💡 constraints are never free.

Equalized Odds may force you to give up some predictive performance. To see why, and understand how to apply Equalized Odds, we first take a look at our model probability score, which indicates the likelihood of an individual falling into one class or another. Your logistic regression (say) outputs a score (probability) rather than a rigid decision (classification). Usually for binary classification we use a 0.5 threshold to determine the class (e.g. qualified or rejected for a loan). Since Equalized Odds means the error rates across groups are identical, we look at the two types of errors (typically called Type 1 and Type 2 errors) for each group:

False Positives (e.g., mistakenly denying a good candidate) and
False Negatives (e.g., mistakenly approving a risky candidate).

Enter the Receiver Operator Characteristic (ROC) curve of the score. ROC captures the false positive and true positive (equivalently, false negative) rates at different cutoff points. What we need to search for is a threshold which would deliver the same error rates for both error types and for both groups, like so:
In this hypothetical graph above we see that there is a point which error rates across the two groups are the same. Specifically, when the false positive rate is approximately 30%, and the false negative rate (which is the complement of TPR on the Y-axis) is about 20%. The precise classification cutoff is not shown on this chart (so don’t get confused), but whatever that cutoff is, it would satisfy the Equalized Odds criterion of fairness. At this specific cutoff, both demographic groups have identical error rates, effectively neutralizing gender with regards to that attribute’s predictive power.

The tricky part is of course to find that cutoff point.

Practical example

Let’s use the Adult dataset for our example. It’s a classic supervised-learning benchmark. Each row is one person with demographic and employment features (e.g., age, education, occupation, hours per week), and the label is whether annual income is above $50K. The gender is also an attribute used for prediction, and while it may matter, we would like to disregard that sensitive attribute as a predictor for high income. So we can use Equalized Odds for making the two types of error rates match across groups. Below is the Python code used to calculate the specific cutoff point where the error rates of the two groups (males and females) intersect.

The collapsible “details” chunk is the code to create some needed functions, load the data and estimate a basic logistic regression (use it for replication if you want). The code visible thereafter shows the analysis.

> Prep code (click to collapse)

import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.metrics import accuracy_score

np.set_printoptions(precision=3,suppress=True)
pd.set_option("display.precision",3)
pd.options.display.float_format="{:.3f}".format

X,y=fetch_openml("adult",version=2,as_frame=True,return_X_y=True)
df=pd.concat([X,y.rename("income")],axis=1).dropna()
df["income"]=(df["income"]==">50K").astype(int)
X=df.drop(columns=["income"])
y=df["income"]
X=pd.get_dummies(X,drop_first=True)
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=0,
stratify=y)
X_tr,X_cal,y_tr,y_cal=train_test_split(X_train,y_train,test_size=0.3,random_state=1,
stratify=y_train)
model=LogisticRegression(max_iter=2000,solver="liblinear")
model.fit(X_tr,y_tr)
s_cal=model.predict_proba(X_cal)[:,1]
s_test=model.predict_proba(X_test)[:,1]
df_cal=X_cal.copy()
df_cal["y"]=y_cal.values
df_cal["s"]=s_cal
df_test=X_test.copy()
df_test["y"]=y_test.values
df_test["s"]=s_test
df_test["y_hat"]=(df_test["s"]>=0.5).astype(int)

def roc_by_group(df,group_col,group_val):
    d=df[df[group_col]==group_val]
    y=d["y"].to_numpy()
    s=d["s"].to_numpy()
    fpr,tpr,thr=roc_curve(y,s)
    return fpr,tpr,thr,d
''

def segment_intersections(fpr0,tpr0,thr0,fpr1,tpr1,thr1):
    cand=[]
    for i in range(len(fpr0)-1):
        a0=np.array([fpr0[i],tpr0[i]])
        b0=np.array([fpr0[i+1],tpr0[i+1]])
        u=b0-a0
        for j in range(len(fpr1)-1):
            a1=np.array([fpr1[j],tpr1[j]])
            b1=np.array([fpr1[j+1],tpr1[j+1]])
            v=b1-a1
            M=np.column_stack([u,-v])
            det=np.linalg.det(M)
            if abs(det)<1e-12:
                continue
            rhs=a1-a0
            sol=np.linalg.solve(M,rhs)
            p=float(sol[0])
            q=float(sol[1])
            if 0<=p<=1 and 0<=q<=1:
                pt=a0+p*u
                cand.append((pt[0],pt[1],i,p,j,q))
    return cand
''

def total_error_at_point(df,group_col,fpr_star,tpr_star):
    errs=[]
    wts=[]
    for gv in [0,1]:
        d=df[df[group_col]==gv]
        y=d["y"].to_numpy()
        pi=y.mean()
        err=pi*(1-tpr_star)+(1-pi)*fpr_star
        errs.append(err)
        wts.append(len(d))
    wts=np.array(wts,dtype=float)
    wts=wts/wts.sum()
    return float(wts[0]*errs[0]+wts[1]*errs[1])
''

def apply_random_threshold(scores,ta,tb,p):
    z=rng.random(len(scores))<p
    yhat=np.empty(len(scores),dtype=int)
    yhat[z]=(scores[z]>=ta).astype(int)
    yhat[~z]=(scores[~z]>=tb).astype(int)
    return yhat
''

def rates(d):
    y=d["y"].to_numpy()
    h=d["y_hat"].to_numpy()
    tp=((h==1)&(y==1)).sum()
    fp=((h==1)&(y==0)).sum()
    fn=((h==0)&(y==1)).sum()
    tn=((h==0)&(y==0)).sum()
    tpr=tp/(tp+fn) if (tp+fn)>0 else np.nan
    fpr=fp/(fp+tn) if (fp+tn)>0 else np.nan
    return float(tpr),float(fpr)
''

# About 25% earn more than 50K:
print(df_test[["y"]].mean())
y   0.248

dtype: float64
# And accuracy is 
df_base=df_test.copy()
df_base["y_hat"]=(df_base["s"]>=0.5).astype(int)
acc_before=accuracy_score(df_base["y"],df_base["y_hat"])
print(f"Accuracy before EO: {acc_before:.3f}")
Accuracy before EO: 0.789
# around 80%

overall_tpr,overall_fpr=rates(df_test)
print("Overall TPR:",overall_tpr) # missing 75% rich people
print("Overall FPR:",overall_fpr) # but almost never calling someone poor, rich

# The following is without any correction applied:
Overall TPR: 0.262
Overall FPR: 0.037

if "sex_Male" in df_test.columns:
    g0=df_test[df_test["sex_Male"]==0]
    g1=df_test[df_test["sex_Male"]==1]
    tpr0,fpr0=rates(g0)
    tpr1,fpr1=rates(g1)
    print("Female TPR:",tpr0)
    print("Female FPR:",fpr0)
    print("Male TPR:",tpr1)
    print("Male FPR:",fpr1)
''
Female TPR: 0.291
Female FPR: 0.035
Male TPR: 0.257
Male FPR: 0.038

fpr0,tpr0,thr0,_=roc_by_group(df_cal,"sex_Male",0)
fpr1,tpr1,thr1,_=roc_by_group(df_cal,"sex_Male",1)

cand=segment_intersections(fpr0,tpr0,thr0,fpr1,tpr1,thr1)

best=None
best_err=np.inf

for fpr_star,tpr_star,i,p,j,q in cand:
    err=total_error_at_point(df_cal,"sex_Male",fpr_star,tpr_star)
    if err<best_err:
        best_err=err
        best=(fpr_star,tpr_star,i,p,j,q)
''

fpr_star,tpr_star,i,p,j,q=best

t0_a=float(thr0[i])
t0_b=float(thr0[i+1])
t1_a=float(thr1[j])
t1_b=float(thr1[j+1])

rng=np.random.default_rng(0)

df_out=df_test.copy()
g0=df_out["sex_Male"].to_numpy()==0
g1=~g0
yhat=np.zeros(len(df_out),dtype=int)
yhat[g0]=apply_random_threshold(df_out.loc[g0,"s"].to_numpy(),t0_a,t0_b,p)
yhat[g1]=apply_random_threshold(df_out.loc[g1,"s"].to_numpy(),t1_a,t1_b,q)
df_out["y_hat"]=yhat

# Error rates for everyone (test set)
tpr_all,fpr_all=rates(df_out)

# Error rates per group (test set)
tpr_f,fpr_f=rates(df_out[df_out["sex_Male"]==0])
tpr_m,fpr_m=rates(df_out[df_out["sex_Male"]==1])

print(f"Female avg cutoff: {(p*t0_a+(1-p)*t0_b):.3f}")
print(f"Male avg cutoff  : {(q*t1_a+(1-q)*t1_b):.3f}")
Female avg cutoff: 0.708
Male avg cutoff  : 0.687

print("Target (calibration) common point FPR,TPR:",fpr_star, tpr_star)
print("Test Overall TPR,FPR:",tpr_all,fpr_all)
print("Test Female TPR,FPR:",tpr_f,fpr_f)
print("Test Male   TPR,FPR:",tpr_m,fpr_m)

Target (calibration) common point FPR,TPR: 0.008 0.181
Test Overall TPR,FPR: 0.173 0.007
Test Female TPR,FPR: 0.203 0.006
Test Male   TPR,FPR: 0.167 0.007

Results:
For easier exposition, and since the goal is to predict whether an individual earns more than 50K or not, let us simply refer to those earning more than 50K as rich, and refer to the rest as poor. If you would like to equalize the odds so to speak, the chart above shows that you need to deviate from the “built-in” 0.5 cutoff point, how? such that the FPR is more or less zero (no poor individual is classified as rich), and the false negative rate to about 80% (so regrettably many rich people are wrongly classified as poor). You can see below that the model’s accuracy is not bad at around 80%, but of course it’s driven by classifying the majority as poor (which they are..).

You can read the exact details in the code below, but in short:

We moved from the original TPR which was 26% to TPR which is about 17%, the change in FPR was small.
We moved from a threshold of 0.5 for everyone (both males and females) to an individual threshold per group of about 0.7 (females) and 0.68 (males).

More things to say:

As you can see from the graph, there are other points in which the ROC curves intersect, but there is an underlying rule to choose one that has the best overall accuracy (makes sense).
Although it seems logical that accuracy would decrease overall, this is not necessarily the case. The reason is simple: we calibrate on training data, while accuracy is computed out-of-sample.
The code includes several other nuances. Since each time you move the cutoff there are a “bunch” of observations that “flip” simultaneously rather than a single observation, the ROC curve is not smooth but rather a step function, so we always have two error rates: before – and after the change in the cutoff. But since we have many observations (so not continuous, but almost) we can simply consider the average here.

Summary

This implementation of Equalized Odds is useful. I find it particularly appealing in that it operates purely as a post-processing step, and so we are open to use whichever classifier we choose, without any complexity add-ons.

Of course, there are other notions of fairness and more sophisticated ways to implement them, but this is a good enough starting point for understanding what research recommends we do, if we fare to be fair. Below couple of papers who served as inspiration.