Fool's Gold 2: Is There Really a Low-Cost Accuracy-Fairness Trade-off?

Richard Pace, PhD
Oct 25, 2023
20 min read

Updated: Mar 31, 2025

The existence of similarly-predictive credit models with improved fairness is a foundational pillar of algorithmic debiasing. Is it actually true?

Fool's Gold 2: Is There Really a Low-Cost Accuracy-Fairness Trade-off?

Earlier this year, I released Fool's Gold? Assessing the Case For Algorithmic Debiasing - an empirical assessment of claims made by certain fintechs, regtechs, and consumer advocates that credit model fairness can be relatively easily improved via machine learning techniques that search for less discriminatory alternative ("LDA") model configurations. These techniques, commonly referred to as algorithmic debiasing ("AD"), are based on a fundamental premise that, for a given credit model, a multiplicity of similarly-predictive, yet fairer, alternative models exist. Accordingly, to manage fair lending risk effectively, proponents urge lenders to regularly perform AD on existing and new credit models and adopt LDA versions of such models when improved fairness can be achieved with small sacrifices in model accuracy.[1]

On its face, AD appears to be an obvious advancement in expanding responsible credit access to underserved consumers - particularly given the fundamental premise on which it is based. However, given these clear apparent benefits, there is a surprising lack of transparency from promoters and adopters as to how AD specifically achieves its goal, and whether there may be important collateral risks associated with its model changes that risk and compliance professionals should take heed of.

This lack of AD transparency inspired my original Fool's Gold article where I created a simple credit scoring model (the "Base Model") from publicly-available Home Mortgage Disclosure Act ("HMDA") data and used this model - along with HMDA's applicant race data - to generate LDA Models using popular AD techniques such as fairness regularization and adversarial debiasing. I then investigated how the debiasing process specifically altered my Base Model to improve fairness, and identified some important risk considerations that lenders should consider if pursuing this approach to mitigate credit model disparate impact risk - for example:

Algorithmic debiasing may distort the Base Model's estimated credit risk relationships so seriously that certain LDA Models become counterintuitive, conceptually unsound, and expose the lender to elevated safety-and-soundness risk.

Algorithmic debiasing effectively "overrides" the Base Model's probability of default ("PD") estimates for certain credit risk profiles - making them more or less risky than their actual credit performance - in order to improve relative approval rates (i.e., the AIR fairness metric)[2]. Such distortions in these credit risk signals in the interest of improved fairness means that the LDA Model model no longer produces reliable default rate estimates - particularly on the subset of approved loans - thereby challenging the lender's ability to manage portfolio credit risk effectively.

Because of AD's dual competing objectives (i.e., improving lending outcome fairness while still accurately differentiating defaulters from payers), standard approaches for determining Adverse Action reasons may yield denial reasons that are not compliant with ECOA. This is because the LDA Model produces some lower-risk denials and some higher-risk approvals solely to improve the model's fairness performance. Accordingly, standard methods to determine ECOA denial reasons may attribute credit risk factors as the reason(s) for these lower-risk denials when, in fact, such denials were driven solely by fairness improvement considerations of the applicant demographics associated with those credit risk factors.

However, with further analysis and thought, I realize that this assessment was incomplete as it excluded one very important consideration - whether the foundational premise of algorithmic debiasing is actually true. That is, my analyses were performed using the standard model fairness and accuracy metrics (AIR and AUC, respectively) used by AD proponents - metrics that are foundational to the resulting "low-cost" fairness-accuracy trade-off supporting proponents' calls for LDA Model search and adoption.

But are these the right metrics? And, if not, what is the implication for the foundational premise of algorithmic debiasing - i.e., the existence of a multiplicity of similarly predictive, yet fairer models?

As I have written rather extensively about my issues with the Adverse Impact Ratio, that will not be my focus here although I summarize some of its apparent flaws a little later in this post. Instead, I focus my analyses on two other dimensions of AD's foundational premise:

Model Accuracy Measurement - I evaluate whether the typical measure of "predictive accuracy" used in AD approaches (the Area Under the Curve ("AUC") statistic) sufficiently captures the accuracy metrics most important for traditional credit risk management. What I find is that the answer to this question is No and, accordingly, reliance solely on AUC (or other similar measures of the model's "rank-order" prediction accuracy) may result in loan originations whose actual default rates are much worse than predicted by the LDA Model - thereby compromising effective portfolio credit risk management and significantly weakening AD's foundational premise.

How LDA Outcome Fairness is Achieved - improved model fairness is generally summarized by an outcomes-based metric like the AIR - with little if anything said about exactly how such improved fairness actually occurs, or whether the credit risk profiles of the applicants whose credit decisions are changed by the LDA Model are intuitively related to traditional disparate impact drivers. To help expand this dimension of model explainability / transparency, I take a closer look at the "swap sets" of applicants generated by the LDA Model (i.e., applicants denied under the original Base Model but approved under the LDA Model, and vice versa). What I find is that - to reduce approval rate disparities (and thereby improve the AIR metric) - the debiasing algorithms encode a reverse disparate impact into the mathematical structure of the LDA model.

While this swapping of Base Model approvals and denials impacts both demographic groups due to its implementation via the weights on "objective" model attributes, it is clear that these attributes are selected by the debiasing algorithm due to their correlation with the applicants' underlying demographics and their ability to effect the desired change in relative approval rates. While this may represent what many consider to be the inherent purpose of algorithmic debiasing (i.e., altering the model to down- weight certain attributes that create artificial barriers to protected class loan approval), my analysis shows that this favorable treatment is not necessarily focused on otherwise qualified applicants whose credit risk is improperly overestimated by the Base Model.

Given the significance of each of these areas, I have split this Fool's Gold "prequel" into two parts - the first of which you are currently reading and will address the model accuracy metric. The second part - to be released next month - will focus on salient characteristics of the LDA Model "swap sets" and raise some important legal and reputational considerations associated with the credit risk profiles of these applicants.

Alright, let's dive in.

AD's Foundational Premise: The Accuracy-Fairness Trade-off

As noted above, algorithmic debiasing assumes that, for a given credit model, there exists a "multiplicity" of alternative models with similar predictive performance but with improved fairness - as measured by the outcomes-based AIR fairness metric. Therefore, according to proponents, lenders can in many cases significantly improve model fairness by deploying an LDA Model with relatively little sacrifice in model predictive accuracy.[3] Figure 1 below, excerpted from my original Fool's Gold article, illustrates this premise by plotting the LDA Models generated by a fairness regularization algorithm applied to my simple HMDA-based credit scoring model.

LDA Credit Model Accuracy-Fairness Tradeoff — Figure 1: Accuracy-Fairness Trade-off Source: Figure 9 - Original Fool's Gold Article

In this chart, Model Accuracy, displayed on the vertical axis, is measured using the AUC accuracy metric - a standard model performance metric employed for classification models, such as credit scoring, where the model is designed to differentiate between two mutually-exclusive "good" and "bad" events - such as payment and default. Alternatively, Model Fairness, displayed on the horizontal axis, is measured using the AIR outcomes-based fairness metric - calculated as the ratio of the protected class loan approval rate relative to the corresponding control group loan approval rate - with values closer to one indicating better fairness performance, and vice versa. As I have discussed in previous blog posts, this measure of fairness is based on the presumption that approval rates should be equal between demographic groups regardless of differences in underlying applicant qualifications.[4]

Figure 1 illustrates the oft-described "low-cost" accuracy-fairness trade-off underlying the viability and value of LDA Model search - according to proponents. This can be seen by the relatively flat profile of the potential LDA Models to the right of the Base Model - indicating that modest gains in AIR-measured fairness can be achieved at relatively negligible cost of AUC-measured model accuracy.

But an important question is: Is AUC the relevant (or only) model accuracy metric for credit risk management purposes?

To answer this, let's first take a deeper dive into what the AUC accuracy metric really measures.

AUC Rank-Ordering Accuracy: A Credit Scoring Model's First Use Case

Fundamentally, model accuracy metrics should be tailored to the intended use(s) of a model. In the case of credit scoring, for example, effective credit risk management requires these models to satisfy two important use cases: (1) the ability to sort or rank-order applicants accurately according to their relative credit risk to facilitate the lender's credit decision strategy (e.g., approval decisions and pricing), and (2) the ability to predict accurately the expected default rates of certain applicant subsets - primarily, approved applicants - to measure potential portfolio credit exposure and losses relative to company policies and risk limits.

In the context of algorithmic debiasing, such as in Figure 1, the model's predictive accuracy is typically measured by the AUC accuracy metric.[5] And, as also shown in Figure 1, the AD process typically generates a number of LDA Models with significantly improved outcome fairness (as measured by the AIR metric) at a "cost" of (initially) small reductions in this measure of model accuracy. For example, in my simple credit scoring model, the selected LDA Model's AUC value is 0.814 vs. the Base Model's AUC value of 0.843 - a reduction of only -0.029 or -3.4% - a cost consistent with AD's foundational premise.

I note, however, that the AUC metric displayed in Figure 1 is designed to address only the first credit scoring use case - that is, measuring how well the model differentiates between "good" applicants and "bad" applicants - where "good/bad" represents whether the applicant repays the loan or not. From a practical perspective, strong "rank-ordering" or "differentiation" power means that the model tends to assign high PDs to actual defaulters and low PDs to actual payers - such that a single probability threshold can largely separate these "defaulter" and "payer" applicants from each other - an important feature for operationalizing a lender's credit decision strategy.

Importantly, for this specific credit risk management use case, only the relative accuracy of these PDs is important for this accuracy metric; that is, as long as the estimated PDs for defaulters are greater than the estimated PDs for payers, the precise values of these PDs are largely irrelevant. In fact, one can scale these PDs up or down and not affect the model's AUC accuracy metric - so long as the scaling does not alter the ordering of the original PDs. To illustrate this visually, consider the following diagram in which a sample of applicants is sorted from lowest (0.0) to highest (1.0) estimated PD:

Figure 2: Rank-Order Accuracy Source: Google Machine Learning Concepts

In this diagram, Actual Positives are the defaulters (green P dots) while Actual Negatives are the payers (red N dots). A credit scoring model possessing high "rank-ordering" accuracy would assign a large percentage of its actual defaulters higher PDs than those of its actual payers, and this is - in fact - what we see here. Specifically, a large percentage of actual defaulters (green P dots) reside toward the right of the diagram (i.e., closer to 1.0) and a large percentage of payers (red N dots) reside towards the left of the diagram (i.e., closer to 0.0). Models with a high degree of rank-ordering accuracy are desirable as their ability to separate defaulters from payers allows a lender to fairly easily select a single PD threshold or cutoff to determine the set of applicants to approve (i.e., approve applications with PDs less than this threshold) - consistent with its credit decision strategy and risk limits.

How do we translate the model's rank-ordering predictive performance into an accuracy metric?

The most common way to express the model's rank-ordering power is via the AUC -an accuracy metric that varies in value between 0.50 and 1.0 - with 1.0 representing a model with perfect rank-ordering ability (i.e., all defaulters have higher PDs than all payers) and 0.50 representing a model whose rank-ordering ability is no better than a random coin flip. In the diagram above, we see that the model lies somewhere between these two extremes with a large percentage of defaulters (green P dots) having higher PDs than all of the payers (red N dots). However, towards the middle of the PD range, we see less than perfect rank-ordering with some payers possessing higher PDs than some defaulters. This less than perfect rank-ordering accuracy is typical of most credit scoring models and results in an AUC value that lies somewhere between 0.50 and 1.0 - with values closer to 1.0 having fewer of these inconsistent rankings and those closer to 0.50 having a significant number of such inconsistencies.

So, in terms of Figure 1, my Base Model has an AUC accuracy value of 0.843 - a relatively high value and indicative of relatively strong rank-ordering power. The selected LDA Model has an AUC value of 0.814 - a relatively minor difference and a value still indicative of strong credit risk differentiation ability. But let's put a pin in this comparison for now. I have some additional context to provide below that will help us further understand why this AUC change is relatively minor - and why this may be a misleading indicator of how AD impacts credit model predictive performance.

Calibration Accuracy: A Credit Scoring Model's Second Use Case

While the AUC's rank-ordering accuracy metric is certainly important for operationalizing a scorecard-based credit decision strategy, it represents only a partial measure of credit model performance relevant to lenders. What it doesn't capture is the model's accuracy in predicting the lender's expected credit risk exposure levels associated with its credit decisions - such as expected default rates and expected losses on approved loans. Accurate estimates of these metrics are critically necessary for the lender to remain within established credit risk limits, and for evaluating whether subsequent observed loan losses are consistent with original LDA Model estimates.

To understand the difference between rank-ordering accuracy and "calibration accuracy", let's return to Figure 2. Notice that the absolute values of the PDs are not actually relevant to the model's rank-ordering accuracy - only the relative values, such as whether a defaulter's PD is higher or lower than a payer's PD. And as I discussed above, these PDs can be adjusted upward or downward in value - yet have minimal to no impact on their relative positioning or the resulting AUC accuracy metric. For example, if most of the payers have PD values less than 0.10 and most of the defaulters have PD values greater than 0.90, then adjusting some of the defaulters' PDs lower - say, to 0.40 - would still keep them above most of the payers and therefore minimally impact the AUC accuracy metric. However, this rank-order-preserving adjustment to these estimated PDs would significantly underestimate the actual default rate of these high-risk applicants - assigning a 40% expected default rate to applicants who have a much higher actual rate of default.

And herein lies the problem with the algorithmic debiasing's underlying premise.

Figure 3 below displays visually the calibration accuracy of my Base and LDA Models.

LDA Credit Model Calibration Accuracy — Figure 3: Comparison of Calibration Accuracy Curves - Base vs LDA Models

The horizontal axis measures the applicants' average expected default rate (PD) and the vertical axis measures their average actual default rates. The thin black diagonal line is a reference line indicating equality (or perfect calibration) between expected and actual default rates. The green line represents the Base Model's calibration accuracy for groups of applicants with successively higher estimated PDs, and the red line represents the same for the LDA Model. If these models were perfectly calibrated, these lines would lie on top of the 45 degree black reference line - indicating that all PDs were equal to actual default rates.

Before focusing on local regions of these two "calibration curves", I want to point out that both models - overall - have zero calibration error. That is, across all sample applicants, the average expected PD of 7.86% is exactly equal to the average actual default rate of 7.86% thereby yielding 0 overall calibration error in both models. However, I further note that such perfect calibration does not hold universally across credit risk sub-segments - which is evidenced by the fact that neither the green nor the red calibration curves lie exactly on the black 45 degree line. In fact, over some sub-segments, each model underestimates actual default rates while over other sub-segments each model overestimates actual default rates.[6] Overall, however, these underestimations offset the overestimations to yield zero aggregate model error.

So, if overall model calibration error is zero, what is the problem?

From a credit risk management perspective, lenders need to have an accurate estimate of expected default rates and losses for the loans they plan to originate. Overall model calibration error - while relevant for model validation purposes - is an insufficient accuracy metric for this particular model use. What is also needed is an accurate default rate estimate on the subset of applications the lender intends to approve with its credit decision strategy. To delineate these subsets, I have included two additional reference lines in Figure 3. Specifically, the green dashed line represents the PD cutoff for the Base Model that yields a 90% approval rate, while the red dashed line represents the PD cutoff for the LDA model that also yields a 90% approval rate - with the 90% approval rate assumption retained from my original Fool's Gold article.

When we narrow our focus to just these subsets, we see that the Base Model actually has relatively good local calibration accuracy as the green calibration curve lies almost completely on top of the 45 degree reference line in this region. In fact, the expected default rate for these approvals is 3.50% while the actual default rate is 3.42% - yielding a calibration error of only 0.08 percentage points (or 2.3% on a relative basis).

On the other hand, for the LDA Model and its corresponding subset of approvals, we see something very different. The red calibration curve shows extreme departures from the 45 degree reference line in this local region - with significant underestimates of default rates followed by material overestimates. In fact, unlike the Base Model, the LDA Model possesses significant "non-monotonicities" in its calibration curve in the approval region in which higher-defaulting applicants are assigned lower PDs than lower-defaulting applicants.[7] Such localized inconsistencies in rank-ordering are what drive the LDA Model's somewhat lower AUC metric; however, since the overall AUC metric is calculated over the entire PD range (i.e., within and outside of the approval region), such inconsistencies have a muted effect on the overall AUC metric (i.e., the Base Model's rank-ordering is still largely preserved by the LDA Model at PDs above the approval threshold even though the absolute PD values have changed).

As discussed in my original Fool's Gold article, the LDA Model miscalibrations associated with these local PD inconsistencies are not surprising when you consider the brute force manner in which algorithmic debiasing achieves its goal when fairness is measured by the AIR. That is, when fairness is measured by approval rate equality - regardless of underlying credit qualifications - an automated process to improve such fairness will simply lower the estimated PDs of credit profiles that are correlated with specific demographic groups to push those approval rates closer to those of the control group. In the example from my original Fool's Gold article, the debiasing algorithm primarily achieved greater fairness by significantly lowering estimated PDs on applicants with CLTVs > 95% and raising estimated PDs on applicants with CLTVs between 80% and 85% - as illustrated in the chart below reproduced from that article (Figure 10a).

LDA Credit Model: Impact of Fairness Regularization — Figure 10a: Fool's Gold? Assessing the Case For Algorithmic Debiasing

While this lowering of estimated PDs for the former group, and the raising of estimated PDs on the latter group, did not materially change the rank-ordering of actual defaults and non-defaults in the overall sample (as evidenced by the relatively small AUC metric change from 0.843 to 0.814 - a 3.4% relative reduction) or the model's overall calibration error, it did materially change the calibration of these applicants' estimated default rate levels to their actual default performance. Specifically, the expected default rate of LDA Model approvals is now estimated at 3.65% (vs. 3.50% in the Base Model); however, the actual default rate of these approvals is a much higher 4.57% - yielding a calibration error of 0.92 percentage points (nearly 12x the size of the Base Model's calibration error, and a -20% relative underestimation of actual default rates).

So, when viewed from a calibration accuracy perspective, the LDA Model suffers from a -20% reduction in model accuracy on approved loan applications relative to the Base Model - which is much different and more troubling than the -3.4% reduction in rank-ordering accuracy as measured by the AUC.

In fact, if we recast Figure 9 from my original Fool's Gold article to consider calibration accuracy as the primary model accuracy metric, we would obtain Figure 4 below.

AI Credit Model Accuracy-Fairness Tradeoff — Figure 4: Calibration Accuracy-Fairness Trade-off

Here we see a very different trade-off between fairness and accuracy. First, we see that algorithmic debiasing always introduces negative calibration error (i.e., causing the underestimation of actual default rates) when AIR is used as the fairness metric. This occurs in my credit model because improved approval rate equity can only be achieved by swapping into the approval region higher risk credit profiles previously denied under the Base Model (e.g., CLTVs > 95%) in exchange for swapping out (and denying) lower risk credit profiles (e.g., CLTVs between 80% and 85%) that were previously approved under the Base Model. Because of the relative demographic compositions of these credit profile "swap sets", LDA-based approval rates for Black applicants rise relatively significantly while those for Whites actually drop slightly.

At this point, I would expect some to argue that these miscalibrations are a natural part of scorecard models that are easily addressable through a second-stage default-odds "re-calibration" process in which the estimated PDs or scores are re-aligned with their true default rates via a logistic regression model (or other estimation process). However, I note that such a re-calibration process would not appear to solve the issue here since:

The miscalibrations are intentionally non-monotonic in order to improve relative approval rates, and undoing this non-monotonicity through traditional scorecard re-calibration would either undo the debiasing or yield much higher estimated default rates on the LDA "approvals". In the latter case, this would require a re-calibration of the PD approval threshold to maintain compliance with credit risk limits - thereby compromising the fairness improvement; and

Traditional scorecard re-calibration approaches are not really meant to address non-monotonicity issues - which are considered structural model issues typically requiring model re-development.

So what does this all mean?

When the AUC is used as the sole model accuracy metric, algorithmic debiasing can act in counterintuitive ways that may create heightened risks and challenges for the lender - specifically:

Solely focusing on the credit model's rank-ordering accuracy via the AUC metric provides a misleading signal of how the model's overall accuracy for credit risk management purposes is impacted by the algorithmic debiasing process. In my example, while the LDA Model seemed to incur a relatively small cost in rank-order "accuracy" to achieve a moderate improvement in fairness, it actually resulted in a -20% reduction in calibration accuracy as evidenced by a -20% underprediction of approved loan default rates.

Risk professionals should do independent due diligence on the LDA Model's calibration accuracy - focusing particularly on the local PD region associated with the lender's planned approvals. When performing this due diligence, keep in mind that some AD proponents might suggest that the relatively high actual default rates of some LDA Model loan approvals (i.e., those helping to improve fairness) are actually caused by the very historical discrimination that the LDA Model is designed to address. That is, these default rates are due to past discriminatory actions and, therefore, are not really indicative of the applicants' true underlying credit risk. While this is certainly possible, appropriate empirical analysis and validation of this hypothesis would be prudent.

If anything, model users should only rely on the LDA Model's rank-order accuracy metrics and place no reliance on unvalidated accuracy metrics or other model outputs that are derived from the absolute PD estimates. As shown above, these absolute PD estimates are intentionally distorted by the algorithmic debiasing process to improve relative approval rates under the AIR fairness metric. But this also means that the LDA Model may no longer serve both of the lender's credit scoring model use cases.

Are these results guaranteed for all credit scoring models subject to algorithmic debiasing?

Unfortunately, we do not know. And, yet, we should if these LDA Models are being deployed into production.

As I mentioned at the beginning of this post, there is a surprising lack of transparency into the popular AD processes being promoted to consumer lenders. While the claimed benefits are very compelling, there is still a fair amount we don't yet know about the unintended collateral impacts these processes may have to a lender's credit and compliance risk management objectives. Yes, my analyses are based on a simple credit scoring model that may not be representative of the more complex models actually being used. However, if even such a simple model as mine suggests some very concerning risk issues, doesn't this make it even more important that further research is performed and more transparency is provided? And if such research shows that the risks are much less than what is stated here - all the better. However, my experience suggests that there is rarely a free lunch and, accordingly, lenders should do their due diligence to determine whether their specific LDA Models may inadvertently be fool's gold.

Teaser Trailer - Fool's Gold 3 - Do LDA Credit Models Really Improve Fairness?

Algorithmic debiasing is most typically associated with the mitigation of disparate impact - that is, the effect of facially-neutral credit policies or practices that adversely impact one or more protected class groups and - even if sufficiently justified due to business necessity - cannot be achieved in a less discriminatory manner. In the context of credit scoring, many people interpret this to mean that unnecessary barriers to credit access - particularly those that disproportionately impact underserved groups - should be identified and reduced / eliminated to rightfully increase credit access to otherwise qualified individuals.

But is this what LDA Models are actually doing?

Stay tuned.

* * *

ENDNOTES:

[1] “Rigorous searches for less discriminatory alternatives are a critical component of fair lending compliance management,” Ficklin said during a panel discussion at the National Community Reinvestment Coalition’s (NCRC) annual conference. “I worry that firms may sometimes shortchange this key component of testing. Testing and updating models regularly to reflect less discriminatory alternatives is critical to ensure that models are fair lending compliant.” CFPB Puts Lenders & FinTechs On Notice: Their Models Must Search For Less Discriminatory Alternatives Or Face Fair Lending Non-Compliance Risk, National Community Reinvestment Coalition, April 5, 2023.

[2] The Adverse Impact Ratio ("AIR") fairness metric applies to credit decision outcomes and is calculated by dividing the protected class group's loan approval rate by the corresponding control group's loan approval rate. AIR values less than one indicate that the protected class group's approval rate is less than that of the control group, and vice versa. For loan pricing outcomes, a similar outcomes-based fairness metric - the Standardized Mean Difference - is used which, in many cases, is simply a normalized difference in average interest rates between the protected class and control groups.

[3] Some AD proponents reframe this trade-off as a profitability-fairness trade-off - with the implication that certain changes in model accuracy may not matter as much when expressed in terms of profitability impacts. In my opinion, significant caution should be exercised here as unexpectedly high consumer default rates that are largely offset by collateral or other recoveries (and, therefore, have little adverse impact on profits) may expose the lender to claims of asset-based lending and/or predation. See, for example, "CFPB and New York Attorney General Sue Credit Acceptance for Hiding Auto Loan Costs, Setting Borrowers Up to Fail".

[4] While some AIR proponents may dispute this statement, the facts are that:

The AIR is frequently described as an unconditional measure of potential disparate impact; that is, I have seen no attempts by AIR proponents to use a conditional AIR metric that exempts or controls for the effects of certain legitimate credit risk attributes (some of which may be required by federal law or regulation) on loan approval rate disparities across demographic groups. Without such conditional AIR measures, approval rate disparities driven by clearly unqualified applicants may falsely indicate the potential presence of disparate impact - and, as I will discuss in more detail in Part 2 - may result in the creation and deployment of an LDA Model that exposes the lender to certain legal / compliance and reputational risks.

While certain "rule of thumb" thresholds have been proposed by proponents to delineate "problematic" AIR values (such as 0.8 or 0.9), these thresholds have no basis in any empirical analysis related to legitimate differences in underlying applicant credit qualifications.

The algorithms typically used to debias credit models operate in an automated manner across the dozens, hundreds, or even thousands of model attributes contained in such models - without consideration as to whether an attribute is a legitimate underwriting factor or not (such as debt-to-income ratio, loan-to-value ratio, etc.).

While it is true that certain algorithmic debiasing tools may permit the user to "exempt" certain model attributes from the tool's LDA modifications (e.g., forcing the algorithm to keep the attribute in the final model specification, or prohibiting the algorithm from changing the weight of the attribute), to my knowledge such "controls" do not also convert the unconditional AIR fairness metric into a conditional metric by removing the differential effect of such attributes from the AIR's approval rates. For example, if the unconditional AIR value is 0.75, but the debt-to-income ratio drives a significant portion of the approval rate differences between the demographic groups, then a conditional AIR value that controls for this legitimate approval rate difference may, in fact, be 0.88 - indicating a much greater degree of fairness than the unconditional AIR value.

[5] See, for example:

"Explainability & Fairness in Machine Learning for Credit Underwriting: Policy & Empirical Findings Overview," FinRegLab, July 2023.

"Why The CFPB Should Encourage The Use Of AI In Underwriting," Zest AI, December 18, 2020.

[6] Of course, whether these departures from equality are "statistically significant" depends on sample sizes and underlying variances of the sub-segment's average PDs. I will say, however, that such sample sizes are quite large in the approval regions that I will discuss momentarily.

[7] "Monotonicity" refers to whether a function or curve is consistently increasing or decreasing over its full range. In the context of credit scoring, we would expect a calibration curve to be monotonically increasing over its probability range - that is, actual default rates should increase along with estimated default rates. This logical relationship is indicative of a credit scoring model with good predictive performance. "Non-monotonicities" are regions of the calibration curve that are illogical - that is where higher estimated PDs are associated with lower actual default rates, or vice versa. Typically, non-monotonicities in credit scoring models are an undesirable feature that developers work hard to avoid through rigorous data collection, feature selection, and model specification refinement.