Fool's Gold? Assessing the Case For Algorithmic Debiasing

Richard Pace
Jan 27, 2023
36 min read

Updated: Mar 31

Can we really achieve increased credit model fairness with minimal trade-offs?

Author's Note: This is a fairly lengthy blog post due to the analysis needed to understand how algorithmic debiasing actually works within a credit model context. For those readers less familiar with the topic, I would recommend starting at the beginning as this post is designed to walk readers through the algorithmic debiasing process at a level of detail likely not previously encountered. Alternatively, for those readers comfortable with the topic and interested in moving directly to my conclusions, you may wish to start reading here.

Over the last few years, an increasing chorus of voices has emerged decrying the state of U.S. consumer lending as discriminatory and unfair. Some readers may pause at this statement, thinking "that's not really new - fair lending has been a top civil rights issue for a long while" as they reflect upon the long list of public fair lending enforcement actions brought against U.S. lenders over the last several decades. And they would be right.

However, the concerned voices that dominate today's headlines, fintech webinars, and industry conferences are expressing something different. Rather than focusing on the pernicious discrimination arising from insufficiently governed discretionary decision-making - such as underwriting exceptions, redlining, or variable loan pricing - this new breed of fair lending criticism is aimed at a more fundamental component of the consumer lending process - the credit score. And this is new.

Historically, credit scores - such as FICO - received only cursory fair lending attention due to their grounding in objective credit bureau data, their powerful evidence of predictive accuracy, and their central role in modern consumer lending. Even for lenders who constructed their own proprietary credit scores to automate consumer credit decisioning, the fair lending risks of these tools were considered well known, there was formal regulatory guidance around these risks, these tools were well-integrated into bank fair lending compliance programs, and they were subject to regular oversight and audit from Corporate Compliance, Model Risk Management, Internal Audit, and bank supervisors.

So what changed? Why are there now fair lending concerns about these credit scores?

Simple. Credit scores (and consumer lending, generally) became the next big business ripe for disruption by the growing commercialization of AI/ML algorithms, the relatively easy availability of "alternative data", and an increasing social focus on financial access and inclusion. This convergence created an opening for fintech entrepreneurs to challenge FICO's reigning leadership position via a new breed of AI-based credit scores that they purport are more accurate, more inclusive, and - importantly - fairer. And with well-funded marketing campaigns, these challengers are quickly making the staid credit scoring models of yesteryear technically and socially obsolete - with claims of their inherent bias quickly becoming one of the top compliance risks most lenders didn't know they had.

Their alleged defect? They produce "disparate impact" across demographic groups - measured by the new "it" metric called the Adverse Impact Ratio ("AIR") - a buzzy new measure of credit fairness that some might think was designed to sell a solution to those who didn't know they had a problem. That's because, according to the AIR, a credit scoring model is now considered biased if its corresponding loan approval rates are not sufficiently equal across demographic groups - regardless of any underlying differences in the groups' credit qualifications - a stretching of consumer lending disparate impact theory that goes beyond that of even the most aggressive regulator. And - no surprise - this is the outcome of most traditional credit scoring models.

Thankfully, the fintech entrepreneurs also have a relatively easy solution to this new form of credit model bias - an AI-based process called "algorithmic debiasing" that uses a black box algorithm to "debias" both black box and traditional credit scoring models. And such debiasing can generally occur with relatively little sacrifice to the model's underlying predictive performance - thereby making the process a de facto requirement under their interpretation of disparate impact law. And, let me be clear, developing and promoting fairer credit scoring models is an admirable goal that all lenders should support.

Great. Sounds like a win-win situation. Is there an issue?

Potentially. While there has been much discussion and promotion of these new credit models and debiasing processes among the conference circuit and webinar crowds, the simple truth is that we don't yet know very much about how algorithmic debiasing actually works in practice - how exactly does it achieve the improved fairness and expanded financial inclusion? How specifically is the predictive integrity of the credit scoring models affected? and, Are any notable risks created that may mitigate or even outweigh the promoted benefits?

It's unfortunate that such important information has not yet been widely shared by the promoters - perhaps because this information can be quite difficult to glean from AI-based credit scoring models containing hundreds (or thousands) of predictive variables within a complex non-linear model architecture. Nevertheless, such information is sorely needed by lenders to make sound compliance risk management - and safety-and-soundness - decisions in the current environment. Which now brings me to the focus of this blog post.

In what follows, I explore these questions using a simplified credit scoring model that helps to illustrate precisely what happens during the most common type of algorithmic debiasing process, and permits an analysis of how the fairness-driven model adjustments impact key credit scoring model performance properties. What I find from this research is that all that glitters may not be gold, and - accordingly - lenders need to be very cautious in their application of current algorithmic debiasing methods to their credit models. And our quest for improved fairness and inclusion from AI-based consumer credit models may require further multi-disciplinary research from technologists, compliance professionals, credit risk managers, and model risk specialists.

Let's dive in.

A Simple Credit Scoring Model

I begin with a simple credit scoring model built using loan-level residential mortgage loan application data from the 2019 Home Mortgage Disclosure Act ("HMDA") public file. Those familiar with HMDA data might be surprised at this choice as this data does not reflect loan performance but, rather, the lenders' initial underwriting decisions. And you would be right.

However, for the purposes of this analysis, I simply need a synthetic credit performance dataset with a "good" and "bad" credit outcome, and the HMDA dataset meets this requirement if we assume that loan approval broadly correlates with "good" credit performance (i.e., non-default) and loan denial broadly correlates with "bad" credit performance (i.e., default). While clearly an imperfect assumption (some approvals will default, and some denials would not default), it's ultimately a moot point for this analysis as I am simply seeking a synthetic dataset that: (1) differentiates between a "good" and "bad" credit outcome, (2) possesses a set of predictive variables related to these credit outcomes, and (3) identifies each applicant's corresponding racial group. The HMDA data satisfy all three of these requirements.[1]

Figures 1 and 2 below summarize the main properties of this synthetic credit performance dataset. As detailed in Figure 1, the sample is comprised of over 450,000 records – 5.8% of which are Black and the remaining 94.2% of which are White. The Black "default rate" is 18.4% and the White "default rate" is 7.2% - with an overall sample "default rate" of 7.9%.

Figure 1: Synthetic Credit Performance Dataset Record Counts

Figure 2: Synthetic Credit Performance Dataset "Default Rates"

To construct the simple credit scoring model, I selected the three main credit risk attributes available in the public HMDA data - the applicant's debt-to-income ("DTI") ratio, the application's combined loan-to-value ("CLTV") ratio, and the requested loan amount - to help differentiate statistically the "good" and "bad" credit outcomes. Both the CLTV and loan amount variables are provided in continuous formats, while the DTI variable is provided in an ordered categorical format.

To capture the well-known non-linear relationships between these credit risk attributes (particularly, CLTV and DTI) and the likelihood of default, I pre-processed all three variables into a set of "binned" dummy variables in which each individual variable represents a specific range of the credit risk attribute - thereby providing the estimation algorithm the flexibility to assign different predictive weights to different ranges of each attribute. This yielded a set of 26 predictive credit risk attributes (excluding the constant term and the three reference attributes) whose coefficient values were estimated via standard logistic regression. I specifically selected this model architecture for its analytical tractability - which is important to answer the questions posed in the Introduction, and because it has long been the work-horse of traditional credit scoring models. While simple in structure, we'll see that it is well-suited to provide us with the analytical insights I seek.

Figure 3 below presents this "Base Model" along with the estimated odds ratios for each predictive attribute.[2]

Figures 4a-c below illustrate the Base Model's estimated predictive relationships graphically - confirming that each credit risk attribute has a logical and conceptually-sound estimated relationship with the likelihood of credit default - consistent with what one would obtain with a "real" credit performance dataset. For example, Figure 4a clearly shows that the likelihood of credit default increases steadily as the loan's CLTV increases - with very high CLTVs (i.e., >95%) having significant risk.

Now, let's assume that we use the Base Model for automated underwriting - targeting an expected default rate of 3.5% (set by credit policy). Assuming that the model training data is representative of future applications, we can find the appropriate probability of default ("PD") cut-off point by sorting the training data from lowest to highest estimated PD and finding the probability level at which all loan applications below this level (which would be approved) collectively yield a 3.5% expected default rate. As shown in Figure 5 below, this "PD Approval Threshold" is equal to 9.42% and yields an expected overall application approval rate of 89.7%.[3]

From an approval rate fairness perspective, however, I note that these approvals are not equally distributed between the Black and White applicants. In particular, at the 9.42% PD Approval Threshold, 90.5% of White applicants are approved while only 78.3% of Black applicants are approved - yielding an Adverse Impact Ratio ("AIR") of 0.865 (=0.783/0.905) or 86.5%.

What is the driver of this approval rate disparity?

Figure 6 below compares the distributions of the Base Model's estimated PDs for Black and White applicants. From this chart, we can see that Black applicants' PDs (in Blue) are skewed somewhat to the right of the White applicants' PDs (in yellow) - leading to a relatively smaller percentage of Black applicants that fall below the Base Model's PD Approval Threshold.

But why are the Black applicants' PDs skewed toward higher values?

Figures 7a-c provide an answer to this question by comparing the credit risk profiles of the two demographic groups across the three attributes used in the Base Model.

Logically and empirically, the observed credit risk profile differences drive the estimated PD differences which drive the estimated approval rate differences. Traditionally, these facts - along with the conceptual and technical soundness of these estimated relationships and the objective credit risk behaviors on which they are based - would be used by Compliance Officers to support the lender's "business necessity" defense under HUD's 2013 disparate impact burden shifting framework or the "legitimate business need" defense under Regulation B's effects test.[4]

However, proponents of algorithmic debiasing have re-interpreted the disparate impact burden-shifting framework - suggesting that there is, or should be, a regulatory expectation that lenders proactively search for potentially less discriminatory alternative ("LDA") models when the AIR-based evidence indicates the presence of potential disparate impact [5] - regardless of any "business necessity" or "legitimate business need" claims.

What approval rate disparity level is considered indicative of potential disparate impact?

At first, debiasing proponents leveraged the Equal Employment Opportunity Commission's ("EEOC's") four-fifths rule of thumb as the threshold at which actionable bias was present. That is, if a credit decisioning model's AIR value was 80% or less, then a potential disparate impact would be deemed present - thereby requiring a pre-emptive search for LDA models. However, that 80% trigger point now appears to be in flux - with some recent publications suggesting an even smaller disparity:

a "more conservative version of the [EEOC's] 'four-fifth's rule of thumb" is appropriate "when dealing with a fully automated model where the model is relatively easy to validate and the effects of model inputs on the model outcome can be defined and adjusted with some precision." In such cases, a 90% trigger point is recommended versus "the more forgiving 80% “four-fifths” threshold sometimes used in employment." [6]

Consistent with the above positions, and for the purposes of this assessment, I assume that the Base Model's 0.865 (86.5%) AIR value (Figure 5) would trigger a search for less discriminatory alternative models using algorithmic debiasing regardless of the business necessity evidence presented above.

Let's see how this specifically works and what it produces.

Algorithmic Debiasing: The Search For Less Discriminatory Alternative Models

While there are a number of different outcomes-based algorithmic debiasing approaches, I will focus here on what I consider to be the simplest one - fairness regularization - in order to maximize the intuition and explainability of how debiasing achieves its fairness-enhancing effects.[7]

So what is fairness regularization?

Intuitively, a machine learning model - such as logistic regression - is designed to find an optimal set of model weights such that the weighted combination of the model's inputs (in our example, CLTV, DTI, and Loan Amount) yields default rate predictions that have the least amount of predictive error on a given data sample.

However, as we saw in the prior section, such a model - while being optimal from a predictive error perspective - may, according to the AIR fairness metric, produce outcomes that are not considered equitable across certain demographic groups. Therefore, to find less discriminatory alternative models, fairness regularization modifies the model's training objective - as shown below - to consider both of these desired properties (i.e., predictive error and fairness).

More specifically, under fairness regularization, the model's training objective is modified to include an additional term that I refer to as a "fairness penalty". That is, rather than solely searching for the specific set of model weights, 𝞱 , that minimize the model's prediction error, the algorithm now searches for the set of model weights, 𝞱*, that minimizes the weighted sum of the model's prediction error and a model fairness penalty - with the hyperparameter λ capturing the relative weight assigned to the fairness penalty. For example, setting λ to 0 yields an objective function that only considers prediction error (like the Base Model). However, as the λ value increases, the fairness penalty increases in relative importance. As we will see below, these two objectives generally compete against each other; that is, if we increase the weight of the fairness penalty, we typically will get more predictive error, and vice versa. The search for LDA models then proceeds by optimizing the dual model training objective above over a range of relative fairness weights λ - with each λ value generating a different LDA model.

Theoretically, the fairness penalty can be measured in any number of ways. However, for my analysis, I focus on an outcomes-based fairness penalty based on the popular AIR fairness metric typically used by credit model debiasing proponents. More specifically, my fairness penalty is defined as the inverse AIR value associated with a 90% overall approval rate. As Figure 8 illustrates, this fairness penalty becomes increasingly severe the lower the AIR value and converges to a minimum value as approval rates tend to equality.[8]

AI Credit Model Fairness Penalty — Figure 8

Under this algorithmic debiasing approach, for each set of potential model weights that the algorithm considers, 𝞱*, it not only calculates the model's overall PD predictions (and, therefore, its overall prediction error), but also: (1) applies the appropriate PD threshold to achieve a 90% overall approval rate, and (2) calculates the associated AIR value using the individual Black and White borrowers' estimated approval rates. It then determines whether that potential set of model weights, 𝞱*, yields a minimum combined value of both the model prediction error and the fairness penalty based on the assumed relative weight of the two objectives, λ. If the combined weighted value of the two training objectives can be further reduced by modifying the model weights 𝞱* - that is, there is a trade-off between predictive error and fairness that achieves a lower combined value - then the debiasing algorithm will continue its search. Only when there are no further opportunities to minimize the combined training objective (for the given λ value) will the algorithm stop - yielding its optimal set of LDA model weights, 𝞱*, associated with that λ value.

Figure 9 below illustrates the range of less discriminatory alternative models I produced using this fairness regularization approach - with model fairness (as measured by the AIR value) on the horizontal axis and model accuracy (as measured by the Area Under the Curve ("AUC") value) on the vertical axis. I use the AUC measure of predictive accuracy since it is the model performance measure most consistent with how credit scorecards are actually used.[9] Technically, the AUC varies between 0 and 1 - although a value of 0.50 typically represents a model whose rank-ordering power is no better than random chance. An AUC value of 1 represents perfect rank-ordering ability; however, most decent credit scoring models typically have AUC values between 0.80 and 0.90.

The Base Model is represented in Figure 9 below by the labeled green dot and the LDAs associated with different λ values are represented by the grey dots. LDA models (points) to the right of the Base Model reflect greater relative fairness penalties, have more significant model coefficient adjustments, and generally suffer from larger sacrifices to predictive accuracy.

Credit Model Accuracy-Fairness Tradeoff — Figure 9

As we can see in this figure, the Base Model produces the highest predictive accuracy with an AUC value of 0.843; however, it also has the lowest AIR value of 0.865 (86.5%). Moving to the right of the Base Model, we see that by initially adding weight to the fairness penalty, we can obtain LDA models with slightly greater fairness at the expense of de minimis decreases in predictive performance. For example, the first two LDA models (to the immediate right of the Base Model) have AIR values of 87.3-87.8% (+0.8% to +1.3% from the Base Model) and AUC values of 0.842 (-0.001 from the Base Model) - at expected approval rates of 90%.

However, to achieve more desired levels of fairness, we must sacrifice more predictive accuracy. For example, the LDA Candidate Model required to increase the AIR value above the 90% trigger point with a minimum decrease in predictive accuracy is highlighted in red. This LDA Candidate Model achieves an AIR value of 92.7% (+6.2% from the Base Model) with an AUC accuracy level of 0.814 (-0.029 from the Base Model). To some, this may represent an acceptable trade-off - obtaining a relatively large increase in equitable outcomes (i.e., loan approval rates) at the expense of a relatively minor decrease in overall accuracy.

And, unfortunately, this is where the public debiasing presentations typically stop.

But how does the debiasing algorithm actually obtain these results?

What happens to the model such that relative Black approval rates improve so much?

What amount of sacrificed predictive accuracy is considered "acceptable" for a given change in model fairness?

And do the answers to any of these questions weaken the case for algorithmic debiasing?

Let's find out.

Inside the Black Box: How is Algorithmic Fairness Actually Achieved?

To understand how algorithmic debiasing achieves improved credit decision fairness, let's compare the LDA Candidate Model identified in the previous section to the Base Model (i.e., with no fairness adjustment). Since fairness regularization operates by modifying the estimated model weights to minimize a new model training objective comprised of both predictive error and fairness, I start this assessment by comparing the estimated weights for each credit risk attribute (i.e., CLTV, DTI, and Loan Amount) from the two models - shown below in Figures 10a-c.

Impact of Fairness Regularization — Figure 10a

As we observed previously in Figures 7a-c, Black applicants were more likely than White applicants to have higher CLTV ratios, higher DTI ratios, and lower loan amounts AND as we observed in Figure 6, these differences in underlying credit risk profiles drove the differences in estimated approval rates between the two groups and, therefore, the 86.5% AIR value.

So how does fairness regularization achieve more equitable approval rates between Blacks and Whites?

As clearly shown in Figures 10a-c above, the fairness regularizer accomplishes this goal by altering the estimated credit risk profiles of our three predictor variables in a manner that disproportionately lowers the estimated PDs for Black applicants relative to White applicants. Specifically,

The estimated default odds ratio for CLTVs above 95% - i.e., the most risky CLTV segment according to the Base Model - is significantly reduced in the LDA Candidate Model, while that for CLTVs between 80% and 85% is increased.

The estimated odds ratios for DTIs between 50% and 60%, and DTIs greater than 60% - i.e., the most risky DTI segments according to the Base Model - are slightly reduced in the LDA Candidate Model.

The estimated odds ratios for loan amounts less than $150,000 - i.e., the most risky loan amount segments according to the Base Model - are reduced, while those for loan amounts greater than $250,000 are slightly increased.

Not surprisingly, in our example, algorithmic debiasing improves model fairness simply by lowering the estimated riskiness of certain credit profiles that are disproportionately associated with Black applicants, and vice versa. The only magic here is: (1) the algorithm's ability to sift efficiently through the various applicant credit profiles to find those that are both higher risk (i.e., above the PD approval threshold) and disproportionately Black, and (2) conditional on the weight assigned to the fairness penalty, λ, the amount of risk distortion needed to meet the dual model training objectives.[10] While this is certainly an effective mathematical solution to the outcomes-based fairness issue, it clearly raises some important risk issues of which lenders need to be aware.

Assessing The Case For Algorithmic Debiasing

On the surface, the case for algorithmic debiasing appears strong and intuitively appealing. In particular, under the premise that outcomes-based fairness metrics are a reasonable measure of credit model disparate impact or bias, popular debiasing approaches do appear to achieve the laudable goal of producing less discriminatory credit scoring model alternatives with more equitable approval rates.

However, digging below the surface - as I have above - reveals the presence of several risk considerations that begin to erode this case when considered within the context of actual credit model use and prudent enterprise risk management. Let's investigate these further.

Algorithmic debiasing may distort the Base Model's estimated credit risk profiles so seriously that they become counterintuitive, conceptually unsound, and expose the lender to safety-and-soundness risk.

When unconditional relative approval rates (i.e., AIR) or score demographics are used to measure model fairness, algorithmic debiasing simply searches for mathematical ways to adjust the Base Model's estimated credit risk relationships to increase the relative approval rates of the protected class group (or to "de-correlate" the estimated PDs with demographic group membership under adversarial debiasing). Not surprisingly, this search focuses on those credit risk attributes that are disadvantageous to the protected class group regardless of what those attributes represent.

In my example, as shown in Figure 10a, the fairness regularizer solves the mathematical fairness problem primarily by distorting the core CLTV credit risk relationship - i.e., all else equal, CLTVs above 85% are now associated with lower estimated PDs than those for CLTVs between 80% and 85%. This produces an LDA Candidate Model that - while superior according to the outcomes-based fairness metric - is no longer conceptually sound, appearing to come from some Stranger Things "upside-down" universe in which credit risk behaviors operate very differently.

Now, to be fair, my Base Model is - by design - a simple model. And, perhaps, more "real-world" credit scoring models incorporating hundreds of credit risk attributes may not see these core credit risk attributes distorted as much since the model relies on so many other attributes whose weights could also be adjusted. While that may be true, the fact is that no such analysis has been publicly presented by algorithmic debiasing proponents to dissuade this concern - probably due to the difficulty in detecting and describing the precise weight adjustments occurring in a highly complex model with hundreds of interacting features.

Additionally, some may point out that my LDA Candidate Model is too extreme - and that alternative LDA models closer to the Base Model may not experience these counterintuitive credit risk relationships. That is, there may be some "low-hanging fruit" that could easily be harvested without this issue. And at least in my example, this would not necessarily be true. For example, if we defined our LDA Candidate Model to be the right-most model within the initial cluster (see Figure 9), we would still obtain a counter-intuitive credit risk relationship for one of the credit risk attributes - Loan Amount. Alternatively, we could select one of the first two LDAs to the right of the Base Model in Figure 9; however, while such LDAs may avoid the significant weight distortions of the other LDA candidates, such "low-hanging fruit" would likely not provide the needed increase in fairness to achieve the desired >=90% AIR (or a sufficiently reduced Adversary AUC value) - nor would the minimal increase in fairness provided justify the increased "governance tax" associated with the use of algorithmically debiased models.[11]

Notwithstanding these considerations, my central point still holds. Regardless of the number of predictive features or model complexity, outcomes-based algorithmic debiasing processes seek ways to alter the estimated credit risk of certain risky credit profiles in order to push those PD estimates into the lender's loan approval region (or to reduce their correlation with protected class membership) - thereby improving relative outcomes. However, as we will see in the next section, this risk distortion is likely not "righting a wrong" of a Base Model that is somehow unable to quantify accurately the risk of these credit profiles - and thereby fixing an important model failure that drives the relatively lower approval rates for protected class consumers. No - the debiasing algorithm is much more superficial. It simply looks for mathematical ways to drive increased relative approval rates of certain demographic groups regardless of the predictive accuracy of the credit profiles whose risks it chooses to adjust.[12] Therefore, counterintuitively, rather than fixing a model failure, the debiasing algorithm actually introduces a model failure in order to achieve the desired outcome-based results. For lending institutions desiring rigorous and reliable credit models - and for banks subject to rigorous model risk management and other safety-and-soundness requirements, such debilitations make LDA Candidate Models generated under an AIR or other outcome-based fairness metric potentially non-viable.

To achieve improved outcomes-based fairness, algorithmic debiasing effectively "overrides" the Base Model's PD estimates for certain credit profiles - making them more or less risky than actual credit performance. In doing so, the LDA Candidate Model model no longer produces reliable default rate estimates for credit risk management purposes.

In the previous section, I compared the relative risk estimates of different attribute ranges to those from the Base Model to focus on how the LDA Candidate Model's weight distortions can create illogical, conceptually unsound estimated credit risk profiles. Here, I focus instead on the LDA Candidate Model's absolute default rate estimates and how these estimates affect the model's absolute predictive accuracy.

To illustrate this risk, Figure 11 below compares the LDA Candidate Model's decision metrics with those previously discussed for the Base Model in Figure 5.

Here we see that with essentially the same estimated overall approval rate of 90%, the LDA Candidate Model underestimates the actual default rate on approved loans by almost a full percentage point - which represents a significant change from the Base Model's absolute error rate (0.08%) as well as a significant change in the model's relative error rate (-25.2% vs. +2.3% for the Base Model).[13]

The root cause of the LDA Candidate Model's prediction bias is:

The suppression of estimated PDs (average LDA PD = 9.0%) on a subset of bad loans (actual default rate = 24.4%) that were denied under the Base Model (average Base PD = 23.1%) - thereby pushing them into the LDA Candidate Model's approval region,

in exchange for:

The inflation of estimated PDs (average LDA PD = 20.2%) on a subset of good loans (actual default rate = 5.4%) that were approved by the Base Model (estimated Base PD = 5.0%) - thereby pushing them into the LDA Candidate Model's denial region.

This "rough justice" swap set has credit profiles consistent with those discussed in Figures 10a-c above and maintains about the same overall expected default rate for approved loan applications under both models (i.e., 3.65% vs. 3.50% - see Figure 11 above); however, the LDA Candidate Model's approved loans have worse actual credit risk and its denied loans have better actual credit risk - than observed under the Base Model - thereby causing the variance in the LDA Candidate Model's predictive accuracy.

By breaking the Base Model's strong alignment between estimated and actual default rates across the spectrum of credit risk profiles, outcomes-based algorithmic debiasing has impaired the ability of the LDA Candidate Model to be a reliable predictor of credit risk on approved loan applications - as well as for certain critical credit risk profiles (e.g., high CLTV loans, low loan amounts, etc.) - thereby exposing the lender to both credit and model safety-and-soundness risks.

It also raises an interesting question as to the accuracy of Adverse Action Notices - to which I now turn.

It is unclear whether standard approaches for determining Adverse Action reasons are consistent with the underlying drivers of incremental denials generated by the LDA Candidate Model.

As noted in the section above, in our example, outcomes-based algorithmic debiasing created a swap-set of loan applications to improve the relative approval rate of Blacks. One component of this swap set is a group of relatively low-risk applicants who were previously approved by the Base Model, but whose estimated PDs are now inflated by the LDA Candidate Model and, therefore, denied. In exchange for these denials, the LDA Candidate Model suppresses the estimated PDs on a second group of higher-risk credit profiles whose approvals under the LDA Candidate Model provide the needed AIR improvement.

For those lower-risk applicants in the swap set who are now denied by the LDA Candidate Model, the standard approach to generating Adverse Action notification reasons would likely identify their CLTV level and or loan amount as the primary denial reasons. This would align with Figures 10a and 10c which show these applicants have CLTVs in the 80-85% range and/or loan amounts above $250,000. However, while technically correct, these Adverse Action factors may not be considered the specific reasons for denial within the context of the CFPB's Circular 2022-03 - nor would they be objectively actionable by the consumers receiving them.

Such risks arise because these individuals are really being denied for a different reason - specifically, to improve the lender's credit scoring model fairness measures. Consider the case of one of these applicants who has a CLTV between 80% and 85%, a DTI equal to 37%, and a loan amount between $150,000 and $250,000. Under the Base Model, this applicant had an estimated PD of 5.01% and was therefore approved as it fell below the PD Approval Threshold of 9.42% (see Figure 11). Under the LDA Candidate Model, however, this applicant receives an estimated PD of 19.54% and is, therefore, denied as it exceeds that model's PD Approval Threshold of 17.15%.

Contrast that applicant with another applicant who has the same DTI and loan amount profiles, but whose CLTV exceeds 95%. Under the Base Model, this applicant had an estimated PD of 58.87% and was therefore denied as it well exceeded the PD Approval Threshold of 9.42%. Under the LDA Candidate Model, however, this applicant receives an estimated PD of 8.47% and is, therefore, approved as it falls below that model's PD Approval Threshold of 17.15%.

Was the first applicant really denied because of his/her excessive CLTV?

Is knowledge that the application was denied because of the CLTV value practically actionable to this borrower?

These would seem to be relevant questions in the current regulatory environment; however, their answers require the advice of legal and compliance professionals. Lenders may wish to keep this in mind.

Most algorithmic debiasing processes applied to credit models employ outcomes-based fairness metrics - such as AIR - because of their focus on credit decision outcomes, as well as their past usage in employment discrimination matters. However, in my opinion, outcomes-based fairness metrics - such as AIR - are conceptually-flawed measures of credit model bias, and their use exposes lenders to potentially significant regulatory and safety-and-soundness risks.

Suppose that a bank examiner is reviewing the credit policies of a consumer lender and learns that the lender requires a minimum FICO score of 580, and a maximum debt-to-income ratio of 55% for an application to be eligible for approval.

What is the likelihood that the examiner would raise a disparate impact issue over these credit policy rules because they lead to disproportionately lower approval rates for a protected class group?

Not likely - because these credit policy rules are grounded in direct measures of an individual's credit risk as evidenced by the long-term, consistent empirical relationship of these attributes with future credit performance. It would not matter that reducing the minimum FICO score to 540 and/or raising the maximum DTI to 60% would result in greater approval rate equity across demographic groups, since such incremental approvals would surely yield much greater credit losses than under current credit policy rules. Accordingly, the lender (and, presumably, the examiner) implicitly asserts Regulation B's "legitimate business need" defense for these credit policy rules.

Now, would the answer differ if the lender had a credit policy rule that denied applications if the applicant did not own a motor vehicle?

Likely yes - because of the tenuous and unusual relationship between this attribute and future credit performance, and the fact that the underlying credit risk attribute it is likely capturing - potential employment instability - could be measured more directly in a manner that may have a less disproportionate impact on protected class groups.

My point here is that there are certain credit risk attributes such as CLTV, DTI, PTI, recent bankruptcy, recent credit derogatories, and more that are standard and direct measures of applicant credit risk used ubiquitously throughout the banking industry, that form the foundations of prudent consumer credit risk management policies, and that have not been criticized by federal bank regulators for disparate impact risk. In fact, federal regulation actually requires some of these credit risk attributes to be considered by lenders during loan underwriting. For example, Regulation Z (Truth in Lending) states:

"A card issuer must not open a credit card account for a consumer under an open-end (not home-secured) consumer credit plan, or increase any credit limit applicable to such account, unless the card issuer considers the consumer's ability to make the required minimum periodic payments under the terms of the account based on the consumer's income or assets and the consumer's current obligations. ... Card issuers must establish and maintain reasonable written policies and procedures to consider the consumer's ability to make the required minimum payments under the terms of the account based on a consumer's income or assets and a consumer's current obligations. ...Reasonable policies and procedures also include consideration of at least one of the following: The ratio of debt obligations to income; the ratio of debt obligations to assets; or the income the consumer will have after paying debt obligations." [emphasis added by author]

Accordingly, using a completely unconditional outcomes-based fairness metric - such as AIR, or the correlation of estimated PDs with applicant demographics - to measure credit model fairness would not seem to align with certain bank regulatory requirements (such as the Reg Z requirement above), nor with long-standing bank regulatory supervision and enforcement activities that have not cited standard credit risk attributes for illegal disparate impact.

Outcomes-based debiasing methodologies may also be inconsistent with controlling disparate impact legal doctrine as they are based on the premise that a lender's "legitimate interests" (as referenced in HUD's 2013 Disparate Impact Burden Shifting Framework) or "legitimate business needs" (as referenced in Regulation B) solely consist of acceptably-accurate credit decisions. That is, so long as an LDA Candidate Model can make loan approval rates more equitable with an "acceptable" reduction in predictive accuracy[14], the third stage of the discriminatory effects test is deemed satisfied by debiasing proponents - thereby requiring the LDA Candidate Model to be adopted. However, as I have described here, a lender's "legitimate interests" are multidimensional in this context, and it is not enough that an LDA Candidate Model is fairer and doesn't diminish predictive accuracy "too much". Any high-impact model used by a lender must also meet several other regulatory compliance and safety-and-soundness requirements, and all of these collectively represent the lender's "legitimate interests" that should be satisfied for an LDA Candidate Model to be deemed viable.

Finally, by suppressing the estimated PDs for protected class consumers in order to achieve higher approval rates, outcomes-based debiasing may increase a lender's legal, regulatory, and reputational risks under UDAAP (i.e., unfair, deceptive, or abusive acts and practices). In my example, the LDA Candidate Model - while approving more Blacks than the Base Model - does so by approving higher-risk applicants whose estimated PD levels are intentionally suppressed by the outcomes-based debiasing process. In fact, the actual default rates of the LDA Candidate Model's incremental approvals are 24.4% versus an estimated default rate of 9.0% (and, more relevantly, an estimated default rate of 23.1% under the Base Model). By knowingly approving high-risk loans with suppressed PD estimates, a lender may face allegations of making loans to individuals they knew were likely not viable - and, as a result, harmed those borrowers' abilities to obtain future credit at reasonable prices.

In addition to the concerns above, outcomes-based fairness metrics may provide inconsistent signals relative to alternative fairness metrics - creating uncertainty as to whether disparate impact risk is truly present, or how to address it.

Two other measures traditionally used to evaluate credit scoring model fairness are relative AUC values and relative model errors between protected class and control group borrowers. Rather than focusing on equity in decision outcomes, these performance-based measures seek to identify true "model failures" - that is, statistical evidence showing that the model performs relatively less accurately for certain protected class groups, such as overestimating their credit risk relative to actual credit performance (and/or underestimating the credit risk of control group members relative to their actual credit performance). Accordingly, fairness adjustments / debiasing using these fairness metrics are seen as ways to mitigate these "model failures" and put the two groups on a more equal footing with respect to accurate credit risk quantification.

Figure 12 below presents these two alternative fairness measures for both the Base Model and the LDA Candidate Model.

What's interesting about these alternative fairness metrics is:

When evaluating the model's ability to predict accurately each demographic group's actual default rates, we see that - in fact - the Base Model actually favors Black applicants as it estimates their default rate to be almost 400 bps lower than their actual default rate, while disfavoring White applicants by overestimating their default rate by 25 bps.

Essentially, the Base Model's "model failure" (i.e., its inability to predict accurately the default rates of each demographic group) actually improves the AIR - increasing it to a level (86.5%) that is more than 20 points higher than the level that would be obtained if the model produced accurate risk quantifications for both groups (63.3% - details not displayed).

In terms of rank-ordering predictive power, we see - again - that Blacks are favored in the Base Model as their AUC value (0.844) is slightly better than the AUC value for Whites (0.839) - although this difference may not be statistically significant.

So, when viewed within the context of alternative performance-based fairness metrics, the AIR can present a glaringly inconsistent assessment of credit scoring model bias. But this should not be a surprise since outcomes-based fairness metrics such as the AIR are unconditional - that is, they do not consider directly whether differences in actual credit performance may partially or fully explain the differences in predicted credit performance and, therefore, differences in decision outcomes. In fact, if we were to use the alternative performance-based metrics for our fairness assessment, we would conclude that the Base Model - if anything - actually favors Black applicants since it estimates better PDs for this group relative to their true underlying credit performance.[15]

The LDA Candidate Models produced by in-processing algorithmic debiasing approaches may not be considered consistent with applicable fair lending laws and regulations.

There are two primary schools of thought as to whether LDA Candidate Models violate federal fair lending laws and regulations by using applicant demographics during the model training process. The first viewpoint is that there is no fair lending violation since the in-processing methodologies yield LDA models that are demographically blind. That is, they achieve "fairness through unawareness" - meaning that when deployed into production, the model does not require, nor use, any demographic information to make its credit decision. A second viewpoint rejects this distinction between training use and production use - arguing that any use of demographic data during model development may be a violation.

What's interesting about these two perspectives is how they drive very different algorithmic debiasing processes. Specifically,

Under the first view, demographic data is used to generate a continuum of LDA Candidate Models that typically retains the same set of predictive attributes as the Base Model - but whose estimated model weights on these attributes deviate in ways designed to improve decision outcome fairness. While demographic data impacts the model weights, it is not used directly in the model and, therefore, is not needed to generate model predictions. Fairness regularization and adversarial debiasing are two examples of this approach.

In contrast, under the second view, the LDA Candidate Models are formed without any knowledge of demographics; that is, they are formed by varying the set of predictive attributes included in the model, as well as by varying certain model hyperparameter values that also influence variable selection. For Base Models containing hundreds of predictive attributes, this process could involve the estimation of a vast number of LDA Candidate Models - which is why users typically employ "smart" algorithms that narrow this generation down to a smaller relevant subset. The fairness of the LDA Candidate Models is then assessed in a separate process after model training (perhaps by a separate compliance team using demographic data) and the LDA Candidate Models are then summarized by their predictive accuracy and fairness metrics for final model selection.

While I can appreciate the nuance of both viewpoints, ultimately I believe that neither is free from potential regulatory / legal challenge. For the debiasing methods in the first group, let's remember that the credit scoring model's weights are being adjusted in a manner designed to favor (or to reduce the disfavor of) one or more protected class groups. While the final models do not directly consider an individual's demographic group membership and, therefore, operate in a demographically-blind manner, one could argue that the adjustments made to the model weights during the debiasing process have encoded a latent form of demographic bias to the model, and this latent bias certainly impacts the estimated PDs generated during model use.

But these model weight adjustments are favorable to protected class groups?

Yes - but as many fair lending practitioners know, considerations of borrower demographics in the credit origination process - even in a manner favorable to protected class groups - may still be considered a potential violation of applicable fair lending laws unless done within a legally-compliant special purpose credit program.

For the second debiasing approach that searches through various subsets of the model's predictive attributes and hyperparameter values to find the combinations that yield an acceptable accuracy - fairness trade-off, I have a similar concern. Here, rather than maintaining a fixed set of predictive attributes and adjusting their weights to achieve improved fairness, the algorithm adds and drops predictive factors to generate a large number of alternative models. This collection of models is then separately evaluated for fairness performance, and the lender selects a specific LDA Candidate Model that provides an acceptable accuracy-fairness trade-off.

While the estimated model weights of the LDA Candidate Model were not explicitly adjusted by a demographically-driven fairness penalty - like in the first group - they were still effectively adjusted by the ex post fairness evaluation of the lender. That is, rather than having the algorithm tell you the specific model weights needed to achieve a given fairness level (as is done by the first group), this approach takes a more brute force approach to find the LDA Candidate Model by generating a long list of alternative models with different model weight values, calculating each model's fairness value separately after all training has completed, and then selecting the specific LDA Candidate Model whose fairness-accuracy trade-off is considered acceptable. While demographic data was not used during model development, it still influenced the specific LDA Candidate Model chosen by the lender, and the estimated model weights of that model still reflect a latent form of demographic bias - this time driven indirectly by demographically-correlated omitted variable bias rather than direct model weight distortion (as in the first approach).

Ultimately, whether either of these two approaches is inconsistent with federal fair lending laws and regulations is up to lawyers and the courts to resolve. However, it appears that reasonable arguments could be made for both sides of this question and, therefore, lenders should proceed cautiously.[16]

Final Thoughts

So where do we go from here?

Based on my analysis, I believe it is clear that lenders need to evaluate more holistically the potential benefits, risks, and costs associated with algorithmic debiasing - and I hope this analysis has shed some needed light on the considerations involved in that evaluation.

More specifically, however, my takeaway is that outcomes-based credit model debiasing appears to create a number of problematic risk management issues that, frankly, may be difficult for some lenders to overcome. Whether these tools are actually legally-required under proponents' interpretations of disparate impact laws and regulations - well, I leave that to the lawyers to advise.

Nevertheless, if a lender's policy goals are explicitly to promote greater approval rate equity in their consumer lending, and they find an effective way to manage the associated credit and model risks, and they wrap their debiased models appropriately within a legally- and regulatory-compliant program (perhaps a special purpose credit program?), then these tools can be an important technological facilitator of such policy goals.

For other lenders, and without further regulatory guidance in this area, I would suggest the following:

Explore alternative performance-based measures of model fairness as discussed above.

Based on these performance-based measures, determine where your Base Model stands with respect to the relative accuracy of credit risk quantification. Does the model suffer from a true "model failure" that causes it to favor or disfavor one or more protected class groups relative to corresponding control groups?

If the evidence indicates the Base Model disfavors a protected class group, can the model be directly fixed to address the model failure without the use of demographic data (e.g., inclusion of important omitted variables, consideration of alternative functional forms)? If not, consider an alternative performance-based debiasing approach - such as one whose fairness regularizer is based on relative model errors or other predictive performance measures. However, further research may be needed here on effective methodologies, and be sure to evaluate the resulting LDA Candidate Models thoroughly along the lines of my analysis - including a legal review of potential disparate treatment risks.

If the evidence indicates the Base Model favors a protected class group, be sure to evaluate with appropriate legal and compliance advisors whether there may be UDAAP risks present, whether a conflicting AIR-based fairness signal is a relevant mitigant, and whether a potential reverse-debiasing or other model fix may be needed to mitigate UDAAP concerns. Beware - this is a tricky area.

And finally, if anything, it's clear to me that further publicly-shared research into algorithmic debiasing is specifically needed for credit models due to the highly regulated industry in which they operate. Such research should be holistic in design - focusing not only on technical methods, but also on the multidimensional objectives that comprise lenders' legitimate business needs.

While the state of AI-based lending in achieving the promise of increased financial access and inclusion is not yet where it likely could be, additional multi-disciplinary research and an openness to consider alternatives may just yet yield a new "gold standard".

For my further research on this topic, see the following follow-up posts:

* * *

ENDNOTES:

[1] I created the synthetic credit performance dataset from 30-year conventional conforming, rate-term refinance loan applications on single-family, owner-occupied properties in order to homogenize the underlying credit applications as much as possible. Additionally, because the public HMDA file does not contain data on the applicant's credit history, I only included denied loan applications where the denial reason was DTI, collateral, or Other so as to align the sample's denials with the specific set of credit risk attributes available.

[2] Using binned dummy variables also provides more opportunity for the algorithmic debiasing process to find specific patterns across the credit risk attributes that lead to "fairer" outcomes. That is, if all three variables were each entered in a continuous and linear manner, there would only be three coefficient values (excluding the constant) term that could be varied to improve model fairness.

[3] The reason this PD Approval Threshold value is relatively low (why not 50%?) is the unbalanced nature of the sample - with 92% of the sample being non-defaults.

[4] For the purposes of this post, I assume that "loan amount" would not be considered a controversial credit risk attribute under a disparate impact assessment and, therefore, also meet the "business necessity" defense. However, this is far from certain and, perhaps, unlikely. Nevertheless, given the limited available credit risk attributes in the public HMDA dataset, I proceed under this assumption in order to focus on the main point of this post.

[5] See, for example, "CFPB Should Encourage Lenders To Look For Less Discriminatory Models," March 11, 2022 Letter From NCRC, Upturn, and Zest AI to Director Rohit Chopra of the Consumer Financial Protection Bureau.

I note that, under HUD's 2013 disparate impact burden-shifting framework, the "burden" of proving the existence of less discriminatory alternatives lies with the individual / entity alleging illegal disparate impact - not with the lender.

[6] See "Fair Lending Monitorship of Upstart Network’s Lending Model: Second Report of the Independent Monitor," Relman Colfax PLLC, November 10, 2021.

[7] I have also performed this analysis using another popular outcomes-based debiasing methodology - adversarial debiasing - and achieved qualitatively similar results. This is not unexpected as both fairness regularization and adversarial debiasing achieve improved fairness by adjusting the Base Model's estimated weights. Fairness regularization does this in a more direct manner (i.e., including a direct fairness penalty during model training), while adversarial debiasing does this more indirectly (i.e., by effectively de-correlating estimated PDs with applicant demographics in a two-stage competing estimation approach). Later sections of this post will share more information about the adversarial debiasing results.

[8] This is not the only way to define a fairness penalty under fairness regularization. However, for the purposes of this analysis, it possesses the desired properties and - as we shall see - generates LDA models with desired properties.

[9] Specifically, credit scoring models are primarily used to: (1) "rank-order" the credit risk of loan applications - that is, to sort applications from lowest to highest estimated risk of default, and (2) to predict accurately at the overall portfolio level (i.e., in the aggregate across a large number of applications). While other studies may also evaluate model performance metrics such as false positive rates, false negative rates, F1 scores, accuracy rates, etc., these metrics are really most relevant when a model is being used (and monitored) for individual classification accuracy - in our context, being able to predict accurately the specific loan applications that will default or not default. While that may be of interest, it is not reflective of how credit scoring models are used in practice.

[10] Adversarial debiasing works in a similar manner in that it also adjusts the estimated model weights to achieve a combined accuracy-fairness objective. However, in this methodology, the fairness objective is expressed differently - with a second model (the "adversary") used to measure the degree to which the credit scoring model's PD estimates are predictive of the applicant's protected class membership. Essentially, this approach looks to maximize the credit scoring model's predictive strength while simultaneously minimizing the adversary model's predictive strength (i.e., we want estimated PDs that are predictive of default but de-correlated with the applicant's protected class membership). The advantage of this debiasing approach over the AIR-based approach is that the fairness objective is not conditional on a specific approval threshold. Nevertheless, both approaches still measure fairness based on the scoring model's outputs - as opposed to its relative predictive performance across demographic groups.

Depending on the relative weight assigned to the two competing model training objectives (similar to the λ hyperparameter used in fairness regularization), adversarial debiasing can generate different LDA Candidate Models with varying degrees of PD model accuracy and fairness where fairness is now measured by the predictive strength (i.e., AUC value) of the adversary.

[11] By "governance tax", I am referring to the additional model risk management testing, oversight, and monitoring associated with both model performance and model fairness over the model's life-cycle. See my post "Are Rising Interest Rates Impairing AI Credit Model Fairness" for a related discussion and example.

[12] For credit scoring models containing alternative data that may have a tenuous relationship with consumer credit behaviors, or containing highly complex non-linear interactions that may effectively proxy for borrower demographics, it is possible that the risk distortions introduced by the algorithmic debiasing are "righting a wrong" by suppressing the relative adverse effects of these factors on protected class PD estimates. However, I am unaware of a practical and effective way to analyze and validate this. Additionally, as I discuss further below, even if one could target the debiasing algorithm's adjustments to such questionable factors and weights, adjusting them solely to improve relative approval rates (as opposed to other fairness-related measures of model performance) may create alternative, unintended risk issues.

[13] These default rates are calculated at the overall aggregate portfolio level - not at the individual loan level - to be consistent with lender credit model usage.

[14] There is also no commonly-accepted standard, or regulatory guidance, on what is an "acceptable" reduction in predictive accuracy for a given change in model fairness. One would expect that this trade-off is best assessed by the lender given the totality of the lender's risks and costs associated with this decision, as well as the lender's specific risk appetite. However, we have already begun to see challenges to these decision rights - most recently in "Fair Lending Monitorship of Upstart Network’s Lending Model: Third Report of the Independent Monitor," where the independent monitor created its own methodology (described across eleven pages) to determine whether an LDA Candidate Model's reduction in predictive accuracy would be acceptable to the lender in exchange for an improved AIR (Spoiler Alert - they believed it was). However, according to the Fourth and Final Report of the Independent Monitor, Upstart disagreed with this methodology and declined to adopt it.

[15] This is true whether we look over the entire training sample, or just the approved loans. Figure 12 reports the results for the entire training sample. For approved loans, the Base Model predicts an average default rate of 4.1% for Blacks (3.5% for Whites) versus actual default rates of 7.5% for Blacks (3.2% for Whites) - yielding a -3.4% favorable model variance for Blacks and a +0.3% unfavorable model variance for Whites.

[16] One interesting recent wrinkle to this issue is the DOJ's recent settlement with Meta / Facebook ("Meta") in which Meta was required to develop a Variance Reduction System ("VRS") in order to effectively debias its algorithms that governed the display of its customer's digital ads to Meta users. As part of the DOJ-approved VRS system design, announced in January 2023, Meta was permitted to use aggregated customer demographic data (currently, race/ethnicity and sex) to adjust the set of users to whom future ad impressions will be displayed - with the overall goal of aligning the targeted demographics of a customer's ad campaign with the actual demographics of users to whom the ads were actually displayed. While a deeper dive of the system architecture is needed to understand more specifically how it operates, the fact that the DOJ permitted the use of aggregate demographics as part of an algorithmic debiasing process is important in assessing the risk of this issue.