Fool's Gold 3: Do LDA Credit Models Really Improve Fairness?

Richard Pace
Nov 25, 2023
30 min read

Updated: Mar 31

Fool's Gold 3: Do LDA Credit Models Really Improve Fairness?

Earlier this year, I released Fool's Gold? Assessing the Case For Algorithmic Debiasing - an empirical assessment of claims made by certain fintechs, regtechs, and consumer advocates that credit model fairness can be improved via automated searches for less discriminatory alternative ("LDA") model configurations - a process otherwise known as algorithmic debiasing ("AD"). The success of AD as a fair lending compliance risk mitigation tool is grounded in the premise that, for a given credit model, a multiplicity of similarly-predictive, yet fairer, alternative models exist. And the recent emergence of ML-based tools to find these LDA Models now permits lenders to address credit model disparate impact much more effectively than ever before. Accordingly, proponents urge lenders to perform AD regularly on existing and new credit models, and to adopt LDAs when improved fairness can be achieved without a "significant" sacrifice in model predictive accuracy - which ostensibly occurs often.

On its face, AD appears to be an important advancement in expanding responsible credit access by preventing "algorithmic redlining" via discriminatory credit scoring models. However, given these significant benefits, there is a surprising lack of transparency from promoters and adopters as to how AD specifically achieves this goal. And given its relatively recent emergence from the academic realm, there is also a surprising lack of basic research on whether its application to the highly-regulated consumer lending area may create unintended, yet important, risk and compliance side effects for lenders and consumers.

This lack of AD transparency and research inspired my original Fool's Gold article where I created a simple credit scoring model (the "Base Model") from publicly-available Home Mortgage Disclosure Act ("HMDA") data, and used this model - along with HMDA's applicant race data - to generate LDA Models using popular AD techniques such as fairness regularization and adversarial debiasing. I then investigated how the AD process specifically altered my Base Model to improve fairness, and identified some important risk considerations that lenders should consider if pursuing this approach to mitigate credit model disparate impact risk.

However, with further analysis and thought, I realize that this earlier assessment was incomplete as it excluded one very important consideration - whether the foundational premise of algorithmic debiasing is actually true. That is, my analyses were performed using AD proponents' standard model fairness and accuracy metrics (AIR and AUC, respectively) - metrics that are foundational to the resulting "low-cost" fairness-accuracy tradeoff underlying proponents' calls for regulatory endorsement of proactive LDA Model search and adoption.[1]

But are these the right metrics for lenders concerned about potential disparate impact risk - considering the high-stakes, highly-regulated environment in which they operate?

And, if not, what are the implications for lenders who adopt LDA Models based on such metrics?

Last month, I explored the "accuracy" side of this foundational premise in Fool's Gold 2: Is There Really A Low-Cost Accuracy-Fairness Trade-off? - pointing out that there are actually two distinct model accuracy metrics important for effective credit risk management. And, once both accuracy metrics are considered, improvements in AIR-based model fairness can result in much larger reductions in model accuracy - thereby calling into question the validity of AD's foundations, and highlighting important safety-and-soundness risks of which lenders should be aware.

In this third Fool's Gold post, I turn my attention to what AD's "improved fairness" actually means at a more fundamental level. Some may think the answer is rather obvious as the Adverse Impact Ratio ("AIR") - the typical AD fairness metric - measures the relative approval rates of protected class and control group applicants. However, I am actually seeking a deeper understanding than that. In particular, I want to explore the following:

Conceptually, what type of "unfairness" is AD really addressing? And how does this unfairness concept reconcile with traditional concerns with, and measurement approaches for, credit model disparate impact discrimination.

Analytically, how exactly does AIR-based AD impact individual credit decisions to improve approval rate equity? What are the characteristics of applicants who are newly-approved or newly-declined under the LDA Model? And how do the characteristics of these "swap sets" align with the traditional focus on non-causal or improperly-specified model attributes, or model predictive bias, that drive disparate impact?

From a risk management standpoint, based on the "swap set" analysis, do AIR-based LDA Models expose lenders to any potential legal, compliance, or reputational risks that - while unintended - may require the development and implementation of new risk assessment and mitigation processes for compliance officers and other senior management?

More generally, beyond exploring these specific questions, this post seeks to advance the important goal of AI credit model explainability and transparency - but this time from a fairness perspective. In my opinion, given the highly-complex, high-stakes environment in which these LDA Models operate, transparency arising solely from global model explainability techniques and algorithmically-derived adverse action reasons is no longer sufficient for proper model governance and responsible use. What is now needed is a similar degree of transparency into how LDAs globally achieve improved model-level fairness, and greater insights and explanations for the local credit decision changes experienced by individual applicants impacted by the LDA. With this additional transparency,

A lender may gain comfort with the specific credit access barriers identified by AD and the corresponding LDA Model mechanisms designed to reduce or eliminate the resulting disparate impact, or

A lender may discover the need to mitigate some very real risks created by the credit decision "overrides" latently inserted into the LDA Model's complex structure that may not conform to traditional notions of disparate impact - nor the lender's risk and compliance policies.

And, with that, let's dive in.

What Type of Unfairness is AD Really Addressing?

In last month's post, I introduced the concept of "calibration accuracy" - a metric capturing a credit scoring model's ability to predict accurately the specific default rate levels of applicants with different credit risk profiles. In Figure 1 below, I show my Base Model's "calibration curve" (the green line) - a visual device that displays a credit model's calibration accuracy over the range of credit risk profiles within its development / training sample.

LDA Credit Model Calibration Accuracy — Figure 1

Within this chart, the horizontal axis measures an applicant group's average estimated default rate ("PD") and the vertical axis measures its associated average actual default rate. Points that are further right on the horizontal axis (further up on the vertical axis) correspond to applicants with higher estimated (actual) default rates, and vice versa. The thin black diagonal line is a reference line indicating equality (or perfect calibration) between the applicants' PDs and actual default rates. Accordingly, if the Base Model were perfectly calibrated, the green calibration curve would lie on top of the 45 degree black reference line - indicating that average PDs for all credit risk profiles are exactly equal to their average actual default rates.

Overall (i.e., across the entire training sample), the Base Model exhibits zero calibration error as the average estimated PD of 7.86% is calibrated exactly to the average actual default rate of 7.86%. However, since the green calibration curve does not lay exactly on the black 45 degree line, such perfect global calibration does not hold locally across all credit risk sub-segments. In fact, for some credit risk profiles (i.e., above the 45 degree line), the model underestimates actual default rates while for other credit risk profiles (i.e., below the 45 degree line), the model overestimates actual default rates. When aggregated across the entire sample, however, the underestimations offset the overestimations to yield zero global model calibration error.

OK, but what about fairness? Does each demographic group also have zero global calibration error?

No.

Figure 2 below shows the Base Model's calibration curves calculated separately for Black (blue line) and White (gold line) applicant groups.

LDA Credit Model Calibration Accuracy by Race — Figure 2

Here, we see something very different. Although the global model calibration error across the whole training sample is zero (see Figure 1), the global model calibration error for each demographic group is NOT zero and, in fact, varies significantly. In fact, Figure 2 shows that Black applicants experience much greater model calibration error than White applicants as displayed visually by the greater departure of the Blue calibration curve from the 45 degree line.

Ah, I see. So these different calibration errors are the type of model unfairness (or disparate impact) that algorithmic debiasing is designed to address?

Actually, no.

Looking more closely at these two calibration curves, we can see that the Base Model is fairly well-calibrated for White applicants (the gold curve) with initially close calibration at the lowest PD levels and modest fluctuations around the 45 degree reference line at higher PD levels - indicating no significant global model bias towards over- or under-estimation. In fact, the Base Model overestimates the White average default rate by a mere +25 basis points (i.e., 7.45% estimated vs. 7.20% actual) or by +3.5% on a relative basis. However, for Black applicants (the blue curve), we see something entirely different. Overall, the Base Model actually underestimates the Black average default rate by a whopping -397 basis points (i.e., 14.45% estimated vs. 18.43% actual - rounded) or by -21.6% on a relative basis - and this underestimation is consistent for all credit risk profiles except the very riskiest.

Furthermore, if we focus specifically on those applicants within the Base Model's approval region (i.e., to the left of the green reference line), we see similar group-level model calibration differences - with the Base Model overestimating average default rates on approved White applicants by +26 basis points (i.e., 3.46% estimated vs. 3.20% actual) or by +8.1% on a relative basis and underestimating average default rates on approved Black applicants by -338 basis points (i.e., 4.11% estimated vs. 7.49% actual) or by -45.1% on a relative basis.

Hmmm - now I'm confused. This analysis indicates that the credit scoring model underestimates the credit risk of Black applicants. Aren't credit scoring models biased against protected class applicants - which is why AD is advocated for virtually all such models? Isn't this the exact opposite of that? How can this be so?

Well, there are two responses to these questions.

First, while the underestimation of protected class credit risk is not a guaranteed outcome for all consumer credit scoring models, it happens frequently enough that one should not simply assume it exists.[2] Instead, lenders should always test their specific models to assess what kind of predictive biases may be present.

Second, when one hears that credit scoring models are "biased" against protected class applicants, it is important to understand what type of bias is being referenced.

Let's explore both of these points further.

Point 1: Testing For Credit Model Bias

In general, when a lender has a loan performance dataset in which the protected class group exhibits a higher default rate than the corresponding control group, a credit scoring model built on this dataset will likely produce default rate estimates that underestimate the higher default rate for the smaller protected class group, and overestimate the lower default rate on the much larger control group. This is because the model will be calibrated to the blended default rate of the total sample (which lies between the two group's individual default rates), and because typical credit scoring models fail to capture all the variability in observed default rates across borrower groups. Figure 3 below shows this pattern in my HMDA-based credit scoring model (as discussed above).

AI Credit Model Calibration Accuracy Metrics — Figure 3: Base Model Calibration Accuracy Metrics

Traditionally, many fair lending compliance professionals interpreted this favorable calibration accuracy bias as meaning that the credit scoring model favors the protected class group since: (1) on average and across relevant sub-groups, their credit risk tends to be underpredicted by the model (and the credit risk of the control group is overpredicted), and (2) their approval rate (and AIR) tends to be higher than they would be without this underprediction. Accordingly, such results traditionally formed the lender's evidentiary basis for the model's lack of disparate impact against protected class applicants.

To further demonstrate this latter point empirically, consider the two calibration curves in Figure 4 below where, counterfactually, each demographic group's PDs are estimated using a separate Base Model (i.e., a White-only Base Model and a Black-only Base Model containing the same predictive attributes).

AI Credit Model Calibration Accuracy by Race — Figure 4

With a separate Base Model for each demographic group, PDs are now calibrated to each group's specific default behavior (rather than the groups' blended default behavior under the single Base Model) - thereby producing zero global calibration error for each group as shown visually in Figure 4 and numerically in the third column of Figure 5 below. Now, compared to the traditional Base Model that combines both demographic groups together in a racially-blind manner (Figure 3), treating each group separately according to its specific default behavior results in:

Globally accurate PD estimates for each group and, therefore, no overall model biases,

A significantly lower loan approval rate and AIR for Black applicants - specifically, a -20 percentage point lower approval rate (i.e., 58.2% vs. 78.3%) and a -0.24 reduction in AIR value (i.e., 0.63 vs. 0.87), and

A slight improvement in the loan approval rate for White applicants (i.e., 91.9% vs. 90.5%).

Figure 5: Separately Estimated Base Models - Calibration Accuracy Metrics

So what does this all mean in terms of whether credit models are inherently biased against protected class groups?

As shown in the above analysis, and contrary to conventional narratives, traditional credit scoring models can, and frequently do, underestimate the credit risk of certain protected class groups - leading to protected class approval rates and AIR fairness metrics that are higher than if those models were calibrated to each group's specific default behavior. Traditionally, many compliance professionals interpreted these results as evidence against disparate impact risk since the protected class group was impacted favorably by the credit model's predictive bias.[3]

Additionally, this analysis indicates that even though a credit scoring model's AIR fairness metric may fall below common thresholds (i.e., 0.8 or 0.9), such models may still favor certain protected class groups by underestimating their observed default rates and increasing their approval rates from levels that would otherwise be obtained if one were to calibrate each group's PDs to its specific default behavior (Figure 5).

So, given these results, why is there a presumption that most credit scoring models need algorithmic debiasing (i.e., a search for LDA Models)?[4]

The answer to this question hinges crucially on how AD proponents define "bias".

Point 2: Credit Model Bias vs. Modern-Day Disparate Impact

Building on the discussion above, traditional disparate impact analysis typically focuses on two potential drivers of credit model "unfairness":

The identification of specific predictive attributes whose causal connections to borrower default behavior are considered questionable, and whose high correlations with applicant demographics adversely impact the protected class's estimated PDs relative to its corresponding control group. This driver of disparate impact is effectively an attribute bias (or proxy bias) - not a model bias, and the typical cure is the removal of such attributes from the model training dataset.[5]

The identification of differential model predictive accuracy (model bias) that adversely impacts protected class applicants (i.e., overestimates their average PDs relative to their control groups) - even in the presence of purely causal predictive factors tied specifically to the lender's objective underwriting criteria and grounded empirically in borrower repayment behaviors (i.e., no attribute bias). The cure for such model bias typically is the exploration of alternative model specifications - including additional causal predictive factors - that significantly reduce or eliminate such bias.

However, with the emergence of machine learning technologies and the growing complexity of credit models, an increasing chorus of voices - including many technologists and regtech entrepreneurs - have advocated for a different type of disparate impact analysis. In their view,

Attribute-level analysis is no longer feasible in "cutting-edge" ML-based credit models with hundreds, if not thousands, of variables and complex interaction terms.[6]

Model bias - as defined in the previous section - is no longer relevant for measuring disparate impact. In fact, rather than focusing on the identification of model flaws that create differential predictive accuracy, this view shifts its focus to group-level differences in the model's PD estimates - even though such differences are primarily driven by unequal group-level model input distributions.[7] This then leads to an odd mitigation process in which the AD solution does not focus on specific problematic model inputs driving the unequal PD distributions, but instead mutes these input effects indirectly through complex model changes impacting potentially a broader swath of the model.[8]

The "business necessity" justification for individual predictive attributes is now a moot point (it was a central element of traditional credit scoring disparate impact analysis). Now, any statistically-based change in such attributes' weights that generate LDA Models with more aligned approval rates (with "acceptable" decreases in the model's relative predictive accuracy) would override this causally-based justification - even for such causal and commonly-used credit attributes as debt-to-income ratios, recent bankruptcies, recent serious delinquencies, etc.

Indeed, this modern-day distinction between model bias and disparate impact was recently well-summarized in the following social media post focusing on how one should measure credit model discrimination:

"When considering the right metric to use, I separate them into measuring one of two things: model bias and model disparate impact. Model bias is where the model systematically under or over-predicts risk for a group of borrowers. What we see in practice is that models built by good modeling teams rarely show much bias. To the extent that they do, it typically does not disfavor Black or Hispanic borrowers.

Disparate impact, on the other hand, captures historical and present patterns of discrimination, and we see evidence of it in almost all consumer credit models. Disparate impact is where a model gives, on average, worse scores to one group despite the model only including otherwise reasonable factors. ... we should focus on measures of disparate impact because minimizing these minimizes the patterns of discrimination that we see in consumer credit and housing."

Consistent with my points above, this view holds that even though a credit scoring model may be biased in favor of certain protected class applicants - thereby yielding approval rates and AIR values that are higher than ones calibrated specifically to their true underlying credit risk, such models would still be subject to LDA Model search if they generate average PD estimates for such applicants that are higher than those of the their corresponding control groups.[9]

Accordingly, in the Base Model analysis above, even though Black applicants receive an average PD of 14.45% (Figure 3) which is -397 basis points lower (i.e., favorable) than their underlying average default rate, the model would still be deemed unfair and therefore likely require algorithmic debiasing because this favorable average PD (14.45%) is still higher than the average PD for White applicants (7.45%).[10]

While some AD proponents may argue that this is not a new interpretation of fairness as it relates to disparate impact theory, I believe it is. And, beyond the definitional semantics, it begs the following more practical questions:

What is actually happening at the individual applicant level? Are the LDA Model's changes to individual credit decisions really reducing or eliminating the types of improper credit access barriers that disparate impact law and regulation were designed to address?

And, if not, do such models create unintended legal, compliance, of reputational risks for lenders?

Let's find out.

How Exactly Does AIR-based Debiasing Impact Individual Credit Decisions to Improve Approval Rate Equity?

Given what we learned above about AD proponents' distinction between model bias and disparate impact, I now take a closer look at what an LDA Model's "improved fairness" actually means at the individual applicant level - focusing specifically on the following questions:

What are the characteristics of applicants who are newly-approved and newly-declined under the LDA Model?

Do the characteristics of these credit profiles reflect the type of questionable non-causal attributes that may reduce credit access to otherwise credit-qualified applicants?

Based on these results, do AIR-based LDA Models expose lenders to any potential legal, compliance, or reputational risks that may require appropriate mitigation from compliance officers and other senior management?

To answer these applicant-level questions, I take a closer look at the "swap set" of credit decisions created by the AIR-based LDA Model - that is:

Applicants approved by the Base Model but now denied by the LDA Model (the "swap-outs"), and

Applicants denied by the Base Model but now approved by the LDA Model (the "swap-ins").

These swap sets are valuable debiasing artifacts as they represent, at a micro level, the collection of specific applicants whose credit profiles were targeted and modified most significantly by the debiasing algorithm to better align the relative PDs (and approval rates) of Black and White credit applicants.[11]

Wait - why is this analysis necessary?

I get that a credit model may already exhibit a favorable bias toward protected class applicants. But what's wrong with the modern disparate impact view? Aren't more equalized approval rates across demographic groups a good thing and, therefore, a sufficient enough reason to support regular LDA Credit Model searches despite any existing favorable model bias?

In my opinion, no.

It is insufficient due diligence for a lender in a high-stakes, highly-regulated area like consumer lending to adopt a complex, algorithmically-altered credit model simply based on surface-level improvements in high-level credit outcome comparisons.

Instead, as lenders incorporate fairness regularization into the credit model development process, model transparency and explainability need to expand further to provide key stakeholders with robust and accurate information as to how the LDA Model specifically achieves its improved fairness performance both globally and locally. Such transparency helps these stakeholders evaluate whether the bases for expanded protected class approvals (as well as the likely reduction in control group approvals) are consistent with their fair lending compliance objectives as well as with applicable laws, regulations, company policies, and company values. Not only is this prudent, but it is also consistent with the model risk management principles embedded in long-standing bank regulatory guidance.[12]

So how can swap sets provide this enhanced transparency?

By analyzing the credit profiles of applicants whose credit decisions are changed by the LDA Model, we can better understand how the debiasing algorithm altered the Base Model's estimated credit risk relationships to improve the AIR-based fairness metric.

For example, Figure 5 below segments my total swap set graphically with three primary "clusters" identified by the red and green ovals.[13] Each swap set member's Base Model PD is plotted on the horizontal axis, and its LDA Model PD is plotted on the vertical axis. Points shaded darker represent a larger number of applicants with the same Base Model-LDA Model PD combination, and vice versa.

LDA Credit Model Swap Set Analysis — Figure 5

I have also added two reference lines to this diagram. The vertical green dashed line represents the Base Model's PD approval threshold (i.e., all applications with Base Model PD estimates less than this threshold would be approved), while the horizontal red dashed line represents the LDA Model's PD approval threshold. These PDA approval thresholds were selected to yield an overall 90% approval rate - which is retained from my original Fool's Gold article and represents the assumed target approval rate used for credit risk management.

These two PD approval thresholds create four quadrants of which two are relevant for our swap sets: (1) the Upper Left quadrant that reflects 23,719 applicants approved under the Base Model but denied under the LDA Model (the "swap-outs"), and (2) the Lower Right quadrant that reflects 24,870 applicants who were denied under the Base Model but approved under the LDA Model (the "swap-ins"). Based on my analysis of the swap set members in these two quadrants, I identified three relevant swap set segments (circled above in Figure 5 and summarized below in Figure 6).

Figure 6: Analysis of LDA Model Swap Sets

Below, I analyze these three swap set segments in further detail to gain more transparency into how the LDA Model specifically achieves its improved fairness performance.

Swap-Outs

Segment 1: LDA Denial of Relatively Low CLTV Borrowers

The first swap set segment, in red in the upper left of Figure 5, consists of 23,719 applicants who were approved under the Base Model with average estimated PDs of 5.0%, but denied under the LDA Model with average estimated PDs of 20.2%. As can be seen in Figure 6:

All of these swap outs have CLTVs between 80-85% - considered a relatively low-risk level and one of the Base Model attributes most changed by the AD process (see Figure 10a from my original Fool's Gold article).

The LDA Model increased the PD estimates on this segment over 4x from 5.0% to 20.2% - a level far exceeding its observed default rate of 5.4%. This causes not only a significant overestimate of these applicants' true credit risk, but directly leads to what many would consider to be improper denial decisions from a credit perspective.

While these swap-outs adversely impact both Black and White applicants, the denials are disproportionately skewed toward White applicants (94% of this swap set) as compared to the other two swap set segments whose LDA approvals are disproportionately skewed towards Black applicants (i.e., only 85-90% White).

Swap-Ins

Segment 2: LDA Approval of Very High CLTV Borrowers

The second swap set segment, in green in the lower right of Figure 5, consists of 5,477 applicants who were denied under the Base Model with average estimated PDs of 61.4%, but approved under the LDA Model with average estimated PDs of 9.0%. As can be seen in Figure 6:

All but one of these applicants have CLTVs > 95% - considered a very high risk level and one of the Base Model attributes most changed by the AD process (see Figure 10a from my original Fool's Gold article).

The LDA Model decreased the PD estimates on this segment by -85% from 61.4% to 9.0% - a level far below its observed default rate of 62.1%. This causes not only a significant underestimate of these applicants' true credit risk, but directly leads to what many would consider to be improper approval decisions from both a credit and compliance perspective - granting credit to an applicant segment with a history of high defaults (62.1%) that may be considered predatory and unsafe / unsound.

While these swap-ins favorably impact both Black and White applicants, the approvals are disproportionately skewed toward Black applicants (15.5% of this swap set) as compared to Segment 1 whose swap outs are only 6% Black.

While the incremental LDA denials from Segment 1 had approximately the same adverse impact on Black and White approval rates (see the last two rows in Figure 6), the incremental approvals from this segment disproportionately favor Black applicants as their approval rate increases by +3.2 percentage points - nearly 3x the increase for White applicants.

Segment 3: LDA Approval of High CLTV Borrowers

The third swap set segment, in green in the lower left of Figure 5, consists of 19,393 applicants who were denied under the Base Model with average estimated PDs of 12.2%, but approved under the LDA Model with average estimated PDs of 9.0%. As can be seen in Figure 6:

Most of these applicants have CLTVs between 90-95% - considered a high risk level - but mitigated somewhat by relatively larger loan sizes and lower DTI levels.[14]

The LDA Model decreased the PD estimates on this segment by -26% from 12.2% to 9.0% - a level 33% below its observed default rate of 13.7%. This causes not only a significant underestimate of these applicants' true credit risk, but directly leads to what many might consider to be improper approval decisions from both a credit and compliance perspective - granting credit to an applicant segment with a history of above average defaults (4x the level of defaults associated with Base Model approvals) that may be considered predatory and potentially unsafe / unsound.

While these swap-ins favorably impact both Black and White applicants, the approvals are disproportionately skewed toward Black applicants (10.6% of this swap set) as compared to Segment 1 whose swap outs are only 6% Black.

While the incremental LDA denials from Segment 1 had approximately the same negative impact on Black and White approval rates (see the last two rows in Figure 6), the incremental approvals from this segment disproportionately favor Black applicants as their approval rate increases by +7.7 percentage points - nearly 2x the increase for White applicants.

Based on the Swap Set Analysis, Do AIR-Based LDA Models Expose Lenders to Any Potential Legal, Compliance, or Reputational Risks?

OK, so now that we have more transparency into how the AIR-based AD process changes the Base Model and its individual credit decisions, what are the main take-aways from a risk management perspective?

While AIR-based AD may improve the Base Model's relative approval rates, this "fairness" improvement may NOT be due to the lender's remediation of improper credit access barriers typically associated with disparate impact. Instead, the LDA Model may simply act as a tool for affirmative, policy-driven credit access expansion targeted to certain protected class groups via a latently-encoded reverse disparate impact - thereby exposing the lender and its consumers to certain unintended risks.

As discussed previously, traditional credit model disparate impact analysis focused, in part, on specific model attributes that were highly correlated with applicant demographics and whose causal connections to an applicant's credit performance were considered questionable.[15] From a fair lending perspective, the primary concern was that such attributes would improperly penalize protected class applicants' credit access and borrowing costs by overestimating their credit risk relative to: (1) their historical default behavior, or (2) credit risk estimates based on more "legitimate" credit risk attributes linked more directly and causally to their repayment behavior.

In the swap set analysis performed on my LDA Model, we learned that the AD process achieved its AIR-based fairness improvements NOT by mitigating the impact of an improper predictive attribute on the estimated PDs of otherwise qualified applicants, but by artificially making certain unqualified applicants appear qualified, and vice versa. For example, high risk applicants with CLTVs > 95% (which skew demographically to the protected class and were denied under the Base Model) are now approved by the LDA Model by assigning them artificially and significantly lower PDs (i.e., 9% on average vs. 61.4% under the Base Model - and an actual average default rate of 62.1%). And to keep the overall approval rate at the lender's desired 90% level and to alter the relative approval rates in the desired demographic direction, the algorithm then offset these additional LDA approvals with denials of an approximately similar number of lower risk, qualified applicants with CLTVs between 80% and 85% (which skewed demographically to the control group). These denials were "justified" by assigning them artificially and significantly higher estimated PDs (i.e., 20.2% under the LDA Model vs. 5.0% under the Base Model and an actual average default rate of 5.4%).[16]

In none of these swap set segments was the LDA Model mitigating the effects of a model failure or a questionable, non-causal attribute that was distorting the applicants' true credit qualifications based on observed repayment performance. Rather, the LDA Model implemented credit decision overrides for applicants with credit profiles exhibiting certain demographic correlations in order to achieve the AD objective of improved approval rate equity.[17] In essence, the LDA Model implemented a latent ECOA-like Special Purpose Credit Program - but without the explicit identification of the target group and without the formal underlying legal structure required by law and regulation (among other things).[18]

Furthermore, approving unqualified applicants as a means to manage disparate impact risk is ostensibly inconsistent with regulator / enforcement views on appropriate fair lending policy - as most recently stated by Assistant Attorney General Kristen Clarke at an industry conference. According to AAG Clarke:

"Lenders should proactively and consistently review and test their underwriting process – including their automated steps – to ensure that credit decisions do not turbocharge discrimination by disproportionately rejecting loans to applicants of color for reasons unrelated to creditworthiness." (emphasis mine)

This is why LDA credit model transparency from a fairness perspective is so important. For many lenders, AD's promise as a tool to eliminate illegal lending discrimination makes it too good to turn down. However, because AD itself is based on complex machine learning algorithms, and is applied - in many cases - to complex machine learning-based credit models, and because there has been a dearth of publicly available research focused on its potential risks, many lenders take it on faith that the resulting LDA Models are addressing directly the types of fair lending risks that the AD proponents publicly tout (i.e., credit access barriers). But this isn't necessarily so and, in my view, may lead to future problems as these models become better understood by consumers, regulators, and private litigants.

AIR-based debiasing processes may create LDA Models that expose lenders to UDAAP claims, predatory lending allegations, and/or safety and soundness risk.

According to the swap set analysis, the LDA Model may be approving higher-defaulting (i.e., unqualified) applicant segments to improve the AIR-based fairness metric. However, without expanded fairness transparency, the lender may inadvertently expose itself to heightened legal, compliance, safety and soundness, and reputational risks for targeting such applicants with loans that many may not be able to repay. For example, my LDA Model swapped in 5,477 applicants with CLTVs > 95% for approval despite such applicants having a 62.1% historic default rate. While, in the real world, a lender's risk management team would likely (and sensibly) identify and potentially prevent such blatantly improper approvals, this outcome becomes much more difficult in models with hundreds or thousands of predictive variables and where no swap set transparency analysis of this type is performed.

LDA Models may counterintuitively perpetuate diminished credit access to underserved populations.

While improved approval rate equity may be a strong driver for a lender's AIR-based LDA Model adoption, this analysis suggests that lenders should also consider the longer-term potential for certain LDA Models to perpetuate the very societal problem that they are trying to address. Continuing the discussion from (2) above, while LDA Model swap-ins help to improve approval rate equity, if these swap-ins (like my Segment 3 and - especially Segment 2) are not otherwise creditworthy, then a high percentage of such approvals may likely default on such loans - leading to impaired credit reports for such borrowers and an extended future period of likely diminished and higher-priced credit access.

Because of AD's dual competing objectives (i.e., improving outcome fairness while still accurately differentiating defaulters from payers), standard approaches for determining Adverse Action reasons may yield denial reasons that are not compliant with ECOA.

This is because the LDA Model produces some lower-risk denials and some higher-risk approvals solely to improve the model's fairness performance. Accordingly, standard methods to determine ECOA denial reasons may attribute certain credit model attributes as the reason(s) for these lower-risk denials when, in fact, the only role played by such attributes in this decision was their correlation with the applicant pool's demographics.

For example, in my LDA Model, 23,719 applicants - approved under the Base Model with an average PD of 5% - are denied under the LDA Model with an average PD of 20.2% (see Segment 1 in Figure 6 above). Standard explainability processes would attribute this denial to the applicant's CLTV attribute since the LDA Model significantly increased the risk of CLTV 80-85% applicants. That is, while CLTV 80-85% applicants were estimated as only 19% as risky as CLTV > 95% applicants in the Base Model, they artificially became 167% riskier than CLTV > 95% applicants in the LDA Model - thereby driving many of their estimated PDs to levels above the lender's approval threshold.

However, the higher estimated PDs for these applicants have nothing to do with their repayment behavior; in fact, their repayment behavior (i.e., average default rate) is 90% better than the repayment behavior of applicants with CLTVs > 95% which the LDA Model now approves. Accordingly, attributing these applicants' credit denial decisions to their CLTVs may not be considered accurate as they were, in reality, denied due to the racial composition of their credit profile as part of a rough justice swap set to improve the lender's approval rate equity.

LDA Models may expose lenders to the risk of reverse discrimination claims in light of SCOTUS's Students For Fair Admission ("SFFA") decision.

Before I discuss this point, I will remind you I am not a lawyer and that you should seek advice from legal counsel. However, in my interactions with certain members of the legal community, I am aware that there are concerns that the SFFA decision could have impacts for consumer lending,[19] and that some consumer lenders are evaluating risks related thereto.

Accordingly, to the extent that an LDA Model achieves improved AIR-based fairness via the affirmative, policy-driven swapping in (i.e., approval) of less qualified applicants and the swapping out (i.e., denial) of more qualified applicants, a lender may be at risk of reverse discrimination claims if the SFFA decision is deemed to apply. Additionally, even if a lender is not debiasing its credit models, there could still be risk since, as I discussed previously, many credit models may have an inherent predictive bias in favor of certain protected class groups. I would therefore recommend that lenders take prudent action to evaluate such risks with legal counsel and mitigate them as warranted.

Final Thoughts

As a long-time practitioner in the field of bank regulatory risk management, I have learned that one must be cautious in adopting emerging technologies and analytical tools that promise big benefits, but are short on transparency and rigorous public vetting. The devil is always in the details - and the area of algorithmic debiasing, to me, is another example where we may be getting ahead of ourselves in the rush to address a worthy social justice goal with what appears to be a clearcut solution made feasible by newfound technological advancements.

Accordingly, given the high-stakes nature of consumer credit decisioning - and the significant financial and social impacts these decisions can have on individuals, families, and communities, we should take the time to ensure that this shiny new technological tool is the real thing. This means asking questions - even those that may be uncomfortable or unpopular - and seeking answers to ensure that the promise of this new solution doesn't have unintended consequences that may do harm as well as good. That's simply the "professional skepticism" nature of scientific inquiry and progress, and the way we truly ensure "responsible" AI.

And that is my goal with this series of articles.

Now, some may dispute the generalization of my findings due to the use of a simple HMDA-based credit scoring model. And they may be right. But we don't know - do we? No alternative analyses have been shared publicly that would add to this quest for transparency nor alleviate concerns about unintended risks. And, the fact is that the analyses presented in these three articles - at a minimum - indicate that these unintended risks can be present and can be problematic. That doesn't mean that they are always present, but one doesn't know until one looks.

So my final recommendation is that you look. Consider the importance of your credit model's calibration accuracy when deciding on potential LDA Models. Analyze the swap set members and ask yourself whether the changes in credit decisions caused by your LDA Model make sense to you, are consistent with your goals to reduce illegal discrimination and responsibly expand financial access, and do not create unintended risks of UDAAP, predatory lending, ECOA Adverse Action Notice violations, or reverse discrimination.

And, if you can, share some of your insights with the rest of us so we can build the validation evidence needed to assess whether we are dealing with Fool's Gold and, if unfortunately so, how we can find alternative solutions that are the real thing.

* * *

ENDNOTES:

[1] CFPB Should Encourage Lenders To Look For Less Discriminatory Models, National Community Reinvestment Coalition, April 15, 2022.

[2] I note that this favorable model bias is not an artifact of my simple HMDA-based credit scoring model nor an engineered outcome. Many consumer credit scoring models exhibit an underprediction of certain protected class groups' default rates when their actual average default rates are higher than those of their corresponding control groups and when direct or indirect demographic proxies are kept out of the models. For real-world examples of such credit performance differences, see:

• "Student Loan Borrowers With Certain Demographic Characteristics More Likely to Experience Default," January 25, 2023 - a survey conducted by the Pew Charitable Trust.

• "Unequal Distribution of Delinquencies by Gender, Race, and Education," Federal Reserve Bank of New York, November 17, 2021.

• "Upstart Response to Information Requested in February 13 Letter", February 28, 2020, p. 4.

Furthermore, while the magnitude of these model biases could certainly be exacerbated by the limitation of just three predictive attributes in my Base Model, I note that: (1) two of these attributes - CLTV and DTI - are two of the most important credit risk drivers in mortgage underwriting, and (2) the HMDA-based training sample was designed to be as homogenous as possible with respect to transaction type (see my original Fool's Gold article). Nevertheless, including additional relevant predictive attributes could certainly reduce the model bias magnitudes; however, in my experience, a fair amount of directionally-consistent bias will still remain.

[3] These results also show that an under-representation of protected class borrowers in a credit scoring model training sample, by itself, does not create an adverse model bias for such applicants - as many in the industry presume. In fact, a favorable model bias is more likely if a larger-sized lower-defaulting control group dominates the model calibration process - thereby potentially producing lower estimated PDs for the smaller-sized higher-defaulting protected class group. This is what we see in my Base Model (Figure 3) where Black borrowers comprise only 5.8% of the model training sample.

[4] “Rigorous searches for less discriminatory alternatives are a critical component of fair lending compliance management,” Ficklin said during a panel discussion at the National Community Reinvestment Coalition’s (NCRC) annual conference. “I worry that firms may sometimes shortchange this key component of testing. Testing and updating models regularly to reflect less discriminatory alternatives is critical to ensure that models are fair lending compliant.” CFPB Puts Lenders & FinTechs On Notice: Their Models Must Search For Less Discriminatory Alternatives Or Face Fair Lending Non-Compliance Risk, National Community Reinvestment Coalition, April 5, 2023.

[5] This driver of disparate impact also includes causal attributes whose specific measurement improperly disadvantages protected class applicants - for example, including income as an attribute rather than debt-to-income ratio.

[6] Rather than conclude that - perhaps - lenders shouldn't deploy credit models containing such a high degree of complexity if they cannot: (1) explain certain predictive attributes (i.e., complex interactions), and (2) effectively evaluate the model for important regulatory compliance risks, these proponents oddly adopt - instead - the position that attribute-level analysis should be largely abandoned in favor of aggregate-level outcome-focused analysis.

[7] In their view, model predictive accuracy only comes into play in evaluating whether a LDA Model still "serves the entity's business needs" (i.e., has sufficient predictive accuracy) - NOT, importantly, in the actual assessment of whether a disparate impact is even present (as is the case for traditional disparate impact testing that is grounded in differential predictive accuracy (i.e., model bias)).

[8] And, again oddly, AD will incorporate these changes into the LDA Model even if the model attribute contributing most to the PD prediction differences is, in fact, a causal driver of credit repayment behavior (e.g., debt-to-income ratio) and there is no less discriminatory alternative way to measure the attribute relative to this behavior.

[9] Technically, in addition to having higher average PDs than the control group, AD proponents also consider whether the AIR value is lower than a particular threshold value - such as 0.8 or 0.9. Additionally, one potential motivation for this viewpoint is the impact of historical discrimination on borrowers' credit profiles (i.e., making them appear riskier than they actually are). However, if this is the underlying rationale, it calls into question the reliability of any credit model if the underlying data for such models contain pervasive errors of unknown frequency and magnitude.

[10] In fact, if the Base Model were perfectly calibrated to each group's actual default behavior, there would be even a stronger argument for AD under this modern view since the AIR value would be even lower (i.e., 0.63 - without the predictive bias) in Figure 5 vs. 0.87 (with the predictive bias) in Figure 3).

[11] While these are not the only applicants impacted by the debiasing algorithm (all applicants have some change in their estimated PDs), these applicants had significant enough changes to their PDs to reverse their credit decisions from those under the Base Model.

[12] See, for example, the OCC's Model Risk Management Handbook - "Transparency and explainability are key considerations that are typically evaluated as part of effective risk management regarding the use of complex models. The appropriate level of explainability of a model outcome depends on the specific use and level of risk associated with that use. Models applied to significant operations or decisions (e.g., credit underwriting decisions) should be supported by thorough understanding of how the model arrived at its conclusions and validation that it is operating as intended." (emphasis mine)

[13] The k-Means clustering algorithm - or other similar analytical techniques - can be applied to the swap set members to separate them into meaningful segments with distinct profiles.

[14] More specifically, 71.3% of segment members have CLTVs between 90% and 95% and 16.4% have CLTVs between 85% and 90%.

[15] "By 'disparate impact' we mean that a variable’s predictive power might arise not from its ability to predict future performance within any demographic group, but rather from acting as a surrogate for group membership." - "Does Credit Scoring Produce a Disparate Impact?" Board of Governors of the Federal Reserve System, October 12, 2010.

[16] One would think that these PD overrides would cause the LDA Model to be significantly less accurate. However, under AD's traditional model accuracy metric - AUC - the model's rank-order accuracy only decreases by -3.4%. However, such a small reduction in this accuracy metric masks much larger reductions in the LDA Model's calibration accuracy. Specifically, the LDA Model underpredicts the average default rates of approved loans by -22% (vs. a slight +2.9% overprediction in the Base Model) - and underpredicts defaults rates on Black approvals by -60% (vs. -45% in the Base Model) and default rates on White approvals by -14% (vs. a 9.4% overprediction in the Base Model). See "Fool's Gold 2: Is There Really a Low-Cost Accuracy-Fairness Tradeoff?" for further discussion of these different model accuracy metrics.

[17] This is, in fact, not so different from Meta's Variance Reduction System which also embeds a reverse disparate impact into the system's mathematical structure. See "Meta's Variance Reduction System: Is This the AI Fairness Solution We've Been Waiting For?"

[18] Whether or not the actions of an LDA Model fit the legal definition of a Special Purpose Credit Program is a question for a lender's counsel.

[19] See, for example, "Affirmative Action in Lending: The Implications of the Harvard Decision on Financial Institutions." Alston & Bird Financial Services & Products Advisory, October 10, 2023 and "CFPB, HUD Risk Litigation Over Fair Lending Enforcement", Mortgage Banker Magazine, March 2024.