The pressure on lenders to adopt less discriminatory alternative ("LDA") credit models is accelerating - hastened by the CFPB's more intensive examinations of their policies and procedures for disparate impact testing and the search for fairer alternatives.[1] Interestingly, however, is that despite the growth in algorithmic LDA solutions - as well as public entreaties by multiple advocacy groups for the CFPB to require their use,[2] the CFPB has yet to publicly endorse these technological innovations, stating recently:
"Further, as the CFPB continues to monitor markets and institutions for fair lending compliance, the CFPB will also continue to review the fair lending testing regimes of financial institutions. Robust fair lending testing of models should include regular testing for disparate treatment and disparate impact, including searches for and implementation of less discriminatory alternatives using manual or automated techniques. CFPB exam teams will continue to explore the use of open-source automated debiasing methodologies to produce potential alternative models to the institutions’ credit scoring models." - June 2024 Fair Lending Report of the Consumer Financial Protection Bureau (emphases mine)
What are we to make of this statement?
In my opinion, by referencing both automated and manual LDA search methods - as well as ongoing research into "open-source automated debiasing tools"[3], the CFPB is communicating - at least publicly - a position of technological agnosticism as to how LDA credit models should be derived. While some may be impatient with the CFPB's reticence to endorse these new tools - particularly given their oft-stated concerns about algorithmic discrimination, I find their measured approach both admirable and prudent given the dearth of publicly-available, objective research on the potential risks these complex new tools could have on consumers and lenders. And, as readers of this blog know, this caution is well-placed given the potentially serious safety-and-soundness and compliance risks I have found in my own testing.
But even my testing - broad as it has been - hasn't fully explored the totality of risks associated with this complex black-box technology. For example, while my prior research focused on the fundamental technical and legal foundations of these algorithmic fairness tools, they were silent on a critical risk dimension that's now becoming increasingly relevant to many lenders' risk and compliance managers – the questionable robustness and stability of these LDA Models.[4]. In fact, with more algorithmically-driven LDA Models moving into production, recent industry chatter suggests that some may not be performing as expected on real-time application data - with lower than expected fairness performance and higher than expected default rates[5] necessitating risk mitigations and raising the following questions.
Are these adverse performance outcomes simply a matter of significant "data drift" coincidentally encountered at or near the time of LDA Model deployment, or do algorithmically-debiased credit models have more fundamental weaknesses that we have yet to identify?
To explore these important questions, I extend the analytical framework from my previous Fool's Gold research studies to evaluate the inherent stability / robustness of algorithmically-derived LDA Models using standard model validation techniques. For those unfamiliar with the field of model risk management, a "stable" or "robust" model is one whose estimated predictive relationships and overall predictive performance are highly-consistent across different data input samples.[6] This important property of a conceptually and technically sound credit model provides users with confidence that: (1) the model's predictive performance will generalize outside of the training data to which it was calibrated, and (2) its estimated credit risk relationships are fundamentally sound as they are likely grounded in more causal, theoretically-based borrower behaviors, as opposed to reflecting mere statistical artifacts present in the specific training sample used.
Although I walk through my stability / robustness analyses in the main section below with a broad audience in mind, I realize that some readers may simply wish to know the major findings - which I summarize here for convenience.
Common LDA Models relying on outcomes-based fairness metrics - such as the Adverse Impact Ratio ("AIR") - appear to be inherently unstable and brittle in the presence of relatively small random training data differences. As I show below, during LDA Model training, relatively small random variations in the training sample can cause widely different solutions to the algorithmic debiasing ("AD") process for the same Fairness Weight - thereby yielding significantly different: (1) LDA Model structures (i.e., different estimated credit risk relationships), (2) fairness and accuracy performance measures, and (3) primary disparate impact factors (i.e., the credit model attributes whose risk weights are altered the most by the AD process to improve model fairness - much more on this later). In effect, the "multiplicity of models" concept on which algorithmic debiasing conceptually rests operates at even deeper levels than commonly understood - also applying to each of the LDA Models generated by common algorithmic fairness tools at a given Fairness Weight.
This LDA Model brittleness appears to increase as larger Fairness Weights are used in the algorithmic debiasing process. That is, as the Fairness Weight increases to generate LDA Models with increasingly higher AIRs, the estimated credit risk relationships within these models become ever more brittle, unstable, and less conceptually sound. Additionally, the model's AIR and Area Under the Curve ("AUC") performance metrics also become more volatile - creating a risk of misleading fairness and accuracy performance expectations once deployed into production.
LDA Model brittleness also imparts statistical bias into estimated risk coefficients, AIR-based fairness values, and AUC-based accuracy values relative to those estimated from a larger, population-based LDA Model with the same predictive factors. For example, using 100 training data samples randomly drawn from the same population, I found that the average values of these LDA model outputs across these samples were, in general, materially different from those generated by an LDA Model estimated on the whole population. I note that such statistical biases did not exist when these same samples were used to estimate Base Models (i.e., with no algorithmic debiasing).
In addition to these safety-and-soundness risks, my analyses also identified further compliance-related concerns linked to the use of algorithmic LDA Models. Specifically,
AIR-based fairness metrics for Base Models also vary non-trivially in response to small random training data differences. Accordingly, if a lender's fair lending policy requires a search for LDA Models only when the Base Model's AIR falls below a specific threshold, the lender should be cautious in relying on an AIR performance metric derived from a single model training, validation, or test sample.
For a given training sample, the primary disparate impact factors selected by the AD process to improve the LDA Model's fairness performance can vary with the specific Fairness Weight used. Such brittleness in the specific set of model attributes that are "de-biased" to improve fairness raises, in my opinion, a significant concern that these algorithmic fairness tools are not mitigating the impact of specific "artificial, arbitrary, and unnecessary" credit decision factors per relevant federal law, regulation, and Supreme Court opinion.[7] Instead, the AD process appears to select the primary model attributes to "de-bias" simply based on whether they possess the statistical and demographic properties needed to achieve the desired degree of AIR improvement through reverse disparate impact.
While the LDA Model improves AIR-based fairness between demographic groups, a deeper review of the associated credit decision changes reveals that this inter-group fairness improvement comes at the expense of diminished fairness among individuals within demographic groups. For example, in my analysis: (1) Approximately 30% of Black applicants whose credit decisions are changed by the LDA Models are actually "swapped out" - that is, they are denied under the LDA Model even though they were approved by the Base Model. (2) Given the brittleness in the primary disparate impact factors selected at different Fairness Weights, the specific set of Black applicant swap-ins can change notably depending on the specific Fairness Weight used during training. (3) The Black swap-ins selected for approval by the LDA Model may possess risk attributes that are intuitively worse than the risk attributes of the Black swap-outs (e.g., higher CLTVs). Accordingly, while overall group-level fairness may improve with LDA Models, this aggregate improvement abstracts from the significant number of winners and losers among the individuals within the protected class group - many of whom may not view their new credit decisions as fair.[8]
In the next section, I provide a quick summary of my LDA Model analytical framework that regular readers of this blog should recognize. Thereafter, I provide a deeper dive into the analyses supporting each of the key findings above.
Let's dive in.
The LDA Credit Model Analytical Framework
My analyses were performed using the same credit scoring model, underlying consumer credit application data, and algorithmic debiasing techniques I used in my three previous "Fool's Gold" studies. For those unfamiliar with those articles, I recommend starting with Fool’s Gold: Assessing the Case For Algorithmic Debiasing (the "Fool's Gold article") as a primer to the data, methods, and analyses I use here.
In summary, I created a synthetic credit performance dataset from a 2019 sample of residential home mortgage loan applications obtained from public Home Mortgage Disclosure Act ("HMDA") data.[9] As illustrated in Figure 1 below (excerpted from Figures 1 and 2 of the Fool’s Gold article), this synthetic credit performance dataset contains 456,761 total records - 94% of which are White and 6% of which are Black. Additionally, while the overall “default” rate in the dataset is 7.9%, it varies between the two demographic groups with Black borrower default rates over 2.5x that of White borrowers.
The following table labeled as Figure 2 is excerpted from Figure 3 of the first Fool’s Gold article and presents the credit scoring model (the "Base Model") I trained on the full synthetic credit performance dataset via logistic regression.[10] It relies on three primary credit risk attributes - combined loan-to-value ratio (“CLTV”), debt-to-income ratio (“DTI”), and loan amount - all three of which are represented with a series of dummy variables to capture potential non-linearities in their relationships with borrower default behavior. As expected, the estimated odds ratios for each predictive attribute capture a monotonically-increasing risk of default for borrowers with larger CLTV and DTI values and a monotonically-decreasing risk of default for borrowers with larger Loan Amounts.
Using the standard fairness (AIR) and accuracy (AUC) metrics of common algorithmic fairness tools, I obtain an AIR value of 0.865 and an AUC value of 0.843 for this Base Model (plotted in green in Figure 3 below which is excerpted from Figure 9 of the first Fool’s Gold article). However, to improve the model's fairness performance, I apply an algorithmic debiasing methodology commonly referred to as “fairness regularization” (aka “dual optimization”) to the Base Model. In general, this debiasing approach adds an AIR-based "unfairness" penalty to the Base Model's training objective to create a broader definition of model performance that includes both predictive accuracy and fairness. LDA Models can now be trained by optimizing this dual training objective using different "Fairness Weights" (i.e., a number that governs the relative importance of predictive accuracy and fairness in the model's dual training objective).
By varying the relative importance of accuracy and fairness in the training process, we obtain a spectrum of LDA Models with increasing AIR-based fairness and, generally, decreasing degrees of AUC accuracy as displayed below in Figure 3.
While many LDA Models are generated for a given training sample using different Fairness Weights, a lender typically selects one specific LDA Model from this set (such as that highlighted in red) in which the improved fairness is achieved with an "acceptable" tradeoff in predictive accuracy.
With this general background on my LDA Model analytical framework, I now turn to how I extended this framework to investigate the stability of the LDA Models derived therefrom.
Exploring LDA Model Performance Stability: Evidence From 100 Training Samples
My 100 Training Samples
For the purposes of my previous analyses, I trained the models and performed the testing using the entire synthetic credit performance dataset (456,761 records). Now, however, for the purpose of evaluating model robustness and stability, my focus is on how the structure and performance of these models vary across multiple random training samples drawn from the original population.[11]
To this end, I created 100 training samples - each of size 100,000 - by randomly sampling my synthetic credit performance dataset 100 times.[12] Given each sample's relatively large sampling rate (i.e., 22% of the full dataset), I expect that the Base Models trained on these 100 samples using identical sets of predictive attributes would exhibit a high degree of structural and performance similarity. This is because each of the 100 training samples should differ only slightly from each other in terms of credit risk profiles and default outcomes - which I show in Figure 4 below using basic sample descriptive statistics.
As expected, the 100 random training samples are, on average, representative of the full synthetic credit population (i.e., 456,761 records) with negligible mean differences (numerical column 2 vs. numerical column 1) and relatively small variability around the sample means (numerical column 3).
Using the 100 Training Samples to Create a Set of Performance Stability Benchmarks
To evaluate properly the LDA Models' structural and performance stability across these 100 training samples, I first create a set of benchmarks that can be used as "apples-to-apples" comparators for the LDA Models' stability tests. More specifically, these benchmarks will help me determine whether the various measures of the LDA Models' structural and performance variability across the 100 training samples are consistent with those of the corresponding Base Models (and, therefore, are simply due to the small random differences across the underlying training data), or something more problematic.
To create these benchmarks, I estimate a set of 100 Base Models - one for each of the 100 training samples - and calculate the following Base Model stability metrics:
The variability of the Base Models' estimated credit risk profiles across the 100 samples (i.e., the standard deviations of the estimated coefficients for each risk attribute).
The statistical bias in the Base Models' estimated credit risk profiles (i.e., the difference between the Base Model's average estimated coefficients (across the 100 samples) versus those estimated on the entire synthetic credit performance dataset).
The variability of the AIR-based fairness and AUC-based accuracy metrics across the 100 samples (i.e., the standard deviations of the AIR and AUC metric values).
The next sections provide further detail on the derivation of these benchmarks.
The Variability of the Base Models' Estimated Credit Risk Profiles Across the 100 Samples
After estimating the 100 Base Models using the credit model specification contained in Figure 2, I analyzed the variability of the estimated CLTV, DTI, and Loan Amount credit risk profiles - displayed below in Figures 5a-c. For each credit risk attribute, the blue lines reflect the corresponding set of model coefficient estimates (i.e., the estimated credit risk profile) for each of the 100 Base Models while the green line represents the estimated profile from the population-based Base Model.
Based on these estimated credit risk profiles, I note that, in general, the Base Models' profiles exhibit a relatively high degree of stability across the 100 training samples. That is, we can clearly see in these charts that the Base Models' estimated credit risk relationships have a relatively high degree of directional and value consistency across the 100 training samples - clustering relatively closely together to varying degrees.
More quantitatively, Figure 6 below presents both the average coefficient values (numerical column 2) as well as the standard deviations of these values across the 100 samples (numerical column 3). Later, I will use these standard deviations as a benchmark to evaluate the structural stability of the LDA Models' estimated credit risk profiles on these same training samples.
The Statistical Bias in the Base Models' Estimated Credit Risk Profiles
Based on the data presented in Figure 6, I also note that the 100 Base Models yield largely unbiased estimates of the population-based Base Model's credit risk relationships - where the estimated bias (numerical column 4) is calculated as the difference between the average estimated model coefficients for the 100 Base Models (numerical column 2) and those of the population-based Base Model (numerical column 1).[13] Later, I will use these bias estimates as a benchmark to evaluate the bias of the LDA Models' estimated credit risk profiles on these same training samples.[14]
The Variability of the Base Models' AIR-Based Fairness and AUC-Based Accuracy Metrics Across the 100 Samples
Finally, Figure 7 below displays the variability of the Base Models' fairness and accuracy metrics across the 100 training samples (blue dots) - with AIR-based fairness performance measured on the horizontal axis and AUC-based accuracy performance measured on the vertical axis. For reference, the performance metrics for the population-based Base Model are depicted with a green dot.
Based on this data, I note that the sample-based Base Model AIR and AUC vary non-trivially across the 100 training samples - particularly the AIR - but, nevertheless, provide unbiased measures of the Base Model's population-based fairness and accuracy performance. Later, I will use these variability estimates as a benchmark to evaluate the stability of the LDA Models' fairness and accuracy performance on these same training samples.
I also note that the 100 Base Models' fairness performance varies twice as much as their accuracy performance as measured by their corresponding standard deviations. This feature is likely driven, in part, by the smaller Black applicant sample sizes (relative to Whites) that create inherently larger variability in Black approval rates (i.e., a given change in loan approvals for each group has a larger impact on the Black approval rate than the White approval rate).[15]
One important implication of this behavior is the following.
If a lender's fair lending policy requires a search for LDA Models when the Base Model's AIR falls below a specific threshold, the lender should be cautious in relying on a single AIR performance metric from its model development process.
Due to the underlying AIR variability noted above, reliance on a single sample's AIR metric may provide a misleading signal of the Base Model's more general fairness performance. This becomes particularly troublesome when the underlying AIR variability straddles common AIR fairness thresholds - such as 0.80 or 0.90. In such cases, the lender may conclude that the Base Model does not trigger LDA search requirements when, in fact, it may - and vice versa.[16]
Evaluating the LDA Model's Performance Stability Using the Base Model Benchmarks
With these benchmarks in hand, I now apply the algorithmic debiasing method of fairness regularization (aka "dual" or "joint" optimization) to create a set of 20 LDA Models for each of the 100 Base Models and their corresponding training samples. Each of the 20 LDA Models corresponds to a different Fairness Weight (i.e., the relative importance of fairness vs. accuracy in the model training process) ranging between 0.1 and 2.0 in increments of 0.1[17] Overall, this yields a total analysis sample of 2,000 LDA Models.
The Variability of the LDA Models' Estimated Credit Risk Profiles Across the 100 Samples
For my first analysis, I evaluate the variability of the LDA Models' estimated credit risk relationships relative to the corresponding Base Model variability benchmarks. Figures 8a-c below illustrate these estimated relationships - separately for CLTV, DTI, and Loan Amount - for LDA Models estimated using 12 of the 20 Fairness Weights.[18] I also include the Base Model benchmarks for reference and discuss my interpretation of these charts further below.
Each set of charts depicts how the fairness regularization process algorithmically alters the set of estimated coefficients for one of the Base Model's credit risk attributes as the Fairness Weight increases from a low of 0.1 to a high of 2.0. Additionally, each chart shows how this fairness-driven Base Model alteration varies across the 100 training samples.
Because there is a lot of interesting information in these charts, I think it's worthwhile to do a detailed walkthrough. Accordingly, consider Figure 9 below that focuses on the estimated CLTV relationship for LDA Models trained with a Fairness Weight of 0.1 (obtained from the first chart in Figure 8a).
This chart, as with all other charts in Figures 8a-c, can be interpreted as follows:
The red lines represent the estimated CLTV credit risk profiles for the 100 LDA Models trained using a Fairness Weight of 0.1.
The dark blue lines represent the estimated CLTV credit risk profiles for the 100 Base Models trained on the same data samples and are presented here as a benchmark.
The yellow line represents the estimated CLTV credit risk profile of the population-based LDA Model (i.e., trained on the full synthetic credit performance dataset with a Fairness Weight = 0.1) and is presented here as a benchmark.
The green line represents the estimated CLTV credit risk profile of the population-based Base Model (i.e., trained on the full synthetic credit performance dataset) and is presented here as a benchmark.
Upon review of Figure 9 - as well as all the charts in Figure 8a-c, it is clear that the random training sample variability has a greater impact on the LDA Models' estimated credit risk relationships than it does on the Base Models'. In particular,
As the Fairness Weight increases (to achieve better AIR-based fairness performance), algorithmic debiasing tends to alter the credit model in ways that create increasing LDA Model "brittleness".
That is, the AD process causes the LDA Models' risk coefficient estimates to become increasingly unstable as the Fairness Weight increases. For example, even with a relatively low Fairness Weight = 0.3, Figure 10 below shows the significant volatility in the LDA Models' estimated CLTV risk profiles across the 100 training samples:
Rather than being tightly clustered together as we see with the Base Models (the blue lines), the CLTV risk coefficient estimates now appear to separate into distinct "clusters" with materially different profiles. Furthermore, as the Fairness Weight continues to increase, this volatility grows even larger (see Figures 8a-c).
Figures 11a-c below illustrate this LDA Model brittleness more quantitatively for each of the three credit risk attributes. In each chart, I calculate - for each risk coefficient - the ratio of: (1) the LDA Models' risk coefficient variability across the 100 training samples (as measured by its standard deviation) to (2) the Base Models' risk coefficient variability benchmark (also a standard deviation and contained in Figure 6 above). These ratios are depicted by the colored lines. I also include a dashed horizontal line at a value of 1.0 - indicating, as a reference, a neutral ratio value where the LDA and Base Model risk coefficient standard deviations are the same.
Using the CLTV risk attribute as an example, here we see - in general - that all CLTV risk coefficients possess significantly more variability in the LDA Models (i.e., all ratios are greater than 1 at Fairness Weights greater than 0). As a more specific example, we see that with a Fairness Weight = 0.3 (see Figure 10), the 100 LDA Models' CLTV 80-85% risk coefficients have a standard deviation of 0.615 - 12.3x higher (depicted by the green line) than the standard deviation of the same coefficients estimated for the corresponding 100 Base Models. At even larger Fairness Weights (i.e., above 1.3), this coefficient's instability grows even more rapidly - reaching a maximum value of 31.6x that of the Base Models at a Fairness Weight = 2.0.[19]
Due to this increasing LDA Model structural brittleness / instability, I note that:
AIR-based algorithmic debiasing increasingly distorts the conceptual soundness of estimated LDA credit risk profiles by reducing their monotonicity as the Fairness Weight increases.
For each Fairness Weight (including the 0 Fairness Weight for Base Models) and for each credit risk factor (i.e., CLTV, DTI, and Loan Amount), Figure 12 below indicates the percentage of the 100 models - estimated at each Fairness Weight - that exhibit monotonically increasing or decreasing credit risk profiles.[20]
For example, across the 100 Base Models (Fairness Weight = 0), 90% of the estimated CLTV credit risk profiles (the blue line) are monotonically increasing. That is, at each successively higher CLTV range, the Base Models' estimated default risk is at least as great as the estimated default risk associated with the previous (i.e., lower) CLTV range. However, as the Fairness Weight increases to produce LDA Models with greater AIR-based fairness, we see that these models exhibit decreasing levels of risk profile monotonicity - with only 56% of LDA Models estimated with a Fairness Weight = 0.3 having monotonic CLTV risk profiles (see Figure 10 above), only 25% of LDA Models estimated with a Fairness Weight = 1.0 having monotonic CLTV risk profiles, and only 8% of LDA Models estimated with a Fairness Weight of 2.0 having monotonic CLTV risk profiles. This pattern is generally consistent with that exhibited by the two other credit risk factors in Figure 12 - DTI (the red line) and Loan Amount (the green line).
What is the cause of this structural brittleness?
As I discuss in more detail later in this post, it appears that small random differences in training data can cause the AD process - for a given Fairness Weight - to select different sets of credit risk attributes as the primary means to drive greater AIR-based fairness performance (I call these attributes the "primary disparate impact factors"). These different sets of primary disparate impact factors, which appear as different credit risk profile "clusters" in Figures 8a-c, are really just different solutions to the same LDA Model training exercise whose emergence is triggered by random perturbances in the model training data. Furthermore, the emergence of these multiple solutions is facilitated by a decreasing relative weight on model accuracy during model training - permitting the AD algorithm even more freedom to select different sets of primary disparate impact factors to achieve the desired level of fairness without the limiting constraint of high model accuracy.[21]
The Statistical Bias of LDA Models' Estimated Credit Risk Profiles
Looking across the charts in Figures 8a-c, we can see a few other interesting features.
AIR-based algorithmic debiasing appears to inject material statistical bias into the LDA Models' estimated credit risk profiles.
Unlike the Base Models' estimated credit risk coefficients that were near-universally unbiased relative to the population-based Base Model's estimates (see Figure 6), Figures 8a-c show that the LDA Models' estimated credit risk profiles (the red lines) differ materially from those of the population-based LDA Model (the yellow line) - indicating some type of inherent statistical bias in the LDA estimation process. This is quantitatively confirmed in Figures 13a-c below where I display the calculated statistical bias levels for all three credit risk factors for each LDA Fairness Weight.
Using the CLTV risk profile as an example, Figure 13a plots - for each of the five CLTV ranges and for each Fairness Weight - the ratio of: (1) the average risk coefficient value across the 100 LDA Models relative to (2) the corresponding population-based LDA risk coefficient value. Ratio values equal to 1 (highlighted by the middle dashed horizontal reference line) correspond to the absence of estimation bias - that is, on average, the 100 risk coefficients estimated from the training samples equal the population-based estimate.
Consistent with my discussion of Figure 6, we see that all CLTV risk coefficients estimated with a Fairness Weight = 0 (i.e., the Base Models) cluster close to this reference line - consistent with the general absence of statistical bias for the 100 sample-based Base Models. However, across the range of LDA Fairness Weights (i.e., from 0.1 to 2.0), we can see many instances of material statistical bias in the sample-based LDA Models' risk coefficient estimates. In fact, in some cases, the bias estimates are so large that I truncated them so as not to distort the visual display of the results for the remaining data points. These dashed truncation lines can be seen at the top and bottom of the charts.
One of the reasons for this statistical bias can be seen in the following chart (Figure 14 - obtained from Figure 8a) for the estimated CLTV risk profiles using a Fairness Weight = 1.1.
Similar to my discussion of Figure 10 where the Fairness Weight = 0.3, here we can see how the LDA Models' estimated CLTV risk profiles separate into even more distinct "clusters" at this higher Fairness Weight. For example, the CLTV>95% coefficient varies across 3-4 distinct clusters with values varying around 4 and around 1. We can also see the impact of these clusters on some of the other CLTV coefficients - such as CLTV 80-85% and CLTV 90-95%.
As I discussed at the end of the previous section (and will discuss in more detail later), these clusters appear to be formed when different solutions to the LDA Model training exercise emerge - triggered by the random perturbances in the training data. However, with the existence of multiple solutions, statistical bias is sure to be present since the distinct risk profile patterns (from the different solutions) cannot average out to the single population-based CLTV credit risk profile (i.e., the yellow line in Figure 14) that reflects only one solution.
The Variability and Bias of the LDA Models' Fairness Metrics Across the 100 Samples
Next, I evaluated the variability of the 100 LDA Models' fairness and accuracy performance metrics - for each Fairness Weight - relative to the Base Model benchmarks. For the AIR-based fairness metric, I simply calculated - for each of the 2,000 LDA Models - the expected approval rates for Black and White applicant groups assuming a 90% overall approval rate.[22] These expected approval rates were then converted to the AIR fairness metric, segmented by LDA Model Fairness Weight, and plotted in Figure 15 below.
In this figure, for each Fairness Weight, each of the 100 LDA Models' AIRs are plotted using green dots - with darker green dots associated with a greater concentration of LDA Models at that particular AIR value. Overlaid onto this figure, with a solid green line, is the average AIR value across the 100 training samples for each Fairness Weight (i.e., the mean AIR for all 100 LDA Models estimated with this Fairness Weight). For reference, the AIR values associated with the population-based LDA Model - for each Fairness Weight - is denoted with the blue line.[23]
Based on these results, I note the following observations.
While the LDA Models' mean AIRs generally increase with higher Fairness Weights, the individual models exhibit increasing AIR variability within a Fairness Weight - with a growing rightward tail toward higher AIR values.
For example, with a Fairness Weight = 0.3, the mean AIR = 0.891 with a standard deviation of 0.008. At a higher Fairness Weight = 1.0, the mean AIR = 0.917 with a standard deviation of 0.021. And at a Fairness Weight = 1.8, the mean AIR = 0.946 with a standard deviation of 0.036.[24] These standard deviations are all larger than the Base Models' AIR standard deviation benchmark of 0.006 and increase with higher Fairness Weights.
This widening (not just shifting) of the LDA Models' AIR distribution can be seen even more clearly in Figure 16 below where I show the joint distribution of the 100 LDA Models' AIRs and AUCs for four these Fairness Weights (0.1, 0.3, 1.0, and 1.8) along with associated descriptive statistics.
When compared to the Base Model AIR-AUC joint distribution benchmark in Figure 7, we see that as the Fairness Weight increases, the joint AIR-AUC distribution rotates and elongates - with increasing standard deviations for both metrics. What this means is that while higher Fairness weights drive a greater average AIR and typically a lower average AUC (see Figure 17 below), there can be significant variability observed around these mean values due to random training sample variability. Similar to previous discussions, a primary contributor to this volatility is the emergence of multiple LDA Model solutions triggered by random training sample perturbances. These solutions involve the selection of different sets of primary disparate impact factors which drive different approval rates across demographic groups and, therefore, different fairness metrics. As the Fairness Weight increases, more solutions may become viable due to the down-weighting of accuracy in the model training objective - resulting in even more variability in LDA Model fairness and accuracy metrics across the 100 samples.
In terms of statistical bias, I note that:
The LDA Models' mean AIR at a given Fairness Weight can depart materially from that of the population-based LDA Model - indicating the risk of statistical bias in a lender's AIR fairness metric.
The blue line in Figure 15 reflects the AIR value associated with the population-based LDA Model at each Fairness Weight. When compared to the corresponding average AIR values from the sample-based LDA Models (the green line), we can see notable divergences between the two - particularly at lower Fairness Weights (i.e., between 0.3 and 1.0). As I will discuss later, the divergence here is due to the AD process's selection of different sets of primary disparate impact factors for the sample-based LDA Models versus the population-based model at the same Fairness Weight. There is a greater divergence between these sets of factors at Fairness Weights less than 1.1 and closer alignment at 1.1 and above.
The Variability and Bias of the LDA Models' Accuracy Metrics
Turning now to the LDA Models' AUC-based accuracy results, Figure 17 below plots for each Fairness Weight, each of the 100 LDA Models' AUCs using red dots - with darker red dots associated with a greater concentration of LDA Models at that particular AUC value. Overlaid onto this figure, with a solid red line, is the average AUC value across the 100 training samples for each Fairness Weight (i.e., the mean AUC for all 100 LDA Models estimated with this Fairness Weight). For reference, the AUC values associated with the population-based LDA Model - for each Fairness Weight - is denoted with the blue line.
Based on these results, I note the following observations.
While the LDA Models' mean AUCs generally decrease with higher Fairness Weights, the individual models exhibit increasing AUC variability within a Fairness Weight - with a growing leftward tail toward lower AUC values.
For example, with a Fairness Weight = 0.3, the mean AUC = 0.837 with a standard deviation of 0.005. At a higher Fairness Weight = 1.0, the mean AUC = 0.810 with a standard deviation of 0.023. And at a Fairness Weight = 1.8, the mean AUC = 0.766 with a standard deviation of 0.051.[25] These standard deviations are all larger than the Base Models' AIR standard deviation benchmark of 0.003 and increase with higher fairness weights.
This widening (not just shifting) of the LDA Models' AUC distribution can be seen even more clearly in Figure 16 above where I show the joint distribution of LDA Model AIRs and AUCs for these four Fairness Weights (0.1, 0.3, 1.0, and 1.8) along with associated descriptive statistics. My conclusions here are similar to those for the AIR metric discussed above.
The LDA Models' mean AUC at a given Fairness Weight can depart materially from that of the population-based LDA Model - indicating the risk of statistical bias in a lender's AUC fairness metric.
The blue line in Figure 17 reflects the AUC value associated with the population-based LDA Model at each Fairness Weight. When compared to the corresponding average AUC values from the sample-based LDA Models (the red line), we can see notable divergences between the two - with the individual LDA Models, on average, achieving lower AUC accuracy rates than the population-based LDA Model at the same Fairness Weight. As I will discuss later, the divergence here is due to the AD process's selection of different sets of primary disparate impact factors for the sample-based LDA Models versus the population-based model at the same Fairness Weight.
So What Does This All Mean For Lenders?
From a lender's perspective, common AIR-based AD processes can create LDA Models with significant model risk due to the instability of their model structures in response to relatively small variations in training data. Such "brittleness" - if undetected during model development or model validation - may cause model accuracy and fairness performance metrics obtained during model development to be highly idiosyncratic - potentially resulting in materially different model performance results during production.
In the final section of my analysis below, I turn my attention back to the core reasons driving LDA Model adoption for lenders - the elimination or mitigation of illegal credit model disparate impact. However, now - with the findings above in hand - I explore what this brittleness implies about the use of AIR-based algorithmic debiasing for such purposes.
The Impact of LDA Model Brittleness on Disparate Impact Remediation: Identifying and Analyzing the Primary Disparate Impact Factors
According to its proponents, algorithmic debiasing creates fairer AI credit models by reducing or eliminating the illegal disparate impact driving lending outcome disparities for protected class applicants. However, as I have written extensively elsewhere - see, for example, Algorithmic Justice: What's Wrong With the Technologists' Credit Model Disparate Impact Framework - whether a credit model's lending outcome disparity is evidence of an illegal disparate impact is a fact-based analysis governed by federal law, regulation, and associated Supreme Court opinions. Among other important things, this legal framework requires the identification of the specific "artificial, arbitrary, and unnecessary" credit attribute(s) allegedly causing the illegal lending outcome disparity.
Nevertheless, despite its widespread promotion as a tool to remediate "disparate impact" discrimination, algorithmic debiasing performs its task without the transparency consistent with this legal framework (or, frankly, with typical model risk management requirements). That is, because algorithmic debiasing is, itself, an automated black box process - and because many modern-day AI credit models are driven by hundreds or thousands of predictive attributes, most lenders are unaware of how the debiasing algorithm actually alters the Base Model to produce the less discriminatory alternatives they are encouraged to adopt - let alone whether such attributes are really "artificial, arbitrary, and unnecessary".
While one would think that identification of the credit model's "disparate impact factors" should matter a great deal to highly-regulated lenders, such concerns do not appear to be widely or publicly shared in the many conference panels, white papers, blog posts, and podcasts advocating for AD adoption. Instead, the focus of discussions is ultimately on the results - not on the process - with the existence of demonstrably "better" fairness performance proof enough to justify many lenders' LDA Model adoption decisions.
But what would a lender learn if it did seek greater transparency? And how might such learnings impact its decision to adopt an algorithmic LDA Model?[26]
Let's see.
To answer these questions, I analyzed my 2,000 LDA Models one final time to understand more specifically:
Which LDA Model risk attributes experience the greatest PD-altering impacts (both positively and negatively) relative to their Base Model? I refer to these attributes as the LDA Model's "primary disparate impact factors".
How do these primary disparate impact factors vary, if at all, across LDA Models estimated with the same Fairness Weight?, and
How do these primary disparate impact factors vary, if at all, across Fairness Weights for the same LDA Model (i.e., estimated on the same training sample)?
I address each of these in turn below.
Identification of the LDA Models' Primary Disparate Impact Factors
To identify the primary disparate impact factors for each of my 2,000 LDA Models, I performed the following steps on each model:
I first calculated the change in each individual's probability of default ("PD") relative to the Base Model.
Using ordinary least squares ("OLS"), I performed regression analysis on these individual-level PD differentials using the original set of credit risk attributes as the set of explanatory variables (i.e., those listed in Figure 2).[27]
I identified the primary disparate impact factors as the credit risk attributes with the largest positive and negative OLS regression coefficients.
Since all risk attributes in my credit model take the form of indicator (dummy) variables, their coefficients are already normalized - meaning that the attributes with the largest positive and negative coefficients are the risk attributes most impacted by the debiasing algorithm.[28] Additionally, I note that these primary disparate impact factors come in pairs - one positive and one negative - because one cannot "de-bias" or reduce the impact of one risk attribute without offsetting this effect with an increase to other risk attributes; otherwise, the LDA Credit Model would no longer generate PD estimates with zero overall expected error. Therefore, one should think of credit model disparate impact remediation as a reallocation of estimated credit risk across model attributes.
As an example of this identification process, Figure 18 below summarizes the primary disparate impact factors identified across the 100 LDA Models trained with a relatively low Fairness Weight = 0.1.
The top table summarizes the 100 primary disparate impact factors responsible for decreasing the LDA Models' PD estimates (i.e., de-risking PD estimates relative to the Base Models) to improve AIR-based fairness, while the bottom table identifies the corresponding primary disparate impact factors responsible for increasing the LDA Models' PD estimates (i.e., up-risking PD estimates relative to the Base Models) to improve fairness.
Reviewing Figure 18, I notice two immediate results. First, at a Fairness Weight of 0.1, the majority of LDA Models (62%) achieve improved AIR-based fairness by de-risking (i.e., decreasing) the estimated PDs for smaller loan sizes (i.e., loan amounts <= $50,000). Second, to counterbalance this de-risking, the AD process primarily up-risks (i.e., increases) the PDs for applicants with CLTVs between 90% and 95% for a majority of training samples (79%).
Why these specific factors?
While not a definitive explanation, Figures 19 and 20 below suggest a reason for this particular selection of primary disparate impact factors at this Fairness Weight level.
In Figure 19, I plot the estimated loan amount risk profiles for the 62% of LDA Models where loan amounts <= $50,000 were identified as the primary de-risked disparate impact factor. For reference, I also add the population-based Base Model's loan amount risk profile as a benchmark. Here we can see that the lowering of the relative riskiness of small loan sizes (<=$50K) for these models still largely preserves the monotonicity in the estimated loan amount risk profile we observe in the population-based Base Model. This is consistent with the low Fairness Weight of 0.1 in which relative accuracy (i.e., AUC) is much more important than AIR-based fairness. Accordingly, the AD process appears to search for a risk attribute whose de-risking incrementally improves AIR-based fairness by: (1) skewing PD reductions disproportionately to protected class applicants in order to improve their relative approval rates, but (2) still maximally preserving the LDA Model's AUC-based accuracy.
Similarly, in Figure 20, we can see that the LDA Models' up-risking of loans with CLTV 90-95% also largely preserves the monotonicity present in the estimated CLTV risk profile of the population-based Base Model while contributing to improved AIR-based fairness. Here, the AD process appears to search for a risk attribute whose up-risking incrementally improves AIR-based fairness by: (1) skewing PD increases disproportionately to control group applicants in order to suppress their relative approval rates, but (2) still maximally preserving the LDA Model's AUC-based accuracy.
Evaluating How These Primary Disparate Impact Factors Vary Across LDA Models Estimated With the Same Fairness Weight
More broadly, Figures 21 and 22 below summarize the primary de-risked and up-risked disparate impact factors identified across all 2,000 LDA Models in my analysis - segmented by Fairness Weight. To reduce the clutter of these charts, I only include a primary disparate impact factor if it is used in at least 10% of LDA Models estimated at any Fairness Weight.
A review of these charts leads to the following observations.
At the same Fairness Weight, small differences in the underlying training samples can cause the AD process to select different sets of primary disparate impact factors.
For example, according to Figure 21, at a Fairness Weight = 1.0, Loan Amount <= $50K and CLTV>95% account for 33% and 29% of the 100 primary de-risked disparate impact factors, respectively - with DTI=49% accounting for another 11%. According to Figure 22, the primary up-risked disparate impact factors selected at this Fairness Weight are CLTV 90-95% and CLTV 80-85% - both of which account for 36% and 31% of the total, respectively. Based on these results, we see that LDA Model brittleness is clearly associated with different sets of primary disparate impact factors selected at the same Fairness Weight, and these different primary disparate impact factors create the different "clusters" of LDA Model credit risk profiles as discussed in relation to Figure 14.
This implies that LDA Model brittleness not only raises safety-and-soundness concerns related to model non-robustness and instability, but it also raises equally important concerns about whether algorithmic debiasing is truly remediating illegal disparate impact driven by "artificial, arbitrary, and unnecessary" predictive factors per the governing legal framework. That is, if it is really remediating true illegal disparate impact, then why - at the same Fairness Weight - are different model attributes being selected for "de-risking" in response to small random differences in training samples?
Evaluating How These Primary Disparate Impact Factors Vary Within Training Samples For Different Fairness Weights
In addition to analyzing how primary disparate impact factors change across LDA Models estimated at the same Fairness Weights, I also flipped this analysis to explore how the primary disparate impact factors change within a given training sample as the Fairness Weights increase. What I found was the following:
Even within a given training sample, algorithmic debiasing can switch among different primary disparate impact factors as the Fairness Weight changes.
For example, Figures 23a-e below present the primary disparate impact factors identified within 5 random training samples (out of the 100) for each Fairness Weight between 0.1 and 2.0. That is, for each training sample, I estimated 20 LDA Models - each associated with a different Fairness Weight. Then, for each of the 20 LDA Models, I used my method described above to identify the set of primary disparate impact factors used to improve AIR-based fairness in this sample.
For each of the 5 randomly-selected training samples below, the first column denotes the LDA Model's Fairness Weight, the second column identifies the primary de-risked disparate impact factor, the third column identifies the counterbalancing primary up-risked disparate impact factor, and the fourth and fifth columns contain the corresponding LDA Model's AIR and AUC - which can be compared to the training sample's Base Model AIR and AUC at the top of the respective columns.
Reviewing these tables individually, we can see clearly that within a given training sample, the AD process can select different risk attributes as the primary disparate impact factors depending on the degree of fairness improvement sought and the relative importance of model accuracy. For example, at low Fairness Weights, the AD process tends to focus its de-risking on smaller loan sizes. However, as the Fairness Weight increases from these low levels, it tends to switch - generally - to DTI attributes. And at even higher Fairness Weights, the AD process - again, generally - switches to higher CLTV loans. Such attribute switching is also generally evident in the primary up-risked disparate impact factors.
Additionally, the fairness improvement path for each training sample (i.e., the primary disparate impact factors selected as the Fairness Weight increases sequentially from 0.1 to 2.0) contains differing degrees of instability - which I define here as a lack of consistency in disparate impact factor selection. Although this instability affects all samples to some degree, it is most apparent in Samples 1 and 3 where 6 and 9 different primary de-risked disparate impact factors, respectively, are leveraged along the samples' fairness improvement paths.
Overall, this instability in LDA Models' primary disparate impact factors calls into question whether these identified model attributes are truly discriminatory according to the disparate impact legal framework, or simply model attributes whose statistical and demographic properties are being leveraged to achieve higher levels of AIR-based fairness through reverse disparate impact.
Why Does Any of This Matter?
For those who would ask what difference any of this makes since improved fairness is nevertheless achieved, I point out that - apart from the potentially serious risk and compliance issues noted above - such fairness improvements are not without real customer impacts.
In particular, keep in mind that the LDA Model's expanded protected class approvals are driven by swap-sets of applicants from both demographic groups - i.e., Black and White.[29] While, as a group, Black applicants experience net positive swap-ins (and, therefore, higher relative approval rates) under all these LDA Models regardless of the primary disparate impact factors employed, this group-level result masks significant intra-group credit decision "churn" in which many individual Black applicants can be harmed.
To illustrate and expand on this point, Figure 24 shows the average number of Black "swap-ins" at each Fairness Weight (the green bars - averaged across the 100 LDA Models estimated at each Fairness Weight) along with the corresponding average number of Black "swap-outs" (the red bars). I note that at each Fairness Weight the Black swap-ins are larger than the Black swap-outs - thereby leading to a net positive increase in Black approvals and AIR-based fairness.
Based on this analysis, I note that:
At each LDA Model Fairness Weight, although Black applicants - as a group - achieve higher net approvals, this group-level fairness improvement comes at the expense of significant intra-group "unfairness".
That is, because algorithmic debiasing does not directly de-risk only Black applicant PDs, the increase in Black net approvals occurs as a form of "rough justice" - that is, while the Black applicants possessing the primary de-risked disparate impact factor have their PDs reduced (as do other non-Black applicants with this attribute), other Black applicants possessing the primary up-risked disparate impact factor have their PDs increased. As the former group dominates the latter group in size, overall net swap-ins of Black applicants are positive - leading to increased group-level approvals. However, not all Black applicants are likely happy with this group-level result.
For example, at a Fairness Weight = 1.0, the 100 LDA Models, on average, swap-in 449 new Black approvals (i.e., Black applicants who would be denied under the corresponding Base Models). However, these new approvals are partially offset by 210 Black swap-outs (i.e., Black applicants who would be approved under the Base Model, but would now be denied). Therefore, while Black net approvals increase by an average of 239 at a Fairness Weight = 1.0, this net increase masks an average 659 individual credit decision changes within this demographic group - which is 11.3% of all Black applicants.
While, by themselves, the Black swap-outs would likely view the adverse change in their credit decision to be unfair, Figure 25 below further illustrates the potential harm to these individuals. Here, I calculated - across all 100 training samples - the average actual default rate of Black swap-ins and swap-outs across the LDA Model Fairness Weights. The Black swap-ins' actual default rates are plotted in green while the Black swap-outs' actual default rates are plotted in red and are displayed as negatives for presentation purposes.
At all Fairness Weights, the Black swap-ins, on average, have worse credit risk (i.e., higher actual default rates) than the Black swap-outs. Additionally, for specific training samples, the characteristics of those swapped-in vs swapped-out may be non-intuitive. Take, for example, Training Sample 3 from Figure 23c above. For the LDA Model estimated with a Fairness Weight = 1.3, the primary de-risked disparate impact factor is CLTV>95% while the primary up-risked disparate impact factor is CLTV 80-85%. In this case, Black applicants with lower CLTVs (and lower actual default rates) would now be denied under this LDA Model while Black applicants with higher CLTVs (and higher actual default rates) would be approved.[30],[8]
And this is why this matters.
An LDA Model's "better" fairness performance is not all that matters when deciding whether to adopt it or not. As we have seen here, even putting aside the potentially serious safety-and-soundness risks discussed previously, there are still important legal, compliance, and reputational risks to consider due to:
Whether the LDA Model is actually remediating an illegal disparate impact as governed by applicable law, regulation, and Supreme Court opinion OR whether the AD process is, instead, simply altering the LDA Model structure to embed an arguably illegal and latent reverse disparate impact to achieve greater approval rate equity. That is, are we remediating a legal violation or potentially creating one?
The apparent intra-group unfairness created within demographic groups - including the protected class groups - by potentially swapping-in higher risk applicants for new credit approvals at the expense of swapping-out (i.e., denying) lower risk applicants within the same demographic group.
The conceptual complications related to Adverse Action Notice accuracy. For example, many of the swap-outs in Training Sample 3 would receive Adverse Action Notices indicating their credit denial was based on "insufficient collateral" or some other reason related to their CLTV value. However, that is not exactly accurate since other applicants with higher CLTVs (i.e., the swap-ins) are now approved. In reality, these applicants are being denied solely to achieve a specific fairness goal - but that is not what would be communicated to them.
Additionally, the existence of LDA model multiplicity at a given Fairness Weight raises another interesting point to consider. Given that random differences in training samples can produce LDA Models with different primary up-risked disparate impact factors at the same Fairness Weight, those applicants swapped-out (and, therefore, denied) by a given LDA Model may not have been denied if the LDA Model had been trained on a different random sample. That is, while it may be technically true that these swap-outs were denied due to their primary up-risked disparate impact factor, the underlying brittleness in LDA Models means that - randomly - some of these applicants could have been approved had a slightly different training sample been used.[31] In such cases, what really is the specific reason for denial? And is this reason really a causal reason that the applicant can address?
The poorer credit performance of the swapped-in protected class applicants (relative to the swapped-out applicants) that raises concerns as to whether a lender is engaged in responsible lending.[32]
Final Thoughts
As readers of this blog know, my primary concern with current algorithmic fairness tools is the rush by many to embrace them as an extraordinary technological innovation that can easily solve an important and complex fair lending compliance issue. While the value proposition behind these tools is certainly compelling, the reality is that there is a death of publicly-available, objective, and rigorous research supporting their use in a consumer lending context - research that explores more deeply the complex operations occurring within their automated "black box" processes to identify overlooked features or behaviors that may pose important risks to lenders and consumers.
And why is that?
Every month we see what seems like hundreds of new research papers investigating every nuance associated with large language models, yet we still don't have a proper vetting of the complex models being used today by highly regulated institutions deploying these models into high-risk use cases such as consumer lending.
Sigh.
In any case, I don't purport that the research in this post - nor my prior Fool's Gold analyses - satisfy this objective and, therefore, should be considered the decisive word on current algorithmic debiasing tools. As I have stated many times, my research is limited to the public data and tools available to me which may, or may not, limit the broader applicability of my findings. However, I do strongly believe that these analyses highlight clearly important risks associated with these tools that should be investigated more thoroughly - whether part of a formal research program or part of a lender's model risk management process - to determine their applicability and presence with specific lender credit model applications.
And, lastly, for those who may believe that my results are specific to the "fairness regularization" method of algorithmic debiasing and not to other common AD methodologies such as adversarial debiasing - well that remains to be seen. Certainly, adversarial debiasing - although not calibrating directly to the AIR - does calibrate to an outcome-based fairness penalty term involving the correlation between PDs (which drive approvals) and borrower demographics. And as the weight of this penalty term increases to drive greater LDA Model fairness performance, the AD process will likely distort the estimated credit risk relationships similarly to what I've described here. Whether these distortions are of similar magnitudes and cause similar brittleness is a research topic I leave for another time (or another researcher). However, I think many of us - including the CFPB and, importantly, lenders employing this methodology - would like to know this answer. The sooner the better.
* * *
ENDNOTES:
[1] See, for example, the June 2024 Fair Lending Report of the Consumer Financial Protection Bureau which states, "In 2023, the CFPB issued several fair lending-related Matters Requiring Attention and entered Memoranda of Understanding directing entities to take corrective actions that the CFPB will monitor through follow-up supervisory actions. ... [T]he CFPB ... directed the institutions to test credit scoring models for prohibited basis disparities and to require documentation of considerations the institutions will give to how to assess those disparities against the stated business needs. To ensure compliance with ECOA and Regulation B, institutions were directed to develop a process for the consideration of a range of less discriminatory models."
[2] See, for example,
"Urgent Call for Regulatory Clarity on the Need to Search for and Implement Less Discriminatory Algorithms," June 26, 2024 Letter From The Consumer Federation of America and Consumer Reports to Director Rohit Chopra of the CFPB. In particular, "While rudimentary techniques exist to modify models to mitigate disparate impact (such as “drop-one” techniques), a range of more advanced tools and techniques are emerging, including adversarial debiasing techniques, joint optimization, and Bayesian methods that use automated processes to more effectively search for modifications to reduce disparate impacts. ... Companies should be expected to utilize emerging good practices in terms of techniques for searching for LDAs."
"CFPB Should Encourage Lenders To Look For Less Discriminatory Models," April 22, 2022 Letter From the National Community Reinvestment Coalition to Director Rohit Chopra of the CFPB.
[3] Their explicit use of the term "open-source" is interesting as it indicates a potential aversion to proprietary algorithmic fairness tools.
[4] This omission from my previous studies is due to my use of a single - although large - model training sample and my focus on the performance of a single LDA Model derived therefrom.
[5] As just one example, see "How to Choose a Less Discriminatory Alternative" at FairPlay.ai
[6] The importance of model robustness / stability is reinforced by the federal financial regulators. For example, according to OCC 2011-12 "Supervisory Guidance on Model Risk Management" - considered to be the authoritative guidance on model risk management:
"Model quality can be measured in many ways: precision, accuracy, discriminatory power, robustness, stability, and reliability, to name a few."
"An integral part of model development is testing, in which the various components of a model and its overall functioning are evaluated to determine whether the model is performing as intended. Model testing includes checking the model's accuracy, demonstrating that the model is robust and stable, assessing potential limitations, and evaluating the model’s behavior over a range of input values."
[7] This is the primary theme of my prior post "Algorithmic Justice: What's Wrong With the Technologists' Credit Model Disparate Impact Framework?"
[8] This intra-group unfairness also affects the control group applicants. While the aggregate level metrics for this group suggest that they are trivially impacted (e.g., the LDA Model approval rate for White applicants is at most about 1% less than their Base Model approval rate of 90.4% at the highest Fairness Weights), this result masks the thousands of individual White applicants whose credit decisions are changed by the LDA Model - about 49% swapped-in and 51% swapped-out. As with the protected class applicants so affected, many of these applicants may not consider the new credit decisions to be fair.
[9] I refer to this HMDA dataset as a "synthetic credit performance dataset" as - for the purpose of these analyses - I treat denied credit applications as "defaults" and approved credit applications as "non-defaults". See my further discussion of this assumption in the first Fool's Gold article.
[10] I use logistic regression for the credit model as: (1) it is the most common machine-learning algorithm ("ML") used in consumer credit scoring, (2) it allows me to abstract away from the unnecessary complexities and intractabilities of more advanced ML techniques such as gradient boosted trees, neural networks, etc., and (3) it makes the overall analysis feasible from a computational resource perspective (I ultimately need to estimate 2,100 Base and LDA Models). Nevertheless, my results should be considered within the context of this choice.
[11] I may perform future analyses using out-of-time HMDA data to assess the impact of such data on a given LDA Model's fairness and accuracy performance.
[12] I base my analysis on 100 samples to balance the need for a relatively large number of random samples to provide sufficiently precise results with the significant computational time associated with generating 20 LDA Models for each of the 100 samples. While I believe that my results and conclusions would be qualitatively unchanged with a larger sample size, they should be interpreted within the context of this choice.
[13] More formally, only 2 of the 26 risk attributes (DTI=39% & DTI=44%) have population-based coefficients that are statistically different than the corresponding sample-based average estimates at a 5% level of significance. The values of these differences are -0.019 and 0.014, respectively.
[14] I note that the use of the term "bias" here refers to its statistical definition - not its consumer compliance definition.
[15] That is, a one standard deviation change in the number of black approvals changes the black approval rate by 1.13% while a similar change in white approvals changes the white approval rate by 0.27% - a difference of 4.2x. Given that the AIR is a ratio of these approval rates, the greater inherent variability of Black approval rates will amplify the variability in the AIR fairness metric.
[16] A similar phenomenon is discussed in Black, Emily, Gillis, Talia & Hall, Zara. (2024). "D-hacking". 602-615. 10.1145/3630106.3658928. In their article, they identify this as a risk for opportunistic fair-washing of algorithmic credit models.
[17] For a more in-depth discussion of this algorithmic debiasing method, see my discussion in Fool's Gold: Assessing the Case For Algorithmic Debiasing.
[18] I display only 12 of the 20 Fairness Weights solely to conserve space. There is nothing abnormal or atypical of the results associated with the remaining 8 Fairness Weights that are not displayed.
[19] These results are qualitatively unchanged if one instead uses the relative coefficients of variation as the volatility measure.
[20] I measured monotonicity as the percentage of risk profiles in which each successive risk coefficient was greater than or equal to (or less than or equal to, for loan amount) the preceding risk coefficient. Given the number of coefficients for DTI, I focused on only the following subset: 37%, 40-41%, 44%, 46%, 48%, 50-60%, and >60%.
[21] This raises the question as to whether algorithmic debiasing - by introducing a second component (fairness) to the model training objective - causes the LDA Model solution to be under-identified.
[22] This is the same approval rate assumption that I have used throughout the previous three Fool's Gold analyses.
[23] While it appears that the population-based LDA Models have a different AIR behavior than the sample-based LDA Models due to the large jump at a Fairness Weight = 1.1 (vs. the smooth AIR increase of the sample-based LDA Models), keep in mind that the latter is an average value across 100 LDA Model AIR paths while the former is a single AIR path.
[24] Technically, these AIR distributions are skewed so means and standard deviations are less meaningful. However, the conclusion is unchanged if I instead measure the central tendency and spread of the AIR distributions with the median and interquartile ranges. In particular, the latter increases from 0.011 to 0.034 to 0.058 as the Fairness Weight increases from 0.3 to 1.0 to 1.8, respectively.
[25] Technically, these AUC distributions are skewed so means and standard deviations are less meaningful. However, the conclusion is unchanged if I instead measure the central tendency and spread of the AUC distributions with the median and interquartile ranges. In particular, the latter increases from 0.007 to 0.027 to 0.068 as the Fairness Weight increases from 0.3 to 1.0 to 1.8, respectively.
[26] I distinguish here between LDA Models created algorithmically through automated black-box machine learning methods and those created manually through more traditional disparate impact testing and credit model modification. This "traditional" approach is discussed in more detail in my prior post "Algorithmic Justice: What's Wrong With the Technologists' Credit Model Disparate Impact Framework?"
[27] I note that different estimation approaches may be required for more complex LDA model structures.
[28] To be clear, these attributes are not the only attributes affected by the AD process and, therefore, they are not the only attributes affecting the LDA Model's PD estimates. However, they have the largest effects and, because of that, I designate them as the primary disparate impact factors.
[29] See Fool's Gold 3: Do LDA Models Really Improve Fairness? for a detailed discussion of applicant swap-sets in the context of LDA Model fairness improvement.
[30] This observation regarding the sacrifice of intra-group fairness for greater levels of inter-group outcome equality is consistent with a similar theme contained in the recent working paper Caro, Spencer, Gillis, Talia, and Scott Nelson, Modernizing Fair Lending, Working Paper No. 2024-18, Becker Friedman Institute For Economics, University of Chicago, August 2024.
[31] This point is similar, though not exact, to one made in the working paper: Emily Black, Manish Raghavan, and Solon Barocas. 2022. Model Multiplicity: Opportunities, Concerns, and Solutions. https://doi.org/10.1145/3531146.3533149
[32] These swap-ins are different from those who may be swapped in due to the use of new alternative data attributes that provide more inclusive and accurate credit risk assessments for individuals with sparse traditional credit histories.
© Pace Analytics Consulting LLC, 2024.