Sigh.
I was hopeful. I really was.
I was looking forward to what was promised to be "groundbreaking research" on a new algorithmic fairness methodology designed to address troubling disparities in credit access and pricing for historically disadvantaged groups. A methodology - the developers teased - that did not require a trade-off in credit model predictive accuracy. My hopes were high given the apparent flaws and insufficient vetting of other popular algorithmic fairness solutions - summarized in my recent article "The Road to Fairer AI Credit Models: Are We Heading in the Right Direction?". Would this new approach take a different path with a more thorough due diligence process?
Sadly, no.
In the new research, the authors provide a trove of detailed technical information about the large and rich consumer lending data underlying their analyses. We are treated to the intricacies of loss functions and neural network architectures. The research results are replete with comparisons of AUCs and AIRs as related to the search for less discriminatory alternative ("LDA") credit models. And, in the end, lofty claims are made as to how all of this technical alchemy - accessible to few fair lending compliance professionals and attorneys - yields a long-awaited solution to the pervasive problem of credit model disparate impact.
But amid all these flowcharts, mathematical equations, greek letters, and colorful charts - I found a troubling omission. There was no meaningful discussion of credit model disparate impact from a legal or compliance practitioner perspective - nor any intuition as to how all of this technical wizardry mitigated the underlying fairness issue. For example, what exactly caused the disparate impact they identified? How did these causal factor(s) specifically disadvantage the protected class group(s)? How did their new algorithmic fairness methodology specifically mitigate the effects of these causal factor(s)? Did the LDA model create any problematic side-effects - such as safety-and-soundness or compliance risks?
To be fair, this criticism is not new or specific to this research - it has been one of my top issues with algorithmic fairness approaches for some time. And its persistence is likely due to one simple fact - our current algorithmic fairness toolbox is driven largely by a cadre of technologists whose work in this field is purely clinical - largely developed independent of specific use cases and reduced to sterile mathematical equations, dense theorems and proofs, generic models and predictive variables, and theoretical measurements of "discrimination" or "fairness" expressed dispassionately in mathematical notation. Their "justice" is developed and dispensed algorithmically - lacking any specific connection to the real individuals impacted by the lender's allegedly faulty credit policies and existing largely untethered from the deep body of law and interpretive opinions that govern lenders' fairness responsibilities.
Nevertheless, well-funded, technology-savvy entrepreneurs have warmly embraced these solutions for use in consumer lending - proclaiming a new era of fairer AI credit models and building businesses to deploy them at scale. But this deployment, like their earlier development, is also largely clinical - treating credit scoring inputs as simply a set of generic predictive factors, measuring a lender's legitimate business interests with a single rank-order accuracy metric, reducing the identification of disparate impact to a simple disparity statistic, and "remediating" such disparities robotically using black box algorithmic methods that are executed with the click of a button.
Lost in all of this technical wizardry is the original credit model's due process rights - being algorithmically rehabilitated without reference to the underlying law and associated legal requirements against which it is deemed in violation. And this is largely because the lawyers and regulators have been fairly absent in the growing deployment of algorithmic fairness technologies to consumer lending - passively ceding this new age of disparate impact compliance risk management to the technologists, and allowing its adoption to run ahead of the risk and compliance guardrails typically employed for such highly-regulated and high-stakes uses.
But are the lawyers really necessary? I mean, it is AI we are dealing with here - right? And the technologists’ algorithmic fairness solutions appear to work (or so we’re told).
In my opinion - yes. And I will now enter some dangerous territory explaining why - with a clear warning that I am not a lawyer and you should consult with one on the information I present below. You see, we must bring a greater balance to the teams designing and implementing credit model disparate impact tests and remediation solutions. This will require adding appropriate representation from legal, compliance, and regulator communities to the technology teams that - up to now - have largely driven the approaches and the dialogue. In the remainder of this article, I will provide my practitioner's perspectives on why this diversity and representation matters, where these new participants may want to focus their contributions, and what potential changes it could have on current disparate impact frameworks for credit models.
Let's dive in.
The Technologists' Three-Step Credit Model Disparate Impact Framework: What's Missing?
Most modern-day credit model disparate impact frameworks[1] are grounded in the three-step burden-shifting framework contained in HUD's 2013 Discriminatory Effects Standard.[2] In general, this three-step framework is structured as follows:
Does the credit model have an adverse impact on a credit outcome for one or more protected class groups (relative to their corresponding control groups)? For credit decisions, this step is typically implemented by evaluating the Adverse Impact Ratio ("AIR") for a sample of credit applicants. The AIR is simply a ratio of the protected class group's loan approval rate divided by the control group's loan approval rate - based on the scores generated by the credit model.[3] If the AIR value falls below a specific "practically significant" threshold value[4] and, for some practitioners, is also statistically significant, then the answer to this first question is yes.
Does the credit model serve the lender's legitimate business need?[5] In the technologists' implementation, this second step is effectively irrelevant because the standard by which they evaluate the credit model is tautological - i.e., is the credit model used to predict applicant creditworthiness with acceptable accuracy? If the answer to this question is yes (which it virtually always is), one proceeds directly to Step 3. As I will discuss further below, such a facile implementation of an important lender safeguard raises questions about the adherence of this overall framework to its underlying legal foundations.
Does a less discriminatory alternative ("LDA") credit model exist that has an acceptable level of predictive accuracy? This is the crux of the algorithmic debiasing process - requiring the lender to search proactively for alternative model configurations that largely meet the lender's legitimate business need but are less discriminatory (as measured by the AIR).
From a practical perspective, the technologists' disparate impact framework effectively reduces down to the following simplistic process:
Is there a "practically significant" statistical disparity (i.e., the AIR)?
If yes, proceed to algorithmic debiasing.
What I find troubling here is that this important compliance testing and remediation process appears to largely ignore the explicit requirements and safeguards contained in the underlying anti-discrimination laws and regulations - i.e., the Fair Housing Act (“FHA”), HUD's 2013 Discriminatory Effects Standard, and other federal civil rights statutes - as well as associated Supreme Court of the United States ("SCOTUS") opinions related thereto - particularly, Texas Department of Housing and Community Affairs v. The Inclusive Communities Project, Inc. ("Inclusive Communities”).
In this landmark case, decided in 2015, not only did SCOTUS affirm that disparate impact claims were cognizable under the FHA, but it also recognized that such claims, if not properly limited, could lead to the widespread adoption of race-based quotas to avoid proactively the simple statistical disparities on which abusive disparate impact claims may be predicated.
"Without adequate safeguards at the prima facie stage, disparate-impact liability might cause race to be used and considered in a pervasive way and “would almost inexorably lead” governmental or private entities to use “numerical quotas,” and serious constitutional questions then could arise." - Inclusive Communities (emphasis mine)
To mitigate such risks, SCOTUS laid out a set of rigorous standards for such claims to proceed – standards that, in many ways, were based on those from related federal anti-discrimination laws - such as Title VII of the Civil Rights Act and the Age Discrimination in Employment Act - along with relevant SCOTUS interpretations of these laws such as Griggs vs. Duke Power Co. ("Griggs") and Wards Cove Packing Co., Inc. v. Antonio ("Wards Cove").[6] Consider, for example, the following statements from the Inclusive Communities opinion related broadly to the three-step disparate impact framework:
"But disparate-impact liability has always been properly limited in key respects to avoid serious constitutional questions that might arise under the FHA, e.g., if such liability were imposed based solely on a showing of a statistical disparity." (emphasis mine)
"A disparate-impact claim relying on a statistical disparity must fail if the plaintiff cannot point to a defendant's policy or policies causing that disparity. A robust causality requirement ensures that “[r]acial imbalance . . . does not, without more, establish a prima facie case of disparate impact” and thus protects defendants from being held liable for racial disparities they did not create. Wards Cove Packing Co. v. Antonio, 490 U. S. 642, 653 (1989), superseded by statute on other grounds, 42 U. S. C. § 2000e–2(k)." (emphasis mine)
"Policies, whether governmental or private, are not contrary to the disparate-impact requirement unless they are “artificial, arbitrary, and unnecessary barriers.” Griggs, supra, at 431." (emphasis mine)
“Remedial orders in disparate-impact cases should concentrate on the elimination of the offending practice that “arbitrar[ily] . . . operate[s] invidiously to discriminate on the basis of rac[e].” (emphasis mine)
As these statements illustrate, once one considers the relevant legal frameworks underlying the technologists' algorithmic approach, we see a much more nuanced disparate impact framework in which important (and reasonable) judgments and considerations are required that don't fit neatly into the technologists' world of a one-click, automated compliance testing and remediation process.
In the next sections, I explore these considerations (and others) in further detail and assess their impacts on each of the three steps in the technologists' framework. Once again, however, I point out that a professional legal interpretation may find more ambiguity here than I interpret, or may find that my interpretations fail to incorporate other relevant legal considerations. And that's fine, as my goal here is to highlight certain features of the technologists' three-step framework that appear inconsistent with the underlying legal foundations. I'll happily leave it to the lawyers to debate these apparent inconsistencies and advance us toward a proper, legally-grounded credit model disparate impact framework.
Evaluating Step 1 - Does the credit model have an adverse impact on a credit outcome for one or more protected class groups ?
In Inclusive Communities, SCOTUS articulates important conditions for a statistical disparity to rise to the level of a prima facie disparate impact claim.[7] However, the technologists' Step 1 process appears inconsistent with these conditions - in particular:
The technologists solely rely on a broad statistical disparity - such as the AIR - as their prima facie evidence of credit model disparate impact.[8]
The technologists' fail to include a requirement that a specific policy be identified as the cause of the statistical disparity.
Let's evaluate these inconsistencies in more detail.
Issue 1: Flaws in the Measurement of Potential Disparate Impact
With respect to the first point, I note that the AIR is simply a ratio of the percentage of protected class applicants who are approved under the lender's credit policies relative to the percentage of control group applicants that are approved. Not only does such a broad-based "fairness" statistic fail to reflect just the disparity caused by the specific credit policy being challenged, it also fails to focus on the specific applicant populations relevant for such a disparate impact analysis.[9] As SCOTUS pointed out in the Wards Cove decision in relation to the Plaintiff's alleged statistical disparity in job categories:
"It is clear to us that the Court of Appeals' acceptance of the comparison between the racial composition of the cannery workforce and that of the noncannery workforce, as probative of a prima facie case of disparate impact in the selection of the latter group of workers, was flawed for several reasons. Most obviously, with respect to the skilled noncannery jobs at issue here, the cannery workforce in no way reflected "the pool of qualified job applicants" or the "qualified population in the labor force." Measuring alleged discrimination in the selection of accountants, managers, boat captains, electricians, doctors, and engineers -- and the long list of other "skilled" noncannery positions found to exist by the District Court, ... - by comparing the number of nonwhites occupying these jobs to the number of nonwhites filling cannery worker positions is nonsensical. If the absence of minorities holding such skilled positions is due to a dearth of qualified nonwhite applicants (for reasons that are not petitioners' fault), petitioners' selection methods or employment practices cannot be said to have had a "disparate impact" on nonwhites." (emphasis mine)
But this type of faulty logic is exactly what the technologists employ with the use of the AIR metric for credit decisions. By basing the statistical disparity (i.e., the AIR) on the overall population of applicants in each demographic group, they ignore whether such credit applicants are considered otherwise credit-qualified (i.e., without consideration of the specific policy being challenged for disparate impact). This flaw is further compounded by their practice of assuming that the AIR disparity is solely attributable to the credit model - even in cases where the disparity may be caused by other non-model factors such as the lender's minimum credit score threshold required for credit approval, or the lender's applicant sourcing policies and processes that may result in differentially-qualified applicant pools across demographic groups.[10] In the latter example, while disparate impact may, in fact, exist due to the lender’s discriminatory applicant sourcing policies and processes, the technologists' broad-based AIR disparity metric would erroneously attribute this discrimination to the credit model – thereby causing a misguided model-focused remediation that has nothing to do with the underlying disparate impact cause.
Issue 2: Failure to Identify the Specific Policy Causing the Alleged Disparate Impact
The above example – whereby the disparate impact is actually caused by upstream policies differentially impacting the credit quality of the lender's applicant pools (rather than caused by the credit model) - reinforces SCOTUS's wisdom of requiring identification of the specific policy causing the alleged disparate impact.
“… a disparate-impact claim that relies on a statistical disparity must fail if the plaintiff cannot point to a defendant's policy or policies causing that disparity.” – Inclusive Communities.
However, even where the credit model is the likely source of the statistical disparity, we still must dig deeper to understand the specific causes. This is because the model is simply a collection of many different credit “policies” (i.e., predictive factors) based on applicant- and transaction-level attributes that either positively or negatively impact an applicant’s estimated credit risk and, therefore, the lender's credit decision. Across these attributes, some may have no (or minimal) impact on the statistical disparity at issue, some may contribute significantly to the disparity but have a strong business necessity (and/or are outside of the lender’s control), and some may contribute significantly to the disparity but are considered potentially “artificial, arbitrary, and unnecessary”. The differences across these three categories are extremely important in the Inclusive Communities opinion, but they are effectively absent in the technologists’ disparate impact framework which, instead, focuses on the credit model algorithm - as a whole - as the specific “policy”[11] – thereby precluding the need to evaluate the model’s individual predictive factors.
Based on these Step 1 considerations, I note the following:
The technologists' failure to identify the individual predictive factors responsible for the alleged statistical disparity appears inconsistent with SCOTUS's requirement that "specific" and "causal" policies be asserted for a prima facie claim under Inclusive Communities.
Even if the technologists’ model-level "policy" is considered legally compliant, I am unaware of any reason preventing a lender from identifying an individual predictive factor as the specific and causal policy. In fact, this would appear to be a preferred approach as it provides the lender with a greater range of defenses to a prima facie disparate impact case by: (1) providing a logical (and legal) basis to adjust the statistical disparity calculation to a conditional value that controls for the effects of all other factors not deemed causally linked to the alleged disparate impact, and (2) permitting an evaluation of whether the specific causal predictive factors have a strong business necessity (and/or are outside of the lender’s control) as outlined in Inclusive Communities. Overall, under this alternative approach, identifying and remediating disparate impact risk is more straightforward (and transparent) for the lender and also more consistent with more traditional credit model disparate impact assessment processes.[12]
If, because of its sheer complexity, the technologists identify the credit model – as a whole - as the specific “policy” causing the statistical disparity, then this position may be in conflict with other regulatory compliance requirements. That is, if a credit model is considered too complex to allocate accurately the statistical disparity among the hundreds or thousands of individual model predictive factors, then such a lack of transparency would also seem inconsistent with the lender's simultaneous position that sufficient explainability exists to produce accurate Adverse Action Notifications under Regulation B. Given the recent CFPB focus on this latter issue[13], this may be an unwise position to take.
Evaluating Step 2 - Does the credit model serve the lender's legitimate business need?
As discussed previously, Step 2 of the technologists’ disparate impact framework is governed by a “decision” rendered moot by a tautology that leads all credit models with a significant AIR disparity to a single outcome – algorithmic debiasing (i.e., the automated search for LDA credit models).[14] This occurs primarily due to two positions taken by the technologists:
The technologists identify the credit model in its entirety as the specific “policy” causing the disparate impact – thereby keeping the analysis of disparate impact at the aggregate model level and precluding the need to deal with the messy complexity of the individual predictive factors created by their AI methodologies. This practice effectively opens up all of the model’s predictive factors to the Step 3 LDA analysis whether or not such factors are the specific causes of the statistical disparity or have valid individual business necessity defenses.
The technologists interpret Step 2 of HUD’s 2013 Discriminatory Effects Standard as requiring the lender not only to evaluate the business necessity of the credit model – but also requiring that the lender show proactively that these business needs cannot be satisfied with a less discriminatory alternative model.[15]
Let's evaluate these positions in more detail.
Issue 1: Failure to Narrow the Step 2 Analysis to the Specific Factors Causing the Alleged Statistical Disparity
The technologists’ practice of performing the Step 2 analysis at the overall model level significantly dilutes the legal objective of the disparate impact framework – which is the identification and remediation of specific lender policies that create “artificial, arbitrary, and unnecessary barriers” to credit for one or more protected class groups. As SCOTUS recognized in its Inclusive Communities opinion, safeguards are required to prevent abusive disparate impact claims based solely on statistical disparities.
“Governmental or private policies are not contrary to the disparate-impact requirement unless they are “artificial, arbitrary, and unnecessary barriers.” Griggs, 401 U. S., at 431.”
“If a statistical discrepancy is caused by factors other than the defendant's policy, a plaintiff cannot establish a prima facie case, and there is no liability.”
“Remedial orders in disparate-impact cases should concentrate on the elimination of the offending practice that “arbitrar[ily] . . . operate[s] invidiously to discriminate on the basis of rac[e].”
By requiring the identification of specific causal policies driving the alleged statistical disparity, SCOTUS clearly is not requiring all credit policies within a lender's decision process to be subject to LDA analysis – which is what the technologists’ approach effectively does. Rather, the intent is to identify the specific credit policies within this overall decision process whose “business necessity” is considered tenuous and whose relationship to the business interest at issue (e.g., credit risk management) is questionable. Supporting this position is the following statement from HUD’s 2013 Discriminatory Effects Standard:
“ … when the Inclusive Communities Court quoted Griggs' decades-old formulation that disparate impact claims require the removal of artificial, arbitrary, and unnecessary barriers, it did so as part of restating the safeguards and requirements that it found (and HUD agrees) have always been a part of disparate impact jurisprudence. In this context, the Court quoted Griggs' short-hand formulation for the type of policy that traditionally has been held to create an unjustified discriminatory effect at the end of the burden shifting analysis. HUD believes that Inclusive Communities, following Griggs as well as earlier Fair Housing Act cases, went on to describe policies invalidated by longstanding precedent as either “arbitrary” or “artificial” as a shorthand for those found to violate the Fair Housing Act under traditional jurisprudence.” (emphases mine)
The assessment of a credit policy's "business necessity" is inherently a judgmental exercise requiring deep knowledge of the decision process, the individual “policies” comprising the decision process, and the business interest(s) these policies are designed to achieve – which is likely why this step is the lender’s burden.
By failing to perform this deeper analysis, the technologists’ framework contradicts SCOTUS’s desire to place reasonable limitations on actionable disparate impact claims – allowing all of the lender’s credit policies as embedded within the credit model algorithm to be open to “second guessing” - rather than just the subset of credit policies whose relationship to the borrower's repayment behavior is considered causally tenuous.
Furthermore, by opening up the entire credit model to LDA analysis via model-level algorithmic debiasing processes, the technologists' Step 2 process: (1) appears inconsistent with SCOTUS’s requirement that "remedial orders" be concentrated specifically on the arbitrary “offending practice” and (2) may generate an LDA credit model that may not, in fact, be the “least” discriminatory alternative. That is, should the Step 3 algorithmic debiasing process find an LDA credit model with an improved AIR (and “acceptable” predictive accuracy), this LDA may not necessarily be the least discriminatory if the AIR improvement comes at the expense of fairness to the control group (that is, if it remediates the original statistical disparity through model changes that unjustifiably disadvantage the control group).[16]
Issue 2: How Does a Lender to Prove a Negative?
Disparate impact’s burden-shifting framework is decidedly litigation oriented. That is, it is expressed in terms of the requirements or responsibilities of two parties - a plaintiff and a defendant (hence, the “shifting” of the burden from one party to another). However, within a fair lending compliance risk management process, there is only one party – the lender. Accordingly, applying this framework to a proactive compliance risk management realm is not straightforward – particularly where, as in Steps 2 and 3 – the responsibilities of the two parties are in opposition. For example,
If a meaningful statistical disparity exists and the lender has identified the specific causal factors within the credit model responsible for these disparities, is the lender supposed to assume the conflicting roles of both the Plaintiff and the Defendant for Steps 2 and 3?
If the lender believes it has a sufficient basis to support the business necessity of these factors, are they supposed to self-challenge this belief by looking for potential less discriminatory alternatives?
If yes, how is the lender supposed to prove a negative (i.e., there are no less discriminatory alternatives)?
The technologists effectively ignore these conceptual issues by simply requiring the lender to be responsible for both Steps 2 and 3 which, as I have stated previously, appears to render toothless the entire business necessity defense and corresponding safeguard that is the whole purpose of Step 2. And, no, HUD doesn’t appear to require or necessarily endorse this interpretation as can be seen in the Commentary accompanying HUD’s 2013 Discriminatory Effects Standard:
“HUD declines to place the step three burden on defendants. As explained in 2013, this rule's burden-shifting scheme is consistent with the majority view of courts interpreting the [Fair Housing] Act as well as the Title VII discriminatory effects standard codified by Congress in 1991, and the discriminatory effects standard under ECOA, which borrows from Title VII's burden-shifting framework. As HUD has explained, all but one of the federal appeals courts to address the issue have placed the burden at the third step on the plaintiff. HUD additionally notes the significant overlap in coverage between ECOA, which prohibits discrimination in any aspect of a credit transaction, and the Fair Housing Act, which prohibits discrimination in housing and residential real estate-related transactions. Thus, under the rule's framework, in litigation involving claims brought under both the Fair Housing Act and ECOA, the parties and the court will not face the burden of applying inconsistent methods of proof to claims based on the same underlying facts. Having the same allocation of burdens under the Fair Housing Act and ECOA will provide for less confusion and more consistent decision making by courts. Moreover, HUD continues to believe that this framework makes the most sense because it does not require either party to prove a negative.” (emphases mine)
The last sentence here is important. As I have raised previously in my article, “Six Unanswered Fair Lending Questions Hindering AI Credit Model Adoption”, how far must a lender go in searching for LDA credit models?
Must the lender search through all possible combinations of the predictive attributes for LDAs? What if there are 50 predictive attributes and 1,125,900,000,000,000 different combinations?
Must the lender include all potential attribute interactions in its search - even though the original model relationships are strictly linear? What about other potential functional forms?
Must the lender search for and consider additional data attributes outside of the original model development sample? For example, if there are 20 predictive attributes in the original model development dataset, is the lender expected to search for additional attributes to add to the original dataset? If so, how much search is expected?
Must the lender explore additional model methodologies outside of the one used for the original model? That is, if the lender employed logistic regression, is the lender now expected to consider a random forest model? a gradient boosted tree model? If so, for how many different model methodologies is the lender expected to search?[17]
How robust must the LDA's improved fairness be? How many samples must be evaluated? What if the LDA's improved fairness is unstable across samples? How much compute cost is considered reasonable in evaluating the fairness of multiple LDA candidates across multiple samples?[18]
How much hyperparameter tuning / searching is expected? Must the lender explore all possible hyperparameter configurations?
Ultimately, under this framework, the lender is put into the impossible position of proving a negative (i.e., there are no acceptable LDA credit models) – which is different than HUD’s framework where the Plaintiff is responsible for showing that there is an acceptable LDA (i.e., proving a positive). These two responsibilities are very different. And lenders hesitating to adopt this Step 3 responsibility due to the absence of legal guardrails that make this a finite and reasonable exercise are right to be cautious. Nevertheless, if a lender wishes to voluntarily adopt a policy to perform an LDA search, that’s certainly their prerogative; however, this ultimately may be an elective choice – not a compliance responsibility.[19]
Evaluating Step 3 - Does an LDA credit model exist that also meets the lender's legitimate business need?
Under the technologists’ disparate impact framework, Step 3 involves the use of algorithmic debasing methodologies to remediate the statistical disparities identified in Step 1. As I have written extensively in prior articles, algorithmic debiasing is a black box computational methodology that alters the original credit model’s estimated relationships (i.e., how each specific predictive factor influences the likelihood of borrower default) to find an alternative set of predictive relationships that yield a lower statistical disparity and an “acceptable” degree of model accuracy. For those unfamiliar with this approach, I refer you to my article “Fool’s Gold? Assessing the Case For Algorithmic Debiasing” for background and discussion.
With respect to disparate impact remediation, SCOTUS stated the following in Inclusive Communities:
“And when courts do find liability under a disparate-impact theory, their remedial orders must be consistent with the Constitution. Remedial orders in disparate-impact cases should concentrate on the elimination of the offending practice, and courts should strive to design race neutral remedies. Remedial orders that impose racial targets or quotas might raise difficult constitutional questions.” (emphasis mine)
And HUD had the following comment in its 2013 Discriminatory Effects Standard,
“As HUD explained in the 2013 Rule, the framework in this rule does not allow plaintiffs to impose untenable policies upon defendants because it still requires the less discriminatory alternative to “serve the defendant's [substantial, legitimate, nondiscriminatory stated] interests.”” (emphasis mine)
Let's explore these considerations in more detail.
Issue 1: Algorithmic Debiasing Fails to Narrow the Step 3 LDA Analysis to the Specific Factors Causing the Alleged Statistical Disparity
As I have stated previously, because it operates at the overall model level, algorithmic debiasing has free rein to alter the original credit model's predictive relationships in whatever manner achieves the objective of producing a "fairer" set of model outputs (typically measured by the AIR) with minimal impacts to the credit model's overall predictive accuracy (typically measured by the AUC statistic). As such, there is no assurance that algorithmic debiasing focuses its remediation on the specific credit policies (model factors) causing the original conditional disparity - particularly when, as in the technologists' Step 1 process, no specific causal factors are even identified.
Issue 2: Algorithmic Debiasing Appears to be Inconsistent With Certain Elements of Inclusive Communities
Elimination of the Offending Practice
As I discuss in more detail in the article “The Road to Fairer AI Credit Models: Are We Heading in the Right Direction?”:
“Debiasing algorithms typically operate unfettered - that is, they ruthlessly seek an optimal mathematical solution to the fairness problem by adjusting the weights on whatever predictive variables are necessary. It matters not to the algorithm what each variable represents conceptually from a credit risk perspective, how critical that variable is to the lender's credit underwriting policy, whether that variable has a strong causal connection to borrower default behavior, whether the variable has previously been criticized (or not) by regulators for use in credit underwriting, or what true form the estimated credit risk relationship of that variable should inherently take (e.g., positively or negatively related to default risk, monotonically increasing or decreasing, etc.).”
Given this unfettered freedom to adjust the original credit model and the lack of transparency into what specifically the debiasing algorithm has changed within the LDA model, lenders generally do not know exactly how the original credit model’s statistical disparity was remediated. All they tend to know is how much the original disparity was reduced and what the quantitative impact was on a certain measure of the model’s predictive accuracy. But this raises the important question as to whether this black box remediation approach actually remediates the specific actionable disparate impact identified in Steps 1 and 2, or whether it inadvertently remediates other non-actionable contributors to the statistical disparity. Additionally,
Because many algorithmic debiasing approaches keep all of the original predictive factors in the LDA credit model (albeit with changed weights), one cannot be sure that the remediation has actually addressed the specific causal credit factor(s) responsible for the disparity.[20]
Because the AIR is calculated based on an assumed credit score threshold, the algorithmic debiasing process may actually be adjusting the credit model to remediate AIR disparities wholly or partially driven by the lender's assumed credit score threshold - an improper solution.
All of these features of the technologists' Step 3 remediation process appear inconsistent with SCOTUS's Inclusive Communities safeguards noted above.
Race Neutral Remedies
Because algorithmic debiasing – in its typical form – does not constrain the LDA credit model to focus on the specific causal credit factor(s) driving the disparity, it is possible and, perhaps, likely that the LDA credit model may simply improve fairness via a latently-encoded reverse disparate impact - thereby exposing the lender and its consumers to certain unintended legal and compliance risks. That is, the LDA credit model may be overriding the original credit model's default risk estimates for applicants with credit profiles exhibiting certain demographic correlations - even if such estimates are accurate - in order to achieve the algorithm's objective of improved approval rate equity. Such a remedial approach would appear to be consistent with SCOTUS’s concerns about non-race-neutral remedies that pose “difficult constitutional questions” - as well as HUD’s comment in its 2013 Discriminatory Effects Standard that:
“… there is nothing in step three, or any other part of the proposed rule that requires the plaintiff or anyone else to resort to any type of discrimination. To the contrary, step three encourages defendants to utilize practices that have the least discriminatory effect because of any protected characteristic.” (emphasis mine)
That is, the LDA credit model would likely not meet the definition of the least discriminatory alternative if remediation is implemented via reverse disparate impact.
Issue 4: Contrary to Popular Belief, the CFPB May Not Expect Lenders to Perform Algorithmic Debiasing
Some proponents of algorithmic debiasing make reference to public remarks made by Ms. Patrice Ficklin, Assistant Director of the CFPB’s Office of Fair Lending and Equal Opportunity, in which she reminded lenders of the need to conduct rigorous fair lending testing of their models – including for disparate treatment and disparate impact – as well as searching for less discriminatory alternative models.[21] Seizing on the latter phrase, these proponents have interpreted this to mean the use of algorithmic debiasing methodologies.
Having read and listened to these remarks, I believe that this interpretation may be misleading, and I wish the CFPB would publicly clarify Assistant Director Ficklin’s remarks as they carry significant weight in the industry and bear upon an area with considerable risk and uncertainty right now. With that said, I believe these remarks could be interpreted in a much less revolutionary manner.
Credit model disparate impact testing has been part of bank fair lending compliance risk management programs for decades, and Assistant Director Ficklin's remarks appear to acknowledge this by stating: “regardless of whether firms use more traditional methods or newer techniques they're still responsible for robust fair lending testing.” Additionally, a lender's consideration of less discriminatory alternative models has been an integral part of these compliance testing programs over these years – as lenders identify specific predictive model factors that create elevated disparate impact risk, evaluate their business necessity (e.g., their causal relationship to borrower default behavior, the strength of their statistical relationship with borrower default, and how important they are to the model’s overall predictive accuracy), and consider less discriminatory alternatives where warranted – such as removing the factor from the model or measuring the model factor in a different way that has less disparate impact.
My point here is that, without further clarification from the CFPB, these remarks could also simply be reminding lenders of these traditional requirements for disparate impact testing - independent of any specific methodological approach.
Issue 5: Algorithmic Debiasing is a Relatively New and Untested Methodology For LDA Search
As I previously wrote in “The Road to Fairer AI Credit Models: Are We Heading in the Right Direction?”, algorithmically-driven LDA credit models have not undergone the rigorous, in-depth due diligence process that would be expected for such a high-stakes use within a highly-regulated industry. And, based on my own research, there appears to be a number of importance risks – from both a safety-and-soundness and compliance perspective – that have been largely ignored by those promoting the methodology.
Issue 6: Some Proponents Advocate LDA Acceptance Criteria That are Not Conceptually Sound and May Amount to “Second-Guessing”
Clearly, there exists great uncertainty on how to measure whether an LDA credit model serves a lender’s business necessity (or “substantial, legitimate, non-discriminatory interests, per HUD). And I do not propose to have an answer to this. However, I do note that one potential answer that has been making the rounds is troubling both in its lack of conceptual soundness and its seeming “second-guessing” of the lender’s stated interests.
In April 2024, the Fair Lending Monitor (“Monitor”) for Upstart Network (“Upstart”) released its Final Report containing an in-depth analysis of Upstart’s credit model for potential fair lending issues – including disparate impact. As part of this analysis, the Monitor engaged technical consultants to search for LDA credit models using algorithmic debiasing in order to remediate a “significant” AIR disparity for Black applicants. According to the Monitor’s Third Report - which laid out its methodologies in more detail, it developed the following approach to determine whether an LDA credit model would be acceptable to Upstart (i.e., meet its legitimate business interests):
“Presumably, before it deploys a model Upstart understands there is some uncertainty regarding how that model will perform and implicitly makes a business decision that performance within that range would reasonably achieve its legitimate business interests. We believe there is a likelihood that a court would find that Upstart’s legitimate business interests could “reasonably be achieved as well” by a model whose expected range of predicted Error Metric values generally falls within that expected range of the Baseline Model.”
“Understanding that the viability of an approach is context-dependent, we ultimately recommended an approach based on the belief that there is a significant likelihood that a court would find that a less discriminatory alternative model could serve Upstart’s legitimate business needs as well as the Baseline Model if there is a reasonable probability that the performance of that alternative would fall within the likely performance range of the Baseline Model. Accordingly, we identified a measure of that range for Upstart’s Baseline Model—what we referred to as the Uncertainty Interval—and presumed an alternative model within that range would reasonably achieve Upstart’s business interests as well as the Baseline.” (emphasis mine)
“We rely on the narrower 68% Uncertainty Interval (here, +/–10 points) because the counterarguments that an alternative model within that range would not be viable are weaker. … Accordingly, we would recommend an alternative model that has a less discriminatory effect if its average Error Metric falls within the 68% Uncertainty Interval (equal to the standard error) of the Baseline cross-validation.” (emphasis mine)
Since publication, some industry participants have referenced this approach as a potentially useful way - in more general contexts - to evaluate the acceptability of LDA credit models. However, in my opinion, this would be a mistake as the approach (as I understand it from the report) appears to contain an important conceptual flaw.
In general, the Monitor’s use of an “Uncertainty Interval” is predicated on the familiar statistical tool called a “confidence interval” in which the uncertainty of a statistical metric – such as an accuracy metric – is quantified. The figure below is a general example of a confidence interval.
Intuitively, the confidence interval quantifies the likelihood of observing a range of values that a given metric may take across different data samples. For example, in the context of a credit model, a rank-order accuracy metric such as the AUC may be 0.86, on average, across the data samples used to validate the model, but differ slightly from this 0.86 average value when computed for specific data sample subsets - such as monthly loan origination samples - due to random influences. The key here is that, on average, the lender expects an AUC accuracy value of 0.86 - but acknowledges that there is a certain probability (or likelihood) of observing AUC values that are either higher or lower than this average. The confidence interval quantifies this uncertainty – indicating, for example, that 68% of the time the lender should expect AUC values that fall within the range of 0.76 to 0.96.
With this context, the Monitor’s logic is as follows. Since the lender accepts that the original credit model's accuracy statistic (AUC) can vary between 0.76 and 0.96, then an LDA credit model with an average AUC that falls within this range should also be acceptable to the lender. For example, if an LDA credit model exhibits an average AUC of 0.80, then this model should meet the lender’s legitimate business interests even though it is lower than the 0.86 average AUC of the lender’s original credit model.
The primary conceptual flaw with this logic is that the Monitor ignores the probabilities associated with this “acceptable” model performance range. That is, if measured discretely, each AUC value in this range is not equally likely to occur. In fact, AUC values that are further away from the 0.86 average value are increasingly less likely to occur – which is why the lender is willing to tolerate periodic performance measures as low as 0.76. It's because such outcomes should be fairly infrequent and non-persistent.
Now, if you tell a lender that you have an alternative credit model whose expected AUC value is 0.80, how do you think it would respond? The Monitor would say that the lender would find it acceptable, if not ideal, since it falls within the Uncertainty Interval range of the original model. The lender knew such a performance level could occur and yet still adopted the original model. However, having an alternative model whose expected AUC value is 0.8 is very different from having a model that has a relatively low likelihood of periodically generating a 0.8 AUC value. In the LDA case, the 0.8 AUC value is the most likely accuracy performance the lender would experience, while in the original credit model context, the 0.8 AUC value occurs relatively infrequently.
For this reason, I believe this is a flawed approach. Additionally, I note that Upstart pushed back on this approach and provided specific arguments for why it would not adopt an LDA credit model whose performance fell within this range. The Monitor ultimately did not agree with Upstart and continued to recommend their approach. While I certainly understand that each side of the dispute believes it has valid reasons for its position, I fear that approaches such as these lead us down a path of the type of second-guessing that SCOTUS desired to limit in its disparate impact decisions.
What's the Alternative?
Taking all of the above into consideration leads me to an alternative framework - depicted in Figure 1 below - that represents a more nuanced and reasoned disparate impact compliance evaluation - not a purely clinical data exercise that can be executed with a single mouse click.
Let's break down this alternative approach.
Feature 1: Disparate Impact Analysis Occurs at the Individual Model Factor Level
For the many reasons discussed previously, I believe that an analysis occurring at the specific factor level has a number of advantages:
It is consistent with the Inclusive Communities requirement to identify specific factors causing the statistical disparity.
It reinforces the need for lenders to know what data/information it is using to assess applicant credit risk - helping to provide more transparency to black box model methodologies.
It facilitates specific actions in Steps 2 and 3 consistent with Inclusive Communities' and HUD's focus on eliminating "artificial, arbitrary, and unnecessary barriers" to credit access.
It is consistent with long-standing traditional credit model disparate impact testing processes.
Cue the push-back.
But my credit model has hundreds (or thousands) or predictive factors! This is infeasible!
My best response to this push-back is to ask Why does your credit model have hundreds or thousands of predictive factors?
While such a large number of factors may be a winning marketing position (Our models consider thousands of data points! FICO only uses 20-30!), the reality is that such a vast number of predictive factors is seldom, if ever, statistically or conceptually justified. Frankly, in nearly all cases, there is an incredible degree of redundancy across these factors - redundancy that could easily be reduced through various types of variable reduction techniques used for decades in credit scoring model development. Even for the non-redundant factors, it is highly likely that many may provide little, if any, statistically relevant predictive power.
However, winnowing down the master database of model development data to a non-redundant, statistically relevant subset of core predictive factors requires extra work, manual intervention, and professional judgment - whereas many automated machine-learning methodologies make it easy to just let the algorithm decide how to weight the importance of all the factors - even if many of the weights are de minimis. So, to those who suggest factor-specific disparate impact testing is impractical in this new age of AI where practically unlimited data can be used for prediction, I would suggest that such models can also create elevated safety-and-soundness risks for the lender - about which I've previously written - as well as more elevated risks of disparate impact due to the sheer number of redundant or unnecessary predictive factors. With more attention given to prudent model risk management, model pruning, and parsimony, such models would likely contain significantly fewer predictive factors and, therefore, facilitate a more effective factor-specific disparate impact testing process - with a side benefit of being relatively easier to explain and manage.
Feature 2: The Model's Differential Predictive Accuracy Across Demographic Groups is Used to Measure Disparate Impact
To avoid the flaws described above for the AIR, the most relevant fairness metric for a credit model, in my opinion, is the model's relative predictive error for a protected class group relative to its corresponding control group. This is because a credit model is designed to estimate an applicant's credit risk - usually measured as the probability of default (or serious delinquency) over a certain period of time since loan origination. A disparate impact for such a model would arise if its overprediction (underprediction) error for a protected class group's default rate is relatively larger (smaller) than its overprediction (underprediction) error for the associated control group. It is this specific model output that may directly disadvantage a protected class group - estimating them to be relatively more risky than they actually are and, therefore, creating a model-driven credit access barrier that should be remediated from a credit model perspective. While other credit decision policies - such as credit score thresholds - can also have a disparate impact, the remediation for those policies should focus on relevant causal features of those policies - not the credit model itself.
If the specific model factor does not contribute to a "significant" disparity using this metric, then the disparate impact assessment concludes for this factor. Alternatively, if there is a significant disparity contribution, then the disparate impact analysis advances to Step 2 to assess the "business necessity" of this factor.
Feature 3: The Lender is Not Required to Prove a Negative
By piercing the surface of the credit model to focus on the specific predictive factors contributing to the Step 1 disparity, we obtain a more meaningful Step 2 process whereby the “business necessity” of the each predictive factor is evaluated for conceptual soundness and statistical validity. Further, rather than all predictive factors proceeding to Step 3 for potential LDA remediation, this alternative Step 2 process allows certain causal factors to be excluded consistent with SCOTUS’s goal to: (1) limit disparate impact liability to prevent abuses and (2) not to displace valid private or governmental policies or requirements.
In the present context of credit underwriting, this means that standard, commonly-used, core indicia of borrower creditworthiness – such as proof of identity, certain ability-to-pay requirements, considerations of adequate loan collateralization, etc. would be excluded from Step 3 LDA remediation even if they contribute to the Step 1 disparity value. I note, however, that the lender should still ensure that: (1) it maintains a list of approved core credit underwriting factors in which sound justification has been documented - such as requirements by law or regulation, or standard credit policy factors from investors to whom the loans will be sold, and (2) consistent with the 1994 Joint Policy Statement on Discrimination in Lending, the lender periodically considers whether less discriminatory versions of these core credit indicia are feasible.
"Lenders will not have to justify every requirement and practice every time that they face a compliance examination. The Agencies recognize the relevance to credit decisions of factors related to the adequacy of the borrower's income to carry the loan, the likely continuation of that income, the adequacy of the collateral to secure the loan, the borrower's past performance in paying obligations, the availability of funds to close, and the existence of adequate reserves. While lenders should think critically about whether widespread, familiar requirements and practices have an unjustifiable disparate impact, they should look especially carefully at requirements that are more stringent than customary. Lenders should also stay informed of developments in underwriting and portfolio performance evaluation so that they are well positioned to consider all options by which their business objectives can be achieved." - 1994 Joint Policy Statement on Discrimination in Lending
Accordingly, Step 3 LDA remediation efforts would focus only on those causal policies / factors that are the most questionable when it comes to business necessity and, therefore, more likely to fall within the “artificial, arbitrary, and unnecessary” category that both SCOTUS and HUD highlight.[22] As, admittedly, there is no formal delineation of this category of which I am aware, this will require lenders to establish such a definition in its fair lending compliance policy (with appropriate governance) and require credit model owners (in conjunction with Compliance / Legal professionals) to perform reasonable evaluations of predictive model factors – and maintain appropriate documentation of these evaluations – to identify those whose risk and uncertainty from a business necessity perspective require a further Step 3 LDA analysis.
Feature 4: A Reconsideration of Algorithmic Debiasing
For those model factors demonstrating a "significant" disparate impact as defined above, and lacking a strong conceptual foundation and/or technical validity, Step 3 looks to see: (1) whether less discriminatory alternative measures of these factors are feasible, and (2) if so, whether such alternatives create unacceptable risks or costs for the lender. Notably, the LDA credit models created in this Step 3:
Involve either the elimination or modification of the factor(s) at issue - consistent with Inclusive Communities language that remediation be focused on the "offending practice".
Unlike algorithmic debiasing, do not involve the modification of other model predictive factors unrelated to the specifically-identified disparate impact.
Do not remediate the specifically-identified disparate impact via model modifications that effectively inject a reverse disparate impact to the model results.
With respect to whether a potential LDA creates unacceptable risks or costs to the lender, that is something that should also be clearly delineated in the lender's fair lending compliance policy and be subject to appropriate governance. In particular, the lender should be transparent in the specific criteria it uses to evaluate these risks and costs - as well as the thresholds or indicia that may indicate that such incremental risks and/or costs are unacceptable. I also note that - consistent with my previous thoughts on this topic - the acceptability of LDA credit models cannot be reasonably assessed solely based on a single measure of model predictive accuracy, nor based only on predictive accuracy. There are other very important considerations involved in the safe and sound, legally-compliant deployment of a credit model that also need to factor into this assessment.
Final Thoughts
So is this alternative disparate impact assessment process the one that all consumer lenders should use?
Not necessarily, but I think it’s a sorely needed alternative viewpoint grounded - not in technology - but in the applicable legal frameworks that govern credit model disparate impact. And I do believe that we need to consider and discuss alternatives - if only to make sure we are doing our proper due diligence on new technologies being deployed into high-stakes uses. As I have written previously, there are many unresolved issues with the technologists’ current disparate impact remediation approach even though it continues to be promoted by fintech entrepreneurs, certain consumer protection organizations, and well-meaning fair lending compliance professionals. But ask yourself this – have you seen any marketing materials or research papers analyzing the post-origination results of LDA credit models? Is there any reliable and objective analysis of how these loans have actually performed both absolutely and relative to initial expectations at both the aggregate and individual demographic group levels? Why not?
In my opinion, the current path we are on is fraught with perils for both lenders and consumers – mainly because it has largely been technology-driven. But the world of consumer lending is more complex and nuanced, and the environment we operate within is highly regulated with specific legal requirements that must be considered.
Is this proposed credit model disparate impact framework the right answer? Maybe or maybe not, but I believe it’s focus on the underlying legal framework is a step in the right direction.
* * *
ENDNOTES:
[1] See, for example:
Fair Lending Monitorship of Upstart Network’s Lending Model: Fourth and Final Report of the Independent Monitor, March 27, 2024 - "There are three steps involved in determining whether a policy or practice—here, the use of a model—has an unlawful disparate impact:"
An Introduction to Artificial Intelligence and Solutions to the Problems of Algorithmic Discrimination, 2019 - "Our general approach focuses on evaluating model outcomes for disparate impact discrimination risk. The approach mirrors the legal burden-shifting test that originated in employment law and has become the relevant legal test for disparate impact discrimination."
[2] Throughout this article, my references to HUD's 2013 Discriminatory Effects Standard specifically means the 2023 reinstatement of the 2013 Standard.
[3] Technically, the AIR also depends on certain other factors - including the lender's specific credit score threshold for loan approval (i.e., the minimum acceptable credit score) which is governed by the lender's credit risk management policy. I will discuss this point further later in the article.
[4] The technologists initially developed an AIR threshold of < 0.8 to determine whether a "practically significant" adverse impact is present - based on the EEOC's "four-fifths" rule. However, more recently, Upstart's Fair Lending Monitor adopted a more aggressive AIR threshold of < 0.9 - arguing ("In addition to prompting more frequent searches for less discriminatory models, a more conservative practical significance threshold (i.e., 90% AIR) is sensible when dealing with a fully automated model where the model is relatively easy to validate and the effects of model inputs on the model outcome can be defined and adjusted with some precision."). See Fair Lending Monitorship of Upstart Network’s Lending Model: Second Report of the Independent Monitor, November 10, 2021. I find their higher threshold value to be questionable as it relies on an unsupported and judgmental basis.
[5] In this article, the terms “business necessity”, “legitimate business need”, “valid interest”, and "substantial, legitimate, nondiscriminatory interests" are used interchangeably with acknowledgement that these terms may vary slightly in their precise legal interpretations due to their derivation from different legal artifacts.
[6] In what follows, I highlight certain language in SCOTUS’s Inclusive Communities opinion and interpret this language within the context of consumer credit model disparate impact. However, it is important to note that – since the 2015 opinion – certain federal courts, HUD, and legal professionals have offered various interpretations of this language - making clear that there is uncertainty as to its precise meaning. Accordingly, my critique of the technologists’ credit model disparate impact framework, and my recommendations for an alternative framework, should be understood to lie within this range of interpretive variability.
[7] In practical terms, a prima facie case requires sufficient evidence of an actionable disparate impact - that is, a meaningful adverse lending disparity that can be traced to a specific policy within the lender's control that is considered artificial, arbitrary, and unnecessary to the lender's credit decision. These specific elements of a prima facie case provide a check on potentially abusive disparate impact claims that may solely rest on the presence of an adverse statistical disparity.
[8] While some technologists would likely push back on this and say that they do not consider a "significant" AIR to be "evidence" of disparate impact, I believe that they do. This is because, under their implementation of the three-step process, you always proceed to Step 3 and search for LDA credit models whenever a "significant" AIR value is obtained. See, for example, Fair Lending Monitorship of Upstart Network’s Lending Model: Fourth and Final Report of the Independent Monitor, March 27, 2024 ("... approval / denial disparities are considered “practically significant” and warrant a search for less discriminatory alternatives if AIR for any tested protected class group is less than 0.90.")
[9] A third flaw is the assumption that approval rate equity (i.e., an AIR value of one) represents a non-discriminatory outcome. However, in the context of credit decisions, it is certainly possible that a protected class group of applicants is better credit-qualified than its corresponding control group – in which case an AIR value of one would falsely indicate an absence of discrimination.
[10] The AIR metric confounds two potential sources of disparate impact - the credit model itself and the specific credit score threshold used by the lender to make credit decisions. A "significant" AIR could be due one or both of these factors and knowledge of this attribution is critical for implementing appropriate remedial actions. Alternatively, an AIR that is not "significant" could, in fact, be masking an underlying "significant" disparity for one of these sources - with such disparity being offset at the aggregate AIR level by an opposing disparity in the other source.
[11] The reasons for this position are not clear. They may be related to EEOC guidance that interprets employment-related algorithmic decision tools as "selection procedures" under Title VII. An alternative reason may be to align their disparate impact framework with a remediation tool (algorithmic debiasing) that operates at the overall model level. Regardless of the specific reason, this position appears to be at odds with Inclusive Communities.
[12] See “Fool’s Gold 3 – Do LDA Credit Models Really Improve Fairness?” for my discussion of traditional approaches to evaluate credit model disparate impact.
[13] See, for example, CFPB Circular 2022-03 “Adverse action notification requirements in connection with credit decisions based on complex algorithms”.
[14] Typically, the technologists evaluate a credit model’s “legitimate business need” by: (1) evaluating whether the model is predictive of a relevant measure of creditworthiness, and (2) evaluating whether the model achieves a sufficient level of predictive accuracy. Given that these are two fundamental elements of a basic enterprise-grade predictive model, they are nearly always met once the model is submitted for fair lending compliance review.
[15] Some would say that federal banking regulators have always expected lenders to evaluate less discriminatory alternatives as part of their fair lending compliance risk management processes. And they would be correct. However, as I will discuss in more detail further below, LDA searches for individual model factors under traditional credit model disparate impact frameworks were much more limited and straightforward than model-level LDA searches utilizing algorithmic debiasing approaches.
[16] See, for example, my discussions of remediations resulting in apparent reverse disparate impact in “Meta’s Variance Reduction System: Is This the AI Fairness Solution We’ve Been Waiting For?” and “Fool’s Gold 3: Do LDA Credit Models Really Improve Fairness?”
[17] It can actually get even more complicated than this. According to one fintech's recent patent for algorithmic debiasing, "The new model can be a version of the initial model with new model parameters, a new model constructed by combining the initial model with one or more additional models in an ensemble, a new model constructed by adding one or more transformations to an output of the initial model, a new model having a different model type from the initial model, or any other suitable new model having a new construction and/or model parameters..." See "Systems and Methods For Model Fairness", Zest Finance, Inc., Pub No. US 2024/0127125 A1, April 18, 2024.
[18] See, for example, Black, E., Gillis, T., & Hall, Z. Y. (2024, June). D-hacking. In The 2024 ACM Conference on Fairness, Accountability, and Transparency (pp. 602-615).
[19] In the next section, I specifically address the oft-stated claim that the CFPB expects lenders to search for LDA credit models (frequently in the context of algorithmic debiasing) – thereby supporting the technologists’ claims that Step 3 is, in fact, the lender’s legal responsibility.
[20] Even for algorithmic debiasing methodologies that do exclude certain model factors, the lender cannot be sure if the right model factors were excluded if the specific causes of the original disparate impact were never individually identified.
[21] NCRC Just Economy Conference, March 2023. Ms. Ficklin also made similar remarks during the January 17, 2024 webinar "Getting Ahead of the Curve: Emerging Issues in the Use of AI and Machine Learning in Financial Services" sponsored by FinRegLab.
[22] This will also preclude potentially unsafe or non-compliant LDA credit model configurations in which core credit risk criteria - such as LTV or debt-to-income ratio - are either removed or substantially down-weighted by the debiasing algorithm in order to yield greater approval rate equity.
© Pace Analytics Consulting LLC, 2024.