Are Rising Interest Rates Impairing AI Credit Model Fairness?

How are macroeconomic conditions impacting AI credit model decisions, and how might this change our views on current fairness metrics and algorithmic de-biasing practices?

Since the beginning of 2022, consumer debt burdens have been on the rise as the Federal Reserve - in response to the highest inflation rates in four decades - quickly raised interest rates to levels not seen since late 2007 - resulting in surging interest rate levels on new consumer loans and existing adjustable-rate consumer debt. For example,

1) The average 30-year fixed conforming mortgage interest rate has approximately doubled in 2022 - rising from just under 3.5% to recent highs of over 7.0%.

30-Year Fixed Rate Conforming Mortgage Rate Index - Source: Federal Reserve Bank of St. Louis

2) The U.S. Prime Rate - which is the basis for many short- to medium-term consumer loans - has doubled in 2022 - rising from 3.25% at the start of the year to 7.5% in December.

U.S. Bank Prime Rate - Source: Federal Reserve Bank of St. Louis

3) According to bankrate.com, average interest rates have increased "1.77 percentage points for a 60-month new car loan and 1.78 percentage points for a 48-month used loan". The former represents an almost 50% increase in the interest rate since January 2022.

4) According to CNN, "The national average APR for credit cards has climbed by 2.74 percentage points so far this year, the biggest increase in a single year on record, according to Bankrate.com." This represents a 17% increase in the interest rate since January 2022.

While U.S. consumers are now facing the highest interest rate levels in almost 15 years, the economic impact of these rate increases may not be the same for all demographic groups. Indeed, the CFPB's recent 2022 mortgage lending research - based on an analysis of federal Home Mortgage Disclosure Act ("HMDA") data through 2022Q2 - notes the following about applicant debt-to-income ratios ("DTIs"):

"... by the end of our observation period [June 2022], the average DTI for Hispanic white borrowers reached over 40 percent, while the average DTI for Black borrowers rose to 39.4 percent. Although we don’t have third quarter filings for 2022 HMDA data at this time, given the trends observed in the first half of the year and the fact that the mortgage interest rates continued to rise in the third and fourth quarters of 2022 to the highest level in more than 20 years, we predict that the increase in DTI will likely continue through the rest of 2022. This means that consumers taking out new mortgages are likely devoting a higher share of their income towards servicing debts, especially mortgage payments, with the potential that average DTI for Hispanic white and Black borrowers will approach levels not seen since DTI information was first collected in HMDA in 2018." [emphasis by this author]

While the CFPB researchers had yet to detect a notable change in mortgage application denial rates for home purchase loans through 2022Q2, they did find an increasing share of denied applications reporting the applicant's DTI as a denial reason, with the share of Black and Hispanic applicants' DTI-based denials at historical highs - as shown in the Figure below.[1]

Percent of HMDA Denials Due to DTI — Percentage of denials with DTI reported as reason - Source: CFPB November 30, 2022 Office of Research Blog

According to the CFPB,

"Notably, by the end of the second quarter of 2022, over 45 percent of all Black and Hispanic white applicants who were denied had DTI reported as a denial reason [relative to about 35% for non-Hispanic white applicants]. That is the highest since the revised HMDA data collection and reporting requirements took effect in 2018." [emphasis by this author]

Clearly, these continuing macroeconomic changes are affecting consumer credit risk profiles in a meaningful, and historically different, way[2] - and likely driving or presaging notable credit model outcome divergences between certain demographic groups - such as credit approval rates, interest rates, and/or line amounts. Traditional fair lending analyses of these outcome divergences would naturally point to the rising market interest rates - and corresponding debt burden increases - as the root-cause of these disparities. And credit model disparate impact concerns would be defended by the "business necessity" of prudent credit risk management policies associated with traditional measures of a consumer's ability to pay (i.e., DTI and payment-to-income ("PTI") credit policy maximums). However, these traditional analyses have recently fallen out of favor - with a growing number of industry participants adopting a new breed of fintech-driven fair lending assessments that view disparate impact and credit model fairness very differently.

The New Breed of AI Credit Models and Fairness Analysis

As I discussed earlier this year, we are in the midst of a fierce "score war" - a public battle between a new cadre of AI/ML-driven fintechs armed with troves of alternative data and armies of data scientists that seek to end the long reign of traditional credit score providers with new AI-based credit risk models that they profess are more accurate, inclusive, and fair.

However, despite the purported benefits of these new AI-based credit models, the response of federal regulators, and the adoption by many top- and middle-tier banks, has been more cautious[3] - with concerns over model transparency, explainability, the robustness of model performance claims, and - perhaps most importantly - the potential hidden biases of such models against the very demographic groups they are meant to benefit.

To address these latter "algorithmic bias" or "robo-discrimination" concerns head-on, academic and industry researchers have been quite active in developing AI-based technologies to purge these credit models of their demographic biases - albeit at a cost of some predictive accuracy as well as a potential increase in certain other model risks.[4] These novel algorithmic "de-biasing" approaches - the most popular of which are typically implemented during credit model training - generally rely on the following two critical - yet subjective - premises adopted by their advocates:

The presence and magnitude of the algorithmic bias, or disparate impact, for credit decisions is best measured using the Adverse Impact Ratio ("AIR") - which measures the relative "approval" rates of protected and non-protected class applicants. Essentially, the AIR simply reflects whether there is equality of decision outcomes between groups - that is, are protected class groups approved at the same rate as non-protected class groups. If the protected class group has a lower approval rate, then the AIR will be less than one, and vice versa. Advocates of this disparate impact metric point to its long-standing use in similar employment discrimination matters, and typically suggest that an AIR value less than 0.80 represents problematic model bias (disparate impact) - relying on the EEOC's "four-fifths rule" as support for this threshold.[5]
Lenders have an obligation to search preemptively for less discriminatory alternative ("LDA") models that reduce disparate impact (i.e., increase the AIR closer to one) in exchange for a "modest" or "reasonable" decrease in the model's predictive accuracy. For example, if the model's AIR falls below the 0.80 EEOC-based threshold, then it is incumbent upon the model owner to search for other model configurations that reduce the approval rate disparities - even if the lender firmly believes that the drivers of the AIR disparity satisfy a "business necessity" defense (e.g., the disparities are clearly driven by differences in legitimate credit risk characteristics between the two groups).[6] Depending on the specific algorithmic de-biasing approach adopted, these LDA models may involve a different set of predictive attributes, changes to the weights of the model's predictive attributes, or both.

Based on these two critical premises, credit model fairness is achieved once the underlying predictive algorithms are modified to produce decision outcomes that are approximately equal (as measured by an AIR close to one) between the protected and non-protected class groups - at an acceptable cost of predictive performance.

But what happens after the de-biased credit model is deployed in production and, say, macroeconomic conditions - such as market interest rates - change markedly?

Unfortunately, advocates of algorithmic de-biasing have provided little public information as to the on-going stability of a de-biased credit model's AIR-based fairness over the model's life-cycle. In the absence of such information, one is left with the impression that - once purged of biases - a newly-deployed credit model will continue to operate with sustained fairness performance so long as the algorithm remains unchanged. But this seems unlikely for a number of reasons - one of which pertains to the recent interest rate hikes discussed above.

What seems more likely is that - as with a model's predictive performance - once deployed, a de-biased credit model's fairness performance (as measured by the AIR or SMD metric) may also drift over time in response to changing consumer credit risk profiles.[7] Such "fairness drift" under these metrics makes a company's fair lending compliance a moving target - requiring vigilant monitoring and, where necessary, updated de-biasing. In fact, the macroeconomic changes this year are a great case study in such fairness drift as lenders experience rising consumer DTI and PTI ratios flowing through their AI credit models and, potentially, a corresponding deterioration in the "equality of outcome" fair lending risk metrics (i.e., AIR for credit decisions and SMD for interest rates).

While the advocates of algorithmic de-biasing haven't publicized their next steps in these situations, the natural response to this fairness deterioration would appear to be an update of their algorithmic de-biasing procedure to counteract the deterioration in the AIR and/or SMD fairness metrics using a more recent data sample reflecting the higher consumer debt burdens.

However, this logical extension of their approach appears to create a conundrum.

Specifically, in this case study, we know that the driver of the growing model disparate impact (as measured by the decreasing AIR / increasing SMD) is rising market interest rates - a purely macroeconomic driver. However, under the the de-biasing advocates' current premises, no predictive variables are safe from the LDA requirement (i.e., business necessity does not preclude the requirement for an LDA search). This means that an updated de-biasing procedure will be needed to ameliorate the impact of the growing DTI and PTI ratios on applicant credit scores (and, by extension, relative loan approval rates (AIR) and relative interest rates (SMD)). But this would likely raise some potentially significant safety-and-soundness concerns due to the potential weakening of these ability-to-pay measures during a period when they are increasingly important from a credit risk management perspective.

Can't we just exempt macroeconomic-driven AIR-/SMD-based fairness deterioration from the LDA requirement?

We could; however, this violates the advocates' second premise of algorithmic de-biasing. If the fairness effects of rising interest rates are deemed immune from the LDA search requirement (due, perhaps, to business necessity?), then this opens the door to other potential exceptions, and - frankly - calls into question the underlying basis for the first premises' use of the AIR / SMD fairness metrics. But this may not be such a bad thing.

Is There a Solution?

Perhaps by recognizing the "business necessity" of certain - but not necessarily all - credit risk drivers, these algorithmic de-biasing approaches can evolve to be more consistent with traditional model disparate impact assessments where LDAs are considered only for predictive attributes that: (1) create a potential disparate impact AND (2) lack sufficient business necessity or justification (as with some alternative data).

That is, by evolving the foundations of algorithmic de-biasing away from an "equality of outcomes" objective and toward a fairness metric that accounts for fundamental objective and direct credit risk attribute differences across customers, AI credit model fairness would exhibit more stability over the credit model life-cycle, safety-and-soundness risks would be reduced, model risk management challenges may de-escalate, and - just maybe - some of the caution exhibited by mid- and top-tier banks toward AI-based credit models would begin to recede, thereby expanding the use of these AI technologies to achieve broader and fairer access to credit.

* * *

ENDNOTES:

[1] A couple of additional points about these observations. First, rising interest rates are just one of the drivers of increasing consumer debt burdens. These burdens are further fueled by recent inflation in residential real estate prices and automobile prices that exacerbate the impacts of current macroeconomic conditions on consumer credit risk profiles. Second, while CFPB researchers have yet to detect notable increases in HMDA loan denial rates, it is certainly possible that the impact of these rising interest rates on loan denial rates is not immediate, as applicants mitigate the impact by lowering their requested loan amounts to remain within the lenders' approval criteria (or lenders counteroffer at lower loan amounts than requested). However, to the extent that this reduction in loan amounts disproportionally impacts protected class consumers, it still represents an adverse effect on their credit outcomes - just one that is not traditionally focused on from a fair lending perspective. Additionally, as interest rates continue to rise, there will be a point where interest rate levels drive applicants' loan requests outside the lenders' credit policies - even at reduced loan amounts - thereby causing relative denial rates to eventually rise.

[2] "Historically different" is relative to the past 15 years over which the predictive and fairness performances of many AI credit models have been tuned.

[3] Interestingly, while many top- and middle-tier banks have had a relatively tepid response to these models, the models appear to have made strong in-roads into the credit union sector where concentration continues to expand.

[4] There has also been broadly comparable efforts to address other regulator and industry concerns - such as model explainability. However, while much progress has been made, this concern remains unsettled.

[5] See, for example, Hall, Patrick, et. al. "A United States Fair Lending Perspective on Machine Learning", Frontiers in Artificial Intelligence, 07 June 2021, Sec. Artificial Intelligence in Finance, https://doi.org/10.3389/frai.2021.695301,

The AIR metric is used to measure disparate impact in discrete outcomes - such as approval rates, while another metric - referred to as the Standardized Mean Difference ("SMD") is used to measure disparate impact in continuous outcomes - such as loan interest rates. The key point here is that both disparate impact metrics are based on the "equality of outcomes" fairness standard in which disparate impact is deemed present whenever the lending outcome (e.g., average approval rate or average loan interest rate) is "sufficiently" different between a protected class group and a corresponding control group - regardless of the underlying objective differences in the credit risk profile of each group.

I also note that the use of an AIR threshold of 0.80 has not been formally endorsed by the federal financial regulators, and one well-known law firm suggests using an even more conservative 0.90 AIR threshold value under some circumstances. For my further analysis of the AIR bias metric, see my blog post "Six Unanswered Fair Lending Questions Hindering AI Credit Model Adoption".

[6] See my above blog post for six important unanswered questions about this preemptive LDA model search. I note here, in particular, that this LDA search "obligation" is not a settled regulatory expectation. In fact, consumer lenders have developed credit scoring models for decades - with attendant bank regulatory scrutiny - and have never been under an expectation or requirement to search for LDA models (although they may have searched for an LDA attribute within the model if they were concerned about the potential disparate impact of that attribute and its associated "business necessity" justification). Nevertheless, advocates of algorithmic de-biasing view a pre-emptive LDA search as a "likely" expectation of bank regulators - although such regulators have yet to opine one way or the other on it, and such a practice would appear to be at odds with HUD's 2013 disparate impact burden shifting framework.

[7] I note that this "fairness drift" is specific to the manner by which "fairness" is measured. Where AIR (or SMD) is used to measure fairness, then any model input change that drives relatively lower loan approval rates (or relatively higher interest rates) for protected class groups will create "fairness drift". However, if - alternatively - fairness is measured by the relative accuracy of model predictions, then the same model input change may not lead to "fairness drift" under that definition.