Four Potential Pitfalls of AI Credit Scoring Models

Richard Pace
Feb 10, 2022
14 min read

Updated: May 6, 2024

The last few years have seen rapid advances in the application of AI/ML technologies to the consumer credit underwriting process. In fact, we are currently in the midst of a fierce “score war” in which new AI/ML-driven fintechs – armed with troves of alternative data and armies of data scientists – launch repeated salvos against the traditional credit score, seeking to end its long reign over the consumer lending landscape with new credit risk scores they profess are more accurate, inclusive, and fair.

While the defeat of traditional credit scores is far from certain, the newer scores are gaining ground. Nevertheless, before lenders race to embrace these new tools, it would be prudent to consider a few potential "under-the-radar" risks of the underlying technology and data that – in my view – should be evaluated to ensure their safe, sound, and compliant adoption in consumer underwriting processes.

In what follows, I focus on three key features of newer AI/ML-based credit scoring models that are linked to these risks: (1) the tendency to use hundreds or even thousands of predictive variables (i.e., "high dimensional data") often derived from various types of alternative data, (2) the tendency of some popular AI/ML model architectures to be "underspecified" during model training - thereby generating multiple solutions with the same, or very similar, predictive performance, and (3) the increasingly-common use of "de-biasing" techniques to alter the trained model to produce "fairer" credit underwriting outcomes while preserving predictive accuracy.

To be clear, I am not suggesting that all AI credit scoring models are afflicted by the risks I discuss below. Rather, I believe that the typical features of such models may make these risks more applicable and relevant; therefore, incorporation of these risks as part of a lender's due diligence and model validation processes may be both warranted, as well as prudent, from a risk management perspective.

Risk 1: AI Credit Scoring Models Based on High-Dimensional Data May Not Generalize Well on New Data

On its face, one would expect that models containing more data attributes should perform better since there are many more useful “signals” to predict credit performance – particularly for individuals with sparse credit bureau data for which traditional models may either underperform or produce no score at all. While there certainly is validity to this argument, improved performance from high-dimensional alternative data may also cause unintended risks to the model's robustness and stability.

As a simplified example, consider a credit scoring model based on 10 binary predictive factors (i.e., so-called "dummy variables" that take on only two possible values). Theoretically, these 10 factors create 1,024 potential linear credit profiles across a sample of borrowers - that is, there are 1,024 combinations of the 10 dummy variables that can describe an individual’s credit risk profile.[1] If these credit profiles are equally likely to be observed, and we have a dataset of 10,000 individuals, then we should observe almost 10 individuals for each possible credit profile in the dataset – thereby providing both representativeness and depth of the underlying borrower population. Representativeness ensures that the training and test samples correspond to the individuals we will likely encounter in a production environment, while depth ensures that the predictions for each credit profile are based on multiple individuals with that profile. In general, this model would tend to have low variance with a low risk of overfitting and, therefore, good generalization power.

How do these model performance attributes change with high-dimensional data?

Suppose I double the number of binary predictive factors to 20 in order to expand the number of useful signals related to borrower credit performance. In this case, the number of possible linear credit profiles increases to 1,048,576 – more than 1,000 times larger than the model with 10 predictive factors (see the table below). Now, with the same training dataset of 10,000 individuals, we cover less than 1% of the potential credit profiles – making the dataset both non-representative and extremely sparse – with many credit profiles lacking any training/test data whatsoever, and others having only a single record with which to base a prediction. To retain the same representativeness and depth as the previous model (about 10 data records per profile), we would need to expand the dataset more than 1,000x to include 10,240,000 individuals. And this is just for a credit scoring model with 20 binary predictive factors. As the table below demonstrates, the number of potential credit profiles increases exponentially as the number of data dimensions increases.

Curse of Dimensionality — Exponential Increase in Potential Credit Profiles as the Number of Predictive Factors (Dimensions) Increases

As these simplified examples illustrate, while increasing the number of predictive factors may benefit the credit scoring model by expanding the number of useful signals to predict future credit performance, it also leads to an exponential increase in required dataset size (i.e., number of borrower records) to avoid a rapid decrease in data representativeness and density. Clearly, even in this age of Big Data, procuring a high-quality credit risk dataset larger than 100 million individuals is unlikely. However, even with a dataset of such vast size, a target data density of about 10 would still limit the number of predictive factors to a maximum of 23. Yet we are observing AI credit scoring models with hundreds or even thousands of predictive factors! Even at just 100 predictive factors, a 100 million record dataset would cover an infinitesimally small 7.89E-21% of potential credit profiles! Just think how much smaller this number would become with 1,000 or 2,000 predictive factors.[2]

The point of these examples, and the corresponding math, is to highlight a key risk of “high-dimensional” credit scoring models – specifically, that the underlying model development dataset – even though it may seem “large” in absolute terms: (1) may not sufficiently represent the potential credit profiles of the broader population of potential loan applicants, and (2) may contain only one instance of profiles that are represented in the dataset. Such data sparsity – which may be more problematic for credit scoring models where loan defaults (the primary event of interest) are relatively rare – may lead to a model whose predictive performance does not generalize well in production due to its overfitting to the sparse data. In fact, with a sufficiently large number of data dimensions, it may be possible for some AI model architectures to achieve remarkable predictive power in differentiating defaults from non-defaults during model development; however, such "power" - derived as it may be from high-dimensional overfitting of the training data - may prove to be short-lived once the model is deployed into production and exposed to a larger sample of the vast input space. Claims of production "data drift" may then emerge to diagnose the production model's underperformance - with frequent re-trainings executed for remediation - even though the "data drift" may just be the observance of additional credit profiles in the vast high-dimensional input space, and the sub-par model performance may be due to the model overfitting - rather than some fundamental change in the underlying data distribution itself.

While some would point to the use of test data as a mitigant against this generalization risk, I note that test datasets for new AI credit scoring models are typically sampled from the same historical borrower population as is used to derive the training data. This is because the model’s dependent variable – credit performance – has a 1-2 year time dimension associated with it (i.e., we are predicting credit performance over a 1-2 year observation window), making more recent “out-of-sample / out-of-time” test data very limited or even infeasible – particularly for new credit products. For this reason, acceptable test dataset performance may be a misleading indicator of the model's predictive performance in a production environment.

To be clear, this potential risk is by no means guaranteed, nor an inherent flaw of present-day AI credit scoring models. Nevertheless, proper due diligence and model risk governance should require further exploration of this risk prior to model implementation. It is certainly possible that, unlike my simplified examples, real-world credit profiles are much more concentrated in the selected input space (even those of higher dimensions) and, therefore, data representativeness and density are better than my simple math would imply. It is also possible that the specific AI model architecture used may, itself, mitigate this risk. However, such assessments should be backed by hard empirical evidence, and any weaknesses and limitations of the model due to data sparsity and potential overfitting should be clearly disclosed to model validators and users – with appropriate remediations and risk mitigants.

Risk 2: Alternative Data's Greater Degree of Missing Values May Adversely Impact a Material Percentage of Credit Scores

While the vast majority of the U.S. adult population has traditional credit bureau data and credit scores, the population coverage rate for alternative data can be much lower due to how it is collected, to whom it pertains, the impact of privacy laws and regulations, and how it is identified and matched to individuals. Accordingly, even assuming that one can collect a large enough alternative data set to provide sufficient representativeness and depth for the target customer population, it is fairly common to find that one or more alternative data attributes are missing for each individual.

Rather than simply exclude the training records of individuals with missing data – which may cause a significant reduction in the size of the remaining dataset – typical model development practice is to replace missing data with synthetic replacement values that are computed using various methods including: (1) the mean or median of the non-missing continuous variable values, and (2) the most frequent non-missing value for categorical variables.[3] While normally not a material issue when using 10-20 traditional credit bureau variables, the higher frequency of missing values for hundreds or thousands of alternative data attributes can result in a significant presence of synthetic replacement values in the model development dataset.[4] In my experience, since the prevalence of missing data is rarely disclosed by vendors, it would be prudent risk management to evaluate the following potential risks prior to model implementation:

It is possible that many individuals' credit profiles are so tainted by synthetic replacement values that the model’s predicted credit risk probabilities for these individuals are not individually-meaningful. That is, rather than truly reflecting the individual’s specific credit profile, the estimated credit risk probabilities are effectively, in the limit, an “average” cohort-level probability. This effectively becomes the alternative data version of a “thin file” segment – raising the question as to whether these individuals are really “scorable” and, accordingly, (1) whether they should be processed through the model or production algorithms at all, and (2) how much the AI credit scoring model truly increases the "scorability" of certain consumer groups above and beyond the levels of traditional credit score models.

If synthetic replacement values are based on medians, means, or most frequent values, they likely skew toward lower-risk values given the typical class imbalance observed in most consumer credit performance data (i.e., defaults are relatively rare while non-defaults are dominant). In such cases, depending on the specific credit score threshold employed in decisioning, impacted individuals may be assigned credit scores that fall within the lender’s credit approval region (even though such individuals may be unscorable by traditional credit scoring models, or receive traditional credit scores that may result in credit decline decisions). Such artificial boosting of AI-based credit scores may suggest that the model is approving a larger percentage of applicants than “traditional” credit scoring models when, in fact, the increased financial access (and, potentially, improved fair lending metrics such as adverse impact ratios) may be an illusion created by the synthetic replacement values.

For individuals with missing data whose true underlying credit profiles are actually stronger than the synthetic replacement values, and who may be declined by the lender’s credit score thresholds, traditional local explainability methods may not be entirely accurate as to the real reasons for credit denial - thereby potentially increasing the lender's consumer compliance risk.

Risk 3: The AI model training solution may not be unique – resulting in potential model instability and volatility in global and local model explanations.

Recent research has shown that AI-based models are prone to produce more than one training solution with effectively the same predictive performance – but with potentially very different sets of weights - if the data inputs are slightly changed, or if different random initializations are used. This risk may be heightened even further in the presence of high dimensional data. Whether referred to as the "Rashômon effect"[5] or model “underspecification”[6], the non-uniqueness of the resulting model solution has important implications for the robustness of AI credit scoring models - specifically,

Although multiple solutions may produce effectively the same training, validation, and test set predictive performance, the manner in which that performance is generated (i.e., the final model weights) can be very different across solutions - thereby creating potentially significant challenges for model explainability and transparency - as well as validation testing of conceptual soundness. Accordingly, model developers and validators should first investigate whether multiple solutions have been sufficiently explored and evaluated. This can be accomplished in different ways - with one of the simplest being the use of different random seeds during model training runs.

As discussed more fully in [7] below, although multiple solutions may generate effectively the same model performance measures on training, validation, and test samples, such models can behave and perform very differently on new unseen data that lie outside these samples - an important model limitation that may be quite relevant for high-dimensional credit scoring models based on alternative data. Such a risk suggests the importance of appropriate model stress testing to help developers (and validators) identify model solutions whose performance on a broader sample of high-dimensional credit profiles is less robust than others.

In the absence of material performance differences, the existence of multiple solutions also means that global and local model explanations may be very different - that is, there is no consistent "truth" in how the model inputs impact estimated credit scores on a global and local basis.[7] Additionally, the existence of multiple explanations can create real issues with adverse action compliance requirements - that is, even if the multiple models produce a consistent credit decline decision for an individual, how meaningful is the explanation for that decline decision if it can vary significantly across multiple model solutions? So which version of this "truth" should developers select and on what basis? This is where conceptual soundness reviews - or interpretable AI and Causal AI techniques - may help by narrowing down multiple solutions to those that also meet specific interpretability / explainability requirements related to the underlying consumer behaviors that drive credit performance differences.

One frequently-cited feature of AI-based modeling methodologies is the ease by which models can be re-trained over time based on new data. However, such frequent re-training can also exacerbate the inherent instability in model explanations due to underspecification. That is, as new data is added to the training dataset, this perturbation may result in a material change in the distribution of model weights across input variables - thereby causing a material change in global and local model explanations. Assuming model transparency and explanations are subject to formal risk governance controls, this may indicate a need to place specific additional controls around the model re-training process to ensure appropriate governance and oversight by model risk management and compliance functions.

Risk 4: The use of AI-based methodologies may complicate fair lending compliance.

One of the important benefits ascribed to AI credit scoring models is that they improve fairness and inclusivity through three primary features: (1) the expansion of new (alternative) data types used to evaluate applicant credit risk, (2) the incorporation of complex non-linear features generated by the AI methodology that capture more nuanced credit risk behaviors than traditional credit scoring models, and (3) the use of AI-based "de-biasing" techniques during model training to generate "fairer" model scores (i.e., credit scores with lower levels of disparate impact).[8] While the goals of improved fairness and inclusivity are of great importance, caution needs to be exercised during model development, validation, and compliance assessments to avoid the following pitfalls.

In Section 1 above, I discussed the potential for complex AI credit scoring models - particularly those based on high-dimensional data - to overfit the data - thereby compromising the model's ability to generalize well in production. However, there is also a very important secondary impact of model overfitting - it may mask disparate impact measurements. That is, since the model is able to predict credit outcomes in the model development data with an artificially high level of accuracy, traditional fair lending disparity metrics - such as relative predictive accuracy - may indicate very little to no bias when, in fact, potential bias may be present on samples containing larger and/or more diverse credit profiles.

In Section 3 above, I discussed how the existence of multiple model solutions with similar performance measures can create issues with model transparency and explainability. However, I also note that certain "de-biasing" techniques may actually benefit from the presence of multiple solutions to identify a less discriminatory alternative ("LDA") that has effectively the same (or very similar) predictive performance - but, quite possibly, very different global and local model explanations. What this suggests is that current de-biasing techniques should consider expanding their evaluation criteria to consider more than just accuracy and bias - considerations of conceptual soundness, robustness, and explainability would also seem relevant to the ultimate "de-biased" solution.

Finally, I note that model instability can adversely impact de-biasing results in similar ways that it affects overall model predictive performance. Specifically, an LDA model selected from multiple solutions may not exhibit the same fair lending performance in production as it does on training, validation, and test samples for the same reasons as discussed previously - indicating the need to evaluate fair lending performance on a wider range of potential credit profiles, particularly for high-dimensional models. Additionally, model re-trainings may materially impact fair lending performance due to inherent model instability - thereby adding to the additional controls that should be in place to ensure appropriate governance and oversight by compliance personnel.

In summary, while the application of emerging AI technologies to a growing library of alternative data shows promise in expanding safe, sound, and compliant financial access to U.S. consumers, it is still relatively early days in this revolution. While certainly promising, these new technologies and alternative data attributes are inherently novel and complex, and our understanding of both their benefits and risks continue to evolve. What's more, the regulatory environment for these models is still largely unsettled in that: (1) federal financial regulators have limited experience/comfort with such models from both a safety and soundness, and consumer compliance, perspective, and (2) we are entering a period of aggressive federal and state consumer compliance enforcement activity in which algorithmic bias is a top concern. For these reasons, I recommend that consumer lenders adopt a more measured approach to the current "score wars" - meaning that they should continue to innovate, evaluate, and potentially adopt new credit scoring tools - but do so with appropriate and prudent risk management activities commensurate with the currently elevated - and shifting - risk landscape.

* * *

ENDNOTES:

[1] In practice, some of the model variables may have more than 2 discrete values while others may be continuous. The presence of these multi-valued variables actually exacerbates the issues described in this section since the theoretical number of credit profiles becomes even larger.

[2] According to one large fintech lender whose AI credit model uses "over 1500 traditional and non-traditional variables", "Given the sheer number of variables used by our model there are never two applicants that are the same." See Upstart Response to Information Requested in February 13 Letter, February 28, 2020.

[3] Another common practice is to add a "Missing" category to the variable and using this as the replacement value. The risks discussed in this section are also relevant under this practice.

[4] For simplicity, I assume that the missing data is random. However, for alternative data, there are legitimate reasons why such missing data may not be random - such as age-effects (younger or older people may be more likely to have missing data for certain types of alternative data), or geographic effects (variations in data privacy laws and regulations at the state of local levels). Missing data driven by such non-random factors may introduce other risks into the AI-based model estimates (which are outside the scope of this article).

[5] See Breiman, Leo. "Statistical Modeling: The Two Cultures," Statistical Science, Vol. 16, No. 3 (Aug., 2001), pp. 199-215.

[6] See D'Amour, et al., "Underspecification Presents Challenges for Credibility in Modern Machine Learning," arXiv:2011.03395

[7] In addition, as high dimensional data are more prone to multicollinearity, certain global and local model explanations may be impacted by random training conditions - thereby impacting explanation stability (e.g., Factors A and B are highly correlated and related to credit performance, yet Factor A receives a strong weight during one model run, and Factor B receives a strong weight during another model run).

[8] I focus here solely on de-biasing techniques embedded within the model training process. However, I note that there are also de-biasing techniques that can be deployed during the data pre-processing stage as well as during post-processing (i.e., when the model is in production).