Using Explainable AI to Produce ECOA Adverse Action Reasons: What Are The Risks?

Richard Pace
Sep 15, 2022
25 min read

Updated: May 6, 2024

"ECOA and Regulation B do not permit creditors to use complex algorithms when doing so means they cannot provide the specific and accurate reasons for adverse actions." - CFPB Circular 2022-03, "Adverse action notification requirements in connection with credit decisions based on complex algorithms"

In May 2022, the CFPB - through its Circular 2022-03 - stunned the consumer credit industry with its announcement that Adverse Action Notification reasons produced by AI-based credit models may not be compliant with ECOA requirements - a marked turnaround from its seeming embrace and encouragement of expanded AI-based lending access just two years earlier. In my July article, "Dark Skies Ahead: The CFPB's Brewing Algorithmic Storm", I examined this Circular more closely - offering my analysis of the underlying factors driving the CFPB's stated concerns with these AI-based Adverse Action reasons.

However, there is still more to say on this topic. In particular, I want to dive deeper into one of the most surprising and important statements in the CFPB's Circular - notably contained in Footnote 1 - that focuses on the analytical tools typically applied to AI-based credit models to extract the specific reasons/factors responsible for an applicant's adverse credit decision.

"While some creditors may rely upon various post-hoc explanation methods, such explanations approximate models and creditors must still be able to validate the accuracy of those approximations, which may not be possible with less interpretable models." - Footnote 1

As I stated in my prior article:

"This statement is critical. It implies that if the lender's model is inherently non-interpretable and, therefore, requires a post-hoc explainability tool to deduce the drivers of its predictions, then the lender must validate the accuracy of these analytically-derived explanations. However, these explanations - which are typically generated by explainability tools such as SHAP and LIME - are not exact, they are approximations." Furthermore, "...how can a lender evaluate their accuracy if such an evaluation requires knowledge of the "true" exact explanations - which are unknowable due to the black box nature of the model?"

Public reaction to the CFPB's Circular has been mixed. Some have stated that there is nothing new here - the Bureau is simply reminding the industry that existing federal fair lending law and regulation also extend to new AI-based credit models. Others believe, like myself, that the Bureau is communicating a new position - specifically taking aim at popular AI explainability tools. However, some in this camp believe that the Bureau's issue is really with the LIME explainability approach (which most lenders do not use) as it relies on a surrogate model to produce Adverse Action reasons - not with the more widely-adopted SHAP and Integrated Gradients tools that possess stronger theoretical foundations and, therefore, valid and accurate model prediction explanations.

In my opinion, this latter position understates the potential risk to industry participants.

And to support this opinion, the remainder of this article focuses on the following:

I perform a deeper dive on the three most common AI explainability tools - SHAP, LIME, and Integrated Gradients - but do so (hopefully!) in an intuitive manner (no math!) to focus on what I see as the core risk issues.

As I agree that SHAP and Integrated Gradients both possess strong theoretical foundations, I focus instead on where I see the real risk issues for lenders - in their implementation. It is here that the presence of random sampling, simplifying assumptions, and other user-driven choices can introduce risk to the technical robustness and the regulatory alignment of the resulting Adverse Action Notification reasons.

Finally, because this is an area of intersecting governance within a Company - specifically, Compliance/Legal, Model Developers/Users, and Model Risk Management, I offer some specific recommendations for such key stakeholders to consider when evaluating and mitigating these technical and compliance risk issues.

To be clear, I am not suggesting that these tools are unsuitable for Adverse Action Notifications, nor am I suggesting that they are inherently inaccurate. In fact, industry participants and researchers continue to make refinements to these tools that address some of the risks I note below. Rather, I am pointing out that these tools do not produce infallibly-certain decision explanations as some non-technical stakeholders may inadvertently assume. As with any analytically-based estimate, prudent risk management requires that relevant stakeholders assess the key risks impacting estimate reliability - both technical and regulatory - and take appropriate steps to address such risks consistent with their institution's risk appetite.

Why Are AI Explainability Tools Necessary?

Over the last several decades, traditional credit scoring models were built using tried-and-true statistical models whose underlying architecture permitted easy decomposition of an applicant's credit score into the individual sub-components associated with each predictive factor - thereby providing an exact and straightforward means to provide a denied credit applicant with the specific factors most contributing to his/her insufficient score level.

In today's increasingly AI-enabled world, however, newer credit scoring models exploit more complex architectures in which: (1) significantly more predictive factors may be included (e.g., 100's or 1000's), (2) the large number of predictive factors may introduce some redundant or overlapping predictive signals that create difficulty in measuring the predictive contributions of such correlated factors (e.g., multiple measures of recent credit derogatories), and (3) the algorithm creates complex interactions among two or more predictive factors to improve predictive accuracy - thereby complicating and obscuring the specific drivers of an applicant's overall credit score.

The presence of such high dimensional data and significant model complexity turns what was a straightforward process of explaining individual credit decisions into a highly complicated analytical and computational exercise.

What AI Explainability Tools Are Typically Used For Credit Decisioning?

To address the challenge of determining Adverse Action reasons from AI-enabled credit models, lenders have adopted a set of AI Explainability Tools - that is, analytically-based processes that are overlaid onto an AI/ML algorithm to decompose individual predictions into their fundamental input-based drivers. However, as I will discuss below, while these tools are based on sound theoretical principles and methodologies that provide support for their accuracy, several challenges arise in their operational implementation that introduce sources of potential imprecision and instability - thereby, perhaps, feeding the CFPB's concerns when such post-hoc explanation methods are used to produce Adverse Action Notifications under ECOA.

In the next sections, I provide an overview of the three most common AI Explainability Tools and discuss some of the risks and limitations of which model developers, model validators, and Compliance/Legal personnel should be aware. I note that this article is not meant to be an authoritative technical resource on these tools, nor an exhaustive analysis of all such risks; rather, it is meant to facilitate a high-level understanding of these methodologies, and highlight perhaps the most important or well-known risks, in order to demonstrate the need for appropriate engagement among Company control functions to meet evolving regulatory expectations associated with a complex, high-stakes consumer-facing business process.

Shapley Values

Shapley Value Example — Figure 1: Illustration of Shapley Values

What are Shapley Values?

The present-day use of "Shapley values" to explain AI model predictions relies most fundamentally on the 2017 paper by Lundberg and Lee: "A Unified Approach to Interpreting Model Predictions" - with additional enhancements and extensions by themselves and others since that time. Methodologically, Shapley values are the outputs of an algorithm that uniquely decomposes an individual model prediction into the individual contributions of each model input that drives the model prediction away from a baseline value (e.g., the average model prediction). I refer to this difference between the individual model prediction and the baseline value as the "baseline model prediction difference".

For example, Figure 1 above illustrates the Shapley values associated with an individual applicant's estimated credit score of 600 and a "baseline" average predicted credit score of 720. The individual Shapley values decompose this applicant's baseline credit score difference of -120 points (=600-720) into individual input contributions whose sum equals -120. Input contributions that are positive in value (i.e., the red and blue inputs) have a corresponding positive effect on the individual's estimated credit score - while those input contributions that are negative in value (i.e., the orange and yellow inputs) have corresponding negative effects on the individual's estimated credit score.[1] If this individual's credit application were declined, then the yellow input would be identified as the top contributor to the individual's lower than average credit score followed by the pair of orange inputs.

While there are many ways in which such a decomposition can be made, Shapley values have the advantage of doing so in a manner that yields a number of desirable properties. For example,[2]

For a given model prediction, the Shapley values for the associated model inputs will naturally sum to the baseline model prediction difference. This property yields a very simple and intuitive set of explanations for a given model prediction (e.g., the red input is the most important model input for this prediction as it has the largest individual contribution to the baseline model prediction difference; however, the yellow input is most responsible for the applicant's insufficient score level should the application be declined).

Shapley values "fairly" decompose the baseline model prediction difference across model inputs using the strong theoretical foundations of cooperative game theory.

Shapley values will be zero for attributes that do not impact the baseline model prediction difference.

Shapley values represent "contrastive" or "counterfactual" explanations - that is, in their original form, they explain which inputs are most responsible for causing a given model prediction to differ from a baseline model prediction (which, for credit scoring models, could be a minimum score for loan approval).

At the most general level, Shapley values are essentially the average baseline model prediction difference associated with the presence and absence of a model input - answering the question: How does the presence of this input for this individual affect the model's prediction (relative to the baseline) compared to a counterfactual scenario in which this input is not considered? While these calculations are extremely simple in traditional linear credit scoring models, they are much more complicated in today's AI-enabled models due to a complex architecture that causes the individual predictive effect of a given input's presence or absence to depend intricately on what other inputs are also present or absent.

How are Shapley Values calculated / estimated?

From an implementation standpoint, it turns out that computing exact Shapley values for most AI/ML model predictions is computationally impractical as the process requires the calculation of baseline model prediction differences for every possible combination of present and absent model inputs. While such computations may not be so burdensome for simple models with few variables, they grow explosively in volume as the number of model inputs increases. For example, a model with 20 inputs would have over 1 million present / absent input combinations; a model with 30 inputs would have over 1 billion. Because of this computational complexity, Lundberg and Lee introduced the well-known SHAP (Shapley Additive Explanations) approach to compute approximate Shapley values in a more computationally tractable and efficient manner.[3]

While new variants and enhancements to the SHAP approach continue to occur, I focus my discussion here on the widely-used, model-agnostic KernelSHAP approach - discussed in the Lundberg and Lee paper - that leverages samples of present / absent input combinations, certain simplifying assumptions, and linear regression analysis to streamline significantly the computational time and resources needed to estimate Shapley values - called "SHAP values" under this approach - although this improved computational efficiency comes at a cost as discussed below.

What risks should key stakeholders know about SHAP values?

While the technical details of the SHAP methodology are outside the scope of this article, I highlight below some of the relevant features of this popular explainability tool that may underlie the concerns expressed by the CFPB in Circular 2022-03 (i.e., when post-hoc explanation methods are used to explain individual credit model predictions for Adverse Action Notification purposes).

1. Reliance on Random Sampling

As described above, calculating Shapley values requires one to compute baseline model prediction differences for every possible combination of present / absent model inputs which - for a model with only 30 inputs - would require over 1 billion of such calculations. Alternatively, KernelSHAP reduces these computational requirements by randomly sampling the population of input combinations and their associated baseline model prediction differences - but in a manner that focuses disproportionately on the subset of input combinations that provides the greatest explanatory information.

For each of these sampled input combinations, the "present" model inputs take the values associated with the individual whose model prediction we seek to explain. Alternatively, the "absent" model input values are iteratively assigned from a "reference" dataset (discussed further below) - that frequently is (but is not required to be) a representative random sample of the original training dataset. By averaging the resulting baseline model prediction differences over these representative values for "absent" inputs, we obtain an overall average baseline model prediction difference for each sampled input combination.

A weighted linear regression model is then applied to the set of baseline model prediction differences associated with the sampled input combinations to estimate simultaneously all the SHAP values associated with a given individual's model prediction.

Key Stakeholder Considerations: As with all sample-based estimation methodologies, selecting an appropriate sample size is critical as there is always a trade-off between the size of the sample and the potential variability in the resulting SHAP estimates. For example, using a larger sample of input combinations, and/or using a larger size reference dataset, may be desirable to minimize the impact of sampling variability on the estimated SHAP values; however, the trade-off is a longer KernelSHAP run-time with no guarantee that the increased sample sizes will provide incremental improvements in the accuracy / stability of such estimates.[4]

Alternatively, for credit models with 100s or 1000s of model inputs, even practical sample sizes will surely represent only a tiny fraction of possible input combinations - creating risk that samples will be non-representative. For example, KernelSHAP's default sample size of input combinations for a model with 50 inputs is 2,148 - which represents only 0.0000000002% of all possible input combinations. While these 2,148 input combinations are focused disproportionately on the subset of input combinations that provides the greatest explanatory information, they still represent just a tiny percentage of the population.

For these reasons, it is important to understand - and to provide appropriate documented support for - how the Company selected the specific sample sizes to use in its AI explainability tool. At a minimum, the Company may wish to demonstrate that the top Adverse Action reasons produced under the selected sample sizes are invariant to larger samples - when applied to a sufficiently broad and deep set of "declined" credit applications. That is, through reasonable and appropriate sensitivity analysis, the Company can support that its chosen sample sizes appear sufficient to generate stable and robust Adverse Action reasons.

2. Reliance on a Specifically-Defined "Baseline" Scenario to Which an Individual's Model Prediction Will Be Compared

Figure 1 above showed graphically how Shapley values are used to explain the difference between a given individual's credit score prediction (i.e., 600) and a corresponding average credit score prediction (i.e., 720). Additionally, as discussed, the yellow input had the largest contribution to this individual's below average credit score estimate - followed by the orange inputs.[5]

From this example, we can see that Shapley / SHAP values, by definition, are counterfactual - that is, they are measured relative to a "baseline" scenario (and corresponding input combination). Indeed, in its default implementation, the KernelSHAP algorithm defines this baseline scenario - similar to Figure 1 - as the average input values associated with a given reference dataset (typically, a subset of the training dataset). However, other baseline scenarios can also be used - which will necessarily change the resulting SHAP value estimates and their interpretations.

Key Stakeholder Considerations: Since Shapley / SHAP values were not specifically designed to produce Adverse Action Notification reasons, it is important that the Company be intentional in selecting the appropriate baseline scenario for such a purpose. While regulatory guidance in this area is far from specific, one must still ensure there is no inconsistency with such guidance - such as the Official Staff Interpretation to Regulation B (12 CFR Part 1002 Supplement I Comment 1002.9(b)(2)(5)):

"The regulation does not require that any one method be used for selecting reasons for a credit denial or other adverse action that is based on a credit scoring system. Various methods will meet the requirements of the regulation. One method is to identify the factors for which the applicant's score fell furthest below the average score for each of those factors achieved by applicants whose total score was at or slightly above the minimum passing score. Another method is to identify the factors for which the applicant's score fell furthest below the average score for each of those factors achieved by all applicants. These average scores could be calculated during the development or use of the system. Any other method that produces results substantially similar to either of these methods is also acceptable under the regulation."

According to this guidance, two potential baseline scenarios are: (1) applicants with model scores at or slightly above the minimum passing score, or (2) the average model score of all applicants (calculated using the model training sample or based on live application data within the model production system). Another common baseline used in the industry is the maximum achievable score associated with the model.

As there is no "one size fits all" choice of the baseline scenario, and because different baselines produce different SHAP values and Adverse Action reasons with different interpretations, it is important that this choice be well considered by key stakeholders - such as Compliance/Legal, the model developers, and the Company's Model Risk Management function. In particular,

Selection of an appropriate baseline scenario involves multiple considerations. It should involve consideration of regulatory requirements, the range of industry practices, technical feasibility and validity, and consumer benefit. That is, baseline scenarios that are considered computationally impractical, technically unsound, or of limited information value to the consumers receiving the Adverse Action Notifications should be avoided.

Consider the operational risks of dynamic baseline scenarios. While baseline scenarios that are calibrated to score cut-offs may be considered logical and appropriate, they also carry heightened operational risks that need to be considered. In particular, as it is not uncommon for score cut-offs to be revised from time to time, the Company will need to ensure that adequate monitoring processes are in place to alert appropriate parties of such revisions, and to ensure corresponding updates to the baseline scenarios are made. Similar monitoring and change management processes will also be needed should baseline scenarios be calibrated to live application averages (whether of all applications or only approved applications) - rather than training data averages.

Consider the risks of how certain baseline scenarios are implemented. Selecting a fixed baseline scenario - such as one associated with maximum score points or the average score from the training sample - has the advantage of stability over time (assuming the underlying model does not change) and precludes the types of operational risks discussed above. However, the method by which such fixed scenarios are implemented can introduce an additional risk/limitation.

For example, if the Company desires that explanations of individual model predictions are based on a comparison to the model's average score (in the model training dataset), then there are two ways to implement this choice. The simplest would be to calculate the average input values in the model training dataset and use these averages to represent consistently the "absent" input values in the KernelSHAP estimation process. However, this approach - while computationally fast - assumes that the model is linear in the model inputs which is not true for most complex AI-credit models. In this case, the average input scenario will not accurately represent the average model score and the resulting SHAP values will be different than intended. The technically correct approach is to use a "reference" dataset for calculating the KernelSHAP values (as described above). While this approach is more computationally costly due to its iterative nature, it is true to the intended baseline scenario.

3. The Use of Illogical Input Combinations to Calculate SHAP Values

As discussed above, SHAP values are calculated using an "on-off" methodology in which - for a given model prediction to be explained: (1) a sample of baseline model prediction differences are generated based on the presence ("on") and absence ("off") of specific model inputs, and (2) a weighted regression model is applied to this sample of baseline model prediction differences to estimate the corresponding SHAP values. The "presence" of an input in these input combinations is simply the value associated with the individual prediction we are seeking to explain. For example, referring back to Figure 1, if the yellow input represents the applicant's total number of serious delinquencies in the last 90 days (let's assume, 3), then an input combination where the yellow input is "present" would contain the same 3 serious delinquencies.

But what value would the yellow input take if we have an input combination where the yellow input is absent ("off")? Under the traditional KernelSHAP estimation approach, the yellow input's serious delinquency value is iteratively replaced with a value from a sample of applicants contained in the model training data (i.e., the "reference" dataset) to simulate an average baseline serious delinquency count for the yellow input when it is "absent".

The risk / limitation here is that the KernelSHAP estimation approach breaks the underlying correlations among the model inputs when replacing the "absent" values.[6] Continuing with our example, suppose we have an input combination in which the yellow input is "absent" and, accordingly, its value (i.e., the applicant's 3 serious delinquencies) is replaced by a value of 0 from the reference dataset and the corresponding baseline model prediction difference is calculated. The issue here is that, in addition to serious delinquency, there may be other model inputs that reflect the applicant's credit history - such as number of recent minor delinquencies, the presence of a recent bankruptcy, etc., and this set of credit history attributes will certainly be correlated (e.g., the applicant's 3 serious delinquencies may also be associated with a recent bankruptcy and several recent minor delinquencies). However, "new" input combinations are formed by independently changing the values of specific "absent" inputs without consideration of this underlying correlation structure.

This creates two issues. First, by breaking the correlation among these credit history attributes, KernelSHAP can produce what appears to be logically inconsistent input combinations - for example, a recent bankruptcy, several recent minor delinquencies, but 0 serious delinquencies. Second, the model predictions we obtain from these new illogical input combinations are likely considered to be "out-of-sample" - and, therefore, subject to potentially significant uncertainty and imprecision - as there were likely no applicants with these input combinations in the original model training set used to estimate the model.

These two issues can adversely affect the robustness of the resulting SHAP values. At best, this feature of the methodology merely introduces noise into the SHAP value estimates. At worst, it may impact the accurate identification and ordering of the true underlying Shapley values - thereby impacting the accuracy of the Adverse Action reasons.

Key Stakeholder Considerations: The issue described here is generally present for all "perturbation-based" AI explainability tools - not just SHAP and may be one of the reasons for the CFPB's requirement that the outputs from these tools be validated. It is presently unclear how much of a practical impact these illogical values have on the resulting SHAP value estimates. Nevertheless, the Company should ensure that it sufficiently evaluates and mitigates this risk / limitation of the SHAP approach with respect to identifying and ordering accurately the resulting Adverse Action Notification reasons produced therefrom.

4. Appropriately Treating SHAP Values For Related / Correlated Inputs

One of the advantages of AI-based credit models is the ability to consider efficiently a larger number of potential predictive signals. Indeed, some fintechs explicitly advertise their usage of hundreds if not thousands of predictive inputs in their AI credit models. However, one of the corresponding features of such high-dimensional datasets is the presence of high correlations - or even redundancy - among subsets of these inputs - a feature that is especially common when leveraging granular credit bureau data. In such cases, the common predictive effect represented by these inputs may end up being divided and spread across the individual correlated inputs. While this may not have a practical effect on the model's predictive accuracy, it can have a significant effect on the estimated SHAP values and, accordingly, the identification of Adverse Action Notification reasons.

For example, if recent credit derogatories have a significant negative contribution to an applicant's credit score - but the predictive effect of this factor is spread out across 10 correlated model inputs, then the overall "SHAP value" of this factor will similarly be disaggregated into smaller SHAP values for each of these 10 inputs. Now, when identifying the top reasons for why the applicant's credit score is insufficient for loan approval, it is possible that none of the 10 inputs related to credit derogatories will be in the top reasons since - individually - they are too small in magnitude, even though - in the aggregate - they may be a top reason.

This phenomenon is illustrated in Figure 1 with the two orange inputs - which are considered to be highly correlated / redundant. As we can see from their estimated SHAP values, they both contribute -40 points to the applicant's below average credit score. From an Adverse Action Notification perspective, these two inputs would be considered of secondary importance relative to the yellow input with a -60 point contribution. However, this ordering changes if one considers that the two orange inputs are really measuring the same predictive signal and, therefore, their collective contribution is -80 points - now greater than the yellow input and now the top reason for the adverse action. In more general cases, this phenomenon may cause important reasons for loan denial to be diluted in magnitude and - therefore - not included in the Adverse Action reasons despite, collectively, being one of the top reasons.

Key Stakeholder Considerations: It is important for a Company to have an explicit policy related to how individual model inputs are mapped to Adverse Action Notification reasons (under both ECOA and FCRA) with more recent consideration of the CFPB's positions outlined in Circular 2022-03. This could mean that SHAP values for related inputs are aggregated when ranking model prediction explanations[7], or it could suggest that model developers handle such correlations and redundancies on the front end through appropriate variable selection processes or model architecture selection.

LIME-Based Explanation Factors

Another popular AI explainability tool is called LIME (Local Interpretable Model-Agnostic Explanations) which relies on the creation of a surrogate model to explain a given model prediction. More specifically, LIME first identifies the input values associated with the prediction to be explained. Then, like SHAP, it perturbs the values of these inputs to create additional input combinations and associated model predictions. However, (1) the new input combinations are not designed to have the same "on-off" structure as the SHAP input combinations - they just vary more naturally, and (2) greater weight is placed on those new input combinations that have "similar" or "close" input values to the original data point. LIME then approximates the original model around the original data point by estimating an inherently interpretable model (e.g., linear regression model or decision tree) on the weighted data points. Given its interpretable form, this local surrogate model can be used to determine which model inputs are most important in differentiating the original model prediction from those "nearby".

Most credit model practitioners agree that LIME is not a suitable AI explainability tool to identify Adverse Action Notification reasons - primarily because it explains an individual's model prediction relative to its closest "neighbors" which - by design - are similar to the data point one is trying to explain. This type of explanation is really more useful to understand how the original model behaves, approximately, within a relatively small neighborhood of a given data point - perhaps for local sensitivity analysis - rather than to explain why the model's prediction for that data point differs more broadly from a broader baseline scenario (such as the average model prediction in a given dataset).

Key Stakeholder Considerations: For the reasons stated above, the use of a LIME explainability tool to generate Adverse Action Notification reasons may be considered inconsistent with the CFPB's positions as described in Circular 2022-03.

Integrated Gradient-Based Explanation Factors

What are they?

As discussed previously, explaining individual model predictions using SHAP values can be computationally intensive for AI-based credit models using 100s or 1000s of predictive factors - requiring substantial sampling of input combinations that, ultimately, may not be representative of the potentially billions of potential input combinations that may exist. In the quest to find an AI explainability tool for these high-dimensional models that possesses SHAP's desirable theoretical properties, but with a much lower sampling requirement and, therefore, risk, Google researchers devised an approach called Integrated Gradients ("IG").[8]

Recall from our SHAP discussion associated with Figure 1 that SHAP's objective is to attribute fairly to each of the model inputs the difference between an individual applicant's estimated credit score and the average estimated credit score within a given sample of applicants (e.g., the model training data). In general, KernelSHAP does this by computing the average baseline model prediction differences for a sufficient sample of input combinations in which the individual's input values are present or absent. Intuitively, KernelSHAP is trying to understand how sensitive the model's output is to each model input - within a space defined by: (1) the input combination associated with the model prediction we seek to explain, and (2) the input combination(s) associated with the baseline scenario. A large number of input combination samples within this space are required to discern these individual input sensitivities because KernelSHAP is a model-agnostic method (i.e., it does not rely on access to the underlying mathematical structure of the model) and, therefore, such input sensitivities have to be deduced indirectly from the model's predictions under different input value combinations.

How are they calculated / estimated?

Alternatively, IG leverages complete access to the model's underlying mathematical structure to calculate these input sensitivities (called "gradients") directly - which, when compared to KernelSHAP: (1) precludes the need for input sensitivity deduction based on substantial input combination sampling, (2) is much faster and efficient, and (3) produces a much finer-grained set of input sensitivities. Along with these gradients, the tool requires a baseline scenario - similar to SHAP; however, in the typical case, this baseline scenario is a single input point (e.g., the medians or averages of the input values within a reference dataset). Figure 2 below illustrates a simplified two input IG process.

Integrated Gradients Example — Figure 2: Illustration of Two Input Integrated Gradient Process

With these two components (the model gradients and a baseline input scenario), the IG tool then does the following:

It creates a "straight-line" path between the baseline input scenario and the individual's input combination for which we are seeking an explanation (see Figure 2 above). This straight-line path represents a sequence of "linear" input combinations that connects the baseline input scenario (e.g., the inputs' average values in the training data) to the individual applicant's specific input values. The number of "steps" along this straight-line path (i.e., the sample of "linear" input combinations - see the dashed red line in Figure 2) affects the accuracy of the resulting explanations - with a larger number of smaller steps (i.e., a larger number of "linear" input combinations) preferred; however, in general, typical required sample sizes are in the low hundreds.

For each of the "linear" input combinations along this straight-line path, the IG tool applies the corresponding input sensitivity (i.e., gradient measure) to calculate the change in the model's prediction associated with the change in the "linear" input combination since the last step. This tells us how much the model prediction has changed since the last step and how much each input contributed to this change in prediction. These incremental changes are accumulated (i.e., integrated) over the entire straight-line path to provide - at the end - a decomposition of the total change in the model prediction relative to each of the model inputs. These represent our IG-based explanations.

What risks should key stakeholders know about Integrated Gradient values?

As IGs have some conceptual similarity to SHAP values, they also possess some similar features that may underlie the concerns expressed in the CFPB's 2022-03 Circular (i.e., when post-hoc explanation methods are used to explain individual credit model predictions for Adverse Action Notification purposes). In particular,

1. Reliance on Sampling

As described above and illustrated in Figure 2, IGs are calculated along a straight-line path between the baseline input scenario and the individual input combination associated with the model prediction one is seeking to explain. This distance can be segmented into any number of steps - depending on the specific step size. Larger step sizes will involve fewer calculations (e.g., 10 total steps and calculations) while smaller step sizes will involve more calculations (e.g., 100 total steps and calculations). In general, smaller step sizes typically yield more precise IG-based explanations - as measured by the difference (i.e, "estimation error") between the estimated baseline model prediction difference - calculated by summing up the estimated IG factors across the model inputs - and the actual baseline model prediction difference we are trying to explain. Larger steps, on the other hand, tend to yield less precise IG-based explanations as the IG estimates are "cruder" due to the larger range of input values covered by each step.

Key Stakeholder Considerations: The number of steps / step sizes used in implementation should be appropriately supported by sufficient evidence that the resulting sample yields IG factors that - collectively - explain the actual baseline model prediction difference within an acceptable accuracy tolerance.

2. Reliance on a Specifically-Defined "Baseline" Scenario to Which an Individual's Model Prediction Will Be Compared

As noted above for SHAP, individual model prediction explanations depend crucially on the baseline scenario to which the model prediction is compared. Different baselines will yield different model prediction explanations and, therefore, different Adverse Action Notification reasons and interpretations. The same risk is present for Integrated Gradients.

Key Stakeholder Considerations: As there is no "one size fits all" choice of baseline scenario for producing Adverse Action Notification reasons, it is important that this choice be well considered by Compliance/Legal, the model developers, and the Company's Model Risk Management function. See my further recommendations in the SHAP section above.

3. The Use of Illogical Input Combinations on IG Reliability

Integrated Gradients use a straight-line path between the baseline input scenario and the input combination associated with the model prediction we are seeking to explain. I note, however, that there is nothing magical about this path, and it is certainly possible that some of the "linear" input combinations along this path may be illogical or unobserved in the model training data.

Key Stakeholder Considerations: The Company should ensure that it sufficiently evaluates this risk with respect to the accurate identification and ordering of the resulting Adverse Action Notification reasons produced therefrom. To the extent that existing research reliably supports the robustness of IG explanations produced with a straight-line path, the Company should evaluate such claims and, if in agreement, include them as support in its documentation.

4. Appropriately Treating IG Factors For Related / Correlated Inputs

As noted previously, one of the features of high-dimensional credit risk datasets is the presence of high correlations - or even redundancy - among subsets of these inputs. In such cases, the common predictive effect represented by these inputs may end up being divided and spread across the individual correlated inputs. While this may not have a practical effect on the model's predictive accuracy, it can have a significant effect on the estimated IG values and, accordingly, the identification of Adverse Action Notification reasons. See my further discussion and recommendations in the SHAP section above.

Key Stakeholder Considerations: It is important for a Company to have an explicit policy related to how individual model inputs are mapped to Adverse Action Notification reasons (under both ECOA and FCRA) with more recent consideration of the CFPB's positions outlined in Circular 2022-03. This could mean that IG values for related inputs are aggregated when ranking model prediction explanations, or it could suggest that model developers handle such correlations and redundancies on the front end through appropriate variable selection processes or model architecture selection.

5. Ensuring That the IG Methodology is Applied to the Correct Models

For technical reasons, the Integrated Gradient methodology is not suitable for tree-based machine learning algorithms - such as Random Forest, Decision Trees, Boosted Trees, etc.[9] Additionally, for models where the methodology is suitable, it is important that all model inputs are continuous-valued (i.e., no discrete-valued categorical variables are permitted).

Key Stakeholder Considerations: The Company - particularly its Model Risk Management functions - should ensure that the application of the IG methodology conforms with its underlying technical requirements - notably, no tree-based algorithms and the transformation of categorical model inputs into appropriate embedding values.

Additional Considerations

Lastly, I want to highlight a couple of additional regulatory compliance considerations for Adverse Action Notification reasons derived from these methodologies:

The CFPB released two Circulars in 2022 and 2023 that lenders should heed. My prior post "Dark Skies Ahead: The CFPB's Brewing Algorithmic Storm" contains an in-depth discussion of the 2022 Circular with some important considerations for Compliance Officers.

AI credit models that have gone through algorithmic debiasing (also known as LDA credit models) raise particularly thorny issues for the generation of Adverse Action Notification reasons using the above methodologies. My more recent posts "Fool's Gold: Assessing the Case for Algorithmic De-biasing" and "The Road to Fairer AI Credit Models: Are We Heading in the Right Direction" contain a more detailed discussion of these specific issues.

* * *

ENDNOTES:

[1] For the astute reader, yes - there are two orange model inputs. I will revisit this intentional feature a little later in the article.

[2] There are additional properties other than these; however, they are more technical in nature and I accordingly abstract from them here.

[3] This article focuses solely on the SHAP Interventional Approach (vs. the SHAP Observational Approach) as it aligns with the use case of determining Adverse Action Notification reasons for credit decisions. See Chen, Hugh, et. al. "True to the Model or True to the Data", arXiv:2006.16234v1, June 2020, for further details of this distinction.

[4] Keep in mind that this sampling process is run for every individual model prediction that requires explanation. Accordingly, for a credit scoring model that is processing tens of thousands of applications a day, this can be a significant amount of compute time and resources.

[5] While the red and blue inputs had larger impacts on the individual's baseline credit score difference than the yellow and orange inputs, those inputs' contributions were positive (i.e., increasing the estimated credit score) and, therefore, do not correspond to the requirements of Adverse Action reasons.

[6] This risk/limitation is specific to the model-agnostic KernelSHAP approach (which is the focus of this article). However, there are additional model-specific implementations of SHAP - such as TreeSHAP - in which this particular risk is mitigated. While some will say that TreeSHAP (or other model-specific implementations) produce "exact" SHAP values, this is subject to debate and, even if true, does not mean that there are no other risks / limitations of the model-specific implementation approach that would be of concern to key stakeholders. See, for example, this section of Molnar, Christopher. Interpretable Machine Learning. Second Edition.

[7] One complication of such aggregation occurs if the weighted linear regression model used to estimate the KernelSHAP values employs regularization. For example, if L1 regularization is employed to reduce the number of SHAP values to a smaller, more relevant, subset, then care should be exercised to ensure that: (1) the relatively "insignificant" SHAP values that get excluded do not possess any common aggregate explainability that is considered "significant", and (2) that these exclusions also do not materially impact the ordering of the "significant" SHAP values that survive regularization.

[8] See Sundararajan, Mukund, et. al. "Axiomatic Attribution for Deep Networks", arXiv:1703.01365v2, June 2017.

[9] These is a variant of Integrated Gradients - called Generalized Integrated Gradients - that purports to handle tree-based algorithms. See Merrill, John, et. al., "Generalized Integrated Gradients: A practical method for explaining diverse ensembles", arXiv:1909.01869v2, September 2019. for further details and discussion.