SHAP Importance is calculated differently, depending on the algorithm you select in Qlik AutoML.
As explained in Chapter Five of the SONAR© Guide, SHAP Importance offers important insight about the predictions created in AutoML. This article provides details on how SHAP Importance is calculated for various algorithms in AutoML.
SHAP Importance values are SHAP values
AutoML uses the SHAP package to calculate SHAP (SHapley Additive exPlanations) values for a variety of algorithms. SHAP is based on the game theoretically optimal Shapley Values. These SHAP values appear in AutoML prediction datasets as "K_" columns that we call SHAP Importance. SHAP Importance needs to be interpreted differently depending on the algorithm that AutoML is using. It's important to be careful to interpret the Shapley value correctly: The Shapley value is the average contribution of a feature value to the prediction in different coalitions. The Shapley value is NOT the difference in prediction when we would remove the feature from the model.
How AutoML calculates SHAP Importance
AutoML produces SHAP Importance on datasets with up to 100,000 rows for prediction data and up to 500,000 rows for training data (1/5 of which is used for holdout data that AutoML uses for SHAP Importance on training) for various algorithms in both classification and regression models using two distinct methods:
1. Tree SHAP
Tree SHAP is a fast and exact method to estimate SHAP values for tree models and ensembles of trees, under several different possible assumptions about feature dependence. More academic information can be found here. Tree SHAP is used to calculate SHAP Importance with the following models:
- Classification
- RandomForestClassifier
- XGBClassifier
- Regression
- RandomForestRegressor
- XGBRegressor
2. Linear SHAP
Linear SHAP computes the SHAP values for a linear model and can account for the correlations among the input features. More academic information can be found here. Linear SHAP is used to calculate SHAP Importance with the following models:
- Classification
- LogisticRegression
- Regression
- LinearRegression
- SGDRegressor
NOTE: If AutoML doesn't create SHAP Importance on your training data because you trained with more than 500,000 rows, AutoML will still produce SHAP Importance values on apply datasets with up to 100,000 rows as long as you're predicting using a model that supports SHAP Importance in AutoML.