Improved: target selection, driver inclusion/exclusion and categorical encoding
As of June, 2020 it is easier than ever to select your target, to identify drivers that Kraken automatically excludes from model training as well as for you to manually include or exclude drivers. We've also improved the way high-cardinality categorical values are encoded.
Target Selection: In Schema View, after selecting your target column in your training dataset the remaining columns become selectable to allow you to easily exclude columns from training.
Exclude All/Include All checkbox: In Schema View, after selecting your target column in your training dataset there is now an “Exclude All” checkbox at the top of the Schema View. This allows you to quickly deselect ALL columns (excluding the target column) and then choose just the columns to include in model training; this is helpful when you only want to include a few of many columns. Note that, after excluding all columns, the box changes to “Include All” so that you can quickly include all columns in your model training.
Visual indication of excluded columns: In both Schema View and Data View, columns excluded from analysis are greyed out.
Data Insights in Data View: Similar to Schema View, Kraken now visually indicates in Data View any columns that are automatically excluded from analysis with a gold triangle. Hover over the triangle to get more information about the automatic exclusion.
Categorical Encoding:Kraken uses an improved method of transforming high cardinality categorical values. This improvement will reduce overfitting and increase generalization. NOTE: Models built before June, 2020 will not be affected unless the model is retrained. All new models will use the new encoding.The new encoding may lead to seeing lower scores on some datasets for models built before June, 2020 but the scores will be a more accurate representation of how the model will perform in production.
For all models built before June, 2020 we strongly recommend that you create new versions of existing analyses so that you can compare model results with the new categorical encoding to the model results for your existing models.