One of the largest challenges in predictive analytics is to know how a trained model will do on data that is has never seen before. Or in other terms, how well has the model learned true patterns versus having simply memorized the training data. There are several ways to combat this and approach this problem. Cross validation is one of the most common, but is also potentially expensive. The practice of cross validation is to take a dataset and randomly sort it into multiple buckets called “folds”. For the purpose of this example, imagine splitting a dataset into 4 pieces at random. Cross validation would then test each fold against a model trained using the other three folds. In other words, each trained model is tested on a piece of the data that the model has never seen before and this is done in rotation for all of the segments of the data. The outcome of cross validation is a set of test metrics that gives a reasonable forecast of how a model trained on all of the data will do at predicting on a record it has never seen before. Computers are exceptionally good at helping a math equation memorize the data so that the equation fits the historical data perfectly--cross validation is a good technique to make sure that the model isn’t just memorizing, but is learning generalized patterns.
Cross validation is what generates all of the scoring metrics in Kraken. These metrics may also be called “test accuracy” or “test metrics”, and can often be referred to as “model score” or “model accuracy.”