Scoring a classification model can predominantly be done by simply looking at the confusion matrix as most other metrics are calculated by looking at the quadrants of the confusion matrix a little bit differently to tell us about different strengths or weaknesses of the model. Scoring a classification model consists of the following metrics:
- Confusion Matrix - The confusion matrix is designed to show how accurate the model is and to understand the differences between the model predictions and the actual truth. It sorts each record into one of four quadrants and illustrates the differences and similarities between a predicted value and an actual value, quickly illustrating where the predictions were aligned with the actual truth and where they diverge:
Actual True |
Actual False |
|
Predicted True |
True Positive |
False Positive |
Predicted False |
False Negative |
True Negative |
- Overall Accuracy - This is how accurate the model is on average = (True Positives + True Negatives)/(All Predictions).
- Precision - When the model makes a prediction that something is true, what is the probability that the model was correct = True Positives/(True Positives + False Positives).
- Recall - When something was actually True, how often did the model accurately predict that it was True = True Positives/(True Positives + False Negatives). This can be best memorized by thinking about the human phrase, “If I recall correctly”, which in essence is asking if there is good remembrance. Recall is the measure of good remembrance of the positive class.
- F1 Score - F1 is a metric that tries to take into account accuracy when classes are imbalanced by focusing on the accuracy of positive predictions and actual positive records. It does this by taking the harmonic mean of recall and precision = 2 * ((Precision * Recall)/(Precision + Recall)). It’s important to note that the more imbalanced a dataset it, the lower the F1 score is likely to be even with the same overall accuracy.
- AUC - AUROC or AUC ROC - This is a more complicated accuracy metric that can help us understand how deterministic a model is. It is defined as the area under the curve for the graph that charts the change in the True Positive Rate (Y-Value) by changing the False Positive Rate (X-Value). The closer this value is to 1 (the maximum possible area under the curve), the more deterministic the model is.
- MCC - Matthews Correlation Coefficient - Matthews Correlation Coefficient is another measure of how good a model is. The score ranges from -1 to 1, where 1 would mean that the model predicted every sample correctly. = ((True Positives * True Negatives) - (False Positives * False Negatives))/ [(TP + FP) * (FN + TN) * (FP + TN) * (TP + FN)]^(1/2).
Interpreting the score on classification models is intuitive, but takes a little bit of mental gymnastics in some cases. Specifically, it is important to note that a great overall accuracy score doesn’t mean that the model is great. Because, what if a business only had a 10% conversion rate? The model could get a 90% accuracy score by simply saying that no leads would ever convert. That’s where F1, recall, and precision can come in to play and help determine the balance of the strengths and the weaknesses of a model. If the model did assume 100% of the leads would not convert, F1 would be 0. The important thing to grasp is that there are different strengths and weaknesses that these scoring metrics expose and none of them on their own can be a true measure of goodness of fit, specifically because sometimes a model can have low accuracy, but be much better than the business has ever been able to do, and therefore improves the bottom line for that business.