Imputation is the practice of filling missing or null values. These null values could result from starting to collect new data and older records don’t have the new data point, from a user not filling out a part of a form, adatabase or ETL error, or any number of other reasons. There are many methods that are used to impute these null values, but it’s more important to understand the effects that imputation could possibly have on the data and persist through the model and predictions. The problem with imputation is that it has the potential to create patterns that didn’t previously exist or add noise to a pattern that did exist, reducing the validity of the data. For example, NUMERICAL-IMPUTATION-1 shows a scatter plot and best fit line with some numerical value along the x axis and some target value along the y axis. It appears to be a good fit, but it is only taking into account points of data where x is known.
NUMERICAL-IMPUTATION-1 shows imputing the mean where the missing value doesn’t appear to be random. This ends up skewing the line of best fit and adds noise, reducing the overall accuracy that the model can achieve. NUMERICAL-IMPUTATION-3 shows what it would look like if the mean was imputed but that the missing values are randomly distributed, or in other words, that there is no additional knowledge to be gained by knowing if the value was present or missing. When the missing values are randomly distributed, it does not change the line of best fit. If zero was imputed instead of the average, it would change the line of best fit similar to the way it did in NUMERICAL-IMPUTATION-2. A real world example of this would be if x was the annual income and y was the amount someone would be able to qualify for on a loan. It stands to reason that there would be a relationship where the more money someone makes, the more they can get borrow in a loan, as seen in NUMERICAL-IMPUTATION-1. Now, imagine that people who make less money are less likely to submit their annual income; that is what NUMERICAL-IMPUTATION-2 looks like. In this case, the x value is missing because of something and imputing the mean for those values will skew the results of the model. Now, consider that some records in the database were randomly deleted that contained how much some people make but not the record for the person and so the x-value is randomly null; that effect is pictured in NUMERICAL-IMPUTATION-3 and is a case where it is safe and sound to impute the value. One other thing to consider is, what if 0 was imputed instead of the mean. It’s probably not an accurate representation to say that someone makes 0 dollars annually and it would skew the model in a similar way to NUMERICAL-IMPUTATION-2.
NUMERICAL-IMPUTATION-1
NUMERICAL-IMPUTATION-2
NUMERICAL-IMPUTATION-3
If a simple way to impute a missing numerical value is to take the mean of the values, how can the imputation value be assessed for categorical values? A simple method is to take another form of average, the mode, which is the most frequently occurring value. Some of the challenges present in imputing the mode are the same as in imputing the mean for a numerical value. Look at CATEGORICAL-IMPUTATION-1, consider `Missing1` similar to NUMERICAL-IMPUTATION-2 and `Missing2` similar to NUMERICAL-IMPUTATION-3, where there is a pattern to `Missing1` in relation to the target and `Missing2` is randomly distributed, which can be seen because it matches the overall average target value. The differences, however, are that in this case `Missing1` and `Missing2` both skew the pattern recognition and in this case it is only being skewed for a subset of the data, Category B, which would be the imputed value if imputing with the mode.
CATEGORICAL-IMPUTATION-1
Skewing the relationship between a single value and the target value is not the only risk in imputation, but it is among the most common and easiest to evaluate. Other risks tend to stay in the same realm as they will simply add noise and reduce accuracy.
Fortunately, this risk can be greatly mitigated by dropping dimensions that are relatively sparse. There is always debate about exactly what sparsity levels (%missing) to allow in a column or a row, but it is never debated that imputing values has the potential to distort the data in a way that it is no longer truly representative of what the data is trying to describe. Despite this, it is still a common practice because dropping records due to a single missing value could drastically reduce data set size and possibly cause patterns to be missed across other dimensions.
Now that the risk has been addressed, how should values actually be imputed? Know that there are many methods that can be used outside of the methods that are addressed here, but that these are very common practice.
Numerical values: A good default is to simply impute the mean. If the missing values are normally distributed, then it is like adding weight to the very center of a teeter-totter. It doesn’t affect the weight to either side and allows everything to continue normally. However, what the value is trying to represent should always be considered. If there is sales data and some null values for sale price, is it because of ETL error or is it because the person used a coupon or a loyalty punch card? Does 0 or some other value than the mean make more sense?
Categorical values: A good default here is actually to add another category and call it ‘Other’ or ‘Unknown’ or even ‘Missing’. This category would be able to hold the additional variance if the missing values are randomly distributed, or gain in feature importance if there is a meaningful relationship between missing that value and the target. A fallback is to impute the average in the form of the mode. As with numerical values, it is always important to assess what the value is actually trying to represent and why the value might be missing. If it’s known that most of the missing values are probably the mode, then impute with the mode. It becomes harder and harder to assess as the unique values for that categorical column increase. If it’s not known, or there is uncertainty, impute the categorical value with filler text that is new unique value to the dataset.