A lot of model preprocessing is performed by default, but it’s important to know the concepts behind what and why, because the default may fit a lot of cases, but sometimes it should be done differently. Knowing the defaults along with the underlying concepts will help guide a user to know what to do with the data before using it for Analyses.
Preprocessing Steps
After the user has selected a target column, rows where the target is null are identified and separated into another dataset considered the apply dataset, leaving rows where the target is known as the training set. Only data from the training set is used for making the decisions in the following steps; the steps and metadata for them will be saved and applied to any new data for the model to make predictions on.
- Columns classified as categorical or numerical:
- FLOAT, DOUBLE, & DECIMAL data types always are considered as numerical.
- STRING data types are always considered as categorical.
- INTEGER data types are considered categorical or numerical depending on the cardinality of the column. If the column has 50 or fewer unique values, the column will be considered categorical. If it has more, then it will be considered numerical.
- Each column is checked for sparsity and cardinality (high cardinality = high number of unique values) to make the following decisions:
- If the column is 50% null or more, then the entire column is dropped. Remember that 0 is never considered null.
- If the column only has one unique value, and it is constant, the column will be dropped.
- If the column is categorical and is 90% unique or more, or has more than 1000 unique values, the column will be dropped.
- Calculates and saves the mean for numerical values and the mode for categorical values.
- Imputes missing values.
- Encodes categorical variables.
- Calculates and saves summary statistics for each column to use for feature scaling.
- Standardizes (feature scales) each column, using the saved summary statistics.