A dataset that is mostly ready for machine learning is required. However, we do apply some basic pre-processing steps to the data before building models.
- Imputation of nulls
- Encoding categorical features (a.k.a. creating "dummy variables")
- Feature scaling (a.k.a. normalization)
- Five-fold cross validation and automatic holdout
- Extract 20% of your dataset to be used for final model evaluation in what's known as automatic holdout.
- Randomly sample the training dataset for five fold cross validation.
All of these pre-processing steps are performed given different thresholds set in our pipeline. The thresholds can be changed by us as we learn more about how accurate the models are that are created.