A dataset that is mostly ready for machine learning is required. However, we do apply some basic pre-processing steps to the data before building models.
- Imputation of nulls
- Encoding categorical features (a.k.a. creating "dummy variables")
- Feature scaling (a.k.a. normalization)
- Five-fold cross validation and automatic holdout
- Extract 20% of your dataset to be used for final model evaluation in what's known as automatic holdout.
- Randomly sample the training dataset for five fold cross validation.
All of these pre-processing steps are performed given different thresholds set in our pre-processing pipeline.