Kraken does a lot of preprocessing by default, but it’s important to know the concepts behind what and why, because the default may fit a lot of cases, but sometimes it should be done differently. Knowing what Kraken’s defaults are along with the underlying concepts will help guide a user to know what to do with the data before sending it to Kraken.
Kraken Preprocessing Steps
After the user has selected a target column, Kraken looks for rows where the target is null and separates them into another dataset considered the apply dataset, leaving rows where the target is known as the training set. Only data from the training set is used for making the decisions in the following steps; the steps and metadata for them will be saved and applied to any new data for the model to make predictions on.
- Kraken classifies a column as categorical or numerical:
- FLOAT, DOUBLE, & DECIMAL data types always are considered as numerical.
- STRING data types are always considered as categorical.
- INTEGER data types are considered categorical or numerical depending on the cardinality of the column. If the column has 50 or fewer unique values, the column will be considered categorical. If it has more, then it will be considered numerical.
- Kraken checks each column for sparsity and cardinality (high cardinality = high number of unique values) to make the following decisions:
- If the column is 50% null or more, then the entire column is dropped. Remember that 0 is never considered null.
- If the column only has one unique value, and it is constant, the column will be dropped.
- If the column is categorical and is 90% unique or more, or has more than 1000 unique values, the column will be dropped.
- Calculates and saves the mean for numerical values and the mode for categorical values.
- Imputes missing values.
- Encodes categorical variables.
- Calculates and saves summary statistics for each column to use for feature scaling.
- Standardizes (feature scales) each column, using the saved summary statistics.
Kraken Pipeline Errors
This is a list of things that cause errors in the data pipeline for Kraken as of August, 2018. Most are based on general dataset problems that persist across most platforms. Some are specific to common Machine Learning Packages and Kraken.
- Commas within fields act as delimiters
- If a field is processed as a string and has commas, the dataset will not load.
- User Fix: use the REPLACE() function and replace with another separator or cast the string into a numerical type (when applicable). ”
- Column Names with Periods
- Columns that maintain schema information like ‘customer.revenue’ cause errors.
- User Fix: rename the columns if periods exist.
- Datasets that are greater than 500MB
- Preprocessing and parsing categorical variables adds a lot of memory. To prevent errors we limit the data sets that will be parsed to 500 MB.
- User Fix: Submit feature request for larger sets. Drop unnecessary features. Break the data into multiple models.