Data Pipeline gives you flexibility in selecting and refining data in your training datasets. Use the Data Pipeline to quickly create multiple versions of model Analyses, each with different data transformations.
Data Pipeline window: After selecting your target column, the Data Pipeline window automatically appears on the right. You can show or hide the window by clicking the “curvy arrows” button to the left of the “Analyze” button.
List of selected algorithms: After selecting your target column, the Data Pipeline window provides an expandable list in the “Analysis Setup” section showing the algorithms that are automatically selected for training based on the data type of your target column.
Preprocessing information: After selecting your target column, the Data Pipeline window provides an expandable list in the “Analysis Setup” section showing the preprocessing actions that will be performed during model training. Preprocessing occurs AFTER all user-defined actions in the Data Pipeline are complete.
Action Menu (“the three dots”): You can perform a variety of data transformations with this menu, available by clicking the three dots.
- Accessing the Action Menu: In Schema View, the Action Menu is available on the right side of the dataset column list. In Data View, the Action menu is available to the right of each column name.
- Creating Actions: Creating an action will create an entry in the “Data Prep” section of the Data Pipeline where you can complete the action. To add the action to the Data Pipeline, click “Apply” in the lower right. Actions you create will be displayed in real time in both Schema View and Data View. Available Actions include:
- Exclude from Analysis/Include in Analysis: Available only In Data View, the Action Menu has an option to exclude/include a column from analysis. In Schema View, this action is accomplished using the checkbox to the left of each column name.
- Set as Target: Select the target column that will be used for analysis; this overrides the target column you have previously selected. The selected target column will appear at the top of the “Analysis Setup” section in the Data Pipeline window.
- Rename Column: Rename a column to the name of your choice. You cannot rename a column to an existing column name in the dataset.
- Filter: Apply a filter on the values in your dataset. Add an additional filter action for each filter you want to apply. If you apply multiple filters that result in exclusion of all column data (e.g. State not equal to ‘UT’ and State equal to ‘UT’) AutoML will grey out the “Analyze” button in the upper right.
- Change Type: Change the data type of a column in your dataset. For example, change an integer column to a string column.
- Bin Data: Group two or more values together in a string column. For example, group states in the southeast into a single “Southeast” bin.
- Replace Nulls: Replace null values in a column with one of four options: Median, Mode, 0.0 or “Other”.
- Split Column: Split a column into two columns based on values in the column. For example, a column called “Location” with values of “city, state” can be split into a second column called “State” using a comma for the delimiter. When you split columns, your new column will appear in both Schema View and Data View as a new column. You cannot create a column with the same name as an existing column in the dataset. A space is a valid delimiter for splitting columns. You do NOT need to include quotes (‘’ or “”) around characters or spaces when specifying a delimiter.
- Find and Replace: Replace column values with different values of your choice. For example, replace UT with Utah. You do NOT need to include quotes (‘’ or “”) around characters or spaces when specifying a replacement value.
- Modifying Actions: If you make a change to an existing Data Prep action, click the “Revert” circular arrow to undo your change. You can drag the Data Prep actions to reorder them as you wish; take care when reordering actions to avoid creating a situation where the action logic cannot be successfully applied to the Data Pipeline.
- Remove Actions from processing: To prevent AutoML from processing the action while keeping it in the Data Pipeline for future analysis, uncheck the box to the left of the action in the Data Pipeline.
- Deleting Actions: To permanently remove the action from the Data Pipeline, click the trash can icon in the lower left.
Important Notes about the Data Pipeline Action Menu
- Actions that are not available - based on the column type and/or data values - will be greyed out for that column.
- In Data View, the Action Menu will be limited to “Include in Analysis” for columns you have excluded from analysis; you must first include the column in analysis to access the full Action Menu.
- In Schema View, the Action Menu is not available for columns that are excluded from analysis. You must first include the column in analysis to access the Action Menu by checking the box on the left for the column.
- The Action Menu cannot be accessed for columns that AutoML has automatically excluded from analysis.
- Columns excluded from analysis do NOT appear in the list of Data Prep actions in Data Pipeline; you can see which columns are excluded in both Schema View and Data View by observing which columns are greyed out.
- When you apply an action that requires processing time, the “Analyze” button in the upper right will be greyed out until the processing is complete; this prevents AutoML from starting an analysis before all Data Prep actions have been fully processed. If the “Analyze” button remains greyed out for a long period of time, you may have two or more Data Prep actions that are conflicting; inspect your Data Prep actions carefully in this case.
Important Notes about training and prediction datasets
- Your prediction input dataset should have the same columns (and respective column types) as your training dataset as it exists BEFORE any Data Pipeline steps are applied. For example, if your training dataset has a column called "Territory" that you rename to "State" in a Data Pipeline step, your prediction input dataset should have a column called "Territory" (NOT "State") as AutoML will apply the same Data Pipeline steps to your prediction dataset that were applied to your training dataset.
- When using the Prediction API to generate predictions, the Prediction API input data records should have the same columns (and respective column types) as your training dataset as it exists AFTER any Data Pipeline steps are applied. For example, if your training dataset has a column called "Territory" that you rename to "State" in a Data Pipeline step, your Prediction API input data records should have a column called "State" (NOT "Territory") as AutoML does not apply Data Pipeline steps to your Prediction API input data records.
- Changes you make to the data schema in Data Pipeline steps will be reflected in your prediction dataset output. For example, if you include a Data Pipeline step that renames the "Territory" column to "State", then your prediction dataset output will include a column called "State" instead of "Territory".
- Your prediction dataset output WILL include ALL columns that were included in the prediction dataset input, regardless of exclusion from analysis in the Data Pipeline. For example, if the "Account ID" column is excluded from analysis due to high cardinality, AutoML will ignore it for the purposes of modeling but it will be included in the prediction dataset output.
- Your prediction dataset output will NOT include any rows of data that you have filtered out in Data Pipeline steps. For example, if you apply a filter to the "State" column to exclude any rows that have a value of "Utah", your prediction dataset output will NOT include any rows where State=Utah.
- Occasionally you may encounter a "Column type mismatch" error when applying a prediction dataset, even though the structures of your training and apply datasets are identical.
- This is usually a result of one or more null values in one or more INT columns in your apply dataset; these null values are being converted to "NaN" during preprocessing, which inadvertently converts those INT columns to FLOAT, resulting in a column type mismatch when applying your prediction dataset.
- To avoid this problem, you can create Data Pipeline "Replace Nulls" steps on your training dataset to apply MEAN, MODE or 0.0 values to null values in non-target INT columns; the steps will work even if you don't have nulls in your training dataset. The Data Pipeline steps will be applied automatically to your apply dataset when you create predictions.
- For more information, consult your friendly neighborhood Customer Success Manager or Customer Support.