If you modify the source of data you have already loaded, "Refresh Dataset" makes it easy to use that data immediately without having to create a new data provider from scratch.
If you have data loaded and you later make changes at the source, such as adding a column to your Snowflake table, changing a column type in your SQL Server view or adding rows of data to your CSV file, the Data Refresh option allows you to ensure you're using the latest data for your analyses and predictions. The method for refreshing the data depends on the source.
Read this first: Important considerations when using the "Refresh Data" option
- When working with existing analyses and predictions, if you have made ANY changes to the target column in your source data since the model was trained, you should create a new analysis instead of refreshing your dataset. Any edits to your source data on a target column, no matter how minor - everything from a simple name change to data value changes that could change the type of analysis you are running (e.g. from binary to regression or multiclass)- will cause your model to fail.
- When working with existing analyses and predictions, you should consider carefully the edits you make to feature columns in your source data before you use the "Refresh Data" option. If your edits result in a change to the schema or column data types of any feature columns, you could break your existing analyses (for example, pipeline steps might fail if they refer to columns or column types that don't exist anymore) as well as your existing predictions (predictions will fail if the apply dataset schema no longer matches the training dataset schema). An error message will be presented if that happens and, where possible, details on the specific error will be shown.
- It is also important to consider that refreshing a dataset will affect ALL analyses and/or predictions that use that dataset. If your dataset is used by multiple models and you want to make a dataset change to only one model, you should use a different dataset instead of refreshing the existing one. This is especially true if you have made any schema changes to feature columns - as well as ANY changes to your target column - in your source dataset after you created one or more analyses with that dataset.
- Finally, it is worth noting that the "Refresh Dataset" option is for on-demand use when you want to immediately use the latest version of your source dataset. This does not apply to - and is not needed for - nightly scheduled predictions (when you have prediction syncing turned on). With prediction syncing enabled, the defined data provider will always be queried when generating nightly scheduled predictions; in other words, the apply dataset is automatically "refreshed" when scheduled predictions are run.
Refreshing datasets for all data providers EXCEPT for CSV files:
When training a new model or refining an existing model with data you have already loaded into AutoML, the "Refresh Dataset" option is available by clicking the gear icon next to the Analyze button and then clicking "Refresh Dataset". This instructs AutoML to clear the cache of existing data and re-query the data you already defined when you created the data provider.
When applying data to create predictions with data you have already loaded into AutoML, and when applying a different dataset to existing predictions (using the Deploy option from the Analysis overview screen), the "Refresh Dataset" option is available by clicking the gear icon next to the Predict button and then clicking "Refresh Dataset". This instructs AutoML to clear the cache of existing data and re-query the data you already defined when you created the data provider.
Refreshing datasets for CSV file data providers:
Unlike other AutoML data providers, once a CSV is loaded and cached, the source is no longer tracked. From AutoML's perspective there is no longer a source that can be refreshed without reloading the data from a file. Consequently, refreshing CSV data is a different process than for other data providers.
When working with models and predictions created from CSV data you have already loaded into AutoML, it is not currently possible to refresh the CSV data UNLESS you are training and applying from the same dataset AND you have already created predictions.
To update the predictions dataset, click on Settings from the Predictions overview.
Click the "Update predictions" button. You will be prompted to select the CSV file you want to use to replace the existing data cached in AutoML. After you choose the file, your predictions will be updated.
If you want to use the refreshed data to retrain your existing model (assuming the refreshed file has training data and was used to originally train the model), navigate back to the Analysis Overview screen then click the "Refine" button.
Identical to other data providers, the "Refresh Dataset" option is available by clicking the gear icon next to the ANALYZE button and then clicking "Refresh Dataset". This instructs AutoML to use the refreshed CSV data you created in the previous steps.