It all starts with the data – garbage in, garbage out. Some factors to consider that may have an impact on model quality:
- If there is data that’s unclean or unreliable, consider removing it from the set. Unclean or unreliable could mean that a majority of column values are null, high concentration of one value in a single column (i.e. you have a column with the values of 'red', 'green', 'blue' and 90% of the values in the column are 'red'), values in a column are highly unique
- If the nature of the data you are gathering has experience a significant change (for example, a major policy change that goes into effect may mean the previous data doesn't resemble new data)
- A larger volume of data tends to produce more reliable models, so any additional relevant data points will help, whether those are new observations gathered as time passes, or historical ones gathered from a previously untapped source.
If your model still isn’t scoring well, it may also be because the metrics that truly have a relationship with the predicted Metric are not yet captured in the dataset. It could be time to brainstorm what other things might have an effect on the predicted Metric, and see if that data can be gathered and included in the dataset. Remember – the predictive algorithms can only identify patterns that are there to be found!