It all starts with the data – garbage in, garbage out. Some additional factors to consider that may have an impact on model quality:
- Dirty data: If there is data that’s unclean or unreliable, consider removing it from the set. Unclean or unreliable could mean that a majority of column values are null, high concentration of one value in a single column (i.e. you have a column with the values of 'red', 'green', 'blue' and 90% of the values in the column are 'red'), values in a column are highly unique
- Sample size: A larger volume of data tends to produce more reliable models, so any additional relevant data points will help, whether those are new observations gathered as time passes, or historical ones gathered from a previously untapped source.
- Eliminate outliers: Consider removing observations with outlier values in the feature columns, and dropping observations to look at a smaller subset that has a tighter spread in the target column.
- Target range: If the distribution of your target data is too spread out relative to your sample size it may be hard to find signal in your data.
- Pattern changes: If the nature of the data you are gathering has experienced a significant change (for example, a major policy change that goes into effect may mean the previous data doesn't resemble new data) you may need to create a new training dataset using only data since the change occurred.
- Data grouping: Separating the data into different models based on different high level groups based on one or more columns may improve your results.
When considering Regression model scoring, it's important to remember that a linear relationship isn't always critical. If linear regression is producing poor results while other algorithms have better performance, that is not necessarily an indication that your model is bad; it may simply mean that your data can't be modeled well enough by a linear relationship.
If your model still isn’t scoring well, it may also be because the metrics that truly have a relationship with the predicted Metric are not yet captured in the dataset. It could be time to brainstorm what other things might have an effect on the predicted Metric, and see if that data can be gathered and included in the dataset. Remember – the predictive algorithms can only identify patterns that are there to be found.