Another thing that is important in machine learning is standardization or feature scaling. Most often, data will have varying ranges for the different dimensions. For example, if trying to predict whether a homeowner would default on their mortgage, interest rate and home value are going to have drastically different ranges and magnitudes. Standardizing each of these values relative to themselves allows them to be mathematically represented along the same plane, which can both increase accuracy and speed of model training. To picture this, imagine an archer given several targets, placed at drastically varying distances unknown to the archer. The archer will do his best to hit all of the targets but will take his time between each shot to try and visually understand the distance to the new target. Now, imagine that same archer with all of the targets placed at the same distance; this time the archer is able to make judgements more quickly and probably more accurately because he doesn’t have to reevaluate the distance. Although that isn’t a perfect metaphor, it does illustrate simply a few reasons why it is a basic practice to feature scale data for machine learning.
Common practice for feature scaling is to calculate the mean and standard deviation for each column, and then for each row calculate the number of standard deviations away from the mean that sample is.
To see this concept and practice followed with actual data, look at the columns `InitialOrderValue` and `DaysToConvertDOUBLE` as seen in DATA-PREVIEW-2, then find the mean and standard deviation in DATA-SUMMARY-1, and finally feature scaled data in DATA-TRANSFORMED-1:
PersonID |
InitialOrderValue |
DaysToConvertDOUBLE |
Person_1 |
$45.37 |
0.0 |
Person_2 |
$22.21 |
1.0 |
Person_3 |
$17.35 |
2.0 |
Person_4 |
$34.83 |
0.0 |
Person_5 |
$28.41 |
1.0 |
Person_6 |
$64.18 |
2.0 |
Person_7 |
$33.61 |
0.0 |
Person_8 |
$26.88 |
1.0 |
Person_9 |
$22.49 |
2.0 |
Person_10 |
$32.81 |
0.0 |
DATA-PREVIEW-2
InitialOrderValue |
DaysToConvertDOUBLE |
|
Mean |
$32.81 |
0.9 |
Standard Deviation |
$13.58 |
0.88 |
DATA-SUMMARY-1
PersonID |
InitialOrderValue |
DaysToConvertDOUBLE |
Person_1 |
0.925 |
-1.028 |
Person_2 |
-0.781 |
0.114 |
Person_3 |
-1.139 |
1.256 |
Person_4 |
0.148 |
-1.028 |
Person_5 |
-0.324 |
0.114 |
Person_6 |
2.310 |
1.256 |
Person_7 |
0.059 |
-1.028 |
Person_8 |
-0.437 |
0.114 |
Person_9 |
-0.760 |
1.256 |
Person_10 |
0.000 |
-1.028 |
DATA-TRANSFORMED-1
To walk through the calculation for a sample, consider `InitialOrderValue` for Person_1 -- the actual value is $45.37, the mean `InitialOrderValue` is $32.81, and the Standard Deviation is $13.58. The feature scaled value is then solved for: x’ = ($45.37 - $32.81)/$13.58 = $12.56/$13.58 = 0.925. Note that by dividing in the last step, units cancel out. 0.925 is no longer in dollars, but in relative, unitless, standard deviations from the mean and thereby both columns are now on the same descriptive plane. Look at BOX-PLOT-ORIGINAL and BOX-PLOT-TRANSFORMED to visualize the difference.
BOX-PLOT-ORIGINAL
BOX-PLOT-TRANSFORMED