Another thing that is important in machine learning is standardization or feature scaling. Most often, data will have varying ranges for the different dimensions. For example, if trying to predict whether a homeowner would default on their mortgage, interest rate and home value are going to have drastically different ranges and magnitudes. Standardizing each of these values relative to themselves allows them to be mathematically represented along the same plane, which can both increase accuracy and speed of model training. To picture this, imagine an archer given several targets, placed at drastically varying distances unknown to the archer. The archer will do his best to hit all of the targets but will take his time between each shot to try and visually understand the distance to the new target. Now, imagine that same archer with all of the targets placed at the same distance; this time the archer is able to make judgements more quickly and probably more accurately because he doesn’t have to reevaluate the distance. Although that isn’t a perfect metaphor, it does illustrate simply a few reasons why it is a basic practice to feature scale data for machine learning.
Common practice for feature scaling is to calculate the mean and standard deviation for each column, and then for each row calculate the number of standard deviations away from the mean that sample is.
To see this concept and practice followed with actual data, look at the columns `InitialOrderValue` and `DaysToConvertDOUBLE` as seen in DATAPREVIEW2, then find the mean and standard deviation in DATASUMMARY1, and finally feature scaled data in DATATRANSFORMED1:
PersonID 
InitialOrderValue 
DaysToConvertDOUBLE 
Person_1 
$45.37 
0.0 
Person_2 
$22.21 
1.0 
Person_3 
$17.35 
2.0 
Person_4 
$34.83 
0.0 
Person_5 
$28.41 
1.0 
Person_6 
$64.18 
2.0 
Person_7 
$33.61 
0.0 
Person_8 
$26.88 
1.0 
Person_9 
$22.49 
2.0 
Person_10 
$32.81 
0.0 
DATAPREVIEW2
InitialOrderValue 
DaysToConvertDOUBLE 

Mean 
$32.81 
0.9 
Standard Deviation 
$13.58 
0.88 
DATASUMMARY1
PersonID 
InitialOrderValue 
DaysToConvertDOUBLE 
Person_1 
0.925 
1.028 
Person_2 
0.781 
0.114 
Person_3 
1.139 
1.256 
Person_4 
0.148 
1.028 
Person_5 
0.324 
0.114 
Person_6 
2.310 
1.256 
Person_7 
0.059 
1.028 
Person_8 
0.437 
0.114 
Person_9 
0.760 
1.256 
Person_10 
0.000 
1.028 
DATATRANSFORMED1
To walk through the calculation for a sample, consider `InitialOrderValue` for Person_1  the actual value is $45.37, the mean `InitialOrderValue` is $32.81, and the Standard Deviation is $13.58. The feature scaled value is then solved for: x’ = ($45.37  $32.81)/$13.58 = $12.56/$13.58 = 0.925. Note that by dividing in the last step, units cancel out. 0.925 is no longer in dollars, but in relative, unitless, standard deviations from the mean and thereby both columns are now on the same descriptive plane. Look at BOXPLOTORIGINAL and BOXPLOTTRANSFORMED to visualize the difference.
BOXPLOTORIGINAL
BOXPLOTTRANSFORMED
Comments
0 comments
Article is closed for comments.