It was previously mentioned that there is a difference between a numerical value and a categorical value, where a categorical value gets taken into account as something that doesn’t have a clearly measurable relationship with the other values in that same column. The challenge that hasn’t been addressed is how does a categorical column get converted into a numeric representation to be measured by math. There are a number of ways that a category can be given mathematical representation like this, but the most common is called one hot encoding.
One hot encoding pivots the categorical column into n number of columns, where n is equal to the number of unique values in the column and assigning a one to the appropriate column for value in each row and a zero to the other columns that were generated. CATEGORICAL-ENCODING-1 shows how `MarketingSource` from DATA-PREVIEW-1 would be one hot encoded.
PersonID |
MarketingSource |
|
Google Paid Search |
Organic Search |
Customer Referral |
Person_1 |
|
1 |
0 |
0 |
0 |
Person_2 |
|
1 |
0 |
0 |
0 |
Person_3 |
|
1 |
0 |
0 |
0 |
Person_4 |
Google Paid Search |
0 |
1 |
0 |
0 |
Person_5 |
Google Paid Search |
0 |
1 |
0 |
0 |
Person_6 |
Organic Search |
0 |
0 |
1 |
0 |
Person_7 |
Organic Search |
0 |
0 |
1 |
0 |
Person_8 |
Customer Referral |
0 |
0 |
0 |
1 |
Person_9 |
Customer Referral |
0 |
0 |
0 |
1 |
Person_10 |
|
1 |
0 |
0 |
0 |
CATEGORICAL-ENCODING-1
Categorical encoding allows the math to evaluate each unique variable independently of the others, unlike a numerical value that is evaluated in relative terms to the other values in the column, unique or not.