When looking at the types of data that are being prepared for machine learning, it is a critical skill to be able to discern what each variable is actually measuring and how to properly assign a representation for that variable. This section will cover some basic concepts through a few examples of how to properly assess the data type and the actual meaning of that point of data in the real world and should give you the understanding of how to mathematically represent things that don’t have a measurable difference like the colors red and blue.
Most of the examples in this section will be based on the following dataset that has already been aggregated to answer the distinct question: “Will a user repurchase any product within thirty days of the first purchase date?”
PersonReorder30DaysToKraken |
|
COLUMN |
TYPE |
FirstPurchaseDate |
DATE |
PersonID |
STRING |
Reorder30DaysAfterP1 |
BOOLEAN |
FirstOrderShippingState |
STRING |
BillingName |
STRING |
MinutesOnWebBeforeP1 |
DOUBLE |
FirstProductInCartOrdered |
STRING |
MarketingSource |
STRING |
SourceCostPerAcquisition |
STRING |
InitialOrderValue |
DOUBLE |
DaysToConvertINT |
INT |
DaysToConvertDOUBLE |
DOUBLE |
DATA-SCHEMA-1
Ultimately, the data contains two types of data: categorical and numerical. In general, machine learning would consider anything that is a FLOAT or DOUBLE data type to be numerical, and anything that is a STRING to be categorical. That, however, may not always be the correct interpretation. Consider the columns and values in DATA-PREVIEW-1:
PersonID |
MarketingSource |
SourceCostPerAcquisition |
Person_1 |
|
High |
Person_2 |
|
High |
Person_3 |
|
High |
Person_4 |
Google Paid Search |
Medium |
Person_5 |
Google Paid Search |
Medium |
Person_6 |
Organic Search |
Low |
Person_7 |
Organic Search |
Low |
Person_8 |
Customer Referral |
Low |
Person_9 |
Customer Referral |
Low |
Person_10 |
|
High |
DATA-PREVIEW-1
It is clear that there is not a measurable difference between the different values for `MarketingSource`. It is not possible to solve for x when x = Facebook - Google Paid Search. There are things that can be understood to be different between the two sources. In this example, one of the differences is that Facebook ads are display ads and Google Paid Search is search advertising, but that’s not something that can easily be communicated mathematically without additionally transforming the data. The fact that `MarketingSource` does not have a numerical or scalar relationship, is why it should be a string and considered a categorical variable. On the other hand, the values of `SourceCostPerAcquisition` have a more measurable difference between them. It’s not as exact as having the average cost per source acquisition in dollars, but it can distinctly be said that (High - Low) > (Medium - Low) or in other words, there is a distinct order of magnitude between the values even though that order of magnitude may not be perfectly decipherable. In a situation like this, it could easily be debated whether “SourceCostPerAcquisition` should be considered categorical or numerical. A recommended practice in this and similar situations is to give numerical mapping to the different values as one column, converting it to a DOUBLE so that Low = 1.0, Medium = 2.0, and High = 3.0 , and keeping the original STRING column. The data can now be ran through Kraken in three combinations, once with only the STRING or categorical representation, once with only the DOUBLE or numerical representation, and once with both.
The results of this would generally be expected to be minuscule, but if it is a very important feature, it is possible to have a high impact. In the case that there is a very noticeable difference in model scoring, it would be best to use the best combination; otherwise, it’s not going to be significant which set of the columns are chosen.
Numerical values should only be used if there is a measurable difference of what it is describing. Things like money, cost, revenue, profit, counts, sums, and averages are generally all very measurable and great examples of numerical values. There are times, however, that the dataset might have a numerical value that is not actually indicative of anything because it is a dictionary or incremental value like an ID. In DATA-PREVIEW-1, if the values of `PersonID` did not start with ‘Person_’ it might be mistaken for a numerical value that is not an accurate representation of the difference between those people. Even if the database did automatically increment so that Person_1 purchased before Person_2 and Person_2 purchased before Person_3, the best practice is to use ID as a STRING value. In some instances, this means that it is requisite to evaluate numerical values to determine if they should be converted to categorical values, or possibly even dropped from the analysis.
Lastly, the DATE types. DATE field types, in this context, are categorical variables. Although DATES and TIMESTAMPS can be subtracted from one another, there are several components to a date and it leaves the question of which of those components should be considered and how. Should it be the year? Month? Day of the month? Day of the Year? Hour of the day? Or should it be a time difference in minutes? There are so many components and ways to compare to a date value that to get the most predictive power it should be parsed into at least some of those components distinctly. Some of those components could be categorical and some of those components could be numerical. For example, a date can be parsed to the day of the week. This could be represented as a numerical value, or a string. Like so:
Day of Week Number |
Day of Week String |
1 |
Sunday |
2 |
Monday |
3 |
Tuesday |
4 |
Wednesday |
5 |
Thursday |
6 |
Friday |
7 |
Saturday |
There are a couple of problems with assessing the day of the week as a numerical value. First, when is the start of the week? Is it Monday? Or is is Sunday? Two, there isn’t necessarily going to be a trend that can be deciphered by knowing how far into the week the date was, even if the day of the week is actually important because it can be hidden in a cyclical trend. In some instances this may be very telling like in CHART-1(Monday first) but in CHART-2(Sunday First), which is the same data as CHART-1, the significance of the data is much lower. CHART-3 illustrates another problem with using that parsed component of a date as a numerical value; there may be a significant pattern but if it isn’t linear it will be missed. Because of this, and the cyclical trends that are often present in date components, whether daily seasonality, quarterly seasonality, or any other interval of seasonality, it is recommended to use these components as categorical values and not as numerical values.
CHART-1
CHART-2 (CHART-1 Shifted by one day)
CHART-3
Earlier in this module, it was stated that an incremental ID shouldn’t be used in an analysis. Date differences are a better substitute for this. The first reason that date differences are a better measure is because if you are using an incremental ID it is only descriptive of the order the person came in and can easily be skewed by the change in volume (customers per day in many examples), whereas a date difference is agnostic of that velocity. This date difference could still be one of several measures (seconds, hours, days, months, quarters), pick the one that is the most relevant to the granularity of the data that you have. For example, when looking at monthly data there isn’t going to be any real value in a date difference of days; in that case, a date difference of months would be more relevant and should be selected over a date difference value of days.