# Preparing Data for Predictive Lead Scoring – a Case Study

This is the second of a series of posts with a worked example of predictive lead scoring. In this post I focus on getting data ready to apply predictive machine learning techniques.

Please see "Selecting Data for Predictive Lead Scoring" for the first post in the series which covers initial data exploration and identifying useful features. If you want to view and edit the Displayr document containing my analysis, follow the link here.

## Feature engineering - categorical variables

In the first table of our exploration of the data, duplicated below, we saw that the "Lead Stage" outcome variable has 9 categories.

Learning to decide between more categories is a harder task then between few categories. Since we're not actually interested in the distinction between some of those categories the easiest thing to do is to merge them. The breakdown after merging is as below.

I have used 3 categories, preferring to maintain a distinction between warm leads which are "Interested" and "Qualified" and the cold leads. Other approaches are possible such as:

- Separating out more categories. "Unreachable" could be a 4th category, since it is 30% of the population and potentially better than junk.
- Combining everything that is not closed into a single category. This makes the problem simpler, but we lose the distinction between the other lead types. Note also that "Closed" is only 6% of the population. Imbalanced populations like these may require special treatment to deal with (which I return to later).
- Converting the categories to a numeric scale. We could use 2 for "Closed", 1 for "Interested + Qualified" and 0 for everything else. This has the advantage that we have fine control over the numeric scale. Maybe "Closed" is significantly better and is worth 5, or "Not Called" should be better than "Junk" (0.5).
- Refining the numeric scale further, the values could be weighted by their actual or expected profit. So a closed lead resulting in a big order is better than a closed lead for a small order.

Similarly, I check the tables of the other predictor variables. There are no other obvious transformations we need to make.

## Feature engineering - dates

For date variables, we can consider various transformations depending what the relevant information actually is. For example, if it's possible that season is useful then a date could be converted into a categorical variable of winter, spring, summer and autumn. Alternatively, you could split date into some or all of 4 variables: year, month, day and weekday.

In this case it is reasonable that the length of time since an event (such as last activity) is useful, so I convert each of the dates to the number of days since some reference date (April 1st 2016 - the earliest date in the data). I've shown these histograms below. Given the strong similarity between "Last Activity Days" and "Last Visit Days", I choose to remove the latter from our predictor variables.

## Feature engineering - numeric variables

Next, I check histograms of the other numeric predictors to understand their distributions. "Lead Score" has an average of around 100 but a long tail of a few values up to 4000.

Now, it would be useful to know how lead score was calculated. Since we don't have that background let's try to reverse-engineer what the very high lead scores mean. We would like to find out if they are errors or truly significant indicators of purchase. To do so, here is a scatter plot of lead score against the numeric transform of lead stage (0 = dead lead, 2 = deal closed).

Note that the very high lead scores are not closed deals. So let's remove any values above 1000 (there are only 4) and perform a log transformation of the remainder, to produce the more compact distribution below.

Now, I apply similar log transformation to "Engagement Score", "Total Visits" and "Average Time Per Visit".

## Splitting data for cross validation

The last stage before modelling is to split the data into a part to be used for training and a part for testing. This is known as cross-validation and is important to mitigate overfitting. Overfitting is when a model learns the specifics of the training data so well that it is poor at generalizing to new (previously unseen) examples.

We will use a random 70% of cases of our data for training, and the remaining 30% for testing.

## Summary and next steps

The main points to take away from this post are:

- Combine small and irrelevant categories.
- Transform date and numeric variables to a relevant scale.
- Split the data into a training and test set.

**Stay tuned for the next post in which we build the models for predictive lead scoring!**