Selecting Data for Predictive Lead Scoring – a Case Study
This is the first of a series of posts where I step through a worked example of predictive lead scoring. The process will involve preparing the data, then building and comparing machine learning models. Along the way I will establish some general rules that apply to predictive lead scoring generally.
The data set that I will be using can be downloaded here. You can see and edit the Displayr document containing my analysis in this post here. Check out this post for a more general overview of what data you need for Predictive Lead Scoring.
Exploring the data
As my starting point, I assume that the data is a collation of all available and relevant information. The first stage of analyzing a new data set should always be to understand the "shape" of the data. Relevant questions and their answers are:
- How many cases (rows) are there? 9240
- How many variables (columns) are there? 122
- What is the target outcome variable that we want to predict and its type? "Lead Stage", a categorical variable with breakdown as follows:
- What does a sample of the data look like? See below for the first 5 rows (note that you need to scroll right to see all the columns)
What did we learn from these four simple questions?
- We have enough rows (at least a few hundred is a good start for building a model).
- Many models can handle 121 predictors but it is unlikely that all are useful. It generally provides more insight by starting with a simple model of fewer predictors and build back up.
- We are interested in the "Closed" category of the outcome. The distinction between some of the other categories does not seem to be relevant.
- There are many predictors with missing data. Actually looking at some data is an important and often overlooked step that allows us to identify problems early.
With the points above in mind, we will remove variables with a lot of missing data. There might be some useful information that is thrown away, but we could revisit that after building a solid model. Also removed are variables with no expected predictive power, such as "Prospect ID" and "Lead Number". These variables are unique to each lead, so provide no information that can be generalized to learn about other cases. Along the way I have also fixed the type of some variable by ensuring that dates and categories are recognized as such, and not as text.
This leaves us with a more manageable 19 predictor variables.
The next stage in feature selection is to check for redundant variables that are colinear. Essentially this means that the variables contain the same information. At a minimum that means the variables are unnecessary, but they can also cause problems with some models, particularly when describing how important each variables is for prediction. To do this, look at the correlation matrix below (excluding date variables),
Note that I have split each categorical variable into variables for each category (minus one "constant" reference). This means there are a lot of cells and the labels are a little difficult to read. However if you hover on the darkest blue cells there are very high correlations between "Last Notable Activity" and "Last Activity" and between "Lead Source" and "Lead Origin".
Taking a closer look at these variables confirms what their names imply, that that they contain very similar information. Thus, we'll remove "Last Notable Activity" , "Last Notable Activity Date" and "Lead Source" leaving us with 16 predictors for our models, as per the table below.
Note that I have included "Lead Score" and "Engagement Score" in our predictors. This makes an important point for predictive lead scoring - it can and should use traditional lead scoring as an input where possible. Although traditional lead scores may be flawed, they contain significant information. In that sense predictive scoring is an enhancement of, and not a replacement for, traditional scoring.
As a final check for this stage, we can see that only 1.5% of cases now have missing data. This is small enough that we can just remove those cases when building models.
Summary and next steps
The main points identified so far are:
- The first stage of analyzing a new data set should always be to understand the "shape" of the data.
- Remove variables with significant missing data, no predictive power and colinearity.
- Incorporate existing traditional lead scores where possible.
In the next post I will continue with the data preparation by transforming the variables to be more relevant.