Advanced Analysis | Market Research Topics

Planning Ahead: What Data Do You Need for Predictive Lead Scoring?

by Matt Steele

A key benefit of predictive lead scoring over traditional lead scoring is that it provides objectivity in your data. It uses machine learning to come up with its predictions. But the flip side of this, is that you need to put in "good" data to get a good model. Put “bad” data in and you’re not going to get a good model out of it. So understanding what constitutes “the right” data is important to implementing an effective predictive scoring system.

Make sure you check out "Pros and Cons of Predictive Lead Scoring" if you haven't already!

What I’m not going to do in this article is get into the nitty-gritty of the technical requirements for predictive lead scoring. In subsequent articles, we'll get into more concrete and detailed technical specifications for how exactly the data should be formatted. What I can say is that, if you’re going DIY with your predictive lead scoring (as you can with Displayr), you should aim to consolidate your data into a flat format. This is stored typically as .csv file, with columns as your variables and rows as your cases. We have a specifications page that provides guidelines as to how the information should be stored in flat .csv format.

My aim in this article is to stimulate your thinking about the type of data you should be compiling. The purpose is to get you proactively thinking about what data you're going to be using in predictive lead scoring, rather than working with an existing pile of data. Or if you do have a pile of data, perhaps this article can help you distill it down into being "good" data as you embark on setting up predictive lead scoring for the first time.

Start small and then grow the model

In a previous article, we spoke about the minimum number of cases (rows in a flat file format). When it comes to the variables (cases), more isn't necessarily more. Consider the quotation below from an earlier post on machine learning that my colleague Jake wrote:

Machine learning is a problem of trade-offs. The classic issue is overfitting versus underfitting. Overfitting happens when a model memorizes its training data so well that it is learning noise on top of the signal. Underfitting is the opposite: the model is too simple to find the patterns in the data.

When it comes to planning what variables you want to use in your predictive lead scoring model, you want to avoid overfitting or underfitting. There is no hard rule about the appropriate number of variables but perhaps select 20 variables (give or take) to start with. Remember the great thing about machine learning is that you can always incrementally incorporate new variables information in the future. Start humbly, add more later.

Consider data points most relevant to your business priorities

We’ve spoken in previous articles about explicit, implicit and external data. All these data types are useful, but the implicit (i.e. behavioral) data is particularly potent in a predictive model. But even if you’ve got a rich trove of behavioral data… which behaviors to choose? You might have hundreds of variables to pick from. Ideally, somehow, you want to select data that is most relevant. And by “relevant” I mean data that is going to lead to meaningful action (which in this case is having good discriminating power in delivering our predictive lead score). This requires some qualitative assessment of which variables to use as both the outcome and predictor variables, at least to kick off the model. It’s not about just “chucking it all in and see what happens”.

Start by being clear about what business problem(s) you’re looking to solve with predictive lead scoring. Is it just about closing a deal (won/lost)? or is it about nurturing a lead (e.g. the time they spend engaging with your content or a trial your product)? This helps you to determine the outcome variables of your model. Are your outcome variable(s) binary (win/lost) or something more continuous (like the revenue generated or the number of units sold)?

Likewise, an analysis of your sales funnel can help uncover a good selection of predictor variables. Consider where in the sales funnel the sales teams are making key decisions (e.g. converting or disqualifying a lead, or moving them through the opportunity stages). Look at what information is available on your leads immediately before and after these steps. A good place for inspiration and feedback might be talking to the top sales reps.

Picking the right level of granularity

It can be tempting to include as many itty-bitty pieces of information in your model as possible, particularly if you're capturing information in a highly comprehensive and granular way. But a bit like the overfitting point above, it isn't always best to have lots of stuff, however specific it may be. You may be better to have geographic regions in your model rather than discrete postcodes. Or perhaps it's more powerful to aggregate data at different points in time together. For example, if you have a lead's visit to a particular web page, is it sufficient and appropriate just to work with the sum total of the number of times they've visited the website? Perhaps you could further aggregate into categories (low visits vs. high visits). But beware of over-aggregating too! Again, a bit like underfitting, it can lead to a loss of predictive accuracy.

So there is a trade-off in granularity too, and you want to strike a good balance. The guiding principle should be picking a level of granularity that is going to be practical, useful, and meaningful. If you are going to want to look at things at a day-by-day level down the track, then you'll need time aggregated at least to the day. Again, if things don't work out perfectly, you can adjust the granularity of predictor variables down the track.

Work diligently towards getting clean data

For a machine learning algorithm, you need to have a clean data file. The two most essential aspects of cleaning are completeness and consistency.

Completeness. You don’t want to have missing data in your variables. Missing data can cause issues for the predictive model leading to erroneous conclusions. A key source of missing data is unfilled fields in your CRM. So it’s important to have the sales team vigilantly filling in all the relevant data fields. If it is looking like you are going to have variables with lots of missing data, then perhaps you should consider if you should include these variables in the first place. Another strategy is to think about having auto-fill fields in your databases (some CRM platforms offer this and do data-cleaning services, albeit at a cost).
Consistency. You need to have your fields being filled out or captured correctly. For example, if you’re capturing post-codes they need to be in the same format, and not mixing in some other information (for example, confounding postcode, state, and region in the same variable, or perhaps mixing hours or days in the one variable).

Thinking about transforming the data

Along with cleaning also comes aspects of data checking. You may decide to exclude variables altogether on the basis of some of these flaws. But you can also consider how to address some of these issues by transforming your data. I cover (non-exhaustively) a few below:

Outliers. A common way to identify outliers is via their standard deviations. If they are more than some threshold (say 3 standard deviations) you could exclude them or perhaps cap their values.
Skewness. Predictive models generally work better if continuous variables have a normal distribution. You should inspect your variables to see if they skew to a great extent (negatively or positively). There are a variety of statistical techniques that can help you with that (log scales and so forth).
Collinearity. Sometimes variables overlap in the information they are providing, which makes it hard for the model to predict which variable is contributing to the outcome. Ideally, you want your variables to be as unrelated to each other as possible. Something that is completely correlated would be, say, postcode/zip code and state. Something that is moderately correlated could be time spent on various pages of your website. So you could consider picking one or the other or perhaps time spent on those pages could be combined into just one variable.

Conclusion: Quality over quantity

All of the above is an investment of time and money and effort. That's not just in the checking and cleaning of data, it also requires some qualitative thinking and analysis. Yes, developing a model is a bit of trial of error (as I've said, you can adjust the model later)... but you want to kick off on the best foot possible, so it's worth investing the time to give it the best chance of success. When it comes to that initial investment, focus your energies on a small subset of variables that you prioritize.

Check out this post for a technical breakdown of how to select the right data for your predictive lead scoring model with examples and a case study!