Feature engineering refers to a process of selecting and transforming variables when creating a predictive model using machine learning or statistical modeling (such as deep learning, decision trees, or regression). The process involves a combination of data analysis, applying rules of thumb, and judgement. It is sometimes referred to as pre-processing, although that term can have a more general meaning.
The goal of feature engineering
The data used to create a predictive model consists of an outcome variable, which contains data that needs to be predicted, and a series of predictor variables that contain data believed to be predictive of the outcome variable. For example, in a model predicting property prices, the data showing the actual prices is the outcome variable. The data showing things, such as the size of the house, number of bedrooms, and location, are the predictor variables. These are believed to determine the value of the property.
A "feature" in the context of predictive modeling is just another name for a predictor variable. Feature engineering is the general term for creating and manipulating predictors so that a good predictive model can be created.
The first step in feature engineering is to identify all the relevant predictor variables to be included in the model. Identifying these variables is a theoretical rather than practical exercise and can be achieved by consulting the relevant literature, talking to experts about the area, and brainstorming.
A common mistake people make when they start predictive modeling is to focus on data already available. Instead, they should be considering what data is required. This mistake often leads to two practical problems:
- Essential predictor variables end up being left out of the model. For example, in a model predicting property prices, knowledge of the type of property (e.g., house, apartment, condo, retail, office, industrial) is crucially important. If this data is not available, it needs to be sourced well before any attempt is made at building a predictive model.
- Variables that should be created from available data are not created. For example, a good predictor of many health outcomes is the Body Mass Index (BMI). To calculate BMI, you have to divide a person's weight by the square of their height. To build a good predictive model of health outcomes you need to know enough to work out that you need to create this variable as a feature for your model. If you just include height and weight in the model, the resulting model will likely perform worse than a model that includes BMI, height, and weight as predictors, along with other relevant variables (e.g., diet, a ratio of waist to hip circumference).
Feature transformation involves manipulating a predictor variable in some way so as to improve its performance in the predictive model. A variety of considerations come into play when transforming models, including:
- The flexibility of machine learning and statistical models in dealing with different types of data. For example, some techniques require that the input data be in numeric format, whereas others can deal with other formats, such as categorical, text, or dates.
- Ease of interpretation. A predictive model where all the predictors are on the same scale (e.g., have a mean of 0 and a standard deviation of 1), can make interpretation easier.
- Predictive accuracy. Some transformations of variables can improve the accuracy of prediction (e.g., rather than including a numeric variable as a predictor, instead include both it and a second variable that is its square).
- Theory. For example, economic theory dictates that in many situations the natural logarithm of data representing prices and quantities should be used.
- Computational error. Many algorithms are written in such a way that "large" numbers cause them to give the wrong result, where "large" may not be so large (e.g., more than 10 or less than -10).
Transformations involve creating a new variable by manipulating one variable in some way or another. Feature extraction involves creating variables by extracting them from some other data. For example, using:
- Principal components analysis (PCA) to create a small number of predictor variables from a much larger number.
- Orthogonal rotations of predictor variables to minimize the effect of them being highly correlated.
- Cluster analysis to create a categorical variable from multiple numeric variables.
- Text analytics to extract numeric variables, such as sentiment scores, from text data.
- Edge detection algorithms to identify shapes in images.
Feature selection refers to the decision about which predictor variables should be included in a model. To a novice, it might seem obvious to include all the available features in the model. Then let the predictive model automatically select which ones are appropriate. Sadly, it is not so simple in reality. Sometimes the computer you are using will crash if you select all the possible predictor variables. Sometimes then algorithm being used may not have been designed to take all available variables. If you were to include all the possible features of a model, the model may end up identifying spurious relationships. Just like people, if you give a model a whole lot of data, they can often come up with predictions that seem to be accurate, but which are just coincidences.
Feature selection in practice involves a combination of common sense, theory, and testing the effectiveness of different combinations of features in a predictive model.
Want to know more? Find out more with our What Is guides!