Feature engineering refers to a process of selecting and transforming variables when creating a predictive model using machine learning or statistical modeling (such as deep learning, decision trees, or regression). The process involves a combination of data analysis, applying rules of thumb, and judgement. It is sometimes referred to as pre-processing, although that term can have a more general meaning.
The goal of feature engineering
The data used to create a predictive model consists of an outcome variable, which contains data that needs to be predicted, and a series of predictor variables that contain data believed to be predictive of the outcome variable. For example, in a model predicting property prices, the data showing the actual prices is the outcome variable, and the data showing things that are believed to determine property values, such as size of the house, number of bedrooms, and location, are the predictor variables.
A "feature" in the context of predictive modeling is just another name for a predictor variable, and feature engineering is the general term for the process of creating and manipulating predictors, so that a good predictive model can be created.
The first step in feature engineering is to identify all the relevant predictor variables (features) that should be included in the model. This is a theoretical rather than practical exercise and can be achieved by consulting the relevant literature, talking to experts about the area, and brainstorming.
A common mistake people make when they start predictive modeling is to focus on data already available, rather than to consider what data is required. This mistake often leads to two practical problems:
- Essential predictor variables are left out of the model. For example, in a model predicting property prices, knowledge of the type of property (e.g., house, apartment, condo, retail, office, industrial) is crucially important. If this data is not available, it needs to be sourced well before any attempt is made at building a predictive model.
- Variables that should be created from available data are not created. For example, a good predictor of many health outcomes is the BMI, which is a person's weight divided by the square of their height. To build a good predictive model of health outcomes you need to know enough to work out that you need to create this variable as a feature for your model. If you instead just include height and weight in the model, the resulting model will likely perform worse than a model that includes BMI, height, and weight as predictors, along with other variables known to be relevant (e.g., diet, ratio of waist to hip circumference).
Feature transformation involves manipulating a predictor variable in some way so as to improve its performance in the predictive model. A variety of considerations come into play when transforming models, including:
- The flexibility of the machine learning and statistical models at dealing with different types of data. For example, some techniques require that the input data be in numeric format, whereas others can deal with other formats,such as categorical, text, or dates.
- Ease of interpretation. A predictive model where all the predictors are on the same scale (e.g., have a mean of 0 and a standard deviation of 1), can make interpretation easier.
- Predictive accuracy. Some transformations of variables can improve the accuracy of prediction (e.g., rather than including a numeric variable as a predictor, instead include both it and a second variable that is its square).
- Theory. For example, economic theory dictates that in many situations the natural logarithm of data representing prices and quantities should be used.
- Computational error. Many algorithms are written in such a way that "large" numbers cause them to give the wrong result, where "large" may not be so large (e.g., more than 10 or less than -10).
Whereas transformations involve creating a new variable by manipulating one variable in some way, feature extraction involves creating variables by extracting them from some other data. This can be achieved by using:
- Principal components analysis (PCA) to create a small number of predictor variables from a much larger number.
- Orthogonal rotations of predictor variables to minimize the effect of them being highly correlated.
- Cluster analysis to create a categorical variable from multiple numeric variables.
- Text analytics to extract numeric variables, such as sentiment scores, from text data.
- Edge detection algorithms to identify shapes in images.
Feature selection refers to the decision about which predictor variables should be included in a model. To a novice it seems obvious that you should just include all the available features in the model and let the predictive model automatically select which ones are appropriate. Sadly, it is not so simple in reality. Sometimes the computer you are using will crash if you select all the possible predictor variables. Sometimes the algorithm is not designed to take all of the available variables. Sometimes if you include all the possible features in the model the model ends up identifying spurious relationships (just like people, if you give a model a whole lot of data, they can often come up with predictions that seem to be accurate, but which are just coincidences).
Feature selection in practice involves a combination of common sense, theory, and testing the effectiveness of different combinations of features in a predictive model.
Want to know more? Find out more with our What Is guides!