Feature Engineering in Displayr
Feature engineering refers to the process of manipulating predictor variables (features) with the goal of improving a predictive model. In this post I outline some of the key tools and processes for feature engineering in Displayr.
Switching between categorical and numeric treatment of predictor variables
Perhaps the most fundamental form of feature engineering when building a predictive model is the decision about whether to treat a particular predictor as being categorical or numeric. In Displayr, the way that a variable is treated in a model is determined by its structure. Displayr has 15 different structures, but the two key ones of relevance in most predictive models are Numeric and Mutually exclusive categories (nominal), where Mutually exclusive categories (nominal) means that the data is treated as being categorical.
The structure of a variable is changed by selecting the variable in the Data Sets Tree (bottom-left), and changing Object Inspector > Properties > INPUTS > Structure. Sometimes a variable will be grouped into a variable set with other variables. It can be split by selecting Data Manipulation > Split.
Creating a new numeric variable
There are many tools in Displayr for creating new variables. The most flexible tool is to select Insert > New R (Variables) > Numeric Variable, which allows you to create a new variable using the R language. For example, to create a new variable which is the natural logarithm of an existing variable, called Tenure, type log(Tenure). See Feature Engineering for Numeric Variables for examples of the code to do things like winsorize, cap, normalize, and calculate polynomials.
Creating a new categorical variable
Categorical variables are created as follows:
- Start by creating a numeric variable: Insert > New R (Variables) > Numeric Variable and enter code in the R CODE box.
- Change the type to categorical with Object Inspector > Properties > INPUTS > Structure: Mutually exclusive categories (nominal).
- Labels and values can be modified by clicking on the various options in Object Inspector > Properties > DATA VALUES.
Missing value settings
To modify which values of a variable are treated as missing, select the variable and then press Object Inspector > Properties > DATA VALUES > Missing values.
Merging categories of categorical variables
Categories of categorical variables can be merged by dragging and dropping. This is done by:
- Dragging the variable from the Data Sets Tree onto the page. This will create a table.
- Click on the table and then click on one of the categories you wish to merge. When three grey lines appear to the right, you can click on them and drag the category onto another category to merge them. Alternatively, you can use control or shift to select multiple categories and merge them using Data Manipulation > Merge (Rows/Columns).
Reordering categories of categorical variables
Categories can be reordered by clicking on them (see the previous selection), and dragging them.
Displayr contains a large number of tools for feature extraction. For example:
- Principal components analysis (PCA), for extracting dimensions from numeric variables: Insert > Dimension Reduction > Principal Components Analysis. Once the analysis has been run, select it and then select the output and then click Insert > Dimension Reduction > Save Variable(s), which will add the variables to the data set.
- t-SNE, which is a highly nonlinear dimension reduction technique: Insert > Dimension Reduction > t-SNE. Once the analysis has been run, select it and then select the output and then click Insert > Dimension Reduction > Save Variable(s).
- Multiple correspondence analysis, for extracting dimensions from categorical variables: Insert > Dimension Reduction > Multiple Correspondence Analysis. Once the analysis has been run, select it and then select the output and then click Insert > Dimension Reduction > Save Variable(s).
- The various cluster analysis and latent class analysis tools in Insert > Group/Segment.
You can do anything…
Displayr supports all the main R packages, so it can perform any feature engineering that you require. If you cannot figure out how to do something, please contact us.
Ready to try this yourself? You can do this, and so much more, for free! Get started now!
About Tim Bock
Tim Bock is the founder of Displayr. Tim is a data scientist, who has consulted, published academic papers, and won awards, for problems/techniques as diverse as neural networks, mixture models, data fusion, market segmentation, IPO pricing, small sample research, and data visualization. He has conducted data science projects for numerous companies, including Pfizer, Coca Cola, ACNielsen, KFC, Weight Watchers, Unilever, and Nestle. He is also the founder of Q www.qresearchsoftware.com, a data science product designed for survey research, which is used by all the world’s seven largest market research consultancies. He studied econometrics, maths, and marketing, and has a University Medal and PhD from the University of New South Wales (Australia’s leading research university), where he was an adjunct member of staff for 15 years.