02 February 2017 |
10 Ways to Create New Variables in Displayr
Most data scientists have pretty clear picture of how variables should be created – and it almost certainly involves writing code. While you can take this approach in Displayr, there are often much smarter ways. By “smarter”, I mean faster and less error prone.
Recap: What is a variable? And a derived variable?
Just in case you are completely new to data science, let me quickly explain precisely what I mean by a variable. The first table below shows the average values of each of five variables. The raw data from these five variables appears in the second table below, labeled optus, orange, telstra, vodaphone and SUM. Each of these variables contains a value for each of 725 people. The first variable shows an NaN for the first person, which means that there is no data (Not a Number), 90 for the second and third person, 99 for the fourth person and so on.
The first four variables are in the original data file. They appear in the data set, which you find in the Data tree (at the bottom-left of the screen in Displayr). The fifth variable, SUM, is a derived variable (also known as a constructed or computed variable). It is the sum of the first four variables – the value in each row is generated by adding up the values from the other four variables in that row. It is a new variable, in that it was not in the original data file.
1. Grouping variables into a variable set
You have already seen the first way of creating a new variable. If you create a table using a variable set that contains multiple numeric or binary variables, Displayr will automatically create one or more new variables. For numeric variables, this will be the SUM variable, as in the example above.
For binary variables, where the data has values of 0 and 1, you will get a NET. This data in the NET variable takes a value of 1 for people who have a 1 in any of the variables, and a 0 for those people who do not. The percentage shown in the table for the NET then indicates the proportion of people who have a 1 among any of the variables. See below for an example. Three key things to note about the NET:
- The NET is not guaranteed to be 100%. With a nominal variable set (i.e., a variable set with mutually exclusive and exhaustive categories), the NET will always be 100%, but this is not the case for NETs of binary data. In the example below, the table is showing the proportion of people that like each of the different brands shown. 2% of the people like none of the brands, and so the NET is only 98%.
- The NET is not the sum of the other percentages in the table. It is computed as an OR operation. For a nominal variable set, the OR operation is the same as the sum, but this is the exception rather than the rule.
- Where you have missing values, a NET is only computed based on people that have no missing data. If you want to change this, the trick is to recode the data (e.g., Data Manipulation > Data Values > Missing Values). This lets you treat the missing values as either a 0 or a 1.
2. Creating NET variables
We can also manually create new NET and SUMs for a variable set. This is done by selecting the row or column headings of a table, and selecting Data Manipulation > Rows/Columns > Create NET. In the table above, a combined Vodafone + Optus + Telstra category has been created.
3. Changing the Structure of a Variable Set
Many of the most common types of new variables that people need, can be created in Displayr by changing the structure of a variable set. For example, if you have…:
- …numeric data, and you wish to aggregate it into categories, you can do so by changing the variable set structure to Percentages (Nominal or Nominal – Multi).
- …rating scales, such as a 5-point scale measuring agreement, you can change their structure so that you get top 2 box scores by changing to Percentages (Binary – Multi).
- …categorical data, and you want to treat it as numeric data, you can change the Structure from Percentages (Nominal or Nominal – Multi) to Average (Numeric or Numeric – Multi), and then modify the values by pressing Recode Values. Check out the post on computing NPS for more detail.
4. Duplicating data and modifying it
Often changing the structure of a variable set does not quite get you where you want. While it allows you to create data in the format you want, you lose the data that was already there. The solution to this is therefore to first duplicate the data (Data Manipulation > Variables > Duplicate), which causes new copies of the same variables to be added to the data set.
5. Using the menus and buttons in the ribbon
There are lots of automatic ways of creating variables available from the different menus and buttons in Displayr. For example, if you have created a regression model, you can select that model, and choose Insert > Analysis > More > Regression > Save Variable(s) and choose one of the options (e.g., Predicted Values).
6. Creating an R variable
Custom variables can be created by selecting Insert > Variables > R and entering R CODE. For example, if we type Q1 + Q2 we will create a new variable that contains the sum of the values of these two variables for each case in the data set. If Q1 and Q2 are not able to be added, e.g. if they are text or categorical variables (referred to as factors in R), you will get an error.
Displayr calculates the values of the new variable instantly, while you type. So, if what you type does not make sense due to it being invalid or incorrect, an error appears.
You can even drag variables from the Data tree in the bottom left into the code window as a shortcut to writing formulas.
See Introduction to Displayr 4: Simple calculations for a gentle introduction to using R in Displayr for other types of calculations.
var result = [N] //Creating an array that will store the result // Looping through the observations in the database (N means all observations) for (var i = 0; i < N; i++) result[i] = Q1[i] + Q2[i] // Adding the cases for each case result //Returning the result. This is used as the newQ1 variable.
9. Creating a filter
When you create a filter, by selecting Home > Data Selection > New Filter, a new variable is created, that takes a value of 1 if the case is included in the filter and 0 otherwise. Click here for more detail about creating filters. While this does create a filter, you can also use the resulting variable in any other analyses that you need to perform.
10. Creating a weight
Lastly, when you create a weight for your data set (to adjust the contribution of each case to table statistics and other analyses), this results in a new variable being created in the data set which contains the weight value for each case. Create weights from Home > Data Selection > New Weight.
Author: Tim Bock
Tim Bock is the founder of Displayr. Tim is a data scientist, who has consulted, published academic papers, and won awards, for problems/techniques as diverse as neural networks, mixture models, data fusion, market segmentation, IPO pricing, small sample research, and data visualization. He has conducted data science projects for numerous companies, including Pfizer, Coca Cola, ACNielsen, KFC, Weight Watchers, Unilever, and Nestle. He is also the founder of Q www.qresearchsoftware.com, a data science product designed for survey research, which is used by all the world’s seven largest market research consultancies. He studied econometrics, maths, and marketing, and has a University Medal and PhD from the University of New South Wales (Australia’s leading research university), where he was an adjunct member of staff for 15 years.