Creating R Variables from Multiple Input Variables Using Code
Most of the time, when wanting to create new variables, the trick is to either change the structure of the variables or use one of the in-built functions (e.g., Insert > New Transform). However, it is sometimes necessary to write code. This post lists the key concepts necessary for creating new variables by writing R code.
All the traditional mathematical operators (i.e.,
*) work in R in the way that you would expect when performing math on variables.
For example, to add two numeric variables called
q2b_1, select Insert > New R > Numeric Variable (top of the screen), paste in the code
q2a_1 + q2b_1, and click CALCULATE. That will create a numeric variable that, for each observation, contains the sum values of the two variables. Similarly, the following code computes a proportion for each observation:
q2a_1 / (q2a_1 + q2b_1).
To see the name of a variable, hover over it in the Variable Sets tree. Or, drag the variable into the R CODE box.
One of the great strengths of using R is that you can use vector arithmetic. Consider the expression
q2a_1 / sum(q2a_1). This tells R to divide the value of
q2_a1 by the sum of all the values that all observations take for this variable. That is, when computing the denominator, R sums the values of every observation in the data set. Other programs, such as SPSS, would instead treat this expression as meaning to divide
q2_a1 by itself.
Similarly, if we wished to standardize
q2a_1 to have a mean of 0 and a standard deviation of 1, we can use
(q2a_1 - mean(q2a_1)) / sd(q2a_1).
In these two examples, there are also specialist functions we can use:
q2a_1 / sum(q2a_1) is equivalent to writing
(q2a_1 - mean(q2a_1)) / sd(q2a_1) is equivalent to
As shown in the previous section,
sum will add up all the observations in a variable. If we want to calculate the average of a set of variables, resulting in a new variable, we do so as follows:
rowMeans(cbind(q2a, q2b, q2c, q2d, q2e, q2f))
cbindgroups the variables together in a table with one row for each observation and one column for each variable
rowMeanscomputes the mean of each row in the table.
Missing values in vector arithmetic
Most in-built R functions, such as
rowSums, will return missing values if any of the values in the vector (variable in this case) passed to them contains a missing value. In most cases, the trick is to use
na.rm = TRUE. For example:
(q2a_1 - mean(q2a_1, na.rm = TRUE)) / sd(q2a_1, na.rm = TRUE)
Sadly, there is no shortage of exotic exceptions to this rule. For example,
prop.table cannot deal with missing values, and
scale automatically removes them.
The data file used in this post contains 12 variables showing the frequency of consumption for six different colas on two usage occasions. When Displayr imports this data, it automatically works out that these variables belong together (based on their having consistent metadata). The variables are then automatically grouped together as a variable set, which is represented in the Data Sets tree, as shown below.
When your mouse pointer is positioned over the variable set, it shows the raw data for the variables. In addition to showing the 12 variables, you can also see nine automatically constructed additional variables:
- One variable which shows the sum of the variables, called SUM, SUM. This is the right-most of the variables.
- Six showing the sum of each of the cola brands: Coca-Cola, SUM, Diet Coke, Sum, etc.
- Two showing the sum of the variables pertaining to each occasion: Sum, 'out and about' and Sum, 'at home'.
These automatically constructed variables can considerably reduce the amount of code required to perform calculations. For example, to compute Coca-Cola's share of category requirements, we can use the expression:
(q2a_1 + q2a_2) / `Q2 - No. of colas consumed`[,"SUM, SUM"]
Note that the denominator has two aspects:
- The Label of the variable set, which is surrounded by backticks (the key that looks a bit like an apostrophe but isn't; on my keyboard it's above the Tab key, but this can vary depending on your keyboard's region).
[,"SUM, SUM"]which means to take the column SUM, SUM.
At first glance, this may seem somewhat strange and unguessable. However, if you create a table with the variable set, you can get a better understanding of what is happening and why. The table below shows the variable set, and you can see that the SUM variables correspond to the totals. With categorical variable sets, NET appears instead of SUM. And, if you delete these categories from the table, it will also delete them from the data set itself.
R has a super-cool function called
apply. It is a little tricky to get your head around it if you're new to writing R code, so if your head is already swimming, skip this section!
Earlier we looked at
rowMeans(cbind(q2a, q2b, q2c, q2d, q2e, q2f)). We can rewrite this as
apply(cbind(q2a, q2b, q2c, q2d, q2e, q2f), 1, mean). This is doing exactly the same thing, except that:
- We are telling R to compute the average with the
1tells R to perform the calculation by rows. If we instead had a
2, we would instead compute the mean of the columns.
The useful thing about
apply is that we can add in any function we want. For example, to compute the minimum, we replace
apply(cbind(q2a, q2b, q2c, q2d, q2e, q2f), 1, min)
And, we can even write custom functions to apply for each row. The example below identifies flatliners (also known as straightliners), who are people with the same answer to each of a set of variables:
apply(cbind(q2a, q2b, q2c, q2d, q2e, q2f), 1, function(x) length(unique(x)) == 1)
The way it works is that:
function(x)part is boilerplate, telling R that you are going to be creating a custom function, and to represent each row as
uniqueidentifies all the unique values in
x(i.e., each row)
length(unique(x))counts the number of unique values for each row
length(unique(x)) == 1returns a
TRUEfor each row that contains only one unique value (i.e., flatlining) and a
We can make the code simpler by referring to variable set labels rather than variable names, as done below. But, when doing this, keep in mind that any automatically constructed SUM or NET variables will be in the calculation. This is fine for working out flatlining (as in this example), but will lead to double-counting in other situations e.g., if computing a sum or average).
apply(`Q2 - No. of colas consumed`, 1, function(x) length(unique(x)) == 1)
This section returns to basics and looks at all the steps that go into recoding a numeric variable into a categorical variable. In this example, we will illustrate various aspects of how the program works by recoding age into a new variable with four categories. If all you are really wanting to do is recode, there is a much better way: see How to Recode into Existing or New Variables.
- Create a table by dragging the variable onto the page. This shows us the labels that we need to reference in our code.
- Insert > New R > Numeric Variable, which will cause a new variable to appear in the Data Sets tree on the left side of the screen.
- Type or copy and paste the code shown below into INPUTS > R CODE (on the right of the screen) and click CALCULATE (at the top-right of the screen).
- Check the new variable by cross-tabbing it with the original variable. That is, drag the new variable (probably called newvariable) over the original table, releasing it in the Columns slot. You will see the values that have been recoded to each of the categories, showing as averages.
- Click back on the new variable in the Data Sets tree, and give it an appropriate Label and Name (top-right of the screen; e.g., Age groupings, and age, respectively).
- Optional: change the structure of the data so that it is categorical, by setting INPUTS > Structure to Nominal: Mutually exclusive categories (at the bottom) and set the labels by clicking DATA VALUES > Labels.
Looking at the code above, note that:
- For a single category, we use the
- For multiple categories, we list them surrounded by
c()and use the
- The values are assigned at the end of the line, after a
Automatic updating: benefits and gotchas
When your original data updates, the code is automatically re-run. This is mainly a good thing. However, if you merge the categories of the input age variable, it will cause problems to the variable. Here are two ways to avoid this:
- Duplicate the original variable (Home > Duplicate) and merge its categories.
- Modify the code to use the label of the merged categories.
In R, the way you write "not" (as in, "not under 40") is to use an exclamation mark (
!). So, we can write:
Variable labels containing punctuation
Rather than typing variable labels, we can drag them from the data set into the R code. Where the variable label contains punctuation, it will be surrounded by backticks, which look a bit like an apostrophe. On my keyboard, the backtick key is above the Tab key.
Using variable names
When you hover over a variable in the Data Sets tree, you will see a preview which includes its name. In my data set, "living arrangement" has a variable name of d4, and we can refer to that in the code as well in place of the label.
You can also use the or operator, which is a pipe (i.e., a single vertical line). On my keyboard, I hold down the shift key and click the button above Enter to get the pipe.
In this example, note that I've used parentheses around the expression that is preceded by the not operator (
!), as otherwise it would be read as "not living with partner and children or living with children only", rather than "not(living with partner and children or living with children only)."
In the example above, line 3 is a very verbose way of writing "everybody else". We can instead use the code snippet below. The
case_when function evaluates each expression in turn, so when it gets to line 3, R reads this as "everybody else" or "other".
Missing values (
If our categories are not exhaustive, we will end up with missing values. For example, this code creates a variable with a 1 for people with children and missing values for others.
Recoding after creating the R variable
It might look like the missing values caused by the example above is a mistake. But it can be an efficient way to work because you can later recode the variable using Displayr's GUI. Simply click DATA VALUES > Values, change the Missing data in the Missing Values setting to Include in analyses, and set your desired value in the Value field.
The example below uses the and operator,
&, to compute a respondent's family life stage. The green bits, preceded by a
#, are optional comments which help make the code easier to understand.
Temporary variables within the code used to create a variable
A much nicer way of computing a household structure variable is shown in the code below. This approach initially creates four variables as inputs to the main variable of interest, and these variables are not accessible anywhere else in Displayr. They exist for the sole purpose of computing household structure.
Line 1 computes a variable that contains
FALSE values for each row of data, as do lines 2 through 4. Then,
case_when evaluates these using standard boolean logic for each row of data.
What makes this better code? It improves on the earlier example because:
- Calculations are performed once. In the earlier example, the definition of younger appeared six times, but in this example, it only appears once.
- It is simpler to read
Earlier we looked at this example:[desktop]
A much shorter way of writing it is to use
You can nest these if you wish, as shown below. The use of two lines and the spacing is a matter of personal preference; they are not required.
Using the numeric values of variables in computations
It can be more convenient to refer to values rather than labels when doing computations. But there's a good way and a bad way to do this. I'm going to start with the bad way because it is an obvious (but not the smartest) approach for many people new to writing code using R (particularly those used to SPSS).
The example below uses
as.numeric to convert the categorical data into numeric data. A value of 1 is automatically assigned to the first label, a value of 2 to the second, and so on. These values will not necessarily match the values that have been set in the raw data file. For example, if the data file contains values of 1 Male and 2 Female, but no respondent selected male, then the value of 1 would be assigned to Female.
The safer way to work is to click on the variable set, and then select a numeric structure from Inputs > Structure (on the right side of the screen). For example, you would change the age variable to a structure of Numeric. Or, better yet, first duplicate the variable (Home > Duplicate), and then change the structure of the duplicate so that the original variable remains unchanged.
In my example, the age variable in the data has midpoints assigned to each category (e.g., 21 for 18 to 24, 27 for 25 to 29, etc.). You can see these by clicking on the variable and select DATA VALUES > Values on the right of the screen.
An alternative approach to recoding is to use subscripting, as done below. Why this works is actually a little complex -- but it does work![desktop]
Mathematical operations on categorical variables
This next approach is a wonderful time saver, but is a little harder on the brain.
Earlier we looked at recoding age into two categories in a few different ways, including via an
The code below does the same thing. Let' unpack it:
`Age 2`is the numeric version of age, created in the way described in the previous section.
`Age 2` >= 40creates a variable with a TRUE value for people with an age of 40 or more, and FALSE for people under 40.
+ 1adds a 1 to the TRUE and FALSE values. This may seem odd, but it is a standard thing in computing: when you use a TRUE or a FALSE in calculations, the TRUE is treated as a 1 and the FALSE as a 0.
- The parentheses tell us to first compute the TRUE and FALSE. Without them, the analysis would then be checking to see who is aged 41 or more.
This next example can be particularly useful. This code creates 18 categories representing all the combinations of age and gender, where:
as.numeric(Age)converts the categorical variable into numeric values, as described above in the "bad approach" sub-section. This means that the youngest category gets a value of 1, the second as 2, etc.
max(as.numeric(Age)) * (as.numeric(Gender) - 1)assigns a value of 0 to Males and 9 to Females, where the 9 is the number of age categories.
- By adding the two together, we get values of 1 through 9 for the age categories of males, and 10 through 18 for females.
- If your goal is to create a new variable to use in tables, a better approach is Insert > New Banner.
Returning to our household structure example, we can write it as:
When you insert an R variable, you get a preview of the resulting values whenever you click CALCULATE. However, if doing anything remotely complicated, it is usually a good idea to:
- First check the code by creating an R OUTPUT (Insert > R Output), as these are better for debugging.
- Click on the R Output and check Inputs > OUTPUT > Show raw R Output, which will show all the steps in processing the code, line by line
- Use R functions like summary and table to show the values of intermediate calculations, as shown in the example below.