You have a data set and you need to quickly organize it to perform your data analysis. Where do you start? You could create a table of statistics which summarizes data by aggregating it. In this post, I'll show you how you can easily aggregate data in Displayr.  

data aggregateThe process involves two stages. First, collate individual cases of raw data together with a grouping variable. Second, perform which calculation you want on each group of cases. These two stages are wrapped into a single function.

To perform aggregation, we need to specify three things in the code:

  • The data that we want to aggregate
  • The variable to group by within the data
  • The calculation to apply to the groups (what you want to find out)

An example of aggregating data

The raw data shown below consists of one row per case. Each case is an employee at a restaurant.

Say you want to sort your data by the average salary and age per role. You can perform aggregation with the following R code.

In Displayr, navigate to Insert > R Output to type or paste in the code.

agg = aggregate(data,
                by = list(data$Role),
                FUN = mean)

This produces the following table, allowing you to easily see the average salary and age of each group.

You can easily replicate this table yourself

The first argument to the function is a table. In Displayr, this could be pasted in with Home > Data > Paste Table or created from a data set with Insert > More > Tables > Raw Data.

In Displayr, load the example data by running the following in a R output:

 
library(flipAPI)
data = DownloadXLSX("https://wiki.q-researchsoftware.com/images/1/1b/Aggregation_data.xlsx", want.row.names = FALSE, want.data.frame = TRUE)

The by argument is a list of variables to group by. This must be a list even if there is only one variable. For example, in our data set, the by argument is role.

The FUN argument is the function which is applied to all columns (i.e., variables) in the grouped data. In this case, we want to find out the mean or average so this is our function. As we cannot calculate the average of categorical variables such as Name and Shift, they result in empty columns. These are removed for clarity.

Aggregating your own data and other aggregation functions

Any function that can be applied to a numeric variable can be done within aggregate. Maximum, minimum, count, standard deviation and sum are some of the most popular ways of aggregating data.

But what if you want to do something else with your data? For more specific purposes, it is also possible to write your own function in R and refer to that within aggregate. Let's say you want to discover the story of the second largest group. This is demonstrated below where the second largest value of each group is returned, or the largest if the group has only one case. Note also that the groups are formed by Role and by Shift together.

second = function(x) {
            if (length(x) == 1)
                return(x)
            return(sort(x, decreasing = TRUE)[2])}

agg = aggregate(data,
                by = list(data$Role, data$Shift),
                FUN = second)

Additional features

The aggregate function has a few more features to be aware of:

  • Grouping variable(s) and variables to be aggregated can be specified with R’s formula notation.
  • Setting drop = TRUE means that any groups with zero count are removed.
  • na.action controls the treatment of missing values within the data.

TRY IT OUT
The analysis in this post was performed in Displayr using R. You can repeat or amend this analysis for yourself in Displayr. Don't have Displayr yet? Start your free trial here.