How to Aggregate Data in R (With Examples)

Need to group and summarize data in R? The `aggregate()` function is one of the easiest ways to calculate statistics, such as the mean, sum, or count, across categories. In this guide, we’ll walk through how to aggregate data in R using clear, practical examples. Whether you’re summarizing sales by region or analyzing survey data by demographic, you’ll learn how to use the `aggregate()` function step-by-step.

Start aggregating data in R!

The process involves two stages. First, collate individual cases of raw data together with a grouping variable. Second, perform which calculation you want on each group of cases. These two stages are wrapped into a single function.

To perform aggregation, we need to specify three things in the code:

  • The data that we want to aggregate
  • The variable to group by within the data
  • The calculation to apply to the groups (what you want to find out)

What Is Data Aggregation in R?

Data aggregation in R is the process of summarizing data by grouping it based on one or more variables. It’s commonly used to calculate statistics like the mean, sum, or count for each group—for example, finding average sales by region.

The most popular tool for this in base R is the aggregate() function. It lets you apply a summary function to each group in your dataset.

Example:

aggregate(Sales ~ Region, data = sales_data, FUN = mean)

This groups the data by region and calculates the average sales per group. Aggregation helps simplify complex datasets and uncover meaningful patterns.

Example: How to Aggregate Data in R with Sample Dataset

The raw data shown below consists of one row per case. Each case is an employee at a restaurant.

Load the example data by running the following R code:

 
library(flipAPI)
data = DownloadXLSX("https://wiki.q-researchsoftware.com/images/1/1b/Aggregation_data.xlsx", want.row.names = FALSE, want.data.frame = TRUE)

Perform aggregation with the following R code.

agg = aggregate(data,
                by = list(data$Role),
                FUN = mean)

This produces a table of the average salary and age by role, as below.

Start aggregating data in R!

How to Use the aggregate() Function in R

The first argument to the function is usually a data.frame.

The by argument is a list of variables to group by. This must be a list even if there is only one variable, as in the example.

The FUN argument is the function which is applied to all columns (i.e., variables) in the grouped data. Because we cannot calculate the average of categorical variables such as Name and Shift, they result in empty columns, which I have removed for clarity.

Alternative Aggregation Functions in R: Beyond the Mean

Any function that can be applied to a numeric variable can be used within aggregate. Maximum, minimum, count, standard deviation and sum are all popular.

For more specific purposes, it is also possible to write your own function in R and refer to that within aggregate. I’ve demonstrated this below where the second largest value of each group is returned, or the largest if the group has only one case. Note also that the groups are formed by Role and by Shift together.

second = function(x) {
            if (length(x) == 1)
                return(x)
            return(sort(x, decreasing = TRUE)[2])}

agg = aggregate(data,
                by = list(data$Role, data$Shift),
                FUN = second)

Advanced Features of the aggregate() Function in R

The aggregate function has a few more features to be aware of:

  • Grouping variable(s) and variables to be aggregated can be specified with R’s formula notation.
  • Setting drop = TRUE means that any groups with zero count are removed.
  • na.action controls the treatment of missing values within the data.

FAQs About Data Aggregation in R

How to Use the Aggregate Function in R?

The aggregate() function in R lets you group your data by one or more variables and apply summary functions like mean(), sum(), or custom calculations. The syntax is:

aggregate(x = data,
by = list(data$GroupVar),
FUN = mean)
What is data aggregation in R?

Data aggregation in R refers to the process of summarizing or combining data based on groupings. It’s commonly used to compute statistics (e.g., mean, sum, count) for subsets of data grouped by one or more variables, using functions like aggregate(), tapply(), or dplyr::summarize().

What’s the difference between aggregate() and tapply()?

aggregate() is used for data frames and allows grouping by multiple variables. tapply() is for vectors and is better for simpler, one-variable groupings. Both apply a function to grouped data but differ in input structure.

Can I aggregate multiple variables at once in R?

Yes. You can pass multiple columns into aggregate() by using the cbind() function or work with dplyr::group_by() and summarize() to aggregate multiple variables more cleanly.

What packages are best for data aggregation in R?

Besides base R’s aggregate(), popular packages include dplyr, data.table, and plyr. dplyr is widely used for its readability and chaining capabilities.


TRY IT OUT
The analysis in this post was performed in Displayr using R. You can repeat or amend this analysis for yourself in Displayr.


 

Related Posts

Join the future of data storytelling

Chat with us