How to Calculate Summary Statistics in R Using dplyr

The dplyr package in R is a powerful tool for summarizing data. It provides functions that allow you to quickly calculate summary statistics such as mean, median, mode, standard deviation, quartiles, and more. By using the summarise() function, you can easily summarise data frames into a single row of values for each statistic you wish to calculate. Additionally, dplyr provides versatile functions that allow you to group your data by specific variables and calculate summary statistics for each group.


You can use the following syntax to calculate summary statistics for all numeric variables in a data frame in R using functions from the dplyr package:

library(dplyr)
library(tidyr)

df %>% summarise(across(where(is.numeric), .fns = 
                     list(min = min,
                          median = median,
                          mean = mean,
                          stdev = sd,
                          q25 = ~quantile(., 0.25),
                          q75 = ~quantile(., 0.75),
                          max = max))) %>%
  pivot_longer(everything(), names_sep='_', names_to=c('variable', '.value'))

The summarise() function comes from the dplyr package and is used to calculate summary statistics for variables.

The pivot_longer() function comes from the tidyr package and is used to format the output to make it easier to read.

This particular syntax calculates the following summary statistics for each numeric variable in a data frame:

  • Minimum value
  • Median value
  • Mean value
  • Standard deviation
  • 25th percentile
  • 75th percentile
  • Maximum value

The following example shows how to use this function in practice.

Example: Calculate Summary Statistics in R Using dplyr

Suppose we have the following data frame in R that contains information about various basketball players:

#create data frame
df <- data.frame(team=c('A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'),
                 points=c(12, 15, 19, 14, 24, 25, 39, 34),
                 assists=c(6, 8, 8, 9, 12, 6, 8, 10),
                 rebounds=c(9, 9, 8, 10, 8, 4, 3, 3))

#view data frame
df

  team points assists rebounds
1    A     12       6        9
2    A     15       8        9
3    A     19       8        8
4    A     14       9       10
5    B     24      12        8
6    B     25       6        4
7    B     39       8        3
8    B     34      10        3

We can use the following syntax to calculate summary statistics for each numeric variable in the data frame:

library(dplyr)
library(tidyr)

#calculate summary statistics for each numeric variable in data frame
df %>% summarise(across(where(is.numeric), .fns = 
                     list(min = min,
                          median = median,
                          mean = mean,
                          stdev = sd,
                          q25 = ~quantile(., 0.25),
                          q75 = ~quantile(., 0.75),
                          max = max))) %>%
  pivot_longer(everything(), names_sep='_', names_to=c('variable', '.value'))

# A tibble: 3 x 8
  variable   min median  mean stdev   q25   q75   max
             
1 points      12   21.5 22.8   9.74 14.8  27.2     39
2 assists      6    8    8.38  2.00  7.5   9.25    12
3 rebounds     3    8    6.75  2.92  3.75  9       10

 From the output we can see:

  • The minimum value in the points column is 12.
  • The median value in the points column is 21.5.
  • The mean value in the points column is 22.8.

And so on.

Note: In this example, we utilized the dplyr across() function. You can find the complete documentation for this function .

x