Table of Contents
The dplyr package in R is a powerful tool for summarizing data. It provides functions that allow you to quickly calculate summary statistics such as mean, median, mode, standard deviation, quartiles, and more. By using the summarise() function, you can easily summarise data frames into a single row of values for each statistic you wish to calculate. Additionally, dplyr provides versatile functions that allow you to group your data by specific variables and calculate summary statistics for each group.
You can use the following syntax to calculate summary statistics for all numeric variables in a data frame in R using functions from the dplyr package:
library(dplyr) library(tidyr) df %>% summarise(across(where(is.numeric), .fns = list(min = min, median = median, mean = mean, stdev = sd, q25 = ~quantile(., 0.25), q75 = ~quantile(., 0.75), max = max))) %>% pivot_longer(everything(), names_sep='_', names_to=c('variable', '.value'))
The summarise() function comes from the dplyr package and is used to calculate summary statistics for variables.
The pivot_longer() function comes from the tidyr package and is used to format the output to make it easier to read.
This particular syntax calculates the following summary statistics for each numeric variable in a data frame:
- Minimum value
- Median value
- Mean value
- Standard deviation
- 25th percentile
- 75th percentile
- Maximum value
The following example shows how to use this function in practice.
Example: Calculate Summary Statistics in R Using dplyr
Suppose we have the following data frame in R that contains information about various basketball players:
#create data frame df <- data.frame(team=c('A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'), points=c(12, 15, 19, 14, 24, 25, 39, 34), assists=c(6, 8, 8, 9, 12, 6, 8, 10), rebounds=c(9, 9, 8, 10, 8, 4, 3, 3)) #view data frame df team points assists rebounds 1 A 12 6 9 2 A 15 8 9 3 A 19 8 8 4 A 14 9 10 5 B 24 12 8 6 B 25 6 4 7 B 39 8 3 8 B 34 10 3
We can use the following syntax to calculate summary statistics for each numeric variable in the data frame:
library(dplyr) library(tidyr) #calculate summary statistics for each numeric variable in data frame df %>% summarise(across(where(is.numeric), .fns = list(min = min, median = median, mean = mean, stdev = sd, q25 = ~quantile(., 0.25), q75 = ~quantile(., 0.75), max = max))) %>% pivot_longer(everything(), names_sep='_', names_to=c('variable', '.value')) # A tibble: 3 x 8 variable min median mean stdev q25 q75 max 1 points 12 21.5 22.8 9.74 14.8 27.2 39 2 assists 6 8 8.38 2.00 7.5 9.25 12 3 rebounds 3 8 6.75 2.92 3.75 9 10
From the output we can see:
- The minimum value in the points column is 12.
- The median value in the points column is 21.5.
- The mean value in the points column is 22.8.
And so on.
Note: In this example, we utilized the dplyr across() function. You can find the complete documentation for this function .