How to perform stratified sampling in R?

Stratified sampling in R can be accomplished through the use of the ‘strata’ argument in the sample() function. The strata argument requires a factor variable that divides the data into strata, or groups. Each strata is then sampled independently of the others, with the sample size proportional to the size of the strata. This ensures that the sample is representative of the population in terms of the strata variables. This can be done by specifying the ‘strata’ argument as the factor variable, and the ‘size’ argument as the desired sample size for each stratum.


Researchers often take samples from a population and use the data from the sample to draw conclusions about the population as a whole.

One commonly used sampling method is stratified random sampling, in which a population is split into groups and a certain number of members from each group are randomly selected to be included in the sample.

This tutorial explains how to perform stratified random sampling in R.

Example: Stratified Sampling in R

A high school is composed of 400 students who are either Freshman, Sophomores, Juniors, or Seniors. Suppose we’d like to take a stratified sample of 40 students such that 10 students from each grade are included in the sample.

The following code shows how to generate a sample data frame of 400 students:

#make this example reproducible
set.seed(1)

#create data frame
df <- data.frame(grade = rep(c('Freshman', 'Sophomore', 'Junior', 'Senior'), each=100),
                 gpa = rnorm(400, mean=85, sd=3))

#view first six rows of data frame
head(df)

     grade      gpa
1 Freshman 83.12064
2 Freshman 85.55093
3 Freshman 82.49311
4 Freshman 89.78584
5 Freshman 85.98852
6 Freshman 82.53859

Stratified Sampling Using Number of Rows

The following code shows how to use the group_by() and sample_n() functions from the dplyr package to obtain a stratified random sample of 40 total students with 10 students from each grade:

library(dplyr)

#obtain stratified sample
strat_sample <- df %>%
                  group_by(grade) %>%
                  sample_n(size=10)

#find frequency of students from each grade
table(strat_sample$grade)

 Freshman    Junior    Senior Sophomore 
       10        10        10        10 

Stratified Sampling Using Fraction of Rows

The following code shows how to use the group_by() and sample_frac() functions from the dplyr package to obtain a stratified random sample in which we randomly select 15% of students from each grade:

library(dplyr)

#obtain stratified sample
strat_sample <- df %>%
                  group_by(grade) %>%
                  sample_frac(size=.15)

#find frequency of students from each grade
table(strat_sample$grade)

 Freshman    Junior    Senior Sophomore 
       15        15        15        15 

Types of Sampling Methods
Cluster Sampling in R
Systematic Sampling in R

x