How to Perform Stratified Sampling in Pandas (With Examples)

Stratified sampling is a method of sampling that involves dividing the population into different strata, or groups, based on certain characteristics. Pandas is a data analysis library in Python that provides a powerful and easy-to-use method of performing stratified sampling. To use this method, you must first create a dataframe and then use the stratified sampling function to divide the data into the desired strata. Examples of code for performing stratified sampling in Pandas are provided to help you understand the process.


Researchers often take samples from a population and use the data from the sample to draw conclusions about the population as a whole.

One commonly used sampling method is stratified random sampling, in which a population is split into groups and a certain number of members from each group are randomly selected to be included in the sample.

This tutorial explains two methods for performing stratified random sampling in Python.

Example 1: Stratified Sampling Using Counts

Suppose we have the following pandas DataFrame that contains data about 8 basketball players on 2 different teams:

import pandas as pd

#create DataFrame
df = pd.DataFrame({'team': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
                   'position': ['G', 'G', 'F', 'G', 'F', 'F', 'C', 'C'],
                   'assists': [5, 7, 7, 8, 5, 7, 6, 9],
                   'rebounds': [11, 8, 10, 6, 6, 9, 6, 10]})

#view DataFrame
df

        team	position assists rebounds
0	A	G	 5	 11
1	A	G	 7	 8
2	A	F	 7	 10
3	A	G	 8	 6
4	B	F	 5	 6
5	B	F	 7	 9
6	B	C	 6	 6
7	B	C	 9	 10

The following code shows how to perform stratified random sampling by randomly selecting 2 players from each team to be included in the sample:

df.groupby('team', group_keys=False).apply(lambda x: x.sample(2))

        team	position assists rebounds
0	A	G	 5	 11
3	A	G	 8	 6
6	B	C	 6	 6
5	B	F	 7	 9

Notice that two players from each team are included in the stratified sample.

Example 2: Stratified Sampling Using Proportions

Once again suppose we have the following pandas DataFrame that contains data about 8 basketball players on 2 different teams:

import pandas as pd

#create DataFrame
df = pd.DataFrame({'team': ['A', 'A', 'B', 'B', 'B', 'B', 'B', 'B'],
                   'position': ['G', 'G', 'F', 'G', 'F', 'F', 'C', 'C'],
                   'assists': [5, 7, 7, 8, 5, 7, 6, 9],
                   'rebounds': [11, 8, 10, 6, 6, 9, 6, 10]})

#view DataFrame
df

        team	position assists rebounds
0	A	G	 5	 11
1	A	G	 7	 8
2	A	F	 7	 10
3	A	G	 8	 6
4	B	F	 5	 6
5	B	F	 7	 9
6	B	C	 6	 6
7	B	C	 9	 10

Notice that 6 of the 8 players (75%) in the DataFrame are on team A and 2 out of the 8 players (25%) are on team B.

The following code shows how to perform stratified random sampling such that the proportion of players in the sample from each team matches the proportion of players from each team in the larger DataFrame:

import numpy as np

#define total sample size desired
N = 4

#perform stratified random sampling
df.groupby('team', group_keys=False).apply(lambda x: x.sample(int(np.rint(N*len(x)/len(df))))).sample(frac=1).reset_index(drop=True)

        team	position  assists  rebounds
0	B	F	  7	   9
1	B	G	  8	   6
2	B	C	  6	   6
3	A	G	  7	   8

Notice that the proportion of players from team A in the stratified sample (25%) matches the proportion of players from team A in the larger DataFrame.

Similarly, the proportion of players from team B in the stratified sample (75%) matches the proportion of players from team B in the larger DataFrame.

The following tutorials explain how to select other types of samples using pandas:

How to Perform Cluster Sampling in Pandas

x