Table of Contents
Stratified sampling is a method of sampling that involves dividing the population into different strata, or groups, based on certain characteristics. Pandas is a data analysis library in Python that provides a powerful and easy-to-use method of performing stratified sampling. To use this method, you must first create a dataframe and then use the stratified sampling function to divide the data into the desired strata. Examples of code for performing stratified sampling in Pandas are provided to help you understand the process.
Researchers often take samples from a population and use the data from the sample to draw conclusions about the population as a whole.
One commonly used sampling method is stratified random sampling, in which a population is split into groups and a certain number of members from each group are randomly selected to be included in the sample.
This tutorial explains two methods for performing stratified random sampling in Python.
Example 1: Stratified Sampling Using Counts
Suppose we have the following pandas DataFrame that contains data about 8 basketball players on 2 different teams:
import pandas as pd #create DataFrame df = pd.DataFrame({'team': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'], 'position': ['G', 'G', 'F', 'G', 'F', 'F', 'C', 'C'], 'assists': [5, 7, 7, 8, 5, 7, 6, 9], 'rebounds': [11, 8, 10, 6, 6, 9, 6, 10]}) #view DataFrame df team position assists rebounds 0 A G 5 11 1 A G 7 8 2 A F 7 10 3 A G 8 6 4 B F 5 6 5 B F 7 9 6 B C 6 6 7 B C 9 10
The following code shows how to perform stratified random sampling by randomly selecting 2 players from each team to be included in the sample:
df.groupby('team', group_keys=False).apply(lambda x: x.sample(2)) team position assists rebounds 0 A G 5 11 3 A G 8 6 6 B C 6 6 5 B F 7 9
Notice that two players from each team are included in the stratified sample.
Example 2: Stratified Sampling Using Proportions
Once again suppose we have the following pandas DataFrame that contains data about 8 basketball players on 2 different teams:
import pandas as pd #create DataFrame df = pd.DataFrame({'team': ['A', 'A', 'B', 'B', 'B', 'B', 'B', 'B'], 'position': ['G', 'G', 'F', 'G', 'F', 'F', 'C', 'C'], 'assists': [5, 7, 7, 8, 5, 7, 6, 9], 'rebounds': [11, 8, 10, 6, 6, 9, 6, 10]}) #view DataFrame df team position assists rebounds 0 A G 5 11 1 A G 7 8 2 A F 7 10 3 A G 8 6 4 B F 5 6 5 B F 7 9 6 B C 6 6 7 B C 9 10
Notice that 6 of the 8 players (75%) in the DataFrame are on team A and 2 out of the 8 players (25%) are on team B.
The following code shows how to perform stratified random sampling such that the proportion of players in the sample from each team matches the proportion of players from each team in the larger DataFrame:
import numpy as np #define total sample size desired N = 4 #perform stratified random sampling df.groupby('team', group_keys=False).apply(lambda x: x.sample(int(np.rint(N*len(x)/len(df))))).sample(frac=1).reset_index(drop=True) team position assists rebounds 0 B F 7 9 1 B G 8 6 2 B C 6 6 3 A G 7 8
Notice that the proportion of players from team A in the stratified sample (25%) matches the proportion of players from team A in the larger DataFrame.
Similarly, the proportion of players from team B in the stratified sample (75%) matches the proportion of players from team B in the larger DataFrame.
The following tutorials explain how to select other types of samples using pandas: