How to select random sample of rows in PySpark

In PySpark, the sample() function can be used to randomly select a sample of rows from a DataFrame. This function takes a fraction parameter which is the fraction of rows to return and a seed parameter (optional). The seed parameter is used to ensure reproducible results when the same fraction is used on the same dataset. The sample() function also takes an optional withReplacement parameter, which when set to true allows the same row to be returned multiple times. This function returns a DataFrame containing the randomly sampled rows.


You can use the sample function in PySpark to select a random sample of rows from a DataFrame.

This function uses the following syntax:

sample(withReplacement=None, fraction=None, seed=None)

where:

  • withReplacement: Whether to sample with replacement or not (default=False)
  • fraction: Fraction of rows to include in sample
  • seed: An integer that specifies the random seed for sampling

Note that you should set the seed to a specific integer value if you want the ability to generate the exact same sample each time you run the code.

Also note that the value specified for the fraction argument is not guaranteed to generate that exact fraction of the total rows of the DataFrame in the sample.

The following example shows how to use the sample function in practice to select a random sample of rows from a PySpark DataFrame:

Example: How to Select Random Sample of Rows in PySpark

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 18], 
        ['Nets', 33], 
        ['Lakers', 12], 
        ['Kings', 15], 
        ['Hawks', 19],
        ['Wizards', 24],
        ['Magic', 28],
        ['Jazz', 40],
        ['Thunder', 24],
        ['Spurs', 13]]
  
#define column names
columns = ['team', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-------+------+
|   team|points|
+-------+------+
|   Mavs|    18|
|   Nets|    33|
| Lakers|    12|
|  Kings|    15|
|  Hawks|    19|
|Wizards|    24|
|  Magic|    28|
|   Jazz|    40|
|Thunder|    24|
|  Spurs|    13|
+-------+------+

Suppose we would like to select a random sample of rows that contain 30% of the total rows in the DataFrame.

We can use the following syntax to do so:

#select random sample of 30% of rows in DataFrame
df_sample = df.sample(withReplacement=False, fraction=0.3)

#view random sample
df_sample.show()

+-----+------+
| team|points|
+-----+------+
| Mavs|    18|
| Nets|    33|
|Kings|    15|
+-----+------+

The resulting DataFrame randomly selects 3 out of the 10 rows from the original DataFrame.

Since we specified withReplacement=False, this guarantees that each row from the original DataFrame can only occur once in the random sample.

However, if we specify withReplacement=True, then it’s possible for each row from the original DataFrame to occur more than once in the random sample:

#select random sample (with replacement) of 30% of rows in DataFrame
df_sample = df.sample(withReplacement=True, fraction=0.3)

#view random sample
df_sample.show()

+-----+------+
| team|points|
+-----+------+
|Magic|    28|
|Spurs|    13|
|Magic|    28|
+-----+------+

Note that the team name Magic occurred twice in the random sample since we used sampling with replacement in this example.

Related:

You can find the complete documentation for the PySpark sample function .

The following tutorials explain how to perform other common tasks in PySpark:

x