How to check if a DataFrame is empty in PySpark

In PySpark, you can check if a DataFrame is empty by using the isEmpty() function, which returns a Boolean value. If the DataFrame is empty, it will return True, and if it is not empty, it will return False. This can be used to determine if any data exists within the DataFrame. Additionally, you can use the count() function to determine the number of records in the DataFrame, which can be used to further validate if the DataFrame is empty or not.


You can use the following syntax to check if a PySpark DataFrame is empty:

print(df.count() == 0)

This will return True if the DataFrame is empty or False if the DataFrame is not empty.

Note that df.count() will count the number of rows in the DataFrame, so we’re effectively checking if the total rows is equal to zero or not.

The following examples show how to use this syntax in practice.

Example 1: Check if Empty DataFrame is Empty

Suppose we create the following empty PySpark DataFrame with specific column names:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

from pyspark.sql.types import StructType, StructField, StringType, FloatType

#create empty RDD
empty_rdd=spark.sparkContext.emptyRDD()

#specify colum names and types
my_columns=[StructField('team', StringType(),True),
            StructField('position', StringType(),True),
            StructField('points', FloatType(),True)]

#create DataFrame with specific column names
df=spark.createDataFrame([], schema=StructType(my_columns))

#view DataFrame
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
+----+--------+------+

We can use the following syntax to check if the DataFrame is empty:

#check if DataFrame is empty
print(df.count() == 0)

True

We receive a value of True, which indicates that the DataFrame is indeed empty.

Example 2: Check if Non-Empty DataFrame is Empty

Suppose we create the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 18], 
        ['Nets', 33], 
        ['Lakers', 12], 
        ['Mavs', 15], 
        ['Cavs', 19],
        ['Wizards', 24],]
  
#define column names
columns = ['team', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-------+------+
|   team|points|
+-------+------+
|   Mavs|    18|
|   Nets|    33|
| Lakers|    12|
|   Mavs|    15|
|   Cavs|    19|
|Wizards|    24|
+-------+------+

We can use the following syntax to check if the DataFrame is empty:

#check if DataFrame is empty
print(df.count() == 0)

False

We receive a value of False, which indicates that the DataFrame is not empty.

The following tutorials explain how to perform other common tasks in PySpark:

x