How can I count the number of null values in PySpark, and what are some examples of how to do so?

To count the number of null values in PySpark, one can use the “isNull” function in conjunction with the “sum” function. This will return the total count of null values present in a specific column or across all columns in a PySpark DataFrame. For example, if we have a DataFrame called “df” with columns “A” and “B”, the code “df.select(sum(isNull(“A”))).show()” will return the total count of null values in column “A”. Similarly, the code “df.select(sum(isNull(“*”))).show()” will return the total count of null values across all columns in the DataFrame. By using this method, we can effectively identify and handle null values in our PySpark data analysis.

Count Null Values in PySpark (With Examples)


You can use the following methods to count null values in a PySpark DataFrame:

Method 1: Count Null Values in One Column

#count number of null values in 'points' column
df.where(df.points.isNull()).count()

Method 2: Count Null Values in Each Column

from pyspark.sql.functions import when, count, col

#count number of null values in each column of DataFrame
df.select([count(when(col(c).isNull(), c)).alias(c) for c in df.columns]).show() 

The following examples show how to use each method in practice with the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', None, 11], 
        ['A', 4, 8], 
        ['A', 2, 22], 
        ['A', 10, None], 
        ['B', 8, None], 
        ['B', 11, 14],
        ['B', 14, 13],
        ['B', 6, 7],
        ['C', 2, 8],
        ['C', 2, 5]] 
  
#define column names
columns = ['team', 'assists', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+-------+------+
|team|assists|points|
+----+-------+------+
|   A|   null|    11|
|   A|      4|     8|
|   A|      2|    22|
|   A|     10|  null|
|   B|      8|  null|
|   B|     11|    14|
|   B|     14|    13|
|   B|      6|     7|
|   C|      2|     8|
|   C|      2|     5|
+----+-------+------+

Example 1: Count Null Values in One Column

We can use the following syntax to count the number of null values in just the points column of the DataFrame:

#count number of null values in 'points' column
df.where(df.points.isNull()).count()

2

From the output we can see there are 2 null values in the points column of the DataFrame.

Note that if we wanted to view these rows with null values in the points column then we could replace count() with show() as follows:

#display rows with null values in 'points' column
df.where(df.points.isNull()).show()

+----+-------+------+
|team|assists|points|
+----+-------+------+
|   A|     10|  null|
|   B|      8|  null|
+----+-------+------+

The resulting DataFrame contains only the two rows with null values in the points column.

Example 2: Count Null Values in Each Column

We can use the following syntax to count the number of null values in each column of the DataFrame:

from pyspark.sql.functions import when, count, col

#count number of null values in each column of DataFrame
df.select([count(when(col(c).isNull(), c)).alias(c) for c in df.columns]).show()

+----+-------+------+
|team|assists|points|
+----+-------+------+
|   0|      1|     2|
+----+-------+------+
  • There are 0 null values in the team column.
  • There is 1 null value in the assists column.
  • There are 2 null values in the points column.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x