PySpark: Replace Zero with Null How to Replace Zero with Null in PySpark?

PySpark offers an easy way to replace 0 values with a null value. It can be done by applying a function to the DataFrame, recursively replacing all instances of 0 with null. The function can be applied to any column in the DataFrame, allowing you to easily identify and replace any 0 values with null. This can be useful for data cleaning and other data-wrangling tasks.


You can use the following syntax to replace zeros with null values in a PySpark DataFrame:

df_new = df.replace(0, None)

The following examples show how to use this syntax in practice.

Example: Replace Zero with Null in PySpark DataFrame

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11],
        ['A', 'Guard', 0],
        ['A', 'Forward', 22],
        ['A', 'Forward', 22],
        ['B', 'Guard', 14],
        ['B', 'Guard', 0],
        ['B', 'Forward', 13],
        ['B', 'Forward', 7]] 
  
#define column names
columns = ['team', 'position', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A|   Guard|     0|
|   A| Forward|    22|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B|   Guard|     0|
|   B| Forward|    13|
|   B| Forward|     7|
+----+--------+------+

We can use the following syntax to replace each zero with a null in the DataFrame:

#create new DataFrame that replaces all zeros with null
df_new = df.replace(0, None)

#view new DataFrame
df_new.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A|   Guard|  null|
|   A| Forward|    22|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B|   Guard|  null|
|   B| Forward|    13|
|   B| Forward|     7|
+----+--------+------+

Notice that each zero in the points column has been replaced with a value of null.

If we’d like, we can use the following syntax to count the number of null values present in the points column of the new DataFrame:

#count number of null values in 'points' column
df_new.where(df_new.points.isNull()).count()

2

From the output we can see that there are 2 null values in the points column of the new DataFrame.

The following tutorials explain how to perform other common tasks in PySpark:

x