How can I replace zero values with null in PySpark?

This is a brief formal description on how to replace zero values with null in PySpark. PySpark is a powerful Python-based framework used for distributed data processing. In order to replace zero values with null, we can use the PySpark library’s functions such as ‘when’ and ‘otherwise’ to conditionally replace the values. By setting the condition to check for zero values and replacing them with null, we can effectively handle missing or incorrect data in our PySpark dataframes. This process can be useful for data cleaning and preparation before performing analysis and modeling. Overall, PySpark offers a simple and efficient solution for replacing zero values with null in large datasets.

PySpark: Replace Zero with Null

You can use the following syntax to replace zeros with null values in a PySpark DataFrame:

df_new = df.replace(0, None)

The following examples show how to use this syntax in practice.

Example: Replace Zero with Null in PySpark DataFrame

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11],
        ['A', 'Guard', 0],
        ['A', 'Forward', 22],
        ['A', 'Forward', 22],
        ['B', 'Guard', 14],
        ['B', 'Guard', 0],
        ['B', 'Forward', 13],
        ['B', 'Forward', 7]] 
#define column names
columns = ['team', 'position', 'points'] 
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
#view dataframe

|   A|   Guard|    11|
|   A|   Guard|     0|
|   A| Forward|    22|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B|   Guard|     0|
|   B| Forward|    13|
|   B| Forward|     7|

We can use the following syntax to replace each zero with a null in the DataFrame:

#create new DataFrame that replaces all zeros with null
df_new = df.replace(0, None)

#view new DataFrame

|   A|   Guard|    11|
|   A|   Guard|  null|
|   A| Forward|    22|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B|   Guard|  null|
|   B| Forward|    13|
|   B| Forward|     7|

Notice that each zero in the points column has been replaced with a value of null.

If we’d like, we can use the following syntax to count the number of null values present in the points column of the new DataFrame:

#count number of null values in 'points' column

From the output we can see that there are 2 null values in the points column of the new DataFrame.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:
