Table of Contents
This is a brief formal description on how to replace zero values with null in PySpark. PySpark is a powerful Python-based framework used for distributed data processing. In order to replace zero values with null, we can use the PySpark library’s functions such as ‘when’ and ‘otherwise’ to conditionally replace the values. By setting the condition to check for zero values and replacing them with null, we can effectively handle missing or incorrect data in our PySpark dataframes. This process can be useful for data cleaning and preparation before performing analysis and modeling. Overall, PySpark offers a simple and efficient solution for replacing zero values with null in large datasets.
PySpark: Replace Zero with Null
You can use the following syntax to replace zeros with null values in a PySpark DataFrame:
df_new = df.replace(0, None)
The following examples show how to use this syntax in practice.
Example: Replace Zero with Null in PySpark DataFrame
Suppose we have the following PySpark DataFrame that contains information about various basketball players:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['A', 'Guard', 11], ['A', 'Guard', 0], ['A', 'Forward', 22], ['A', 'Forward', 22], ['B', 'Guard', 14], ['B', 'Guard', 0], ['B', 'Forward', 13], ['B', 'Forward', 7]] #define column names columns = ['team', 'position', 'points'] #create dataframe using data and column names df = spark.createDataFrame(data, columns) #view dataframe df.show() +----+--------+------+ |team|position|points| +----+--------+------+ | A| Guard| 11| | A| Guard| 0| | A| Forward| 22| | A| Forward| 22| | B| Guard| 14| | B| Guard| 0| | B| Forward| 13| | B| Forward| 7| +----+--------+------+
We can use the following syntax to replace each zero with a null in the DataFrame:
#create new DataFrame that replaces all zeros with null df_new = df.replace(0, None) #view new DataFrame df_new.show() +----+--------+------+ |team|position|points| +----+--------+------+ | A| Guard| 11| | A| Guard| null| | A| Forward| 22| | A| Forward| 22| | B| Guard| 14| | B| Guard| null| | B| Forward| 13| | B| Forward| 7| +----+--------+------+
Notice that each zero in the points column has been replaced with a value of null.
If we’d like, we can use the following syntax to count the number of null values present in the points column of the new DataFrame:
#count number of null values in 'points' column df_new.where(df_new.points.isNull()).count()2
From the output we can see that there are 2 null values in the points column of the new DataFrame.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark: