How can “Is Not Null” be implemented in PySpark?

The “Is Not Null” function in PySpark is used to check for the presence of a non-null value in a specific column of a data frame. This function can be implemented by using the “isNotNull()” method on the column of interest. This method returns a Boolean value of True if the column contains a non-null value and False if it is null. The “isNotNull()” method can be combined with other PySpark functions and conditions to filter out rows with null values and retrieve only the desired data. This implementation allows for efficient data cleaning and manipulation in PySpark, making it a useful tool for data analysis and processing tasks.

Use “Is Not Null” in PySpark (With Examples)


You can use the following methods in PySpark to filter DataFrame rows where a value in a particular column is not null:

Method 1: Filter for Rows where Value is Not Null in Specific Column

#filter for rows where value is not null in 'points' column
df.filter(df.points.isNotNull()).show()

Method 2: Filter for Rows where Value is Not Null in Any Column

#filter for rows where value is not null in any column
df.dropna().show()

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', None, 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', None, 12], 
        ['B', 'West', None, 4], 
        ['C', 'East', 5, 2]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      null|     8|      9|
|   A|      East|    10|      3|
|   B|      West|  null|     12|
|   B|      West|  null|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

Example 1: Filter for Rows where Value is Not Null in Specific Column

We can use the following syntax to filter the DataFrame to only show rows where the value in the points column is not null:

#filter for rows where value is not null in 'points' column
df.filter(df.points.isNotNull()).show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      null|     8|      9|
|   A|      East|    10|      3|
|   C|      East|     5|      2|
+----+----------+------+-------+

The resulting DataFrame only contains rows where the value in the points column is not null.

Example 2: Filter for Rows where Value is Not Null in Any Column

We can use the following syntax to filter the DataFrame to only show rows where there are no null values in any column:

#filter for rows where value is not null in any column
df.dropna().show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|    10|      3|
|   C|      East|     5|      2|
+----+----------+------+-------+

The resulting DataFrame only contains rows where there are no null values in any column.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x