How can I use the “Not Equal” operator in PySpark, and what are some examples of its usage?

The “Not Equal” operator in PySpark is a comparison operator that is used to evaluate the inequality between two values or expressions. It is denoted by the symbol “!=” and returns a Boolean value of True if the two values are not equal, and False if they are equal. This operator can be used in PySpark to filter out specific data from a dataset, or to perform conditional operations. For example, it can be used to select all rows where a certain column value is not equal to a given value, or to exclude a particular category from a dataset. Additionally, the “Not Equal” operator can also be combined with other logical operators to create more complex conditions. Overall, the “Not Equal” operator in PySpark is a useful tool for data manipulation and analysis.

Use “Not Equal” Operator in PySpark (With Examples)


There are two common ways to filter a PySpark DataFrame by using a “Not Equal” operator:

Method 1: Filter Using One “Not Equal” Operator

#filter DataFrame where team is not equal to 'A'
df.filter(df.team!='A').show()

Method 2: Filter Using Multiple “Not Equal” Operators

#filter DataFrame where team is not equal to 'A' and points is not equal to 5
df.filter((df.team!='A') & (df.points!=5)).show()

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', 6, 4], 
        ['C', 'East', 5, 2]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

Example 1: Filter Using One “Not Equal” Operator

We can use the following syntax to filter the DataFrame to only contain rows where the team column is not equal to A:

#filter DataFrame where team is not equal to 'A'
df.filter(df.team!='A').show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

Notice that each of the rows in the resulting DataFrame contain a value in the team column that is not equal to A.

Example 2: Filter Using Multiple “Not Equal” Operators

We can use the following syntax to filter the DataFrame to only contain rows where the team column is not equal to A and the value in the points column is not equal to 5:

#filter DataFrame where team is not equal to 'A' and points is not equal to 5
df.filter((df.team!='A') & (df.points!=5)).show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   B|      West|     6|     12|
|   B|      West|     6|      4|
+----+----------+------+-------+

Notice that each of the rows in the resulting DataFrame contain a value in the team column that is not equal to A and a value in the points column that is not equal to 5.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x