Table of Contents
The “Not Equal” operator in PySpark is a comparison operator that is used to evaluate the inequality between two values or expressions. It is denoted by the symbol “!=” and returns a Boolean value of True if the two values are not equal, and False if they are equal. This operator can be used in PySpark to filter out specific data from a dataset, or to perform conditional operations. For example, it can be used to select all rows where a certain column value is not equal to a given value, or to exclude a particular category from a dataset. Additionally, the “Not Equal” operator can also be combined with other logical operators to create more complex conditions. Overall, the “Not Equal” operator in PySpark is a useful tool for data manipulation and analysis.
Use “Not Equal” Operator in PySpark (With Examples)
There are two common ways to filter a PySpark DataFrame by using a “Not Equal” operator:
Method 1: Filter Using One “Not Equal” Operator
#filter DataFrame where team is not equal to 'A' df.filter(df.team!='A').show()
Method 2: Filter Using Multiple “Not Equal” Operators
#filter DataFrame where team is not equal to 'A' and points is not equal to 5 df.filter((df.team!='A') & (df.points!=5)).show()
The following examples show how to use each method in practice with the following PySpark DataFrame:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['A', 'East', 11, 4],
['A', 'East', 8, 9],
['A', 'East', 10, 3],
['B', 'West', 6, 12],
['B', 'West', 6, 4],
['C', 'East', 5, 2]]
#define column names
columns = ['team', 'conference', 'points', 'assists']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
| A| East| 11| 4|
| A| East| 8| 9|
| A| East| 10| 3|
| B| West| 6| 12|
| B| West| 6| 4|
| C| East| 5| 2|
+----+----------+------+-------+
Example 1: Filter Using One “Not Equal” Operator
We can use the following syntax to filter the DataFrame to only contain rows where the team column is not equal to A:
#filter DataFrame where team is not equal to 'A' df.filter(df.team!='A').show() +----+----------+------+-------+ |team|conference|points|assists| +----+----------+------+-------+ | B| West| 6| 12| | B| West| 6| 4| | C| East| 5| 2| +----+----------+------+-------+
Notice that each of the rows in the resulting DataFrame contain a value in the team column that is not equal to A.
Example 2: Filter Using Multiple “Not Equal” Operators
We can use the following syntax to filter the DataFrame to only contain rows where the team column is not equal to A and the value in the points column is not equal to 5:
#filter DataFrame where team is not equal to 'A' and points is not equal to 5 df.filter((df.team!='A') & (df.points!=5)).show() +----+----------+------+-------+ |team|conference|points|assists| +----+----------+------+-------+ | B| West| 6| 12| | B| West| 6| 4| +----+----------+------+-------+
Notice that each of the rows in the resulting DataFrame contain a value in the team column that is not equal to A and a value in the points column that is not equal to 5.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark: