How can I check if a value exists in a column in PySpark?

To check if a value exists in a column in PySpark, you can use the “isin” function provided by the PySpark library. This function takes in a list of values and returns a boolean value indicating whether the specified values exist in the column. Additionally, you can also use the “filter” function to filter the dataframe based on the existence of a specific value in a column. Both of these methods allow for efficient and convenient checking of values in a column in PySpark.

PySpark: Check if Value Exists in Column


You can use the following syntax to check if a specific value exists in a column of a PySpark DataFrame:

df.filter(df.position.contains('Guard')).count()>0

This particular example checks if the string ‘Guard’ exists in the column named position and returns either True or False.

The following example shows how to use this syntax in practice.

Example: Check if Value Exists in Column in PySpark DataFrame

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11, 4], 
        ['A', 'Forward', 8, 5], 
        ['B', 'Guard', 22, 6], 
        ['A', 'Forward', 22, 7], 
        ['C', 'Guard', 14, 12], 
        ['A', 'Guard', 14, 8],
        ['B', 'Forward', 13, 9],
        ['B', 'Center', 7, 9]]
  
#define column names
columns = ['team', 'position', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+-------+
|team|position|points|assists|
+----+--------+------+-------+
|   A|   Guard|    11|      4|
|   A| Forward|     8|      5|
|   B|   Guard|    22|      6|
|   A| Forward|    22|      7|
|   C|   Guard|    14|     12|
|   A|   Guard|    14|      8|
|   B| Forward|    13|      9|
|   B|  Center|     7|      9|
+----+--------+------+-------+

We can use the following syntax to check if the value ‘Guard’ exists in the position column:

#check if 'Guard' exists in position column
df.filter(df.position.contains('Guard')).count()>0

True

The output returns True, which indicates that the value ‘Guard’ does exist in the position column.

Note that we can also use similar syntax to check if a specific value exists in a numeric column.

For example, we can use the following syntax to check if the value 14 exists in the points column:

#check if 14 exists in pointscolumn
df.filter(df.points.contains('14')).count()>0

True

The output returns True, which indicates that the value 14 does exist in the points column.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x