How can I convert a column to lowercase in PySpark?

Converting a column to lowercase in PySpark can be achieved by using the lower() function. This function takes in the name of the column as an input and returns a new column with all the values converted to lowercase. This can be useful for data cleaning and standardization, as well as for performing operations on text data. By using the lower() function, users can easily convert a column to lowercase in PySpark and efficiently process their data.

PySpark: Convert Column to Lowercase


You can use the following syntax to convert a column to lowercase in a PySpark DataFrame:

from pyspark.sql.functions import lower

df = df.withColumn('my_column', lower(df['my_column']))

The following example shows how to use this syntax in practice.

Example: How to Convert Column to Lowercase in PySpark

Suppose we create the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', 6, 4], 
        ['C', 'East', 5, 2]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

Suppose we would like to convert all strings in the conference column to lowercase.

We can use the following syntax to do so:

from pyspark.sql.functions import lower

#convert 'conference' column to lowercase
df = df.withColumn('conference', lower(df['conference']))

#view updated DataFrame
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      east|    11|      4|
|   A|      east|     8|      9|
|   A|      east|    10|      3|
|   B|      west|     6|     12|
|   B|      west|     6|      4|
|   C|      east|     5|      2|
+----+----------+------+-------+

Notice that all strings in the conference column of the updated DataFrame are now lowercase.

Note #1: We used the withcolumn function to return a new DataFrame with the conference column modified and all other columns left the same.

Note #2: You can find the complete documentation for the PySpark withColumn function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x