How can leading zeros be removed in a PySpark column?


You can use the following syntax to remove leading zeros from a column in a PySpark DataFrame:

from pyspark.sql import functions as F

#remove leading zeros from values in 'employee_ID' column
df_new = df.withColumn('employee_ID', F.regexp_replace('employee_ID', r'^[0]*', ''))

This particular example removes all leading zeros from values in the employee_ID column and leaves all other zeros untouched.

The following example shows how to use this syntax in practice.

Example: How to Remove Leading Zeros from Column in PySpark

Suppose we have the following PySpark DataFrame that contains information about sales made by various employees at some company:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['000501', 18], 
        ['000034', 33], 
        ['009230', 12], 
        ['000451', 15], 
        ['000239', 19],
        ['002295', 24],
        ['011543', 28]] 
  
#define column names
columns = ['employee_ID', 'sales'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-----------+-----+
|employee_ID|sales|
+-----------+-----+
|     000501|   18|
|     000034|   33|
|     009230|   12|
|     000451|   15|
|     000239|   19|
|     002295|   24|
|     011543|   28|
+-----------+-----+

Notice that each string in the employee_ID column contains leading zeros.

We can use the following syntax to remove the leading zeros from each string in this column:

from pyspark.sql import functions as F

#remove leading zeros from values in 'employee_ID' column
df_new = df.withColumn('employee_ID', F.regexp_replace('employee_ID', r'^[0]*', ''))

#view updated DataFrame
df_new.show()

+-----------+-----+
|employee_ID|sales|
+-----------+-----+
|        501|   18|
|         34|   33|
|       9230|   12|
|        451|   15|
|        239|   19|
|       2295|   24|
|      11543|   28|
+-----------+-----+

Notice that the leading zeros have been removed from each string in the employee_ID column.

Note that we used the PySpark regexp_replace function to replace the leading zeros in each string with nothing.

You can find the complete documentation for the PySpark regexp_replace function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x