How can I remove leading zeros in a column using PySpark?

The process of removing leading zeros in a column using PySpark involves using the PySpark library to access and manipulate data within a Spark cluster. This can be achieved by utilizing the “ltrim” function, which removes any leading characters specified within a string. By specifying “0” as the character to be removed, the leading zeros in the column can be eliminated. This method allows for efficient and streamlined data cleaning and preparation for further analysis or processing.

PySpark: Remove Leading Zeros in Column


You can use the following syntax to remove leading zeros from a column in a PySpark DataFrame:

from pyspark.sql import functions as F

#remove leading zeros from values in 'employee_ID' column
df_new = df.withColumn('employee_ID', F.regexp_replace('employee_ID', r'^[0]*', ''))

This particular example removes all leading zeros from values in the employee_ID column and leaves all other zeros untouched.

The following example shows how to use this syntax in practice.

Example: How to Remove Leading Zeros from Column in PySpark

Suppose we have the following PySpark DataFrame that contains information about sales made by various employees at some company:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['000501', 18], 
        ['000034', 33], 
        ['009230', 12], 
        ['000451', 15], 
        ['000239', 19],
        ['002295', 24],
        ['011543', 28]] 
  
#define column names
columns = ['employee_ID', 'sales'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-----------+-----+
|employee_ID|sales|
+-----------+-----+
|     000501|   18|
|     000034|   33|
|     009230|   12|
|     000451|   15|
|     000239|   19|
|     002295|   24|
|     011543|   28|
+-----------+-----+

Notice that each string in the employee_ID column contains leading zeros.

We can use the following syntax to remove the leading zeros from each string in this column:

from pyspark.sql import functions as F

#remove leading zeros from values in 'employee_ID' column
df_new = df.withColumn('employee_ID', F.regexp_replace('employee_ID', r'^[0]*', ''))

#view updated DataFrame
df_new.show()

+-----------+-----+
|employee_ID|sales|
+-----------+-----+
|        501|   18|
|         34|   33|
|       9230|   12|
|        451|   15|
|        239|   19|
|       2295|   24|
|      11543|   28|
+-----------+-----+

Notice that the leading zeros have been removed from each string in the employee_ID column.

Note that we used the PySpark regexp_replace function to replace the leading zeros in each string with nothing.

You can find the complete documentation for the PySpark regexp_replace function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x