How can I remove spaces from column names in PySpark?

Removing spaces from column names in PySpark can be done by using the “withColumnRenamed” function. This function allows you to rename a column by specifying the old name and the new name. To remove spaces, the old name can be passed as a string enclosed in backticks (`) and the new name can be specified without any spaces. This will effectively remove spaces from the column name. Alternatively, you can also use the “select” function to select the desired columns and pass in a list of renamed columns using the same method mentioned above. This will result in a new dataframe with column names without any spaces.

PySpark: Remove Spaces from Column Names


You can use the following syntax to remove spaces from each column name in a PySpark DataFrame:

from pyspark.sql import functions as F

#replace all spaces in column names with underscores
df_new = df.select([F.col(x).alias(x.replace(' ', '_')) for x in df.columns])

The following example shows how to use this syntax in practice.

Example: How to Remove Spaces from Column Names in PySpark

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 18, 4], 
        ['Nets', 33, 9], 
        ['Hawks', 12, 7], 
        ['Thunder', 15, 3], 
        ['Lakers', 19, 2],
        ['Cavs', 24, 5],
        ['Magic', 28, 7]] 
  
#define column names
columns = ['team name', 'points scored', 'total assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+---------+-------------+-------------+
|team name|points scored|total assists|
+---------+-------------+-------------+
|     Mavs|           18|            4|
|     Nets|           33|            9|
|    Hawks|           12|            7|
|  Thunder|           15|            3|
|   Lakers|           19|            2|
|     Cavs|           24|            5|
|    Magic|           28|            7|
+---------+-------------+-------------+

Notice that each of the column names contains a space.

We can use the following syntax to replace the spaces in each column name with an underscore instead:

from pyspark.sql import functions as F

#replace all spaces in column names with underscores
df_new = df.select([F.col(x).alias(x.replace(' ', '_')) for x in df.columns])

#view new DataFrame
df_new.show()

+---------+-------------+-------------+
|team_name|points_scored|total_assists|
+---------+-------------+-------------+
|     Mavs|           18|            4|
|     Nets|           33|            9|
|    Hawks|           12|            7|
|  Thunder|           15|            3|
|   Lakers|           19|            2|
|     Cavs|           24|            5|
|    Magic|           28|            7|
+---------+-------------+-------------+

Notice that the spaces in each column name have been replaced with an underscore.

Also note that you could simply remove the spaces from each column name without replacing them with another character by using the following syntax:

from pyspark.sql import functions as F

#remove all spaces in column names
df_new = df.select([F.col(x).alias(x.replace(' ', '')) for x in df.columns])

#view new DataFrame
df_new.show()

+--------+------------+------------+
|teamname|pointsscored|totalassists|
+--------+------------+------------+
|    Mavs|          18|           4|
|    Nets|          33|           9|
|   Hawks|          12|           7|
| Thunder|          15|           3|
|  Lakers|          19|           2|
|    Cavs|          24|           5|
|   Magic|          28|           7|
+--------+------------+------------+

Notice that the spaces in each column name have simply been removed.

Note that we used the PySpark replace function to replace the spaces in the column names.

You can find the complete documentation for the PySpark replace function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x