What is the simplest way to remove spaces from column names in a PySpark dataframe?

The simplest way to remove spaces from column names in a PySpark dataframe is by using the .withColumnRenamed() method. This method takes two parameters: the existing column name and the desired column name, and allows you to rename the column without having to manually edit the column name. This can be done in a single line of code, making it a quick and easy way to remove spaces from column names.


You can use the following syntax to remove spaces from each column name in a PySpark DataFrame:

from pyspark.sql import functions as F

#replace all spaces in column names with underscores
df_new = df.select([F.col(x).alias(x.replace(' ', '_')) for x in df.columns])

The following example shows how to use this syntax in practice.

Example: How to Remove Spaces from Column Names in PySpark

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 18, 4], 
        ['Nets', 33, 9], 
        ['Hawks', 12, 7], 
        ['Thunder', 15, 3], 
        ['Lakers', 19, 2],
        ['Cavs', 24, 5],
        ['Magic', 28, 7]] 
  
#define column names
columns = ['team name', 'points scored', 'total assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+---------+-------------+-------------+
|team name|points scored|total assists|
+---------+-------------+-------------+
|     Mavs|           18|            4|
|     Nets|           33|            9|
|    Hawks|           12|            7|
|  Thunder|           15|            3|
|   Lakers|           19|            2|
|     Cavs|           24|            5|
|    Magic|           28|            7|
+---------+-------------+-------------+

Notice that each of the column names contains a space.

We can use the following syntax to replace the spaces in each column name with an underscore instead:

from pyspark.sql import functions as F

#replace all spaces in column names with underscores
df_new = df.select([F.col(x).alias(x.replace(' ', '_')) for x in df.columns])

#view new DataFrame
df_new.show()

+---------+-------------+-------------+
|team_name|points_scored|total_assists|
+---------+-------------+-------------+
|     Mavs|           18|            4|
|     Nets|           33|            9|
|    Hawks|           12|            7|
|  Thunder|           15|            3|
|   Lakers|           19|            2|
|     Cavs|           24|            5|
|    Magic|           28|            7|
+---------+-------------+-------------+

Notice that the spaces in each column name have been replaced with an underscore.

Also note that you could simply remove the spaces from each column name without replacing them with another character by using the following syntax:

from pyspark.sql import functions as F

#remove all spaces in column names
df_new = df.select([F.col(x).alias(x.replace(' ', '')) for x in df.columns])

#view new DataFrame
df_new.show()

+--------+------------+------------+
|teamname|pointsscored|totalassists|
+--------+------------+------------+
|    Mavs|          18|           4|
|    Nets|          33|           9|
|   Hawks|          12|           7|
| Thunder|          15|           3|
|  Lakers|          19|           2|
|    Cavs|          24|           5|
|   Magic|          28|           7|
+--------+------------+------------+

Notice that the spaces in each column name have simply been removed.

Note that we used the PySpark replace function to replace the spaces in the column names.

You can find the complete documentation for the PySpark replace function .

The following tutorials explain how to perform other common tasks in PySpark:

x