How can I add a string to each value in a column using PySpark?


You can use the following syntax to add a string to each value in a column of a PySpark DataFrame:

from pyspark.sql.functions import concat, col, lit

#add the string 'team_name_' to each string in the team column
df_new = df.withColumn('team', concat(lit('team_name_'), col('team')))

This particular example adds the string ‘team_name_’ to each string in the team column of the DataFrame.

The following example shows how to use this syntax in practice.

Example: Add String to Each Value in Column in PySpark

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11], 
        ['A', 'East', 8], 
        ['A', 'East', 10], 
        ['B', 'West', 6], 
        ['B', 'West', 6], 
        ['C', 'East', 5],
        ['C', 'East', 15],
        ['C', 'West', 31],
        ['D', 'West', 24]]
  
#define column names
columns = ['team', 'conference', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   A|      East|     8|
|   A|      East|    10|
|   B|      West|     6|
|   B|      West|     6|
|   C|      East|     5|
|   C|      East|    15|
|   C|      West|    31|
|   D|      West|    24|
+----+----------+------+

Suppose we would like to add the string ‘team_name_’ to the beginning of each string in the team column.

We can use the following syntax to do so:

from pyspark.sql.functions import concat, col, lit

#add the string 'team_name_' to each string in the team column
df_new = df.withColumn('team', concat(lit('team_name_'), col('team')))

#view new DataFrame
df_new.show()

+-----------+----------+------+
|       team|conference|points|
+-----------+----------+------+
|team_name_A|      East|    11|
|team_name_A|      East|     8|
|team_name_A|      East|    10|
|team_name_B|      West|     6|
|team_name_B|      West|     6|
|team_name_C|      East|     5|
|team_name_C|      East|    15|
|team_name_C|      West|    31|
|team_name_D|      West|    24|
+-----------+----------+------+

Notice that the string ‘team_name_’ has been added to each existing string in the team column of the DataFrame.

Note: You can find the complete documentation for the PySpark concat function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x