What is the process for joining columns with different names in PySpark?


You can use the following syntax to join two DataFrames together based on different column names in PySpark:

df3 = df1.withColumn('id', col('team_id')).join(df2.withColumn('id', col('team_name')), on='id')

Here is what this syntax does:

  • First, it renames the team_id column from df1 to id.
  • Then, it renames the team_name column from df2 to id.
  • Lastly, it joins together df1 and df2 based on values in the id columns.

The following example shows how to use this syntax in practice.

Example: How to Join on Different Column Names in PySpark

Suppose we have the following DataFrame named df1:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 18], 
        ['Nets', 33], 
        ['Lakers', 12], 
        ['Kings', 15], 
        ['Hawks', 19],
        ['Wizards', 24],
        ['Magic', 28]] 
  
#define column names
columns = ['team_ID', 'points'] 
  
#create dataframe using data and column names
df1 = spark.createDataFrame(data, columns) 
  
#view dataframe
df1.show()

+-------+------+
|team_ID|points|
+-------+------+
|   Mavs|    18|
|   Nets|    33|
| Lakers|    12|
|  Kings|    15|
|  Hawks|    19|
|Wizards|    24|
|  Magic|    28|
+-------+------+

And suppose we have another DataFrame named df2:

#define data
data = [['Hawks', 4], 
        ['Wizards', 5], 
        ['Raptors', 5], 
        ['Kings', 12], 
        ['Mavs', 7],
        ['Nets', 11],
        ['Magic', 3]] 
  
#define column names
columns = ['team_name', 'assists'] 
  
#create dataframe using data and column names
df2 = spark.createDataFrame(data, columns) 
  
#view dataframe
df2.show()

+---------+-------+
|team_name|assists|
+---------+-------+
|    Hawks|      4|
|  Wizards|      5|
|  Raptors|      5|
|    Kings|     12|
|     Mavs|      7|
|     Nets|     11|
|    Magic|      3|
+---------+-------+

We can use the following syntax to perform an inner join between these two DataFrames by renaming the team columns from each DataFrame to id and then by joining on values from the id column:

#join df1 and df2 on different column names
df3 = df1.withColumn('id', col('team_id')).join(df2.withColumn('id', col('team_name')), on='id')

#view resulting DataFrame
df3.show()

+-------+-------+------+---------+-------+
|     id|team_ID|points|team_name|assists|
+-------+-------+------+---------+-------+
|  Hawks|  Hawks|    19|    Hawks|      4|
|  Kings|  Kings|    15|    Kings|     12|
|  Magic|  Magic|    28|    Magic|      3|
|   Mavs|   Mavs|    18|     Mavs|      7|
|   Nets|   Nets|    33|     Nets|     11|
|Wizards|Wizards|    24|  Wizards|      5|
+-------+-------+------+---------+-------+

We have successfully joined the two DataFrames into one DataFrame based on matching values in the new id column.

Note that you can also use the select function to only display certain columns in the resulting joined DataFrame.

For example, we can use the following syntax to only display the id, points and assists columns in the joined DataFrame:

#join df1 and df2 on different column names
df3 = df1.withColumn('id', col('team_id')).join(df2.withColumn('id', col('team_name')), on='id')
         .select('id', 'points', 'assists')

#view resulting DataFrame
df3.show()

+-------+------+-------+
|     id|points|assists|
+-------+------+-------+
|  Hawks|    19|      4|
|  Kings|    15|     12|
|  Magic|    28|      3|
|   Mavs|    18|      7|
|   Nets|    33|     11|
|Wizards|    24|      5|
+-------+------+-------+

Notice that only the id, points and assists columns are shown in the joined DataFrame.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x