How can I perform a Right Join in PySpark? Can you provide an example?

A right join in PySpark is a type of join operation that combines two tables based on a common key, keeping all the rows from the right table and matching rows from the left table. This operation is useful for merging data from two tables while ensuring that all the data from the right table is included. To perform a right join in PySpark, the right table is specified first in the join statement, followed by the keyword “right_outer” to indicate the type of join. An example of a right join in PySpark would be joining a sales table with a customer table, where the sales table contains all the sales data and the customer table contains customer information. This would result in a new table with all the sales data and matching customer information for each sale.

Do a Right Join in PySpark (With Example)


You can use the following basic syntax to perform a right join in PySpark:

df_joined = df1.join(df2, on=['team'], how='right ').show()

This particular example will perform a right join using the DataFrames named df1 and df2 by joining on the column named team.

All rows from df2 will be returned in the final DataFrame but only the rows from df1 that have a matching value in the team column will be returned.

The following example shows how to use this syntax in practice.

Example: How to Do a Right Join in PySpark

Suppose we have the following DataFrame named df1:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data1 = [['Mavs', 11], 
       ['Hawks', 25], 
       ['Nets', 32], 
       ['Kings', 15],
       ['Warriors', 22],
       ['Suns', 17]]

#define column names
columns1 = ['team', 'points'] 
  
#create dataframe using data and column names
df1 = spark.createDataFrame(data1, columns1) 
  
#view dataframe
df1.show()

+--------+------+
|    team|points|
+--------+------+
|    Mavs|    11|
|   Hawks|    25|
|    Nets|    32|
|   Kings|    15|
|Warriors|    22|
|    Suns|    17|
+--------+------+

And suppose we have another DataFrame named df2:

#define data
data2 = [['Mavs', 4], 
       ['Nets', 7], 
       ['Suns', 8], 
       ['Grizzlies', 12],
       ['Kings', 7]]

#define column names
columns2 = ['team', 'assists'] 
  
#create dataframe using data and column names
df2 = spark.createDataFrame(data2, columns2) 
  
#view dataframe
df2.show()

+---------+-------+
|     team|assists|
+---------+-------+
|     Mavs|      4|
|     Nets|      7|
|     Suns|      8|
|Grizzlies|     12|
|    Kings|      7|
+---------+-------+

We can use the following syntax to perform a right join between these two DataFrames by joining on values from the team column:

#perform right join using 'team' column
df_joined = df1.join(df2, on=['team'], how='right').show()

+---------+------+-------+
|     team|points|assists|
+---------+------+-------+
|     Mavs|    11|      4|
|     Nets|    32|      7|
|     Suns|    17|      8|
|Grizzlies|  null|     12|
|    Kings|    15|      7|
+---------+------+-------+

Notice that the resulting DataFrame contains all rows from the right DataFrame (df2) but only the rows from the left DataFrame (df1) that had a matching value in the team column.

Note that if the left DataFrame did not contain a matching team value for any team in the right DataFrame, a value of null is used in the points column.

For example, the team name “Grizzlies” did not exist in df1, so this row received a value of null in the points column of the final joined DataFrame.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x