Do an Inner Join in PySpark (With Example)


You can use the following basic syntax to perform an inner join in PySpark:

df_joined = df1.join(df2, on=['team'], how='inner').show()

This particular example will perform an inner join using the DataFrames named df1 and df2 by joining on the column named team.

The following example shows how to use this syntax in practice.

Example: How to Do an Inner Join in PySpark

Suppose we have the following DataFrame named df1:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data1 = [['Mavs', 11], 
       ['Hawks', 25], 
       ['Nets', 32], 
       ['Kings', 15],
       ['Warriors', 22],
       ['Suns', 17]]

#define column names
columns1 = ['team', 'points'] 
  
#create dataframe using data and column names
df1 = spark.createDataFrame(data1, columns1) 
  
#view dataframe
df1.show()

+--------+------+
|    team|points|
+--------+------+
|    Mavs|    11|
|   Hawks|    25|
|    Nets|    32|
|   Kings|    15|
|Warriors|    22|
|    Suns|    17|
+--------+------+

And suppose we have another DataFrame named df2:

#define data
data2 = [['Mavs', 4], 
       ['Nets', 7], 
       ['Suns', 8], 
       ['Grizzlies', 12],
       ['Kings', 7]]

#define column names
columns2 = ['team', 'assists'] 
  
#create dataframe using data and column names
df2 = spark.createDataFrame(data2, columns2) 
  
#view dataframe
df2.show()

+---------+-------+
|     team|assists|
+---------+-------+
|     Mavs|      4|
|     Nets|      7|
|     Suns|      8|
|Grizzlies|     12|
|    Kings|      7|
+---------+-------+

We can use the following syntax to perform an inner join between these two DataFrames by joining on values from the team column:

#perform inner join using 'team' column
df_joined = df1.join(df2, on=['team'], how='inner').show()

+-----+------+-------+
| team|points|assists|
+-----+------+-------+
|Kings|    15|      7|
| Mavs|    11|      4|
| Nets|    32|      7|
| Suns|    17|      8|
+-----+------+-------+

Notice that the resulting DataFrame only contains the rows where the team value occurred in both DataFrames.

For example, the team names Kings, Mavs, Nets and Suns all occurred in both DataFrames, which is why these were the four rows that were included in the final joined DataFrame.

x