How to use partitionBy() with Multiple Columns in PySpark?

The partitionBy() method in PySpark allows for data to be partitioned by one or more columns. This means that the data is split into groups based on the values of the columns that are specified in the partitionBy() method. The data can then be processed in parallel, making it faster and more efficient. This method can be used with multiple columns, meaning that it can be used to partition data based on multiple criteria. This can be useful when looking for patterns, or for further analysis.


You can use the following syntax to use Window.partitionBy() with multiple columns in PySpark:

from pyspark.sql.window import Window

partition_cols = ['col1', 'col2']

w = Window.partitionBy(*partition_cols)

This particular example passes the columns named col1 and col2 to the partitionBy function.

Note that the * operator is used to unpack an iterable into a function call.

Thus, by using the * operator we’re able to pass each of the elements in partition_cols without specifying each element individually.

The following example shows how to use this syntax in practice.

Example: How to Use partitionBy() with Multiple Columns in PySpark

Suppose we have the following PySpark DataFrame that contains information about basketball players on various teams:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11], 
        ['A', 'Guard', 8], 
        ['A', 'Forward', 21], 
        ['A', 'Forward', 22],
        ['A', 'Forward', 30], 
        ['B', 'Guard', 14], 
        ['B', 'Guard', 14],
        ['B', 'Forward', 13],
        ['B', 'Forward', 7]] 
  
#define column names
columns = ['team', 'position', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A|   Guard|     8|
|   A| Forward|    21|
|   A| Forward|    22|
|   A| Forward|    30|
|   B|   Guard|    14|
|   B|   Guard|    14|
|   B| Forward|    13|
|   B| Forward|     7|
+----+--------+------+

Suppose we would like to add a new column named id that contains row numbers for each row in the DataFrame, grouped by the team and position columns.

To do so, we can use the following syntax to pass each of these columns to the partitionBy function and then add a new column that contains row numbers:

from pyspark.sql.functions import row_number,lit
from pyspark.sql.window import Window

#specify columns to partition by
partition_cols = ['team', 'position']

#specify window
w = Window.partitionBy(*partition_cols).orderBy(lit('A'))

#add column called 'id' that contains row numbers
df = df.withColumn('id', row_number().over(w))

#view updated DataFrame
df.show()

+----+--------+------+---+
|team|position|points| id|
+----+--------+------+---+
|   A| Forward|    21|  1|
|   A| Forward|    22|  2|
|   A| Forward|    30|  3|
|   A|   Guard|    11|  1|
|   A|   Guard|     8|  2|
|   B| Forward|    13|  1|
|   B| Forward|     7|  2|
|   B|   Guard|    14|  1|
|   B|   Guard|    14|  2|
+----+--------+------+---+

The resulting DataFrame contains row numbers for each row, grouped by the team and position columns.

Note #1: In this example, we passed two column names to the partitionBy function but you can include as many column names as you’d like.

Note #2: You can find the complete documentation for the PySpark partitionBy function .

The following tutorials explain how to perform other common tasks in PySpark:

x