How can I use the partitionBy() function with multiple columns in PySpark?

The partitionBy() function in PySpark allows for data partitioning based on a specific set of columns. This means that the data can be divided into smaller subsets based on the values of these columns, which can then be processed separately. To use the partitionBy() function with multiple columns, simply specify the columns as arguments within the function. This will create separate partitions for each unique combination of values in the specified columns, enabling more efficient and targeted data analysis.

PySpark: Use partitionBy() with Multiple Columns


You can use the following syntax to use Window.partitionBy() with multiple columns in PySpark:

from pyspark.sql.window import Window

partition_cols = ['col1', 'col2']

w = Window.partitionBy(*partition_cols)

This particular example passes the columns named col1 and col2 to the partitionBy function.

Note that the * operator is used to unpack an iterable into a function call.

Thus, by using the * operator we’re able to pass each of the elements in partition_cols without specifying each element individually.

The following example shows how to use this syntax in practice.

Example: How to Use partitionBy() with Multiple Columns in PySpark

Suppose we have the following PySpark DataFrame that contains information about basketball players on various teams:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11], 
        ['A', 'Guard', 8], 
        ['A', 'Forward', 21], 
        ['A', 'Forward', 22],
        ['A', 'Forward', 30], 
        ['B', 'Guard', 14], 
        ['B', 'Guard', 14],
        ['B', 'Forward', 13],
        ['B', 'Forward', 7]] 
  
#define column names
columns = ['team', 'position', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A|   Guard|     8|
|   A| Forward|    21|
|   A| Forward|    22|
|   A| Forward|    30|
|   B|   Guard|    14|
|   B|   Guard|    14|
|   B| Forward|    13|
|   B| Forward|     7|
+----+--------+------+

Suppose we would like to add a new column named id that contains row numbers for each row in the DataFrame, grouped by the team and position columns.

To do so, we can use the following syntax to pass each of these columns to the partitionBy function and then add a new column that contains row numbers:

from pyspark.sql.functions import row_number,lit
from pyspark.sql.window import Window

#specify columns to partition by
partition_cols = ['team', 'position']

#specify window
w = Window.partitionBy(*partition_cols).orderBy(lit('A'))

#add column called 'id' that contains row numbers
df = df.withColumn('id', row_number().over(w))

#view updated DataFrame
df.show()

+----+--------+------+---+
|team|position|points| id|
+----+--------+------+---+
|   A| Forward|    21|  1|
|   A| Forward|    22|  2|
|   A| Forward|    30|  3|
|   A|   Guard|    11|  1|
|   A|   Guard|     8|  2|
|   B| Forward|    13|  1|
|   B| Forward|     7|  2|
|   B|   Guard|    14|  1|
|   B|   Guard|    14|  2|
+----+--------+------+---+

The resulting DataFrame contains row numbers for each row, grouped by the team and position columns.

Note #1: In this example, we passed two column names to the partitionBy function but you can include as many column names as you’d like.

Note #2: You can find the complete documentation for the PySpark partitionBy function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x