Table of Contents
The partitionBy() method in PySpark allows for data to be partitioned by one or more columns. This means that the data is split into groups based on the values of the columns that are specified in the partitionBy() method. The data can then be processed in parallel, making it faster and more efficient. This method can be used with multiple columns, meaning that it can be used to partition data based on multiple criteria. This can be useful when looking for patterns, or for further analysis.
You can use the following syntax to use Window.partitionBy() with multiple columns in PySpark:
from pyspark.sql.window import Window partition_cols = ['col1', 'col2'] w = Window.partitionBy(*partition_cols)
This particular example passes the columns named col1 and col2 to the partitionBy function.
Note that the * operator is used to unpack an iterable into a function call.
Thus, by using the * operator we’re able to pass each of the elements in partition_cols without specifying each element individually.
The following example shows how to use this syntax in practice.
Example: How to Use partitionBy() with Multiple Columns in PySpark
Suppose we have the following PySpark DataFrame that contains information about basketball players on various teams:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['A', 'Guard', 11], ['A', 'Guard', 8], ['A', 'Forward', 21], ['A', 'Forward', 22], ['A', 'Forward', 30], ['B', 'Guard', 14], ['B', 'Guard', 14], ['B', 'Forward', 13], ['B', 'Forward', 7]] #define column names columns = ['team', 'position', 'points'] #create dataframe using data and column names df = spark.createDataFrame(data, columns) #view dataframe df.show() +----+--------+------+ |team|position|points| +----+--------+------+ | A| Guard| 11| | A| Guard| 8| | A| Forward| 21| | A| Forward| 22| | A| Forward| 30| | B| Guard| 14| | B| Guard| 14| | B| Forward| 13| | B| Forward| 7| +----+--------+------+
Suppose we would like to add a new column named id that contains row numbers for each row in the DataFrame, grouped by the team and position columns.
To do so, we can use the following syntax to pass each of these columns to the partitionBy function and then add a new column that contains row numbers:
from pyspark.sql.functions import row_number,lit from pyspark.sql.window import Window #specify columns to partition by partition_cols = ['team', 'position'] #specify window w = Window.partitionBy(*partition_cols).orderBy(lit('A')) #add column called 'id' that contains row numbers df = df.withColumn('id', row_number().over(w)) #view updated DataFrame df.show() +----+--------+------+---+ |team|position|points| id| +----+--------+------+---+ | A| Forward| 21| 1| | A| Forward| 22| 2| | A| Forward| 30| 3| | A| Guard| 11| 1| | A| Guard| 8| 2| | B| Forward| 13| 1| | B| Forward| 7| 2| | B| Guard| 14| 1| | B| Guard| 14| 2| +----+--------+------+---+
The resulting DataFrame contains row numbers for each row, grouped by the team and position columns.
Note #1: In this example, we passed two column names to the partitionBy function but you can include as many column names as you’d like.
Note #2: You can find the complete documentation for the PySpark partitionBy function .
The following tutorials explain how to perform other common tasks in PySpark: