How can I multiply two columns in PySpark? Can you provide some examples?

In PySpark, multiplying two columns can be done using the “withColumn” function. This function takes in two parameters, the name of the new column and the operation to be performed. For multiplication, the “*” operator is used. Some examples of this operation would be multiplying the “price” column with the “quantity” column to get the total cost of a product, or multiplying the “sales” column with the “profit margin” column to get the total profit earned from sales.

Multiply Two Columns in PySpark (With Examples)


You can use the following methods to multiply two columns in a PySpark DataFrame:

Method 1: Multiply Two Columns

df_new = df.withColumn('revenue', df.price * df.amount)

This particular example creates a new column called revenue that multiplies the values in the price and amount columns.

Method 2: Multiply Two Columns Based on Condition

from pyspark.sql.functions importwhen

df_new = df.withColumn('revenue', when(df.type=='refund', 0)
                       .otherwise(df.price * df.amount))

This particular example creates a new column called revenue that returns 0 if the value in the type column is ‘refund’, otherwise it returns the product of the values in the price and amount columns.

The following examples show how to use each method in practice.

Example 1: Multiply Two Columns

Suppose we have the following PySpark DataFrame that contains information about the price of various items at some store and the amount sold:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [[14, 2], 
        [10, 3], 
        [20, 4], 
        [12, 3], 
        [7, 3],
        [12, 5],
        [10, 2],
        [10, 3]]
  
#define column names
columns = ['price', 'amount'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-----+------+
|price|amount|
+-----+------+
|   14|     2|
|   10|     3|
|   20|     4|
|   12|     3|
|    7|     3|
|   12|     5|
|   10|     2|
|   10|     3|
+-----+------+

We can use the following syntax to create a new column called revenue that multiplies the values in the price and amount columns:

#create new column called 'revenue' that multiplies price by amount
df_new = df.withColumn('revenue', df.price * df.amount)

#view new DataFrame
df_new.show()

+-----+------+-------+
|price|amount|revenue|
+-----+------+-------+
|   14|     2|     28|
|   10|     3|     30|
|   20|     4|     80|
|   12|     3|     36|
|    7|     3|     21|
|   12|     5|     60|
|   10|     2|     20|
|   10|     3|     30|
+-----+------+-------+

Notice that the values in the new revenue column are the product of the values in the price and amount columns.

Example 2: Multiply Two Columns Based on Condition

Suppose we have the following PySpark DataFrame that contains information about the price of various items at some store, the amount sold, and whether or not the transaction was a sale or a refund:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [[14, 2, 'sale'], 
        [10, 3, 'sale'], 
        [20, 4, 'refund'], 
        [12, 3, 'sale'], 
        [7, 3, 'refund'],
        [12, 5, 'refund'],
        [10, 2, 'sale'],
        [10, 3, 'sale']]
  
#define column names
columns = ['price', 'amount', 'type'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-----+------+------+
|price|amount|  type|
+-----+------+------+
|   14|     2|  sale|
|   10|     3|  sale|
|   20|     4|refund|
|   12|     3|  sale|
|    7|     3|refund|
|   12|     5|refund|
|   10|     2|  sale|
|   10|     3|  sale|
+-----+------+------+
from pyspark.sql.functions importwhen

#create new column called 'revenue'
df_new = df.withColumn('revenue', when(df.type=='refund', 0)
                       .otherwise(df.price * df.amount))

#view new DataFrame
df_new.show()

+-----+------+------+-------+
|price|amount|  type|revenue|
+-----+------+------+-------+
|   14|     2|  sale|     28|
|   10|     3|  sale|     30|
|   20|     4|refund|      0|
|   12|     3|  sale|     36|
|    7|     3|refund|      0|
|   12|     5|refund|      0|
|   10|     2|  sale|     20|
|   10|     3|  sale|     30|
+-----+------+------+-------+

Notice that the values in the new revenue column are dependent on the corresponding values in the type column.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x