How to Multiply Two Columns in PySpark (With Examples)

Using the DataFrame API in PySpark, you can multiply two columns by using the multiply() method. This method takes two arguments, the first being the column you want to multiply against and the second being the multiplier. You can also refer to the multiplication as a cross product. This example uses the withColumn() method to create a new DataFrame with the result of the multiplication as a new column. This method is useful for quickly performing calculations on columns of data and can help in data analysis tasks.


You can use the following methods to multiply two columns in a PySpark DataFrame:

Method 1: Multiply Two Columns

df_new = df.withColumn('revenue', df.price * df.amount)

This particular example creates a new column called revenue that multiplies the values in the price and amount columns.

Method 2: Multiply Two Columns Based on Condition

from pyspark.sql.functions import when

df_new = df.withColumn('revenue', when(df.type=='refund', 0)
                       .otherwise(df.price * df.amount))

This particular example creates a new column called revenue that returns 0 if the value in the type column is ‘refund’, otherwise it returns the product of the values in the price and amount columns.

The following examples show how to use each method in practice.

Example 1: Multiply Two Columns

Suppose we have the following PySpark DataFrame that contains information about the price of various items at some store and the amount sold:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [[14, 2], 
        [10, 3], 
        [20, 4], 
        [12, 3], 
        [7, 3],
        [12, 5],
        [10, 2],
        [10, 3]]
  
#define column names
columns = ['price', 'amount'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-----+------+
|price|amount|
+-----+------+
|   14|     2|
|   10|     3|
|   20|     4|
|   12|     3|
|    7|     3|
|   12|     5|
|   10|     2|
|   10|     3|
+-----+------+

We can use the following syntax to create a new column called revenue that multiplies the values in the price and amount columns:

#create new column called 'revenue' that multiplies price by amount
df_new = df.withColumn('revenue', df.price * df.amount)

#view new DataFrame
df_new.show()

+-----+------+-------+
|price|amount|revenue|
+-----+------+-------+
|   14|     2|     28|
|   10|     3|     30|
|   20|     4|     80|
|   12|     3|     36|
|    7|     3|     21|
|   12|     5|     60|
|   10|     2|     20|
|   10|     3|     30|
+-----+------+-------+

Notice that the values in the new revenue column are the product of the values in the price and amount columns.

Example 2: Multiply Two Columns Based on Condition

Suppose we have the following PySpark DataFrame that contains information about the price of various items at some store, the amount sold, and whether or not the transaction was a sale or a refund:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [[14, 2, 'sale'], 
        [10, 3, 'sale'], 
        [20, 4, 'refund'], 
        [12, 3, 'sale'], 
        [7, 3, 'refund'],
        [12, 5, 'refund'],
        [10, 2, 'sale'],
        [10, 3, 'sale']]
  
#define column names
columns = ['price', 'amount', 'type'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-----+------+------+
|price|amount|  type|
+-----+------+------+
|   14|     2|  sale|
|   10|     3|  sale|
|   20|     4|refund|
|   12|     3|  sale|
|    7|     3|refund|
|   12|     5|refund|
|   10|     2|  sale|
|   10|     3|  sale|
+-----+------+------+

from pyspark.sql.functions import when

#create new column called 'revenue'
df_new = df.withColumn('revenue', when(df.type=='refund', 0)
                       .otherwise(df.price * df.amount))

#view new DataFrame
df_new.show()

+-----+------+------+-------+
|price|amount|  type|revenue|
+-----+------+------+-------+
|   14|     2|  sale|     28|
|   10|     3|  sale|     30|
|   20|     4|refund|      0|
|   12|     3|  sale|     36|
|    7|     3|refund|      0|
|   12|     5|refund|      0|
|   10|     2|  sale|     20|
|   10|     3|  sale|     30|
+-----+------+------+-------+

Notice that the values in the new revenue column are dependent on the corresponding values in the type column.

The following tutorials explain how to perform other common tasks in PySpark:

x