How can I use PySpark to create a new column with random numbers?


You can use the following methods to create a new column in a PySpark DataFrame that contains random numbers:

Method 1: Create New Column with Random Decimal Numbers

from pyspark.sql.functions import rand

#create new column named 'rand' that contains random floats between 0 and 100
df.withColumn('rand', rand(seed=23)*100).show()

Method 2: Create New Column with Random Integers

from pyspark.sql.functions import rand, round

#create new column named 'rand' that contains random integers between 0 and 100
df.withColumn('rand', round(rand(seed=23)*100, 0)).show()

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 18], 
        ['Nets', 33], 
        ['Lakers', 12], 
        ['Kings', 15], 
        ['Hawks', 19],
        ['Wizards', 24],
        ['Magic', 28],
        ['Jazz', 40],
        ['Thunder', 24],
        ['Spurs', 13]]
  
#define column names
columns = ['team', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-------+------+
|   team|points|
+-------+------+
|   Mavs|    18|
|   Nets|    33|
| Lakers|    12|
|  Kings|    15|
|  Hawks|    19|
|Wizards|    24|
|  Magic|    28|
|   Jazz|    40|
|Thunder|    24|
|  Spurs|    13|
+-------+------+

Example 1: Create New Column with Random Decimal Numbers

We can use the followings syntax to add a new column to the DataFrame named rand that contains random decimal numbers between 0 and 100:

from pyspark.sql.functions import rand

#create new column named 'rand' that contains random floats between 0 and 100
df.withColumn('rand', rand(seed=23)*100).show()

+-------+------+------------------+
|   team|points|              rand|
+-------+------+------------------+
|   Mavs|    18| 93.88044512577216|
|   Nets|    33|39.432553969527554|
| Lakers|    12|23.260361399084918|
|  Kings|    15| 2.339183228862929|
|  Hawks|    19| 82.53753350983487|
|Wizards|    24| 88.94415403143505|
|  Magic|    28| 80.81524027081029|
|   Jazz|    40| 59.56629641640896|
|Thunder|    24| 27.62195585886885|
|  Spurs|    13| 70.43214981152886|
+-------+------+------------------+

Notice that the new rand column contains random decimal numbers between 0 and 100.

Note #1: By specifying a value for seed within the rand() function, we will be able to generate the same random numbers each time we run the code.

Note #2: The rand() function returns a value between 0 and 1 by default. Thus, the number that we multiply the rand() function by specifies the max number that can be returned. In this example, we set the max to be 100.

Example 2 Create New Column with Random Integers

We can use the followings syntax to add a new column to the DataFrame named rand that contains random integers between 0 and 100:

from pyspark.sql.functions import rand, round

#create new column named 'rand' that contains random integers between 0 and 100
df.withColumn('rand', round(rand(seed=23)*100, 0)).show()


+-------+------+----+
|   team|points|rand|
+-------+------+----+
|   Mavs|    18|94.0|
|   Nets|    33|39.0|
| Lakers|    12|23.0|
|  Kings|    15| 2.0|
|  Hawks|    19|83.0|
|Wizards|    24|89.0|
|  Magic|    28|81.0|
|   Jazz|    40|60.0|
|Thunder|    24|28.0|
|  Spurs|    13|70.0|
+-------+------+----+

Notice that the new rand column contains random integers between 0 and 100.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x