How can I use PySpark to create a new column in a dataframe with random numbers?

PySpark is a powerful tool for data analysis and manipulation, and one of its key features is its ability to easily create new columns in a dataframe. One such use case is creating a new column with random numbers, which can be achieved using the built-in functions and methods of PySpark. By utilizing these functions, users can generate random numbers and add them to their dataframe as a new column, providing valuable insights and enhancing the analysis process. This feature of PySpark is particularly useful for tasks such as data sampling, feature engineering, and creating synthetic datasets. With its efficient and user-friendly interface, PySpark makes it easy for users to incorporate random numbers into their data analysis workflow.

PySpark: Create New Column with Random Numbers


You can use the following methods to create a new column in a PySpark DataFrame that contains random numbers:

Method 1: Create New Column with Random Decimal Numbers

from pyspark.sql.functions import rand

#create new column named 'rand' that contains random floats between 0 and 100
df.withColumn('rand', rand(seed=23)*100).show()

Method 2: Create New Column with Random Integers

from pyspark.sql.functions import rand, round

#create new column named 'rand' that contains random integers between 0 and 100
df.withColumn('rand', round(rand(seed=23)*100, 0)).show()

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 18], 
        ['Nets', 33], 
        ['Lakers', 12], 
        ['Kings', 15], 
        ['Hawks', 19],
        ['Wizards', 24],
        ['Magic', 28],
        ['Jazz', 40],
        ['Thunder', 24],
        ['Spurs', 13]]
  
#define column names
columns = ['team', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-------+------+
|   team|points|
+-------+------+
|   Mavs|    18|
|   Nets|    33|
| Lakers|    12|
|  Kings|    15|
|  Hawks|    19|
|Wizards|    24|
|  Magic|    28|
|   Jazz|    40|
|Thunder|    24|
|  Spurs|    13|
+-------+------+

Example 1: Create New Column with Random Decimal Numbers

We can use the followings syntax to add a new column to the DataFrame named rand that contains random decimal numbers between 0 and 100:

from pyspark.sql.functions import rand

#create new column named 'rand' that contains random floats between 0 and 100
df.withColumn('rand', rand(seed=23)*100).show()

+-------+------+------------------+
|   team|points|              rand|
+-------+------+------------------+
|   Mavs|    18| 93.88044512577216|
|   Nets|    33|39.432553969527554|
| Lakers|    12|23.260361399084918|
|  Kings|    15| 2.339183228862929|
|  Hawks|    19| 82.53753350983487|
|Wizards|    24| 88.94415403143505|
|  Magic|    28| 80.81524027081029|
|   Jazz|    40| 59.56629641640896|
|Thunder|    24| 27.62195585886885|
|  Spurs|    13| 70.43214981152886|
+-------+------+------------------+

Notice that the new rand column contains random decimal numbers between 0 and 100.

Note #1: By specifying a value for seed within the rand() function, we will be able to generate the same random numbers each time we run the code.

Note #2: The rand() function returns a value between 0 and 1 by default. Thus, the number that we multiply the rand() function by specifies the max number that can be returned. In this example, we set the max to be 100.

Example 2 Create New Column with Random Integers

We can use the followings syntax to add a new column to the DataFrame named rand that contains random integers between 0 and 100:

from pyspark.sql.functions import rand, round

#create new column named 'rand' that contains random integers between 0 and 100
df.withColumn('rand', round(rand(seed=23)*100, 0)).show()


+-------+------+----+
|   team|points|rand|
+-------+------+----+
|   Mavs|    18|94.0|
|   Nets|    33|39.0|
| Lakers|    12|23.0|
|  Kings|    15| 2.0|
|  Hawks|    19|83.0|
|Wizards|    24|89.0|
|  Magic|    28|81.0|
|   Jazz|    40|60.0|
|Thunder|    24|28.0|
|  Spurs|    13|70.0|
+-------+------+----+

Notice that the new rand column contains random integers between 0 and 100.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x