Table of Contents
PySpark is a powerful tool for data analysis and manipulation, and one of its key features is its ability to easily create new columns in a dataframe. One such use case is creating a new column with random numbers, which can be achieved using the built-in functions and methods of PySpark. By utilizing these functions, users can generate random numbers and add them to their dataframe as a new column, providing valuable insights and enhancing the analysis process. This feature of PySpark is particularly useful for tasks such as data sampling, feature engineering, and creating synthetic datasets. With its efficient and user-friendly interface, PySpark makes it easy for users to incorporate random numbers into their data analysis workflow.
PySpark: Create New Column with Random Numbers
You can use the following methods to create a new column in a PySpark DataFrame that contains random numbers:
Method 1: Create New Column with Random Decimal Numbers
from pyspark.sql.functions import rand #create new column named 'rand' that contains random floats between 0 and 100 df.withColumn('rand', rand(seed=23)*100).show()
Method 2: Create New Column with Random Integers
from pyspark.sql.functions import rand, round #create new column named 'rand' that contains random integers between 0 and 100 df.withColumn('rand', round(rand(seed=23)*100, 0)).show()
The following examples show how to use each method in practice with the following PySpark DataFrame:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['Mavs', 18], ['Nets', 33], ['Lakers', 12], ['Kings', 15], ['Hawks', 19], ['Wizards', 24], ['Magic', 28], ['Jazz', 40], ['Thunder', 24], ['Spurs', 13]] #define column names columns = ['team', 'points'] #create dataframe using data and column names df = spark.createDataFrame(data, columns) #view dataframe df.show() +-------+------+ | team|points| +-------+------+ | Mavs| 18| | Nets| 33| | Lakers| 12| | Kings| 15| | Hawks| 19| |Wizards| 24| | Magic| 28| | Jazz| 40| |Thunder| 24| | Spurs| 13| +-------+------+
Example 1: Create New Column with Random Decimal Numbers
We can use the followings syntax to add a new column to the DataFrame named rand that contains random decimal numbers between 0 and 100:
from pyspark.sql.functions import rand #create new column named 'rand' that contains random floats between 0 and 100 df.withColumn('rand', rand(seed=23)*100).show() +-------+------+------------------+ | team|points| rand| +-------+------+------------------+ | Mavs| 18| 93.88044512577216| | Nets| 33|39.432553969527554| | Lakers| 12|23.260361399084918| | Kings| 15| 2.339183228862929| | Hawks| 19| 82.53753350983487| |Wizards| 24| 88.94415403143505| | Magic| 28| 80.81524027081029| | Jazz| 40| 59.56629641640896| |Thunder| 24| 27.62195585886885| | Spurs| 13| 70.43214981152886| +-------+------+------------------+
Notice that the new rand column contains random decimal numbers between 0 and 100.
Note #1: By specifying a value for seed within the rand() function, we will be able to generate the same random numbers each time we run the code.
Note #2: The rand() function returns a value between 0 and 1 by default. Thus, the number that we multiply the rand() function by specifies the max number that can be returned. In this example, we set the max to be 100.
Example 2 Create New Column with Random Integers
We can use the followings syntax to add a new column to the DataFrame named rand that contains random integers between 0 and 100:
from pyspark.sql.functions import rand, round #create new column named 'rand' that contains random integers between 0 and 100 df.withColumn('rand', round(rand(seed=23)*100, 0)).show() +-------+------+----+ | team|points|rand| +-------+------+----+ | Mavs| 18|94.0| | Nets| 33|39.0| | Lakers| 12|23.0| | Kings| 15| 2.0| | Hawks| 19|83.0| |Wizards| 24|89.0| | Magic| 28|81.0| | Jazz| 40|60.0| |Thunder| 24|28.0| | Spurs| 13|70.0| +-------+------+----+
Notice that the new rand column contains random integers between 0 and 100.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark: