How can I convert an RDD to a DataFrame in PySpark, and can you provide an example?

Converting an RDD (Resilient Distributed Dataset) to a DataFrame in PySpark is a process of transforming the data stored in an RDD into a structured and tabular format. This allows for easier and more efficient data analysis and manipulation. To convert an RDD to a DataFrame in PySpark, the rdd.toDF() method can be used. This method creates a DataFrame with column names based on the data types of the elements in the RDD. An example of converting an RDD to a DataFrame in PySpark would be as follows:

rdd = sc.parallelize([(1,”John”,25),(2,”Mary”,30),(3,”Bob”,35)])

df = rdd.toDF([“id”,”name”,”age”])

This converts the RDD into a DataFrame with columns “id”, “name”, and “age” and allows for further data operations to be performed on the DataFrame.

PySpark: Convert RDD to DataFrame (With Example)


You can use the toDF() function to convert a RDD (resilient distributed dataset) to a DataFrame in PySpark:

my_df = my_RDD.toDF()

This particular example will convert the RDD named my_RDD to a DataFrame called my_df.

The following example shows how to use this syntax in practice.

Example: How to Convert RDD to DataFrame in PySpark

First, let’s create the following RDD:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [('A', 11), 
        ('B', 19), 
        ('C', 22), 
        ('D', 25), 
        ('E', 12), 
        ('F', 41)] 
  
#create RDD using data
my_RDD = spark.sparkContext.parallelize(data)

We can verify that this object is a RDD by using the type() function:

#check object typetype(my_RDD)

pyspark.rdd.RDD

We can see that the object my_RDD is indeed a RDD.

We can then use the following syntax to convert the RDD to a PySpark DataFrame:

#convert RDD to DataFrame
my_df = my_RDD.toDF()

#view DataFrame
my_df.show()

+---+---+
| _1| _2|
+---+---+
|  A| 11|
|  B| 19|
|  C| 22|
|  D| 25|
|  E| 12|
|  F| 41|
+---+---+

We can see that the RDD has been converted to a DataFrame.

We can verify that the my_df object is a DataFrame by using the type() function once again:

#check object typetype(my_df)

pyspark.sql.dataframe.DataFrame

We can see that the object my_df is indeed a DataFrame.

Note that the toDF() function uses column names _1 and _2 by default.

#convert RDD to DataFrame with specific column names
my_df = my_RDD.toDF(['player', 'assists'])

#view DataFrame
my_df.show()

+------+-------+
|player|assists|
+------+-------+
|     A|     11|
|     B|     19|
|     C|     22|
|     D|     25|
|     E|     12|
|     F|     41|
+------+-------+

Notice that the RDD has now been converted to a DataFrame with the column names player and assists.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x