PySpark: Convert RDD to DataFrame (With Example)


You can use the toDF() function to convert a RDD (resilient distributed dataset) to a DataFrame in PySpark:

my_df = my_RDD.toDF()

This particular example will convert the RDD named my_RDD to a DataFrame called my_df.

The following example shows how to use this syntax in practice.

Example: How to Convert RDD to DataFrame in PySpark

First, let’s create the following RDD:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [('A', 11), 
        ('B', 19), 
        ('C', 22), 
        ('D', 25), 
        ('E', 12), 
        ('F', 41)] 
  
#create RDD using data
my_RDD = spark.sparkContext.parallelize(data)

We can verify that this object is a RDD by using the type() function:

#check object type
type(my_RDD)

pyspark.rdd.RDD

We can see that the object my_RDD is indeed a RDD.

We can then use the following syntax to convert the RDD to a PySpark DataFrame:

#convert RDD to DataFrame
my_df = my_RDD.toDF()

#view DataFrame
my_df.show()

+---+---+
| _1| _2|
+---+---+
|  A| 11|
|  B| 19|
|  C| 22|
|  D| 25|
|  E| 12|
|  F| 41|
+---+---+

We can see that the RDD has been converted to a DataFrame.

We can verify that the my_df object is a DataFrame by using the type() function once again:

#check object type
type(my_df)

pyspark.sql.dataframe.DataFrame

We can see that the object my_df is indeed a DataFrame.

Note that the toDF() function uses column names _1 and _2 by default.

#convert RDD to DataFrame with specific column names
my_df = my_RDD.toDF(['player', 'assists'])

#view DataFrame
my_df.show()

+------+-------+
|player|assists|
+------+-------+
|     A|     11|
|     B|     19|
|     C|     22|
|     D|     25|
|     E|     12|
|     F|     41|
+------+-------+

Notice that the RDD has now been converted to a DataFrame with the column names player and assists.

x