Table of Contents
You can use the toPandas() function to convert a PySpark DataFrame to a pandas DataFrame:
pandas_df = pyspark_df.toPandas()
This particular example will convert the PySpark DataFrame named pyspark_df to a pandas DataFrame named pandas_df.
The following example shows how to use this syntax in practice.
Example: How to Convert PySpark DataFrame to Pandas DataFrame
Suppose we create the following PySpark DataFrame:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['A', 'East', 11, 4],
['A', 'East', 8, 9],
['A', 'East', 10, 3],
['B', 'West', 6, 12],
['B', 'West', 6, 4],
['C', 'East', 5, 2]]
#define column names
columns = ['team', 'conference', 'points', 'assists']
#create DataFrame using data and column names
pyspark_df = spark.createDataFrame(data, columns)
#view PySpark Dataframe
pyspark_df.show()
+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
| A| East| 11| 4|
| A| East| 8| 9|
| A| East| 10| 3|
| B| West| 6| 12|
| B| West| 6| 4|
| C| East| 5| 2|
+----+----------+------+-------+
We can verify that this object is a PySpark DataFrame by using the type() function:
#check object type type(pyspark_df) pyspark.sql.dataframe.DataFrame
We can see that the object pyspark_df is indeed a PySpark DataFrame.
We can then use the following syntax to convert the PySpark DataFrame to a pandas DataFrame:
#convert PySpark DataFrame to pandas DataFrame
pandas_df = pyspark_df.toPandas()
#view first five rows of pandas DataFrame
print(pandas_df.head())
team conference points assists
0 A East 11.0 4.0
1 A East 8.0 9.0
2 A East 10.0 3.0
3 B West 6.0 12.0
4 B West 6.0 4.0
We can see that the PySpark DataFrame has been converted to a pandas DataFrame.
We can verify that the pandas_df object is a pandas DataFrame by using the type() function once again:
#check object type type(pandas_df) pandas.core.frame.DataFrame
We can see that the object my_df is indeed a pandas DataFrame.
Note: You can find the complete documentation for the PySpark toPandas function here.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark: