How can I convert a PySpark DataFrame to Pandas, and can you provide an example?

Converting a PySpark DataFrame to Pandas allows for easier data manipulation and analysis using the popular Python library. This can be achieved by using the `toPandas()` function, which converts the DataFrame into a Pandas DataFrame object. For example, if we have a PySpark DataFrame named `df`, we can convert it to a Pandas DataFrame by using the command `pandas_df = df.toPandas()`. This will create a new Pandas DataFrame called `pandas_df` which can then be used for further analysis.

Convert PySpark DataFrame to Pandas (With Example)


You can use the toPandas() function to convert a PySpark DataFrame to a pandas DataFrame:

pandas_df = pyspark_df.toPandas()

This particular example will convert the PySpark DataFrame named pyspark_df to a pandas DataFrame named pandas_df.

The following example shows how to use this syntax in practice.

Example: How to Convert PySpark DataFrame to Pandas DataFrame

Suppose we create the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', 6, 4], 
        ['C', 'East', 5, 2]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create DataFrame using data and column names
pyspark_df = spark.createDataFrame(data, columns) 
  
#view PySpark Dataframe
pyspark_df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

We can verify that this object is a PySpark DataFrame by using the type() function:

#check object typetype(pyspark_df)

pyspark.sql.dataframe.DataFrame

We can see that the object pyspark_df is indeed a PySpark DataFrame.

We can then use the following syntax to convert the PySpark DataFrame to a pandas DataFrame:

#convert PySpark DataFrame to pandas DataFrame
pandas_df = pyspark_df.toPandas()

#view first five rows of pandas DataFrame
print(pandas_df.head())

  team conference  points  assists
0    A       East    11.0      4.0
1    A       East     8.0      9.0
2    A       East    10.0      3.0
3    B       West     6.0     12.0
4    B       West     6.0      4.0

We can see that the PySpark DataFrame has been converted to a pandas DataFrame.

We can verify that the pandas_df object is a pandas DataFrame by using the type() function once again:

#check object typetype(pandas_df)

pandas.core.frame.DataFrame

We can see that the object my_df is indeed a pandas DataFrame.

Note: You can find the complete documentation for the PySpark toPandas function here.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x