How can I create a new DataFrame from an existing DataFrame in PySpark?

Creating a new DataFrame from an existing DataFrame in PySpark can be done by using the built-in functions provided by the PySpark library. This can be achieved by performing various transformations and actions on the existing DataFrame, such as selecting specific columns, filtering rows, and aggregating data. Additionally, users can also apply user-defined functions to the DataFrame using the PySpark API. The resulting DataFrame can then be saved as a new variable or written to an output file for further analysis and manipulation. Overall, PySpark offers a flexible and efficient way to create a new DataFrame from an existing one, allowing for easy data manipulation and processing.

PySpark: Create New DataFrame from Existing DataFrame


There are two common ways to create a PySpark DataFrame from an existing DataFrame:

Method 1: Specify Columns to Keep From Existing DataFrame

#create new dataframe using 'team' and 'points' columns from existing dataframe
df_new = df.select('team', 'points')

Method 2: Specify Columns to Drop From Existing DataFrame

#create new dataframe using all columns from existing dataframe except 'conference'
df_new = df.drop('conference')

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', 6, 4], 
        ['C', 'East', 5, 2]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

Example 1: Specify Columns to Keep From Existing DataFrame

We can use the following syntax to create a new PySpark DataFrame that contains only the team and points columns from the existing DataFrame:

#create new dataframe using 'team' and 'points' columns from existing dataframe
df_new = df.select('team', 'points')

#view new dataframe
df_new.show()

+----+------+
|team|points|
+----+------+
|   A|    11|
|   A|     8|
|   A|    10|
|   B|     6|
|   B|     6|
|   C|     5|
+----+------+

Notice that the new DataFrame only contains the team and points columns from the existing DataFrame, just as we specified.

Example 2: Specify Columns to Drop From Existing DataFrame

We can use the following syntax to create a new PySpark DataFrame that contains all columns from the existing DataFrame except the conference column:

#create new dataframe using all columns from existing dataframe except 'conference'
df_new = df.drop('conference')

+----+------+-------+
|team|points|assists|
+----+------+-------+
|   A|    11|      4|
|   A|     8|      9|
|   A|    10|      3|
|   B|     6|     12|
|   B|     6|      4|
|   C|     5|      2|
+----+------+-------+

Notice that the new DataFrame contains all columns from the existing DataFrame except the conference column.

Note: In this example we only specified one column to exclude from the existing DataFrame, but you can specify multiple columns to exclude within the drop function by specifying the column names in quotes separated by commas.

Additional Resources

x