How can I create an empty DataFrame in PySpark with specified column names?

To create an empty DataFrame in PySpark with specified column names, we can use the `createDataFrame()` method from the `pyspark.sql` module. This method takes in two parameters – a list of tuples representing the data and a list of column names. Since we want an empty DataFrame, we can pass in an empty list for the data parameter and a list of the desired column names for the column names parameter. This will create a DataFrame with the specified column names and no data.


You can use the following syntax to create an empty PySpark DataFrame with specific column names:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

from pyspark.sql.types import StructType, StructField, StringType, FloatType

#create empty RDD
empty_rdd=spark.sparkContext.emptyRDD()

#specify colum names and types
my_columns=[StructField('team', StringType(),True),
            StructField('position', StringType(),True),
            StructField('points', FloatType(),True)]

#create DataFrame with specific column names
df=spark.createDataFrame([], schema=StructType(my_columns))

This particular example creates a DataFrame called df with three columns: team, position and points.

The following example shows how to use this syntax in practice.

Example: Create Empty PySpark DataFrame with Column Names

We can use the following syntax to create an empty PySpark DataFrame with specific column names:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

from pyspark.sql.types import StructType, StructField, StringType, FloatType

#create empty RDD
empty_rdd=spark.sparkContext.emptyRDD()

#specify colum names and types
my_columns=[StructField('team', StringType(),True),
            StructField('position', StringType(),True),
            StructField('points', FloatType(),True)]

#create DataFrame with specific column names
df=spark.createDataFrame([], schema=StructType(my_columns))

#view DataFrame
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
+----+--------+------+

We can see that an empty PySpark DataFrame has been created with the following column names: team, position and points.

We can also use the following syntax to view the schema of the DataFrame:

#view schema of DataFrame
df.printSchema()

root
 |-- team: string (nullable = true)
 |-- position: string (nullable = true)
 |-- points: float (nullable = true)

From the output we can see:

  • The team field is a string.
  • The position field is a string.
  • The points field is a float.

Note: You can find a complete list of data types that you can specify for columns in a PySpark DataFrame .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x