How can I create an empty DataFrame in PySpark with specific column names?

Creating an empty DataFrame in PySpark with specific column names can be achieved by first importing the necessary libraries and then using the `createDataFrame()` method. This method takes in two parameters – the column names and the data type of each column. By passing in an empty list for the data type, an empty DataFrame with the specified column names can be created. This allows for customization of the DataFrame structure before adding any data, making it a useful tool for data manipulation and analysis in PySpark.

PySpark: Create Empty DataFrame with Column Names


You can use the following syntax to create an empty PySpark DataFrame with specific column names:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

from pyspark.sql.types import StructType, StructField, StringType, FloatType

#create empty RDD
empty_rdd=spark.sparkContext.emptyRDD()

#specify colum names and types
my_columns=[StructField('team', StringType(),True),
            StructField('position', StringType(),True),
            StructField('points', FloatType(),True)]

#create DataFrame with specific column names
df=spark.createDataFrame([], schema=StructType(my_columns))

This particular example creates a DataFrame called df with three columns: team, position and points.

The following example shows how to use this syntax in practice.

Example: Create Empty PySpark DataFrame with Column Names

We can use the following syntax to create an empty PySpark DataFrame with specific column names:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

from pyspark.sql.types import StructType, StructField, StringType, FloatType

#create empty RDD
empty_rdd=spark.sparkContext.emptyRDD()

#specify colum names and types
my_columns=[StructField('team', StringType(),True),
            StructField('position', StringType(),True),
            StructField('points', FloatType(),True)]

#create DataFrame with specific column names
df=spark.createDataFrame([], schema=StructType(my_columns))

#view DataFrame
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
+----+--------+------+

We can see that an empty PySpark DataFrame has been created with the following column names: team, position and points.

We can also use the following syntax to view the schema of the DataFrame:

#view schema of DataFrame
df.printSchema()

root
 |-- team: string (nullable = true)
 |-- position: string (nullable = true)
 |-- points: float (nullable = true)

From the output we can see:

  • The team field is a string.
  • The position field is a string.
  • The points field is a float.

Note: You can find a complete list of data types that you can specify for columns in a PySpark DataFrame .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x