How can I drop the first column in a PySpark DataFrame?

To drop the first column in a PySpark DataFrame, use the “drop” function and specify the name of the column to be dropped as a parameter. This will remove the specified column from the DataFrame and return a new DataFrame with the remaining columns. The function can be applied to both numerical and string columns. This method is useful for managing and manipulating large datasets in PySpark.

Drop First Column in PySpark DataFrame


You can use the following methods to drop the first column from a PySpark DataFrame:

Method 1: Drop First Column by Index Position

#create new DataFrame that drops first column by index position
df_new = df.drop(df.columns[0])

Method 2: Drop First Column by Name

#create new DataFrame that drops first column by name
df_new = df.drop('col1')

The following examples show how to use each of these methods in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', 6, 4], 
        ['C', 'East', 5, 2]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

Example 1: Drop First Column in PySpark by Index Position

We can use the following syntax to drop the first column in the DataFrame by index position:

#create new DataFrame that drops first column by index position
df_new = df.drop(df.columns[0])

#view new DataFrame
df_new.show()

+----------+------+-------+
|conference|points|assists|
+----------+------+-------+
|      East|    11|      4|
|      East|     8|      9|
|      East|    10|      3|
|      West|     6|     12|
|      West|     6|      4|
|      East|     5|      2|
+----------+------+-------+

Notice that only the first column (the team column) has been dropped from the DataFrame.

Example 2: Drop First Column in PySpark by Name

We can use the following syntax to drop the first column in the DataFrame by name:

#create new DataFrame that drops first column by name
df_new = df.drop('team')

#view new DataFrame
df_new.show()

+----------+------+-------+
|conference|points|assists|
+----------+------+-------+
|      East|    11|      4|
|      East|     8|      9|
|      East|    10|      3|
|      West|     6|     12|
|      West|     6|      4|
|      East|     5|      2|
+----------+------+-------+

Notice that only the first column (the team column) has been dropped from the DataFrame.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x