PySpark: Select Columns by Index in DataFrame


You can use the following methods to select columns by index in a PySpark DataFrame:

Method 1: Select Specific Column by Index

#select first column in DataFrame
df.select(df.columns[0]).show()

Method 2: Select All Columns Except Specific One by Index

#select all columns except first column in DataFrame
df.drop(df.columns[0]).show()

Method 3: Select Range of Columns by Index

#select all columns between index 0 and 2, not including 2
df.select(df.columns[0:2]).show()

The following examples show how to use each of these methods in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11], 
        ['A', 'East', 8], 
        ['A', 'East', 10], 
        ['B', 'West', 6], 
        ['B', 'West', 6], 
        ['C', 'East', 5]] 
  
#define column names
columns = ['team', 'conference', 'points'] 
  
#create DataFrame using data and column names
df = spark.createDataFrame(data, columns) 
  
#view DataFrame
df.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   A|      East|     8|
|   A|      East|    10|
|   B|      West|     6|
|   B|      West|     6|
|   C|      East|     5|
+----+----------+------+

Example 1: Select Specific Column by Index

We can use the following syntax to select only the first column in the DataFrame:

#select first column in DataFrame
df.select(df.columns[0]).show()

+----+
|team|
+----+
|   A|
|   A|
|   A|
|   B|
|   B|
|   C|
+----+

Notice that only the first column (the team column) has been selected from the DataFrame.

Example 2: Select All Columns Except Specific One by Index

We can use the following syntax to select all columns in the DataFrame except for the first column:

#select all columns except first column in DataFrame
df.drop(df.columns[0]).show()

+----------+------+
|conference|points|
+----------+------+
|      East|    11|
|      East|     8|
|      East|    10|
|      West|     6|
|      West|     6|
|      East|     5|
+----------+------+

Notice that all columns except the first column (the team column) have been selected from the DataFrame.

Example 3: Select Range of Columns by Index

We can use the following syntax to select all columns in the DataFrame in the range of 0 to 2 (not including 2):

#select all columns between index 0 and 2, not including 2
df.select(df.columns[0:2]).show()

+----+----------+
|team|conference|
+----+----------+
|   A|      East|
|   A|      East|
|   A|      East|
|   B|      West|
|   B|      West|
|   C|      East|
+----+----------+

Notice that all columns in the range of 0 to 2 (not including 2) have been selected from the DataFrame.

x