Select Multiple Columns in PySpark (With Examples)


There are three common ways to select multiple columns in a PySpark DataFrame:

Method 1: Select Multiple Columns by Name

#select 'team' and 'points' columns
df.select('team', 'points').show()

Method 2: Select Multiple Columns Based on List

#define list of columns to select
select_cols = ['team', 'points']

#select all columns in list 
df.select(*select_cols).show()

Method 3: Select Multiple Columns Based on Index Range

#select all columns between index 0 and 2 ( not including 2)
df.select(df.columns[0:2]).show()

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', 6, 4], 
        ['C', 'East', 5, 2]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

Example 1: Select Multiple Columns by Name

We can use the following syntax to select the team and points columns of the DataFrame:

#select 'team' and 'points' columns
df.select('team', 'points').show()

+----+------+
|team|points|
+----+------+
|   A|    11|
|   A|     8|
|   A|    10|
|   B|     6|
|   B|     6|
|   C|     5|
+----+------+

Notice that the resulting DataFrame only contains the team and points columns, just as we specified.

Example 2: Select Multiple Columns Based on List

We can use the following syntax to specify a list of column names and then select all columns in the DataFrame that belong to the list:

#define list of columns to select
select_cols = ['team', 'points']

#select all columns in list
df.select(*select_cols).show()

+----+------+
|team|points|
+----+------+
|   A|    11|
|   A|     8|
|   A|    10|
|   B|     6|
|   B|     6|
|   C|     5|
+----+------+

Notice that the resulting DataFrame only contains the column names that we specified in the list.

Example 3: Select Multiple Columns Based on Index Range

We can use the following syntax to specify a list of column names and then select all columns in the DataFrame that belong to the list:

#select all columns between index positions 0 and 2 ( not including 2)
df.select(df.columns[0:2]).show()

+----+----------+
|team|conference|
+----+----------+
|   A|      East|
|   A|      East|
|   A|      East|
|   B|      West|
|   B|      West|
|   C|      East|
+----+----------+

Notice that the resulting DataFrame only contains the columns in index positions 0 and 1.

x