How can I select multiple columns in PySpark with examples?

PySpark is a powerful data processing framework that allows users to perform various operations on large datasets. One of the common tasks in data analysis is the selection of multiple columns from a dataset. This can be easily achieved in PySpark by using the “select” function along with the desired column names. For example, to select two columns “Name” and “Age” from a dataset, the syntax would be “df.select(‘Name’, ‘Age’)”. This function can also be used to select all columns, or a subset of columns using wildcards or specific criteria. Multiple column selection in PySpark allows users to efficiently extract relevant data for further analysis and manipulation.

Select Multiple Columns in PySpark (With Examples)


There are three common ways to select multiple columns in a PySpark DataFrame:

Method 1: Select Multiple Columns by Name

#select 'team' and 'points' columnsdf.select('team', 'points').show()

Method 2: Select Multiple Columns Based on List

#define list of columns to select
select_cols = ['team', 'points']

#select all columns in list 
df.select(*select_cols).show()

Method 3: Select Multiple Columns Based on Index Range

#select all columns between index 0 and 2 ( not including 2)
df.select(df.columns[0:2]).show()

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', 6, 4], 
        ['C', 'East', 5, 2]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

Example 1: Select Multiple Columns by Name

We can use the following syntax to select the team and points columns of the DataFrame:

#select 'team' and 'points' columns
df.select('team', 'points').show()

+----+------+
|team|points|
+----+------+
|   A|    11|
|   A|     8|
|   A|    10|
|   B|     6|
|   B|     6|
|   C|     5|
+----+------+

Notice that the resulting DataFrame only contains the team and points columns, just as we specified.

Example 2: Select Multiple Columns Based on List

We can use the following syntax to specify a list of column names and then select all columns in the DataFrame that belong to the list:

#define list of columns to select
select_cols = ['team', 'points']

#select all columns in list
df.select(*select_cols).show()

+----+------+
|team|points|
+----+------+
|   A|    11|
|   A|     8|
|   A|    10|
|   B|     6|
|   B|     6|
|   C|     5|
+----+------+

Notice that the resulting DataFrame only contains the column names that we specified in the list.

Example 3: Select Multiple Columns Based on Index Range

We can use the following syntax to specify a list of column names and then select all columns in the DataFrame that belong to the list:

#select all columns between index positions 0 and 2 ( not including 2)
df.select(df.columns[0:2]).show()

+----+----------+
|team|conference|
+----+----------+
|   A|      East|
|   A|      East|
|   A|      East|
|   B|      West|
|   B|      West|
|   C|      East|
+----+----------+

Notice that the resulting DataFrame only contains the columns in index positions 0 and 1.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x