Keep Certain Columns in PySpark (With Examples)


You can use the following methods to only keep certain columns in a PySpark DataFrame:

Method 1: Specify Columns to Keep

from pyspark.sql.functions import col

#only keep columns 'col1' and 'col2'
df.select(col('col1'), col('col2')).show() 

Method 2: Specify Columns to Drop

from pyspark.sql.functions import col

#drop columns 'col3' and 'col4'
df.drop(col('col3'), col('col4')).show()  

The following examples show how to use each method with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', 6, 4], 
        ['C', 'East', 5, 2]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

Example 1: Specify Columns to Keep

The following code shows how to define a new DataFrame that only keeps the team and points columns:

from pyspark.sql.functions import col

#create new DataFrame and only keep 'team' and 'points' columns
df.select(col('team'), col('points')).show()

+----+------+
|team|points|
+----+------+
|   A|    11|
|   A|     8|
|   A|    10|
|   B|     6|
|   B|     6|
|   C|     5|
+----+------+

Notice that the resulting DataFrame only keeps the two columns that we specified.

Example 2: Specify Columns to Drop

The following code shows how to define a new DataFrame that drops the conference and assists columns from the original DataFrame:

from pyspark.sql.functions import col

#create new DataFrame that drops 'conference' and 'assists' columns
df.drop(col('conference'), col('assists')).show()

+----+------+
|team|points|
+----+------+
|   A|    11|
|   A|     8|
|   A|    10|
|   B|     6|
|   B|     6|
|   C|     5|
+----+------+

Notice that the resulting DataFrame drops the conference and assists columns from the original DataFrame and keeps the remaining columns.

x