Table of Contents
You can use the following methods to only keep certain columns in a PySpark DataFrame:
Method 1: Specify Columns to Keep
from pyspark.sql.functions import col #only keep columns 'col1' and 'col2' df.select(col('col1'), col('col2')).show()
Method 2: Specify Columns to Drop
from pyspark.sql.functions import col #drop columns 'col3' and 'col4' df.drop(col('col3'), col('col4')).show()
The following examples show how to use each method with the following PySpark DataFrame:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['A', 'East', 11, 4], ['A', 'East', 8, 9], ['A', 'East', 10, 3], ['B', 'West', 6, 12], ['B', 'West', 6, 4], ['C', 'East', 5, 2]] #define column names columns = ['team', 'conference', 'points', 'assists'] #create dataframe using data and column names df = spark.createDataFrame(data, columns) #view dataframe df.show() +----+----------+------+-------+ |team|conference|points|assists| +----+----------+------+-------+ | A| East| 11| 4| | A| East| 8| 9| | A| East| 10| 3| | B| West| 6| 12| | B| West| 6| 4| | C| East| 5| 2| +----+----------+------+-------+
Example 1: Specify Columns to Keep
The following code shows how to define a new DataFrame that only keeps the team and points columns:
from pyspark.sql.functions import col
#create new DataFrame and only keep 'team' and 'points' columns
df.select(col('team'), col('points')).show()
+----+------+
|team|points|
+----+------+
| A| 11|
| A| 8|
| A| 10|
| B| 6|
| B| 6|
| C| 5|
+----+------+
Notice that the resulting DataFrame only keeps the two columns that we specified.
Example 2: Specify Columns to Drop
The following code shows how to define a new DataFrame that drops the conference and assists columns from the original DataFrame:
from pyspark.sql.functions import col
#create new DataFrame that drops 'conference' and 'assists' columns
df.drop(col('conference'), col('assists')).show()
+----+------+
|team|points|
+----+------+
| A| 11|
| A| 8|
| A| 10|
| B| 6|
| B| 6|
| C| 5|
+----+------+
Notice that the resulting DataFrame drops the conference and assists columns from the original DataFrame and keeps the remaining columns.