How can I print a single column of a PySpark DataFrame?

To print a single column of a PySpark DataFrame, use the “select” method and specify the column name as a parameter. This will create a new DataFrame with only the selected column, which can then be printed using the “show” method. Alternatively, you can use the “selectExpr” method and specify the column name within quotes as a parameter. This will also create a new DataFrame with the selected column and can be printed using the “show” method. Both methods allow for the printing of a single column from a PySpark DataFrame.

Print One Column of a PySpark DataFrame


You can use the following methods to print one specific column of a PySpark DataFrame:

Method 1: Print Column Values with Column Name

df.select('my_column').show()

Method 2: Print Column Values Only

df.select('my_column').rdd.flatMap(list).collect()

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', 6, 4], 
        ['C', 'East', 5, 2]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

Example 1: Print Column Values with Column Name

We can use the following syntax to print the column values along with the column name for the conference column of the DataFrame:

#print 'conference' column (with column name)
df.select('conference').show()

+----------+
|conference|
+----------+
|      East|
|      East|
|      East|
|      West|
|      West|
|      East|
+----------+

Notice that both the column name and the column values are printed for only the conference column of the DataFrame.

Example 2: Print Column Values Only

We can use the following syntax to print only the column values of the conference column of the DataFrame:

#print values only from 'conference' column
df.select('conference').rdd.flatMap(list).collect() 

['East', 'East', 'East', 'West', 'West', 'East']

Notice that only the values from the conference column are printed and the name of the column is not included.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x