How can I use PySpark to select columns with an alias?

PySpark is a powerful tool that allows users to manipulate and analyze large datasets using the Python programming language. One useful feature of PySpark is the ability to select columns from a dataset and assign them an alias. This can be done by using the “select” function and specifying the desired columns along with their desired aliases. This allows users to easily rename columns and create more descriptive and meaningful column names for their analysis. By using PySpark’s select function with aliases, users can efficiently manage and organize their data for further processing and analysis.

PySpark: Select Columns with Alias


There are two common ways to select columns and return aliased names in a PySpark DataFrame:

Method 1: Return One Column with Aliased Name

#select 'team' column and display using aliased name of 'team_name'df.select(df.team.alias('team_name')).show()

Method 2: Return One Column with Aliased Name Along with All Other Columns

#select all columns and display 'team' column using aliased name of 'team_name'
df.withColumnRenamed('team', 'team_name').show()

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', 6, 4], 
        ['C', 'East', 5, 2]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

Example 1: Return One Column with Aliased Name

We can use the following syntax to select the team column from the DataFrame and display it using the aliased name of team_name:

#select 'team' column and display using aliased name of 'team_name'
df.select(df.team.alias('team_name')).show()

+---------+
|team_name|
+---------+
|        A|
|        A|
|        A|
|        B|
|        B|
|        C|
+---------+

Notice that only the values from the team column are shown in the results and the column name is shown using the alias team_name.

Example 2: Return One Column with Aliased Name Along with All Other Columns

We can use the following syntax to select all columns from the DataFrame and display only the team column with an aliased name of team_name:

#select all columns and display 'team' column using aliased name of 'team_name'
df.withColumnRenamed('team', 'team_name').show()

+---------+----------+------+-------+
|team_name|conference|points|assists|
+---------+----------+------+-------+
|        A|      East|    11|      4|
|        A|      East|     8|      9|
|        A|      East|    10|      3|
|        B|      West|     6|     12|
|        B|      West|     6|      4|
|        C|      East|     5|      2|
+---------+----------+------+-------+

Notice that all columns from the DataFrame are returned and only the team column is displayed with an aliased name that we specified.

The function withColumnRenamed is particularly useful when you only want to display an aliased name for one column but you still want to include all other columns from the DataFrame in the output.

Note: You can find the complete documentation for the PySpark alias function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x