Table of Contents
Pandas provides a method called ‘coalesce’ which allows you to combine values from multiple columns into one. It takes a list of column names as an argument and returns a new column which contains the combined values. This is useful when you want to combine related information from multiple columns into one column for analysis.
You can use the following methods to coalesce the values from multiple columns of a pandas DataFrame into one column:
Method 1: Coalesce Values by Default Column Order
df['coalesce'] = df.bfill(axis=1).iloc[:, 0]
Method 2: Coalesce Values Using Specific Column Order
df['coalesce'] = df[['col3', 'col1', 'col2']].bfill(axis=1).iloc[:, 0]
The following examples show how to use each method in practice with the following pandas DataFrame:
import pandas as pd import numpy as np #create DataFrame df = pd.DataFrame({'points': [np.nan, np.nan, 19, np.nan, 14], 'assists': [np.nan, 7, 7, 9, np.nan], 'rebounds': [3, 4, np.nan, np.nan, 6]}) #view DataFrame print(df) points assists rebounds 0 NaN NaN 3.0 1 NaN 7.0 4.0 2 19.0 7.0 NaN 3 NaN 9.0 NaN 4 14.0 NaN 6.0
Method 1: Coalesce Values by Default Column Order
The following code shows how to coalesce the values in the points, assists, and rebounds columns into one column, using the first non-null value across the three columns as the coalesced value:
#create new column that contains first non-null value from three existing columns
df['coalesce'] = df.bfill(axis=1).iloc[:, 0]
#view updated DataFrame
print(df)
points assists rebounds coalesce
0 NaN NaN 3.0 3.0
1 NaN 7.0 4.0 7.0
2 19.0 7.0 NaN 19.0
3 NaN 9.0 NaN 9.0
4 14.0 NaN 6.0 14.0
Here’s how the value in the coalesce column was chosen:
- First row: The first non-null value was 3.0.
- Second row: The first non-null value was 7.0.
- Third row: The first non-null value was 19.0.
- Fourth row: The first non-null value was 9.0.
- Fifth row: The first non-null value was 14.0.
Method 2: Coalesce Values Using Specific Column Order
The following code shows how to coalesce the values in the three columns by analyzing the columns in the following order: assists, rebounds, points.
#coalesce values in specific column order
df['coalesce'] = df[['assists', 'rebounds', 'points']].bfill(axis=1).iloc[:, 0]
#view updated DataFrame
print(df)
points assists rebounds coalesce
0 NaN NaN 3.0 3.0
1 NaN 7.0 4.0 7.0
2 19.0 7.0 NaN 7.0
3 NaN 9.0 NaN 9.0
4 14.0 NaN 6.0 6.0
Here’s the logic that was used to decide which value to place in the coalesce column:
- If the value in the assists column is non-null then use that value.
- Otherwise, if the value in the rebounds column is non-null then use that value.
- Otherwise, if the value in the points column is non-null then use that value.