Table of Contents
New rows can be added to a PySpark DataFrame by using the `union()` function. This function combines two DataFrames together by appending the rows of one DataFrame to the other. The syntax for using `union()` is `df1.union(df2)`, where `df1` is the initial DataFrame and `df2` is the DataFrame containing the new rows.
Here are some examples of how new rows can be added to a PySpark DataFrame:
1. Adding a single row:
“`
# Create a DataFrame with initial rows
df1 = spark.createDataFrame([(1, “John”), (2, “Jane”)], [“id”, “name”])
# Create a new row to be added
new_row = (3, “Bob”)
# Add the new row to the DataFrame
new_df = df1.union(spark.createDataFrame([new_row], [“id”, “name”]))
“`
2. Adding multiple rows from a separate DataFrame:
“`
# Create a DataFrame with initial rows
df1 = spark.createDataFrame([(1, “John”), (2, “Jane”)], [“id”, “name”])
# Create a separate DataFrame with new rows
df2 = spark.createDataFrame([(3, “Bob”), (4, “Sarah”)], [“id”, “name”])
# Add the new rows to the initial DataFrame
new_df = df1.union(df2)
“`
3. Adding rows with different column names:
“`
# Create a DataFrame with initial rows
df1 = spark.createDataFrame([(1, “John”), (2, “Jane”)], [“id”, “name”])
# Create a new row with different column names
new_row = (3, “Bob”, “Marketing”)
# Add the new row to the DataFrame by renaming the columns
new_df = df1.union(spark.createDataFrame([new_row], [“id”, “name”, “department”]))
“`
In summary, the `union()` function can be used to add new rows to a PySpark DataFrame, either one at a time or in batches from a separate DataFrame. The column names and data types of the new rows must match those of the initial DataFrame.
Add New Rows to PySpark DataFrame (With Examples)
You can use the following methods to add new rows to a PySpark DataFrame:
Method 1: Add One New Row to DataFrame
#define new row to add with values 'C', 'Guard' and 14 new_row = spark.createDataFrame([('C', 'Guard', 14)], columns) #add new row to DataFrame df_new = df.union(new_row)
Method 2: Add Multiple New Rows to DataFrame
#define multiple new rows to add new_rows = spark.createDataFrame([('C', 'Guard', 14), ('C', 'Forward', 32), ('D', 'Forward', 21)], columns) #add new rows to DataFrame df_new = df.union(new_rows)
The following examples show how to use each method in practice with the following PySpark DataFrame:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['A', 'Guard', 11], ['A', 'Guard', 8], ['A', 'Forward', 22], ['A', 'Forward', 22], ['B', 'Guard', 14], ['B', 'Guard', 14], ['B', 'Forward', 13], ['B', 'Forward', 7]] #define column names columns = ['team', 'position', 'points'] #create dataframe using data and column names df = spark.createDataFrame(data, columns) #view dataframe df.show() +----+--------+------+ |team|position|points| +----+--------+------+ | A| Guard| 11| | A| Guard| 8| | A| Forward| 22| | A| Forward| 22| | B| Guard| 14| | B| Guard| 14| | B| Forward| 13| | B| Forward| 7| +----+--------+------+
Example 1: Add One New Row to DataFrame
We can use the following syntax to add one new row to the end of the existing DataFrame:
#define new row to add new_row = spark.createDataFrame([('C', 'Guard', 14)], columns) #add new row to DataFrame df_new = df.union(new_row) #view updated DataFrame df_new.show() +----+--------+------+ |team|position|points| +----+--------+------+ | A| Guard| 11| | A| Guard| 8| | A| Forward| 22| | A| Forward| 22| | B| Guard| 14| | B| Guard| 14| | B| Forward| 13| | B| Forward| 7| | C| Guard| 14| +----+--------+------+
Notice that one new row has been added to the end of the DataFrame with the values C, Guard and 14 just as we specified.
Example 2: Add Multiple New Rows to DataFrame
We can use the following syntax to add three new rows to the end of the existing DataFrame:
#define multiple new rows to add new_rows = spark.createDataFrame([('C', 'Guard', 14), ('C', 'Forward', 32), ('D', 'Forward', 21)], columns) #add new rows to DataFrame df_new = df.union(new_rows) #view updated DataFrame df_new.show() +----+--------+------+ |team|position|points| +----+--------+------+ | A| Guard| 11| | A| Guard| 8| | A| Forward| 22| | A| Forward| 22| | B| Guard| 14| | B| Guard| 14| | B| Forward| 13| | B| Forward| 7| | C| Guard| 14| | C| Forward| 32| | D| Forward| 21| +----+--------+------+
Notice that three new rows have been added to the end of the DataFrame.
Note that we used the union function in these examples to return a new DataFrame that contained the union of the rows in the existing DataFrame and the values for the new row(s) that we specified.
You can find the complete documentation for the PySpark union function .
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark: