How can new rows be added to a PySpark DataFrame?

New rows can be added to a PySpark DataFrame by using the `union()` function. This function combines two DataFrames together by appending the rows of one DataFrame to the other. The syntax for using `union()` is `df1.union(df2)`, where `df1` is the initial DataFrame and `df2` is the DataFrame containing the new rows.

Here are some examples of how new rows can be added to a PySpark DataFrame:

1. Adding a single row:
“`
# Create a DataFrame with initial rows
df1 = spark.createDataFrame([(1, “John”), (2, “Jane”)], [“id”, “name”])

# Create a new row to be added
new_row = (3, “Bob”)

# Add the new row to the DataFrame
new_df = df1.union(spark.createDataFrame([new_row], [“id”, “name”]))
“`

2. Adding multiple rows from a separate DataFrame:
“`
# Create a DataFrame with initial rows
df1 = spark.createDataFrame([(1, “John”), (2, “Jane”)], [“id”, “name”])

# Create a separate DataFrame with new rows
df2 = spark.createDataFrame([(3, “Bob”), (4, “Sarah”)], [“id”, “name”])

# Add the new rows to the initial DataFrame
new_df = df1.union(df2)
“`

3. Adding rows with different column names:
“`
# Create a DataFrame with initial rows
df1 = spark.createDataFrame([(1, “John”), (2, “Jane”)], [“id”, “name”])

# Create a new row with different column names
new_row = (3, “Bob”, “Marketing”)

# Add the new row to the DataFrame by renaming the columns
new_df = df1.union(spark.createDataFrame([new_row], [“id”, “name”, “department”]))
“`

In summary, the `union()` function can be used to add new rows to a PySpark DataFrame, either one at a time or in batches from a separate DataFrame. The column names and data types of the new rows must match those of the initial DataFrame.

Add New Rows to PySpark DataFrame (With Examples)


You can use the following methods to add new rows to a PySpark DataFrame:

Method 1: Add One New Row to DataFrame

#define new row to add with values 'C', 'Guard' and 14
new_row = spark.createDataFrame([('C', 'Guard', 14)], columns)
#add new row to DataFrame
df_new = df.union(new_row)

Method 2: Add Multiple New Rows to DataFrame

#define multiple new rows to add
new_rows = spark.createDataFrame([('C', 'Guard', 14),
                                  ('C', 'Forward', 32),
                                  ('D', 'Forward', 21)], columns)

#add new rows to DataFrame
df_new = df.union(new_rows)

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11], 
        ['A', 'Guard', 8], 
        ['A', 'Forward', 22], 
        ['A', 'Forward', 22], 
        ['B', 'Guard', 14], 
        ['B', 'Guard', 14],
        ['B', 'Forward', 13],
        ['B', 'Forward', 7]] 
  
#define column names
columns = ['team', 'position', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A|   Guard|     8|
|   A| Forward|    22|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B|   Guard|    14|
|   B| Forward|    13|
|   B| Forward|     7|
+----+--------+------+

Example 1: Add One New Row to DataFrame

We can use the following syntax to add one new row to the end of the existing DataFrame:

#define new row to add
new_row = spark.createDataFrame([('C', 'Guard', 14)], columns)

#add new row to DataFrame
df_new = df.union(new_row)

#view updated DataFrame
df_new.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A|   Guard|     8|
|   A| Forward|    22|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B|   Guard|    14|
|   B| Forward|    13|
|   B| Forward|     7|
|   C|   Guard|    14|
+----+--------+------+

Notice that one new row has been added to the end of the DataFrame with the values C, Guard and 14 just as we specified.

Example 2: Add Multiple New Rows to DataFrame

We can use the following syntax to add three new rows to the end of the existing DataFrame:

#define multiple new rows to add
new_rows = spark.createDataFrame([('C', 'Guard', 14),
                                  ('C', 'Forward', 32),
                                  ('D', 'Forward', 21)], columns)

#add new rows to DataFrame
df_new = df.union(new_rows)

#view updated DataFrame
df_new.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A|   Guard|     8|
|   A| Forward|    22|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B|   Guard|    14|
|   B| Forward|    13|
|   B| Forward|     7|
|   C|   Guard|    14|
|   C| Forward|    32|
|   D| Forward|    21|
+----+--------+------+

Notice that three new rows have been added to the end of the DataFrame.

Note that we used the union function in these examples to return a new DataFrame that contained the union of the rows in the existing DataFrame and the values for the new row(s) that we specified.

You can find the complete documentation for the PySpark union function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x