How can I read a CSV file into a PySpark DataFrame?

Reading a CSV file into a PySpark DataFrame involves using the PySpark library to load the CSV file as a DataFrame object. This can be achieved by first creating a SparkSession, which is the entry point to the PySpark functionality. Then, using the “read” method, the CSV file can be loaded into the DataFrame by specifying the file path, format, and other necessary options. The resulting DataFrame can then be manipulated and analyzed using various PySpark operations. Overall, reading a CSV file into a PySpark DataFrame provides a convenient and efficient way to process and analyze large datasets in a distributed computing environment.

Read CSV File into PySpark DataFrame (3 Examples)


You can use the spark.read.csv() function to read a CSV file into a PySpark DataFrame.

Here are three common ways to do so:

Method 1: Read CSV File 

df = spark.read.csv('data.csv')

Method 2: Read CSV File with Header

df = spark.read.csv('data.csv', header=True) 

Method 3: Read CSV File with Specific Delimiter

df = spark.read.csv('data.csv', header=True, sep=';')

The following examples show how to use each method in practice.

Example 1: Read CSV File

Suppose I have a CSV file called data.csv with the following contents:

team, points, assists
'A', 78, 12
'B', 85, 20
'C', 93, 23
'D', 90, 8
'E', 91, 14

I can use the following syntax to read this CSV file into a PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#read CSV into PySpark DataFrame
df = spark.read.csv('data.csv')

#view resulting DataFrame
df.show()

+----+-------+--------+
| _c0|    _c1|     _c2|
+----+-------+--------+
|team| points| assists|
| 'A'|     78|      12|
| 'B'|     85|      20|
| 'C'|     93|      23|
| 'D'|     90|       8|
| 'E'|     91|      14|
+----+-------+--------+

By default, PySpark assumes there is no header in the CSV file and simply uses _c0, _c1, _c2 as the column names.

Example 2: Read CSV File with Header

Once again suppose I have a CSV file called data.csv with the following contents:

team, points, assists
'A', 78, 12
'B', 85, 20
'C', 93, 23
'D', 90, 8
'E', 91, 14

I can use the following syntax to read this CSV file into a PySpark DataFrame and specify that the first row should be used as the header:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#read CSV into PySpark DataFrame
df = spark.read.csv('data.csv', header=True)

#view resulting DataFrame
df.show()

+----+-------+--------+
|team| points| assists|
+----+-------+--------+
| 'A'|     78|      12|
| 'B'|     85|      20|
| 'C'|     93|      23|
| 'D'|     90|       8|
| 'E'|     91|      14|
+----+-------+--------+

Since we specified header=True, PySpark used the first row in the CSV file as the header row in the resulting DataFrame.

Example 3: Read CSV File with Specific Delimiter

Suppose I have a CSV file called data.csv with the following contents:

team; points; assists
'A'; 78; 12
'B'; 85; 20
'C'; 93; 23
'D'; 90; 8
'E'; 91; 14

I can use the following syntax to read this CSV file into a PySpark DataFrame and specify that the values in the file are separated by semi-colons:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#read CSV into PySpark DataFrame
df = spark.read.csv('data.csv', header=True, sep=';')

#view resulting DataFrame
df.show()

+----+-------+--------+
|team| points| assists|
+----+-------+--------+
| 'A'|     78|      12|
| 'B'|     85|      20|
| 'C'|     93|      23|
| 'D'|     90|       8|
| 'E'|     91|      14|
+----+-------+--------+

Since we used the sep argument, PySpark knew to use semi-colons as the delimiter for the values when reading the CSV file into the DataFrame.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x