Table of Contents
Reading a CSV file into a PySpark DataFrame involves using the PySpark library to load the CSV file as a DataFrame object. This can be achieved by first creating a SparkSession, which is the entry point to the PySpark functionality. Then, using the “read” method, the CSV file can be loaded into the DataFrame by specifying the file path, format, and other necessary options. The resulting DataFrame can then be manipulated and analyzed using various PySpark operations. Overall, reading a CSV file into a PySpark DataFrame provides a convenient and efficient way to process and analyze large datasets in a distributed computing environment.
Read CSV File into PySpark DataFrame (3 Examples)
You can use the spark.read.csv() function to read a CSV file into a PySpark DataFrame.
Here are three common ways to do so:
Method 1: Read CSV File
df = spark.read.csv('data.csv')
Method 2: Read CSV File with Header
df = spark.read.csv('data.csv', header=True)
Method 3: Read CSV File with Specific Delimiter
df = spark.read.csv('data.csv', header=True, sep=';')
The following examples show how to use each method in practice.
Example 1: Read CSV File
Suppose I have a CSV file called data.csv with the following contents:
team, points, assists 'A', 78, 12 'B', 85, 20 'C', 93, 23 'D', 90, 8 'E', 91, 14
I can use the following syntax to read this CSV file into a PySpark DataFrame:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #read CSV into PySpark DataFrame df = spark.read.csv('data.csv') #view resulting DataFrame df.show() +----+-------+--------+ | _c0| _c1| _c2| +----+-------+--------+ |team| points| assists| | 'A'| 78| 12| | 'B'| 85| 20| | 'C'| 93| 23| | 'D'| 90| 8| | 'E'| 91| 14| +----+-------+--------+
By default, PySpark assumes there is no header in the CSV file and simply uses _c0, _c1, _c2 as the column names.
Example 2: Read CSV File with Header
Once again suppose I have a CSV file called data.csv with the following contents:
team, points, assists 'A', 78, 12 'B', 85, 20 'C', 93, 23 'D', 90, 8 'E', 91, 14
I can use the following syntax to read this CSV file into a PySpark DataFrame and specify that the first row should be used as the header:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #read CSV into PySpark DataFrame df = spark.read.csv('data.csv', header=True) #view resulting DataFrame df.show() +----+-------+--------+ |team| points| assists| +----+-------+--------+ | 'A'| 78| 12| | 'B'| 85| 20| | 'C'| 93| 23| | 'D'| 90| 8| | 'E'| 91| 14| +----+-------+--------+
Since we specified header=True, PySpark used the first row in the CSV file as the header row in the resulting DataFrame.
Example 3: Read CSV File with Specific Delimiter
Suppose I have a CSV file called data.csv with the following contents:
team; points; assists 'A'; 78; 12 'B'; 85; 20 'C'; 93; 23 'D'; 90; 8 'E'; 91; 14
I can use the following syntax to read this CSV file into a PySpark DataFrame and specify that the values in the file are separated by semi-colons:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #read CSV into PySpark DataFrame df = spark.read.csv('data.csv', header=True, sep=';') #view resulting DataFrame df.show() +----+-------+--------+ |team| points| assists| +----+-------+--------+ | 'A'| 78| 12| | 'B'| 85| 20| | 'C'| 93| 23| | 'D'| 90| 8| | 'E'| 91| 14| +----+-------+--------+
Since we used the sep argument, PySpark knew to use semi-colons as the delimiter for the values when reading the CSV file into the DataFrame.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark: