How can we calculate summary statistics in PySpark?

PySpark is a powerful tool for performing data analysis and manipulation on large datasets. One of its key features is the ability to calculate summary statistics, which provide useful insights into the data. Summary statistics can be easily calculated in PySpark using built-in functions such as mean, median, standard deviation, and quartiles. These functions can be applied to specific columns or entire datasets, allowing for a comprehensive overview of the data. Additionally, PySpark also offers the flexibility to customize summary statistics calculations by using user-defined functions. With its efficient processing capabilities and extensive statistical functions, PySpark is an ideal platform for computing summary statistics on large datasets.

Calculate Summary Statistics in PySpark


You can use the following methods to calculate summary statistics for columns in a PySpark DataFrame:

Method 1: Calculate Summary Statistics for All Columns

df.summary().show()

Method 2: Calculate Specific Summary Statistics for All Columns

df.summary('min', '25%', '50%', '75%', 'max').show()

Method 3: Calculate Summary Statistics for Only Numeric Columns

numeric_cols = [c for c, t in df.dtypes if t.startswith('string')==False]

df.select(*numeric_cols).summary().show()

The following examples show how to use each method in practice with the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', 6, 4], 
        ['C', 'East', 5, 2]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

Example 1: Calculate Summary Statistics for All Columns

We can use the following syntax to calculate summary statistics for all columns in the DataFrame:

#calculate summary statistics for each column in DataFrame
df.summary().show()

+-------+----+----------+-----------------+------------------+
|summary|team|conference|           points|           assists|
+-------+----+----------+-----------------+------------------+
|  count|   6|         6|                6|                 6|
|   mean|null|      null|7.666666666666667| 5.666666666666667|
| stddev|null|      null|2.422120283277993|3.9327683210007005|
|    min|   A|      East|                5|                 2|
|    25%|null|      null|                6|                 3|
|    50%|null|      null|                6|                 4|
|    75%|null|      null|               10|                 9|
|    max|   C|      West|               11|                12|
+-------+----+----------+-----------------+------------------+

The output displays the following summary statistics for each column in the DataFrame:

  • count: The number of values in the column
  • mean: The mean value
  • stddev: The standard deviation of values
  • min: The minimum value
  • 25%: The 25th percentile
  • 50%:The 50th percentile (this is also the median)
  • 75%: The 75th percentile
  • max: The max value

Note that many of these values don’t make sense to interpret for string variables.

Example 2: Calculate Specific Summary Statistics for All Columns

We can use the following syntax to calculate specific summary statistics for all columns in the DataFrame:

#calculate specific summary statistics for each column in DataFrame
df.summary('min', '25%', '50%', '75%', 'max').show()

+-------+----+----------+------+-------+
|summary|team|conference|points|assists|
+-------+----+----------+------+-------+
|    min|   A|      East|     5|      2|
|    25%|null|      null|     6|      3|
|    50%|null|      null|     6|      4|
|    75%|null|      null|    10|      9|
|    max|   C|      West|    11|     12|
+-------+----+----------+------+-------+

Example 3: Calculate Summary Statistics for Only Numeric Columns

We can use the following syntax to calculate summary statistics only for the numeric columns in the DataFrame:

#identify numeric columns in DataFrame
numeric_cols = [c for c, t in df.dtypes if t.startswith('string')==False]

#calculate summary statistics for only the numeric columns
df.select(*numeric_cols).summary().show()

+-------+-----------------+------------------+
|summary|           points|           assists|
+-------+-----------------+------------------+
|  count|                6|                 6|
|   mean|7.666666666666667| 5.666666666666667|
| stddev|2.422120283277993|3.9327683210007005|
|    min|                5|                 2|
|    25%|                6|                 3|
|    50%|                6|                 4|
|    75%|               10|                 9|
|    max|               11|                12|
+-------+-----------------+------------------+

Notice that summary statistics are displayed only for the two numeric columns in the DataFrame – the points and assists columns.

Note: You can find the complete documentation for the PySpark summary function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x