How to Calculate Summary Statistics in PySpark

PySpark is a powerful tool for analyzing data in Apache Spark, and it provides several methods for calculating summary statistics. You can use the select() function to access columns in a dataframe and the agg() function to calculate various summary statistics, such as the mean, median, standard deviation, and count of a column. Additionally, you can use other built-in functions and user-defined functions to calculate summary statistics.


You can use the following methods to calculate summary statistics for columns in a PySpark DataFrame:

Method 1: Calculate Summary Statistics for All Columns

df.summary().show()

Method 2: Calculate Specific Summary Statistics for All Columns

df.summary('min', '25%', '50%', '75%', 'max').show()

Method 3: Calculate Summary Statistics for Only Numeric Columns

numeric_cols = [c for c, t in df.dtypes if t.startswith('string')==False]

df.select(*numeric_cols).summary().show()

The following examples show how to use each method in practice with the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', 6, 4], 
        ['C', 'East', 5, 2]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

Example 1: Calculate Summary Statistics for All Columns

We can use the following syntax to calculate summary statistics for all columns in the DataFrame:

#calculate summary statistics for each column in DataFrame
df.summary().show()

+-------+----+----------+-----------------+------------------+
|summary|team|conference|           points|           assists|
+-------+----+----------+-----------------+------------------+
|  count|   6|         6|                6|                 6|
|   mean|null|      null|7.666666666666667| 5.666666666666667|
| stddev|null|      null|2.422120283277993|3.9327683210007005|
|    min|   A|      East|                5|                 2|
|    25%|null|      null|                6|                 3|
|    50%|null|      null|                6|                 4|
|    75%|null|      null|               10|                 9|
|    max|   C|      West|               11|                12|
+-------+----+----------+-----------------+------------------+

The output displays the following summary statistics for each column in the DataFrame:

  • count: The number of values in the column
  • mean: The mean value
  • stddev: The standard deviation of values
  • min: The minimum value
  • 25%: The 25th percentile
  • 50%:The 50th percentile (this is also the median)
  • 75%: The 75th percentile
  • max: The max value

Note that many of these values don’t make sense to interpret for string variables.

Example 2: Calculate Specific Summary Statistics for All Columns

We can use the following syntax to calculate specific summary statistics for all columns in the DataFrame:

#calculate specific summary statistics for each column in DataFrame
df.summary('min', '25%', '50%', '75%', 'max').show()

+-------+----+----------+------+-------+
|summary|team|conference|points|assists|
+-------+----+----------+------+-------+
|    min|   A|      East|     5|      2|
|    25%|null|      null|     6|      3|
|    50%|null|      null|     6|      4|
|    75%|null|      null|    10|      9|
|    max|   C|      West|    11|     12|
+-------+----+----------+------+-------+

Example 3: Calculate Summary Statistics for Only Numeric Columns

We can use the following syntax to calculate summary statistics only for the numeric columns in the DataFrame:

#identify numeric columns in DataFrame
numeric_cols = [c for c, t in df.dtypes if t.startswith('string')==False]

#calculate summary statistics for only the numeric columns
df.select(*numeric_cols).summary().show()

+-------+-----------------+------------------+
|summary|           points|           assists|
+-------+-----------------+------------------+
|  count|                6|                 6|
|   mean|7.666666666666667| 5.666666666666667|
| stddev|2.422120283277993|3.9327683210007005|
|    min|                5|                 2|
|    25%|                6|                 3|
|    50%|                6|                 4|
|    75%|               10|                 9|
|    max|               11|                12|
+-------+-----------------+------------------+

Notice that summary statistics are displayed only for the two numeric columns in the DataFrame – the points and assists columns.

Note: You can find the complete documentation for the PySpark summary function .

The following tutorials explain how to perform other common tasks in PySpark:

x