Table of Contents
To calculate summary statistics for a Pandas DataFrame, use the DataFrame’s describe() method. This returns a summary of the data, including the count, mean, standard deviation, minimum, maximum, and quartiles. The describe() method can be used on both numerical and categorical data, and is useful for getting a quick overview of the data.
You can use the following methods to calculate summary statistics for variables in a pandas DataFrame:
Method 1: Calculate Summary Statistics for All Numeric Variables
df.describe()
Method 2: Calculate Summary Statistics for All String Variables
df.describe(include='object')
Method 3: Calculate Summary Statistics Grouped by a Variable
df.groupby('group_column').mean() df.groupby('group_column').median() df.groupby('group_column').max() ...
The following examples show how to use each method in practice with the following pandas DataFrame:
import pandas as pd import numpy as np #create DataFrame df = pd.DataFrame({'team': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B'], 'points': [18, 22, 19, 14, 14, 11, 20, 28, 30], 'assists': [5, np.nan, 7, 9, 12, 9, 9, 4, 5], 'rebounds': [11, 8, 10, 6, 6, 5, 9, np.nan, 6]}) #view DataFrame print(df) team points assists rebounds 0 A 18 5.0 11.0 1 A 22 NaN 8.0 2 A 19 7.0 10.0 3 A 14 9.0 6.0 4 B 14 12.0 6.0 5 B 11 9.0 5.0 6 B 20 9.0 9.0 7 B 28 4.0 NaN 8 B 30 5.0 6.0
Example 1: Calculate Summary Statistics for All Numeric Variables
The following code shows how to calculate the summary statistics for each numeric variable in the DataFrame:
df.describe()
points assists rebounds
count 9.000000 8.000000 8.000000
mean 19.555556 7.500000 7.625000
std 6.366143 2.725541 2.199838
min 11.000000 4.000000 5.000000
25% 14.000000 5.000000 6.000000
50% 19.000000 8.000000 7.000000
75% 22.000000 9.000000 9.250000
max 30.000000 12.000000 11.000000
We can see the following summary statistics for each of the three numeric variables:
- count: The count of non-null values
- mean: The mean value
- std: The standard deviation
- min: The minimum value
- 25%: The value at the 25th percentile
- 50%: The value at the 50th percentile (also the median)
- 75%: The value at the 75th percentile
- max: The maximum value
Example 2: Calculate Summary Statistics for All String Variables
The following code shows how to calculate the summary statistics for each string variable in the DataFrame:
df.describe(include='object') team count 9 unique 2 top B freq 5
- count: The count of non-null values
- unique: The number of unique values
- top: The most frequently occurring value
- freq: The count of the most frequently occurring value
Example 3: Calculate Summary Statistics Grouped by a Variable
The following code shows how to calculate the mean value for all numeric variables, grouped by the team variable:
df.groupby('team').mean() points assists rebounds team A 18.25 7.0 8.75 B 20.60 7.8 6.50
The output displays the mean value for the points, assists, and rebounds variables, grouped by the team variable.
Note that we can use similar syntax to calculate a different summary statistic, such as the median:
df.groupby('team').median() points assists rebounds team A 18.5 7.0 9.0 B 20.0 9.0 6.0
The output displays the median value for the points, assists, and rebounds variables, grouped by the team variable.
Note: You can find the complete documentation for the describe function in pandas .