How can I use describe() for Categorical Variables?

The describe() function in Python is a useful tool for summarizing the central tendency, dispersion, and shape of a dataset’s distribution, particularly for categorical variables. It can give you the count of each category, the mode of the category, and the percentage of each category in the dataset. This is an invaluable tool when trying to quickly understand the makeup of a dataset.

By default, the describe() function in pandas calculates descriptive statistics for all numeric variables in a DataFrame.

However, you can use the following methods to calculate descriptive statistics for as well:

Method 1: Calculate Descriptive Statistics for Categorical Variables

df.describe(include='object')

This method will calculate count, unique, top and freq for each categorical variable in a DataFrame.

Method 2: Calculate Categorical Descriptive Statistics for All Variables

df.astype('object').describe()

This method will calculate count, unique, top and freq for every variable in a DataFrame.

The following examples show how to use each method with the following pandas DataFrame that contains information about various basketball players:

import pandas as pd

#create DataFrame
df = pd.DataFrame({'team': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'],
'points': [18, 22, 19, 14, 14, 11, 20, 28],
'assists': [5, 7, 7, 9, 12, 9, 9, 4],
'rebounds': [11, 8, 10, 6, 6, 5, 9, 12]})

#view DataFrame
print(df)

team  points  assists  rebounds
0    A      18        5        11
1    B      22        7         8
2    C      19        7        10
3    D      14        9         6
4    E      14       12         6
5    F      11        9         5
6    G      20        9         9
7    H      28        4        12

Example 1: Calculate Descriptive Statistics for Categorical Variables

We can use the following syntax to calculate descriptive statistics for each categorical variable in the DataFrame:

#calculate descriptive statistics for categorical variables only
df.describe(include='object')

team
count	8
unique	8
top	A
freq	1

The output shows various descriptive statistics for the only categorical variable (team) in the DataFrame.

Here’s how to interpret the output:

• count: There are 8 values in the team column.
• unique: There are 8 unique values in the team column.
• top: The “top” value (i.e. highest in the alphabet) is A.
• freq: This top value occurs 1 time.

Example 2: Calculate Categorical Descriptive Statistics for All Variables

#calculate categorical descriptive statistics for all variables
df.astype('object').describe()

team	points	assists	 rebounds
count	8	8	8	 8
unique	8	7	5	 7
top	A	14	9	 6
freq	1	2	3	 2

The output shows count, unique, top and freq for every variable in the DataFrame, including the numeric variables.

x