Pandas: How to Use dropna() with thresh?

Pandas’ dropna() function with thresh allows you to specify a minimum number of non-null values for the row/column to be kept. This is useful for removing rows or columns that have too many missing values and can help keep your dataset clean. It is also helpful for handling missing data in a more controlled manner.


You can use the dropna() function to drops rows from a pandas DataFrame that contain missing values.

You can also use the thresh argument to specify the minimum number of non-NaN values that a row or column must have in order to be kept in the DataFrame.

Here are the most common ways to use the thresh argument in practice:

Method 1: Only Keep Rows with Minimum Number of non-NaN Values

#only keep rows with at least 2 non-NaN values
df.dropna(thresh=2)

Method 2: Only Keep Rows with Minimum % of non-NaN Values

#only keep rows with at least 70% non-NaN values
df.dropna(thresh=0.7*len(df.columns))

Method 3: Only Keep Columns with Minimum Number of non-NaN Values

#only keep columns with at least 6 non-NaN values
df.dropna(thresh=6, axis=1)

Method 4: Only Keep Columns with Minimum % of non-NaN Values

#only keep columns with at least 70% non-NaN values
df.dropna(thresh=0.7*len(df), axis=1)

The following examples show how to use each method in practice with the following pandas DataFrame:

import pandas as pd
import numpy as np

#create DataFrame
df = pd.DataFrame({'team': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'],
                   'points': [18, np.nan, 19, 14, 14, 11, 20, np.nan],
                   'assists': [5, np.nan, np.nan, 9, np.nan, 9, 9, 4],
                   'rebounds': [11, np.nan, 10, 6, 6, 5, 9, np.nan]})

#view DataFrame
print(df)

  team  points  assists  rebounds
0    A    18.0      5.0      11.0
1    B     NaN      NaN       NaN
2    C    19.0      NaN      10.0
3    D    14.0      9.0       6.0
4    E    14.0      NaN       6.0
5    F    11.0      9.0       5.0
6    G    20.0      9.0       9.0
7    H     NaN      4.0       NaN

Example 1: Only Keep Rows with Minimum Number of non-NaN Values

We can use the following syntax to only keep the rows in the DataFrame that have at least 2 non-NaN values:

#only keep rows with at least 2 non-NaN values
df.dropna(thresh=2)

	team	points	assists	rebounds
0	A	18.0	5.0	11.0
2	C	19.0	NaN	10.0
3	D	14.0	9.0	6.0
4	E	14.0	NaN	6.0
5	F	11.0	9.0	5.0
6	G	20.0	9.0	9.0
7	H	NaN	4.0	NaN

Notice that the row in index position 1 has been dropped since it only had 1 non-NaN value in the entire row.

Example 2: Only Keep Rows with Minimum % of non-NaN Values

We can use the following syntax to only keep the rows in the DataFrame that have at least 70% non-NaN values:

#only keep rows with at least 70% non-NaN values
df.dropna(thresh=0.7*len(df.columns))

        team	points	assists	rebounds
0	A	18.0	5.0	11.0
2	C	19.0	NaN	10.0
3	D	14.0	9.0	6.0
4	E	14.0	NaN	6.0
5	F	11.0	9.0	5.0
6	G	20.0	9.0	9.0

Notice that the rows in index positions 1 and 7 have been dropped since those rows did not have at least 70% of the values as non-NaN values.

Example 3: Only Keep Columns with Minimum Number of non-NaN Values

We can use the following syntax to only keep the columns in the DataFrame that have at least 6 non-NaN values:

#only keep columns with at least 6 non-NaN values
df.dropna(thresh=6, axis=1)

        team	points	rebounds
0	A	18.0	11.0
1	B	NaN	NaN
2	C	19.0	10.0
3	D	14.0	6.0
4	E	14.0	6.0
5	F	11.0	5.0
6	G	20.0	9.0
7	H	NaN	NaN

Notice that the ‘assists’ column has been dropped because that column did not have at least 6 non-NaN values in the column.

Example 4: Only Keep Columns with Minimum % of non-NaN Values

We can use the following syntax to only keep the columns in the DataFrame that have at least 70% non-NaN values:

#only keep columns with at least 70% non-NaN values
df.dropna(thresh=0.7*len(df), axis=1)

        team	points	rebounds
0	A	18.0	11.0
1	B	NaN	NaN
2	C	19.0	10.0
3	D	14.0	6.0
4	E	14.0	6.0
5	F	11.0	5.0
6	G	20.0	9.0
7	H	NaN	NaN

Notice that the ‘assists’ column has been dropped because that column did not have at least 70% non-NaN values in the column.

Note: You can find the complete documentation for the pandas dropna() function .

x