Table of Contents

Pandas is an incredibly powerful library for data manipulation and analysis in Python. It can be used to read HTML tables from webpages and then manipulate the data within them. To do so, the read_html() method can be used to turn the HTML tables into a list of DataFrames. This example shows how to read a simple HTML table and then print it to the console. Once the DataFrames are created, they can be manipulated and processed just like any other Pandas DataFrame.

You can use the pandas function to read HTML tables into a pandas DataFrame.

This function uses the following basic syntax:

df = pd.read_html('https://en.wikipedia.org/wiki/National_Basketball_Association')

The following example shows how to use this function to read in a table of NBA team names from .

Example: Read HTML Table with Pandas

Before using the read_html() function, you’ll likely have to install lxml:

pip install lxml

Note: If you’re using a Jupyter notebook, you need to restart the kernel after performing this installation.

Next, we can use the read_html() function to read every HTML table on :

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from unicodedata import normalize

#read all HTML tables from specific URL
tabs = pd.read_html('https://en.wikipedia.org/wiki/National_Basketball_Association')

#display total number of tables read
len(tabs)

44

We can see that a total of 44 HTML tables were found on this page.

I know that the table I’m interested in has the word “Division” in it, so I can use the match argument to only retrieve HTML tables that contain this word:

#read HTML tables from specific URL with the word "Division" in them
tabs = pd.read_html('https://en.wikipedia.org/wiki/National_Basketball_Association',
                    match='Division')

#display total number of tables read
len(tabs)

1

I can then of the columns of the table:

#define table
df = tabs[0]

#list all column names of table
list(df)

[('Division', 'Eastern Conference'),
 ('Team', 'Eastern Conference'),
 ('Location', 'Eastern Conference'),
 ('Arena', 'Eastern Conference'),
 ('Capacity', 'Eastern Conference'),
 ('Coordinates', 'Eastern Conference'),
 ('Founded', 'Eastern Conference'),
 ('Joined', 'Eastern Conference'),
 ('Unnamed: 8_level_0', 'Eastern Conference')]

I’m only interested in the first two columns, so I can the DataFrame to only contain these columns:

#filter DataFrame to only contain first two columns
df_final = df.iloc[:, 0:2]

#rename columns
df_final.columns = ['Division', 'Team']

#view first few rows of final DataFrame
print(df_final.head())

   Division                Team
0  Atlantic      Boston Celtics
1  Atlantic       Brooklyn Nets
2  Atlantic     New York Knicks
3  Atlantic  Philadelphia 76ers
4  Atlantic     Toronto Raptors

The following tutorials explain how to read other types of files in pandas:

How to read HTML tables with Pandas (Including Example)

Example: Read HTML Table with Pandas

Requst a

Scale

Example: Read HTML Table with Pandas

Related terms:

Requst a

Scale