Pandas read_html

Pandas read_html#

The pandas package provides us with one of the easiest ways to get data from web pages: .read_html().

However, this method only works if the data is already nicely formatted in an HTML table. For example, the unemployment rate data from BLS:
https://data.bls.gov/timeseries/LNS14000000

If we pass this URL to .read_html(), it will return a list of all of the <table> elements it found on the page.

import pandas as pd
url = 'https://data.bls.gov/timeseries/LNS14000000'
df_unemp = pd.read_html(url)

len(df_unemp)
2

In this case it found two tables. We can print them out and see that the unemployment data we want is contained in the second table.

df_unemp[0]
0 1
0 Series Id: LNS14000000
1 Seasonally Adjusted Seasonally Adjusted
2 Series title: (Seas) Unemployment Rate
3 Labor force status: Unemployment rate
4 Type of data: Percent or rate
5 Age: 16 years and over
df_unemp[1]
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
0 2014 6.6 6.7 6.7 6.2 6.3 6.1 6.2 6.1 5.9 5.7 5.8 5.6
1 2015 5.7 5.5 5.4 5.4 5.6 5.3 5.2 5.1 5.0 5.0 5.1 5.0
2 2016 4.8 4.9 5.0 5.1 4.8 4.9 4.8 4.9 5.0 4.9 4.7 4.7
3 2017 4.7 4.6 4.4 4.4 4.4 4.3 4.3 4.4 4.3 4.2 4.2 4.1
4 2018 4.0 4.1 4.0 4.0 3.8 4.0 3.8 3.8 3.7 3.8 3.8 3.9
5 2019 4.0 3.8 3.8 3.7 3.6 3.6 3.7 3.6 3.5 3.6 3.6 3.6
6 2020 3.6 3.5 4.4 14.8 13.2 11.0 10.2 8.4 7.8 6.8 6.7 6.7
7 2021 6.4 6.2 6.1 6.1 5.8 5.9 5.4 5.1 4.7 4.5 4.1 3.9
8 2022 4.0 3.8 3.6 3.7 3.6 3.6 3.5 3.6 3.5 3.6 3.6 3.5
9 2023 3.4 3.6 3.5 3.4 3.7 3.6 3.5 3.8 3.8 3.8 3.7 3.7
10 2024 3.7 3.9 3.8 3.9 NaN NaN NaN NaN NaN NaN NaN NaN

The same thing works for the treasury yield curve data, though it requires a bit of post-scraping clean up.

url = 'https://home.treasury.gov/resource-center/data-chart-center/interest-rates/TextView?type=daily_treasury_yield_curve&field_tdr_date_value_month=202301'
df_rates = pd.read_html(url)
df_rates[0]
Date 20 YR 30 YR Extrapolation Factor 8 WEEKS BANK DISCOUNT COUPON EQUIVALENT 17 WEEKS BANK DISCOUNT COUPON EQUIVALENT.1 52 WEEKS BANK DISCOUNT COUPON EQUIVALENT.2 ... 4 Mo 6 Mo 1 Yr 2 Yr 3 Yr 5 Yr 7 Yr 10 Yr 20 Yr 30 Yr
0 01/03/2023 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 4.70 4.77 4.72 4.40 4.18 3.94 3.89 3.79 4.06 3.88
1 01/04/2023 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 4.69 4.77 4.71 4.36 4.11 3.85 3.79 3.69 3.97 3.81
2 01/05/2023 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 4.75 4.81 4.78 4.45 4.18 3.90 3.82 3.71 3.96 3.78
3 01/06/2023 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 4.74 4.79 4.71 4.24 3.96 3.69 3.63 3.55 3.84 3.67
4 01/09/2023 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 4.74 4.83 4.69 4.19 3.93 3.66 3.60 3.53 3.83 3.66
5 01/10/2023 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 4.77 4.85 4.74 4.24 3.94 3.72 3.67 3.61 3.91 3.74
6 01/11/2023 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 4.82 4.84 4.73 4.20 3.90 3.66 3.61 3.54 3.84 3.67
7 01/12/2023 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 4.74 4.76 4.66 4.12 3.79 3.53 3.48 3.43 3.73 3.56
8 01/13/2023 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 4.73 4.77 4.69 4.22 3.88 3.60 3.55 3.49 3.79 3.61
9 01/17/2023 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 4.74 4.82 4.67 4.18 3.86 3.60 3.57 3.53 3.81 3.64
10 01/18/2023 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 4.74 4.79 4.63 4.06 3.72 3.43 3.40 3.37 3.65 3.54
11 01/19/2023 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 4.74 4.79 4.65 4.09 3.76 3.48 3.43 3.39 3.69 3.57
12 01/20/2023 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 4.75 4.80 4.68 4.14 3.83 3.56 3.51 3.48 3.77 3.66
13 01/23/2023 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 4.76 4.82 4.70 4.21 3.88 3.61 3.56 3.52 3.80 3.69
14 01/24/2023 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 4.75 4.84 4.70 4.12 3.86 3.58 3.52 3.46 3.73 3.62
15 01/25/2023 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 4.75 4.79 4.67 4.11 3.84 3.54 3.51 3.46 3.74 3.62
16 01/26/2023 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 4.76 4.79 4.68 4.17 3.88 3.58 3.54 3.49 3.75 3.62
17 01/27/2023 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 4.76 4.81 4.68 4.19 3.90 3.62 3.58 3.52 3.77 3.64
18 01/30/2023 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 4.76 4.82 4.71 4.25 3.96 3.68 3.62 3.55 3.79 3.66
19 01/31/2023 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 4.74 4.80 4.68 4.21 3.90 3.63 3.59 3.52 3.78 3.65

20 rows × 23 columns

df_clean = df_rates[0].drop(columns=['20 YR', '30 YR', 'Extrapolation Factor',
       '8 WEEKS BANK DISCOUNT', 'COUPON EQUIVALENT', '17 WEEKS BANK DISCOUNT',
       'COUPON EQUIVALENT.1', '52 WEEKS BANK DISCOUNT', 'COUPON EQUIVALENT.2'])
df_clean
Date 1 Mo 2 Mo 3 Mo 4 Mo 6 Mo 1 Yr 2 Yr 3 Yr 5 Yr 7 Yr 10 Yr 20 Yr 30 Yr
0 01/03/2023 4.17 4.42 4.53 4.70 4.77 4.72 4.40 4.18 3.94 3.89 3.79 4.06 3.88
1 01/04/2023 4.20 4.42 4.55 4.69 4.77 4.71 4.36 4.11 3.85 3.79 3.69 3.97 3.81
2 01/05/2023 4.30 4.55 4.66 4.75 4.81 4.78 4.45 4.18 3.90 3.82 3.71 3.96 3.78
3 01/06/2023 4.32 4.55 4.67 4.74 4.79 4.71 4.24 3.96 3.69 3.63 3.55 3.84 3.67
4 01/09/2023 4.37 4.58 4.70 4.74 4.83 4.69 4.19 3.93 3.66 3.60 3.53 3.83 3.66
5 01/10/2023 4.41 4.62 4.73 4.77 4.85 4.74 4.24 3.94 3.72 3.67 3.61 3.91 3.74
6 01/11/2023 4.42 4.62 4.72 4.82 4.84 4.73 4.20 3.90 3.66 3.61 3.54 3.84 3.67
7 01/12/2023 4.57 4.59 4.66 4.74 4.76 4.66 4.12 3.79 3.53 3.48 3.43 3.73 3.56
8 01/13/2023 4.58 4.59 4.67 4.73 4.77 4.69 4.22 3.88 3.60 3.55 3.49 3.79 3.61
9 01/17/2023 4.60 4.63 4.71 4.74 4.82 4.67 4.18 3.86 3.60 3.57 3.53 3.81 3.64
10 01/18/2023 4.59 4.62 4.69 4.74 4.79 4.63 4.06 3.72 3.43 3.40 3.37 3.65 3.54
11 01/19/2023 4.69 4.66 4.71 4.74 4.79 4.65 4.09 3.76 3.48 3.43 3.39 3.69 3.57
12 01/20/2023 4.69 4.64 4.72 4.75 4.80 4.68 4.14 3.83 3.56 3.51 3.48 3.77 3.66
13 01/23/2023 4.69 4.65 4.73 4.76 4.82 4.70 4.21 3.88 3.61 3.56 3.52 3.80 3.69
14 01/24/2023 4.70 4.67 4.72 4.75 4.84 4.70 4.12 3.86 3.58 3.52 3.46 3.73 3.62
15 01/25/2023 4.67 4.65 4.72 4.75 4.79 4.67 4.11 3.84 3.54 3.51 3.46 3.74 3.62
16 01/26/2023 4.61 4.65 4.71 4.76 4.79 4.68 4.17 3.88 3.58 3.54 3.49 3.75 3.62
17 01/27/2023 4.61 4.64 4.73 4.76 4.81 4.68 4.19 3.90 3.62 3.58 3.52 3.77 3.64
18 01/30/2023 4.60 4.64 4.72 4.76 4.82 4.71 4.25 3.96 3.68 3.62 3.55 3.79 3.66
19 01/31/2023 4.58 4.64 4.70 4.74 4.80 4.68 4.21 3.90 3.63 3.59 3.52 3.78 3.65