How to Scrape Hypertext from Tables Using Beautiful Soup

To get some web scraping practice, I wanted to obtain a large list of animal names. Colorado State University has a list of links to this data as a part of the Warren and Genevieve Garst Photographic Collection. The data is stored in table format. Here’s how to scrape hypertext data from HTML tables using Beautiful Soup.

Inspect the Data

First, visit the web page and inspect the data you would like to scrape. On Window, this means either right-clicking a desired element and selecting ‘Inspect’ or hitting Ctrl+Shift+I to open up the browser’s developer tools.

A screenshot of the window that appears after right-clicking an element on Windows before inspection.
Screen that appears after right-clicking an element on the web page.
Screenshot of the DevTools window showing the selected element and its surrounding HTML that we'll need to scrape the data.
Inspected element in the Developer Tools window showing the HTML behind the page.

After inspecting the element, we see that it is in an HTML table and each row holds an entry for an animal name.

Scrape the Data

Before beginning, import the packages we’ll need now (requests and Beautiful Soup) and later on (pandas).

import pandas as pd
import requests
from bs4 import BeautifulSoup

Then, we’ll use a request to gather the HTML from the webpage.

page = requests.get('https://lib2.colostate.edu/wildlife/atoz.php?letter=ALL')

Next, we’ll create a Beautiful Soup object referencing the page variable.

soup = BeautifulSoup(page.text, 'html.parser')

Using the object we just created, let’s gather all the row data by appending it a new list.

rows = soup.find_all('tr')
list_animals = []
for row in rows:         
    instance = row.get_text()
    list_animals.append(instance)

Afterwards, I create a pandas dataframe with the list we just generated and use the head() function to preview the output.

list_of_animals = pd.DataFrame(list_animals)
print(list_of_animals.head(10))
Preview of output from initial scraping of animal names and associated genus and species classification.
Initial dataframe preview of animal data after scraping.

Clean the Data

Based on our output, I want to refine the dataframe so the row entries are in a position to be split into two columns. First, I remove rows in index locations 0 through 2.

list1 = list1.drop([list1.index[0], list1.index[1], list1.index[2]])
Dataframe following row drop to leave only animal names and genus species.

Then, I drop the escape characters in the front and end of each cell entry using the lstrip() and rstrip() functions. I split the remaining column into two columns based on ‘Animal’ and ‘Genus Species’ by using str.split() to separate the row.

list1['v1'] = list1[0].map(lambda x: x.lstrip('\n').rstrip('\n'))
list1[['v1','v2']] = list1['v1'].str.split('\n',expand=True)
A table shows the dataframe with the original column, a new column with animal names, and another column with genus and species.
Separation of original column based on animal name and genus species.

Next, I reset the index and drop the index and original column. I rename columns ‘v1’ and ‘v2’ to their appropriate names.

list1 = list1.reset_index()
list1 = list1.drop(columns = ['index', 0])
list1 = list1.rename(columns={"v1": "common_name", "v2": "genus_species"})
A table shows the final dataframe resulting from cleaning with two columns as properly titled 'common_name' and 'genus_species'.
Final dataframe with columns named for common names and genus species.

Lastly, I save the dataframe in a comma-separated values file for later use.

animal_names_list = list1.to_csv(r'Path\...\animal_names_list.csv')

1 comment / Add your comment below

  1. Just remember that, on a page with multiple data tables, you may have to qualify the rows better than `soup.find_all(‘tr’)` …

    I think the BeautifulSoup selectors may be different than “jQuery-style” selectors, but there you might so something like `(‘table.the_class_name_here > tr’)`

    to get *just* rows from a table with a certain class name.

    And if it was a page where the table needed to “render” – as in, the plain response doesn’t have it – there’s a library from the same guy as the normal requests one which does that too. https://pypi.org/project/requests-html/

Leave a Reply