How to Scrape Hypertext from Tables Using Beautiful Soup

To get some web scraping practice, I wanted to obtain a large list of animal names. Colorado State University has a list of links to this data as a part of the Warren and Genevieve Garst Photographic Collection. The data is stored in table format. Here’s how to scrape hypertext data from HTML tables using Beautiful Soup.

Inspect the Data

First, visit the web page and inspect the data you would like to scrape. On Window, this means either right-clicking a desired element and selecting ‘Inspect’ or hitting Ctrl+Shift+I to open up the browser’s developer tools.

A screenshot of the window that appears after right-clicking an element on Windows before inspection.
Screen that appears after right-clicking an element on the web page.
Screenshot of the DevTools window showing the selected element and its surrounding HTML that we'll need to scrape the data.
Inspected element in the Developer Tools window showing the HTML behind the page.

After inspecting the element, we see that it is in an HTML table and each row holds an entry for an animal name.

Scrape the Data

Before beginning, import the packages we’ll need now (requests and Beautiful Soup) and later on (pandas).

import pandas as pd
import requests
from bs4 import BeautifulSoup

Then, we’ll use a request to gather the HTML from the webpage.

page = requests.get('https://lib2.colostate.edu/wildlife/atoz.php?letter=ALL')

Next, we’ll create a Beautiful Soup object referencing the page variable.

soup = BeautifulSoup(page.text, 'html.parser')

Using the object we just created, let’s gather all the row data by appending it a new list.

rows = soup.find_all('tr')
list_animals = []
for row in rows:         
    instance = row.get_text()
    list_animals.append(instance)

Afterwards, I create a pandas dataframe with the list we just generated and use the head() function to preview the output.

list_of_animals = pd.DataFrame(list_animals)
print(list_of_animals.head(10))
Preview of output from initial scraping of animal names and associated genus and species classification.
Initial dataframe preview of animal data after scraping.

Clean the Data

Based on our output, I want to refine the dataframe so the row entries are in a position to be split into two columns. First, I remove rows in index locations 0 through 2.

list1 = list1.drop([list1.index[0], list1.index[1], list1.index[2]])
Dataframe following row drop to leave only animal names and genus species.

Then, I drop the escape characters in the front and end of each cell entry using the lstrip() and rstrip() functions. I split the remaining column into two columns based on ‘Animal’ and ‘Genus Species’ by using str.split() to separate the row.

list1['v1'] = list1[0].map(lambda x: x.lstrip('\n').rstrip('\n'))
list1[['v1','v2']] = list1['v1'].str.split('\n',expand=True)
A table shows the dataframe with the original column, a new column with animal names, and another column with genus and species.
Separation of original column based on animal name and genus species.

Next, I reset the index and drop the index and original column. I rename columns ‘v1’ and ‘v2’ to their appropriate names.

list1 = list1.reset_index()
list1 = list1.drop(columns = ['index', 0])
list1 = list1.rename(columns={"v1": "common_name", "v2": "genus_species"})
A table shows the final dataframe resulting from cleaning with two columns as properly titled 'common_name' and 'genus_species'.
Final dataframe with columns named for common names and genus species.

Lastly, I save the dataframe in a comma-separated values file for later use.

animal_names_list = list1.to_csv(r'Path\...\animal_names_list.csv')

The Role of Open Access

The Role of Open Access

Open access research provides an opportunity for the public to learn and use data as needed for free, but it is not overwhelmingly common. For researchers outside of academia, trying to pull together useful data can be difficult when considering accessibility barriers.

About two months ago, I began looking for data to create a model of biological inputs and energy requirements in the United States food system. Open data resources such as FAOSTAT, the Economic Research Service,  and Bureau of Transportation Statistics provided helpful figures on land use, food imports, and food transportation values. Aside from these resources, a lot of information I wanted to reference in building a model came from scientific papers that require journal subscriptions or charge a per-article fee.

Three articles that may have been helpful in my research illustrate the cost of access:

Upon closer investigation, Appetite claims it ‘supports open access’ but charges authors $3000 to make articles available to everyone, according to publisher Elsevier. Clearly, providing affordable open access options doesn’t seem like a priority for publishers.

There may have been useful data in the articles mentioned above. However, I won’t find out because I’m sticking with open access resources for my food systems project.

Public government databases are great, but specific science studies may hold more value to independent researchers. Journals like PLOS ONE lead the way in open access articles for those looking for specific research to compliment information from public databases. A 2016 article by Paul Basken in The Chronicle of Higher Education called ‘As an Open-Access Megajournal Cedes Some Ground, a Movement Gathers Steam’ shows a rise in open access papers, but I got the figures via Boston College because the article itself is ‘premium content for subscribers.’

Rise of published open access articles over time between 2008 and 2015. Data from: Basken, P. 2016. As an open-access megajournal cedes some ground, a movement gathers steam. The Chronicle for Higher Education, 62(19), 5-5.

Charging fees for accessibility can create an elitist barrier between academia and those who want to learn more about certain topics. I’m not proposing that everyone would take advantage of open access research articles if there were cheaper publishing options, or no access fees. If more studies were open access, it would create more opportunities for members of the public to digest scientific studies on their own terms.

There’s immense value in the open-source, collaborative culture of the tech community that I hope spills over into academia. I’m optimistic about a continued increase in open access publications in the science community. For now, I’m looking forward to creating open source projects that take advantage of public data.