To get some web scraping practice, I wanted to obtain a large list of animal names. Colorado State University has a list of links to this data as a part of the Warren and Genevieve Garst Photographic Collection. The data is stored in table format. Here’s how to scrape hypertext data from HTML tables using Beautiful Soup.
Inspect the Data
First, visit the web page and inspect the data you would like to scrape. On Window, this means either right-clicking a desired element and selecting ‘Inspect’ or hitting Ctrl+Shift+I to open up the browser’s developer tools.
After inspecting the element, we see that it is in an HTML table and each row holds an entry for an animal name.
Scrape the Data
Before beginning, import the packages we’ll need now (requests and Beautiful Soup) and later on (pandas).
import pandas as pd
import requests
from bs4 import BeautifulSoup
Then, we’ll use a request to gather the HTML from the webpage.
page = requests.get('https://lib2.colostate.edu/wildlife/atoz.php?letter=ALL')
Next, we’ll create a Beautiful Soup object referencing the page variable.
soup = BeautifulSoup(page.text, 'html.parser')
Using the object we just created, let’s gather all the row data by appending it a new list.
rows = soup.find_all('tr')
list_animals = []
for row in rows:
instance = row.get_text()
list_animals.append(instance)
Afterwards, I create a pandas dataframe with the list we just generated and use the head() function to preview the output.
list_of_animals = pd.DataFrame(list_animals)
print(list_of_animals.head(10))
Clean the Data
Based on our output, I want to refine the dataframe so the row entries are in a position to be split into two columns. First, I remove rows in index locations 0 through 2.
list1 = list1.drop([list1.index[0], list1.index[1], list1.index[2]])
Then, I drop the escape characters in the front and end of each cell entry using the lstrip() and rstrip() functions. I split the remaining column into two columns based on ‘Animal’ and ‘Genus Species’ by using str.split() to separate the row.
list1['v1'] = list1[0].map(lambda x: x.lstrip('\n').rstrip('\n'))
list1[['v1','v2']] = list1['v1'].str.split('\n',expand=True)
Next, I reset the index and drop the index and original column. I rename columns ‘v1’ and ‘v2’ to their appropriate names.
list1 = list1.reset_index()
list1 = list1.drop(columns = ['index', 0])
list1 = list1.rename(columns={"v1": "common_name", "v2": "genus_species"})
Lastly, I save the dataframe in a comma-separated values file for later use.
animal_names_list = list1.to_csv(r'Path\...\animal_names_list.csv')
Just remember that, on a page with multiple data tables, you may have to qualify the rows better than `soup.find_all(‘tr’)` …
I think the BeautifulSoup selectors may be different than “jQuery-style” selectors, but there you might so something like `(‘table.the_class_name_here > tr’)`
to get *just* rows from a table with a certain class name.
And if it was a page where the table needed to “render” – as in, the plain response doesn’t have it – there’s a library from the same guy as the normal requests one which does that too. https://pypi.org/project/requests-html/