In this tutorial, I review ways to take raw categorical survey data and create new variables for analysis and visualizations with Python using pandas and GeoPy. I’ll show how to make new pandas columns from encoding complex responses, geocoding locations, and measuring distances.
Here’s the associated GitHub repository for this workshop, which includes the data set and a Jupyter Notebook for the code.
Thanks to the St. Lawrence Eastern Lake Ontario Partnership for Regional Invasive Species Management (SLELO PRISM), I was able to use boat launch steward data from 2016 for this virtual workshop. The survey data was collected by boat launch stewards around Lake Ontario in upstate New York. Boaters were asked a series of survey questions and their watercrafts were inspected for aquatic invasive species.
This tutorial was originally designed for the Syracuse Women in Machine Learning and Data Science (Syracuse WiMLDS) Meetup group.
Here’s an overview of how to map the coordinates of cities mentioned in song lyrics using Python. In this example, I used Lana Del Rey’s lyrics for my data and focused on United States cities. The full code for this is in a Jupyter Notebook on my GitHub under the lyrics_map repository.
Gather Bulk Song Lyrics Data
First, create an account with Genius to obtain an API key. This is used for making requests to scrape song lyrics data from a desired artist. Store the key in a text file. Then, follow the tutorial steps from this blog post by Nick Pai and reference the API key text file within the code.
You can customize the code to cater to a certain artist and number of songs. To be safe, I put in a request for lyrics from 300 songs.
Find Cities and Countries in the Data
After getting the song lyrics in a text file, open the file and use geotext to grab city names. Append these to a new pandas dataframe.
Use GeoText to gather country mentions and put these in a column. Then, clean the raw output and create a new dataframe querying only on the United States.
Personally, I focus only on United States cities to reduce errors from geotext reading common words such as ‘Born’ as foreign city names.
In my example, I corrected Newport and Venice to include ‘Beach’. I understand this can be cumbersome with larger datasets, but I did not see it imperative to automate this task for my example.
city_mentions = city_mentions.replace(to_replace ='Newport', value ='Newport Beach')
city_mentions = city_mentions.replace(to_replace ='Venice', value ='Venice Beach')
Next, save a list and a dataframe with value counts for each city to be used later for the map. Reset the index as well to have the two columns as city and mentions.
Use GeoPy to geocode the cities from the unique list, which pulls associated coordinates and location data. The user agent needs to be specified to avoid an error. Create a dataframe from this output.
chrome_user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.92 Safari/537.36"
geolocator = Nominatim(timeout=10,user_agent=chrome_user_agent)
lat_lon = []
for city in unique_list:
try:
location = geolocator.geocode(city)
if location:
lat_lon.append(location)
except GeocoderTimedOut as e:
print("Error: geocode failed on input %s with message %s"%
(city, e))
city_data = pd.DataFrame(lat_lon, columns=['raw_data','raw_data2'])
city_data = city_data[['raw_data2', 'raw_data']]
This yields one column as the latitude and longitude and another with comma separated location data.
Reduce the Geocode Data to Desired Columns
I cleaned my data to have only city names and associated coordinates. The output from GeoPy allows for more information such as county and state, if desired.
To split the location data (raw_data) column, convert it to a string and then split it and create a new column (city) from the first indexed object.
Then, convert the coordinates column (raw_data2) into a string type to remove the parentheses and finally split on the comma.
#change the coordinates to a string
city_data['raw_data2'] = city_data['raw_data2'].astype(str)
#split the coordinates using the comma as the delimiter
city_data[['lat','lon']] = city_data.raw_data2.str.split(",",expand=True,)
#remove the parentheses
city_data['lat'] = city_data['lat'].map(lambda x:x.lstrip('()'))
city_data['lon'] = city_data['lon'].map(lambda x:x.rstrip('()'))
Convert the latitude and longitude columns back to floats because this is the usable type for plotly.
Create an account with MapBox to obtain an API key to plot my song lyric locations in a Plotly Express bubble map. Alternatively, it is also possible to generate the map without an API key if you have Dash installed. Customize the map for visibility by adjusting variables such as the color scale, the zoom extent, and the data that appears when hovering over the data.
px.set_mapbox_access_token(open("mapbox_token.txt").read())
df = px.data.carshare()
fig = px.scatter_mapbox(merged, lat='lat', lon='lon', color='mentions', size='mentions',
color_continuous_scale=px.colors.sequential.Agsunset, size_max=40, zoom=3,
hover_data=['city'])
fig.update_layout(
title={
'text': 'US Cities Mentioned in Lana Del Rey Songs',
'y':0.95,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'})
fig.show()
#save graph as html
with open('plotly_graph.html', 'w') as f:
f.write(fig.to_html(include_plotlyjs='cdn'))