Here’s an overview of how to map the coordinates of cities mentioned in song lyrics using Python. In this example, I used Lana Del Rey’s lyrics for my data and focused on United States cities. The full code for this is in a Jupyter Notebook on my GitHub under the lyrics_map repository.
Gather Bulk Song Lyrics Data
First, create an account with Genius to obtain an API key. This is used for making requests to scrape song lyrics data from a desired artist. Store the key in a text file. Then, follow the tutorial steps from this blog post by Nick Pai and reference the API key text file within the code.
You can customize the code to cater to a certain artist and number of songs. To be safe, I put in a request for lyrics from 300 songs.
Find Cities and Countries in the Data
After getting the song lyrics in a text file, open the file and use geotext to grab city names. Append these to a new pandas dataframe.
places = GeoText(content)
cities_from_text = places.cities
city_mentions = pd.DataFrame(cities_from_text, columns=['city'])
Use GeoText to gather country mentions and put these in a column. Then, clean the raw output and create a new dataframe querying only on the United States.
Personally, I focus only on United States cities to reduce errors from geotext reading common words such as ‘Born’ as foreign city names.
f = lambda x: GeoText(x).country_mentions
origin = city_mentions['city'].apply(f)
city_mentions['country_raw'] = origin
fn = lambda x: list(x)[0]
city_mentions['country'] = city_mentions['country_raw'].apply(fn)
city_mentions = city_mentions[city_mentions['country'] == 'US']
Afterwards, remove the country columns and manually clean the city data. I removed city names that seemed inaccurate.
city_mentions.drop(columns=['country_raw', 'country'], inplace=True)
cities_to_remove = ['Paris','Mustang','Palm','Bradley','Sunset','Pontiac','Green','Paradise',
'Mansfield','Eden','Crystal','Monroe','Columbia','Laredo','Joplin','Adrian',
'York','Golden','Oklahoma','Kansas','Coachella','Kokomo','Woodstock']
city_mentions = city_mentions[~city_mentions['city'].isin(cities_to_remove)]
In my example, I corrected Newport and Venice to include ‘Beach’. I understand this can be cumbersome with larger datasets, but I did not see it imperative to automate this task for my example.
city_mentions = city_mentions.replace(to_replace ='Newport', value ='Newport Beach')
city_mentions = city_mentions.replace(to_replace ='Venice', value ='Venice Beach')
Next, save a list and a dataframe with value counts for each city to be used later for the map. Reset the index as well to have the two columns as city and mentions.
city_val_counts = city_mentions['city'].value_counts()
city_counts = pd.DataFrame(city_val_counts)
city_counts = city_counts.reset_index()
city_counts.columns = ['city', 'mentions']
Then, create a list of the unique city values.
unique_list = (city_mentions['city'].unique().tolist())
Geocode the City Names
Use GeoPy to geocode the cities from the unique list, which pulls associated coordinates and location data. The user agent needs to be specified to avoid an error. Create a dataframe from this output.
chrome_user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.92 Safari/537.36"
geolocator = Nominatim(timeout=10,user_agent=chrome_user_agent)
lat_lon = []
for city in unique_list:
try:
location = geolocator.geocode(city)
if location:
lat_lon.append(location)
except GeocoderTimedOut as e:
print("Error: geocode failed on input %s with message %s"%
(city, e))
city_data = pd.DataFrame(lat_lon, columns=['raw_data','raw_data2'])
city_data = city_data[['raw_data2', 'raw_data']]
This yields one column as the latitude and longitude and another with comma separated location data.
Reduce the Geocode Data to Desired Columns
I cleaned my data to have only city names and associated coordinates. The output from GeoPy allows for more information such as county and state, if desired.
To split the location data (raw_data) column, convert it to a string and then split it and create a new column (city) from the first indexed object.
city_data['city'] = city_data['raw_data'].str.split(',').str[0]
Then, convert the coordinates column (raw_data2) into a string type to remove the parentheses and finally split on the comma.
#change the coordinates to a string
city_data['raw_data2'] = city_data['raw_data2'].astype(str)
#split the coordinates using the comma as the delimiter
city_data[['lat','lon']] = city_data.raw_data2.str.split(",",expand=True,)
#remove the parentheses
city_data['lat'] = city_data['lat'].map(lambda x:x.lstrip('()'))
city_data['lon'] = city_data['lon'].map(lambda x:x.rstrip('()'))
Convert the latitude and longitude columns back to floats because this is the usable type for plotly.
city_data = city_data.astype({'lat': 'float64', 'lon': 'float64'})
Next, drop all the unneeded columns.
city_data.drop(['raw_data2', 'raw_data'], axis = 1, inplace=True)
Drop any duplicates and end up with a clean set of city, latitude, and longitude.
city_data.drop_duplicates(keep=False,inplace=True)
Create the Final Merged DataFrame and Map
Merge the city coordinates dataframe and city mentions dataframe using a left join on city names.
merged = pd.merge(city_data, city_counts, on='city', how='left')
Create an account with MapBox to obtain an API key to plot my song lyric locations in a Plotly Express bubble map. Alternatively, it is also possible to generate the map without an API key if you have Dash installed. Customize the map for visibility by adjusting variables such as the color scale, the zoom extent, and the data that appears when hovering over the data.
px.set_mapbox_access_token(open("mapbox_token.txt").read())
df = px.data.carshare()
fig = px.scatter_mapbox(merged, lat='lat', lon='lon', color='mentions', size='mentions',
color_continuous_scale=px.colors.sequential.Agsunset, size_max=40, zoom=3,
hover_data=['city'])
fig.update_layout(
title={
'text': 'US Cities Mentioned in Lana Del Rey Songs',
'y':0.95,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'})
fig.show()
#save graph as html
with open('plotly_graph.html', 'w') as f:
f.write(fig.to_html(include_plotlyjs='cdn'))