Mapping Song Lyric Locations in Python

Here’s an overview of how to map the coordinates of cities mentioned in song lyrics using Python. In this example, I used Lana Del Rey’s lyrics for my data and focused on United States cities. The full code for this is in a Jupyter Notebook on my GitHub under the lyrics_map repository.

A Lana Del Rey album booklet on a map
A map with Lana Del Rey’s Lust for Life album booklet.

Gather Bulk Song Lyrics Data

First, create an account with Genius to obtain an API key. This is used for making requests to scrape song lyrics data from a desired artist. Store the key in a text file. Then, follow the tutorial steps from this blog post by Nick Pai and reference the API key text file within the code.

You can customize the code to cater to a certain artist and number of songs. To be safe, I put in a request for lyrics from 300 songs.

Find Cities and Countries in the Data

After getting the song lyrics in a text file, open the file and use geotext to grab city names. Append these to a new pandas dataframe.

places = GeoText(content)
cities_from_text = places.cities
city_mentions = pd.DataFrame(cities_from_text, columns=['city'])

Use GeoText to gather country mentions and put these in a column. Then, clean the raw output and create a new dataframe querying only on the United States.

Personally, I focus only on United States cities to reduce errors from geotext reading common words such as ‘Born’ as foreign city names.

A three column dataframe shows city and two country columns.
The results from geotext city and country mentions in a dataframe, with a cleaned country column.
f = lambda x: GeoText(x).country_mentions
origin = city_mentions['city'].apply(f)
city_mentions['country_raw'] = origin

fn = lambda x: list(x)[0]
city_mentions['country'] = city_mentions['country_raw'].apply(fn)

city_mentions = city_mentions[city_mentions['country'] == 'US']

Afterwards, remove the country columns and manually clean the city data. I removed city names that seemed inaccurate.

city_mentions.drop(columns=['country_raw', 'country'], inplace=True)

cities_to_remove = ['Paris','Mustang','Palm','Bradley','Sunset','Pontiac','Green','Paradise',

city_mentions = city_mentions[~city_mentions['city'].isin(cities_to_remove)]

In my example, I corrected Newport and Venice to include ‘Beach’. I understand this can be cumbersome with larger datasets, but I did not see it imperative to automate this task for my example.

city_mentions = city_mentions.replace(to_replace ='Newport', value ='Newport Beach')
city_mentions = city_mentions.replace(to_replace ='Venice', value ='Venice Beach')

Next, save a list and a dataframe with value counts for each city to be used later for the map. Reset the index as well to have the two columns as city and mentions.

city_val_counts = city_mentions['city'].value_counts()
city_counts = pd.DataFrame(city_val_counts)

city_counts = city_counts.reset_index()
city_counts.columns = ['city', 'mentions']
A two column dataframe shows cities and number of mentions.
A pandas dataframe shows city and number of song mentions.

Then, create a list of the unique city values.

unique_list = (city_mentions['city'].unique().tolist())

Geocode the City Names

Use GeoPy to geocode the cities from the unique list, which pulls associated coordinates and location data. The user agent needs to be specified to avoid an error. Create a dataframe from this output.

chrome_user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.92 Safari/537.36"
geolocator = Nominatim(timeout=10,user_agent=chrome_user_agent)

lat_lon = []
for city in unique_list: 
        location = geolocator.geocode(city)
        if location:
    except GeocoderTimedOut as e:
        print("Error: geocode failed on input %s with message %s"%
             (city, e))

city_data = pd.DataFrame(lat_lon, columns=['raw_data','raw_data2'])
city_data = city_data[['raw_data2', 'raw_data']]

This yields one column as the latitude and longitude and another with comma separated location data.

A two column dataframe showing coordinates and location data such as city, county, zip code and state
The raw output of GeoPy’s geocode function in a pandas dataframe, showing the coordinates and associated location fields in a list.

Reduce the Geocode Data to Desired Columns

I cleaned my data to have only city names and associated coordinates. The output from GeoPy allows for more information such as county and state, if desired.

To split the location data (raw_data) column, convert it to a string and then split it and create a new column (city) from the first indexed object.

city_data['city'] = city_data['raw_data'].str.split(',').str[0]
A three column datadrame shows two columns of geocoded output and one for city names.
A dataframe with the outputs from GeoPy geocoder with one new column for string split city names.

Then, convert the coordinates column (raw_data2) into a string type to remove the parentheses and finally split on the comma.

#change the coordinates to a string
city_data['raw_data2'] = city_data['raw_data2'].astype(str)

#split the coordinates using the comma as the delimiter
city_data[['lat','lon']] = city_data.raw_data2.str.split(",",expand=True,)

#remove the parentheses
city_data['lat'] = city_data['lat'].map(lambda x:x.lstrip('()'))
city_data['lon'] = city_data['lon'].map(lambda x:x.rstrip('()'))

Convert the latitude and longitude columns back to floats because this is the usable type for plotly.

city_data = city_data.astype({'lat': 'float64', 'lon': 'float64'})

Next, drop all the unneeded columns.

city_data.drop(['raw_data2', 'raw_data'], axis = 1, inplace=True)

Drop any duplicates and end up with a clean set of city, latitude, and longitude.

A three column dataframe shows city, latitude, and longitude.
The cleaned dataframe for the city, latitude, and longitude.

Create the Final Merged DataFrame and Map

Merge the city coordinates dataframe and city mentions dataframe using a left join on city names.

merged = pd.merge(city_data, city_counts, on='city', how='left')
A four column dataframe shows city names, latitude, longitude, and number of mentions
The final merged dataframe with city, latitude, longitude, and number of song mentions.

Create an account with MapBox to obtain an API key to plot my song lyric locations in a Plotly Express bubble map. Alternatively, it is also possible to generate the map without an API key if you have Dash installed. Customize the map for visibility by adjusting variables such as the color scale, the zoom extent, and the data that appears when hovering over the data.

df =
fig = px.scatter_mapbox(merged, lat='lat', lon='lon', color='mentions', size='mentions',
                  color_continuous_scale=px.colors.sequential.Agsunset, size_max=40, zoom=3, 
        'text': 'US Cities Mentioned in Lana Del Rey Songs',
        'xanchor': 'center',
        'yanchor': 'top'})

#save graph as html
with open('plotly_graph.html', 'w') as f:

Spotify Web API: How to Pull and Clean Top Song Data using Python

Spotify Web API: How to Pull and Clean Top Song Data using Python

I used the Spotify Web API to pull the top songs from my personal account. I’ll go over how to get the fifty most popular songs from a user’s Spotify account using spotipy, clean the data, and produce visualizations in Python.

Top 50 Spotify Songs

Top 50 songs from my personal Spotify account, extracted using the Spotify API.
1BorderlineTame ImpalaBorderline77
2GroceriesMallratIn the Sky64
3FadingToro y MoiOuter Peace48
4FanfareMagic City HippiesHippie Castle EP57
5LimestoneMagic City HippiesHippie Castle EP59
6High Steppin'The Avett BrothersCloser Than Together51
7I Think Your Nose Is BleedingThe Front BottomsAnn43
8Die Die DieThe Avett BrothersEmotionalism (Bonus Track Version)44
9SpiceMagic City HippiesModern Animal42
10Bleeding WhiteThe Avett BrothersCloser Than Together53
11Prom QueenBeach BunnyProm Queen73
12SportsBeach BunnySports65
13FebruaryBeach BunnyCrybaby51
14Pale Beneath The Tan (Squeeze)The Front BottomsAnn43
1512 Feet DeepThe Front BottomsRose49
16Au Revoir (Adios)The Front BottomsTalon Of The Hawk50
17FreelanceToro y MoiOuter Peace57
18SpacemanThe KillersDay & Age (Bonus Tracks)62
19Destroyed By Hippie PowersCar Seat HeadrestTeens of Denial51
20Why Won't They Talk To Me?Tame ImpalaLonerism59
21FallingwaterMaggie RogersHeard It In A Past Life71
22Funny You Should AskThe Front BottomsTalon Of The Hawk48
23You Used To Say (Holy Fuck)The Front BottomsGoing Grey47
24Today Is Not RealThe Front BottomsAnn41
25FatherThe Front BottomsThe Front Bottoms43
26Broken BoyCage The ElephantSocial Cues60
28Laugh Till I CryThe Front BottomsBack On Top47
29Nobody's HomeMallratNobody's Home56
30Apocalypse DreamsTame ImpalaLonerism60
31Fill in the BlankCar Seat HeadrestTeens of Denial56
32SpiderheadCage The ElephantMelophobia57
33Tie Dye DragonThe Front BottomsAnn47
34Summer ShandyThe Front BottomsBack On Top43
35At the BeachThe Avett BrothersMignonette51
36MotorcycleThe Front BottomsBack On Top41
37The New Love SongThe Avett BrothersMignonette42
38Paranoia in B MajorThe Avett BrothersEmotionalism (Bonus Track Version)49
39AberdeenCage The ElephantThank You Happy Birthday54
40Losing TouchThe KillersDay & Age (Bonus Tracks)51
41Four of a KindMagic City HippiesHippie Castle EP46
42Cosmic Hero (Live at the Tramshed, Cardiff, Wa...Car Seat HeadrestCommit Yourself Completely34
43Locked UpThe Avett BrothersCloser Than Together49
44Bull RideMagic City HippiesHippie Castle EP49
45The Weight of LiesThe Avett BrothersEmotionalism (Bonus Track Version)51
46Heat WaveSnail MailLush60
47Awkward ConversationsThe Front BottomsRose42
48Baby Drive It DownToro y MoiOuter Peace47
49Your LoveMiddle KidsMiddle Kids EP29
50Ordinary PleasureToro y MoiOuter Peace58

Using Spotipy and the Spotify Web API

First, I created an account with Spotify for Developers and created a client ID from the dashboard. This provides both a client ID and client secret for your application to be used when making requests to the API.

Next, from the application page, in ‘Edit Settings’, in Redirect URIs, I add http://localhost:8888/callback . This will come in handy later when logging into a specific Spotify account to pull data.

Then, I write the code to make the request to the API. This will pull the data and put it in a JSON file format.

I import the following libraries:

  • Python’s OS library to facilitate the client ID, client secret, and redirect API for the code using the computer’s operating system. This will temporarily set the credentials in the environmental variables.
  • Python’s json library to encode the data.
  • Spotipy to provide an authorization flow for logging in to a Spotify account and obtain current top tracks for export.
import os
import json
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import spotipy.util as util

Next, I define the client ID and secret to what has been assigned to my application from the Spotify API. Then, I set the environmental variables to include the the client ID, client secret, and the redirect URI.


os.environ['SPOTIPY_CLIENT_ID']= cid
os.environ['SPOTIPY_CLIENT_SECRET']= secret

Then, I work through the authorization flow from the Spotipy documentation. The first time this code is run, the user will have to provide their Sptofy username and password when prompted in the web browser.

username = ""
client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret) 
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
scope = 'user-top-read'
token = util.prompt_for_user_token(username, scope)

if token:
    sp = spotipy.Spotify(auth=token)
    print("Can't get token for", username)

In the results section, I specify the information to pull. The arguments I provide indicate 50 songs as the limit, the index of the first item to return, and the time range. The time range options, as specified in Spotify’s documentation, are:

  • short_term : approximately last 4 weeks of listening
  • medium_term : approximately last 6 months of listening
  • long_term : last several years of listening

For my query, I decided to use the medium term argument because I thought that would give the best picture of my listening habits for the past half year. Lastly, I create a list to append the results to and then write them to a JSON file.

if token:
    sp = spotipy.Spotify(auth=token)
    results = sp.current_user_top_tracks(limit=50,offset=0,time_range='medium_term')
    for song in range(50):
        list = []
        with open('top50_data.json', 'w', encoding='utf-8') as f:
            json.dump(list, f, ensure_ascii=False, indent=4)
    print("Can't get token for", username)

After compiling this code into a Python file, I run it from the command line. The output is top50_data.JSON which will need to be cleaned before using it to create visualizations.

Cleaning JSON Data for Visualizations

The top song data JSON file output is nested according to different categories, as seen in the sample below.

 "artists": [
                        "external_urls": {
                            "spotify": ""
                        "href": "",
                        "id": "5PbpKlxQE0Ktl5lcNABoFf",
                        "name": "Car Seat Headrest",
                        "type": "artist",
                        "uri": "spotify:artist:5PbpKlxQE0Ktl5lcNABoFf"
                "disc_number": 1,
                "duration_ms": 303573,
                "explicit": true,
                "href": "",
                "id": "5xy3350chgFfFcdTET4xz3",
                "is_local": false,
                "name": "Destroyed By Hippie Powers",
                "popularity": 51,
                "preview_url": "",
                "track_number": 3,
                "type": "track",
                "uri": "spotify:track:5xy3350chgFfFcdTET4xz3"

Before cleaning the JSON data and creating visualizations in a new file, I import json, pandas, matplotlib, and seaborn. Next, I load the JSON file with the top 50 song data.

import json
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

with open('top50_data.json') as f:
  data = json.load(f)

I create a full list of all the data to start. Next, I create lists where I will append the specific JSON data. Using a loop, I access each of the items of interest for analysis and append them to the lists.

list_of_results = data[0]["items"]
list_of_artist_names = []
list_of_artist_uri = []
list_of_song_names = []
list_of_song_uri = []
list_of_durations_ms = []
list_of_explicit = []
list_of_albums = []
list_of_popularity = []

for result in list_of_results:
    this_artists_name = result["artists"][0]["name"]
    this_artists_uri = result["artists"][0]["uri"]
    list_of_songs = result["name"]
    song_uri = result["uri"]
    list_of_duration = result["duration_ms"]
    song_explicit = result["explicit"]
    this_album = result["album"]["name"]
    song_popularity = result["popularity"]

Then, I create a pandas DataFrame, name each column and populate it with the above lists, and export it as a CSV for a backup copy.

all_songs = pd.DataFrame(
    {'artist': list_of_artist_names,
     'artist_uri': list_of_artist_uri,
     'song': list_of_song_names,
     'song_uri': list_of_song_uri,
     'duration_ms': list_of_durations_ms,
     'explicit': list_of_explicit,
     'album': list_of_albums,
     'popularity': list_of_popularity

all_songs_saved = all_songs.to_csv('top50_songs.csv')

Using the DataFrame, I create two visualizations. The first is a count plot using seaborn to show how many top songs came from each artist represented in the top 50 tracks.

descending_order = top50['artist'].value_counts().sort_values(ascending=False).index
ax = sb.countplot(y = top50['artist'], order=descending_order)

sb.despine(fig=None, ax=None, top=True, right=True, left=False, trim=False)

ax.set_title('Songs per Artist in Top 50', fontsize=16, fontweight='heavy')
sb.set(font_scale = 1.4)

y = top50['artist'].value_counts()
for i, v in enumerate(y):
    ax.text(v + 0.2, i + .16, str(v), color='black', fontweight='light', fontsize=14)
plt.savefig('top50_songs_per_artist.jpg', bbox_inches="tight")
A countplot shows artists in descending song counts in total top tracks from Spotify.
A countplot shows the number of songs per artists in the top 50 tracks from greatest to least.

The second graph is a seaborn box plot to show the popularity of songs within individual artists represented.

popularity = top50['popularity']
artists = top50['artist']


ax = sb.boxplot(x=popularity, y=artists, data=top50)
plt.xlabel('Popularity (0-100)')
plt.title('Song Popularity by Artist', fontweight='bold', fontsize=18)
plt.savefig('top50_artist_popularity.jpg', bbox_inches="tight")
A graph shows the varying levels of song popularity per artist in top tracks from Spotify.
A boxplot shows the different levels of song popularity per artist in top 50 Spotify tracks.

Further Considerations

For future interactions with the Spotify Web API, I would like to complete requests that pull top song data for each of the three term options and compare them. This would give a comprehensive view of listening habits and could lead to pulling further information from each artist.