
[This is the second in a series of three articles]
In Part 1, we looked at how to retrieve data from the web. In this article, we will focus more on interacting with an API and on using Pandas.
Getting data from the Spotify API.
Previously we used web scraping to get a dataframe containing all of Rolling Stone‘s 500 greatest songs of all time. What if we wanted to create a Spotify playlist? Or, better yet, get more data from Spotify to supplement ours? We can do that by querying and posting to Spotify’s API. In the simplest terms, an Application Programming Interface (or ‘API’) is just the part of a server that your browser or app interacts with when sending and retrieving data from the Internet. When you add a song to a playlist in Spotify or look at how long a song lasts, you’re interacting with its API.
This is a good starting point to learn more:
Step 1: Setting up
Head to Spotify’s API dashboard, log in with your standard account and click on ‘create an app’. From there you should copy the client id and the client secret. These will be the credentials you need to log in to the API. It’s a good idea to paste them in a .txt
file (mine is called Spotify.txt
– see below).
Now we can connect to the API as follows (the with open (file, 'r') as f
line is a neat way to avoid forgetting to close the file):
file = 'Spotify.txt'# replace this with the path and file name you use
with open(file,'r') as f:
f = f.read().splitlines()
cid = f[0].split(':')[1]
secret = f[1].split(':')[1]
client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
Python has a great library called spotipy
(that’s the sp
in the code above) which makes this much easier than usual. Here you can read the documentation. Or, like I did, you can get acquainted by reading an article here on Medium – I suggest this one: https://medium.com/@samlupton/spotipy-get-features-from-your-favourite-songs-in-python-6d71f0172df0.
Step 2: Retrieve the data
Now let’s get to the fun part. In order to download data from Spotify we first retrieve the track id
for each one of our 500 songs. Let’s try this out for one example, My Girl by The Temptations.
We will extract the artist and the song’s title from our dataframe and pass the results to spotipy
‘s .search()
method.
artist = df.Artist[412]
track = df['Song title'][412]
artist, track
('The Temptations', 'My Girl')
track_id = sp.search(q='artist:' + artist + ' track:' + track, type='track')
track_id
{'tracks': {'href': 'https://api.spotify.com/v1/search?query=artist%3AThe+Temptations+track%3AMy+Girl&type=track&offset=0&limit=10',
'items': [{'album': {'album_type': 'album',
'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/3RwQ26hR2tJtA8F9p2n7jG'},
'href': 'https://api.spotify.com/v1/artists/3RwQ26hR2tJtA8F9p2n7jG',
'id': '3RwQ26hR2tJtA8F9p2n7jG',
'name': 'The Temptations',
'type': 'artist',
'uri': 'spotify:artist:3RwQ26hR2tJtA8F9p2n7jG'}],
'available_markets': ['AD',
'AE',
'AL',
'AR',
'AT',
'AU',
'BA',
[...]
'previous': None,
'total': 143}}
While this may look scary, it’s a JSON object and it behaves pretty much like a Python dictionary. A little inspection shows us that, for example, the actual id
is nested as the value of the key id
within the first item if a list which is itself the value of the key items
inside the first key-value pair (e.g. {tracks : ...}
). Similarly, a little digging allows us to find the song’s popularity
attibute.
id_ = track_id['tracks']['items'][0]['id']
popularity = track_id['tracks']['items'][0]['popularity']
Now we can put everything into a function. Because some of the searches might not have a result, we’ll handle the exceptions with a try... except...
block. Note the use of the .zip()
function in line 4 to avoid more complex iterations. We then retrieve the data and add it to our dataframe.
def get_spotify_data(dataframe):
"""
Takes a dataframe as input.
Returns a list of track ids and a list of popularity scores from the Spotify API.
"""
from numpy import nan # we import np.nan to handle empty queries
track_ids = []
popularities = []
for [artist, song] in list(zip(dataframe['Artist'], dataframe['Song title'])):
try:
song_data = sp.search(q='artist:' + artist + ' track:' + song, type='track')
track_id = song_data['tracks']['items'][0]['id']
popularity = song_data['tracks']['items'][0]['popularity']
track_ids.append(track_id)
popularities.append(popularity)
except:
track_ids.append(nan)
popularities.append(nan)
return track_ids, popularities
track_ids, popularities = get_spotify_data(df)
df['Spotify id'] = track_ids
df['Popularity'] = popularities
df.head()

Good. Let’s also check whether our function has returned any missing values using pandas’ .isnull()
:
df.isnull().sum()
Artist 0
Song title 0
Writers 0
Producer 0
Year 0
Spotify id 13
Popularity 13
dtype: int64
We have 13 items that were not returned by our query. For brevity’s sake I will just drop those from the dataframe. I may add a workaround at a later stage.
df.dropna(inplace=True)
Step 3: Get audio features in the dataframe
Spotify also tracks audio features such as danceability, time signature, tempo. We can access those with spotipy
‘s by passing a track’s id to it’s audio_features
method. I will show you two ways to do it. In the first, I use Python’s list comprehension, a slightly more advanced feature that allows you to create lists iteratively in a very succinct way. But you can use a standard for
loop if it feels more comfortable.
To find out more about Spotify’s audio features head here.
# using list comprehension
features = [sp.audio_features(id_) for id_ in df['Spotify id']]
features[0]
#using a 'for' loop
features_2 = []
for id_ in df['Spotify id']:
feature = sp.audio_features(id_)
features_2.append(feature)
# Look at an example
features_2[0]
[{'danceability': 0.365,
'energy': 0.668,
'key': 7,
'loudness': -12.002,
'mode': 1,
'speechiness': 0.817,
'acousticness': 0.836,
'instrumentalness': 2.58e-05,
'liveness': 0.911,
'valence': 0.216,
'tempo': 53.071,
'type': 'audio_features',
'id': '2eOFGf5MOA5QHGLY9ZlOfl',
'uri': 'spotify:track:2eOFGf5MOA5QHGLY9ZlOfl', 'track_href': 'https://api.spotify.com/v1/tracks/2eOFGf5MOA5QHGLY9ZlOfl',
'analysis_url': 'https://api.spotify.com/v1/audio-analysis/2eOFGf5MOA5QHGLY9ZlOfl',
'duration_ms': 217720,
'time_signature': 4}]
We have a list of lists, each corresponding to a song and containing a dictionary. Here’s how we are going to add each of the features to a corresponding column in the dataframe:
- create a list of features
- iterate over the list to create a dictionary. Keys will correspond to column names and values will be lists containing the actual audio features
- next, iterate over the list of features and retrieve the audio feature’s label and value pair with the dictionary’s items() method. Append to the dictionary we created.
- add each of the item of the dictionary as a new column with a for loop
#STEP 1
k = list(features[0][0].keys())
# STEP 2
dict_list = {}
for key in k:
dict_list[key] = []
# STEP 3
for i in features:
item = i[0].items()
for pair in item:
key, value = pair
dict_list[key].append(value)
# STEP 4
for key in dict_list.keys():
df[key] = dict_list[key]
Because we won’t be needing some of the columns I will just get rid of them and take a look at our dataframe.
columns_to_drop = ['tempo', 'type', 'id', 'uri', 'track_href', 'analysis_url']
df.drop(columns_to_drop, axis=1, inplace=True)
df.head()

We’re all set. In the next article, we are going to finally start to explore and visualize our dataset. I hope you enjoyed and learned some useful tools.