Methods shared:
Extracting data from the CKAN API using Requests (storage efficiency).
Reading the data using Pandas (memory efficiency).
Purpose

Whilst sneezing and coughing on a relatively cool evening, I decided to explore seasonal allergic rhinitis (hayfever) prescribing data. The far off aim is to build a model using pollen count data and the prescribing data.
In this ongoing process, I have faced multiple challenges. Here, I describe the methods used to exctract data within the constraints of a 2017 Macbook Air (8GB RAM, 22GB free storage, i5 dual core processor)
Extracting Data from the CKAN API
The first step requires obtaining enough data to split into dev, train and test samples. It seemed reasonable to use data for a whole year to develop my future model as whole seasons would be accounted for.
First problem, storage. One CSV file on the NHSBA website (one month) is at least 6GB in size. This accounts for nearly 30% of the free storage remaining.
The solution? Read the CSVs into a Pandas dataframe in memory through the API used by the NHSBSA. Code below.
#Import the requests module.
import requests
#The url, described within the API documentation, is used to view the datasets available.
url = "http://opendata.nhsbsa.net/api/3/action/package_list"
#Request to obtain data from a url.
response = requests.get(url)
#Status code 200 means the request was successful.
response.status_code
#Display the response as a dictionary.
response.json()

#Print the values in the 'result' key.
#The 'result' key contains the names of datasets which are available.
print(response.json()['result'])

#Using the API documentation, the url containing the medicines data #was deduced and tested.
url2 = "http://opendata.nhsbsa.net/api/3/action/package_show?id=english-prescribing-data-epd"
response2 = requests.get(url2)
response2_json = response2.json()
response2.status_code
#The keys within the output of the results were investigated.
response2_json['result'].keys()

#Each dataset for medicines in secondary care is stored within csv files downloadable through a url.
response2_json['result']['resources'][0]['url']

#Each url is passed into a list.
datasets =[]
for num in range(len(response2_json['result']['resources'])):
datasets.append(response2_json['result']['resources'][num]['url'])
datasets
Reading the Data
Attempting to read every CSV in the list above results in wastage of hours of waiting followed by a Jupyter Notebooks crash. To begin, I investigated the memory usage of one dataframe.
#Here, an entire CSV (one month) is read to memory.
import pandas as pd
test_df = pd.read_csv(datasets[60], sep =",")
#The info() function outputs information about the data frame. #Additional arguments such as 'memory_usage' and 'null_counts' can #be used to determine the amount of information returned at the cost #of computational power.
test_df.info(memory_usage='deep', null_counts=True, verbose=True)

23.8GB. Astonishing! After exploring the data further to determine the most suitable features, records and data types, the function below was created. It saves memory and time by; reading only the required columns, changing the data types to those that use less memory where appropriate and by filtering the records as required.
from datetime import datetime
def LargeCSVsChop(listofCSV_URLs):
dfs = []
for num in range(len(listofCSV_URLs)):
feats = ['YEAR_MONTH', 'PCO_NAME', 'PRACTICE_NAME', 'CHEMICAL_SUBSTANCE_BNF_DESCR',
'BNF_DESCRIPTION', 'TOTAL_QUANTITY', 'ACTUAL_COST']
custom_date_parser = lambda x: datetime.strptime(x, "%Y%m")
df = pd.read_csv(listofCSV_URLs[num], sep =",", usecols = feats, date_parser=custom_date_parser,
parse_dates=['YEAR_MONTH'])
""""
'feats' is a list of the columns/features from the CSV that I want Pandas to read to memory.
types = {'TOTAL_QUANTITY': int, 'ACTUAL_COST':float}
Types is a dictionary of columns and the datatypes they should be read as using dtypes = types in the read_csv function.
The types are optimal so no changes will be made but the dictionary can be used in future.
'custom_data_parser' will be used to convert the dtype of the YEAR_MONTH feature to date format.
The dataframe will read only feats and convert the dtype of the YEAR_MONTH feature.
'df' is used as a variable again for the filtered dataframe to prompt the deallocation of the unfiltered
#dataframe object from memory by dropping the reference count to zero and prompting garbage collection algorithms.
Oral antihistamines will be the focus of this analysis. 'CHEMICAL_SUBSTANCE_BNF_DESCR' is used to ensure all
forms and brands of the drugs are captured. Further work is done using 'BNF_DESCRIPTION' to remove unwanted products.
"""
df = df[
(df['CHEMICAL_SUBSTANCE_BNF_DESCR']=='Cetirizine hydrochloride') |
(df['CHEMICAL_SUBSTANCE_BNF_DESCR']=='Loratadine') |
(df['CHEMICAL_SUBSTANCE_BNF_DESCR']=='Desloratadine') |
(df['CHEMICAL_SUBSTANCE_BNF_DESCR']=='Fexofenadine hydrochloride') |
(df['CHEMICAL_SUBSTANCE_BNF_DESCR']=='Acrivastine') |
(df['CHEMICAL_SUBSTANCE_BNF_DESCR']=='Bilastine') |
(df['CHEMICAL_SUBSTANCE_BNF_DESCR']=='Levocetirizine') |
(df['CHEMICAL_SUBSTANCE_BNF_DESCR']=='Mizolastine') |
(df['CHEMICAL_SUBSTANCE_BNF_DESCR']=='Chlorphenamine maleate') &
(df['BNF_DESCRIPTION']!='Chlorphenamine 10mg/1ml solution for injection ampoules') |
(df['CHEMICAL_SUBSTANCE_BNF_DESCR']=='Promethazine hydrochloride') &
(df['BNF_DESCRIPTION']!='Promethazine 25mg/1ml solution for injection ampoules') &
(df['BNF_DESCRIPTION']!='Phenergan 25mg/1ml solution for injection ampoules')
]
dfs.append(df)
df_large = pd.concat(dfs)
return df_large
The size of a resulting data frame can be seen below. A huge difference.
Jan_2019_df = LargeCSVsChop(datasets[60:61])
Jan_2019_df.info(memory_usage='deep', null_counts=True, verbose=True)

Time
Unfortunately, I found that the time taken to read the CSVs remains similar. This, I hypothesise, is due the the documents being read from urls and is thus dependent on bandwidth.
As a result, in part 2 of this series, I will be running this particular task through the virtual machines made available to students by AWS. This should free my machine whilst EDA is undertaken.
The function below – which I learnt within my Algorithmic Data Science Module – can be used to measure the time taken to read the CSVs with/without the function.
import time
import numpy as np
def timefunc(function, arg, repeats = 20):
alltime = []
while repeats > 0:
"""
Unless specified, the number of repitions will be 20.
"""
starttime= time.time() # record the start time
result = function(arg) # run the function and store in the variable 'result' in case result is needed.
endtime = time.time() # Record end time.
timetaken = endtime - starttime
alltime.append(timetaken)
repeats -=1
mean = np.mean(alltime)
std = np.std(alltime)
error=std/(len(alltime)**0.5)
return (mean)
Without_ncols = timefunc(FuncReadCSV, datasets[60:62], repeats = 1)
With_ncols = timefunc(LargeCSVsChop_MINI, datasets[60:62], repeats = 1)
#LargeCSVsChop_MINI reads the CSV and parses the YEAR_MONTH column #but filtering of the dataframe is not included. This is to #facilitate like for like comparisons.
import matplotlib.pyplot as plt
%matplotlib inline
fig,ax1=plt.subplots(figsize=(10,6))
plt.xlabel('CSV Reading Methods')
plt.ylabel('Mean Time Taken to Read (minutes)')
plt.title('Comparison of Time Taken to Read Two CSVs')
plt.yticks(range(100))
plt.bar(height=Without_ncols/60,x = 'Without Function', color = 'red')
plt.bar(height=With_ncols/60,x='LargeCSVsChop_MINI', color = 'blue')
plt.savefig('Time taken')

I hope you have found this useful. I would appreciate feedback on my code. The link to my GitHub repository is here.
Articles that helped in this process are:
[1] Vincent Teyssier, Optimizing the size of a Pandas dataframe for low memory environment (2018), https://vincentteyssier.medium.com/optimizing-the-size-of-a-pandas-dataframe-for-low-memory-environment-5f07db3d72e
[2] B.Chen, 4 tricks you should know to parse date columns with Pandas read_csv() (2020, https://towardsdatascience.com/4-tricks-you-should-know-to-parse-date-columns-with-pandas-read-csv-27355bb2ad0e
[3] B.Chen, Optimizing the size of a pandas dataframe for low memory environment, (2020), https://vincentteyssier.medium.com/optimizing-the-size-of-a-pandas-dataframe-for-low-memory-environment-5f07db3d72e