13 ways to access data in Python

How to get data from local files, databases, APIs and data access libraries into your Python environment

John Micah Reid
Towards Data Science

--

Image by Lance Grandahl on Unsplash

Introduction

Most Python analysis starts by importing data into your environment. But what if that data is stuck in a database? Or behind an API? Or in a bunch of tiny files?

Luckily, Python is incredibly flexible and has a lot of open-source libraries for accessing and processing data. In this tutorial we’ll look at 13 methods for getting data into a pandas Dataframe, after which it can be cleaned, analysed and visualized. We can group the methods into 4 main categories:

  1. Local files
  2. Databases
  3. APIs
  4. Dataset access libraries

The only major requirement is installing the pandas library:

$ pip install pandas

With that, let’s get cracking!

📁 Local files

Often the data you need is stored in a local file on your computer. Depending on where you’re running your Python environment, you can either specify the filename as a relative or absolute path:

# Absolute path
file1 = "~/Users/johnreid/Documents/my_project/data/example.csv"
# Relative path, assuming current working directory is my_project
file2 = "./data/example.csv"

1. CSV files

CSVs are a popular choice for storing tabular data, and the simplest way to get started. Let’s assume you’ve downloaded this population dataset from Our World in Data:

import pandas as pdcsv_file = "/Users/johnreid/Downloads/population-by-country.csv"
df_from_csv = pd.read_csv(csv_file)
df_from_csv.info()

After importing the data, it’s helpful to run df.info() to understand how your data is structured e.g. how many rows, columns and non-null values you have. Running that code gives us the following output:

This method also works for files accessible by URLs, like a public Google Sheet or CSV file in a public Github repo. Also, if you keep getting aFileNotFoundError then try renaming your filename to replace spaces with underscores e.g. "Financial Sample.xlsx" becomes "Financial_Sample.xlsx".

2. Excel files

You need to be a bit more cautious with Excel files, because they may contain more than one sheet of data and complex visual formatting e.g. extra header rows. Otherwise the syntax is pretty similar — here’s a financial data example:

import pandas as pdexcel_file = "/Users/johnreid/Downloads/Financial_Sample.xlsx"
df_from_excel = pd.read_excel(excel_file, sheet_name = "Sheet1")
df_from_excel.info()

3. Text files

Text files often need more data processing — start by looking at how the data is stored and how you’d like to represent it in Python. From there, you can write code to transform textual input into a dataframe. Let’s use a shopping list example, with each line containing an item and a quantity:

To convert that to a dataframe, you can run the following:

shopping_list = "/Users/johnreid/Downloads/shopping_list.txt"results = []with open(shopping_list) as f:
line = f.readline()
while line:
results.append(line.strip().split(" "))
line = f.readline()
f.close()df_from_textfile = pd.DataFrame(results, columns = ["Item", "Quantity"])

We read the lines one-by-one, strip extra whitespaces and split the line into two parts. When we create a dataframe, we also need to assign column names.

4. Multiple files / folders

What happens if you need to extract data from multiple stored files? Let’s combine a couple of things that we’ve learned to extract data from the BBC Sport text dataset.

We have 5 subfolders, each with around 100 files. Each file starts with a headline, followed by the body of the article. Our goal will be to combine all these files into a single dataframe with ‘Title’, ‘Subtitle’, ‘Body’ and ‘Genre’ columns. The glob library comes really in handy here to list all possible filenames:

import glob
import pandas as pd
base_path = "/Users/johnreid/Downloads/bbcsport/"
genres = ["athletics", "cricket", "football", "rugby", "tennis"]
def read_and_split_file(filename):
with open(filename, 'r', encoding="latin-1") as f:
lines = f.readlines() # Get lines as a list of strings
lines = list(map(str.strip, lines)) # Remove /n characters
lines = list(filter(None, lines)) # Remove empty strings
return linesdef get_df_from_genre(path, genre):
files = glob.glob(path + genre + "/*.txt")
titles = []
subtitles = []
bodies = []
for f in files:
lines = read_and_split_file(f)
titles.append(lines[0]) # First line is the title
subtitles.append(lines[1]) # Second line is the subtitle
bodies.append(' '.join(lines[2:])) # Combine all the rest
return(pd.DataFrame({
'genre': genre,
'title': titles,
'subtitle': subtitles,
'body': bodies
})
)
final_df = pd.concat([get_df_from_genre(base_path, g) for g in genres])final_df

We use the * operator with glob to get all possible filenames ending in .txt. Note that you can concatenate multiple dataframes together using pd.concat. Running that code gives us the following output:

🗄 Databases

Most organizations store their business-critical data in a relational database like Postgres or MySQL, and you’ll need to know Structured Query Language (SQL) to access or update the data stored there. Databases have a number of advantages, like data normaliza

5. SQLite

SQLite is an embedded database that is stored as a single file, so it’s a great place to start testing out queries. Here we’ll show an example of connecting to a SQLite file of the Chinook database:

import pandas as pd
import sqlite3 as sql
conn = sql.connect('/Users/johnreid/Downloads/chinook.db')# First pattern - turn query directly into dataframe:
df1 = pd.read_sql_query("SELECT * FROM invoice", conn)
# Second pattern - get row-level data, but no column names
cur = conn.cursor()
results = cur.execute("SELECT * FROM invoice LIMIT 5").fetchall()
df2 = pd.DataFrame(results)

If you’re curious, read my full tutorial on building an interactive dashboard using SQL here:

6. Remote databases

Connecting to a remote database like Postgres, Redshift, or SQLServer uses mostly the same syntax but requires access credentials. For security reasons, it’s best to store these credentials in a config file and load them into your Python script. You can create a separate .py file with the following info:

host = "localhost"
database= "suppliers"
user = "postgres"
password = "SecurePas$1"

and then import it into your Python script as follows (you’ll also need the psychopg2 library):

import psycopg2
import config
conn = psycopg2.connect(
host=config.host,
database=config.database,
user=config.user,
password=config.password)
df1 = pd.read_sql_query("SELECT * FROM invoice", conn)

Make sure to keep your config.py file safe and don't upload it elsewhere - you can add it to your .gitignore to make sure it doesn't get included in git commits.

7. SQLAlchemy

If you want a more ‘pythonic’ way of querying a database, try the SQLAlchemy library, which is an Object-Relational-Mapper. It’s typically used for applications so that developers don’t have to write pure SQL to update their database, but you can use it for querying data too!

Here’s an example using the same Chinook music store database:

import sqlalchemy as dbengine = db.create_engine('sqlite:///chinook.db')
connection = engine.connect()
metadata = db.MetaData()
invoice = db.Table('invoice', metadata, autoload=True, autoload_with=engine)# Get the first 10 invoices from the USA
query = (db.select([invoice])
.filter_by(billing_country = 'USA')
.limit(10)
)
df = pd.read_sql(query, engine)

In this code we connect to the database, then set up some tables & metadata in SQLAlchemy. Once that’s defined, we can write a query in a more ‘pythonic’ way and read the results directly to a Pandas dataframe. Running that code gives the following output:

📶 APIs

Sometimes you’ll need to access data from a particular platform your company uses, like Hubspot, Twitter or Trello. These platforms often have a public API that you can pull data from, directly inside your Python environment.

The basic idea is you send a request (which may include query parameters and access credentials) to an endpoint. That endpoint will return a response code plus the data you asked for (hopefully). You’ll need to look at the API documentation to understand what data fields are available. The data will usually be returned in JSON format, which allows for deeply-nested data.

8. Without credentials

Let’s do a minimal example using the OpenNotify API, which tracks all the people currently in space:

import requestsresponse = requests.get("http://api.open-notify.org/astros.json")print(response.status_code)
print(response.json())
res = pd.DataFrame(response.json()["people"])res.head()

Running that code gives us the following output:

The response code tells you the result of your API call — according to Dataquest the most common are:

  • 200: Everything went okay, and the result has been returned (if any).
  • 301: The server is redirecting you to a different endpoint. This can happen when a company switches domain names, or an endpoint name is changed.
  • 400: The server thinks you made a bad request. This can happen when you don’t send along the right data, among other things.
  • 403: The resource you’re trying to access is forbidden: you don’t have the right permissions to see it.
  • 404: The resource you tried to access wasn’t found on the server.
  • 503: The server is not ready to handle the request.

9. With credentials & query parameters

Sometimes you may need more specific information from the API, or have to authenticate. There are several ways to do this, however one of the most common is adding URL parameters to your request.

Let’s assume we have a config.pyfile with our API key in it:

personal_api_key = "wouldntyouliketoknow"

Then we create a dictionary for all the parameters (this is a made-up example) and pass it in:

import config
import pandas as pd
import requests
parameters = {
"personal_api_key": config.personal_api_key,
"date": "2021-09-22"
}
response = requests.get(url, params = parameters)

print(response.status_code)
print(response.json())

res = pd.DataFrame(response.json()["people"])
res.head()

If you don’t want to deal with JSON you can try searching for a Python library for that API — these are usually open-source and maintained by the company or third parties.

📚 Dataset access libraries

What if you need some reference data for a comparison or adding context? There are a bunch of libraries for downloading public datasets straight into your environment — think of it as pulling from APIs without having to manage all the extra complexity.

10. Pandas_datareader

Pandas_datareader is a great way to pull data from the internet into your Python environment. It is particularly suited to financial data, but also has some World Bank datasources. To get Zoom’s daily share price over the past few years, try the following:

from pandas_datareader import data
import datetime as dt
zm = data.DataReader(
"ZM",
start='2019-1-1',
end=dt.datetime.today(),
data_source='yahoo'
).reset_index()
zm.head()

Running that code gives us the following output:

11. DataCommons

Datacommons is a project by Google providing access to standardized and cleaned public datasets. The underlying data is represented in a graph format, making it really easy to query and join data from many different datasources e.g. the US Census, World Bank, Wikipedia, Centre for Disease Control and more. Here’s a basic example:

!pip install datacommons datacommons_pandas --upgrade --quietimport datacommons_pandas as dc
import pandas as pd
city_dcids = dc.get_property_values(["CDC500_City"], "member", limit=500)["CDC500_City"]cdc500_df = dc.build_multivariate_dataframe(
city_dcids,
["Percent_Person_Obesity", # Prevalence of obesity from CDC
"Median_Income_Person",
"Median_Age_Person",
"UnemploymentRate_Person", # Unemployment rate from BLS
"Count_Person_BelowPovertyLevelInThePast12Months", # Persons living below the poverty line from Census
"Count_Person", # Total population from Census
],
)
cdc500_df.info()

Running that code gives us the following:

If you want to learn how to use DataCommons, read my full tutorial here:

12. PyTrends (Google Trends)

PyTrends is an unofficial but useful library for querying Google Trends data — here’s a simple example:

import pandas as pdfrom pytrends.request import TrendReqpytrends = TrendReq()
keywords = ["oat milk", "soy milk", "almond milk"]
pytrends.build_payload(keywords, cat=0, geo='', gprop='') # Get data from the last 5 yearstop_queries = pytrends.interest_over_time()[keywords]top_queries.head()

Running that code gives us the following output:

13. Kaggle

Kaggle is a data science community that hosts a lot of datasets and competitions for learning Python. You can download some of these datasets to play around with through their command-line interface (note: you’ll need to sign up for a Kaggle account). For example, say we want to download some Zillow economics data, we can run the following commands in our terminal (Jupyter users: replace the $ with ! in your Python code:

$ pip install kaggle
$ export KAGGLE_USERNAME=datadinosaur
$ export KAGGLE_KEY=xxxxxxxxxxxxxx
$ kaggle datasets download zillow/zecon
$ unzip zecon.zip

This will download a zipped file of the datasets, and then uncompress them. From there, you can open them as local files with Pandas:

import pandas as pdcsv_file = "/Users/johnreid/Downloads/Zip_time_series.csv"
df_from_csv = pd.read_csv(csv_file)
df_from_csv.info()

To learn more, check out the Kaggle API documentation.

Conclusion

Image from Giphy

You made it! Now you can use your newfound powers to access multiple datasources and join them together withpd.merge or pd.concat, then visualize them with an interactive library like Altair, Pandas or Folium.

Are there any methods I missed? Let me know in the comments.

--

--