The world’s leading publication for data science, AI, and ML professionals.

Data Integration with Pandas

How to easily import Data

Photo by Dušan veverkolog on Unsplash
Photo by Dušan veverkolog on Unsplash

For adhoc data analysis via Jupyter Notebook you often need external data via CSV, Excel or database to process them with Python. Reading the data is relatively easy thanks to the handy Pandas library. I have summarized the standard procedures for you in this article. Of course, you can also use the same errors for later standardized and integrated Data Integration processes, such as an ETL for your Data Warehouse.

Main Use Cases

Here is my selection of commands and sources that I use most often. If you want to try it out as well, you can simply use Jupyter Notebook Online – Click here [1].

Read from a CSV File

A very typical first use case is of course the famous CSV file, which was either provided to you as a dump of a source system, and can be retrieved online or was automatically stored on a file folder.

import Pandas as pd
pd.read_csv('yourfilename.csv', header=None, nrows=5)

After importing Pandas and **** importing the file via _read_cs_v you will already see success:

Imported Data - Image by Author
Imported Data – Image by Author

Read Data from Google Sheet

Same like above but with a Google Sheets Link – be aware that you published the Sheet to the web before. (Go to file →publish to web [2])

import pandas as pd
new_google_sheet_url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vQ8lv36OQPiqgfFjdgJF9clvwhhh2Ao2xDzkIzYp2g1CqBbpLsnW5nefPsPVOKKBX2PdFVX6Mweyyut/pub?gid=1696994381&single=true&output=csv'
df=pd.read_csv(new_google_sheet_url)

Et voilà:

Imported Data - Image by Author
Imported Data – Image by Author

Read Data from a Database

Beside CSV and other similar files you will often connect data to a database – here we need beside Pandas also the sqlite3 module[3].

import pandas as pd
import sqlite3

After the successful import of the data, we now can easily query tables via SQL String:

# Read via SQLite databases
con = sqlite3.connect("your.database.link")
#Read table via Select Statement
player = pd.read_sql_query("SELECT * from Table", con)
#close the connection
con.close()

Other possible Data Sources

In addition to the standard data sources mentioned above, there are of course many other possible data sources, such as data warehouse technologies like Google Big Query and Amazon’s Redshift or even NoSQL databases. Here, you will often find downloadable Pythons libraries [4].

Summary

Whether for an adhoc data analysis via e.g. Jupyter Notebook or later for standardized and automated data integration processes, external data sources like files or databases are often needed. These can be easily connected and queried with the toolset Python, Pandas and possibly other libraries. In this article, the most commonly used data sources were mentioned as an overview, further data sources can usually be accessed with other third-party libraries.

Sources and Further Readings

[1] Jupyter.org, Mainpage (2021)

[2] zhukovgreen, Stackoverflow (2016)

[3] pythoncentral.io, Introduction to SQLite in Python (2013)

[4] Google, Downloading BigQuery data to pandas using the BigQuery Storage API (2021)


Related Articles