An Introduction to SQLAlchemy in Python

Riley Predum
Towards Data Science
6 min readFeb 10, 2019

--

SQL is simpler and more readable than Pandas, which is why many people use it, aside from it also being a legacy system. It’s fast though, and it’s the language for talking to databases and extracting data from data warehouses. It’s the stuff of data science at scale!

In this article, I’ll walk through my notebook that roughly emulates the core workflow of this scaled up data science: ETL. ETL stands for Extract, Transform, and Load. While this example is a notebook on my local computer, if the database file(s) were from a source system, extraction would involve moving it into a data warehouse. From there it would be transformed using SQL queries. The final step would be loading the data into something like Python and Pandas to do machine learning and other cool stuff.

I used SQLAlchemy, which uses SQLite under the hood. SQLAlchemy has its own set of classes and methods for running SQL queries, but I wrote out raw SQL instead for readers who are more interested in seeing that or more familiar with that. I also did this to show the logic of my queries, since all that would be abstracted away by SQLAlchemy.

Setting Up

I downloaded the chinook.db database file here to get started. It’s a zip folder so I unzipped and copied it to my notebook’s working directory with the following shell script:

!cd ..\..\..\Downloads && powershell.exe -NoP -NonI -Command 'Expand-Archive'.\chinook.zip''.\chinook\ && cd C:\Users\riley\Downloads\chinook\ && copy chinook.db C:\Users\riley\Documents\Coding\ETL

The command above will be different for you because you have different folders, but that can be modified. This is also a powershell script, so it’s only for Windows. But it’s a quick and nifty way to unzip a file and move a copy of the contents to the target directory. The ‘!’ signifies to Jupyter that that line is a shell script. Pretty neat!

“Extract”

With the database file in the working directory, I went ahead and wrote my import statements and created the engine that connects to the database using Sqlite via the SQLAlchemy library.

Import statements and connection to the database via sqlite

Before I did anything, I looked up the table schemas diagrammed below. This gave me a sense of what variables were available and the relationships between tables. In SQL, relations and schema are predefined and must be declared before creating a table and subsequently populating it with data.

Chinook.db table schemas

Next, I needed to define the metadata, which instantiates the objects that compose the tables. I also took a look at the columns of the ‘employees’ table, since that’s the one I was curious about starting off.

Metadata instantiation and column inspection

The code above returns the columns for the ‘employees’ table.

Getting down to it, I called my first query of the ‘employees’ table and took a look at the result.

Connects to the engine, which is connected to the database, then returns the query written in raw SQL on line 4.
The output of the code above for the employees in that table

The SELECT * FROM table_name is basically the hello world of SQL. You can think of it as df. I don’t recommend doing it without a LIMIT n (equivalent to df.head(n)) statement though, where n is the number of rows to return. This will protect you and your computer in case your table is gigantic.

To look at the other tables in the database, I called inspector.get_table_names(), which returns a list. It’s a handy way of seeing what’s available to explore.

I became interested in a few things based on these data:

  1. How many employees are there?
  2. How many customers did each sales rep help?
  3. Is there an association between how senior a sales rep is at the company, and how many customers they helped?
  4. How can I utilize the information in different tables to answer another question: how many minutes of music were purchased, aggregated by country?

Transform

The first question is answered quite easily. I didn’t limit the output of SELECT * FROM employees so it showed me all of them. There are obviously 8 entries. But knowing this was only easy because of the table’s size. What if it were 3,000,000 employees? The code to count the number of employees regardless of table size is below.

Returns the number of employees in the employees table

To answer my second question, I looked at the ‘SalesRepId’ variable in the ‘customers’ table, counted those, and grouped that together to see how many instances there were of each ‘SalesRepId’.

Returns the number of customers employees 3, 4, and 5 helped respectively

Employee 3 helped 21 customers, 4 helped 20, and 5 helped 18 respectively. To answer my third question, I next examined the employees’ hire dates.

Returns the hire date and employee ID for employees 3–5 ordered in ascending order (note: the BETWEEN clause in SQL is inclusive of the lower and upper bound of the range given)
The output of the code above shows the difference in hiring date for employees 3–5

To find the most senior employee, I wrote the query below. On a table this small it’s unnecessary, but again I wrote it out because that’s how to scale!

Returns the hire date and employee ID of the most senior employee (the “smallest” hire date — which translates to furthest date back)

It seems that there is some kind of linear relationship, at least on this very limited data set. As tenure lengthens, so do the number of customers helped. This is obviously not enough information to infer skill — the longer someone works somewhere the more [insert task here] they will complete.

Relating it all together

I wanted to answer a more complex/fun question, and write the equally more complex SQL query to do so. I also wanted to use the data from different tables, since there are so many to choose from. To do that, I needed to join the tables.

In a relational database, there are, well, relations between tables. The ‘customers’ table has a primary key ‘CustomerId’. This is a unique identifier for each customer. This is important because while there could be more than one Joe, there wouldn’t be more than one ‘CustomerId’ number 2355.

With that principle in place, I began to tackle the challenge of examining how many minutes of music were sold to each country.

The ‘tracks’ table has song lengths in milliseconds. The ‘customers’ table tells us the countries.

I joined the ‘tracks’ table to the ‘invoice_items’ table which contained a common key: ‘TrackId’. ‘TrackId’ is the primary key for the ‘tracks’ table, but the foreign key for the ‘invoice_items’ table. Because these tables share that column, they can be merged together with respect to that column.

Here’s a nice Medium article detailing joins further if you like.

With that in mind, I joined ‘tracks’ to ‘invoice_items’, ‘invoice_items’ to ‘invoice’, and ‘invoice’ to ‘customers’ using the foreign keys I needed for each.

Returns a table of various attributes of various tables, the last column of which being the total length of music sold in minutes to each country
Minutes of music sold to customers in the US, Canada, and so on

3,277 minutes of music were sold to 494 Americans! Cool! 🎵🎶

Load

With the transformation aspect done, I went ahead and loaded it as the final step. Once that’s done, you can dive into more complex analyses that Pandas and the Python universe are famous for!

The code is the same query, inputted as the first argument for the pd.read_sql_query() method from Pandas.

Creates a Pandas dataframe from the SQL query (note: the engine.connect() argument must be included)

And finally here’s the output of calling df.head():

Back to the world of Pandas dataframes

Conclusion

I gave a brief overview of ETL (Extract, Transform, and Load) and its role in the big data world. This was a walk through of my code, with explanations of key SQL concepts sprinkled in. I worked in SQLAlchemy for Python, which has an abstracted series of classes and methods, so SQL queries wouldn’t look quite the same had I used those. You can see that syntax on the documentation page.

The full code for this project is available on my GitHub.

I hope you enjoyed the read!

If you learned something new and would like to pay it forward to the next learner, consider donating any amount you’re comfortable with, thanks!

Happy coding!

Riley

--

--

Focused on teaching you the ins and outs of data analytics through detailed tutorials.