An Introduction to SQLAlchemy in Python
SQL is simpler and more readable than Pandas, which is why many people use it, aside from it also being a legacy system. It’s fast though, and it’s the language for talking to databases and extracting data from data warehouses. It’s the stuff of data science at scale!
In this article, I’ll walk through my notebook that roughly emulates the core workflow of this scaled up data science: ETL. ETL stands for Extract, Transform, and Load. While this example is a notebook on my local computer, if the database file(s) were from a source system, extraction would involve moving it into a data warehouse. From there it would be transformed using SQL queries. The final step would be loading the data into something like Python and Pandas to do machine learning and other cool stuff.
I used SQLAlchemy, which uses SQLite under the hood. SQLAlchemy has its own set of classes and methods for running SQL queries, but I wrote out raw SQL instead for readers who are more interested in seeing that or more familiar with that. I also did this to show the logic of my queries, since all that would be abstracted away by SQLAlchemy.
Setting Up
I downloaded the chinook.db database file here to get started. It’s a zip folder so I unzipped and copied it to my notebook’s working directory with the following shell script:
!cd ..\..\..\Downloads && powershell.exe -NoP -NonI -Command 'Expand-Archive'.\chinook.zip''.\chinook\ && cd C:\Users\riley\Downloads\chinook\ && copy chinook.db C:\Users\riley\Documents\Coding\ETL
The command above will be different for you because you have different folders, but that can be modified. This is also a powershell script, so it’s only for Windows. But it’s a quick and nifty way to unzip a file and move a copy of the contents to the target directory. The ‘!’ signifies to Jupyter that that line is a shell script. Pretty neat!
“Extract”
With the database file in the working directory, I went ahead and wrote my import statements and created the engine that connects to the database using Sqlite via the SQLAlchemy library.
Before I did anything, I looked up the table schemas diagrammed below. This gave me a sense of what variables were available and the relationships between tables. In SQL, relations and schema are predefined and must be declared before creating a table and subsequently populating it with data.
Next, I needed to define the metadata, which instantiates the objects that compose the tables. I also took a look at the columns of the ‘employees’ table, since that’s the one I was curious about starting off.
The code above returns the columns for the ‘employees’ table.
Getting down to it, I called my first query of the ‘employees’ table and took a look at the result.
The SELECT * FROM table_name
is basically the hello world of SQL. You can think of it as df
. I don’t recommend doing it without a LIMIT n
(equivalent to df.head(n)
) statement though, where n is the number of rows to return. This will protect you and your computer in case your table is gigantic.
To look at the other tables in the database, I called inspector.get_table_names()
, which returns a list. It’s a handy way of seeing what’s available to explore.
I became interested in a few things based on these data:
- How many employees are there?
- How many customers did each sales rep help?
- Is there an association between how senior a sales rep is at the company, and how many customers they helped?
- How can I utilize the information in different tables to answer another question: how many minutes of music were purchased, aggregated by country?
Transform
The first question is answered quite easily. I didn’t limit the output of SELECT * FROM employees
so it showed me all of them. There are obviously 8 entries. But knowing this was only easy because of the table’s size. What if it were 3,000,000 employees? The code to count the number of employees regardless of table size is below.
To answer my second question, I looked at the ‘SalesRepId’ variable in the ‘customers’ table, counted those, and grouped that together to see how many instances there were of each ‘SalesRepId’.
Employee 3 helped 21 customers, 4 helped 20, and 5 helped 18 respectively. To answer my third question, I next examined the employees’ hire dates.
To find the most senior employee, I wrote the query below. On a table this small it’s unnecessary, but again I wrote it out because that’s how to scale!
It seems that there is some kind of linear relationship, at least on this very limited data set. As tenure lengthens, so do the number of customers helped. This is obviously not enough information to infer skill — the longer someone works somewhere the more [insert task here] they will complete.
Relating it all together
I wanted to answer a more complex/fun question, and write the equally more complex SQL query to do so. I also wanted to use the data from different tables, since there are so many to choose from. To do that, I needed to join the tables.
In a relational database, there are, well, relations between tables. The ‘customers’ table has a primary key ‘CustomerId’. This is a unique identifier for each customer. This is important because while there could be more than one Joe, there wouldn’t be more than one ‘CustomerId’ number 2355.
With that principle in place, I began to tackle the challenge of examining how many minutes of music were sold to each country.
The ‘tracks’ table has song lengths in milliseconds. The ‘customers’ table tells us the countries.
I joined the ‘tracks’ table to the ‘invoice_items’ table which contained a common key: ‘TrackId’. ‘TrackId’ is the primary key for the ‘tracks’ table, but the foreign key for the ‘invoice_items’ table. Because these tables share that column, they can be merged together with respect to that column.
Here’s a nice Medium article detailing joins further if you like.
With that in mind, I joined ‘tracks’ to ‘invoice_items’, ‘invoice_items’ to ‘invoice’, and ‘invoice’ to ‘customers’ using the foreign keys I needed for each.
3,277 minutes of music were sold to 494 Americans! Cool! 🎵🎶
Load
With the transformation aspect done, I went ahead and loaded it as the final step. Once that’s done, you can dive into more complex analyses that Pandas and the Python universe are famous for!
The code is the same query, inputted as the first argument for the pd.read_sql_query()
method from Pandas.
And finally here’s the output of calling df.head()
:
Conclusion
I gave a brief overview of ETL (Extract, Transform, and Load) and its role in the big data world. This was a walk through of my code, with explanations of key SQL concepts sprinkled in. I worked in SQLAlchemy for Python, which has an abstracted series of classes and methods, so SQL queries wouldn’t look quite the same had I used those. You can see that syntax on the documentation page.
The full code for this project is available on my GitHub.
I hope you enjoyed the read!
If you learned something new and would like to pay it forward to the next learner, consider donating any amount you’re comfortable with, thanks!
Happy coding!
Riley