DATA ENGINEERING

Creating a Database Front End with Deepnote

Here’s how I automated my reports with Python Notebooks in the cloud

Benjamin Dornel
Towards Data Science
6 min readApr 1, 2022

--

Photo by Sebastian Coman Photography on Unsplash

Handling reporting as a data professional is like being a chef in a kitchen. You have an extremely large amount of raw material that needs to be sourced, delivered, and transformed into meals for customers.

Except in this case, the raw material is data and the output tends to be reports for business stakeholders. As you can imagine, doing this over and over with Excel and 3rd party platforms is repetitive, and not fun at all.

Fortunately, a lot of issues around repetitive and low value reporting can be resolved by a mixture of SQL and Python (e.g. pandas). We’re also seeing a massive push towards self-service reporting with products like Power BI and Tableau, with companies developing their data infrastructure to support this.

However, the main problem for a lot of these companies is that they fall flat in the crucial step of operationalizing their data.

Sure — they might have data sitting in a warehouse somewhere, but a flexible method is still needed for the end user to access data as they desire. If your users don’t have strong SQL skills and if dashboards aren’t enough, a database frontend could be what you need.

In this article, I’ll be covering Deepnote and how I used it to create a database front-end for my company.

Enter Deepnote

If you haven’t heard of it before, Deepnote is a free collaborative Python notebook that runs in your browser — it’s pretty similar to Jupyter Notebook but can be run in the cloud without needing to install Python.

Deepnote (Source)

You can go beyond Deepnote’s basic SQL query functionality by using a Python client library which to connect directly to data warehouses like Google BigQuery, with a code snippet like this:

import json
import os
from google.oauth2 import service_account
from google.cloud import bigquery
bq_credentials = service_account.Credentials.from_service_account_info(
json.loads(os.environ[‘INTEGRATION_NAME_SERVICE_ACCOUNT’]))
client = bigquery.Client(credentials=bq_credentials,
project=bq_credentials.project_id)

Deepnote also has a GitHub integration, meaning that you can easily create and store functions that can be re-used within multiple notebooks. This makes it a lot easier to automate reporting processes for your users.

All of this put together allowed me to create a Python library (or “Ramsey” as I like to call it) to automate reporting by pulling data from our various marketing platforms.

No complicated SQL queries — just fill in the parameters, add in a few optional lambda expressions, then hit execute and download.

Automated data extraction & transformation with Ramsey

For my particular use-case (basically the GIF above), data was being pulled from various platforms into a data warehouse, but wasn’t actually being used to help with reporting. I took the additional step of creating a front-end for my team to easily extract, transform and download data.

Reporting process flowchart

Implementing the Front End

This front-end above looks deceptively simple, but there is in fact a lot going on under the hood.

The front-end is based around a parent class called Pipeline which holds a SQL query template with parameters that can be swapped out depending on the table that is being used.

Parent class with SQL template

The Pipeline class also holds the function to generate the final SQL query which looks something like this —

Generating SQL code with Python

If you’ve ever tried comparing data across platforms like Facebook and Twitter, you’ll realize that almost all the metrics and dimensions are named differently.

This necessitates quite a bit of cleaning and grouping of data. Accordingly, each marketing data provider (or table in BigQuery) is then represented by a child class, that contains various dictionaries that will be used to help rename or group dimensions and metrics.

Child class with specific dimensions and metrics

In a nutshell, the FacebookAdsPipeline child class inherits the query template and generate_query() function from the Pipeline parent class. The user then creates a dictionary storing the necessary parameters, before creating and running the pipeline.

import pandas as pd
from pipeline import Pipeline
from platforms import FacebookAdsPipeline
# Creating a dictionary to pass into the pipeline
params = {
"start_date": "2021-10-01",
"end_date": "2021-12-24",
"campaign_identifier": "123123123",
"campaign_name": "2022_Q1_Campaign_Name",
"output_file_type": "csv" # Options are .csv / .xlsx
}
# Passing in parameters & extracting data via BigQuery SQL
fb = FacebookAdsPipeline(**params)
df = fb.extract()
df.to_csv("output.csv")

To avoid having to manually instantiate each child class, I created a grandparent class — Sandwich — that automatically detects and pulls data from each platform that was used by a marketing campaign.

This removed the need to keep track of which platform a marketing campaign was run on. Overall, this saved my team (well, honestly me), countless hours of either copy-pasting SQL code or pulling data directly from each marketing platform.

Extracting data with the grandparent class

The data is extracted through the extract_all() function which is stored in the Sandwich class. The function queries an aggregated table of top-level campaign data across all marketing platforms and returns a list of platforms where a campaign is found to have run.

Pipelines for each provider are then instantiated, and the extract() method is individually run on each pipeline. Data is then grouped by marketing channel, and saved as a dictionary within the instantiated classes.

Lastly, the data is transformed with a transform_all() function and saved to a output directory as a CSV or XLSX file.

It’s all pretty complex, but this is a consequence of how I designed this front-end with the user’s preference in mind. I wanted to give users the option of using this as a low code or high code tool, depending on whether they want to use Excel or Python for data cleaning & transformation.

Front-end options: low code & high code

Preparing for the Future

To be honest, a lot of this could have been accomplished with SQL, but it goes back to the issue of SQL code complexity and having to join a massive number of tables in a single query. This also doesn’t take into account the size of a fully aggregated table (probably several terabytes) which would probably be slow/expensive to query.

Hosting individual notebooks and bite-sized functions for different reports & projects is, in my opinion, more feasible compared to writing thousands of lines of SQL code.

I also wanted the option of real-time collaboration, with the option for users to create new functions on the go. Lastly, having a Python Notebook format paves the way for more advanced analytics with packages like scikit-learn, or more advanced visualizations with plotly /seaborn.

At the end of the day, there’s no single “right” way to automate reporting. Dashboards, Python Notebooks, and Excel/Google Sheet reports are all great to have. Having more options available benefits a wide-spectrum of end-users.

That’s it for this article! I hope it was insightful/helpful for you. If you have any feedback or thoughts, free to reach out through LinkedIn.

--

--