A Simple Guide to Improving Your Skills in Scheduling Queries

Some basic principles and strategies you can follow when creating your scheduled queries.

Astandri Koesriputranto
Towards Data Science

--

Photo by Djim Loic on Unsplash

In my daily job as a data professional, doing queries is one of the most common tasks that you will face. When you use these queries repetitively to produce reports, dashboards, or even training data, we will need to schedule and manage them efficiently.

I want to share some principles and strategies that you can use to create a scheduled query based on your specific scenario. If you’re new in your role as Data Analyst or maybe Data Scientist, I hope you’ll find this knowledge comes in handy.

Some Principles to follow

In my day-to-day activities, I found that these principles help me a lot in doing and managing my scheduled queries. It doesn’t come overnight, you’ll have to start doing it little by little and make yourself comfortable with it.

Declare your variables/parameters

It can be start and end date or timestamp, specific variables to be used in filters, and others. It’s a good practice because you can change it quickly based on your needs.

Also, date filters may be used in multiple subqueries when you query from many tables, this approach makes you only need to change the parameters rather than change the WHERE clause in ALL of the subquery

Start and end date variable/parameter declaration to be used in the query

Work on small subset of data first

Compose your query using a small subset of the data first (if possible). You can use data from the past 1 day or past 7 days first then create a full running scheduled query based on it (based on the 1st principle, you may just have to change the parameters 😃).

Use partitions!

The most common is partitions by date. Furthermore, I found it’s quite helpful to use date local as your partition (in your scheduled query). Because oftentimes your report will need you to filter based on a certain period in your local time with partition it can save you time and cost of running query.

To find out more about partitions in the table you can visit this link: https://www.sqlshack.com/database-table-partitioning-sql-server/

Avoid using tables created from another scheduled queries

If possible, always use the main table as your source in your SQL queries. It can be the table that is directly extracted from a transactional database, data warehouse, datamart, or any kind of table that you and everyone else agreed that those tables are the main source for the data you need.

If you use a table from another scheduled query and keep doing it, you may lose track of changes, time of executions, and it can be chaotic if some of the schedules fails and causing your dashboards to stop working.

Be clear, be clean

Like in programming language practice, write clear SQL queries, add comments if necessary, use clear variables or alias if possible. This will help you and your teammate to understand what is going on in the query and to make it easier when you need to revisit it for updates in the future.

Example of not so clear vs quite clear SQL queries

Some Strategies you can use

After you create your query, it’s time to schedule it. If your query doesn’t need to produce any table, then you can ignore these strategies. Although, usually you will need to store the result in a table to be used in reporting tools, dashboards, or even basic Machine Learning Pipelines.

Replace All Strategy

This is the most simple strategy, just run your query then store it in a table. It can be the daily generated leads/sales list, list of churned customers during the previous month, etc.

I don’t think I need to explain more about this strategy, it’s quite obvious already.

Simple Append Strategy

This strategy can be used when you need to continuously append new data to your table, like for monitoring dashboards.

First, you need to create your initial table as your base table, then you can append more data using your scheduled query based on your needs, daily, weekly, monthly.

Simple append strategy (image by Author)

Just don’t forget to set the partition if needed 👍

Windowed Append Strategy

Sometimes you’re working with continuously updating data, it can be data from event trackers or some other systems that we are expecting delays of the extractions from these data sources, like firebase event data for example.

Let’s say you already make a snapshot of yesterday data using a scheduled query and store it in your table, but there are some updates of the data in your main source that comes after your run your scheduled query, furthermore, you don’t know when exactly the data gets updated/inserted. This can make data in your table invalid or incomplete so then pieces of information produced in your reports are invalid as well.

In this kind of situation, it’s best to run your append strategy in a windowed manner. Because in every run we replace the data with the newer ones, we can reduce the possibility of invalid or incomplete data. You can set your window period based on what makes the most sense in your data environment. I usually set the window in 3 to 7 days for a daily scheduled query.

Windowed append strategy (image by Author)

Calculate then Replace Strategy

In some use cases, you’ll need to calculate some numbers based on previous numbers, just like a running calculation.

For example, we want to have a continuously updating total weekly spending of certain users, based on this spending amount we want to give special treatments to our customers.

This kind of problem actually can be solved by simply summing up all of the user’s spending based on our daily/weekly appending tables (from the previous strategy). A bad condition may arise in the future when the data gets too big, it would be costly to run such aggregations query.

Therefore, you can just simply capture the latest total weekly spending of each user one time as the base table, then gradually update the value by just simply add previous_total_spending + current_week_total_spending.

Calculate then replace strategy (image by Author)

Closing Thoughts

Working with queries is indeed one of the most common tasks that will be executed as a Data Professional. As time goes by, we have to make our work more efficient so we can deliver results and insights faster. By end of the day, we can start spending more time solving more and more business problems.

Some of the strategies above were based on my experience, there might be a lot more strategies to explore for other use cases. I’m very open if you have feedback or want to add some other strategies that I missed here, will appreciate it!

Finally, happy learning!!! 🚀

--

--