The world’s leading publication for data science, AI, and ML professionals.

Databases: How to get Better Performance from your Data Pipeline.

Relational Databases can be one of your biggest allies in wrangling data. Let's move off the beaten path to learn how.

Making Sense of Big Data

Person trying to figure out why their query is running slow - Photo by cottonbro from Pexels
Person trying to figure out why their query is running slow – Photo by cottonbro from Pexels

I’ve worked with developers and engineers for a long time and there are a few commonalities I’ve come to expect and count on. One of the more consistent truisms I’ve observed is that there is often a tunnel-vision that develops. By that, I mean if a developer is a front-end person, most solutions focus on trying to handle everything with a front-end focus. If the developer is more middleware focused, they want to use that layer for everything. If a developer is "full stack" they then revert to their area of greatest comfort.

This isn’t a bad thing, per se. It’s just the modern application of the old maxim that if all you have is a hammer, everything looks like a nail. But, we don’t just have to be limited to hammers. We can add more tools to the proverbial toolbox.

One thing in software that tends to be lacking is good old fashioned relational database fluency. It’s like the 10mm socket of our tool kit. We probably have something in the box that will work in a pinch and get us by, but it’s not the perfect tool for the job.

So, before I run off the rails with the analogies, let’s dig in.

Note: For this discussion, I’ll be in a mental state of thinking about PostgreSQL. I like Postgres quite a bit. I encourage everyone to gain some fluency with it. However, most of these concepts will transfer to other database engines to varying degrees.

Queries

If you are a software or data person and don’t know the basic SQL query set (SELECT, INSERT, UPDATE, DELETE) then stop here and go brush up. Come back when you have that understanding. I promise, I won’t pull the article.

So many people involved with data seem to have a narrow view of the relational database. This includes more than a few people tasked with data management and even some database administrators I’ve met along the way.

Sure, their SQL-fu is strong. They know when to use a left join over a natural join. They might even have most of the syntax for a UNION memorized. But there is so much more.

If you are in the realm of Data Science (data scientist, data analyst, data engineer, etc.), there is no more powerful tool in your productivity tool box than becoming proficient with your database.

That’s a bold statement – I know. I’m sure the ire of the middleware gods is forging some thunderbolts now to strike me down. But, here’s my rebuttal:

I took my first computer class back in 1985 and one of the first things that teacher taught us was GIGO. Garbage in, garbage out. Since that day so long ago, I’ve been fighting with garbage on both ends. As have most people working with data.

The Database provides a tool to control the flow of garbage. It can pre-process on the way in and refine it on the way out. This function, by its very nature and design, makes it valuable and prevents so much wrangling we often do in the post-data retrieval phase of analysis.

Imagine if a few tricks could make your life easier? Isn’t that worth knowing?

Concept

Way back when, I was a database admin working at a very small dev shop. We had very limited resources and it was quite a long time before AWS was even a thing.

This scarcity makes a person have to be more creative to get the most milage out of the resources they have. This was no different.

At the time, we were trying to push a lot of processing and front-end transformations through our application servers. Some of our processing on the front end was pushing the load on the machines very high and it was starting to get to a concerning level.

One day, as we backed up to look at everything in a bigger picture manner, we realized that while our application servers were being hammered, our database server was cool as the other side of the pillow. It’s obvious to say it now, but in the heat of the moment we often forget that databases are IO-bound rather than CPU-bound.

This created an idea. Let’s see about offloading some of the processing from our more difficult transformations to the database and have the data returned closer to a finished product rather than do it in application logic.

Almost 20 years later, here I am writing this article.

Views

How many of you use views? You can be honest, I won’t tell anyone.

Many people have the understanding that a view is simply a stored query. Happily, they aren’t wrong, but that only goes so far.

One of the big advantages of a view isn’t just that a complex query can be stored and referenced easily, it’s that the query plan is stored as well. What’s the query plan, you say?

Behind every database engine is the magic sauce that makes it what it is. That really comes down to the planner, which provides the most efficient way to store and retrieve data. So, for each query executed, the planner has to evaluate what’s being asked and calculate an optimal path.

As you might imagine, complex queries can often take a not insignificant level of effort to develop the plan (in computational context). Having this pre-planned is a benefit since it largely allows subsequent queries to be returned with almost no friction in processing aside from data retrieval.

Speed is our friend.


Views are also a wonderful way to restructure data to better fit our needs without resorting to huge queries articulated in analysis code. Given a level of effort, we can create entire layers of new data formats for access. Apply some permissions to each view and there is the making of a very robust interface that exists almost at the metal.

If you are working with junior analysts or front-end developers who have limited experience with SQL not served up by a framework, this can be the difference between sanity and insanity.

Functions

I love database functions. They are often some of the best avenues for processing data but also very misunderstood.

When we think of databases, we tend to think in a contained manner. We are trained to conceptualize transactional functions. A select query returns a cartesian product. An update works on modifying a dataset. An insert adds new rows. A delete removes them.

We even usually think of functions in this manner as well. We have text manipulation functions to do things to data. We have aggregate and math functions to work on retrieved data.

But, there are two key concepts that are game changers when we talk about user-defined functions:

  1. Functions can return arbitrary sets of data. Have a single result – the function can return it. Have a row of data – a function can return it. Need a function that will join all your tables and return the whole database in one go? Yep, a function can do that. But don’t do that – for real.
  2. Functions can perform a logical, programmatic flow. Just like you might do with a dataframe or multiple data frames, your function can pull from various tables without regard to joins and perform whatever operations you need to perform. Then you can do whatever you want with that data.

Think about that for a second and what that means. There is a HUGE amount of preprocessing you can perform internally using functions. And you’re not just limited to SQL.

For Postgres, there is certainly the SQL function, which is great for simple tasks. Then they have pl/pgsql, which is their own take on Oracle’s pl/sql. But, there is also support for functions written in Python, Perl, Java, Php, and a whole host of other languages.

Given the breadth of language libraries and capabilities, there is no excuse why anyone in the data field is not diving in. It can make your life easier. It also carries major nerd-cred among those poor engineers stuck in front of an ORM.

Triggers

Functions are great when you run them, but they are even better when they run themselves on events. That’s our friend the trigger.

A trigger is in its simplest form, a function that doesn’t return data, but performs a task. That function is then bound to an object and given conditions for when it should run.

Bad data be gone!

I’ve had quite a few times where I have been stuck dealing with data that is annoying inconsistent. The problems are usually something that is difficult to catch on intake, but simple once it’s in the database.

For example – I have a data collection process that runs. For some reason, it likes to occasionally spew data that should only go into one table into another. Sounds weird, but I think it’s a timing issue on the system running.

Now, I could trace it down and figure out what’s going on. That would be the responsible thing to do. However, I know it will be at least an hour of time to figure it out and it’s easily fixed in the database layer. So, I created a trigger and moved on with my life.

My trigger simply runs a query to see if the insert process messed up and if so, it isolates the offending records and removes them. This occurs after insert and the bad data is never seen.

Total time to fix was about 10 minutes, which included refilling my coffee.

But think about it. The power of a function married with the power of event-driven execution. I get chills thinking about the possibilities and I’ve been doing this forever. I can only hope you are catching on to why diving into the database realm is far from a waste of time.

Even More Advanced

I guess it’s the strategic side of my brain, but when I design programs and processes, I always try to ensure I have a path forward. This also applies to database work.

It’s not uncommon for a quick database to be created that has a limited application or lifespan. However, it often winds up growing legs and becoming a long-term resident of your data lake. Inevitably, over time it continues to grow and your adolescent database is no longer as quick and flexible as it used to be.

Growth tends to slow performance.

This is why it’s critical to understand the performance path. How can you rejuvenate your lumbering database?

Do you know about EXPLAIN? It’s a means of running your query as a what-if to see what the planner thinks about it. It will show the process the system goes through to get to the data and return it. But, it’s just a hypothetical. It’s quick, but doesn’t really do anything.

Do you know about EXPLAIN ANALYZE? Adding that ANALYZE key word tells the planner to not only pontificate on what it would do, but actually run the query and show the results.

This will show where indexes are being hit – or not hit as it may be. You can step through and re-optimize your basic and complex queries.

Then you can see how to put those queries into views. You might get some quicker returns on your data.

Are your views no longer giving you the pep you need? It might be time to incorporate those views into functions. Functions execute at a different level of priority and are handled differently. You will likely see more performance.

Finally, before re-tooling your entire data model, there is one more path. If you are semi-competent with C/C++, you can create a function that is loaded into the database (Postgres of course) as a compiled object. This is where pure essence of performance lives. Especially if you just need to brute force some horsepower out of a poorly designed data model struggling to keep up.

Ideally, you’ll do it right the first time, but designing data models is well beyond what I wanted to cover today.

Conclusions

Hopefully, this was all review for you and you have been nodding along in agreement. If it’s not, that’s ok. Hopefully you learned a few things to do some research on.

The important part is that despite what we work with, there are many tools at our disposal. If we use them efficiently, we can often make our lives a lot easier and smoother.

One of the best, practical investments a data person can make is in learning how to more effectively deal with their data base. It’s estimated we spend 80% of the job wrangling data – if we can cut that down by 10% or more, then I trust you can do the math to determine the impact.


Related Articles