I’d love to say the title is click-bait, but in this case it’s actually true. Let me explain.
Background
Several months ago, a CTO I had done some business with years ago reached out to me on LinkedIn. He was in a new company and needed some help.
"Hey Brad – I remember that you were experienced with Postgres. If you have some time in your schedule and you’re interested, would you be interested in having a call?"
One of those simple messages that had no real detail, but seemed interesting. We set a meeting and he explained.
His company was pushing a lot of Data through a hosted solution managed by their vendor. The problem was there were some performance and stability issues that were creeping in and making his customer guarantees a major chore. He wanted someone who could act as an interpreter to make sure the information being provided was reliable and also translate the deep technical stuff.
It all sounded easy enough, so why not?
I am in the formal process of nominating that assumption as the understatement of the year.
When I finally got in to see exactly what I was dealing with, I was a bit shocked. The Database had grown to around 80TB with some individual tables (and indexes) exceeding 15TB. Woof.
Normally, when I get a call out of the blue to come in on a scenario like this it is on, what I would consider "normal" sized databases. Somewhere in the 0.5–2TB range. By the time a company has a datastore of this size, there are usually teams of people managing it to keep everything humming along.
Not this time. I had to be the subject matter expert and wrangle some very talented people with their own expertise in various areas to appreciate the data challenge they had before them.
Where to Begin?
The first logical step is to make sure the system is actually semi-tuned to handle the size of the data and the hardware its running upon.
Postgres is a funny beast. It is very user-friendly but it is picky about its friends. It will happily allow an admin and a user to do all manner of bad things and will try to accommodate as best as it can.
This poses a considerable problem. There are not a lot of guides or recommendations out there for tuning a database to account for the massive amount of data in a store of this size.
If you have a web app with a few hundred GB of on the database – search away, there are plenty of guides that will get you right in the neighborhood and allow you to run with fairly decent performance. You need to fight your way through a non-optimal datamodel to get your result – you can brute force your way through an example from stackoverflow and get what you need. You probably won’t have too many problems.
Postgres will try to accommodate you. It has a great planner and a great engine to execute those plans.
The downside is that you will likely see marked improvement when compared against more self-configuring databases like MySQL / MariaDB or the like.
If you get it dialed in correctly though, Postgres can be the Indy car you’ve always dreamed of. Even with big data coursing through its schemas.
So, my first step was to get it in the neighborhood. Maybe not make it that formula one racer but at least make it a stock car.
I spent a lot of time finding optimal values for Shared Buffers and Work Mem and Maintenance Work Mem. I also tuned the Vacuum and AutoVacuum settings to keep things clean and tidy.
Before long, the system was right-sized for the data and the hardware. It stopped having reliability issues and its resource usage began looking a lot better.
While this isn’t the thrust of this article, I do want to mention that in context, performance and reliability stability were major items to check off the list. That alone assisted the business units with keeping their customers happy. If the customers aren’t happy – no one is happy.
The Big Problem
As I dug deeper into the issues, another annoyance came up, which was just poor performance generally.
There were queries running that despite the size of the data were taking way longer than I expected. Some of this was design and some of this was just activity, but something seemed off.
I want to take one second here and just explain the last sentence. In my very long time dealing with various systems whether is be networks, operating systems, or databases there is a quality I like to refer to as the SysAdmin 6th Sense. It’s a function of experience both generally as well as specifically with the system you’re dealing with. You begin to know by gut-feel when something is behaving poorly.
It’s not really magic or anything mystical. It’s just being observant to the various systems and understanding what certain inputs are likely to produce under certain conditions. You might not immediately know what is wrong, but you can usually tell when it’s just not right. It’s all about attention to details.
Using my spidey-sense, I knew there was a query issue.
For the experienced among you reading this, you might be thinking, "well, duh." However, the enhanced challenge I faced is that I was working on a database with highly confidential and protected information. While I was cleared to work with the system, I was not cleared to work with the data itself.
Another joy of being a consultant.
So, to explore this problem, I needed to work with the internal team to provide me with some anonymized samples of queries. Within about 10 minutes of seeing the queries, I realized the problem.
The Reveal
What good is a click-bait-ish headline if you don’t have to wait for the reveal?
Looking through the primary queries on the system, I saw a kiss of death for Postgres performance. The dreaded SELECT DISTINCT.
Before I have a whole gaggle of angry DBAs and Data Analysts marching upon my abode with pitchforks and torches, let me offer a defense.
Simply put, Postgres specifically does not handle DISTINCTs well at all. It’s not a general statement but instead a Postgres statement.
One of the things I love about Postgres is the adherence to the SQL standards. It tends to default to a standard instead of creating its own flavor of the language. This also means that in some cases, while the system will comply with what you want, the implementation is not perfect – or in this case, good.
Other database engines such as SQLServer, Oracle, MySQL, etc., handle DISTINCTs fine with minimal performance impact. Postgres does not. It’s an exception and one that makes all the difference.
You see, in simple terms the other RDBMS engines allow for indexes that can be easily skipped through to pull out unique values. It is highly efficient and effective.
Postgres, on the other hand, has to first retrieve the entire set and then perform a secondary scan through the set to identify and return unique values. It doesn’t seem like much, but in data world, it makes all the difference. Begin adding columns to the query and you wind up with multiple passes.
So, for each query on a big database like this that uses that one keyword, it was taking huge amounts of time and resources just to return the set.
How do we get around it?
There are multiple methods. First, if the model is in development, I personally try to ensure the data is stored in a fashion where unique values can be returned properly without resorting to DISTINCT. Sometimes there are exceptions, but most datamodels can be constructed in this way.
Second, if my first preference isn’t feasible, I will attempt to use other constructs to provide a unique set on problematic queries. This may be strategic uses of views or even set returning functions that allow the uncertainty to be reduced and better queries constructed. The downside is that this often takes more time and attention to the database and its use to identify where and how they can be employed.
Third, there is usually some other way to write the query.
In our modern world of development, we often tend to take fast development over fast performance. If you’re in the cloud, resources are infinite (right?) so as long as something is kind of fast we can always throw more credits to the instance and brute force it. I’m kind of joking, but not completely.
There is usually a means to optimize performance just by thinking about the data in a slightly different way. It just takes a little creativity.
The Easy Way
The quickest and easiest way of removing the DISTINCT is simply by duplicating the SELECT clause into a GROUP BY clause.
On the other databases I mentioned, if you run a query analysis to view the plan, you will often see that a DISTINCT and an equivalent GROUP BY have almost the same query plans. This is the optimization by the planner to note that these are mostly equivalent functions.
Postgres is not like this. The DISTINCT and the GROUP BY will have very different plans and ultimately will result in very different execution times in most cases. On complex queries against large data sets, this can have substantial effects.
I will note that there are exceptions and caveats. When aren’t there? But as general guidance, I’ve personally found this to be true. Given the simplicity of the change, it is often low impact to test and implement in code.
Proof
After all that, let’s talk proof.
My client had a query that was running to move a lot of data from one location to another. Essentially, they were extracting the transactional data from the database and moving it into a location of more analytics and reporting.
This was a nightly process that before I started took an average of 25 hours to complete. It begs the question of whether this was truly daily reporting if it takes more than a day, but that’s a whole other semantic discussion.
The stabilization of the environment and the configuration changes had some minor average impacts, but really focused the time spent on this operation to a consistent 25 hours.
Modifying the SELECT DISTINCT to a SELECT … GROUP BY reduced the process to 2 hours.
The simple change reduced the time spent by 92%. And all it was was a copy and paste from the SELECT clause into a GROUP BY clause.
Conclusions
As I always try to point out, not only to my clients but also in this – what turned out to be an overly long explanation of a simple concept – article is that it all comes down to knowing your systems.
Only knowing how the system works is not enough. You also need to know its nuances and oddities. Where does it excel and where does it struggle? Is there somewhere that it doesn’t adhere to conventional wisdom?
If you work with a specific set of technology for long enough, you will slowly learn these lessons. If, like me, you happen to do a lot of this type of consulting, you just have to enter each new environment knowing that every piece of technology in the stack has something to it that can be a headache or a savior.
Pay close attention, do your research, and you will find that sometimes even ctrl+c / ctrl+v can give you results you would have never expected.