Why I chose BigQuery for events analytics and why probably you should too
In every startup comes the time when you have to design a large-scale ingestion of analytics events.
The main idea is that you have a backend logging events and you want to have analytics insights about them.
This is pretty much one of the classic data engineering tasks.
The requirements you have in startups might sound like this:
- You want a managed solution
You do not want to spend a great amount of engineering effort and time. You want to iterate…and you want to do it fast. You want your focus to be on building a product, building it well and building it fast. The less time you spend managing infrastructure, probably the better.
- You want to be able to query the events as fast as possible after they happened
This is not always a requirement. Usually, you want to do analytics on past events and you don’t really care if you are able to do it the day after the events happened. A scenario when you actually have to do it almost immediately is for example when you need to provide user assistance and you need to know what happened to that particular user.
- Keep the cost contained
Obviously, no one wants to pay more than necessary.
- Ability to greatly scale up in a small amount of time if needed
That’s a good one. Usually, you want to build the simplest thing which solves your problem but not simpler. Also, you want to have a good way to eventually scale-up without having to rebuild everything from scratches. This sometimes happens, though and it’s fine.
If you look at the market there is almost an infinite amount of solutions to the problem…which one is the best? Well, it really depends on your use case.
For me there really were three main players:
- Amazon Redshift
- Amazon Redshift Spectrum
BigQuery is Google analytics DB.
It’s powered by Colossus (Google distributed filesystem), each query is transformed into an execution tree by Dremel (Google query engine), the data is retrieved from Colossus and aggregated…Everything runs on Google Jupiter high-speed network.
Being an analytics DB, BigQuery uses a columnar storage. If you are not familiar with this kind of storage, it differs from the classic transactional DB which instead stores the data by rows. This is done to reflect the kind of queries that are generally done on analytics DBs.
Usually, you have something like this:
SELECT foo, bar
WHERE timestamp between '2018–01–05' and '2018–01–09'
As you can see, we query a small number of columns (2 in this case), but we process potentially a very big amount of data (a full quarter of sales events). Storing the data by columns will make this easier.
Redshift is the other big player in the domain of analytics DBs. It’s Amazon’s analytics DB, it uses columnar storage (like BigQuery) and it’s fully managed (like BigQuery).
Redshift Spectrum is a very interesting solution for analytics provided by Amazon. The idea behind Spectrum is to use S3 as the source of data allowing to run SQL queries on it. To have a performance comparable with Redshift and BigQuery, you actually need to “manually” store the data in a columnar format (usually the choice is Apache Parquet format).
What about Athena?
Amazon Athena in principle gives the same functionality of Redshift Spectrum, there are some differences.
I found Spectrum more interesting because we already had Redshift running, and that integrates better with Spectrum rather than Athena.
In the end I chose BigQuery and that’s way!
BigQuery does not use indexes: as data size grows, BigQuery just adds more servers, in this way performance is kept almost constant. In BigQuery each query is a table scan and that’s fine I’d say. Despite that, the performance is amazingly good.
Redshift is probably more performant, BUT you need to wisely choose primary keys and sort keys. It’s really common to see a poorly performing Redshift instance, only because keys were not really well-planned.
Sort keys are used to speed up queries and improve compression in columnar DBs, this will have a great impact on your costs and on your system performances.
Redshift Spectrum and Athena will require even more engineering effort, due to the fact that we need to manually store the data in the right format to have a good performance.
Potential to Scale Up
At first you can just stream insert data into the databases.
In BigQuery you can use stream insert, while with Redshift it’s probably a good idea to use an Amazon Kinesis Data Firehose. Kinesis Firehoses are already set up to work with Amazon storages (like Redshift) and continuously write the data to them providing also some queuing mechanism for fault tolerance.
Eventually, you will generate too much data to directly streaming into the DBs and probably you will hit some limits (either in cost or some quotas in the service)
I think that BigQuery provides the easiest solution to the problem.
You can basically decide which data you want to be immediately able to query and which you don’t want, and you put them on two different paths.
The first one (hot path) is directly streamed to the DB, the other one is stored somewhere else and then it’s batch-inserted in the DB (which is an operation provided for free on Google Cloud Platform).
The figure shows a nice representation of this taken from the Google blog.
The red path is the hot one, while the blue one is the cold one.
With Redshift, I don’t think it’s so straightforward (but I may be wrong).
One solution would be to store the incoming data on DynamoDB for queries that are needed immediately and later you can batch transfer everything to Redshift or S3 for Spectrum.
The architecture of the whole pipeline will look like this (here Athena is used instead of Spectrum or Redshift, but the principle is the same).
BiqQuery it’s charged per query and storage…everything else is free (almost). Loading data is free, but you pay streaming insert.
This means that in some use cases it can be really cheap. Probably in lots of analytics application you will be scanning/returning a huge amount of rows for few columns which usually are not very big in size (data structure-wise).
Redshift is charged by node. The more data you have, the more node you will need. On the other hand, Amazon has reserved instances payment plan, where basically you commit to using the service for between 1 and 3 years and you can pay some money up-front. This can save you up to 70% of the original price. Still, for a startup may not be a good idea to commit to a long-term plan.
Redshift Spectrum can be cheap as well, basically, it’s charged (kind of like BigQuery) according to the amount of data you process.
Conclusions: We had a look at how I designed an events ingestion problem in real-life. Different solutions are available on the marker, each one has its own advantages and disadvantages. We had a look at how BigQuery can be a good solution in certain use cases, comparing it to alternatives such as Redshift and Redshift Spectrum.
If you enjoy the article and you found it useful feel free to 👏 or share.