The Complete Picture

Deep dive into Redshift Spectrum and its internals

The internal functioning of Redshift Spectrum and when to choose between Athena, Spectrum or s3-select to query data over s3

Shubham Jain

Published in

Towards Data Science

6 min readJul 29, 2020

Photo by Caspar Camille Rubin on Unsplash

AWS launched Redshift in 2013 and after the success of redshift there arises the need for decluttering the cluster which is occupied by cold data. But what if you want to access your cold data too? Not frequently but once a year maybe. But would you like to pay for the cluster space for keeping cold data in your cluster which you are hardly using and which keeps increasing in size with years?

No, right, no one wants to fill up their cluster with the cold data. So, the biggest problem that arises with redshift clusters was to query the cold data at minimum cost. To solve the problem AWS launched Redshift Spectrum in 2017 which allows you to query your data stored over s3 and also gives you capabilities to join the s3 data i.e. your cold data with the redshift data i.e. hot data.

This saves a lot of cluster space which can help you save the overall cost of the cluster and with the more space available you can improve your query performance and provide more space to the query to execute.

Internals of Redshift Spectrum:

AWS Redshift’s Query Processing engine works the same for both the internal tables i.e. tables residing within redshift cluster or hot data and the external tables i.e. tables residing over s3 bucket or cold data.

To access the data residing over S3 using spectrum we need to perform following steps:

Create Glue catalog.
Create external table pointing to your s3 data.

There is no need to run crawlers and if you ever want to update partition information just run msck repair table table_name.

When we query the external table using spectrum, the lifecycle of query goes like this:

The query is triggered in the cluster’s leader node where it is optimized and the leader node determines whether which part to run locally to access hot data and what goes to the spectrum.
Query plan is sent to compute nodes where the tables partition information and metadata if fetched from the glue catalog.
Multiple managed compute nodes known as Spectrum Fleet is associated with the cluster for executing the query over external datasets.
Spectrum fleet processes the data and sends it back to leader node where the join with hot data takes place.
The leader node provides us the required output.

Spectrum Fleet

Spectrum fleet is a little tricky and we need to understand it for choosing the best strategy for our workloads management.

The spectrum fleet consists of multiple managed compute nodes residing inside your VPC and is made available only when you execute a query on external data. So, this spawn of compute nodes is completely managed by AWS behind the scenes.

Now the question arises, how many compute nodes are made available to run the queries? Can you run spectrum query over 10 TB data if you are having 2 nodes redshift cluster? Are the number of compute nodes unlimited, for external table?

So the answer is NO.

You don’t get unlimited compute but the number of nodes assigned to particular spectrum query is equal to 10x of your redshift cluster size.

If you are using 2 nodes redshift cluster then AWS will assign no more than 20 nodes to run your spectrum query. Similarly, for 20 nodes cluster, you will get max 200 nodes.

The assignment of the number of nodes is determined in the following ways:

When we query external data, the leader node will generate a optimized logical plan and from that, a physical plan is generated.
Now based on this physical plan, redshift determines the amount of computing required to process the result and assigns the necessary compute nodes to process the query.
If your query requires nodes more than the max limit, redshift assigns the max number of allowed nodes and if that doesn’t fulfills your compute requirement, the query fails.

S3 File Formats and compression

Redshift Spectrum can query data over orc, rc, avro, json ,csv, sequencefile, parquet, and textfiles with the support of gzip, bzip2, and snappy compression. It is recommended by Amazon to use columnar file format as it takes less storage space and process and filters data faster and we can always select only the columns required. To know more about the supported file format, compression, and encryption visit here.

Query Optimization

To optimize query performance, you should consider the following:

Use columnar file format, this will prevent the spectrum from an unnecessary scan of the columns.
Keep your glue catalog updated with the correct number of partitions.
Write your queries to use filters and aggregations that are eligible to be pushed to the Redshift Spectrum layer. The following are examples of some operations that can be pushed to the Redshift Spectrum layer GROUP BY clauses, Comparison conditions and pattern-matching conditions, such as LIKE, Aggregate functions, such as COUNT, SUM, AVG, MIN, and MAX.
Avoid operations that can’t be pushed to the Redshift Spectrum layer include DISTINCT and ORDER BY.

To know more about the query optimization visit here.

Monitoring

There are two system views available on redshift to view the performance of your external queries:

SVL_S3QUERY: Provides details about the spectrum queries at segment and node slice level.
SVL_S3QUERY_SUMMARY: Keeps track of all the spectrum queries triggered so far on the cluster including number of files processed, number of bytes scanned(helps in identifying the cost incurred by the query)

To know more about the query optimization visit here. And to troubleshoot the queries error visit here.

Cost

Spectrum charges for the amount of data scanned i.e. 5$ per TB of data.

To know about the Redshift Spectrum performance detail visit this blog https://aws.amazon.com/blogs/aws/amazon-redshift-spectrum-exabyte-scale-in-place-queries-of-s3-data/

Choose between Spectrum, Athena and S3-Select

Athena:

One should consider Athena when there is no redshift cluster already running and you want to execute analytical queries over the data residing in s3. Or your data does not relate to the data residing in the redshift cluster and you don’t want to perform any joins with cluster data. Some points related to Athena are:

Athena is serverless.
Athena uses the Presto query engine for optimizing queries.
Athena requires the data to be crawled first using glue crawlers which increases its cost overall.
The overall cost for Athena is 5$/TB data scanned + 0.44$ per DPU per hour for crawling the data using glue crawlers.

S3-Select:

S3-Select is very useful if you want to filter out the data of only one s3 object. This provides the facility to query only a single s3 object and is capable to filter the data. S3-Select features include:

Requires no servers to run query over the s3 object.
Can only query single object at a time.
Can be used in Spark applications to apply the predicate pushdown. ( Believe me, this gives you the speed boost if you are reading csv data)
Charges are 0.8$/TB data returned and 2.23$/TB data scanned.

Redshift Spectrum:

Redshift spectrum should only be considered if you are already a Redshift user. If you are already running your workloads on the redshift cluster then should use the redshift spectrum. The redshift spectrum fills the gap of querying data residing over s3 along with your cluster’s data.

Comparison between Spectrum, Athena and s3-select

Summary

Redshift Spectrum is a great choice if you wish to query your data residing over s3 and establish a relation between s3 and redshift cluster data. It’s fast, powerful, and very cost-efficient. One can query over s3 data using BI tools or SQL workbench.

The redshift spectrum is a very powerful tool yet so ignored by everyone.

I hope you liked this article.

Stay tuned for more content.

References:

[1] Redshift Spectrum Documentation