Ditch the Database

Published in

Towards Data Science

5 min readJun 3, 2020

How to use AWS S3 Select to query smarter and maybe cheaper

Amazon S3 Select, a feature of Amazon S3, became generally available in April of 2018. By using Amazon S3 Select it’s possible to run an application without a “classical” database making app architecture simpler.

Anyone who works with data has probably used a cloud-based storage solution at some point in their data pipeline. Personally, Amazon S3 is my preferred choice as it’s relatively simple to use, easy to scale, couples with other AWS services, and stores objects of nearly any file type.

While Amazon S3 is AWESOME for storing datasets it, like other traditional cloud-based object storage solutions, doesn't allow retrieving subsets of a dataset (i.e. the entire dataset must be retrieved). For example, if you need the first 1000 lines of a 1 million line dataset you have to retrieve the entire dataset and filter locally. This makes it impractical to use Amazon S3 as the sole database for applications, particularly apps that constantly need to retrieve subsets of a larger dataset (i.e. web app retrieving customer records).

Enter Amazon S3 Select 🙌

Overview — Amazon S3 Select

Amazon S3 Select lets you use simple SQL expressions to pull out only the data you need from an Amazon S3 object. This is HUGE!!!

With S3 Select you no longer need to retrieve an entire dataset only to retain a few lines. Instead, you can use classic SQL expressions to query data in place and retrieve only a subset, such as data between two dates or above a certain price. Currently Amazon S3 Select works with objects in CSV, JSON, and Apache Parquet format.

By reducing the volume of data that has to be loaded and processed by your applications, S3 Select can improve the performance of most applications that frequently access data from S3 by up to 400%. — Amazon Web Services

Simplify with Amazon S3 Select

The best part of Amazon S3 Select is how it simplifies application structure. Prior to Amazon S3 Select, the typical structure of a web app may have looked like the following.

Store data in a classic SQL (i.e. MySQL or PostgreSQL) or NoSQL (i.e. DynamoDB) database where it can be queried.
Keep backup copies of the data in persistent object storage (i.e. Amazon S3)
Transfer new data into Amazon S3 and then add to the database.

With Amazon S3 Select the structure of the web app looks like the following.

Store data in Amazon S3 where it can be queried.

SO MUCH BETTER! Now you can build an end-to-end application using only Amazon S3. It’s fast, cheap, and simple to interface with! 👍

Example — Amazon S3 Select

Recently I built and deployed a web application to make downloading and visualizing historical Canadian weather data easier and faster. This was the first web application I built where I didn’t use a classic database and instead stored and queried static and dynamic data directly from Amazon S3. Using Amazon S3 Select I was able to quickly and easily take user inputs from my dynamic dashboard, pass them into a SELECT expression, and query and retrieve only the data requested.

Below are some example data and a function to query a CSV in Amazon S3 and return a CSV format that can be read into a Pandas Dataframe. Amazon S3 SQL syntax for CSV can be a bit tricky. You can either query using column names or column index and based on which one you choose the SQL expression will be different.

Function to query CSV in Amazon S3 and unpack data to CSV format

How to Query By Column Name?

Let's query the above dataset between two dates using column names and return only the corresponding columns for date, wind speed, and wind direction.

Query CSV in Amazon S3 using column names

How to Query By Column Index?

Let’s query the above dataset when the temperature is greater than 20 degrees Celsius using column index and return only the corresponding columns for date and temperature.

Notice that the index starts at 1 (i.e. s._1) in S3 Select whereas Python would be 0. Make sure you are comparing common data types in the SQL expression by using the “cast” command to convert one data type to another. If my temperature data had been string format I would have needed “cast(s._4 as float) > 20”.

Query CSV in Amazon S3 using column index

How to Query Header Names?

Let’s query only the header names from the above dataset so that we can use them to append to other query results or as inputs to a dynamic dashboard dropdown. In this case, I set use_header to False so that column names are returned in the first row and use the S3 Select LIMIT feature to limit the number of rows returned to 1.

Query CSV in Amazon S3 for column headers

#  query result
['Date/Time', 'Longitude', 'Latitude', 'Temperature', 'Wind Speed', 'Wind Direction']

Cost of Amazon S3 Select

Amazon S3 Select is CHEAP! Here is a breakdown of the starting costs for my application (current as of June 2020).

Amazon S3 object storage — $0.023 per GB (first 50 TB/Month)
Amazon S3 Select — $0.0004 per 1000 SELECT commands
Amazon S3 Data Transfer— $0.01 per GB (S3 → U.S. East Region)

I store approximately 37 GB of data in S3, perform nearly 100,000 SELECT commands a month, and transfer approximately 50 GB of data out of S3 a month. That brings my estimated monthly costs to….$1.39 USD!

Conclusion

Although Amazon S3 Select has been available for more than 2-years (in code years that’s a millennium 😆) it’s a great tool for simplifying applications, particularly microservice frameworks.

Happy querying!