The world’s leading publication for data science, AI, and ML professionals.

The Most for Your Money from Amazon S3

Leverage the power of cloud storage with frugality in mind.

The Most for Your Money From Amazon S3 - Image via Shutterstock (standard license)
The Most for Your Money From Amazon S3 – Image via Shutterstock (standard license)

Amazon S3 is a file base cloud storage under the AWS umbrella. It can be used for files of any type and size. It is infinitely scalable, durable, and industry standards secure(*when configured correctly). It can be used as a website assets host, an actual website, or a data lake, and its plug and play with about any AWS Service.

For most parts, Amazon S3 is extremely cheap. But that can be deceiving when data keeps piling up, and it’s not monitored for the most optimal budget. Whether you are a data engineer, data scientist, or webmaster, you might get into this pitfall of unnecessary cost. In this overview, I will show you the most repeated mistakes of overspending on Amazon S3 and how to avoid them from the get-go.

* Amazon S3 is a very rigid system with many use cases. In my overview, I will go over the most common ones and what to look for in there.


Storage Classes

Storage Classes are the core of S3 and are mostly an overlooked option, as people go with the default – S3 Standard General Purpose class that just does the job. Putting (or moving) the data in the correct one will be a driving factor of cost. What needs to be considered for your choice of a Storage Class is the lifecycle of the data and frequency of access for a longer period.

Quick Overview of Storage Classes:

S3 Standard – General PurposeThis is the most common one and the one you will start off by default. It is used for frequently used data.

S3 Intelligent-Tiering This tier is used for unknown or changing access patterns. It will automatically determine which data is less frequently accessed and move it for you. You might pay slightly extra for the monitoring, but it’s worth it if you really don’t know your data and don’t have set up Lifecycle Rules(more on those below). You also pay per 1000 lifecycle transaction request so if you have a lot of data you don’t want to be moving back and forth between storage classes too much so that might not be the best option for you.

S3 Standart-IA -Frequent Access: For data that is retrieved less frequently but requires a rapid transfer. It’s a lower storage cost, but you pay for data retrieval.

S3 One Zone-IA: Place for less frequent data access. As the name suggests, It’s not backed up by a second Zone so it should be used only for data that is replicable, something like a second backup.

S3 Glacier: Infrequent Access tier. Data can be retrieved between a few mins and a few hours. This is the place to store backups or old versions. Storage is cheap, but you pay for data transfer.

S3 Glacier Deep Archive: Designed for very infrequent access, something as 1–2 times a year. Data retrieval can take hours. Storage is the cheapest of all options, but you pay for data transfer.

S3 Outposts: Used for on-premise storage, interacts with S3 APIs to get the data. Ideal when you have in-house data storage requirements.

More Info on Storage Classes and Storage Class Pricing


Lifecycle Rules

Create Lifecycle rule is under Management tab (image by author)
Create Lifecycle rule is under Management tab (image by author)

The main idea of Lifecycle Rules is to move data automatically between storage classes, deleting old data or versions. It is very powerful and super easy to use. From my experience that it is most often not used at all. Lifecycle Rules are just set and forget, the "serverless" way to abstract your data management and keep cost low by design.


Versioning

Amazon S3 Versioning Setting (image by author)
Amazon S3 Versioning Setting (image by author)

This is a very neat feature. Good for auditing requirements. If you need to use this option, the best thing you can do here is transitioning the old version to a cheaper long-term storage class or delete the old version in few days via the Lifecycle Rules.

Example Lifecycle Rules for Versioning

You will pay the price here for storing the data multiple times, that’s why I suggest having lifecycle rules as an example:

Example for Lifecycle Rules for Versioning (image by author)
Example for Lifecycle Rules for Versioning (image by author)
  1. From S3 Standard (assuming its the default) move the old version to S3 Standart-IA in 1 day
  2. From S3 Standart-IA move the file to S3 Glaciar in 4 days Now, be aware of the transfer cost per file associated with moving to S3 Glacier.

More Info on Versioning


Encryption Types & Cost

Amazon S3 Encryption Setting (image by author)
Amazon S3 Encryption Setting (image by author)

In S3 you have the option to encrypt the data at rest. You can encrypt the whole bucket or individual folder/suffix. You have the option to use the Amazon S3 key(SS3-S3) or the AWS Key Management Service key (SSE-LMS). If you use a custom (your own) KMS key, the way this works is that every time you read an object from S3 it also calls the KMS API and it incurs cost. If you are using S3 for a Data Lake, or Athena for example, this will pile up unnecessary costs for every single file read. You do have the option to choose Bucket Key that reduces cost by minimizing API calls made to KMS, but you will still end up paying a bid.

I would suggest going with SS3-S3 unless you have other requirements.

*Encryption can be a specific project requirement and in that case, you need to go with the best architecture for that and present the cost to the stakeholders.

More Info on Amazon Key Management Service (KMS)


S3 for Data Lakes

When storing data on S3 for Data Lakes, partitioning and file format are the main keywords here. You want to organize your files in a structure that requires fewer scans over more documents. It will be faster and cheaper. This is where yet again Lifecycle Rues come in place to help you purge any stale data or archive it to long-term storage automatically.

Additionally, make sure that you are storing your data in the smallest possible format. For example, Parquet or Orc will be ideal for querying data from AWS Athena and crawling from AWS Glue. Both formats are supported by default with most AWS Services. This is not only the best practice but also it minimizes the storage size.


S3 for Web Services

The most common use cases here are hosting website images or other assets like javascript, stylesheets (CSS), thumbnails, etc. You can even host a static website on S3. In all those use cases you will pay for data transfer out to the Internet. If this is your intended use case you should definitely use Amazon CloudFront for caching any long-term permanent assets and serving them from there. This will also greatly improve the speed of asset loading and reduce the cost of reading data directly from S3. Yes, you will have to pay for CloudFront, but it will be significantly less than using S3 directly.


Conclusion

S3 is super cheap for most parts. Standard storage for the first 50 TB is only $0.023 per GB (*pricing at the time of writing the article) and if you are a big or small operation it might not be a substantial amount to you. I personally do not like leaky pockets and why not prevent any loss from the start on my projects. For building on AWS I always follow the 5 Pillers of AWS Well-Architected Framework and cost plays a big role there. Hopefully, I gave you some good insights on how to save from few bucks to few hunters on Amazon S3.

Also, Read

The Most for Your Money from Amazon RDS

Contacts

I am a Software, Data, and Machine Learning Consultant, certified in AWS Machine Learning & AWS Solutions Architect – Professional. Get in touch if you need help with your next Machine Learning project. To stay updated with my latest articles and projects, follow me on Medium.


Related Articles