Managing the Cloud Storage Costs of Big-Data Applications

Tips for Reducing the Expense of Using Cloud-Based Storage

Chaim Rand
Towards Data Science

--

Photo by JOSHUA COLEMAN on Unsplash

With the growing reliance on ever-increasing amounts of data, modern day companies are more dependent than ever on high-capacity and highly scalable data-storage solutions. For many companies this solution comes in the form of cloud-based storage service, such as Amazon S3, Google Cloud Storage, and Azure Blob Storage, each of which come with a rich set of APIs and features (e.g., multi-tier storage) supporting a wide variety of data storage designs. Of course, cloud storage services also have an associated cost. This cost is usually comprised of a number of components including the overall size of the storage space you use, as well as activities such as transferring data into, out of, or within cloud storage. The price of Amazon S3, for example, includes (as of the time of this writing) six cost components, each of which need to be taken into consideration. It’s easy to see how managing the cost of cloud storage can get complicated, and designated calculators (e.g., here) have been developed to assist with this.

In a recent post, we expanded on the importance of designing your data and your data usage so as to reduce the costs associated with data storage. Our focus there was on using data compression as a way to reduce the overall size of your data. In this post we focus on a sometimes overlooked cost-component of cloud storage — the cost of API requests made against your cloud storage buckets and data objects. We will demonstrate, by example, why this component is often underestimated and how it can become a significant portion of the cost of your big data application, if not managed properly. We will then discuss a couple of simple ways to keep this cost under control.

Disclaimers

Although our demonstrations will use Amazon S3, the contents of this post are just as applicable to any other cloud storage service. Please do not interpret our choice of Amazon S3 or any other tool, service, or library we should mention, as an endorsement for their use. The best option for you will depend on the unique details of your own project. Furthermore, please keep in mind that any design choice regarding how you store and use your data will have its pros and cons that should be weighed heavily based on the details of your own project.

This post will include a number of experiments that were run on an Amazon EC2 c5.4xlarge instance (with 16 vCPUs and “up to 10 Gbps” of network bandwidth). We will share their outputs as examples of the comparative results you might see. Keep in mind that the outputs may vary greatly based on the environment in which the experiments are run. Please do not rely on the results presented here for your own design decisions. We strongly encourage you to run these as well as additional experiments before deciding what is best for your own projects.

A Simple Thought Experiment

Suppose you have a data transformation application that acts on 1 MB data samples from S3 and produces 1 MB data outputs that are uploaded to S3. Suppose that you are tasked with transforming 1 billion data samples by running your application on an appropriate Amazon EC2 instance (in the same region as your S3 bucket in order to avoid data transfer costs). Now let’s assume that Amazon S3 charges $0.0004 for every 1000 GET operations and $0.005 for every 1000 PUT operations (as at the time of this writing). At first glance, these costs might seem so low that they would be negligible compared to the other costs related to the data transformation. However, a simple calculation shows that our Amazon S3 API calls alone will tally a bill of $5,400!! This can easily be the most dominant cost factor of your project, even more than the cost of the compute instance. We will return to this thought experiment at the end of the post.

Batch Data into Large Files

The obvious way to reduce the costs of the API calls is to group samples together into files of a larger size and run the transformation on batches of samples. Denoting our batch size by N, this strategy could potentially reduce our cost by a factor of N (assuming that multi-part file transfer is not used — see below). This technique would save money not just on the PUT and GET calls but on all of the cost components of Amazon S3 that are dependent on the number of object files rather than the overall size of the data (e.g., lifecycle transition requests).

There are a number of disadvantages to grouping samples together. For example, when you store samples individually, you can freely access any one of them at will. This becomes more challenging when samples are grouped together. (See this post for more on the pros and cons of batching samples into large files.) If you do opt for grouping samples together, the big question is how to choose the size N. A larger N could reduce storage costs but might introduce latency, increase the compute time, and, by extension, increase the compute costs. Finding the optimal number may require some experimentation that takes into account these and additional considerations.

But let’s not kid ourselves. Making this kind of change will not be easy. Your data may have many consumers (both human and artificial) each with their own particular set of demands and constraints. Storing your samples in separate files can make it easier to keep everyone happy. Finding a batching strategy that satisfies everyone will be difficult.

Possible Compromise: Batched Puts, Individual Gets

A compromise you might consider is to upload large files with grouped samples while enabling access to individual samples. One way to do this is to maintain an index file with the locations of each sample (the file in which it is grouped, the start-offset, and the end-offset) and expose a thin API layer to each consumer that would enable them to freely download individual samples. The API would be implemented using the index file and an S3 API that enables extracting specific ranges from object files (e.g., Boto3’s get_object function). While this kind of solution would not save any money on GET calls (since we are still pulling the same number of individual samples), the more expensive PUT calls would be reduced since we would be uploading a lower number of larger files. Note that this kind of solution poses some limitations on the library we use to interact with S3 as it depends on an API that allows for extracting partial chunks of the large file objects. In previous posts (e.g., here) we have discussed the different ways of interfacing with S3, many of which do not support this feature.

The code block below demonstrates how to implement a simple PyTorch dataset (with PyTorch version 1.13) that uses the Boto3 get_object API to extract individual 1 MB samples from large files of grouped samples. We compare the speed of iterating the data in this manner to iterating the samples that are stored in individual files.

import os, boto3, time, numpy as np
import torch
from torch.utils.data import Dataset
from statistics import mean, variance

KB = 1024
MB = KB * KB
GB = KB ** 3


sample_size = MB
num_samples = 100000

# modify to vary the size of the files
samples_per_file = 2000 # for 2GB files
num_files = num_samples//samples_per_file
bucket = '<s3 bucket>'
single_sample_path = '<path in s3>'
large_file_path = '<path in s3>'

class SingleSampleDataset(Dataset):
def __init__(self):
super().__init__()
self.bucket = bucket
self.path = single_sample_path
self.client = boto3.client("s3")

def __len__(self):
return num_samples

def get_bytes(self, key):
response = self.client.get_object(
Bucket=self.bucket,
Key=key
)
return response['Body'].read()

def __getitem__(self, index: int):
key = f'{self.path}/{index}.image'
image = np.frombuffer(self.get_bytes(key),np.uint8)
return {"image": image}

class LargeFileDataset(Dataset):
def __init__(self):
super().__init__()
self.bucket = bucket
self.path = large_file_path
self.client = boto3.client("s3")

def __len__(self):
return num_samples

def get_bytes(self, file_index, sample_index):
response = self.client.get_object(
Bucket=self.bucket,
Key=f'{self.path}/{file_index}.bin',
Range=f'bytes={sample_index*MB}-{(sample_index+1)*MB-1}'
)
return response['Body'].read()

def __getitem__(self, index: int):
file_index = index // num_files
sample_index = index % samples_per_file
image = np.frombuffer(self.get_bytes(file_index, sample_index),
np.uint8)
return {"image": image}

# toggle between single sample files and large files
use_grouped_samples = True

if use_grouped_samples:
dataset = LargeFileDataset()
else:
dataset = SingleSampleDataset()

# set the number of parallel workers according to the number of vCPUs
dl = torch.utils.data.DataLoader(dataset, shuffle=True,
batch_size=4, num_workers=16)

stats_lst = []
t0 = time.perf_counter()
for batch_idx, batch in enumerate(dl, start=1):
if batch_idx % 100 == 0:
t = time.perf_counter() - t0
stats_lst.append(t)
t0 = time.perf_counter()

mean_calc = mean(stats_lst)
var_calc = variance(stats_lst)
print(f'mean {mean_calc} variance {var_calc}')

The table below summarizes the speed of data traversal for different choices of the sample grouping size, N.

Impact of Different Grouping Strategies on Data Traversal Time (by Author)

Note, that although these results strongly imply that grouping samples into large files has a relatively small impact on the performance of extracting them individually, we have found that the comparative results vary based on the sample size, file size, the values of the file offsets, the number of concurrent reads from the same file, etc. Although we are not privy to the internal workings of the Amazon S3 service, it is not surprising that considerations such as memory size, memory alignment, and throttling would impact performance. Finding the optimal configuration for your data will likely require a bit of experimentation.

One significant factor that could interfere with the money-saving grouping strategy we have described here is the use of multi-part downloading and uploading, which we will discuss in the next section.

Use Tools that Enable Control Over Multi-part Data Transfer

Many cloud storage service providers support the option of multi-part uploading and downloading of object files. In multi-part data transfer, files that are larger than a certain threshold are divided into multiple parts that are transferred concurrently. This is a critical feature if you want to speed up the data transfer of large files. AWS recommends using multi-part upload for files larger than 100 MB. In the following simple example, we compare the download time of a 2 GB file with the multi-part threshold and chunk-size set to different values:

import boto3, time
KB = 1024
MB = KB * KB
GB = KB ** 3

s3 = boto3.client('s3')
bucket = '<bucket name>'
key = '<key of 2 GB file>'
local_path = '/tmp/2GB.bin'
num_trials = 10

for size in [8*MB, 100*MB, 500*MB, 2*GB]:
print(f'multi-part size: {size}')
stats = []
for i in range(num_trials):
config = boto3.s3.transfer.TransferConfig(multipart_threshold=size,
multipart_chunksize=size)
t0 = time.time()
s3.download_file(bucket, key, local_path, Config=config)
stats.append(time.time()-t0)
print(f'multi-part size {size} mean {mean(stats)}')

The results of this experiment are summarized in the table below:

Impact of Multi-part chunk size on Download Time (by Author)

Note that the relative comparison will greatly depend on the test environment and specifically on the speed and bandwidth of communication between the instance and the S3 bucket. Our experiment was run on an instance that was in the same region as the bucket. However, as the distance increases, so will the impact of using multi-part downloading.

With regards to the topic of our discussion, it is important to note the cost implications of multi-part data transfer. Specifically, when you use multi-part data transfer, you are charged for the API operation of each one of the file parts. Consequently, using multi-part uploading/downloading will limit the cost savings potential of batching data samples into large files.

Many APIs use multi-part downloading by default. This is great if your primary interest is reducing the latency of your interaction with S3. But if your concern is limiting cost, this default behavior does not work in your favor. Boto3, for example, is a popular Python API for uploading and downloading files from S3. If not specified, the boto3 S3 APIs such as upload_file and download_file will use a default TransferConfig, which applies multi-part uploading/downloading with a chunk-size of 8 MB to any file larger than 8 MB. If you are responsible for controlling the cloud costs in your organization, you might be unhappily surprised to learn that these APIs are being widely used with their default settings. In many cases, you might find these settings to be unjustified and that increasing the multi-part threshold and chunk-size values, or disabling multi-part data transfer altogether, will have little impact on the performance of your application.

Example — Impact of Multi-part File Transfer Size on Speed and Cost

In the code block below we create a simple multi-process transform function and measure the impact of the multi-part chunk size on its performance and cost:

import os, boto3, time, math
from multiprocessing import Pool
from statistics import mean, variance

KB = 1024
MB = KB * KB

sample_size = MB
num_files = 64
samples_per_file = 500
file_size = sample_size*samples_per_file
num_processes = 16
bucket = '<s3 bucket>'
large_file_path = '<path in s3>'
local_path = '/tmp'
num_trials = 5
cost_per_get = 4e-7
cost_per_put = 5e-6

for multipart_chunksize in [1*MB, 8*MB, 100*MB, 200*MB, 500*MB]:
def empty_transform(file_index):
s3 = boto3.client('s3')
config = boto3.s3.transfer.TransferConfig(
multipart_threshold=multipart_chunksize,
multipart_chunksize=multipart_chunksize
)
s3.download_file(bucket,
f'{large_file_path}/{file_index}.bin',
f'{local_path}/{file_index}.bin',
Config=config)
s3.upload_file(f'{local_path}/{file_index}.bin',
bucket,
f'{large_file_path}/{file_index}.out.bin',
Config=config)

stats = []
for i in range(num_trials):
with Pool(processes=num_processes) as pool:
t0 = time.perf_counter()
pool.map(empty_transform, range(num_files))
transform_time = time.perf_counter() - t0
stats.append(transform_time)

num_chunks = math.ceil(file_size/multipart_chunksize)
num_operations = num_files*num_chunks
transform_cost = num_operations * (cost_per_get + cost_per_put)
if num_chunks > 1:
# if multi-part is used add cost of
# CreateMultipartUpload and CompleteMultipartUpload API calls
transform_cost += 2 * num_files * cost_per_put
print(f'chunk size {multipart_chunksize}')
print(f'transform time {mean(stats)} variance {variance(stats)}
print(f'cost of API calls {transform_cost}')

In this example we have fixed the file size to 500 MB and applied the same multi-part settings to both the download and upload. A more complete analysis would vary the size of the data files and the multi-part settings.

In the table below we summarize the results of the experiment.

Impact of Multi-part Chunk Size on Data Transformation Speed and Cost (by Author)

The results indicate that up to a multi-part chunk size of 500 MB (the size of our files), the impact on the time of the data transformation is minimal. On the other hand, the potential savings to the cloud storage API costs is significant, up to 98.4% when compared with using Boto3’s default chunk size (8MB). Not only does this example demonstrate the cost benefit of grouping samples together, but it also implies an additional opportunity for savings through appropriate configuration of the multi-part data transfer settings.

Conclusion

Let’s apply the results of our last example to the thought experiment we introduced at the top of this post. We showed that applying a simple transformation to 1 billion data samples would cost $5,400 if the samples were stored in individual files. If we were to group the samples into 2 million files, each with 500 samples, and apply the transformation without multi-part data transfer (as in the last trial of the example above), the cost of the API calls would be reduced to $10.8!! At the same time, assuming the same test environment, the impact we would expect (based on our experiments) on the overall runtime would be relatively low. I would call that a pretty good deal. Wouldn’t you?

Summary

When developing cloud-based big-data applications it is vital that we be fully familiar with all of the details of the costs of our activities. In this post we focused on the “Requests & data retrievals” component of the Amazon S3 pricing strategy. We demonstrated how this component can become a major part of the overall cost of a big-data application. We discussed two of the factors that can affect this cost: the manner in which data samples are grouped together and the way in which multi-part data transfer is used.

Naturally, optimizing just one cost component is likely to increase other components in a way that will raise the overall cost. An appropriate design for your data storage will need to take into account all potential cost factors and will greatly depend on your specific data needs and usage patterns.

As usual, please feel free to reach out with comments and corrections.

--

--

I am a Machine Learning Algorithm Developer working on Autonomous Vehicle technologies at Mobileye. The views expressed in my posts are my own.