
Amazon Simple Storage Service (S3) provides users with cheap, secure, and easy-to-manage storage infrastructure. It is viable to move files in and out of S3 buckets with the AWS console itself, but AWS offers the option to facilitate these operations with code as well.
For Python, the AWS software development kit (SDK) offers boto3, which allows users to create, configure, and utilize S3 buckets and objects with code. Here, we delve into the basic boto3 functions for S3 resources and consider how they can be used to automize operations.
Why Boto3?
You might be wondering if there is any point in learning to use yet another tool when you can already access the S3 service through the AWS console. Granted, with AWS’s simple and user-friendly user interface (UI), it is easy to move files to and from S3 buckets.
However, using the typical click-and-drag approach isn’t viable when operations need to scale up. It’s one thing to handle 1-10 files at a time, but would you fare with 100s or 1000s of files? Such an undertaking would naturally be time-consuming if done manually.
Furthermore, manual tasks also leave users prone to making mistakes. When moving large quantities of data around manually, can you guarantee that you won’t mistakenly omit or include the wrong files?
Those that value efficiency or consistency will no doubt see the merit in using Python scripts to automate the use of S3 resources.
Prerequisites
Let’s start with the prerequisites for using boto3.
1. Installation
To use boto3, you’ll have to install the AWS SDK if you haven’t already with the following command.
pip install boto3
2. IAM user
You will also need an IAM user account with permission to use S3 resources.
To get a user identity, log in to AWS with the root user account. Head to the Identity and Access Management (IAM) section and add a new user. Assign this user identity a policy that will grant access to S3 resources. The simplest option is to select the "AmazonS3FullAccess" permission policy, but you can find or create policies that are more tailored to your needs.

After picking the policy, complete the remaining prompts and create the user identity. You should be able to see your newly created identity in the console. In this example, we use a user identity called "s3_demo".

Next, click on the user name (not the check box) and go to the security credentials tab. In the access key section, select "create access key" and answer all prompts. You will then receive your access key and secret access key.

These keys are required for accessing the S3 resources with boto3, so they will naturally be incorporated into the Python script.
Basic commands
Boto3 comprises functions that can provision and configure various AWS resources, but the focus of this article will be on handling S3 buckets and objects. The key benefit of boto3 is that it executes tasks like uploading and downloading files with a simple one-liner.
Creating a client
To access S3 resources with boto3 (or any AWS resource), a low-level service client needs to be created first.
Creating a client requires inputting the service name, region name, access key, and secret access key.
Creating a bucket
Let’s start by creating a bucket, which we will call "a-random-bucket-2023". This can be achieved with the create_bucket
function.
If you refresh the S3 section on the AWS console, you should see the newly created bucket.

Listing buckets
To list the available buckets, users can use the list_buckets
function. This returns a dictionary with many key-value pairs. To see just the names of the buckets, retrieve the value of the ‘Buckets’ key in the dictionary.

Upload a file to a bucket
Files can be uploaded with the upload_file
function. In this case, let’s upload "mock_data.xlsx" to the bucket.
It’s worth distinguishing the difference between the Filename
and Key
parameters. Filename
refers to the name of the file that is being transferred, while Key
refers to the name that is assigned to the object stored in the bucket.
Since "First Mock Data.xlsx" was assigned to the Key
parameter, that will be the name of the object when it is uploaded to the bucket.

Uploading a Pandas data frame to a bucket
Since we’re working on Python, it’s worth knowing how to upload a Pandas data frame directly to the bucket. This can be achieved with the io module. In the following snippet, we upload the data frame "df" to the bucket.

List objects
The objects in a given bucket can be listed with the list_objects
function.
The output of the function itself is a large dictionary with a bunch of metadata, but the object names can be found by accessing the "Contents" key.

Download files
Objects can be downloaded from a bucket with the download_file
function.
Deleting objects
Objects can be deleted with the deleted_object
function.

Deleting buckets
Finally, users can delete buckets with the delete_bucket
function.

Case Study
We’ve explored some of the basic boto3 functions for using S3 resources. However, performing a case study is the best way to demonstrate why boto3 is such a powerful tool.
Problem Statement 1: We are interested in the books that are being published on different dates. Obtain data on published books using the NYT Books API and upload it to an S3 bucket.
Once again, we start by creating a low-level service client.
Next, we create a bucket that will contain all of these files.

Next, we need to pull the data with the NYT Books API and upload them to the bucket. We can perform a data pull for a given date with the following get_data
function:
Here is a preview of what the output of the get_data
function looks like:

There are many ways to utilize this function for data collection. One option is to use to run this function and upload the output to the S3 bucket every day (or use a job scheduler). Another option is to use loops to collect data for books published in multiple days.
If we are interested in books published in the last 7 days, we can parse through each day with a loop. For each day, we:
- make a call with the API
- store the response into a data frame
- upload the data frame to the bucket
These steps are executed with the following snippet:

With just several lines of code, we are able to make multiple API calls and upload the responses to the bucket in one well swoop!
At some point, there might be a need to extract some insight from the data uploaded to a bucket. It’s natural for data analysts to receive ad hoc requests pertaining to the data they collect.
Problem statement 2: Find the book title, author, and publisher of the highest-ranking books for all dates and store the results locally .
With Python, we can store multiple S3 objects into Pandas data frames, process the data frames, and load the output into a flat file.
There are two key advantages of using boto3 that are showcased in this case study. The first advantage of this method is that it scales well; the difference in time and effort needed to upload/download 1 file and 1000 files is negligible. The second advantage is that it allows users to seamlessly tie in other processes (e.g., data collection, filtering) when moving data to or from S3 buckets.
Conclusion

Hopefully, this brief boto3 primer has not only introduced you to the basic Python commands for managing S3 resources but also shown how they can be used to automate processes.
With the AWS SDK for Python, users will be able to move data to and from the cloud with greater efficiency and consistency. Even if you are currently content with provisioning and utilizing resources with AWS’s UI, you’ll no doubt run into cases where scalability is a priority. Knowing the basics of boto3 will ensure that you are well prepared for such situations.
Thank you for reading!