The world’s leading publication for data science, AI, and ML professionals.

How You Can (and Why You Should) Access Amazon S3 Resources with Python

Use automation to move data to and from the cloud

Photo by Lia Trevarthen on Unsplash
Photo by Lia Trevarthen on Unsplash

Amazon Simple Storage Service (S3) provides users with cheap, secure, and easy-to-manage storage infrastructure. It is viable to move files in and out of S3 buckets with the AWS console itself, but AWS offers the option to facilitate these operations with code as well.

For Python, the AWS software development kit (SDK) offers boto3, which allows users to create, configure, and utilize S3 buckets and objects with code. Here, we delve into the basic boto3 functions for S3 resources and consider how they can be used to automize operations.


Why Boto3?

You might be wondering if there is any point in learning to use yet another tool when you can already access the S3 service through the AWS console. Granted, with AWS’s simple and user-friendly user interface (UI), it is easy to move files to and from S3 buckets.

However, using the typical click-and-drag approach isn’t viable when operations need to scale up. It’s one thing to handle 1-10 files at a time, but would you fare with 100s or 1000s of files? Such an undertaking would naturally be time-consuming if done manually.

Furthermore, manual tasks also leave users prone to making mistakes. When moving large quantities of data around manually, can you guarantee that you won’t mistakenly omit or include the wrong files?

Those that value efficiency or consistency will no doubt see the merit in using Python scripts to automate the use of S3 resources.


Prerequisites

Let’s start with the prerequisites for using boto3.

1. Installation

To use boto3, you’ll have to install the AWS SDK if you haven’t already with the following command.

pip install boto3

2. IAM user

You will also need an IAM user account with permission to use S3 resources.

To get a user identity, log in to AWS with the root user account. Head to the Identity and Access Management (IAM) section and add a new user. Assign this user identity a policy that will grant access to S3 resources. The simplest option is to select the "AmazonS3FullAccess" permission policy, but you can find or create policies that are more tailored to your needs.

AmasonS3FullAccess Policy (Created By Author)
AmasonS3FullAccess Policy (Created By Author)

After picking the policy, complete the remaining prompts and create the user identity. You should be able to see your newly created identity in the console. In this example, we use a user identity called "s3_demo".

Next, click on the user name (not the check box) and go to the security credentials tab. In the access key section, select "create access key" and answer all prompts. You will then receive your access key and secret access key.

These keys are required for accessing the S3 resources with boto3, so they will naturally be incorporated into the Python script.


Basic commands

Boto3 comprises functions that can provision and configure various AWS resources, but the focus of this article will be on handling S3 buckets and objects. The key benefit of boto3 is that it executes tasks like uploading and downloading files with a simple one-liner.

Creating a client

To access S3 resources with boto3 (or any AWS resource), a low-level service client needs to be created first.

Creating a client requires inputting the service name, region name, access key, and secret access key.

Creating a bucket

Let’s start by creating a bucket, which we will call "a-random-bucket-2023". This can be achieved with the create_bucket function.

If you refresh the S3 section on the AWS console, you should see the newly created bucket.

Creating a Bucket (Created By Author)
Creating a Bucket (Created By Author)

Listing buckets

To list the available buckets, users can use the list_buckets function. This returns a dictionary with many key-value pairs. To see just the names of the buckets, retrieve the value of the ‘Buckets’ key in the dictionary.

Code Output (Created By Author)
Code Output (Created By Author)

Upload a file to a bucket

Files can be uploaded with the upload_file function. In this case, let’s upload "mock_data.xlsx" to the bucket.

It’s worth distinguishing the difference between the Filename and Key parameters. Filename refers to the name of the file that is being transferred, while Key refers to the name that is assigned to the object stored in the bucket.

Since "First Mock Data.xlsx" was assigned to the Key parameter, that will be the name of the object when it is uploaded to the bucket.

Adding an Object (Created By Author)
Adding an Object (Created By Author)

Uploading a Pandas data frame to a bucket

Since we’re working on Python, it’s worth knowing how to upload a Pandas data frame directly to the bucket. This can be achieved with the io module. In the following snippet, we upload the data frame "df" to the bucket.

Uploading a Pandas Data Frame (Created By Author)
Uploading a Pandas Data Frame (Created By Author)

List objects

The objects in a given bucket can be listed with the list_objects function.

The output of the function itself is a large dictionary with a bunch of metadata, but the object names can be found by accessing the "Contents" key.

Code Output (Created By Author)
Code Output (Created By Author)

Download files

Objects can be downloaded from a bucket with the download_file function.

Deleting objects

Objects can be deleted with the deleted_object function.

Deleting an Object (Created By Author)
Deleting an Object (Created By Author)

Deleting buckets

Finally, users can delete buckets with the delete_bucket function.

Deleting a Bucket (Created By Author)
Deleting a Bucket (Created By Author)

Case Study

We’ve explored some of the basic boto3 functions for using S3 resources. However, performing a case study is the best way to demonstrate why boto3 is such a powerful tool.

Problem Statement 1: We are interested in the books that are being published on different dates. Obtain data on published books using the NYT Books API and upload it to an S3 bucket.

Once again, we start by creating a low-level service client.

Next, we create a bucket that will contain all of these files.

Creating a Bucket (Created By Author)
Creating a Bucket (Created By Author)

Next, we need to pull the data with the NYT Books API and upload them to the bucket. We can perform a data pull for a given date with the following get_data function:

Here is a preview of what the output of the get_data function looks like:

Code Output (Created By Author)
Code Output (Created By Author)

There are many ways to utilize this function for data collection. One option is to use to run this function and upload the output to the S3 bucket every day (or use a job scheduler). Another option is to use loops to collect data for books published in multiple days.

If we are interested in books published in the last 7 days, we can parse through each day with a loop. For each day, we:

  • make a call with the API
  • store the response into a data frame
  • upload the data frame to the bucket

These steps are executed with the following snippet:

Uploading Files to the Bucket(Created By Author)
Uploading Files to the Bucket(Created By Author)

With just several lines of code, we are able to make multiple API calls and upload the responses to the bucket in one well swoop!

At some point, there might be a need to extract some insight from the data uploaded to a bucket. It’s natural for data analysts to receive ad hoc requests pertaining to the data they collect.

Problem statement 2: Find the book title, author, and publisher of the highest-ranking books for all dates and store the results locally .

With Python, we can store multiple S3 objects into Pandas data frames, process the data frames, and load the output into a flat file.

There are two key advantages of using boto3 that are showcased in this case study. The first advantage of this method is that it scales well; the difference in time and effort needed to upload/download 1 file and 1000 files is negligible. The second advantage is that it allows users to seamlessly tie in other processes (e.g., data collection, filtering) when moving data to or from S3 buckets.


Conclusion

Photo by Prateek Katyal on Unsplash
Photo by Prateek Katyal on Unsplash

Hopefully, this brief boto3 primer has not only introduced you to the basic Python commands for managing S3 resources but also shown how they can be used to automate processes.

With the AWS SDK for Python, users will be able to move data to and from the cloud with greater efficiency and consistency. Even if you are currently content with provisioning and utilizing resources with AWS’s UI, you’ll no doubt run into cases where scalability is a priority. Knowing the basics of boto3 will ensure that you are well prepared for such situations.

Thank you for reading!


Related Articles