Setting up a Pypi mirror in an AWS private environment with Terraform

Florent Pajot
Towards Data Science
5 min readMar 6, 2024

--

How do you install a Python package in your environment if you don’t have any internet access? I recently came across this issue when creating an AWS Sagemaker Studio environment for my team on AWS.

Building an AWS private environment for Sagemaker

For this particular project, I set up Sagemaker in VPC Only mode with the constraint of keeping the architecture private, which means creating a VPC and private subnets, but no access to the internet.

So all network communications, including application communication with AWS APIs, must go through VPC Endpoint interfaces. This allows for keeping connection secured as data sent and received will never go through the internet using the AWS network backbone instead.

It is particularly suited for limiting exposure to security risks, more particularly when you’re processing personal information, or must comply with some security standards.

Photo by Nadir sYzYgY on Unsplash

Accessing the Pypi package repository from AWS Sagemaker

In my team, Data Scientists use Python as a primary language and sometimes need Python packages that are not provided in Sagemaker’s pre-built Python images, so I’ll focus on this use case. Fortunately, the solution is also working for other languages and repositories like npm.

Your users will typically try to install whatever package they need via pip command. But, as no internet access is allowed, this command will fail because pip won’t be able to contact Pypi.org servers.

Opening internet

One option is to open access to the internet and allow outbound HTTP connections to Fastly CDN IPs used by Pypi.org servers. But, this is not viable in our case as we don’t want any internet connection in the architecture.

Using a dedicated Pypi server

AWS blog also provides an example of using a Python package named Bandersnatch. This article describes how to set up a server, acting like a bastion host, which will mirror Pypi and will be accessible only to your private subnets.

This is not a viable option as you’ve to know in advance which Python packages you need to provide, and you’ll somehow have to create public subnets and give the Pypi server mirror access to the internet.

Using AWS Cordeartifact

This is ultimately the solution I came up with and which works in my case.

AWS Codeartifact is the artifact management solution provided by AWS. It is compatible with other AWS services like AWS Service Catalog to control access to resources within an organization.

To use it, you’ll have to create a “domain” which serves as an umbrella to manage access and apply policies across your organization. Then, you’ll have to create a repository that will serve your artifacts to your different applications.

Also, one repository can have upstream repositories. So, if a Python package is not available in the target repository, the demand will be transmitted to the upstream repository to be fulfilled.

More precisely, this workflow takes into account package versions. Official documentation provides a detailed workflow:

If my_repo contains the requested package version, it is returned to the client.

If my_repo does not contain the requested package version, CodeArtifact looks for it in my_repo's upstream repositories. If the package version is found, a reference to it is copied to my_repo, and the package version is returned to the client.

If neither my_repo nor its upstream repositories contain the package version, an HTTP 404 Not Found response is returned to the client.

Cool right? It will even cache the package version for future requests.

This is precisely the strategy we are going to use, as AWS Codeartifact allows us to define a repository that has an external connection like Pypi as an upstream repository.

Creating AWS Codeartifact resources with Terraform

As AWS Codeartifact is an AWS service, you can easily create a VPC endpoint in your environment VPC to connect to it.

Note: I’m using Terraform v1.6.4 and aws provider v5.38.0

locals {
region = "us-east-1"
}

resource "aws_security_group" "vpce_sg" {
name = "AllowTLS"
description = "Allow TLS inbound traffic and all outbound traffic"
vpc_id = aws_vpc.your_vpc.id

tags = {
Name = "allow_tls_for_vpce"
}
}

resource "aws_vpc_security_group_ingress_rule" "allow_tls_ipv4" {
security_group_id = aws_security_group.allow_tls.id
cidr_ipv4 = aws_vpc.your_vpc.cidr_block
from_port = 443
ip_protocol = "tcp"
to_port = 443
}

data "aws_iam_policy_document" "codeartifact_vpce_base_policy" {
statement {
sid = "EnableRoles"
effect = "Allow"
actions = [
"codeartifact:GetAuthorizationToken",
"codeartifact:GetRepositoryEndpoint",
"codeartifact:ReadFromRepository",
"sts:GetServiceBearerToken"
]
resources = [
"*",
]
principals {
type = "AWS"
identifiers = [
aws_iam_role.your_sagemaker_execution_role.arn
]
}
}
}

resource "aws_vpc_endpoint" "codeartifact_api_vpce" {
vpc_id = aws_vpc.your_vpc.id
service_name = "com.amazonaws.${local.region}.codeartifact.api"
vpc_endpoint_type = "Interface"
subnet_ids = aws_subnets.your_private_subnets.ids

security_group_ids = [
aws_security_group.vpce_sg.id,
]

private_dns_enabled = true
policy = data.aws_iam_policy_document.codeartifact_vpce_base_policy.json
tags = { Name = "codeartifact-api-vpc-endpoint" }
}

Then, you’ll have to create the different resources needed for Codeartifact to handle your requests for new Python packages by mirroring Pypi: a domain, a Pypi repository with an external connection, and a repository that defines Pypi as an upstream repository.

resource "aws_codeartifact_domain" "my_domain" {
domain = "my-domain"

encryption_key = ""

tags = { Name = "my-codeartifact-domain" }
}


resource "aws_codeartifact_repository" "public_pypi" {
repository = "pypi-store"
domain = aws_codeartifact_domain.my_domain.domain

external_connections {
external_connection_name = "public:pypi"
}

tags = { Name = "pypi-store-repository" }
}

resource "aws_codeartifact_repository" "my_repository" {
repository = "my_repository"
domain = aws_codeartifact_domain.my_domain.domain

upstream {
repository_name = aws_codeartifact_repository.public_pypi.repository
}

tags = { Name = "my-codeartifact-repository" }
}

data "aws_iam_policy_document" "my_repository_policy_document" {
statement {
effect = "Allow"

principals {
type = "AWS"
identifiers = [aws_iam_role.your_sagemaker_execution_role.arn]
}

actions = ["codeartifact:ReadFromRepository"]
resources = [aws_codeartifact_repository.my_repository.arn]
}
}

resource "aws_codeartifact_repository_permissions_policy" "my_repository_policy" {
repository = aws_codeartifact_repository.my_repository.repository
domain = aws_codeartifact_domain.my_domain.domain
policy_document = data.aws_iam_policy_document.my_repository_policy_document.json
}

Here it is! You can now set up a Pypi mirror for your private environment easily.

To make things usable, you’ll also have to tell pip commands to direct requests to a specific index. Fortunately, AWS created an API to do the heavy lifting for you. Just add this to your code to make it work:

aws codeartifact login --tool pip --repository $CODE_ARTIFACT_REPOSITOR_ARN --domain $CODE_ARTIFACT_DOMAIN_ID --domain-owner $ACCOUNT_ID --region $REGION

Last but not least, add a VPC Endpoint for AWS Codeartifact in your VPC.

data "aws_iam_policy_document" "codeartifact_vpce_base_policy" {
statement {
sid = "EnableRoles"
effect = "Allow"
actions = [
"codeartifact:GetAuthorizationToken",
"codeartifact:GetRepositoryEndpoint",
"codeartifact:ReadFromRepository",
"sts:GetServiceBearerToken"
]
resources = [
"*",
]
principals {
type = "AWS"
identifiers = [
aws_iam_role.your_sagemaker_execution_role.arn
]
}
}
}

resource "aws_vpc_endpoint" "codeartifact_api_vpce" {
vpc_id = aws_vpc.your_vpc.id
service_name = "com.amazonaws.${local.region}.codeartifact.api"
vpc_endpoint_type = "Interface"
subnet_ids = aws_subnets.your_private_subnets.ids

security_group_ids = [
aws_security_group.vpce_sg.id,
]

private_dns_enabled = true
policy = data.aws_iam_policy_document.codeartifact_vpce_base_policy.json
tags = { Name = "codeartifact-api-vpc-endpoint" }
}

If you would like to receive notifications for my upcoming posts regarding AWS and more, please subscribe here.

Did you know you can clap multiple times?

--

--

Blogging about Data Science, Cloud Engineering, Data Engineering, Machine Learning and side hustles. Connect with me: https://www.linkedin.com/in/florentpajot/