The world’s leading publication for data science, AI, and ML professionals.

Snowflake S3 integration

Access files stored in S3 as a regular table in Snowflake

Photo by Kelly Sikkema on Unsplash
Photo by Kelly Sikkema on Unsplash

Snowflake integration objects enable us to connect with external systems from Snowflake. In my last blog, we went through how to create an API integration to invoke an AWS Lambda. In this blog, we will see how we can integrate AWS S3 so that we can access data stored and query it in snowflake.

Image by author
Image by author

Implementation Details

AWS IAM Role

For snowflake to be able to talk to our AWS account we need to add cross account role. This role will be assumed by an IAM identity in Snowflake account and perform actions in our AWS Account.

  • Create a cross-account IAM Role in your AWS account. Put a dummy account and we will update the trust policy later.
  • Attach an inline policy to the role created above.
Version: '2012-10-17'
Statement:
- Effect: Allow
  Action:
  - S3:GetObject
  - s3:GetObjectVersion
  Resource: arn:aws:s3:::my-Snowflake-data-bucket/dev/gold/*
- Effect: Allow
  Action: s3:ListBucket
  Resource: arn:aws:s3:::my-snowflake-data-bucket

The above policy allows the role to access S3 bucket under a prefix. As a security practice only add minimum access needed by snowflake.

Snowflake Integration Object

  • Create a Storage integration object
create or replace storage integration my_s3_int_01
  type = external_stage
  storage_provider = s3
  enabled = true
  storage_aws_role_arn = '<IAM_ROLE_ARN>'
  storage_allowed_locations = ('s3://my-snowflake-data-bucket/dev/gold/');

The above command when run in snowflake creates an S3 integration object. This object creates the IAM identity in snowflake account. All API action in AWS account will be performed by this user by assuming the role we mentioned in the storage_aws_role_arn.

As mentioned, the integration object creates a user and external identity. We need those to add trust relationship between AWS and Snowflake account. To get that we describe the integration object just created and note down the values of STORAGE_AWS_IAM_USER_ARN & STORAGE_AWS_EXTERNAL_ID

describe integration my_s3_int_01;

OUTPUT:

image by author
image by author

Once we have the user & external identity we Update the trust relationship of the role created in 1st step. Add the value of STORAGE_AWS_IAM_USER_ARN in Principal & STORAGE_AWS_EXTERNAL_ID under condition

Version: '2012-10-17'
Statement:
- Effect: Allow
  Principal:
    AWS: "<STORAGE_AWS_IAM_USER_ARN>"
  Action: sts:AssumeRole
  Condition:
    ForAnyValue:StringLike:
      sts:ExternalId:
      - "<STORAGE_AWS_EXTERNAL_ID>"

We need to tell snowflake of any special formatting we are having in the files which are stored in the S3 bucket. This is only needed we snowflake provided file formats are not enough.

  • (Optional) Create file format
CREATE OR REPLACE FILE FORMAT my_csv_format_01
  TYPE = CSV
  FIELD_OPTIONALLY_ENCLOSED_BY='"'

Snowflake External Stage

Once we have the access and format setup done, we create the stage. Stage stores the metadata of the external files, in our case s3. This is used to find the data which needs to be loaded in the snowflake table(s). We have created a simple stage, you can also look at other options like encryption.

  • Create External stage
create  or replace stage my_s3_stage_01
  storage_integration = my_s3_int_01
  url = 's3://my-snowflake-data-bucket/dev/gold/'
  file_format = my_csv_format_01;

The above command creates a mapping between snowflake and S3 file prefixes. It also tells snowflake to use a file format which is suitable for data stored in S3.

  • Create External table
create or replace external table ext_ccfinput_test_01
  with location = @my_s3_stage_01/
  auto_refresh = true
  file_format = (type = CSV)
  pattern='.*ganalytics.*[.]csv';
describe stage my_s3_stage_01;

Similar to how stage store info on where to find data, External tables store file-level metadata, such as the filename, a version identifier and related properties. This adds an abstraction, which allows us to query data as if it was present in snowflake.

Checking Results

Now we are ready to query the data from S3 in snowflake. We issue a select statement on the table we created.

select t.$1, t.$2, t.$3, t.$4, t.$5, t.$6 from @my_s3_stage_01 as t;

And voila, we get the result which resonates with the content of s3 data files.

In this blog, we saw how we can access and query data stored in S3 from snowflake.

Happy Snowflaking!!


Related Articles