Making Sense of Big Data

Loading CSV data into Azure Synapse Analytics by using PolyBase

A step-by-step guide to importing CSV data from ADLS Gen2 to Azure Synapse Analytics by using PolyBase

Dhyanendra Singh Rathore
Towards Data Science
7 min readNov 11, 2020

--

Photo by Markus Winkler on Unsplash

Azure Synapse Analytics SQL pool supports various data loading methods. The fastest and most scalable way to load data is through PolyBase. PolyBase is a data virtualization technology that can access external data stored in Hadoop or Azure Data Lake Storage via the T-SQL language.

PolyBase shifts the data loading paradigm from ETL to ELT. The data is first loaded into a staging table followed by the transformation steps and finally loaded into the production tables.

In this article, we load a CSV file from an Azure Data Lake Storage Gen2 account to an Azure Synapse Analytics data warehouse by using PolyBase. We will look at the detailed steps to carry out the loading procedure.

Caution: Microsoft Azure is a paid service, and following this article can cause financial liability to you or your organization.

Please read our terms of use before proceeding with this article: https://dhyanintech.medium.com/disclaimer-disclosure-terms-of-use-fb3bfbd1e0e5

Prerequisites

  1. An active Microsoft Azure subscription
  2. Azure Data Lake Storage Gen2 account with CSV files
  3. Azure Synapse Analytics data warehouse

To use PolyBase, you need to define external tables in your SQL pool before loading. PolyBase uses external tables to define and access the data from the Azure storage. The external table contains the table schema and points to data that is stored outside the SQL pool. Defining external tables involves specifying three objects: data source, the format of the text files, and the table definitions.

External tables are in-memory tables that don’t persist onto the physical disk. External tables can be queried like any other table.

External Data Source

The external data source object provides the connection information required to connect to the external data source itself, in our case, the ADLS account. Let’s begin by collecting the URL and the access key. Sign in to the Azure Portal, and navigate to your storage account. Click on Access Keys and copy the Key and the Storage account name to a notepad.

Azure Storage Account: Access Keys (Image by author)

Create an import database

The next step is to create a database scoped credential to secure the credentials to the ADLS account. Create a database master key if one does not already exist, and then encrypt the database-scoped credential named ADLS_Credential using the master key.

The database master key is used to encrypt the private keys of certificates and keys that are present in the database.

A database scoped credential contains the authentication information required to connect to an external resource.

You can run the below T-SQL statements in SSMS or a tool of your choice or Query editor (preview) from the Common Tasks section of your Synapse instance in the Azure portal. Replace the storage account name, storage account key, and password accordingly.

-- Create a database master key if one does not already exist, using your own password. This key will be used to encrypt the credential secret in next step.
CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'Pa$$word123' ;
CREATE DATABASE SCOPED CREDENTIAL ADLS_Credential
WITH
-- IDENTITY = '<storage_account_name>' ,
-- SECRET = '<storage_account_key>'
IDENTITY = 'dlspolybasestorage' ,
SECRET = 'zN9S8mggblYX/sEiz5DVRmcZWjPw65A393bzD9T5rQjo+LnI5GAGrjdLvt4iqK5YEWMSMqV82IsVm3Bww6uw=='
;

Follow our article for instructions on how to connect and access Azure Synapse Analytics from SSMS:

Azure Synapse Analytics: Query editor (Image by author)

Further reading on the database master key and the database scoped credentials:

Create an external data source connection

Use the database-scoped credential to create an external data source named AzureStorage. The location URL point to the container named csvstore in the ADLS Gen2 account. The type Hadoop is used for both Hadoop-based and Azure Blob storage-based external sources. Modify the location to refer to your storage account and the container.

-- Note this example uses a ADLS Gen2 secured endpoint (abfss)
CREATE EXTERNAL DATA SOURCE AzureStorage
WITH
( LOCATION = 'abfss://csvstore@dlspolybasestorage.dfs.core.windows.net' ,
CREDENTIAL = ADLS_Credential ,
TYPE = HADOOP
);

Further reading on creating external data source:

External File Format

The external file format object contains how the data in a file is structured and defines how rows are defined and what column separators are used. Run the below query to define the external file format named csvFile. For this exercise, we’re using a CSV file available here. This file has 4,167 data rows and a header row.

FORMAT_TYPE indicates to PolyBase that the format of the text file is DelimitedText. FIELD_TERMINATOR specifies column separator. STRING_DELIMITER specifies the field terminator for string type data. FIRST_ROW specifies the row number that is read first during the PolyBase load. If the value is set to two, the header row is skipped when the data is loaded. USE_TYPE_DEFAULT specifies how to handle missing values; FALSE means store all missing values as NULL. ENCODING specifies the encoding of the external file.

-- Create an external file format for DELIMITED (CSV/TSV) files. 
CREATE EXTERNAL FILE FORMAT csvFile
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR = ',',
STRING_DELIMITER = '"',
FIRST_ROW = 2,
USE_TYPE_DEFAULT = FALSE,
ENCODING = 'UTF8' )
);

Further reading on creating an external file format:

External Tables

The external table object uses the external data source and external file format objects to define the external table structure within Azure Synapse Analytics. You can then use the external table as a basis for loading data into your data warehouse.

Create an external table named dbo.FIPSLOOKUP_EXT with the column definition corresponding to your CSV file. Use a WITH clause to call the external data source definition (AzureStorage) and the external file format (csvFile) we created in the previous steps. The location denotes that the file to load is in the root folder of the data source.

-- Create a temp table to hold the imported data
CREATE EXTERNAL TABLE dbo.FIPSLOOKUP_EXT (
UID INT NOT NULL,
iso2 VARCHAR(2) NULL,
iso3 VARCHAR(3) NULL,
code3 INT NULL,
FIPS INT NULL,
Admin2 VARCHAR(255) NULL,
provincestate VARCHAR(255) NULL,
countryregion VARCHAR(255) NULL,
latitude DECIMAL(12,9) NULL,
longitude DECIMAL(12,9) NULL,
combined_key VARCHAR(255) NULL,
population INT NULL
)
WITH (
LOCATION='../',
DATA_SOURCE=AzureStorage,
FILE_FORMAT=csvFile
);

Run the query and refresh the Object Explorer in SSMS. At this point, we can see all the three external objects we have created so far.

SSMS: External objects (Image by author)

Let’s run a quick SELECT query to see the data from our external table and test our external objects.

SSMS: Data from the external table (Image by author)

Further reading on creating an external table:

Load destination table

We have set up our PolyBase correctly; however, the data is not yet physically stored in our data warehouse. The data still exists only in the ADLS account. We need to load this data into a physical table to persist in our data warehouse physically.

Create a physical table in the Azure Synapse Analytics. The table will have a clustered column store index defined on all the columns with a round-robin table geometry because round-robin is the best table geometry to use for loading the data.

-- Load the data from Azure Data Lake Store Gen2 to Azure Synapse Analytics data warehouse
CREATE TABLE dbo.FIPSLOOKUP
WITH (
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = ROUND_ROBIN
)
AS
SELECT * FROM dbo.FIPSLOOKUP_EXT;
SSMS: Creating a physical table from an external table (Image by author)

Let’s quickly examine the number of rows loaded into the physical table. The total number of rows loaded is 4,167, meaning none of the rows were discarded during the load.

SSMS: Data from the physical table (Image by author)

That’s all, folks. You’ve successfully loaded a CSV file into your Azure Synapse Analytics data warehouse using PolyBase. If you need to transform or cleanse the data you’ve just loaded, the best spot to do it is to introduce the cleansing and transformation query during the load from the external table to the physical table. This is the ETL to ELT paradigm shift enabled by the PolyBase we discussed earlier.

Conclusion

We loaded CSV data into our Azure Synapse Analytics data warehouse by using PolyBase. We created three external objects and loaded a file stored in ADLS Gen2 account to a physical data table.

Like this post? Connect with Dhyan

Let’s be friends! You can find me on LinkedIn or join me on Medium.

--

--