And also creating a robust pipeline to move data from AWS S3 into Azure File Share by using Azure Data Factory

Motivation:
There has always been a problem in the field of machine learning when we have multiple VM’s for training purposes and to train we have to download all the files in each VM. This would take up a lot of space in VM where we have to attach large hard drives for the same datasets that reside in it. Azure File Share overcomes this problem by sharing the storage drive across multiple VM’s using industry-standard SMB protocol. I will also write how to move data from AWS S3 directly into the Azure File Share. So without further adieu, let’s get started.
Pitfall 1: There is also an NFS protocol supported on Azure Blob Storage but it is in preview: which means that you shouldn’t use it in production but you are free to use it for testing purposes.
Prerequisite:
- An Azure Account with Services like Azure Storage Account and Azure Virtual Machine enabled.
- (Optional) Azure Data Factory Service and An AWS Account with S3 service.
Azure File Share and VM:
Everything in the Azure must be created inside the Azure logical container called Azure Resource Group. These resource groups help us to contain everything inside a container for easy access for further work in the workplace. So, let’s create an Azure Resource Group. Search for Resource Group and then click on it.

After clicking on the ‘New’ button you will be redirected to the new page. Enter the basic information and click on ‘Review + create’.

You have created a resource group. Now it’s time to create an Azure Storage Account that contains the Azure File Share and two Azure Virtual Machines. Search for Azure Storage Accounts.

After clicking on it, fill in the basic details and hit ‘Review + create’.

Do the same thing for Virtual Machine as well.

You need to repeat this process two times if you want to create two VM’s and test them. Creating a single VM is also fine as well. After you are done inputting the basic details, hit ‘Review + create’


Now, go to your storage account and select File Shares. Create one file share by adding these basic details and hit ‘Create’.

Add one file inside the file share for testing purposes. The SMB protocol works on port 445 so we need to open the port on our VM’s. Let’s go to the VM’s that you just created and then select Networking.

Add the inbound rule of 445 on both of your VM’s by hitting on ‘Add inbound port rile’.

Fill in the details as above and then hit ‘Add’. Now, open up your VM cause it’s time to mount the file share into the VM.

Now enter these commands into your VM.
On the line ‘az login’ you will have to authenticate your VM to access the storage account. There are many ways to do it but ‘az login’ is the quick and easy way. After you enter the command, a prompt will show up

Go to the browser where you logged in to the portal and paste the link and then enter the code.

After you do that you will be authenticated and now will be able to type in the rest of the commands.

As you can see you have the files that you wanted. To test it create a file and check on the storage account. For me, I will create a CSV file called ‘ability_name.csv’ on the VM.

As you can see that the file is created and is displayed here. we have done it. We have created a file share and shared it with two of our VM’s.
Azure Data Factory and AWS S3 (Optional):
S3 is really cheap so that’s why most of the data reside there but you have your production workload on Azure. Azure Data Factory is the perfect tool to create a pipeline between their two services to move data. So, let’s jump into it by typing ‘Data factories’ and creating the service.

Add these basic details and hit ‘Review + create’.

Go to the newly created service and click on ‘Author & Monitor’

Click on ‘Copy Data’

Give it a name and put the rest as default.

Click on ‘Create new connection’ and then select ‘Amazon S3’

Now, let’s head over to the S3. Open the AWS Management Console on the separate tab and then go to IAM users and add a user.

Give it a name and allow programmatic access.

Select ‘Attach existing policies directly’ and select ‘AmazonS3FullAccess’ and Create the User.

Now you will be prompted into the new page with your Access key ID and Secret access key. Copy these two keys in a safe place because it is the only time you will be able to see them.
Now select a bucket to copy files from S3 to file share. I already had a bucket called ‘copytofileshare’ which has multiple CSV files. We will be copying this bucket into the file share.

Go to your azure tab and then add those key id and secret access key in the blanks and hit ‘Test connection’. You should be able to connect with your S3 account.

After you have tested successfully, hit ‘Create’. Now on the next prompt, you select the bucket that you want to copy.

Now we have to create the destination. Select ‘Azure File Storage’ and then hit ‘Continue’

Select your subscription account and test your connection.

Hit next leaving the default values and you can see that the pipeline will run immediately.

Now check your storage account as well as your VM if those files are copied or not.


Conclusion:
We have done it. We successfully moved data from AWS S3 into Azure File share and used that file share to provide files into the Azure VMs. Now those VM’s can utilize those CSV data and train on them without even downloading from either Azure file share or AWS S3.
This is just an example of loading the files from S3 to FileShare. The Data Factory has many services from where you can get your data to be uploaded into the FileShare. Still, the choices are endless and it’s up to you for what you want to make it. If you encounter any problems or have difficulty following the steps, comment below on this post or message me at [email protected]. You can also connect with me on Linkedin and GitHub.
Resources:
[1] Mounting File Share: https://docs.microsoft.com/en-us/azure/storage/files/storage-how-to-use-files-linux
[2] Azure Data Factory: https://docs.microsoft.com/en-us/azure/data-factory/