Train a GAN and generate faces using AWS Sagemaker | PyTorch

Published in

Towards Data Science

8 min readApr 5, 2020

Photo by “My Life Through A Lens” on Unsplash

I assume you have already heard or worked on GAN. If you have not heard before, then Generative Adversarial Networks(GAN) is one type of neural network architecture that allows us to create synthetic data, images or videos. It has become an interesting subfield in deep learning. Some of the different types of GAN’s are DCGAN, CycleGAN(CGAN), GauGAN, StyleGAN, Pix2Pix, etc. As it is so popular, new types of GAN papers and architecture emerge as we speak!

Although there are many different GAN architectures, they all have one thing in common. To train a GAN they need a lot of computing power and they are GPU hungry. So it is really difficult to train a GAN in the local environment unless you have a good distributed GPU set up with time and money. Else you can leverage the cloud to train GAN. Cloud environments can be used for various neural network training and it is not restricted to GAN’s. I faced issues while running in my local environment so I used cloud and able to train easily and deploy it in production quickly!

There are different cloud providers, I felt AWS is ahead of other cloud providers in many areas. Particularly in machine learning space, AWS has different services that can be leveraged. So in this blog, we are going to look at Sagemaker service which is provided by AWS.

Amazon SageMaker is a fully managed service that provides us the ability to build, train, and deploy machine learning (ML) models quickly. Another huge advantage of SageMaker is the machine learning models can be deployed to production faster with much less effort. Yes, some cloud providers are cheaper than AWS however, sagemaker provides other advantages on deployment. You can also leverage the local GPU environment if you have one while developing models.

In this blog, we will generate new faces (Again!) by training celebrities dataset. For generating new images, I will use my local GPU environment(to save some bucks) for development and sanity testing and use Sagemaker for training a full-fledged model. I will also show how to create an endpoint for deployment.

As there are plenty of articles on AWS account set up and local environment setup, I am going to skip that part. If you have any questions, please feel free to ask in the comments section. Sagemaker can be accessed via AWS services console page.

Now there are two options for Jupyter Notebooks.

Use Local Environment
Sagemaker Environment

Local Environment:
If you have a local environment with Jupyter notebook, then congrats! You can save some bucks by using a local environment for development and sanity testing. You install Sagemaker python package and use sagemaker functions locally. If you have GPU with Cuda enabled, then you can use it to test the entire code and submit your job sagemaker. Below are the steps for setting up a local environment. Entire code is present in my Github page

Step 1: Install package
Install Sagemaker python package in your virtual environment https://pypi.org/project/sagemaker/

Step 2: Connect to your AWS account
Assuming you have created an AWS account and have Sagemaker and S3 bucket access. You can also set access key, secret variables, and region in your .aws/config file.

You also need an IAM role for sagemaker execution. It needs full access to Sagemaker.

import sagemaker
import boto3sagemaker_session = sagemaker.Session(boto3.session.Session(
    aws_access_key_id='xxxxxxxxxxxxx',
    aws_secret_access_key='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx',
    region_name='update your AWS region'))bucket = sagemaker_session.default_bucket()prefix = 'sagemaker/dcgan'role = 'sagemaker_execution_role'

You can test the connection by uploading test data in S3 bucket and check by using the following command

input_data = sagemaker_session.upload_data(path=data_dir, bucket=bucket, key_prefix=prefix)
input_data

If you did not get any error and data is in the S3 bucket, then you are good to start. If you get any error, then please debug and correct the issue. Alternatively, you can provide the S3 bucket link here and download the data from S3 for local testing.

Sagemaker Environment:
If you do not have a local environment, you can start sagemaker Jupyter notebooks. This will spin up a compute instance and it will deploy required containers for Jupyter notebooks.

Step 1: Launch Notebook
Goto Notebook instances section sagemaker and create a notebook instance

When you go next, you can set up the S3 bucket and IAM roles. Selecting the size of the cloud instance and other technical details depending on your needs and size.

Now, we can go ahead and “Create”. AWS takes some time to get the Notebook ready. We can see on the console that the Notebook Instance is “Pending”.

Once it is ready, click on the “Open Jupyter” notebook. You are ready to start training your GAN now.

GAN Model Training:

I am using PyTorch for Training a GAN model. Before training, it requires some pre-processing. If you are using a local environment, you need to upload the data in the S3 bucket. Below are some processing that you need to perform.

Transform the input images and have them in a common size.

def get_dataloader(batch_size, image_size, data_dir):
    """
    Batch the neural network data using DataLoader
    :param batch_size: The size of each batch; the number of images in a batch
    :param img_size: The square size of the image data (x, y)
    :param data_dir: Directory where image data is located
    :return: DataLoader with batched data
    """
       
    transform = transforms.Compose([transforms.Resize(image_size),
                                  transforms.ToTensor()])
    
    dataset = datasets.ImageFolder(data_dir,transform=transform)
    
    #rand_sampler = torch.utils.data.RandomSampler(dataset, num_samples=32, replacement=True)
    #dataloader = torch.utils.data.dataloader.DataLoader(dataset, batch_size=batch_size,shuffle=False, sampler=rand_sampler)
    
    #dataloader = torch.utils.data.dataloader.DataLoader(dataset, batch_size=batch_size,shuffle=True)
    
        
    return dataloader

While testing you can use random sampler on the input dataset and use it in the data loader.

2. Scale the images

Scaling the images is an important step in the neural network. It is particularly true while performing GAN.

def scale(x, feature_range=(-1, 1)):
    ''' Scale takes in an image x and returns that image, scaled
       with a feature_range of pixel values from -1 to 1. 
       This function assumes that the input x is already scaled from 0-1.'''
    # assume x is scaled to (0, 1)
    # scale to feature_range and return scaled x
    
    min, max = feature_range
    x = x * (max - min) + min
    
    return x

3. Create the model

When performing GAN, two types of network needs to be trained. One is a generator and another is the discriminator. Input to a generator is from latent space or noise. A generator is trained to generate an image and a Discriminator is trained to detect if the image is real or fake. The final output of playing this game between generator and discriminator is a realistic output from Generator which looks like real images.

As mentioned before, there are other architectures of GAN. However, this is the idea behind the GAN. Model code is provided in model.py in the Github repo. I have written a DCGAN model using convolution t

4. Training the model

This is the step where we are going to leverage the cloud. Before running many epochs in sagemaker, test the complete workflow in a local environment with sample data.

Some hyperparameters need to tuned like learning rate, beta1, and beta2. I have selected it from this paper https://arxiv.org/pdf/1511.06434.pdf

Once sanity testing is performed, it is time to submit this job to sagemaker. Create an estimator object using sagemaker PyTorch API and call the fit method.

from sagemaker.pytorch import PyTorchestimator = PyTorch(entry_point="train.py",
                    source_dir="train",
                    role=role,
                    framework_version='0.4.0',
                    train_instance_count=4,
                    train_instance_type='ml.p2.xlarge',
                    hyperparameters={
                        'epochs': 15,
                        'conv_dim': 64,
                    })estimator.fit({'training': input_data})

Some points to note on the above code:

You can change the ML framework. Sagemaker supports all major frameworks like PyTorch, Tensorflow, etc.
The source directory needs to be specified where all the code is present as shown in my GitHub repository.
Pytorch framework version needs to be specified. Train directory should also contain requirement.txt file with all the packages which were used in data processing and training.
Instance type depends on how big compute instance you need. If you are training a GAN, I would at-least prefer p2.xlarge as it contains GPU. It is recommended to have a GPU enabled compute server. Else the model will train forever.

Once you call the fit method, it should create some logs like the below one. It is starting a compute instance and training the model.

Different colors highlight it is using different compute instances. We are also printing discriminator and generator losses. Now you can leave it to train until it completes.

If your training time is large, your kernel session will likely end. Worry not, as we are training in the cloud we can easily attach to the session which we were running by below code. Job name can be found in sagemaker console.

estimator = estimator.attach('sagemaker-job-name-2020-xxxxx')

Once the model is trained, you are good to deploy it.

5. Deploy the model:

Deploy the model to another compute instance which has less compute power. However, if you need GPU for prediction then please use p2.xlarge or above. The model can also be served in a distributed fashion by instance count parameter.

predictor = estimator.deploy(initial_instance_count = 1, instance_type = ‘ml.m5.large’)

After deployment you can get endpoint name

6. Results -Generate faces

After deploying the model, it is time to generate faces from our trained model.

#Generate random noise
fixed_z = np.random.uniform(-1, 1, size=(16, 100))
fixed_z = torch.from_numpy(fixed_z).float()sample_y = predictor.predict(fixed_z)

I have added all the files in my Github repo.

Production:

Once we have the model and endpoint deployed, we can create an AWS Lambda function that could be invoked via the API Gateway. API can be used to generate images from any application.

All the code and packages are found at my Github I hope you can make use of the repo along with this story.

Questions? Comments? Feel free to leave your feedback in the comments section.

Get Code

To get the complete working code for the article and other updates, please subscribe to my newsletter.

Train a GAN and generate faces using AWS Sagemaker | PyTorch

Production:

Get Code

Written by Shyam BV