The world’s leading publication for data science, AI, and ML professionals.

Train Neural Network on Cloud with One Line of Code

The easiest serverless model training tool you may have ever seen

Training a Tensorflow Model on AWS
Training a Tensorflow Model on AWS

Hi everyone,

This is Yuki from Aipaca Inc., a 4th-year Computer Engineering student from the University of Waterloo. When I was working as a data scientist internship, I found it was very time-consuming to set up infrastructure and train models on the cloud. As a result, I teamed up with Cody, Xin, Lounan, and Sam to develop a serverless model training tool – Aibro. As you can observe from the code snippet above, Aibro makes the cloud model training as easy as one-line python code. This post will walk you through the tool step by step.

Colab Demo Link & Our Website

Note: the current Aibro version only supports Tensorflow-based model training. We will add more frameworks like Pytorch and Sklearn in the short future.


About Aibro

Aibro is an MLOps API that connects machine learning models to any Cloud server. It helps data scientists and machine learning researchers easily train & deploy models on cloud platforms without worrying about the infrastructure setup.


Workflow

Step 1: Prepare model & Data

As an example, we used MNIST as the training data then processed it into a shape to fit the following CNN.

from tensorflow.keras.datasets import mnist
import tensorflow as tf
#load data
TRAINING_SIZE = 10000
(train_X, train_Y), (test_X, test_Y) = mnist.load_data()
validation_X = train_X[TRAINING_SIZE:(TRAINING_SIZE + 40)].reshape((40, 28, 28, 1))
validation_Y = train_Y[TRAINING_SIZE:(TRAINING_SIZE + 40)]
#process data
train_X = train_X[:TRAINING_SIZE].reshape((TRAINING_SIZE, 28, 28, 1))
train_Y = train_Y[:TRAINING_SIZE]
train_Y = tf.keras.utils.to_categorical(train_Y, 10)
validation_Y = tf.keras.utils.to_categorical(validation_Y, 10)

The CNN model:

from tensorflow.keras import layers, models
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation="relu", input_shape=(28, 28, 1)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation="relu"))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation="relu"))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation="relu"))
model.add(layers.Dense(10, activation="softmax"))
model.compile(
    optimizer="adam",
    loss=tf.keras.losses.CategoricalCrossentropy(from_logits=False),
    metrics=[tf.keras.metrics.CategoricalAccuracy()],
)

You could try any Tensorflow-based model and data to plug and play in the Colab demo. Just keep in mind that, if your model is compiled with _metrics=["accuracy"]_, the metrics might be incorrectly transported due to the model save and reload issue from Tensorflow. To avoid this issue, please specify your metrics while compiling the model as shown in the demo.

Step 2: Install Aibro Library

The public Aibro has a version of 1.1.0. You may install it with the following command. We may need your email address while calling its methods, which helps us backtrack job logs and improve future development.

!pip install aibro
https://test.pypi.org/Simple

Step 3: Select a cloud machine

Before launching a training machine, you need to select a cloud machine to train models. Currently, Aibro only supports spot instances on AWS because they are 4 times cheaper than on-demand deep learning instances (e.g. Sagemaker).

By the way, don’t worry about the interruption issue from the spot instances, Aibro has a dedicated system to automatically rewire an interrupted training job.

Now, let’s back to the demo. By calling available_machines(), you can check the rental limit of each machine and its availability. Availability is defined as the probability that a specified spot instance type will be successfully launched (because spot instance requests are not guaranteed to be fulfilled by AWS). In general, more powerful instances are less likely to be available.

from aibro.train import available_machines
available_machines()

Sample output:

Available Resources: 
Machine Id: g4dn.4xlarge   GPU Type: 1xT4     num_vCPU: 16    cost: $0.36/hr  limit: 4   availability: 96.0% 
Machine Id: g4dn.8xlarge   GPU Type: 1xT4     num_vCPU: 32    cost: $0.67/hr  limit: 2   availability: 84.0% 
Machine Id: p2.8xlarge     GPU Type: 8xK80    num_vCPU: 32    cost: $2.16/hr  limit: 2   availability: 49.0% 
Machine Id: p2.xlarge      GPU Type: 1xK80    num_vCPU: 4     cost: $0.27/hr  limit: 23  availability: 61.0% 
Machine Id: p3.2xlarge     GPU Type: 1xV100   num_vCPU: 8     cost: $0.92/hr  limit: 11  availability: 24.0% 
Machine Id: p3.8xlarge     GPU Type: 4xV100   num_vCPU: 32    cost: $3.67/hr  limit: 2   availability: 12.0%

Good news: since we want to test the tool in more edge cases, any machine rental fee you generated will be covered by us. Though, please don’t looply use it! Students’ pockets are fragile 😳 .

Step 4: Launch a training job

The code snippet at the very top has shown exactly how the training job is launched. When designing the method, we tried to keep the same style as you normally fit a Tensorflow model. In extra, you only need to pass the model, cloud _machineid to fill the function args and add a description to help you identify the job.

from aibro import Training
job_id, trained_model, history = Training.online_fit(
    model=model,
    train_X=train_X,
    train_Y=train_Y,
    validation_data=(validation_X, validation_Y),
    machine_id=machine_id,
    description='Demo Run 0.0.41',
    epochs=10
)

There are 7 in-order stages once a training job start:

  1. [LAUNCHING]: send a request to the cloud platform to open an instance
  2. [SENDING]: transmit model and data to the instance
  3. [MIRRORING]: clone your local deep learning environment to the cloud
  4. [TRAINING]: model start training
  5. [SUMMARY]: return a brief summary about the training job
  6. [RECEIVING]: send back the trained model and other results to the user
  7. [CLOSING]: destroy user training data and terminate the instance

In the [TRAINING] stage, a tensorboard address will be shown for you to track the progress in real-time.

After the training job is completed, _jobid, _trainedmodel, and history objects will be returned.

You could also review your historical jobs by _listtrial() and check the corresponding details by downloading Tensorboard logs with _get_tensorboard_logs(jobid).

Design Principle – Saving by turning it on and off

Image by Luis Quintero from Pexels
Image by Luis Quintero from Pexels

A cloud machine is just like a tap, you are billed by every second when the "water" is running, which is a waste of both energy and money. Aibro is designed in a way to not only help data scientists train models on clouds but also minimize waste. Therefore, every time a new training job is launched, Aibro opens a new cloud server. Once the job is completed or canceled, the server would be automatically terminated. The tradeoff is that you may need to have 2–3 mins of patience to wait for the instance to be launched.

Note: We are adding multiple features to mitigate the launch time; for example, in the next version, we will allow the server to be held a certain amount of time after job completion for the purpose of quick model debug and modification. In this way, if a new job is submitted within the holding period, the job will skip the [LAUNCHING] stage directly.


What’s Coming Soon

After the demo, we are developing an alpha version for Aibro, which is coming in early October. In the next release, you will see the following updates:

  • User registration: open for users to register and manage accounts.
  • Documentation: explain more details about the library.
  • Holding period: pre-open a server before a training job starts (warm-up period) and keep holding a server after a training job (cool-down period).
  • Offline fit: fire a training job and pickup later by job_id.
  • Retrain: resume a paused training job from checkpoints.
  • Big data: support large data transmission and train.

Contact Us

If you found Aibro is helpful, please join our waitlist from [aipaca.ai](https://aipaca.ai). We will share the latest update of Aibro. Meanwhile, we welcome any feedback from you. You could contact us by joining the AIpaca community from aipaca.ai, sending an email to [email protected], or using _aibro.comm.sendmessage() to send us a direct message!


Related Articles