MACHINE LEARNING

Pre-story
I learnt about TPU availability on Kaggle quite recently and I wanted to run one of my old notebooks on TPU. I thought it would be just a matter of switching the accelerator and applying couple of slight changes, but it turned out to be a full journey where I learnt a lot. I would like to share it to help others to take advantage of TPUs to blazingly speed up their training and therefore the ability to iterate through the experiments.
Audience
⚠️ Warning! This article assumes that you’re familiar with the basics of ML, CNNs and training them with Tensorflow and Keras and just new to using TPU or have some doubts.
📝 Note. The practical example notebook is done on Kaggle but the vast majority of the discussed points are applicable to the other environments as well.
I want this article to focus on practical application without becoming waaaay too long so let’s touch the "theory" very briefly.
Some TPUs background
Neural network training speed can be increased significantly using TPUs (Tensor Processing Unit) – accelerators for deep learning that implement matrix multiplication on hardware and therefore ridiculously reducing computation time.
For more background I would recommend to go through this introduction to TPUs by Google, also for those who prefer video format please refer to this YouTube video by Kaggle
Kaggle environment
Here are some details of the environment for you to check the relevance to the one of your own:
- TPU v3–8
- Tensorflow version 2.4.1 (with TPUs)
- At the time of writing, 20 hours per week (and up to 9h at a time in a single session) of TPUs time is available on Kaggle for free
Links to complete Jupyter notebooks
Main points
Let’s go through the main points first and later review them with some details and code snippets.
TPU is a game-changer, you need to use them if they’re available
I cannot stress it enough, TPU will crazily improve your training times (provided it works at all for your task at hand of course). Below is a table with comparisons of time per epoch for the experiments I went through.
But everything good comes with a price and to be able to use the benefits of TPUs one needs to adapt their code and data pipelines. Sometimes it would mean a complete re-write of the data pre-processing.
To be able to use TPUs
- TPUs read data exclusively from Google Cloud Storage (GCS). So, you’ll need to put your data there and/or read it from GCS if it’s already available (as in the case of Kaggle).
- The use of
tf.data.Dataset
API is required as input formodel.fit()
- The model must be defined within the
TPUStrategy
scope - Data Augmentation must be done during data preparation (ie on CPU), it cannot be a part of the training code because some Tensorflow operations are not supported on TPU.
- Data augmentation layers that change the width and height of the input (e.g.
RandomWidth
,RandomHeight
with squared images or data augmentation methods that flip height and width in case of non-squared images) cannot be used. - Models from TensorFlow Hub need to be read uncompressed or loaded directly to TPU because Cloud TPUs do not have access to the local file system which TFHub relies on
- etc.
To optimize the usage of TPU
There are lots of things that can be done to optimize training on TPU, the list below is not exhaustive. Mainly, with TPUs being as fast as they are, data pipeline can easily become a bottleneck:
- Adjust the batch size depending on the available hardware and tune the learning rate accordingly
- Use
steps_per_execution
parameter when compiling your model - Use dataset caching, prefetching and other data load optimizations to maximize TPU load.
- Make sure your feature dimension sizes are a multiple of 8 or 128 (depends on the chosen batch size)
- Organize your data on GCS in a certain way (I won’t do it here though)
- etc.
The dataset
The set used in this tutorial is from Plant Pathology 2020 competition on Kaggle. The goal of the competition was to classify pictures of apple leaves into 4 different categories – healthy, rust, scab or multiple diseases. The dataset is relatively small, there are 1821 jpg images in the training set and the same amount in test set.
Okay, time to get our hands dirty
⚠️ Important note
Below I will mostly omit the code that’s not relevant to TPU usage. You can see the full version in the notebooks following the provided links.
Locate the TPUs on the network
First we need to locate the TPUs on the network and instantiate a TPUStrategy
that takes care of the distributed calculation. In case of TPUs not being available the code falls back on GPU or CPU
Data loading
The best practice is:
📝 For TPU training, organize your data in GCS in a reasonable number (10s to 100s) of reasonably large files (10s to 100s of MB). With too few files, GCS will not have enough streams to get max throughput. With too many files, time will be wasted accessing each individual file.
To implement this one needs to use TFRecord
file format. I won’t do it here since the dataset is small and we can just cache it in RAM. But I’d recommend you to follow this detailed tutorial to do it.
Here instead of using TFRecord
we’ll create a tf.data.Dataset
from the filenames on GCS.
Why would we use this approach
With the current directory structure (ie all the images in one directory) we could’ve used ImageDataGenerator.flow_from_dataframe
method
But it doesn’t seem to work with GCS giving and error (unlike pd.read_csv()
that works perfectly with GCS at least on Kaggle).
UserWarning: Found 1821 invalid image filename(s) in x_col="filename". These filename(s) will be ignored.
When called with a path to the local input i.e. ../input/plant-pathology-2020-fgvc7/images
it works with no issues but TPUs don’t have access to the local drive.
And besides, ImageDataGenerator
has been deprecated and it’s recommended to use tf.Keras.utils.image_dataset_from_directory
which is not convenient in our case with all the image files in one directory.
The steps to follow
1. Setting up and helper functions
Batch size. The ideal would be to have a batch size of 128*strategy.num_replicas_in_sync
but since our dataset is very small, we’ll use a general rule of thumb: In general, your batch size should be evenly divisible by 8 or 128. Another advantage of using a small batch size is that we won’t need to tweak the learning rate, the default one should work just fine (and it looks like it does).
Steps per execution. Additionally, we’ll mitigate a small-ish batch size by using a steps_per_execution
parameter of model.compile()
following this description:
📝
steps_per_execution
instructs Keras to send multiple batches to the TPU at once. With this option, it is no longer necessary to push batch sizes to very high values to optimize TPU performance.
Locate training data. Also, we need to get a path to the current dataset in Google Cloud Storage. Note that the second line from the snippet below is Kaggle-specific.
2. Prepare stratified train validation split combining the labels with the filenames
3. One-hot encode the labels to match the competition required submission format
4. Create training and validation tf.data.Dataset
by zipping filenames and labels
5. Map the filenames Dataset to get images Dataset
6. Optimize Datasets for training
I used the code above to improve the performance of data input. I definitely recommend you to check the Better performance with the tf.data API to make your own input pipeline efficient and therefore reduce the data loading bottleneck as much as possible.
⚠️ Caching and data augmentation. One important note worth stressing. Make sure to cache()
the dataset before data augmentation otherwise the augmentation won’t be re-applied each epoch and there will be almost no point in the augmentation step. One of the symptoms of this is overfitting of training vs validation but of course overfitting can happen due to some other reasons.
In the end of this step we have stratified and performant training and validation datasets.
Model creation
Steps per execution. Here’s the time to use the steps_per_execution
parameter that we talked about earlier. 32 seems to work pretty good in this case.
Learning rate. And again, if you decided to use a bigger batch size you’d most likely need to tune the learning rate instead of using the default one.
Feature dimensions. Another important thing to pay attention to when creating a model to train on TPU is feature dimensions.
📝 Note: Feature dimension refers to the hidden size of a fully-connected layer or the number of output channels in a convolution. Not all layers can conform to this rule, especially the first and last layers of the network. This is fine, most models require some amount of padding.
You can read in detail about setting feature dimensions in the Cloud TPU performance guide to decide what works best for your model and data.
Other than that you’d create your model as usual using API of your preference (ie sequential of functional) and instantiate it in the scope of TPUStrategy
:
Model training
Train your model as usual.
If you used ds.repeat()
in the data preparation pipeline, you’ll need to provide steps_per_epoch
and validation_steps
to the model.fit()
method. This applies to any training, not only on TPU.
The first model trained for 60 epochs result in approximately 80% accuracy (and it takes about 2 minutes to train on TPU!!!) but we can notice some overfitting happening.

One of the ways to fight the overfitting is to use data augmentation. Let’s dive in.
Data augmentation
There are two ways of applying data augmentation (and preprocessing in general) with TensorFlow (more details here):
- Add data augmentation layers into the model
- Include data augmentation to the dataset preparation pipeline
When running on GPU you’d want to use the first option to benefit from GPU acceleration. When running on TPU you’d have to use the second option because data augmentation layers (except for
Normalization
andRescaling
) are not supported
Initially I used the first option and of course it didn’t work. Don’t be like me:
💔 Troubleshooting. If you try to .fit()
a model that includes data augmentation layers on Tpu you’ll get a TPU compilation failed
message with an error that looks something like this:
NotFoundError: 9 root error(s) found.
(0) Not found: {{function_node __inference_train_function_183323}} No proto found for key <<NO PROGRAM AS COMPILATION FAILED>>
[[{{node TPUVariableReshard/reshard/_5957770828868222171/_22}}]]
(1) Invalid argument: {{function_node __inference_train_function_183323}} Compilation failure: Detected unsupported operations when trying to compile graph while/cluster_while_body_181553_10451082903516086736[] on XLA_TPU_JIT: ImageProjectiveTransformV3 (No registered 'ImageProjectiveTransformV3' OpKernel for XLA_TPU_JIT devices compatible with node {{node while/body/_1/while/sequential_28/sequential_27/random_rotation_8/transform/ImageProjectiveTransformV3}}){{node while/body/_1/while/sequential_28/sequential_27/random_rotation_8/transform/ImageProjectiveTransformV3}}One approach is to outside compile the unsupported ops to run on CPUs by enabling soft placement `tf.config.set_soft_device_placement(True)`. This has a potential performance penalty.
TPU compilation failed
etc....
I also tried [tf.config.set_soft_device_placement(True)](https://www.tensorflow.org/versions/r2.4/api_docs/python/tf/config/set_soft_device_placement)
to keep the augmentation layers within the main model with no effect whatsoever. It looks like this option works only with GPU.
⚠️ For the reasons stated above, if you want your model that uses data augmentation to train efficiently on any accelerator you’ll need to have 2 versions of it. One for GPU and the other for TPU.
Data augmentation transformations
You should not use data augmentation layers that change weight or height of the output tensor (e.g. RandomWidth
, RandomHeight
)
💔 Troubleshooting. Otherwise you’ll get NotFoundError
with TPU compilation failed
message again with an error that looks something like this:
(6) Invalid argument: {{function_node __inference_train_function_248407}} Compilation failure: Dynamic Spatial Convolution is not supported: lhs shape is f32[<=8,<=274,<=286,3]
[[{{node conv2d/Conv2D}}]]
TPU compilation failed
[[tpu_compile_succeeded_assert/_3495234026254819081/_5]]
[[TPUVariableReshard/default_shard_state/_4480269216609879393/_8/_123]]
(7) Not found: {{function_node __inference_train_function_248407}} No proto found for key <<NO PROGRAM AS COMPILATION FAILED>>
[[{{node TPUVariableReshard/reshard/_11705470937593059017/_16}}]]
(8) Invalid argument: {{function_node __inference_train_function_248407}} Compilation failure: Dynamic Spatial Convolution is not supported: lhs shape is f32[<=8,<=274,<=286,3]
[[{{node conv2d/Conv2D}}]]
TPU compilation failed
[[tpu_compile_succeeded_assert/_3495234026254819081/_5]]
And finally (important and relevant to any hardware):
⚠️ Apply data augmentation only to the training set
It doesn’t make sense to apply it to validation set. It doesn’t make sense to apply it to the test set unless you’re going to use Test Time Augmentation (TTA) ensembling method (but then you’d do it differently).
Here’s how I implemented it:
Using data augmentation with Early Stopping (patience set to 20 epochs) leads to about 10% improved accuracy that jumps to ~90% on validation set in about 120 epochs without any obvious signs of overfitting

Transfer learning
There are two ways of doing Transfer learning with TensorFlow and Keras:
- Using layers available from TensorFlow Hub.
- Using
tf.keras.applications
API
For what I know, the latter can be used as usual when training on TPU. But there’s one caveat. On Kaggle Tensorflow version that comes with TPU is 2.4.1. And the latest models (eg the Efficientnet v2 family) are not available.
To use TensorFlow Hub though some adjustments need to be made.
The easiest way to be able to train a model that uses TensorFlow Hub layer on TPU is to instruct TensorFlow to read uncompressed models from GCS. By default TensorFlow Hub downloads compressed model and caches it to the local file system where TPU doesn’t have access to. Then a layer can be created and used as usual
An alternative would be to use tf.saved_model.LoadOptions
and load the model directly to TPU:
Conclusion
TPUs can mind-blowingly decrease the time required to execute one training step. However, there is some setup required to use them. Also, one needs to remember that with TPUs being very fast I/O operations can easily become a limiting factor. And hence, achieving peak performance on TPU requires efficient input pipeline. This article is mainly focused on CNNs but the majority of the discussed points is also applicable to the other types of Deep Neural Networks.
That’s it for today, I hope you’ll find it useful to accelerate your training.
Do not hesitate to leave a comment or ask a question.
Further reading
Obviously, I haven’t covered everything you need to know to successfully train your ML models on TPU. Below is a list of some sources that will help you to deepen you knowledge and adapt and optimize your own pipelines and models even further.
- Keras and modern convnets, on TPUs codelab by _Martin Görner_ (I’d recommend it in general, not only in the context of training on TPU)
- How to use Kaggle/Tensor Processing Units (TPUs) on Kaggle (a nice starting point)
- Cloud TPU performance guide by Google (quite brief in my opinion)
- Better performance with the tf.data API (must read for performant input pipelines)
- Working with preprocessing layers in Keras
- List of available an unavailable Python APIs on TPU (IMHO it’s not very informative but good for sanity check in case of doubt)