
PyTorch is one of the most popular deep learning libraries out there. It provides one of the best balances between being easy to learn and a powerful framework for creating and training models quickly. Extensions of it like PyTorch Lightning make it even easier to write up and scale up networks.
While it’s great that you can easily create and train complex models, constructing a model architecture will only get you part of the way to performing well on your task.
Achieving high performance on your task depends just as much on improvements to your data as it does on improvements to your model architecture.
Depending on your task, refining a dataset can get difficult. Especially as you increase the dimensionality of your data. Working with tabular data is often going to be more straightforward than working with an image or video dataset that you can’t even load into memory all at once.
PyTorch datasets provide a great starting point for loading complex datasets, letting you define a class to load individual samples from disk and then creating data loaders to efficiently supply the data to your model. Problems arise when you want to start iterating over your dataset itself. PyTorch datasets are rigid. They have numerous limitations. First, they require a heavy code rewrite for any changes, which leads to dozens of hours wasted over the lifetime of a project. Second, they are merely a way to load data from disk, they do not support any data visualization or exploration that can help you construct better datasets.
FiftyOne, the open-source pandas-like tool for visual datasets that I have been working on can work in conjunction with PyTorch (and many other tools) to help you get closer to and interact with your datasets. It provides a much more flexible representation for image and video datasets allowing you to search, slice, and visualize them with the help of the FiftyOne API and App without needing any rewrites for frequent changes. Although this article is primarily about PyTorch, other common frameworks, like TensorFlow, suffer from similar challenges.

The magic that makes FiftyOne so flexible for overcoming these PyTorch dataset limitations is in FiftyOne Views. Basically, from a general FiftyOne dataset, you can create a specific view into your dataset with one line of code; the view is then directly used to create a PyTorch Dataset.
For example, say that you trained an object detection model that is getting confused between cars, trucks, and buses. It could be beneficial to first train the model to predict them all as "vehicles". Incorporating FiftyOne into your training workflow can make this as easy as:

vehicle_view
in the FiftyOne App (Image by author)PyTorch datasets synergize well with FiftyOne datasets for hard Computer Vision problems like classification, object detection, segmentation, and more since you can use FiftyOne to visualize, understand, and select the data that you then use to train your PyTorch model. The flexibility of FiftyOne datasets lets you easily experiment with and finetune the datasets you use for training and testing to create better-performing models, faster. In this blog post, I am focusing on object detection since that is one of the most common vision tasks while also being fairly complex. However, these methods work for most ML tasks. Specifically, in this post I cover:
- Loading your labeled dataset into FiftyOne
- Writing a PyTorch Object Detection dataset that utilizes your loaded FiftyOne dataset
- Exploring views into your FiftyOne dataset for training and evaluation
- Training a Torchvision object detection model on your FiftyOne dataset views
- Evaluating your models in FiftyOne to refine your dataset
- Training models outside of Pytorch on FiftyOne datasets
Follow along in Colab
You can follow along with this blog post directly in your browser through this Google Colab notebook!
Add your data to FiftyOne
Getting your data into FiftyOne is oftentimes actually easier than getting it into a PyTorch dataset. Additionally, once the data is in FiftyOne it is much more flexible allowing you to easily find and access even the most specific subsets of data that you can then use to train or evaluate your model.
The goal of this blog is to train an object detection model so I am just using a standard object detection dataset as an example. Specifically, I am using the COCO 2017 dataset which I can load directly from the FiftyOne dataset zoo. In the spirit of making this post easy to follow along with, I am only using a subset of the COCO dataset (the 5000 validation images and labels) and later creating custom training and validation splits from that subset.

We need the height and width of images later in this post so we need to compute metadata on the images in our dataset:
Loading your custom data
If you have data that follows a certain format on disk (for example a directory tree for classification, the COCO detection format, or many more), then you can load it into FiftyOne in one line of code:
If your dataset doesn’t follow a standard format, don’t fret, it’s still really easy to get it into FiftyOne. You just need to create a FiftyOne dataset and iteratively parse your data into FiftyOne samples that are then added to the dataset.
Define a PyTorch Dataset
A PyTorch dataset is a class that defines how to load a static dataset and its labels from disk via a simple iterator interface. They differ from FiftyOne datasets which are flexible representations of your data geared towards visualization, querying, and understanding.
The symbiosis between the two dataset representations comes from the fact that FiftyOne datasets are optimized for helping you gather and curate datasets for training, while PyTorch datasets are designed to encapsulate a static dataset in a standard interface that can be efficiently loaded during training
Using the flexible representation of FiftyOne datasets to understand and select the best training data for your task, then passing that data on to PyTorch datasets for efficient loading results in better models, faster.
Select a model
Every PyTorch model expects data and labels to pass into it in a certain format. Before being able to write up a PyTorch dataset class, you first need to understand the format that the model requires. Namely, we need to know exactly what format the data loader is expected to output when iterating through the dataset so that we can properly define the __getitem__
method in the PyTorch dataset.
In this example, I am following the Torchvision object detection tutorial and construct a PyTorch dataset to work with their RCNN-based models. If you are following along, this code uses some of the utilities and methods for training and evaluation so you need to clone the tutorial code in the PyTorch git repository:
# Download TorchVision repo to use some files from
# references/detection
git clone https://github.com/pytorch/vision.git
cd vision
git checkout v0.3.0
cp references/detection/utils.py ../
cp references/detection/transforms.py ../
cp references/detection/coco_eval.py ../
cp references/detection/engine.py ../
cp references/detection/coco_utils.py ../
These object detection models expect our PyTorch dataset to output a (image, target)
tuple for each sample where target
is a dictionary containing the following fields:
boxes (FloatTensor[N, 4])
: the coordinates of theN
bounding boxes in[x0, y0, x1, y1]
format, ranging from0
toW
and0
toH
labels (Int64Tensor[N])
: the label for each bounding box.0
always represents the background class.image_id (Int64Tensor[1])
: an image identifier. It should be unique between all the images in the dataset and is used during evaluationarea (Tensor[N])
: The area of the bounding box. This is used during evaluation with the COCO metric, to separate the metric scores between small, medium, and large boxes.iscrowd (UInt8Tensor[N])
: instances with iscrowd=True will be ignored during evaluation. (If your dataset doesn’t support crowds, then this tensor will always just be0
‘s)
The following code loads Faster-RCNN with a ResNet50 backbone from Torchvision and modify the classifier for the number of classes we are training on:
_(source of code and field descriptions presented above)_
In general, no matter what model you use, the corresponding PyTorch dataset needs to output the loaded image data along with relevant annotations and metadata for each sample. For example, for classifications tasks target
would just be a single integer representing the class to which the sample belongs.
Write a PyTorch dataset
Now that we have decided on the model and understand the format that the loaded data needs to follow, we can write a PyTorch dataset class that takes a FiftyOne dataset as input and parses the relevant information. Since the FiftyOne API is designed to be easy to use, parsing samples out of a FiftyOne dataset is generally easier than parsing them from disk.
The constructor for the dataset class needs to take in our FiftyOne dataset create a list of image_paths
that can be used by the __getitem__
method to index into individual samples and also access the corresponding FiftyOne sample by filepath. FiftyOne stores class labels as strings, so we also need a mapping of these strings back to integers to be used by the model.
The __getitem__
method then needs to take in a unique integer idx
and use that to access the corresponding FiftyOne sample. Then, since these models are trained on the COCO dataset, we can use the COCO utilities in FiftyOne to reformat each detection in the sample into the COCO format for detections, labels, areas, and crowds. We also add length and get_classes
methods for usability.
Construct a FiftyOne view
Once a PyTorch dataset is constructed for your data and model combination, you need to create a PyTorch data loader. These data loaders are the iterables that use the dataset code you wrote to import your data. They are fairly simple but provide some useful functionality like shuffling, batching, and loading data in parallel.
Since we decided to back our datasets in FiftyOne, we can also perform actions like splitting and shuffling our data by creating views into our dataset. The [take](https://voxel51.com/docs/fiftyone/api/fiftyone.core.collections.html?highlight=take#fiftyone.core.collections.SampleCollection.take)
method on a FiftyOne dataset returns a subset containing random samples from the dataset. The [exclude](https://voxel51.com/docs/fiftyone/api/fiftyone.core.collections.html?highlight=exclude#fiftyone.core.collections.SampleCollection.exclude)
method prevents us from taking any samples that are in the training split for our validation split:
Now say that you wanted to get more specific with the data that you use to train and test your model. For example, you might want to train on a specific subset of classes or remap some of the labels. If you are just using PyTorch code, then you need to go back and rewrite your Dataset class and possibly even go back to change your dataset files on disk.
FiftyOne can do much more than just splitting and shuffling data and makes it easy to get exactly the data that you need for your model. One of the primary ways of interacting with your FiftyOne dataset is through different views into your dataset. These are constructed by applying operations like filtering, sorting, slicing, etc, that result in a specific view into certain labels/samples of your dataset. These operations make it easier to experiment with different subsets of data and continue to finetune your dataset to train better models.
For example, cluttered images make it difficult for models to localize objects. We can use FiftyOne to create a view containing only samples with more than, say, 10 objects. You can perform the same operations on views as datasets, so we can create an instance of our PyTorch dataset from this view:

Another example is if we want to train a model that is used primarily for road vehicle detection. We can easily create training and testing views (and corresponding PyTorch datasets) that only contain the classes car, truck,
and bus
:

vehicles_view
in the FiftyOne App (Image by author)Train the model
Now that we have decided on the data we want to use to train and test our model, the next step is to construct the training pipeline. This varies depending on what you want to accomplish with your model. The specifics of constructing and training models in PyTorch are out of the scope of this blog post, and for that, I refer you to other sources [1,2,3,4].
For this example, we are writing a simple training loop following the PyTorch object detection tutorial. This function takes a model and our PyTorch datasets as input and use the train_one_epoch()
and evaluate()
functions from the Torchvision object detection code:
Let’s continue with the vehicle example from the previous section. We can use the torch_dataset
and torch_dataset_test
to define and train a model:
Every epoch, this training loop prints evaluation metrics on the test split. After 4 epochs on the samples that we selected from the COCO dataset, we have reached 36.7% mAP. This is not surprising since this model was pretrained on COCO (but not this subset of classes).
IoU metric: bbox
Average Precision (AP) @[ IoU=0.50:0.95 ] = 0.367
Evaluate the model
Printing evaluation metrics is useful during training so that you can see how quickly your model is learning and when it begins to saturate. However, this Torchvision evaluation protocol only returns dataset-wide mAP metrics. To best understand where your model performs well and poorly, and thus having any hope of improving it, requires you to see how the model performs on individual samples.
One of the main draws of FiftyOne is the ability to find failure modes of your model. The built-in evaluation protocols tell you exactly where your model got things right and where it got things wrong. Before we can evaluate the model, though, we need to run it on our test set and store the results in FiftyOne. Doing this is fairly simple and just requires us to run inference for the test images, get their corresponding FiftyOne samples, and add a new field called predictions
to each sample to store the detections.
The [DetectionResults](https://voxel51.com/docs/fiftyone/api/fiftyone.utils.eval.detection.html#fiftyone.utils.eval.detection.DetectionResults)
object that is returned stores information like the mAP and contains functions that let you plot confusion matrices, precision-recall curves, and more. Also, these evaluation runs are tracked in FiftyOne and can be managed through functions like [list_evaluations()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.collections.html#fiftyone.core.collections.SampleCollection.list_evaluations)
.
What we are interested in, though, is the fact that [evaluate_detections()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.collections.html#fiftyone.core.collections.SampleCollection.evaluate_detections)
updates the detections in thepredictions
field with attributes indicating whether they are true positives, false positives, or false negatives. Using FiftyOne views and the App, we can quickly find the samples that the model performed worst on by sorting by false positives:

Looking through some of these samples, a pattern emerges. A number of truck and car annotations are actually incorrect in the ground truth. There seems to be confusion for things like vans and SUVs that are interchangeably annotated as cars or trucks.

It would be interesting to see a confusion matrix between these classes. By default, the evaluation only matches predictions with ground truth objects of the same class (classwise=True
). We can rerun the evaluation with classwise=False
and plot that confusion matrix.

It would be best to get this data reannotated to fix these mistakes, but in the meantime, we can easily remedy this by simply creating a new view that remaps the labels car
, truck
, and bus
all to vehicle
and then retraining the model with that. Since our training data is backed by a FiftyOne dataset, this transformation is easy!
Due to our ability to easily visualize and manage our dataset with FiftyOne, we were able to spot and take action on a dataset issue that would otherwise have gone unnoticed if we only concerned ourselves with dataset-wide evaluation metrics and fixed dataset representations. Through these efforts, we managed to increase the mAP of the model to 43%. Even though this example workflow may not work in all situations, this kind of class-merging strategy can be effective in cases where more fine-grained discrimination is not called for.
Alternative training frameworks
If you are primarily focused on developing a novel model architecture, then you would likely want to start directly in PyTorch. However, if your goal is to train a model on a custom dataset and a common task, then there are a number of training frameworks that can make that even easier for you.
Some notable training frameworks for object detection are Detectron2 and MMDetection. But no matter what framework you choose, they will likely not use PyTorch datasets and data loaders directly and instead require you to format your data on disk in a certain way. Detectron2 and MMDetection, for example, expect your data to be stored in the COCO format. Once you have reformatted your dataset, it can be loaded directly into these frameworks and trained on.
Even though you are not loading the datasets yourself, FiftyOne can still help you with this process. With FiftyOne, any dataset or view that you have loaded into it can be exported in over a dozen different formats (COCO included). This means that you can parse your dataset into FiftyOne and export it in one line of code instead of needing to write numerous scripts to convert it yourself.
Summary
PyTorch and related frameworks provide quick and easy methods to bootstrap your model development and training pipelines. However, they largely overlook the need to massage and finetune datasets to efficiently improve performance. FiftyOne is an ML developer tool designed to make it easy to load your datasets into a flexible format that works well with existing tools allowing you to provide better data for training and testing. As they say, "garbage in, garbage out".
About Voxel51
High-quality, intentionally-curated data is critical to training great computer vision models. At Voxel51, we have over 25 years of CV/ML experience and care deeply about enabling the community to bring their AI solutions to life. That’s why we developed FiftyOne, an open-source tool that helps engineers and scientists to do better ML, faster.
Want to learn more? Check us out at fiftyone.ai.