Tips to avoid wasting time scripting in your ML projects

Tips and tools to speed up machine learning workflows by reducing excess scripting and data wrangling

Eric Hofesmann
Towards Data Science

--

Photo by Mahir Uysal on Unsplash

When you imagine the day to day work of a data scientist, you might imagine servers with huge datasets, multiple monitors displaying various graphs and analytics, and a rack of GPUs heating up the room with a massive artificial neural network being trained on them. What most people don’t think about is the hours spent having to put on your cowboy or cowgirl hat and saddling up to wrangle your data.

25% of time spent on ML projects is collecting and cleaning data — Kaggle Poll

Full-time scripting

Data wrangling is a common term used to describe the process of collecting, transforming, and cleaning your data to make it ready to use for training or evaluation of a machine learning model. Since nearly every machine learning task and model uses a different format to store data and labels, it’s difficult to avoid spending tons of time writing scripts between every part of your pipeline.

For example, if you download an open-source dataset, you need to write a script to parse it, write a script to convert it to your annotation tool’s format, write a script to parse the output of your annotations, write a script to perform quality assurance on your annotations, write a script to split the data into training and testing splits, write a script to parse the data to train your model, write a script to evaluate your model, write a script to visualize and analyze your results…. UGH!

The worst part is, this is all specific to one project. Any time you want to work with new data or a new model you have to rewrite all of these scripts!

Photo by Steve Johnson on Unsplash

Scripting Time Sinks

There are certain portions of an ML project that typically require a significant amount of scripting. I’ve put together some tips for each portion that can help reduce the required amount of scripting.

1) Label schema development

Developing a labeling schema seems like it’s a pretty easy task, just make a set of rules and follow them until every sample in your dataset is annotated. Inevitably, when you look through samples that your model is underperforming on, you’ll start to find lots of edge cases that you didn’t consider before that will require you to go back and modify your schema and most likely even reannotate data.

When edge cases arise, going back to find the exact samples, creating an annotation task, and updating your dataset is going to require a lot of scripts. After fixing them and retraining your model you’ll likely find more edge cases that require an update to the schema and dataset and even more scripts.

Tips: The best way to avoid this is to plan ahead for this situation because it WILL come up. One company I worked for had a focused team of 3 (of 15) people who spent weeks developing the schema alone. They wrote a detailed Google doc with specific examples of common and edge cases. This led to some intense discussion because no assumptions could remain tacit. Just like when testing code, spending more time refining your labeling schema to try to tackle any edge cases before they arise will save you time in the future.

When writing your data processing pipeline, add functionality to make it easier to identify and update samples. For example, add a UUID for every sample and label so that you can easily tag an incorrect annotation and update it. Tools that could help are:

  • Cookiecutter — provides a file-structure to help organize your code
  • CVAT — computer vision annotation tool that can help you figure out the structure and available options for annotations for vision tasks
  • pandas — use a pandas DataFrame to store your labels to reduce the amount of scripting needed to update your dataset
Example of annotation tool CVAT

2) Data loading and format conversion

Parsers are the most common kind of script that you will end up writing when wrangling data. Every time you need to view, transform, load, or evaluate your data you will need to parse it and the output format will likely be different than it was before. This means that every time you interact with your data, you are writing a new script.

Tips: One way to cut down the time it takes to load and convert datasets is to work in an environment designed to allow you to easily manipulate data with models available to train and deploy.

  • AutoML and SageMaker — cloud tools that let you prepare and store dataset then train and deploy models

Alternatively, if you have a relevant task, tools exist that allow you to automatically convert your data from one format to another if you are following the correct protocols.

  • FiftyOne: open-source Python library that supports automatic Python or command-line conversion of over a dozen computer vision formats including COCO, Pascal VOC, YOLO, BDD100K, TFrecords, etc.
  • RoboFlow: an annotation service with a tool to convert between computer vision formats like COCO, TFRecords, and Pascal VOC

Finally, if all else fails, the best advice is to have a good understanding of the data pipeline before starting to work on a project so that you can structure your data format in a way that will make it easily load into the model and evaluation you want to use. Dataset and model zoos would be the best place to look to get ideas of data formats you would want to follow.

  • TensorFlow Hub: a collection of models that you can use to understand the input/output formats your desired model expects
  • Papers with code: find the current state-of-the-art datasets and models and follow their formats
Example of models available on TensorFlow Hub

3) Model versioning

You’ll likely end up retraining a variety of models when working on an ML project. This will result in a lot of variations of training scripts, hyperparameter setups, and experiment logs. Managing these can get pretty unruly, especially as you are constantly changing your model architecture and training pipeline.

Tips: Luckily, there are many tools designed specifically for versioning and experiment tracking. Notable examples are:

  • Tensorboard: probably the most popular lightweight experiment tracking tool that lets you visualize metrics like loss and accuracy and more
  • MLflow: open-source tools for tracking parameters and results of your experiments while also providing functionality to package and deploy your models
  • Weights & Biases: cloud-based solution for experiment tracking and model/dataset versioning that allows for easy sharing with collaborators
Experiment tracking with Weights & Biases

4) Evaluation

The complexity of your evaluation greatly depends on the problem you are trying to solve. Finding false positives in an image classification problem is much easier than finding false positives in an object detection problem.

While there do exist libraries that allow you to compute performance metrics across your entire dataset, the best way for you to improve your model performance is to find specific samples where your model is demonstrating the same failure modes. These failure modes and biases are often very model- and dataset-specific meaning the only way to find them is by writing very specific scripts to query your results.

Tips: Start by analyzing dataset-wide metrics, then digging in and finding individual false positives or poor performing samples. Useful tools for this are:

  • Scikit learn: compute gross metrics like accuracy, precision, recall, F1 score, MSE, Jaccard similarity score, etc.
  • pycocotools: automatically compute mAP for object detection problems following the COCO dataset evaluation
  • NLTK: Natural Language Toolkit to compute metrics like BLEU to evaluate machine translation, image captioning, and other NLP models
  • FiftyOne: load your labeled visual dataset and model results and write custom queries to find problematic samples
Computing mAP with Scikit learn and Matplotlib

5) Visualization

As datasets are growing over the years, many now containing millions of samples, visualizing your dataset and results in a meaningful way often takes a pile of scripts just to generate images that you’ll have to view individually on disk. Cutting down on the scripting needed for visualization is best done by using relevant tools.

Tips: A picture is worth a thousand words, try to visualize as many metrics as you can. For example, a confusion matrix is much more insightful than a gross accuracy metric.

However, you really need to visualize not only aggregate metrics, but also analyze some subsets over your data to learn systematic avenues of improvement. Try to explore individual samples and results to understand patterns and biases in annotations or model predictions that may be difficult to find through dataset-wide evaluation metrics.

  • Matplotlib: the most popular Python library for visualizing plots and graphs
  • seaborn: built on matplotlib with its own graphing library and formatting
  • Plotly: one of the simplest plotting libraries in Python
  • Bokeh: a visualization tool for creating interactive plots
  • FiftyOne: open-source tool to visualize and search labeled image or video datasets with model predictions in the FiftyOne App
  • Scale Nucleus: cloud-based tool to visualize and explore labeled image datasets
  • Weights & Biases: cloud-based tool to visualize various data types including images, videos, and 3D objects
Visualizing a dataset with Scale Nucleus

Our Solution for Visual Data

While the above tips and tools are mostly data agnostic, if you’re working with image or video data, then check out our free open-source Python tool FiftyOne for managing your datasets and results.

pip install fiftyone

It provides an API with ways to easily load datasets for problems like object detection, classification, segmentation, keypoint detection, and more, while also providing an App to visualize your entire dataset and analyze model predictions.

Full disclosure — I am an engineer at Voxel51 working on FiftyOne. We have been developing this open-source library specifically because of all the time we used to spend writing scripts for the following tasks:

  • Dataset Collection — The FiftyOne Dataset Zoo provides a host of datasets for various computer vision tasks like object detection, classification, and segmentation in one line of code.
The output of the previous code block (image by author)
  • Manage DatasetsOnce you get your data and labels into FiftyOne, you can update the data and labels as you collect and annotate more raw data, find mistakes, produce model predictions, and balance your dataset. There is no need to figure out how you should store your data anymore, FiftyOne will take care of that and let you access and load it in whatever ways you need.
  • Convert Data Formats Import and export datasets with annotations and model predictions in more than a dozen different formats (ex. CVAT, MSCOCO, YOLO, TensorFlow Object Detection, Classification, and more)
fiftyone convert \
--input-dir /path/to/input \
--input-type fiftyone.types.CVATImageDataset \
--output-dir /path/to/output \
--output-type fiftyone.types.COCODetectionDataset
  • Generate Predictions — Stop writing scripts to try to get someone else's code working just to evaluate their model on your dataset. The FiftyOne Model Zoo lets you load and apply a model in one line of code to get a performance benchmark on your task.
  • Visualize DataOnce your dataset is in FiftyOne, you can load it into the App and visualize your data and labels with no additional scripting. Find annotation mistakes, prepare training data, and compare model results with minimal additional code.
Visualizing dataset in FiftyOne App (image by author)
  • Evaluate Models Model evaluation is one of the most script-heavy portions of an ML project. Things like computing evaluation metrics, visualizing false positives, and querying specific failure cases of your model all require you to write numerous scripts. FiftyOne provides these functionalities in a few lines of code.
  • Notebook Support — Remove scripts entirely by working from Jupyter or Colab notebooks. FiftyOne is fully compatible with notebooks and the App can be launched within a notebook cell.
FiftyOne running in a Colab notebook (image by author)

Summary

Data wrangling takes up a massive portion of the time spent on machine learning projects. Nearly every dataset and model comes with a different data format that you need to write scripts to convert to and from. Training models means writing scripts to track dozens of configurations and results. Finding annotation mistakes requires writing scripts to go back and update your label schema and reannotate your dataset. After finally getting results, you need to write scripts to visualize your data and results as well as a new script for any query that you might come up with to interpret the results.

Planning ahead and writing your code in a reusable way can help reduce the number of scripts that you need to write in the long run. However, the most direct way to reduce the number of scripts is knowing what tools exist for your task and using them.

--

--