7 Things We Looked for in a Video Labelling Tool

Here are the gotchas to look out for and questions to ask:

Published in

Towards Data Science

10 min readAug 19, 2020

In July 2020, V7 Labs released its video annotation tool. This was part of a 6 month development journey. Here’s what we had to look out for, which will inevitably affect the quality of your machine learning projects.

1. Video Frames Quality: Avoiding video compression

Video compression, whether H264, H265, or any extension ending in .MP4, .AVI, .MOV, is almost always lossy. An H264 video, used in most streaming services, is 20 to 200 times lighter than an uncompressed series of images, with differences hard to notice in motion to the human eye.

When you capture a still however, the difference is palpable and will affect your annotation quality and training data.

Video compression mostly affects element in motion, and does terribly in the presence of noise (common in dark sceneries). Modern video compression like H265/HEVC does a better job at handling smaller moving items, however they are still choppier and more pixelated than they should. This will ultimately affect the way your model detects smaller items in motion and learns to detect them within motion blur.

Something you may also want to look out for is frame sampling. Most videos are shot at 30FPS, but unless you are testing fast object-tracking algorithms, you won’t need to label that frequently. Most computer vision applications will require either 2FPS for high-variety datasets, or 15FPS for real-time tracking at human-speed (that is not a moving car, but walking/interacting humans). We’ve built in a frame sampler that allows you to tweak the frame-rate of a video for labelling. This won’t affect the video quality but saves you from annotating scores of almost-identical frames.

Ensure that whatever video annotation system you are using allows you to label video frames uncompressed, and at full resolution. Often this will be avoided to ensure that videos can be played smoothly. Here’s what to look out for:

QUESTION: What image quality loss can we expect when labelling videos from the original file?
ANSWER: You’re looking for a clear answer stating that you’ll have access to video frames at full quality when pausing a video, and upon export.

2. Z-Stacking labels: What sets video editing tools apart

At age 11 back in 2004, my school purchased some licenses of the now-defunct Macromedia Flash MX. What set Flash apart back then, and led to the creative outburst of animations and games of the mid 2000’s, was its ability to handle so many simultaneously moving objects and keyframe interpolation of motion graphics.

This is Adobe Flash Professional 2016, the most modern version of its incarnation before being deprecated. Source: Alan Becker flash tutorials

Similarly, professional video editing software like Adobe Premiere Pro or After Effects set themselves apart from free or open-source alternatives by the sheer number of simultaneous tracks it can sustain. This poses both a UX and engineering challenge, as video annotation may demand hundreds of simultaneous objects, with independent entry and exit points in a timeline.

A busy timeline in Adobe Premiere Pro, with 11 audio and video tracks. We will have to deal with 100+. Source: Adobe

This meant that we couldn’t use a track-based timeline like in video editing, as the vertical space needed would be immense. Instead, we opted to define overlapping annotations in a separate layer bar, and instead made the timeline automatically adjust to fit as many annotations as possible within a narrow band.

Artboard sketches of our video annotation system. Source: V7 Labs, design

We opted for an automatically adjusting timeline based on order-of-appearance, but where horizontal tracks can be shared between annotations. This way, when an object exits the frame, the next can appear in its place afterwards. Also keyframes remember their position in time, so if the duration of an annotation is adjusted, the keyframes will adjust accordingly.

QUESTION: How many simultaneous objects can appear in a video?
ANSWER: Make sure each can have a start and end point that is clearly visible on your interface. This will help with QA as temporal errors are one of the most common mistakes in video annotations, and very bad for network performance.

3. Keyframes: See where changes happen

Keyframes are used in every video editing and animation software on the planet. As they say, if it ain’t broken, don’t fix it.
Depending on the software’s purpose, they may be handled differently. For example, in video montage software like Adobe Premiere Pro, they take a back-seat position, and are only visible by opening a clip’s setting. The same goes for Final Cut where they’re only visible for a clip at a time. For motion-graphics software like After Effects or Adobe Animate (ye olde Macromedia Flash), they take centre stage as they are the primary mode of interaction of the animator.

Image annotation is a lot closer to motion graphics than video editing.
In V7 we wanted to ensure keyframes in one shape were independent of another’s. Each annotation therefore has its own set of keyframes, which can define their position, shape, attributes, or any other property supported such as visible/hidden, occluded, etc.

Something I found frustrating in video editing suites is mistakenly editing a property in the wrong frame, missing a keyframe, so we made each one a clickable shortcut.

QUESTION: How do keyframes work? and how are events happening to annotations such as attribute changes or shape changes visibly marked?
ANSWER: Make sure it’s easy to spot changes, else your QA will become messy and time consuming. Also make sure that every annotation type supports some form of keyframing.

4. Interpolation: How far does it go?

Let’s start with linear vs cubic interpolation. Does it really matter?

Not really, no. We’re talking about spatial interpolation here. Most objects, such as the ball on the left, move following a cubic spline. This isn’t true in manufacturing however, where movement is almost alway linear due to the nature of gears. Even so, the biggest difference isn’t on spatial interpolation, but rather temporal interpolation. Essentially it’s not the arched movement of objects that’s hard to handle, it’s the acceleration and deceleration. Temporal interpolation confuses users, so much so that “temporal interpolation” is one of the most Googled concepts in After Effects tutorials. We found that keeping time linear did a much better job at avoiding user frustration at the cost of a few extra clicks.

Bounding Box Interpolation

The easiest one to handle is bounding boxes. It’s essentially the movement of 2 coordinate points. The only thing you’ll want to ensure is that bounding boxes can change status over time, for example add or lose attributes, or become hidden.

Polygon Interpolation

Things get much harder when you have a potentially unlimited number of points to handle, and your origin and target shapes use different numbers of points. We found that in polygon annotation, this applied to practically every single case.

In mathematics this is known as the Stable Marriage Problem. Shapes have the added challenge of having to make sure that the target and origin points are close to one-another. Computing this in real-time can be a challenge.

We included real-time polygon interpolation as segmentation masks are a primary component of our tool, and we think we did a great job at solving it, however it’s still not perfect.

If you’re interested in a deep-dive on why polygon interpolation is tricky, this excellent tutorial by Flash animator Alan Becker demonstrates some of its shortcomings.

Keypoint Interpolation

Keypoint skeletons are a workaround to polygon interpolation, since they have a defined number of points. You’ll want to make sure you can interpolate skeletons or custom polygons too. Make sure you’re able to mark occluded joints if there’s a need for that in your data. Here’s a video on how we handled keypoint skeletons in video annotation on V7.

Neural Network-based Interpolation

Finally, what we found to be the most effective way of smoothening a polygon across videos, is to run a neural network on each frame (or 2–5, interpolating between them) and making any needed adjustments as an input to the net. Here, we’re using Auto-Annotate which is class agnostic and responds to user clicks as hints to re-draw a segmentation mask.

5. Scrubbing performance: Why video editors spend $000’s on the right editing suite.

Because a lagging video drives people crazy. When people buy video editing software, it’s not about the presets, effects, or transitions — those are all downloaded from marketplaces after the purchase. It’s because of the fidelity of playback experience and real-time rendering. You are going to play and re-play the same 3-second sequence a dozen times, make sure it’s smooth!

There are two elements you want to watch out for:

How well does the video play, and how quickly can you jump across parts of the video and play a sequence.
How smoothly and faithfully do the annotations render when a video is playing? Make sure you are testing this with as many labels as you foresee using in production.

And possibly

3. How quickly can you process videos after upload, and generate exports once they are complete?

Don’t underestimate performance loss when a video is fully labelled. Most annotation systems run on browsers which will have performance limitations, and can become frustratingly slow when it’s your job as a data scientist or reviewer to look over the quality of work.

6. Slow connection performance: Will it work in every region of the world?

CDNs are the unsung heroes of SaaS. They ensure content is delivered to a user by bringing parts of it closer to them, while (if done right) still remaining GDPR or HIPAA compliant.
What this means is: If a demo works well for you while in California on a Macbook Pro, it may not work well for users in Peru. What could load in 2 seconds for you, may take over 60 seconds to load for some users due to their location alone. On that, users labelling data aren’t always on blazing fast connections.

You know something is boring when the only illustrations you can find in the topic are in clip art. Source: ZDNet

Test out the performance of video annotations by using a VPN connected to a remote country. You can also simulate slower connections by heading to the Network tab in your browser’s Developer Tools (CTRL+ALT+I). Pictured on the left are Chrome’s.

QUESTION: Do you have any users in [region] and what network performance are they reporting?
ANSWER: You should be able to test the performance of the service in that region and get some concrete numbers.

7. Video Dataset Management: What happens when you’re neck-deep in data?

Things crash at scale, and it’s very hard, if not impossible, to patch a product that wasn’t built for your end-goal ambitions. When looking for a labelling partner, you’re putting your business’ deliverables in their hands. Make sure they are frank about the long term and large-scale performance of their tools.

We too encounter scale issues, and never stop encountering edge cases. Loading dozens of feature-length videos? Check. Videos with 8:1 aspect ratios? Check. Video datasets with thousands of classes, and tens of thousands of attributes? Check. These are tricky to support if you haven’t built the pre-requisites to sustain them and could cost you dearly.

A series of videos pending annotation in a dataset management interface. Source: V7 Darwin

We’ve encountered far too many businesses that have switched away from internal tools or open-source labelling platform because of poor performance at scale. It’s a painful switch, as it has to happen when you are in full production. With video, these performance issues are multiplied. Here’s what to look out for, and what to ask:

QUESTION: Dataset size - How will performance degrade when I reach 100,000 videos in a dataset? How about 1 million? Can you provide some examples of how you overcame scale issues?
ANSWER: Nothing is completely scale-invariant. You’d be looking for some slow-downs in retrieving and searching for videos. Make sure the tool’s developers have experienced this challenge with another customer and aren’t brushing away the challenge.

QUESTION: Dataset management — Can I search for or retrieve any video by name, status, label, or user who labelled it?
ANSWER: You’ll need this. There comes a time in which your dataset needs to be split, repurposed, cleaned, or troubleshot after a training session.

QUESTION: Dataset integrity — Are you keeping a history of the changes in each video, in case we encounter issues with our ground truth? Do you back up any of these assets?
ANSWER: You want to at least make sure that you’ll have the chance to spot a bad batch of data, since errors tend to either occur around the same timeframe or user. You also want to ensure that both annotations, data, and any performance and history metrics are backed up daily. Human errors in deletion are something we encounter on a daily basis and are far too common.

I hope this has been a useful guide for what to look out for. Make sure you’ve covered every plausible scenario when starting your machine learning data labelling, as the tool you use will be the bedrock of your ML progress. If you’d like to take a look at what we ended up building in V7, here’s a video summary below: