Speedy Computer Vision Pipelines using Parallelism

Published in

Towards Data Science

5 min readNov 22, 2018

If you’ve ever written your own code to process videos using OpenCV, or MoviePy or any of the other gazillion libraries out there, you would’ve probably faced the problem of terribly slow processing.

Unfortunate is ze who has to wait 20 minutes to watermark a 5 minute 4k video

This is because the code we write is fundamentally written to operate on just a single core. We have a task that is monumental in the amount of data it has to operate on, and there’s no solution available which just simplifies our life out of the box. What we’d like however, is an entirely different story. This is best summarized by some of Daft Punk’s lyrics

Work it, Make it ….. Harder, Better, FASTER, Stronger

Seeing the word faster, the first thought that came into your head was probably parallelism. And when we think of parallelizing video processing, this is how it must be done, right?

Approach #1

We could just keep assigning frames one after the other to whichever core is free. Seems like a simple and obvious solution, doesn’t it?

First thoughts: How to NOT parallelize video processing across cores

Unfortunately though, this doesn’t quite work out. In fact, this ends up performing worse than our simple, single-core code for a lot of cases. To understand why, we’ve to dig into how videos themselves are stored. Since video encoding is sequential, while one core is decoding a frame, the other cores have to sit idle. They cannot start processing the next frames until that core has decoded the preceding frame (Here’s a tech-quickie (literally) about how video compression works).

Libraries like joblib also attempt to parallelize jobs in a similar way and so do not giving the kind of speed-up we’d like to see. This means we have to look for a better alternative

Approach #2

Another approach to speed up your video processing was highlighted by the brilliant Adrian in his blog. In this blog post, he discusses shifting the frame decoding process to another thread and storing the decoded frames till the main processing thread needs to retrieve it.

New frames are decoded in a dedicated thread and enqueued to the back of a Queue. They are then dequeued from the front of the list as and when needed by the main processing thread (source: pyimagesearch)

Though this is a brilliant idea in itself, a big limitation of this approach is that it can exploit only two threads/cores of your machine simultaneously. Also, for some reason, the code in Adrian’s blog doesn’t seem to directly extend to Python 3 (see comments on the post. Perhaps Python has changed the internal workings of their multithreading library?) Either ways, if we put in some effort in this direction, we should be able to come to a point where we are at least slightly better off than where we started from.

Still not quite the speed-up we’re looking for though. What we’d like is to leave a Threadripper or a Xeon to massively distribute the workload within the same video. And to be able to do something like that, let’s look back at Approach #1. The main issue with our first approach was the interdependence between the cores. This is unavoidable because of how video encoding is fundamentally “blocking”.

However, it should logically be easy to work around this by forcing the cores to work on completely different pre-allocated fragments. And this brings us to our final approach

Approach #3

To be able to do this, all we have to do is get each core to seek to a completely different segment of the video and operate on it

A better way to distribute video processing across your cores

Say we had 10,000 frames and 4 cores, this would mean we dedicate each core to work on a fixed consecutive quarter (=2500) of those 10,000 frames. This way, the cores don’t have to wait on each other to process the next frames. The only task that remains after this is done is to re-assemble the processed frames in the right order. Following this methodology would let us easily parallelize our video processing pipeline

Doing this in Python is as easy as it gets. For reference, this is probably what your normal video processing code would look like:

A normal code to perform an operation on a video

If we had to parallelize this code, we’d have to use the python multiprocessing library

Parallelizing the video processing by distributing video fragments across cores

And the big part is done. However, an important part is still left out. All we’re left with is processed fragments. We still need to merge these fragments. This can be done easily using ffmpeg in the following manner:

Merging the video fragments that are generated by our parallel processing code

And there we have it! A method to parallelize our video processing pipelines that scales with the number of cores we throw at it.

A speed-up of 6x is observed with 6 cores and 2x with 2 cores for the simple task of video blurring

You do not have to take my word for it :D . Try it out yourself and let me know in the comments below. All the code is available in GitHub at https://github.com/rsnk96/fast-cv

Notes:

If you’re wondering how cv2.set(CAP_PROP_POS_FRAMES) jumps to a particular frame without having to decode it using the previous frames, it’s because it jumps to the nearest keyframe. Check out the video linked to understand this better
cv2.set(CAP_PROP_POS_FRAMES) is known to not seek the specified frame accurately. This is perhaps because it seeks to the nearest keyframe. This means there might be repetition of a couple of frames at points where the video is fragmented. So it is not advisable to adopt this method for frame-critical use cases.
The best performance might actually be obtained by a combination of Approaches 2 and 3. But it would involve a bit more effort in coding it up. Let me know in the comments if anyone manages to piece together a code for that! :)
**: It is normal practice to include a check for if ret==False: break , but I’ve avoided it here for simplicity
This post is a summary of a talk I gave at Pysangamam and Pycon India recently . https://www.youtube.com/watch?v=v29nBvfikcE

Speedy Computer Vision Pipelines using Parallelism

Approach #1

Approach #2

Approach #3

Written by R S Nikhil Krishna