The world’s leading publication for data science, AI, and ML professionals.

YOLO Object Detection on the Raspberry Pi

Running the object detection model on the low-power devices

YOLO object detection results, Image by author
YOLO object detection results, Image by author

In the first part of this article, I tested "retro" versions of YOLO (You Only Look Once), a popular object detection library. The possibility to run a deep learning model using only OpenCV, without "heavy" frameworks like PyTorch or Keras, is promising for low-power devices, and I decided to go deeper into this topic and see how the latest YOLO v8 model works on a Raspberry Pi.

Let’s get into it.

Hardware

It is usually not a problem to run any model in the cloud, where the resources are virtually unlimited. But for the Hardware "in the field," there are much more constraints. Limited RAM, CPU power, or even different CPU architecture, older or incompatible software versions, a lack of a high-speed internet connection, and so on. Another big issue with cloud infrastructure is its cost. Let’s say we are making a smart doorbell, and we want to add person detection to it. We can run a model in the cloud, but every API call costs money, and who will pay for that? Not every customer would be happy to have a monthly subscription for the doorbell or any similar "smart" device, so it can be essential to run a model locally, even if the results may not be so good.

For this test, I will run the YOLO v8 model on a Raspberry Pi:

Raspberry Pi 4, Image source https://en.wikipedia.org/wiki/Raspberry_Pi
Raspberry Pi 4, Image source https://en.wikipedia.org/wiki/Raspberry_Pi

The Raspberry Pi is a cheap credit-card-size single-board computer that runs Raspbian or Ubuntu Linux. I will test two different versions:

  • Raspberry Pi 3 Model B, made in 2015. It has a 1.2 GHz Cortex-A53 ARM CPU and 1 GB of RAM.
  • Raspberry Pi 4, made in 2019. It has a 1.8 GHz Cortex-A72 ARM CPU and 1, 4, or 8 GB of RAM.

Raspberry Pi computers are widely used nowadays, not only for hobby and DIY projects but also for embedded industrial applications (a Raspberry Pi Compute Module was designed especially for that). So, it is interesting to see how these boards can handle such computationally demanding operations as Object Detection. For all further tests, I will use this image:

Test image, made by author
Test image, made by author

Now, let’s see how it works.

A "Standard" YOLO v8 Version

As a warm-up, let’s try the standard version, as it is described on the official GitHub page:

from ultralytics import YOLO
import cv2
import time

model = YOLO('yolov8n.pt')

img = cv2.imread('test.jpg')

# First run to 'warm-up' the model
model.predict(source=img, save=False, save_txt=False, conf=0.5, verbose=False)

# Second run
t_start = time.monotonic()
results = model.predict(source=img, save=False, save_txt=False, conf=0.5, verbose=False)
dt = time.monotonic() - t_start
print("dT:", dt)

# Show results
boxes = results[0].boxes
names = model.names
confidence, class_ids = boxes.conf, boxes.cls.int()
rects = boxes.xyxy.int()
for ind in range(boxes.shape[0]):
    print("Rect:", names[class_ids[ind].item()], confidence[ind].item(), rects[ind].tolist())

In the "production" system, images can be taken from a camera; for our test, I am using a "test.jpg" file as was described before. I also executed the "predict" method twice to make the time estimation more accurate (the first run usually takes more time for the model to "warm up" and allocate all the needed memory). A Raspberry Pi is working in "headless" mode without a monitor, so I am using the console as an output; this is a more-or-less standard way most embedded systems work.

On the Raspberry Pi 3 with a 32-bit OS, this version does not work: pip cannot install an "ultralytics" module because of this error:

ERROR: Cannot install ultralytics

The conflict is caused by:
    ultralytics 8.0.124 depends on torch>=1.7.0

It turned out that PyTorch is available only for ARM 64-bit OS.

On the Raspberry Pi 4 with a 64-bit OS, the code indeed works, and the calculation took about 0.9 s.

The console output looks like this:

I also did the same experiment on the desktop PC to visualize the results:

YOLO v8 Nano detection results, Image by author
YOLO v8 Nano detection results, Image by author

As we can see, even for a model of "nano" size, the results are pretty good.

Python ONNX Version

ONNX (Open Neural Network Exchange) is an open format built to represent machine learning models. It is also supported by OpenCV, so we can easily run our model this way. YOLO developers have already provided a command-line tool to make this conversion:

yolo export model=yolov8n.pt imgsz=640 format=onnx opset=12

Here, "yolov8n.pt" is a PyTorch model file, which will be converted. The last letter "n" in the filename means "nano". Different models are available ("n" – nano, "s" – small, "m" – medium, "l" – large), obviously, for the Raspberry Pi, I will use the smallest and fastest one.

Conversion can be done on the desktop PC, and a model can be copied to a Raspberry Pi using the "scp" command:

scp yolov8n.onnx pi@raspberrypi:/home/pi/Documents/YOLO

Now we are ready to prepare the source. I used an example from the Ultralytics repository, which I slightly modified to work on the Raspberry Pi:

import cv2
import time

model: cv2.dnn.Net = cv2.dnn.readNetFromONNX("yolov8n.onnx")
names = "person;bicycle;car;motorbike;aeroplane;bus;train;truck;boat;traffic light;fire hydrant;stop sign;parking meter;bench;bird;" 
        "cat;dog;horse;sheep;cow;elephant;bear;zebra;giraffe;backpack;umbrella;handbag;tie;suitcase;frisbee;skis;snowboard;sports ball;kite;" 
        "baseball bat;baseball glove;skateboard;surfboard;tennis racket;bottle;wine glass;cup;fork;knife;spoon;bowl;banana;apple;sandwich;" 
        "orange;broccoli;carrot;hot dog;pizza;donut;cake;chair;sofa;pottedplant;bed;diningtable;toilet;tvmonitor;laptop;mouse;remote;keyboard;" 
        "cell phone;microwave;oven;toaster;sink;refrigerator;book;clock;vase;scissors;teddy bear;hair dryer;toothbrush".split(";")

img = cv2.imread('test.jpg')
height, width, _ = img.shape
length = max((height, width))
image = np.zeros((length, length, 3), np.uint8)
image[0:height, 0:width] = img
scale = length / 640

# First run to 'warm-up' the model
blob = cv2.dnn.blobFromImage(image, scalefactor=1 / 255, size=(640, 640), swapRB=True)
model.setInput(blob)
model.forward()

# Second run
t1 = time.monotonic()
blob = cv2.dnn.blobFromImage(image, scalefactor=1 / 255, size=(640, 640), swapRB=True)
model.setInput(blob)
outputs = model.forward()
print("dT:", time.monotonic() - t1)

# Show results
outputs = np.array([cv2.transpose(outputs[0])])
rows = outputs.shape[1]

boxes = []
scores = []
class_ids = []
output = outputs[0]
for i in range(rows):
    classes_scores = output[i][4:]
    minScore, maxScore, minClassLoc, (x, maxClassIndex) = cv2.minMaxLoc(classes_scores)
    if maxScore >= 0.25:
        box = [output[i][0] - 0.5 * output[i][2], output[i][1] - 0.5 * output[i][3],
               output[i][2], output[i][3]]
        boxes.append(box)
        scores.append(maxScore)
        class_ids.append(maxClassIndex)

result_boxes = cv2.dnn.NMSBoxes(boxes, scores, 0.25, 0.45, 0.5)
for index in result_boxes:
    box = boxes[index]
    box_out = [round(box[0]*scale), round(box[1]*scale),
               round((box[0] + box[2])*scale), round((box[1] + box[3])*scale)]
    print("Rect:", names[class_ids[index]], scores[index], box_out)

As we can see, we don’t use PyTorch and the original Ultralytics library anymore, but the required amount of code is bigger. We need to convert the image to a blob, which is required for a YOLO model. Before printing the result, we also need to convert the output rectangles to the original coordinates. But as an advantage, this code works on "pure" OpenCV without any additional dependencies.

On the Raspberry Pi 3, the computation time is 28 seconds. Just for fun, I also loaded the "medium" model (it’s a 101 MB ONNX file!) to see what would happen. Surprisingly, the application did not crash, but the calculation time was 224 seconds (almost 4 minutes). It looks obvious that the hardware from 2015 is not well suited for running SOTA models from 2023, but it was interesting to see how it works.

On the Raspberry Pi 4 the computation time is 1.08 seconds.

C++ ONNX Version

Finally, let’s try the "heaviest guns" in our toolset and write the same code in C++. But before doing this, we will need to install OpenCV libraries and headers for C++. The easiest way is to run a command like "sudo apt install libopencv-dev". But, at least for Raspbian, it does not work. The latest version, available via "apt", is 4.2, and the minimum OpenCV requirement for loading the YOLO model is 4.5. So, we will need to build OpenCV from source.

I will use OpenCV 4.7, the same version that was used in my Python tests:

sudo apt update
sudo apt install g++ cmake libavcodec-dev libavformat-dev libswscale-dev libgstreamer-plugins-base1.0-dev libgstreamer1.0-dev 
sudo apt install libgtk2.0-dev libcanberra-gtk* libgtk-3-dev libpng-dev libjpeg-dev libtiff-dev
sudo apt install libxvidcore-dev libx264-dev libgtk-3-dev libgstreamer1.0-dev gstreamer1.0-gtk3

wget https://github.com/opencv/opencv/archive/refs/tags/4.7.0.tar.gz
tar -xvzf 4.7.0.tar.gz
rm 4.7.0.tar.gz
cd opencv-4.7.0
mkdir build && cd build

cmake -D WITH_QT=OFF -D WITH_VTK=OFF -D CMAKE_BUILD_TYPE=RELEASE -D CMAKE_INSTALL_PREFIX=/usr/local -D WITH_FFMPEG=ON -D PYTHON3_PACKAGES_PATH=/usr/lib/python3/dist-packages -D BUILD_EXAMPLES=OFF ..
make -j2 && sudo make install && sudo ldconfig

The Raspberry Pi is not the fastest Linux computer in the world, and the compilation process takes about 2 hours. And for a Raspberry Pi 3 with 1 GB of RAM, the swap file size should be increased to at least 512 MB; otherwise, the compilation will fail.

The C++ code itself is short:

#include <opencv2/opencv.hpp>
#include <vector>
#include <ctime>
#include "inference.h"

int main(int argc, char **argv) {
    Inference inf("yolov8n.onnx", cv::Size(640, 640), "", false);

    cv::Mat frame = cv::imread("test.jpg");

    // First run to 'warm-up' the model
    inf.runInference(frame);

    // Second run
    const clock_t begin_time = clock();

    std::vector<Detection> output = inf.runInference(frame);

    printf("dT: %fn",  float(clock() - begin_time)/CLOCKS_PER_SEC);

    // Show results
    for (auto &amp;detection : output) {
        cv::Rect box = detection.box;

        printf("Rect: %s %f: %d %d %d %dn", detection.className.c_str(), detection.confidence,
                                             box.x, box.y, box.width, box.height);        
    }

    return 0;
}

In this code, I used "inference.h" and "inference.cpp" files from the Ultralitics GitHub repository, these files should be placed in the same folder. I also executed the "runInference" method twice, the same way as in previous tests. We can now compile the source using this command:

c++ yolo1.cpp inference.cpp -I/usr/local/include/opencv4 -L/usr/local/lib -lopencv_core -lopencv_dnn -lopencv_imgcodecs -lopencv_imgproc -O3 -o yolo1

The results were surprising. A C++ version was significantly slower than the previous one! On the Raspberry Pi 3, the execution time was 110 seconds, which is more than 3 times longer than a Python version. On the Raspberry Pi 4, the computation time was 1.79 seconds, which is about 1.5 times longer. In general, it is hard to say why. An OpenCV library for Python was installed using pip, but OpenCV for C++ was built from the source, and maybe some ARM CPU optimizations were not enabled. If some readers know the reason, please write in the comments below. Anyway, it was interesting to see that such an effect can happen.

Conclusion

I can make an "educated guess" that most data scientists and data engineers are using their models in the cloud or at least on high-end equipment and have never tried running code "in the field" on embedded hardware. The goal of this text was to give readers some insights into how it works. In this article, we tried to run a YOLO v8 model on different versions of the Raspberry Pi, and the results were pretty interesting.

  • Running deep learning models on low-power devices can be a challenge. Even a Raspberry Pi 4, which is the best Raspbian-based model at the moment of writing this article, was able to provide only ~1 FPS with a YOLO v8 Tiny model. Of course, there is room for improvement. Some optimizations may be possible, like converting the model into FP16 (a floating point format with less accuracy) or even INT8 formats. Finally, a more simple model trained on a limited dataset can be used. Last but not least, if more computing power is still required, code can run on special single-board computers like the NVIDIA Jetson Nano, which has CUDA support and can be much faster.
  • At the beginning of this article, I wrote that "the possibility to run a deep learning model using only OpenCV, without heavy frameworks like PyTorch or Keras, is promising for low-power devices". Practically, It turned out that PyTorch is an effective and highly optimized framework. The original YOLO version, based on PyTorch, was the fastest one, and an OpenCV ONNX code was 10–20% slower. But at the moment of writing this article, PyTorch is not available on a 32-bit ARM CPU, so on some platforms, there just may be no other choice.
  • Results with a C++ version were even more interesting. As we can see, it can be a challenge to turn on proper optimizations, especially for embedded architecture. And without going deep into these nuances, custom-built OpenCV C++ code can run even slower compared to a Python version provided by a board manufacturer.

Thanks for reading. If someone would be interested in testing FP16 or INT8 YOLO models on the same hardware or on the NVIDIA Jetson Nano board, please write in the comments, and I will write the next part of the article about this.

If you enjoyed this story, feel free to subscribe to Medium, and you will get notifications when my new articles will be published, as well as full access to thousands of stories from other authors.


Related Articles