Face Mask Detection using darknet’s YOLOv3

COVID-19: Tutorial on how to build a face mask detector using YOLOv3. *For inference, both video streams and images can be used.

Published in

Towards Data Science

9 min readApr 12, 2021

This article aims to offer complete guidelines (step-by-step) for someone who wants to train an object detector from the YOLO family on custom data. Due to the pandemic, such a task seems to be quite topical.

For this tutorial, I am going to use YOLOv3, one of the most frequently used versions of the YOLO family, which comprises the state-of-the-art object detection system for the real-time scenario and it is amazingly accurate and fast. Newer versions such as YOLOv4, YOLOv5 might achieve even better results, and in my next articles, I am going to also experiment with these architectures and share my findings with you.

Supposing that you already have a grasp of object detection using deep learning techniques and specifically you know the basics about YOLO, let’s dive into our task…

You can find this project uploaded on my Github repo.

Environment 🌠

In order to implement this project, I exploited Google Colab’s resources. My first experimentations of the pre-processing steps were built on my laptop since they were not computationally expensive, but the model got trained on Colab using GPU.

On Colab->Edit->Notebook Settings someone can activate GPU — On Colab via **Edit**->**Notebook Settings** someone can activate **GPU,** Image by author

Dataset 📚

First things first, in order to build a mask detector we need relevant data. Additionally, because of the nature of YOLO, we need annotated data with bounding boxes. One option is to build our own dataset by gathering images either from the web or by taking pictures of friends/acquaintances and annotate them by hand using specific programs such as LabelImg. However, both ideas would be exceptionally tedious and time-consuming (especially the latter). The other option, which is the most viable by far for my purpose, is to use a publically available dataset. I chose Face Mask Detection dataset from Kaggle and I downloaded it directly to my Google Drive (you can check out how to do so here). The downloaded dataset consists of two folders:

images, which comprises 853 .png files
annotations, which comprises 853 corresponding .xml annotations.

After we download the dataset, we need to convert the .xml files into .txt and, more precisely, we need to create the YOLO format in order to train our model. An example is shown below:

Assume this is the annotation of an image that contains only 3 bounding boxes (this can be seen from the number of <object> … </object> spans) in .xml format.

To create a .txt file we need 5 things from each .xml file. For each <object> … </object> in an .xml file fetch the class (namely the <name>…</name> field), and the coordinates of the bounding box (namely the 4 attributes in <bndbox>…</bndbox>). The desirable format looks like this:

<class_name> <x_center> <y_center> <width> <height>

However, to achieve that I created a script that fetches the aforementioned 5 attributes for each object in each .xml file and creates the corresponding .txt files. (Note: More analytical steps about the approach of the conversion can be found in my script).

For example, an image1.jpg must have an associated image1.txt containing:

1 0.18359375 0.337431693989071 0.05859375 0.10109289617486339
0 0.4013671875 0.3333333333333333 0.080078125 0.12021857923497267
1 0.6689453125 0.3155737704918033 0.068359375 0.13934426229508196

And this is the exact conversion of the above .xml file into a .txt file. (Note: It is crucial to group into the same folder the images with their corresponding .txt annotations).

Of course, before proceeding with the training we need to be absolutely sure that the conversion was right and we are going to feed our network with valid data. To do so, I created a script that takes an image and its corresponding .txt annotation from a given folder and displays the image with the ground truth bounding boxes. For the above example the image can be shown below:

maksssksksss0.png from Kaggle's publicly available Face Mask Detection dataset

And this is when we know that we are doing well so far, but let’s go on…

Train-Test Split ❇️

In order to train our model and validate it during the training phase, we have to split our data into two sets, the training, and the validation set. The proportion was 90–10% respectively. So I created two new folders and I put 86 images with their corresponding annotations into the test_folder and the rest 767 images into the train_folder.

Bear with me for a little more, we need some final touches and we are ready to train our model 😅

Clone the darknet framework ⬇️

The next step is to clone the darknet repo by running:

!git clone https://github.com/AlexeyAB/darknet

and after that, we need to download the weights of the pre-trained model in order to apply transfer learning and not train the model from scratch.

!wget https://pjreddie.com/media/files/darknet53.conv.74

darknet53.conv.74 is the backbone of the YOLOv3 network which is originally trained for classification on the ImageNet dataset and plays the role of the extractor. To use this for detection the additional weights which are present in the YOLOv3 network are randomly initialized prior to training. But of course, they are going to get their proper values during the training phase.

Final Step 🚨

We need to create 5 files in order to complete our preparations and start training the model.

face_mask.names: create a file _.names which contains the classes of the problem. In our case, the original Kaggle dataset has 3 categories: with_mask, without_mask, and mask_weared_incorrect. To simplify a little bit the task, I fused the two latter categories into one. Thus, for our task, we have two categories: Good and Bad based on whether someone wears her/his mask appropriately.

1. Good
2. Bad

2. face_mask.data: create a _.data file that includes relevant information to our problem and it is going to be used from the program:

classes = 2
train = data/train.txt
valid  = data/test.txt
names = data/face_mask.names
backup = backup/

Note: In case a backup folder does not exist, create one, because there are going to be saved the weights every after 1000 iterations. These will actually be your checkpoints in case of an unexpected interruption, from where you can continue the training process.

3. face_mask.cfg: This configuration file has to be adjusted to our problem, namely we need to copy the yolov3.cfg rename it into _.cfg and apply the amendments as described below:

change line batch to batch=64
change line subdivisions to subdivisions=16 (Note: in case Out of memory issue occurs increase this value to 32 or 64)
change the input dimensions to the default width=416, height=416. (Note: Personally, I started with this resolution and trained my model for 4000 iterations but in order to achieve more accurate predictions I increased the resolution and continued the training process for 3000 more iterations).
change line max_batches to (#classes * 2000), thus 4000 iterations for our task (Note: In case you have only one category you should not train your model for only 2000 iterations. It is suggested that 4000 iterations are the minimum number of iterations for the model).
change line steps to 80% and 90% of max_batches. For our case 80/100 * 4000 = 3200, 90 / 100 * 4000 = 3600.
Use ctrl+F and search for the word “yolo”. This will lead you straight to the yolo_layers where you want to do 2 things. Change the number of classes (for our case classes=2) and change the number of filters right two variables above the [yolo] line. This change has to be of filters=(classes + 5) * 3, namely filters = (2 + 5) * 3 = 21 for our task. In our .cfg file, there are 3 yolo_layers and thus you should do the aforementioned changes 3 times.

4. train.txt & test.txt files: These two files have been included in the face_mask.data file and indicate the absolute path for each image to the model. For example, a snippet of my train.txt file looks like this:

/content/gdrive/MyDrive/face_mask_detection/mask_yolo_train/maksssksksss734.png
/content/gdrive/MyDrive/face_mask_detection/mask_yolo_train/maksssksksss735.png
/content/gdrive/MyDrive/face_mask_detection/mask_yolo_train/maksssksksss736.png
/content/gdrive/MyDrive/face_mask_detection/mask_yolo_train/maksssksksss737.png
/content/gdrive/MyDrive/face_mask_detection/mask_yolo_train/maksssksksss738.png
...

(Note: As I mentioned earlier, the .png files should be located in the same folder with their corresponding .txt annotations)

Hence, our project is structured like this:

MyDrive
├──darknet
      ├──...
      ├──backup
      ├──...
      ├──cfg
            ├──face_mask.cfg      ├──...
      ├──data
            ├──face_mask.data
            ├──face_mask.names
            ├──train.txt
            ├──test.txt├──face_mask_detection
      ├──annotations       (contains original .xml files)
      ├──images            (contains the original .png images)
      ├──mask_yolo_test    (contains .png % .txt files for testing)
      ├──mask_yolo_train   (contains .png % .txt files for training)
      ├── show_bb.py
      └── xml_to_yolo.py

Let the par…training begin 📈

After we compile the model, we need to change the related permissions as shown below:

!chmod +x ./darknet

and finally, we can start training by running:

!./darknet detector train data/face_mask.data cfg/face_mask.cfg backup/face_mask_last.weights -dont_show -i 0 -map

The flag -map will inform us about the progress of the training by printing out important metrics such as average Loss, Precision, Recall, AveragePrecision (AP), meanAveragePrecsion (mAP), etc.

However, the mAP indicator in the console is considered as a better metric than Loss, so train while mAP increases.

(Note: The training process might take many hours depending on various parameters… it is normal. For this project, in order to train my model up to this point, I needed about 15 hours. But I got my first impressions of the model in about 7 hours that 4000 steps of training were completed).

It’s testing (and discussion) time 🎉

And yes… the model is ready to be demonstrated!!! Let’s try out some images that it has never seen before. To do so we need to run:

!./darknet detector test data/face_mask.data cfg/face_mask.cfg backup/face_mask_best.weights

Ηave you noticed that we used face_mask_best.weights and not face_mask_final.weights? Fortunately, our model saves the best weights (mAP.5 87.16% was achieved) in the backup folder in case we train it for more epochs than it should be (something that would possibly lead to overfitting).

The examples that are shown below were taken from Pexels, are high-resolution images, and with the naked eye, I would say that are quite different from the training/testing datasets from different points, and thus they are of different distribution. I chose such pictures to see how well the model generalizes.

**(Left)** Model’s prediction on a photo by Charlotte May from Pexels | **(Middle)** Model’s prediction on a photo by Tim Douglas from Pexels | **(Right)** Model’s Photo by prediction on Anna Shvets from Pexels

In the above examples, the model is accurate and fairly confident about its predictions. Something noteworthy is that the image on the right has not confused the model with the existence of the mask on the globe. It reveals that the predictions are not based exclusively on the existence of a mask but also on the context around it.

**(Left)** Model’s prediction on a photo by Kindel Media from Pexels | **(Right)** Model’s prediction on a photo by Aleksandar Pasaric from Pexels

These two are examples, which obviously show that the depicted people do not wear a mask, and it seems pretty easy for the model to discern that too.

**(Left)** Model’s prediction on a photo by Norma Mortenson from Pexels | **(Right)** Model’s prediction on a photo by Life Matters from Pexels

In these two examples above, we can test the performance of the model in cases in which both categories appear. The fact that the model can even identify faces in the blurry background is admirable. I also observe that its forefront prediction for which it is not so sure (only 38% on a clear region), compared to the prediction right behind it (100% on a blurry region) might be connected with the quality of the dataset that is trained, and thus it seems to be impacted to a certain extent (at least it is not inaccurate 😅 ).

One final test 🐵

Of course, a big advantage of YOLO is its speed. For this reason, I also want to show you how it works when it takes as input a video:

!./darknet detector demo data/face_mask.data cfg/face_mask.cfg backup/face_mask_best.weights -dont_show vid1.mp4 -i 0 -out_filename res1.avi

Inference on a video stream, Image by author

Conclusion 👏

This was my first step-by-step tutorial on how to build your own detector using YOLOv3 on a custom dataset. I hope you found it useful. Feel free to give me feedback or ask any relevant questions.

Thank you very much for your time! See you soon… 😜