
By: Ajay Arunachalam – Senior Data Scientist & Researcher (AI)
Hello, friends. In this blog post, I would like to share our work done towards autonomous machine generation of data labels using AI technology.
Our full article is available here – https://lnkd.in/gJDKQCY
Before we peek into our approach, first let’s understand what data labeling is, in layman terms. In machine learning, data labeling is simply the process of identifying raw data (images, videos, audio files, text files, etc.), and adding one or more meaningful and informative labels to provide context, so that a machine learning model can learn & infer from it. Most state-of-the-art machine learning models highly rely on the availability of a high amount of labeled data, being an essential step in supervised tasks. Data labeling is required for a variety of use cases including Computer Vision, natural language processing, and speech recognition. Traditionally, this tedious & mundane process of labeling the data is largely done by humans till date. To help humans minimize the insane hard work and effort of data labeling from scratch, we suggest an automated algorithmic solution, aiming to reduce much of the manual work. Let’s go through a reference of where such labeled data is actually needed. Here, I will talk about the computer vision tasks. Computer Vision is simply all about replicating the complexity of human vision (human eye vision), and understanding of his surrounding. Computer vision tasks include methods for acquiring, processing, analyzing and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the forms of decisions. In Computer Vision area, there are many different tasks. I will not go into the details of them for e.g. classification, detection, segmentation, etc. But, the diagram below provides a crisp overview & goals of these tasks with an example of object in context – "Banana".
AN EXAMPLE CONTEXT – NEED OF LABELED DATA
![Classification vs. Detection vs. Semantic Segmentation vs Instance Segmentation, [Copyright & Image Adapted from https://www.cloudfactory.com/image-annotation-guide. Reposted with permission]](https://towardsdatascience.com/wp-content/uploads/2021/08/1ONE_rzuHuoGnHnbZRsRsPg.png)
For a supervised model to detect the object – "banana" the annotated labels are feed to the model, so that it can learn the representation of banana pixels, and localize them within the context which then can be used to infer for unseen/new data. The instance segmentation task aims to detect the objects, localize these objects and provide the count, size, and shape information of them. We use one such state-of-the-art instance segmentation model – "Mask R-CNN" as a core backbone for our framework, but here one can use any other network architecture as per their requirement & objective. We sticked with mask R-CNN due to its efficacy in detecting objects in an image while generating high-quality segmented masks for each objects. For our specific tested usecase of COVID infection detection the precise location of the infected regions is critical, so pixel-level detection was more appropriate in this context.
OUR METHOD
The pipeline of our tool is as shown below that mainly consists of the detector & tracker, auto-label module, and I/O module for outputting & saving the machine annotated labels to the disk.

Step 1:- Object Detection & Tracking to do Pixel-Level Classification
An custom weakly trained mask-RCNN model was used for COVID infection detection with very few labeled instances (< 10 samples). To label the infected regions, we used VGG Image Annotator (VIA) image annotation tool. It is a simple and standalone manual annotation software for image, audio and video. VIA runs in a web browser and does not require any installation or setup. The complete VIA software fits in a single self-contained HTML page of size less than 400 Kilobyte that runs as an offline application in most modern web browsers. VIA is an open source project based solely on HTML, Javascript and CSS (no dependency on external libraries). VIA is developed at the Visual Geometry Group (VGG) and released under the BSD-2 clause license which allows it to be useful for both academic projects and commercial applications. The detector is used to obtain the mask, bounding box and the class that are localized. Next, to uniformly track & label the multiple infected regions along the input video data stream, we used the centriod tracking algorithm. A snippet of our mask-RCNN covid detector is given below.
Step 2:- FRAME-BY-FRAME Data Labeling
The inference from the pre-trained detector model is used to get the positions of the bounding boxes, and creating a json metadata. Once the frame is segmented using Mask-RCNN the corresponding Region-Of-Interest (ROI) are generated. Further, the masks for each ROI are generated followed by contour detection over the entire image frame. Then, the (x,y) coordinates are extracted from the contours. Finally, these shape, region and coordinate attributes are saved to the disk frame-by-frame. The snippet of our auto-labeling algorithm is given below.
EXAMPLE – COVID-19 INFECTION DETECTION & AUTO-LABELING
We tested our method with the goal of generating automated computer labels for Covid Infected Regions. The results of machine generated label, and human annotated label is shown below. It can be seen that the auto-annotation engine generates reasonably good quality of synthetic labels that can be used to re-train the object detection model or generate more annotated data that can be used for different tasks.
![Comparison of Machine-Generated label vs. Human-Annotated label of Covid Infected Region in lung CT scans, [Image adapted from & Copyright: doi: 10.1109/TEM.2021.3094544. Reposted with permission]](https://towardsdatascience.com/wp-content/uploads/2021/08/1dwceVJd9yjU3blbPGVQF-g.png)
SUMMARY
Data labeling is a non-trivial task, and one of the critical components of the supervised learning pipeline. It is one such task that requires a lot of manual effort. So, then can we get the bulk of such mundane, labour-intensive & time consuming effort to be autonomously driven by machines aiming to minimize the bulk of human tasks. We focus on this generic universal problem with our intuitive approach to largely alleviate the bottlenecks of having limited labels or the need of labeling tons of instances all by yourself from scratch.
Note:- Our tool is currently in alpha testing phase. Presently, our designed framework is based on mask R-CNN and the VIA annotation format. We also aim to generalize our prototype to include the different state-of-the-art detectors, such as YOLO and the corresponding YOLO compatible annotation format. Further, we also plan to integrate the COCO annotation format. It would be worthy of integrating all the different image annotations as part of our framework while providing the facility with provision of different libraries, i.e., Torch, TensorFlow, Caffe, etc.
CONTACT ME
You can reach me at [email protected] or connect me through Linkedin
Thanks for reading.
Keep learning!!! Check out my github page here
References:-
https://whatis.techtarget.com/definition/data-labeling
https://aws.amazon.com/sagemaker/groundtruth/what-is-data-labeling/
https://www.geeksforgeeks.org/object-detection-vs-object-recognition-vs-image-segmentation/
https://www.robots.ox.ac.uk/~vgg/software/via/
https://github.com/matterport/Mask_RCNN
Image Annotation for Computer Vision
IEEE Transactions on Engineering Management | About Journal | IEEE Xplore