Machine Learning is Not All You Need: A Case Study on Signature Detection

Machine learning should not be your go-to solution for every task. Consider the KISS principle like I did for signature detection

Toon Beerten
Towards Data Science

--

Image by author

In this article, I present a case study demonstrating that machine learning should not be your go-to solution for every task. Simpler techniques could give good results as well and are easier to implement.

Case Study: Signature Detection

Imagine we have a pile of contracts and we need to know whether they are signed or not. This scenario involves signature detection — reliably identifying whether a signature appears in a specific location or not — assuming you already know the rough location where a signature should be (e.g. south-east). In ancient times this task was done by binarizing the image and counting the black pixels in an area. If a signature is present, the black pixel count would surpass a threshold. But in 2023, how could we do this differently?

The Machine Learning Approach

We would use GroundingDino, which is a state-of-the-art zero-shot object detection model. The input to the model is an image combined with a prompt, while the output consists of rectangles indicating potential locations with associated confidence scores. While this may seem like an ideal solution at first glance, there are certain limitations worth considering. Let’s try it out with three different prompts: ‘signature’, ‘handwriting’ and ‘scribble’.

Prompt results with ‘signature’, ‘handwriting’ and ‘scribble’ respectively. Images by author.

You can see that the results heavily depend on the prompt, not to mention that it takes about 30 seconds on CPU before the results show up. That is because this is a foundational model, trained extensively on thousands of other categories beyond signatures alone. What can we do to get it to be more accurate and fast? We could use Autodistill (tutorial) which uses Grounding DINO to train a YOLOv8 model. Effectively using a foundation model to train a lighter, supervised model. The workflow would be to collect a sizeable dataset of signed documents, then find a good prompt to get labelled data and ultimately train a YOLOv8 model on it.

You can imagine this takes some serious time and effort. But is there another way?

Alternative Approach: OpenCV

OpenCV is an open-source computer vision library that provides a wide range of functionalities for real-time image and video processing, analysis, and understanding, using optimized algorithms.

The connectedComponentsWithStats function in OpenCV is used to label and analyze image regions (connected components) based on their pixel connectivity, and additionally calculates various statistics such as area and bounding box dimensions for each labeled region.

To make it more comprehensible, i created this image. It is a cutout of the area with the signature. Each island of connected pixels has a color which represents a single connected component (or: label).

Image by author.

Knowing the above, let’s dive into the intuition behind this computer vision approach. The key idea here is: can we identify which label(s) make up the signature?
Running this function on a normal, average document would produce hundreds if not thousands of unique labels for:

  • each letter (because it is not connected to other letters)
  • bigger things like signatures and logos
  • smaller things like tiny specks of noise and dots

To filter out irrelevant labels, we can take the median area of all labels, which would be the size of a single character (supposing the image contains more letters than noise), as the minimum threshold. Any region below this threshold can be filtered out. We can also set a maximum threshold by assuming that a signature will not take up more than x times the area of a letter. What we are left with are actual candidates for our signature. But what about logos? They may have the same size as signatures, however a signature typically has lots of whitespace in between. With a black pixel ratio filter I can filter those out. The labels that are left should be actual signatures.
Turning the above into code results in this:

def find_signature_bounding_boxes(image):
# Start measuring time
start_time = time.time()

if image is None:
raise ValueError("Could not open or find the image")

# Binarize the image using Otsu's thresholding method
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
# Threshold the image using Otsu's method
_, binary_image = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)

# Find connected components
num_labels, labels, stats, centroids = cv2.connectedComponentsWithStats(binary_image, connectivity=8, ltype=cv2.CV_32S)

# Calculate median area of components
areas = stats[1:, cv2.CC_STAT_AREA] # Exclude background
median_area = np.median(areas)
print('median_area: ' + str(median_area))
median_character_width = int(math.sqrt(median_area))
print('median_character_width: ' + str(median_character_width))

# Define area thresholds
min_area_threshold = median_area * 4
max_area_threshold = median_area * 50

# Filter components based on area thresholds
possible_signatures = []
for i in range(1, num_labels): # Exclude background
area = stats[i, cv2.CC_STAT_AREA]
if min_area_threshold < area < max_area_threshold:
left = stats[i, cv2.CC_STAT_LEFT]
top = stats[i, cv2.CC_STAT_TOP]
width = stats[i, cv2.CC_STAT_WIDTH]
height = stats[i, cv2.CC_STAT_HEIGHT]
print('Found candidate with area: ' + str(area))
#filter horizontal lines
if height < median_character_width * 5 and width > median_character_width*30:
print(' -> candidate is horizontal line with width, height: ' + str(width) + ',' + str(height))
continue
#filter vertical lines
if width < median_character_width * 5 and height > median_character_width*30:
print(' -> candidate is vertical line with width, height: ' + str(width) + ',' + str(height))
continue
#filter on a ratio of black pixels (logos for example have a higher ratio)for now guestimate is 0.3
roi = binary_image[top:top+height, left:left+width]
num_black_pixels = cv2.countNonZero(roi) # Calculate the number of black pixels in the ROI
total_pixels = width * height # Calculate the total number of pixels in the ROI
ratio = num_black_pixels / total_pixels # Calculate and return the ratio of black pixels
print(' -> candidate has black pixel ratio: ' + str(ratio))
if ratio > 0.30:
print(' -> candidate has too high black pixel ratio: ' )
continue
possible_signatures.append((left, top, width, height))

print('Nr of signatures found before merging: ' + str(len(possible_signatures)))
possible_signatures = merge_nearby_rectangles(possible_signatures, nearness=median_character_width*4)

# End measuring time
end_time = time.time()
print(f"Function took {end_time - start_time:.2f} seconds to process the image.")

return possible_signatures

def merge_nearby_rectangles(rectangles, nearness):
def is_near(rect1, rect2):
left1, top1, width1, height1 = rect1
left2, top2, width2, height2 = rect2
right1, bottom1 = left1 + width1, top1 + height1
right2, bottom2 = left2 + width2, top2 + height2
return not (right1 < left2 - nearness or left1 > right2 + nearness or
bottom1 < top2 - nearness or top1 > bottom2 + nearness)

def merge(rect1, rect2):
left1, top1, width1, height1 = rect1
left2, top2, width2, height2 = rect2
right1, bottom1 = left1 + width1, top1 + height1
right2, bottom2 = left2 + width2, top2 + height2
min_left = min(left1, left2)
min_top = min(top1, top2)
max_right = max(right1, right2)
max_bottom = max(bottom1, bottom2)
return (min_left, min_top, max_right - min_left, max_bottom - min_top)

merged = []
while rectangles:
current = rectangles.pop(0)
has_merged = False

for i, other in enumerate(merged):
if is_near(current, other):
merged[i] = merge(current, other)
has_merged = True
break

if not has_merged:
for i in range(len(rectangles) - 1, -1, -1):
if is_near(current, rectangles[i]):
current = merge(current, rectangles.pop(i))

if not has_merged:
merged.append(current)

return merged

I only spent a fraction of the time I would have needed to implement the machine learning approach. Besides the time save, it works remarkably well. It handles resolutions from both high and low DPI scans. Other advantages of this approach is that it is easily integrated into existing C++ or Python code and it is blazingly fast. Surely the parameters can be tweaked further, for this I invite you to open my shared colab notebook and tinker yourself. If you rather try it out online, have a go at my demo on huggingface.

Image by author.

Conclusion

When confronted with a technical challenge, don’t go immediately full ‘machine learning mode’, be open for other, simpler techniques. While machine learning is exciting and opens a lot of new possibilities, it is not always necessary for every task. It is important to consider factors like development time, ease of deployment, accuracy trade-offs, and processing speed when choosing an appropriate approach for your challenge.

--

--