Neural Networks Intuitions: 6. EAST: An Efficient and Accurate Scene Text Detector — Paper Explanation

Hey folks! It’s good to be back again :-)

Published in

Towards Data Science

6 min readDec 24, 2019

It has been a while since I published my last article. In this sixth installment of my series “Neural Networks Intuitions”, I will be talking about one of the most widely used scene text detector — EAST(Efficient and Accurate Scene Text Detection) and as the name suggests it is not just accurate but much more efficient in comparison with its text detector counterparts.

Firstly, let us look at the problem of Scene Text Detection in general and then dive deep into the working of EAST :-)

Scene Text Detection:

Problem: The problem as already mentioned above, is to detect text in natural scene images. Scene Text Detection is a special case of Object Detection where the object here boils down to a single entity — “text”.

But at what level are we detecting text?

We can detect text either at character level or at word level. It all depends on how a text detection dataset has been annotated and how we want the network to learn. Generally, natural scene images are tagged at word level, thereby making the network learn to detect words(as well as spaces between them in order to differentiate between any two word instances).

Solution: Since the problem here is object detection and now that we already know the fundamentals of object detection(Neural Networks Intuitions: 5. Anchors and Object Detection), we can simply use one of the existing object detectors available — say SSD, Faster-RCNN or RetinaNet. And they should pretty much do the job for us.

But do we really need multiple anchor boxes per feature map cell? Or do we need to use the concept of anchors at all for a more specific & simpler task such as text detection?

The reason I say text detection(or more specifically word detection) a simpler task is because:

a. The pattern/feature of word is not that complex — especially when the language is fixed(eg. English), we have basically 26*2(lower and uppercase alphabets)+ 10(numbers).

But we are not detecting at character level, are we? Then how is this task simple, given there can be very large number of combinations of words?

Let us look at the next point to see the reason!

b. If you have prior experience of working in text detection, you can see that text detectors learn to predict words of languages it is not trained on(even though not perfectly). Eg. English word detector detecting Japanese/Tamil words.

This makes it pretty evident that the pattern learnt is not precisely the “language words”. So what is the network really learning? — It could be learning shapes occurring together in groups, separated by a defined amount of white space i.e say, a connected component.

At least this is my intuition of what word detectors could be learning :-)

How about we formulate a solution for the above problem?

Pseudo Code:

Pass an input scene text image to a ConvNet(feature-extractor).
Merge multi-scale features in the feature fusion stage.
Run a 1x1 conv filter(class head) on the feature volume to get an activation-map/heatmap in the range 0–1 where 1s represent the presence of text in the image and 0 represents background.
Threshold the activation map and use some over-the-top logic like cv2.findContours() to find the text regions and eventually find text words from them.

If we pay attention to step 3, we can notice the usage of anchors, associated with a filter of size 1x1 and only one anchor box per feature map cell. This is key because it helps us to draw similarities with the traditional single-shot object detector.

This is pretty much what EAST does — except instead of using some over-the-top logic to find text regions, it has another head — box head which will output the 4 distance values(every pixel in the feature map) to the nearest [minx, miny, maxx, maxy] boundary.

EAST — An Efficient and Accurate Scene Text Detector:

a. Architecture: Every single shot object detector has 3 major stages involved:

Feature extraction stage.
Feature fusion stage.
Prediction network.

All variants of single shot detector differ in one of the above three stages. EAST as well follows the above same paradigm.

b. Input-Output:

The network takes in an input image and passed through some set of conv layers(feature extractor stem) to get four levels of feature maps — f1, f2, f3, f4.
The feature maps are then unpooled(x2), concatenated(along channel dimension) and then passed through 1x1 followed by 3x3 convs. The reason for merging features from different spatial resolution is to predict smaller word regions.

3. The final feature volume will be then used to make score and box predictions — 1x1 filter of depth=1 used to generate score map, another 1x1 filter of depth=5 is used to generate RBOX(rotated boxes) — four box offsets and rotation angle, and another 1x1 filter of depth=8 to generate QUAD(quadrangle with 8 offsets).

c. Loss function:

Before jumping into loss functions, let us try to interpret the network output first.

1. The class head output can be interpreted similar to the traditional detector’s class output — except here there is only anchor box per feature map cell, hence the output will be of shape HxWx1 where 1 indicates the number of anchor boxes per cell.
2. But in case of the box head, the output(of shape HxWx4) should be interpreted at a “pixel” level and there is no concept of anchor box. Every pixel has 4 numbers associated with it — distance to the nearest [minx, miny, maxx, maxy] box. Important thing to note here is that the final word level output is later derived from this per-pixel level output.

c1. Classification Loss

We all are well aware of the class imbalance problem present in object detection datasets. The number of samples for background class is generally very high in number and now that we are treating every 1x1 box(basically every pixel) as output, the number of background samples becomes huge.

In order to tackle this problem of class imbalance, EAST uses a modified version of cross entropy called Balanced/Weighted Cross Entropy.

In BCE, the fraction of highly-represented class is generally multiplied with the under-represented class’s loss term (and similarly for the highly-represented class’s loss term) in order to control the contribution of high and under-represented classes. Note: background class:=highly-represented and foreground class:=under-represented.

Check this blog for a detailed explanation of BCE (Neural Networks Intuitions: 1.Balanced Cross Entropy).

c2. IOU Loss: IOU loss used here is different from the traditional bounding-box loss.

Every pixel will have 4 numbers associated with it — distance to the nearest top, left, bottom, right box boundaries, from which IOU is computed and then negative log likelihood is used as the loss which penalizes when IOU is less than 1.

It is pretty evident that the width and height of gt/pred box can be computed by simply summing their x and y-offsets, from which the gt and pred box area is obtained.
To find width and height of the intersected rectangle,

Now that we have area of gt box, area of predicted box and area of intersected box, we can compute IOU!

We then compute angle loss,

1 — cos(predicted angle — gt angle)

The total loss is written as,

A weighted sum of both losses.

In my upcoming post(edit to this same post), I will be explaining about EAST post processing on how bounding boxes are computed from the network’s output for the sake of completeness.

Link to the paper: https://arxiv.org/abs/1704.03155

Cheers!