The world’s leading publication for data science, AI, and ML professionals.

Vision Transformers for Femur Fracture Classification

A brief resume of two years of my primary Ph.D. research

Thoughts and Theory

Table of contents

  1. Introduction
  2. Background
  3. The full pipeline of our system
  4. Results
  5. Conclusion
  6. References

1. Introduction

Since the beginning of my Ph.D., I have been collaborating with the orthopedic team of the CTO (Center for Orthopaedic Trauma) of Turin (Italy), to develop an algorithm able to assist physicians in fracture diagnosis. We select the femur as starting point as its fractures are the most common ones and their correct classification strongly affects patients’ treatment and prognosis. We began our journey with Convolutional Neural Networks (CNNs) and recently, for the first time in the literature, applied a Vision Transformer (ViT) to overcome the state of the art in this topic and provide a Deep Learning-based support tool to specialists.

In this short article, I am going to summarize (in the simplest way I can) and demonstrate the effectiveness of our system, which is described in more detail in the arXiv original paper.


2. Background

2.1 The problem

Musculoskeletal diseases, and in particular hip fractures, are one of the most common causes of severe, long-term disability worldwide. Due to the progressive aging of the population, the prevalence and incidence of fragility fractures are increasing and will continue to do so in the future.

To get a general idea, in 2010, the estimated incidence of hip fractures was 2.7 million patients per year globally. This is a really huge number of diagnoses to be performed.

Of course, a great responsibility in this lies on physicians, who have to evaluate tens of X-Ray images a day. It is challenging for them to evaluate X-Ray images for many reasons:

  • X-Rays could hide certain particularities of bone
  • a long experience is needed to correctly classify different types of fractures
  • doctors have often to act in emergency situations and may be constrained by time and fatigue

In this context, implementing a CAD (Computer Assisted Diagnosis) system in doctors’ workflow might have a direct impact on patients’ outcomes.

This idea is the core of our work! To develop a fast, intuitive, and accurate system to classify femur fractures, relying solely on 2D X-Ray.

In order to build a supervised classifier, the first necessary step is to understand the specific classes that you want to recognize. In standard classification problems, the ultimate goal is usually to train a network that performs at least as well as an average human (almost anyone could recognize, for example, a dog from a cat). This, unfortunately, often does not apply to the medical domain and particularly to fractures, which are very tricky to evaluate and it is needed a "non-average" human with vast experience in the field. So, how do experts classify femur fractures?

2.2 The AO classification

One of the answers is the AO/OTA classification of the proximal femur, which is hierarchical and determined by the localization and configurations of the fracture lines.

It distinguishes three main fracture types named A, B, and C. Each group is then subsequently divided into different levels of subgroups, in relation to the complexity of the fracture, considering the number of fracture lines as well as the displacement of fragments.

From the figure above, this might seem like a very easy problem. But let’s look at some real samples!

After two years, I still struggle to distinguish the different subgroups by eye. Fortunately, the network we implemented is much smarter than I am.

2.3 Hierarchical CNN

In 2019, when I started this work with my research group, I was completely new to this topic. So the first thing was to perform a literature review, which we have then published here. From our analysis, it became clear that the problem we wanted to address was not yet solved. The majority of the existing methods focused on the binary classification between Broken and Unbroken bones, which unfortunately has a very low impact on physicians’ diagnoses. Just two research groups have tried to classify bones into different sub-fractures, but the results were still sub-optimal.

The first method we tried, explained in this paper, proposed a multistage approach which classifies fractures into 5 classes (at that time, we had not yet obtained the labels for B sub-groups, and C fractures were, and still are, excluded due to the low number of samples), following a hierarchical structure as the AO Classification. The original X-rays were cropped by a semi-automated method used to build a dataset with 2878 samples, divided into A1, A2, A3, B, and Unbroken classes. The hierarchical method consisted of a cascade of three stages: the first network recognized Unbroken or Broken bones, the second one classified the images predicted as Broken by the first network as A and B, and the third one dealt with A subgroups. The method was then compared with three classic CNNs, namely ResNet50, InceptionV3, and VGG16.

This very trivial approach surpassed the three CNNs but was far from optimal. We were struggling with how to improve these results when Visual Transformers appeared!

2.4 Vision Transformer

In recent times, a new paradigm called Transformer, introduced formerly for Natural Language Processing (NLP), has demonstrated exemplary performance on a broad range of language tasks.

Transformer architectures are based on a self-attention mechanism that learns the relationships between the elements of a sequence and 1) can deal with complete sequences, thus learning long-range relationships 2) can be easily parallelized 3) can be scaled to high-complexity models and large-scale datasets.

The discovery of Transformer networks in the NLP domain has aroused great interest in the computer vision community. However, visual data follow a typical structure, thus requiring new network designs and training schemes.

As a result, different authors have proposed their own implementation of a Transformer model applied to vision, but the SOTA has only been achieved by the Vision Transformer (ViT), which has the particularity to focus on small patches of the image, which are treated as tokens.

A deeper explanation of self-attention and Vision Transformer can be found in my first article on Medium.

Let’s finally discuss how we used the superpowers of ViT to surpass the state of the art in this topic.


3. The full pipeline of our system

The preprocessing stages were quite similar to the former approach but with two main differences: the cropping phase was fully automated using a YOLOv3 network, and the samples in the dataset were now 4027 and divided into 7 different classes (still excluding C fractures for the same reason). A CNN and a Hierarchical CNN (discussed in Section 1.3) were used to compare the results of ViT with two baselines. At the end of the pre-trained ViT, two dense layers were attached for classification (for more information on the architecture, I suggest reading the paper on arXiv).

The attention maps were also visualized to demonstrate that the network was indeed focusing on the correct areas of the bones, and a clustering experiment was performed to see the ability of ViT in features extraction.


4. Results

As you can see from the table below ViT outperformed the baselines! For more results, you could read (guess what?) the arXiv pre-print.

4.1 Clustering

We were satisfied with the results, but we also wanted to demonstrate that the network was actually good at features extraction. To prove this, we switched to unsupervised learning: if a network is able to recognize different classes without labels, then the features extracted must be very diverse. Three clustering approaches were tested and the results are shown below: in the first, the initial dataset of images was clustered using a Convolutional Autoencoder (a). In the second, the Convolutional Autoencoder was substituted by an Autoencoder which took as input a vector of 1024 values, extracted in one case from a CNN (b) and in the other from the ViT encoder (c). Clearly, ViT was the only one able to extract meaningful features, although understandably it still struggles with sub-fractures.

4.2 Visualization

We also visualized the attention maps to highlight where the network was focusing during inference. These maps were evaluated by specialists which confirmed the correct approach of ViT in locating the fracture area.

4.3 Specialists Evaluation

Finally, to demonstrate that this tool could actually be used in daily hospital routine, we asked 7 residents and 4 radiologists to evaluate 150 images without and (two weeks later) with the prediction of the network, obtaining an average improvement of 29%!


5. Conclusion

The novelties introduced by this work are four-fold:

  1. we introduced the largest and richest labeled dataset ever for femur fractures classification, with 4207 images divided into 7 different classes;
  2. we applied for the first time a Vision Transformer (ViT) for the classification task, surpassing the two baselines of a classic CNN and a hierarchical CNN;
  3. we visualized the attention maps of ViT and we clustered the output of the Transformer’s encoder in order to understand the potentiality of this architecture;
  4. we carried out a final evaluation, asking 11 specialists to classify 150 images by means of an online survey, with and without the help of our system.

That’s it! For the first time, we achieved very good results while reaching a deep level of the AO classification.

The main limitation of this tool is that in the dataset some classes are underrepresented. For this reason, we are working with Generative Adversarial Networks (GANs) to produce new artificial but reliable samples. Can you spot the fakes samples in the image below?

They’re all fakes! Amazing, isn’t it?


6. References

An Image is Worth 16×16 Words

Vision Transformer for femur fracture classification

Hierarchical fracture classification of proximal femur X-ray images


If you enjoyed this story, you can also check out my first ever article on Medium, where I explained one of the most interesting and recent architectures for vision, CoAtNet!

CoAtNet: how to perfectly combine CNNs and Transformers


Related Articles