How to Use fastai to Evaluate DICOM Medical Files

Charlie Craine
Towards Data Science
5 min readFeb 20, 2021

--

Get started in Kaggle medical-based competitions

Machine learning means you can see like a radiologist —📸 CDC

Kaggle competitions can be really intense and just gaining domain knowledge can take quite a good amount of work upfront.

I have quite a bit of experience using fastai, now, that doesn’t mean I always use it for a final competition entry, but it is a great tool for quickly prototyping and learning about a dataset.

This article is going to show you some helpful tips to get up-and-running fast on learning about DICOM medical files and the data associated with them. This article won’t be full of all of the code, that is being shared on Kaggle with everyone. So I’ll add some snippets here but will point you to the full notebook on Kaggle. I also used the amazing fastai tutorial on medical imaging to learn.

The first question you might be asking yourself (or not if you were Googling “fastai dicom files”): what are DICOM files?

What are DICOMs?

DICOM stands for (Digital Imaging and COmmunications in Medicine) and is the de-facto standard that establishes rules that allow medical images(X-Ray, MRI, CT) and associated information to be exchanged between imaging equipment from different vendors, computers, and hospitals. The DICOM format provides a suitable means that meets health information exchange (HIE) standards for transmission of health-related data among facilities and HL7 standards which is the messaging standard that enables clinical applications to exchange data

DICOM is generally associated with a .dcm extension. What is really amazing about DICOM files is that they provide a means of storing data in separate ‘tags’ such as patient information as well as image/pixel data. A DICOM file consists of a header and image data sets packed into a single file.

This is a good chance for us to see how fastai allows you to quickly see the information stored in the .dcm file. If you are used to using fastai you’ll be familiar with a few imports, but note the medical import. This is important to work with DICOM files.

from fastai.basics import *
from fastai.callback.all import *
from fastai.vision.all import *
from fastai.medical.imaging import *

import pydicom

import pandas as pd

The dataset I’m using is on Kaggle: VinBigData Chest X-ray Abnormalities Detection. This is an interesting competition; you can read the information on Kaggle to learn more. For the sake of a simple tutorial, you’ll see my code below to access the file. The structure is very straight-forward with a parent folder “vinbigdata-chest-xray-abnormalities-detection” and the training path with the DICOM images within it:

path = Path('../input/vinbigdata-chest-xray-abnormalities-detection')
train_imgs = path/'train'

Next, you can set up your images so they can be read.

items = get_dicom_files(train_imgs)

Pydicom is a python package for parsing DICOM files, making it easier to access the header of the DICOM as well as converting the raw pixel_data into pythonic structures for easier manipulation. fastai.medical.imaging uses pydicom.dcmread to load the DICOM file.

To plot an X-ray, we can select an entry in the items list and load the DICOM file with dcmread. Here we can write a simple line of code to see interesting, and potentially valuable, data associated with the dcm file.

#add any number here to pick one single patient 
patient = 3
xray_sample = items[patient].dcmread()

Now we can view the header metadata within the dicom file.

xray_sampleOutput:Dataset.file_meta -------------------------------
(0002, 0000) File Meta Information Group Length UL: 160
(0002, 0001) File Meta Information Version OB: b'\x00\x01'
(0002, 0002) Media Storage SOP Class UID UI: Digital X-Ray Image Storage - For Presentation
(0002, 0003) Media Storage SOP Instance UID UI: 7ecd6f67f649f26c05805c8359f9e528
(0002, 0010) Transfer Syntax UID UI: JPEG 2000 Image Compression (Lossless Only)
(0002, 0012) Implementation Class UID UI: 1.2.3.4
(0002, 0013) Implementation Version Name SH: 'OFFIS_DCMTK_360'
-------------------------------------------------
(0010, 0040) Patient's Sex CS: 'M'
(0010, 1010) Patient's Age AS: '061Y'
(0028, 0002) Samples per Pixel US: 1
(0028, 0004) Photometric Interpretation CS: 'MONOCHROME2'
(0028, 0010) Rows US: 2952
(0028, 0011) Columns US: 2744
(0028, 0030) Pixel Spacing DS: [0.127, 0.127]
(0028, 0100) Bits Allocated US: 16
(0028, 0101) Bits Stored US: 14
(0028, 0102) High Bit US: 13
(0028, 0103) Pixel Representation US: 0
(0028, 1050) Window Center DS: "8190.0"
(0028, 1051) Window Width DS: "7259.0"
(0028, 1052) Rescale Intercept DS: "0.0"
(0028, 1053) Rescale Slope DS: "1.0"
(0028, 2110) Lossy Image Compression CS: '00'
(0028, 2112) Lossy Image Compression Ratio DS: "2.0"
(7fe0, 0010) Pixel Data OB: Array of 5827210 elements

There is a lot of information here and the good news is there is an excellent resource to learn more about these:

http://dicom.nema.org/medical/dicom/current/output/chtml/part03/sect_C.7.6.3.html#sect_C.7.6.3.1.4

Finally, you can see an actual x-ray.

xray_sample.show()

Remember all the metadata above that seemed interesting and may have wondered how you could make it useful? The good news is, you can pull that data into a dataframe.

As a quick note. I’ll add two versions of the code below. One for Google Colab and click the link to see a more complex version for Kaggle. Anyone who has used Kaggle knows that sometimes you have to change things up a bit there to make things work.

Here is the simple way to pull your metadata into a dataframe:

dicom_dataframe = pd.DataFrame.from_dicoms(items)dicom_dataframe[:5]

I’ll add a screenshot below since the data is 29 columns and will go off the page.

Hopefully, this will be helpful for some of you. Next, I’m going to set up the bounding boxes to detect the various diseases within the x-rays.

If anyone does something great with fastai and/or medical data, I want to hear about it! Please let everyone know what you’ve created in the responses below or reach out any time on LinkedIn.

--

--