Applied Computer Vision

Classifying Parkinson’s disease through image analysis: Part 1

Pre-processing and exploratory data analysis

Robin T. White, PhD
Towards Data Science
9 min readSep 8, 2020

--

Photo by Halacious on Unsplash

Introduction

Parkinson’s disease is often associated with movement disorder symptoms such as tremors and rigidity. These can have a noticeable effect on the handwriting and sketching (drawing)of a person suffering from early stages of the disease [1]. Micrographia, are abnormally small undulations in a persons handwriting, however, have claimed to be difficult to interpret due to the variability in one’s developed handwriting, language, proficiency and education etc [1]. As such, a study conducted in 2017 aimed to improve the diagnosis through a standardized analysis using spirals and waves. In this series of posts, we will analyze the raw images collected in that study and see if we can create a classifier for a patient having Parkinson’s, and draw some conclusions along the way. The data we will be using is hosted on Kaggle [2] with special thanks to Kevin Mader for sharing the dataset upload.

Image from author. Sample images of the data we will be using.

In this part 1, we will be conducting some exploratory data analysis and pre-processing the images to create some features that will hopefully be helpful in classification. I am choosing to NOT use a convolutional neural network (CNN) to simply classify the images as this will be black box — without any metric into the underlying differences between the curves/sketches. Instead, we are not simply performing a task of classifying but trying to use image processing to understand and quantify the differences. In a subsequent post, I will compare with a CNN.

From Giphy

Before we begin, disclaimer that this is not meant to be any kind of medical study or test. Please refer to the original paper for details on the actual experiment, which I was not a part of.
Zham P, Kumar DK, Dabnichki P, Poosapadi Arjunan S, Raghav S. Distinguishing Different Stages of Parkinson’s Disease Using Composite Index of Speed and Pen-Pressure of Sketching a Spiral. Front Neurol. 2017;8:435. Published 2017 Sep 6. doi:10.3389/fneur.2017.00435

Exploratory Data Analysis

First, let us take a look at the images, perform some basic segmentation and start poking around with some potential features of interest. We will be using pandas throughout to store the images and information. For those of you questioning whether you will read this section here is what we will get into:
- Thresholding and cleaning
- Thickness quantification through nearest neighbors
- Skeletonization
- Intersections and edge points

Thresholding and cleaning

We use a modified read and threshold function, mostly obtained from the original notebook by Kevin Mader on Kaggle [2]. Here, we have the option to perform a resizing when wanting to view the montage style images, like above and below. We first read in and invert the image so the drawings are white on black background, as well as resize if necessary. We also apply a small median filter.

This project has quite a fair bit of code, I won’t be posting it all here so please see the github link if you want to view the notebook or python script for more details.

from skimage.io import imread
from skimage.util import montage as montage2d
from skimage.filters import threshold_yen as thresh_func
from skimage.filters import median
from skimage.morphology import disk
import numpy as np
def process_imread(in_path, resize=True):
"""read images, invert and scale them"""
c_img = 1.0-imread(in_path, as_gray=True)
max_dim = np.max(c_img.shape)
if not resize:
return c_img
if c_img.shape==(256, 256):
return c_img
if max_dim>256:
big_dim = 512
else:
big_dim = 256
""" pad with zeros and center image, sizing to either 256 or 512"""
out_img = np.zeros((big_dim, big_dim), dtype='float32')
c_offset = (big_dim-c_img.shape[0])//2
d_offset = c_img.shape[0]+c_offset

e_offset = (big_dim-c_img.shape[1])//2
f_offset = c_img.shape[1]+e_offset
out_img[c_offset:d_offset, e_offset:f_offset] = c_img[:(d_offset-c_offset), :(f_offset-e_offset)]
return out_img
def read_and_thresh(in_path, resize=True):
c_img = process_imread(in_path, resize=resize)
c_img = (255*c_img).clip(0, 255).astype('uint8')
c_img = median(c_img, disk(1))
c_thresh = thresh_func(c_img)
return c_img>c_thresh

Lastly, for the read in, we also clean up the images by removing any small objects disconnected from the main sketch.

from skimage.morphology import label as sk_labeldef label_sort(in_img, cutoff=0.01):
total_cnt = np.sum(in_img>0)
lab_img = sk_label(in_img)
new_image = np.zeros_like(lab_img)
remap_index = []
for k in np.unique(lab_img[lab_img>0]):
cnt = np.sum(lab_img==k) # get area of labelled object
if cnt>total_cnt*cutoff:
remap_index+=[(k, cnt)]
sorted_index = sorted(remap_index, key=lambda x: -x[1]) # reverse sort - largest is first
for new_idx, (old_idx, idx_count) in enumerate(sorted_index, 1): #enumerate starting at id 1
new_image[lab_img==old_idx] = new_idx
return new_image

This works by keeping only large enough components that are >1% of activated pixels; defined by cutoff. First label each separate object in the image and sum the areas for each label identified (that isn’t 0). Keep the index if the count is more than 1% of the total. Perform negative sort to have the largest objects with label 1. Replace the old label number with the new ordered id.

Image from author. Sample images of the data after thresholding and cleaning.

As an initial view of how different the drawings are, we can create a skeleton image and form a new data frame where each row is a single pixel coordinate of non-zero pixels in each image. We can then plot each of these curves on a single graph — after normalizing the positions. We won’t be using this format of the data frame, this is just for visualization.

Image from author. All drawings on one plot.

As we can see there is strong consistency between healthy sketches. This makes sense considering the random motion that Parkinson’s symptoms can cause.

Thickness quantification

Next, we will try to quantify the thickness. For this, we will be using a distance map to give an approximation of the width of the drawings. The medial axis also returns a distance map, but skeletonize is cleaner as it does some pruning.

from skimage.morphology import medial_axis
from skimage.morphology import skeletonize
def stroke_thickness_img(in_img):
skel, distance = medial_axis(in_img, return_distance=True)
skeleton = skeletonize(in_img)
# Distance to the background for pixels of the skeleton
return distance * skeleton
Image from author. Line thickness calculation.

Plotting the mean and standard deviation we see some correlation between the drawings. Mostly in the standard deviation which again makes sense considering the random impact, also the fact it is great and not less than healthy also makes sense.

Image from author. Thicknesses of drawings.

Intersections and edge-points

Due to the way that skeletonizing an image works, depending on the ‘smoothness’ of the line, there will be a higher number of end-points for more undulating curves. These are therefore some measure of the random movement compared with a smooth line. In addition, we can calculate the number of intersection points; a perfect curve would have no intersections and only two end-points. These are also useful in other image processing applications such as with road mapping.

I won’t go into the code too much here as there is a fair bit of cleaning required, however, I will try to explain what I did using some images. First, we calculate a nearest neighbors image of the skeleton of the curve. This gives us values of 2 everywhere, except a value of 1 at edge-points, and a value of 3 at intersections; this is when using a connectivity of 2 (the 8 nearest neighbors). Here is an image of the result, zoomed in.

Image from author. Intersection and edge points.

As you can see this works as expected, except we run into some issues if we want to get only one pixel intersection to properly quantify the number of intersections. The image on the right has three pixels with value 3, although this is only one intersection. We can clean this up with the following pseudo algorithm by isolating for these regions as the sum of these intersections is greater than the ‘correct’ intersection. We can sum the nearest neighbors (NN) image and threshold to isolate these. Connectivity 1 has direct neighbors, connected 2 includes diagonals.

  • From original NN, sum using connectivity 1 and 2 separately.
  • Isolate intersections which have value ≥ 8 in connectivity 2.
  • Label each edge that is connected to the intersection pixels.
  • Isolate the intersection pixels which sum between 3 and 5 for the connectivity 1 image. These are the one we don’t want.
  • Overwrite incorrect intersection pixels.

Here is the result:

Image from author. Intersection and edge points, corrected.

As you can see we now have a single pixel with value 3 at the intersection location. We can now simply sum these locations to quantify the number of intersection points. If we draw these on a curve we can see the result, below yellow are intersections and green are edge-points.

Image from author. Skeleton curve with drawn intersection and edge-points.

If we plot the number of these for each drawing type we can see the fairly strong correlation:

Image from author. Mean number of edge-points.
Image from author. Mean number of intersections.

We can also see our initial estimate of a healthy curve having approximately 2 edge-points is also correct. There is a very high number of edge-points for Parkinson’s wave drawings because these are usually quite ‘sharp’ instead of curved smoothly, this creates a high number of edge-points at the tips of these waves.

Lastly, we can also check the correlation with the overall number of pixels in the skeleton image. This is related to the ‘indirect’ nature of the drawings. This quantification is very simple, we just sum the pixels greater than zero in a skeleton image.

Image from author. Mean number of pixels.

Not as strong as a relationship but still something.

Summary

So far we have now read in, cleaned and obtained some potentially useful metrics, that help us not only understand the degree of micrographia but also can be used as inputs into a classifier such as Logistic Regression or Random Forest. In part 2 of this series we will perform classification using these metrics and also compare to a more powerful, but black-box, neural network. The advantage of using Random Forest is we will also be able to see which features provide the strongest impact to the model. Based on the above observations, do you have an intuition already about which features may be the most important?

If you felt that any part of this post provided some useful information or just a bit of inspiration please follow me for more.

You can find the source code on my github. This project is currently still under construction.

Link to my other posts:

Bonus

Not used here, but for fun, we can also create a graph for the representation of each image since we have nodes and edges, where the nodes are the intersections or edge-points and the edges are the sections of the drawings connecting these.

Image from author. Network graph of drawing.

We used Networkx library for this, where each number is either a node or intersection with the colors corresponding to the length of the section of drawing connecting the nodes.

References

[1] Zham P, Kumar DK, Dabnichki P, Poosapadi Arjunan S, Raghav S. Distinguishing Different Stages of Parkinson’s Disease Using Composite Index of Speed and Pen-Pressure of Sketching a Spiral. Front Neurol. 2017;8:435. Published 2017 Sep 6. doi:10.3389/fneur.2017.00435
[2] Mader, K. Parkinson’s Sketch Overview. https://www.kaggle.com/kmader/parkinsons-drawings. Obtained: 09/07/2020

--

--