Removing Spikes from Raman Spectra with Anomaly Detection: Whitaker-Hayes algorithm in Python

Published in

Towards Data Science

9 min readNov 19, 2019

Detecting and removing spikes from Raman spectra

Abstract. In this article, I would like to comment on a new approach to remove spikes from Raman spectra, presented in the Chemometrics and Intelligent Laboratory Systems journal by D. Whitaker and K. Hayes¹. In their publication, the authors use a modified Z-scores outlier detection based algorithm to locate such spikes, when present, followed by a simple moving average to remove them. Instead of calculating the Z-scores of the spectrum intensity, they calculate the Z-scores of the once-differenced spectrum. They also provided the code for the algorithm implementation in R. Here, I present the implementation of the algorithm in Python and show how I applied it to a different data set.

Raman spectroscopy is a widely used analytical technique which provides structural and electronic information from molecules and solids. It is applicable at both laboratory and mass-production scales, and has applications in many different fields such as physics, chemistry, biology, medicine or industry.

A typical issue known in Raman spectroscopy is that Raman spectra are sometimes ‘contaminated’ by spikes. Spikes are positive, narrow bandwidth peaks present at random position on the spectrum. They originate when a high-energy cosmic ray impacts in the charge-couple device detector used to measure Raman spectra. These spikes are problematic as they might hinder subsequent analysis, particularly if multivariate data analysis is required. Therefore, one of the first steps in the treatment of Raman spectral data is the cleaning of spikes.

First, the Python packages that will be needed are loaded:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Figure 1 shows the Raman spectrum of graphene. In the previous years, graphene has become a very popular material due to its remarkable physical properties, including superior electronic, thermal, optical and mechanical properties. Its characteristic Raman spectrum consists of several peaks as shown in the figure. From their shape and related intensity, a large amount of information such as doping, strain or grain boundaries can be learned. This spectrum is a clear example of a spectrum contaminated with a spike.

# load the data as data frame
df = pd.read_csv(‘folder_name.../spectrum.txt’, delimiter = ‘\t’)# Transform the data to a numpy array
wavelength = np.array(df.Wavelength)
intensity = np.array(df.Intensity)# Plot the spectrum:
plt.plot(wavelength, intensity)
plt.title(‘Spectrum’, fontsize = 20)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.xlabel(‘Wavelength (nm)’, fontsize = 20)
plt.ylabel(‘Intensity (a.u.)’, fontsize = 20)
plt.show()

**Figure 1.** A spike was recorded while measuring the typical Raman spectrum of graphene (characterized by the G and 2D bands).

Z-score based approach for spike detection in Raman spectra

Spikes intensities are normally above the intensities of other Raman peaks in the spectrum, therefore using a z-scores based method could be a good starting point. The z-scores tell how far a value is from the average in units of standard deviation. So, if the population mean and population standard deviation are known, the standard score of a raw score x(i) is calculated as:

z(i) = (x(i)-μ) / σ

where μ is the mean and σ is the standard deviation of the population x (x(i) represent the values of a single Raman spectrum). More in detail information on how to detect outliers using Z-score approaches can be found in reference [3].

Let’s calculate the z-scores for the points in our spectrum:

def z_score(intensity):
 mean_int = np.mean(intensity)
 std_int = np.std(intensity)
 z_scores = (intensity — mean_int) / std_int
 return z_scoresintensity_z_score = np.array(z_score(intensity))
plt.plot(wavelength, intensity_z_score)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.xlabel( ‘Wavelength’ ,fontsize = 20)
plt.ylabel( ‘Z-Score’ ,fontsize = 20)
plt.show()

**Figure 2.** Z-scores of the spectrum plotted in figure 1.

A threshold is then needed in order to tell whether a value is an outlier or not. A typical value for this threshold is 3.5, as proposed as guideline by the American Society of Quality control as the basis of an outlier-labeling rule², while the authors of the publication used 6. In order to apply a threshold to exclude spikes, the absolute value of the Z-score must be taken:

|z(i)| = |(x(i)-μ) / σ|

calculated as

def z_score(intensity):
 mean_int = np.mean(intensity)
 std_int = np.std(intensity)
 z_scores = (intensity — mean_int) / std_int
 return z_scoresthreshold = 3.5intensity_z_score = np.array(abs(z_score(intensity)))
plt.plot(wavelength, intensity_z_score)
plt.plot(wavelength, threshold*np.ones(len(wavelength)), label = ‘threshold’)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.xlabel( ‘Wavelength’ ,fontsize = 20)
plt.ylabel( ‘|Z-Score|’ ,fontsize = 20)
plt.show()

**Figure 2b.** Z-scores of the spectrum plotted in figure 1. The threshold still cuts the 2D Raman peak.

However, the z-score pretty much resembles the original spectrum and the threshold still cut off the main Raman peak. Let’s plot what was detected as spikes using a 3.5 threshold:

threshold = 3.5# 1 is assigned to spikes, 0 to non-spikes:
spikes = abs(np.array(z_score(intensity))) > thresholdplt.plot(wavelength, spikes, color = ‘red’)
plt.title(‘Spikes: ‘ + str(np.sum(spikes)), fontsize = 20)
plt.grid()
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.xlabel( ‘Wavelength’ ,fontsize = 20)
plt.ylabel( 'Z-scores > ' + str(threshold) ,fontsize = 20)
plt.show()

**Figure 3.**Using modified Z-Scores, 18 spectral point are above the threshold.

Modified Z-score based approach for spike detection in Raman spectra

A second option would be making use of robust statistics and calculate the modified z-scores of the spectrum. This modified Z-score method uses the median (M) and median absolute deviation (MAD) rather than the mean and standard deviation:

z(i) = 0.6745 (x(i)-M) / MAD

where the MAD = median(|x-M|) and |…| represents the absolute value. Both the median and the MAD are robust measures of the central tendency and dispersion, respectively. The multiplier 0.6745 is the 0.75th quartile of the standard normal distribution, to which the MAD converges to⁴.

def modified_z_score(intensity):
 median_int = np.median(intensity)
 mad_int = np.median([np.abs(intensity — median_int)])
 modified_z_scores = 0.6745 * (intensity — median_int) / mad_int
 return modified_z_scoresthreshold = 3.5intensity_modified_z_score = np.array(abs(modified_z_score(intensity)))
plt.plot(wavelength, intensity_modified_z_score)
plt.plot(wavelength, threshold*np.ones(len(wavelength)), label = 'threshold')
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.xlabel('Wavelength (nm)' ,fontsize = 20)
plt.ylabel('|Modified Z-Score|' ,fontsize = 20)
plt.show()

**Figure 4.** Modified Z-scores of the spectrum plotted in figure 1. The threshold still cuts the 2D Raman peak.

The same issue can still be observed:

# 1 is assigned to spikes, 0 to non-spikes:
spikes = abs(np.array(modified_z_score(intensity))) > thresholdplt.plot(wavelength, spikes, color = ‘red’)
plt.title(‘Spikes: ‘ + str(np.sum(spikes)), fontsize = 20)
plt.grid()
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.xlabel( ‘Wavelength’ ,fontsize = 20)
plt.ylabel( 'Modified Z-scores > ' + str(threshold) ,fontsize = 20)
plt.show()

**Figure 5.** Using modified Z-Scores, 24 spectral points are above the threshold.

The 2D Raman peak is still detected as spike, so a more sensitive approach is required.

Whitaker and Hayes’ modified Z-score based approach for spike detection in Raman spectra

Whitaker and Hayes propose to make advantage of the high intensity and small width of spikes and therefore use the difference between consecutive spectrum points ∇x(i) = x(i)-x(i-1) to calculate the z-scores, where x(1), …, x(n) are the values of a single Raman spectrum recorded at equally spaced wavenumbers and i = 2, …, n. This step annihilates linear and slow moving curve linear trends, while sharp thin spikes will be preserved. Now,

z(i) = 0.6745 (∇x(i)-M) / MAD

So there is only an extra step included in which the difference between consecutive values are included.

# First we calculated ∇x(i):
dist = 0
delta_intensity = [] 
for i in np.arange(len(intensity)-1):
 dist = intensity[i+1] — intensity[i]
 delta_intensity.append(dist)delta_int = np.array(delta_intensity)
# Alternatively to the for loop one can use: 
# delta_int = np.diff(intensity)
intensity_modified_z_score = np.array(modified_z_score(delta_int))
plt.plot(wavelength[1:], intensity_modified_z_score)
plt.title('Modified Z-Score using ∇x(i)')
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.xlabel('Wavelength (nm)', fontsize = 20)
plt.ylabel('Score', fontsize = 20)
plt.show()

**Figure 6.** Modified Z-scores using ∇x(i) of the spectrum plotted in figure 1.

Again, in order to apply a threshold to exclude spikes, the absolute value of the modified Z-score must be taken:

|z(i)| =|0.6745 (∇x(i)-M) / MAD|

Resulting in

threshold = 3.5intensity_modified_z_score = np.array(np.abs(modified_z_score(delta_int)))
plt.plot(wavelength[1:], intensity_modified_z_score)
plt.plot(wavelength[1:], threshold*np.ones(len(wavelength[1:])), label = 'threshold')
plt.title('Modified Z-Score of ∇x(i)', fontsize = 20)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.xlabel('Wavelength (nm)', fontsize = 20)
plt.ylabel('Score', fontsize = 20)
plt.show()

**Figure 7.** Abosolute value of the modified Z-scores of ∇x(i) of the spectrum plotted in figure 1.

And once more, the number of detected spikes can be calculated as

# 1 is assigned to spikes, 0 to non-spikes:
spikes = abs(np.array(modified_z_score(intensity))) > thresholdplt.plot(wavelength, spikes, color = ‘red’)
plt.title(‘Spikes: ‘ + str(np.sum(spikes)), fontsize = 20)
plt.grid()
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.xlabel( ‘Wavelength’ ,fontsize = 20)
#plt.ylabel( ‘Spike(1) or not(0)?’ ,fontsize = 20)
plt.show()

**Figure 8.** Using modified Z-Scores with ∇x(i), 17 spectral points are above the threshold = 3.5.

For the 3.5 recommended threshold many false spikes are assigned. However, the value scored by this approach is much higher in comparison with the Raman peaks than before.

plt.plot(wavelength[1:],
np.array(abs(modified_z_score(delta_int))), color='black', label = '|Modified Z-Score using ∇x(i)|')
plt.plot(wavelength, np.array(abs(modified_z_score(intensity))), label = '|Modified Z-Score|', color = 'red')
plt.plot(wavelength, np.array(abs(z_score(intensity))), label = '|Z-Score|', color = 'blue')
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.xlabel( 'Wavelength' ,fontsize = 20)
plt.ylabel( 'Score' ,fontsize = 20)
plt.legend()
plt.show()

**Figure 9.** Comparison between the three different approaches.

In general, the right threshold must be chosen depending on the data set. For this case, a threshold = 7 is already enough to obtain a clear selection.

**Figure 10.** Absolute values of the modified Z-scores with ∇x(i) of the spectrum plotted in figure 1.

**Figure 11.** Using modified Z-Scores with ∇x(i), 3 spectral points are above the threshold = 7.

Fixing the Raman spectrum

Once the spikes are detected, the next step is to remove them and fix the spectra. For this, interpolated values are obtained at each candidate wavenumber by calculating the mean of its immediate neighbors (within a 2m+1 values window).

def fixer(y,m):
 threshold = 7 # binarization threshold. 
 spikes = abs(np.array(modified_z_score(np.diff(y)))) > threshold
 y_out = y.copy() # So we don’t overwrite y for i in np.arange(len(spikes)):
  if spikes[i] != 0: # If we have an spike in position i
   w = np.arange(i-m,i+1+m) # we select 2 m + 1 points around our spike
   w2 = w[spikes[w] == 0] # From such interval, we choose the ones which are not spikes
   y_out[i] = np.mean(y[w2]) # and we average their values
 
return y_out# Does it work?
plt.plot(wavelength, intensity, label = 'original data')
plt.plot(wavelength, fixer(intensity,m=3), label = 'fixed spectrum')
plt.title('Despiked spectrum',fontsize = 20)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.xlabel('Wavelength (nm)' ,fontsize = 20)
plt.ylabel('Intensity (a.u.)' ,fontsize = 20)
plt.legend()
plt.show()

**Figure 12.** Original and despiked Raman spectrum using modified Z-Scores with ∇x(i) with the threshold = 7.

So after calculation of the modified Z scores of ∇x, and thresholding by setting an appropiate threshold value, the spike is removed and smoothed by applying a moving average filter.

Other examples

To wrap up, two examples of spikes with different intensity signal are presented, showing the strenght of this approach.

**Figure 13.** Examples of two more spectra in the first row of graphs, their Z-scores (blue lines) , modified Z-scores (red lines) and modified Z-scores using ∇x(i) (black lines) in the second row of graphs and thei despiked spectra in the third row, are presented.

Aknowledgements. I would like to thank Jorge Luis Hita and Ana Solaguren-Beascoa for taking the time to proofread this notes.

[1] Whitaker, Darren A., and Kevin Hayes. “A simple algorithm for despiking Raman spectra.” Chemometrics and Intelligent Laboratory Systems 179 (2018): 82–84.

[2] Iglewicz, B., and D. Hoaglin. “The ASQC basic references in quality control: statistical techniques.” How to detect and handle outliers 16 (1993): 1–87.

[3] Colin Gorrie’s Data Story: Three ways to detect outliers. http://colingorrie.github.io/outlier-detection.html

[4] Joao Rodrigues. “Outliers make us go MAD: Univariate Outlier Detection”. Medium. https://medium.com/james-blogs/outliers-make-us-go-mad-univariate-outlier-detection-b3a72f1ea8c7