Scatter correction and outlier detection in NIR spectroscopy

Shravankumar Hiregoudar
Towards Data Science
7 min readMay 19, 2020

--

Recent advancements in the field of AI have given rise to the integration of NIR sensor data and machine learning techniques to achieve the results. In this article, The NIR sensor is interfaced with a PC and the samples were scanned. We performed ten scans per sample with ten seconds scan time to reduce the errors and include all the parts of the sample. In our case, the scan results were scattered and, we had outliers due to change in the light, working distance, and human errors during the scanning. To reduce the scattering and to eliminate the outliers, we implemented Multiplicative Scatter Correction (MSC) and Standard Normal Variate (SNV) on near-infrared (NIR) data.

Data pre-processing is an essential step to build most types of calibration models in NIR and most spectroscopy analysis. With a well-designed pre-processing step, the performance of the model can be greatly improved.

The raw data; sample 1 and sample 2 containing replicate readings

Fig.1. Raw Absorbance spectra of sample 1 with 10 replicate readings. x-axis: wavelengths and y-axis: absorbance units
Fig.2. Raw Absorbance spectra of sample 2 with 10 replicate readings. x-axis: wavelengths and y-axis: absorbance units

Multiplicative Scatter Correction in Python

MSC requires a reference spectrum to start with. The reference spectrum is ideally a spectrum free of scattering effects. In practice, we consider the mean of the sample data as the reference. This is the most important difference between MSC and SNV.

Now, as you can gather, getting our hands on a spectrum that is free of unwanted scattering effects is not easy, not across all wavelengths we are interested in. For this reason, if the data is reasonably well behaved, we can take the average spectrum to be a close approximation to the ideal spectrum we are after. Particle size and path length effects should vary randomly from sample to sample, and therefore the average should reasonably reduce these effects, at least in the approximations that these effects are genuinely random. This is the main assumption behind MSC.

Let us consider Xm to be the mean spectrum. We first regress each spectrum Xi against the mean spectrum. This is done by ordinary least squares Xi ≈ ai + biXm and then We calculate the corrected spectrum Xmsc = (Xi-ai)/bi

def msc(input_data, reference=None):
"""
:msc: Scatter Correction technique performed with mean of the sample data as the reference.
:param input_data: Array of spectral data
:type input_data: DataFrame
:returns: data_msc (ndarray): Scatter corrected spectra data
"""
eps = np.finfo(np.float32).eps
input_data = np.array(input_data, dtype=np.float64)
ref = []
sampleCount = int(len(input_data))

# mean centre correction
for i in range(input_data.shape[0]):
input_data[i,:] -= input_data[i,:].mean()

# Get the reference spectrum. If not given, estimate it from the mean
# Define a new array and populate it with the corrected data
data_msc = np.zeros_like(input_data)
for i in range(input_data.shape[0]):
for j in range(0, sampleCount, 10):
ref.append(np.mean(input_data[j:j+10], axis=0))
# Run regression
fit = np.polyfit(ref[i], input_data[i,:], 1, full=True)
# Apply correction
data_msc[i,:] = (input_data[i,:] - fit[0][1]) / fit[0][0]

return (data_msc)

Further, we can plot the MSC values

def linegraph(df,name="data"):
"""
:linegraph: Plot absorbance unit vs wavelength number graph
:param df: Spectral data containing absorbance units and wavelengths
:type df: ndarray
:returns: Absorbance unit vs wavelength graph
"""
# Read the spectral data file
neospectraDataFile= neoSpectraSensorData.getNeospectraDataFile()
# Get the wavelengths
wavelengths =
np.array(neospectraDataFile.columns).astype(np.float)
x = wavelengths
y = df
ys = [i for i in y] #put into the format for LineCollection
# We need to set the plot limits, they will not autoscale
fig, ax = plt.subplots(figsize=(10,10), dpi=150)
ax.set_xlim(np.min(x), np.max(x))
ax.set_ylim(np.min(ys), np.max(ys))
# Make a sequence of x,y pairs
line_segments = LineCollection([np.column_stack([x, y]) for y in ys],linewidths=(0.5),linestyles='solid')
ax.add_collection(line_segments)
ax.set_title('Absorbance unit graph of '+name, fontsize = 15)
ax.set_xlabel(r'Wavenlength', fontsize=15)
ax.set_ylabel('Absorbance', fontsize=15)

return(plt.show())

Results of MSC :

Fig.3. MSC Corrected data graph of sample 1; x-axis: wavelengths and y-axis: absorbance units
Fig.4. MSC Corrected data graph of sample 2; x-axis: wavelengths and y-axis: absorbance units

Standard Normal Variate in Python

Unlike MSC, SNV correction is done on each individual spectrum, and a reference spectrum is not required. The SNV correction can be divided into two conceptual steps as well.

Mean center each spectrum Xi by taking away its mean Ximean. Divide each mean-centered spectrum by its own standard deviation: Xisnv = (Xi — Ximean)/sigma

def snv(input_data):
"""
:snv: A correction technique which is done on each
individual spectrum, a reference spectrum is not
required
:param input_data: Array of spectral data
:type input_data: DataFrame

:returns: data_snv (ndarray): Scatter corrected spectra
"""

input_data = np.asarray(input_data)

# Define a new array and populate it with the corrected data
data_snv = np.zeros_like(input_data)
for i in range(data_snv.shape[0]):
# Apply correction
data_snv[i,:] = (input_data[i,:] - np.mean(input_data[i,:])) / np.std(input_data[i,:])

return (data_snv)

Further plot the SNV plot by calling the above linegraph function

Results of SNV :

Fig.5. SNV Corrected data graph of sample 1; x-axis: wavelengths and y-axis: absorbance units
Fig.6. SNV Corrected data graph of sample 2; x-axis: wavelengths and y-axis: absorbance units

The result after applying MSC and SNV on raw data:

Fig.7. a. Raw spectra b. SNV spectra c. MSC spectra of sample 1
FIg.8. a. Raw spectra b. SNV spectra c. MSC spectra of sample 2

Now, The results are scatter corrected. The outliers in the MSC can be eliminated by using a simple z score technique.

Outlier detection on scatter corrected data

The outliers in the case of MSC(raw data) are considerably low. So we consider spectra to be an outlier if it has more than 40% points as outliers

zscore = (individualSample — individualSample.median( ))/ individualSample.std()

Define the z score function:

def zscorefunction(arrayMatrix, threshold=1):
"""
:zscorefunction: Compute the z score of arrayMatrix
(Individual sample), relative to the sample median and
standard deviation.
:param arrayMatrix: Array file containing spetral data of
one sample
:type arrayMatrix: array
:returns: The coordinates of the points which are
considered to be outliers.
We are interested in x coordinate of the results.
Here, In our case, the x coordinate is the spectra number.
- Example of the output:: output:(array([1,2,1,2,3,4]), array([1,5,8,70,85,143]))

Read as, (spectra number, point in the spectra)
(1,1),(2,5),(1,8),(2,70),(3,85) and (4,143).

[1,2,1,2,3,4] are the spectra number of the sample
and [1,5,8,70,85,143] are the point in the spectra

The 1st and 8th wavelength point in the 1st spectra
are outliers.similarly, The 5th and 70th wavelength
point in the 2nd spectra are outliers
"""
# A z-score is the number of standard deviations away from a
mean for a data point.
zscore = (arrayMatrix - np.median(arrayMatrix))/
arrayMatrix.std()

return (np.where(np.abs(zscore) > threshold))

The z score function returns the coordinates of the points which are considered to be outliers.

def deleteOutliersSummary(X,Y,summary = True):
"""
:deleteOutliersSummary: Calls the z score function to get
the outlier spectra numbers.We are interested in x
coordinate of the results. In our case, the x coordinate is
the spectra number.So, we apply the counter function on the
result to get the total count of outlier points for
spectra.and delete the spectra if the spectra has 75% of
its points as outliers
:param X: Training spectral file (usually MSCTrain)
:type X: array
:param Y: Training target file
:type Y: array
:returns: individualX (list) and y (list),
New spectral & target train files with outliers eliminated
""" # A z-score is the number of standard deviations away from a
mean for a data point.

# We define a deleteSpectra where we store the Spectra number
with more than 75% (you can change this based on your MSC data)
points as outliers
deleteSpectra = []
individualX = neoSpectraSensorData.getIndividualSamples(X)
y= getMeanData.getMeanTarget(Y)
out = 0
noOut = 0
sampleCount = len(individualX)
for i in range(0,sampleCount):
# call the function
x = zscorefunction(individualX[i])[0]
# print sample number and spectra number with its
corresponding number of outlier points

if summary == True:
# Counter gives the spectra number(index): number of
outlier points
print("\nSAMPLE",i+1)
print(Counter(x))
threshold = 0.75*X[1].shape[0]
for j in range(0,individualX[i].shape[0]):
# If the sepctra contains more than 75% of points as
outliers, delete the spectra

if (Counter(x)[j] > threshold):
deleteSpectra.append(j)
# Delete the outlier spectra from the sample
individualX[i] = np.delete(individualX[i], deleteSpectra, 0)
y[i] = np.delete(y[i], deleteSpectra, 0)

# If the sample has no outliers in all it's spectra then
display "No outliers detected"
if deleteSpectra != []:
out +=1
else:
noOut +=1

if noOut == sampleCount:
print("No outliers detected")

if summary == True:
print ("Delete Spectra:",deleteSpectra)
del deleteSpectra[:]

return(individualX,y)

Output:

Counter gives the spectra number(index): number of outlier points

Plotting the data after eliminating outliers from MSC data :

Fig.9. Graph after eliminating outliers from MSC data of sample 1
Fig.10. Graph after eliminating outliers on MSC data of sample 2

Here you can see that the outliers are eliminated on the scatter corrected data. now, we can use this data for model building.

Final results:

Raw data — MSC of raw data — Outliers elimination on MSC of raw data

Fig.11. Raw data — MSC data — Outliers elimination on MSC of sample 1
Fig.12. Raw data — MSC data — Outliers elimination on MSC of sample 2

Thank you for your time. I hope it helped you.

Reference : https://nirpyresearch.com/two-scatter-correction-techniques-nir-spectroscopy-python/

--

--