Introduction
Attack and Anomaly Detection in the Internet of Things (IoT) infrastructure is a rising concern in the domain of IoT. Due to the increased use of IoT infrastructure, attacks on these infrastructures are also growing exponentially. So there is a need for developing a smart and a secured IoT environment that can detect its vulnerability. Here, a machine learning-based solution is proposed which can detect the type of attack and protect the IoT system.

Design and workflow of the solution
The entire workflow of the solution is mentioned below with a pictorial representation.

The steps involved in the workflow are as follows:
- _Dataset collection and description: A virtual IoT environment is created using the _Distributed Smart Space Orchestration System (DS2OS) which has a set of IoT based services like Temperature controller, window controller, Light controller etc. The communication between the user and the services is captured and are stored in a CSV file format. In the dataset, there are 357,952 samples and 13 features. The dataset has 347,935 Normal data and 10,017 anomalous data and contains eight classes which were classified. The 8 classes of attacks are Denial of Service (DoS), Data Type Probing, Malicious Control, Malicious Operation, Scan, Spying, Wrong Setup, Normal. The dataset is free to use and is available in Kaggle webiste https://www.kaggle.com/francoisxa/ds2ostraffictraces. The "mainSimulationAccessTraces.csv" file contains the dataset is read using pandas library
import pandas as pd #Pandas library for reading csv file
import numpy as np #Numpy library for converting data into array
Dataset=pd.read_csv('mainSimulationAccessTraces.csv')
x=Dataset.iloc[:,:-2].values
y=Dataset.iloc[:,12].values
- Data preprocessing: The first step in Data preprocessing is to handle the missing values in the dataset. In the dataset, we can see "Accessed Node Type" column and "Value" column contains missing data due to anomaly raised during data transferring. Since the "Accessed Node Type" column is of categorical type, I will use a constant value for filling it. The "Value" column is of numerical type and hence I have used mean strategy to fill the missing values. The next step is feature selection which involves removing the timestamp feature which doesn’t have any significance on the data. The next step involves converting the nominal categorical data into vectors using label encoding.
from sklearn.impute import SimpleImputer
imputer=SimpleImputer(missing_values=np.nan,strategy='constant',verbose=0)
imputer=imputer.fit(x[:,[8]])
x[:,[8]]=imputer.transform(x[:,[8]])
imputer1=SimpleImputer(missing_values=np.nan,strategy='mean',verbose=0)
imputer1=imputer1.fit(x[:,[10]])
x[:,[10]]=imputer1.transform(x[:,[10]])
from sklearn.preprocessing import LabelEncoder
labelencoder_X = LabelEncoder()
for i in range(0,10):
x[:,i] = labelencoder_X.fit_transform(x[:,i])
x=np.array(x,dtype=np.float)
y=labelencoder_X.fit_transform(y)
- Sampling: This stage involves splitting the dataset into train and test data set. I have assigned 80% of dataset for training and remaining 20% for testing.
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)
- Normalization: The training and testing dataset are normalized using standard scaler library which will make all the values of the features to be in similar range.
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)
- Building a ML model: The training dataset is passed to a random forest classifier algorithm for training and the model/predictor is generated. I have used sklearn library to accomplish this step. I have used 10 trees for the model. After the training, the test dataset is passed to the predictor/model which will tell whether the data was under attack or not.

from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier(n_estimators=10,criterion='entropy',random_state=0)
classifier.fit(x_train,y_train)
y_pred = classifier.predict(x_test)
- Model Evaluation: The last step is to determine the accuracy of our model and we have used confusion matrix and accuracy parameter to determine the performance.
Using the random forest algorithm I got an accuracy of 99.37%. I have also included the snapshot of the confusion matrix below
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
cm = confusion_matrix(y_test, y_pred)
accuracy=accuracy_score(y_test, y_pred)

Conclusion
The random forest algorithm could able to deliver an accuracy of 99.37% in a virtual IoT environment dataset. A cross fold validation can also be performed on top of this to avoid overfitting of the model. To learn more about the IoT environment that is used to generate data, please refer to the following URL https://www.researchgate.net/publication/330511957_Machine_Learning-Based_Adaptive_Anomaly_Detection_in_Smart_Spaces