Machine Learning: Supervised Learning

Updated on Jan-10–2021
"If you live 5 minutes away from Bill Gates, I bet you are rich."
Introduction
In the realm of Machine Learning, K-Nearest Neighbors, KNN, makes the most intuitive sense and thus easily accessible to Data Science enthusiasts who want to break into the field. To decide the classification label of an observation, KNN looks at its neighbors and assign the neighbors’ label to the observation of interest. This is the underlying ideology of the KNN method, just like the Bill Games analogy used in the beginning. No higher dimensional calculations are needed to understand how the algorithm works.
But there is a catch. Looking at one neighbor may introduce bias and inaccuracy to the model, and we have to set up several "Rules of Engagement" for the method. For example, KNN can adopt the majority case of its "k" numbers of neighbors, as the name suggested.

To decide the label for new observations, we look at the closest neighbors.
Measure of Distance
To select the number of neighbors, we need to adopt a single number quantifying the similarity or dissimilarity among neighbors (Practical Statistics for Data Scientists). To that purpose, KNN has two sets of distance metrics depending on the data type.
For discrete variables, KNN adopts Hamming Distance. It measures the minimum number of replacements needed to make two strings similar (Wikipedia).
For continuous variables, we use Euclidean distance. To account for the distance between two vectors (x1, x2, …, xp) and (μ1, μ2, …, μp), we take their individual differences, square up, take the sum, and take the square root, shown as:

As a side note, selecting distance metrics for KNN is a heavily tested topic in Data Science Interviews. They typically ask for the justifications of choosing one metric over the others, along with their tradeoffs. I’ve elaborated on how to approach this type of question in a related post:
Crack Data Science Interviews: Essential Machine Learning Concepts
What is K-Fold Cross Validation?
As noted, the key to KNN is to set on the number of neighbors, and we resort to cross-validation (CV) to decide the premium K neighbors.
Cross-validation can be briefly described in the following steps:
- Divide the data into K equally distributed chunks/folds
- Choose 1 chunk/fold as a test set and the rest K-1 as a training set
- Develop a KNN model based on the training set
- Compare the predicted value VS actual values on the test set only
- Apply the ML model to the test set and repeat K times using each chunk
- Add up the metrics score for the model and average over K folds
How to Choose K?
Technically speaking, we can set K to any value between 1 and sample size n. Setting K = n, CV takes 1 observation out as the training set and the rest n-1 cases as the test set, and repeat the process to the entire dataset. This type of CV is called "Leave-one-out cross-validation" (LOOCV).
However, LOOCV makes intuitive sense but requires a ton of computational power. It runs forever for a significantly large dataset.
To choose the best K folds, we have to balance the tradeoffs between bias and variance. For a small K, the model has a high bias but a low variance for estimating test error. For a big K, we have a low bias but a high variance (Crack Data Science Interviews).

Code Implementation in R
1. Software Preparation
# install.packages("ISLR")
# install.packages("ggplot2") # install.packages("plyr")
# install.packages("dplyr") # install.packages("class")
# Load libraries
library(ISLR)
library(ggplot2)
library(reshape2)
library(plyr)
library(dplyr)
library(class)
# load data and clean the dataset
banking=read.csv("bank-additional-full.csv",sep =";",header=T)
##check for missing data and make sure no missing data
banking[!complete.cases(banking),]
#re-code qualitative (factor) variables into numeric
banking$job= recode(banking$job, "'admin.'=1;'blue-collar'=2;'entrepreneur'=3;'housemaid'=4;'management'=5;'retired'=6;'self-employed'=7;'services'=8;'student'=9;'technician'=10;'unemployed'=11;'unknown'=12")
#recode variable again
banking$marital = recode(banking$marital, "'divorced'=1;'married'=2;'single'=3;'unknown'=4")
banking$education = recode(banking$education, "'basic.4y'=1;'basic.6y'=2;'basic.9y'=3;'high.school'=4;'illiterate'=5;'professional.course'=6;'university.degree'=7;'unknown'=8")
banking$default = recode(banking$default, "'no'=1;'yes'=2;'unknown'=3")
banking$housing = recode(banking$housing, "'no'=1;'yes'=2;'unknown'=3")
banking$loan = recode(banking$loan, "'no'=1;'yes'=2;'unknown'=3")
banking$contact = recode(banking$loan, "'cellular'=1;'telephone'=2;")
banking$month = recode(banking$month, "'mar'=1;'apr'=2;'may'=3;'jun'=4;'jul'=5;'aug'=6;'sep'=7;'oct'=8;'nov'=9;'dec'=10")
banking$day_of_week = recode(banking$day_of_week, "'mon'=1;'tue'=2;'wed'=3;'thu'=4;'fri'=5;")
banking$poutcome = recode(banking$poutcome, "'failure'=1;'nonexistent'=2;'success'=3;")
#remove variable "pdays", b/c it has no variation
banking$pdays=NULL
#remove variable "duration", b/c itis collinear with the DV
banking$duration=NULL
After loading and cleaning the original dataset, it is a common practice to visually examine the distribution of our variables, checking for seasonality, patterns, outliers, etc.
#EDA of the DV
plot(banking$y,main="Plot 1: Distribution of Dependent Variable")

As can be seen, the outcome variable (Banking Service Subscription) is not balanced distributed, with many more "No"s than "Yes"s, which causes inconvenience for Machine Learning classification questions. The rate of False Positive is high as many minority cases would be classified as the majority case. For such a rare event with unequal distribution, a non-parametric classification method is the preferred method.
2. Data Split
We split the dataset into a training set and a test set. A rule of thumb, we stick to the "80–20" division, namely 80% of the data as the training set and 20% as the test set.
#split the dataset into training and test sets randomly, but we need to set seed so as to generate the same value each time we run the code
set.seed(1)
#create an index to split the data: 80% training and 20% test
index = round(nrow(banking)*0.2,digits=0)
#sample randomly throughout the dataset and keep the total number equal to the value of index
test.indices = sample(1:nrow(banking), index)
#80% training set
banking.train=banking[-test.indices,]
#20% test set
banking.test=banking[test.indices,]
#Select the training set except the DV
YTrain = banking.train$y
XTrain = banking.train %>% select(-y)
# Select the test set except the DV
YTest = banking.test$y
XTest = banking.test %>% select(-y)
3. Train Models
Let’s create a new function ("calc_error_rate") to record the misclassification rate. The function calculates the rate when the predicted label using the training model does not match the actual outcome label. It measures classification accuracy.
#define an error rate function and apply it to obtain test/training errors
calc_error_rate <- function(predicted.value, true.value){
return(mean(true.value!=predicted.value))
}
Then, we need another function, "do.chunk()", to do k-fold Cross-Validation. The function returns a data frame of the possible values of folds.
The main purpose of this step is to select the best K value for KNN.
nfold = 10
set.seed(1)
# cut() divides the range into several intervals
folds = seq.int(nrow(banking.train)) %>%
cut(breaks = nfold, labels=FALSE) %>%
sample
do.chunk <- function(chunkid, folddef, Xdat, Ydat, k){
train = (folddef!=chunkid)# training index
Xtr = Xdat[train,] # training set by the index
Ytr = Ydat[train] # true label in training set
Xvl = Xdat[!train,] # test set
Yvl = Ydat[!train] # true label in test set
predYtr = knn(train = Xtr, test = Xtr, cl = Ytr, k = k) # predict training labels
predYvl = knn(train = Xtr, test = Xvl, cl = Ytr, k = k) # predict test labels
data.frame(fold =chunkid, # k folds
train.error = calc_error_rate(predYtr, Ytr),#training error per fold
val.error = calc_error_rate(predYvl, Yvl)) # test error per fold
}
# set error.folds to save validation errors
error.folds=NULL
# create a sequence of data with an interval of 10
kvec = c(1, seq(10, 50, length.out=5))
set.seed(1)
for (j in kvec){
tmp = ldply(1:nfold, do.chunk, # apply do.function to each fold
folddef=folds, Xdat=XTrain, Ydat=YTrain, k=j) # required arguments
tmp$neighbors = j # track each value of neighbors
error.folds = rbind(error.folds, tmp) # combine the results
}
#melt() in the package reshape2 melts wide-format data into long-format data
errors = melt(error.folds, id.vars=c("fold","neighbors"), value.name= "error")
The upcoming step is to find the number of k that minimizes validation error
val.error.means = errors %>%
#select all rows of validation errors
filter(variable== "val.error" ) %>%
#group the selected data by neighbors
group_by(neighbors, variable) %>%
#cacluate CV error for each k
summarise_each(funs(mean), error) %>%
#remove existing grouping
ungroup() %>%
filter(error==min(error))
# Best number of neighbors
# if there is a tie, pick larger number of neighbors for simpler model
numneighbor = max(val.error.means$neighbors)
numneighbor
## [20]
Therefore, the best number of neighbors is 20 after using 10-fold cross-validation.
4. Some Model Metrics
#training error
set.seed(20)
pred.YTtrain = knn(train=XTrain, test=XTrain, cl=YTrain, k=20)
knn_traing_error <- calc_error_rate(predicted.value=pred.YTtrain, true.value=YTrain)
knn_traing_error
[1] 0.101214
The training error is 0.10.
#test error
set.seed(20)
pred.YTest = knn(train=XTrain, test=XTest, cl=YTrain, k=20)
knn_test_error <- calc_error_rate(predicted.value=pred.YTest, true.value=YTest)
knn_test_error
[1] 0.1100995
The test error is 0.11.
#confusion matrix
conf.matrix = table(predicted=pred.YTest, true=YTest)

Based on the above confusion matrix, we can calculate the following values and prepare for plotting the ROC curve.
Accuracy = (TP +TN)/(TP+FP+FN+TN)
TPR/Recall/Sensitivity = TP/(TP+FN)
Precision = TP/(TP+FP)
Specificity = TN/(TN+FP)
FPR = 1 – Specificity = FP/(TN+FP)
F1 Score = 2TP/(2TP+FP+FN) = Precision*Recall /(Precision +Recall)
# Test accuracy rate
sum(diag(conf.matrix)/sum(conf.matrix))
[1] 0.8899005
# Test error rate
1 - sum(drag(conf.matrix)/sum(conf.matrix))
[1] 0.1100995
As you may notice, test accuracy rate + test error rate = 1, and I’m providing multiple ways of calculating each value.
# ROC and AUC
knn_model = knn(train=XTrain, test=XTrain, cl=YTrain, k=20,prob=TRUE)prob <- attr(knn_model, "prob")
prob <- 2*ifelse(knn_model == "-1", prob,1-prob) - 1
pred_knn <- prediction(prob, YTrain)
performance_knn <- performance(pred_knn, "tpr", "fpr")# AUC
auc_knn <- performance(pred_knn,"auc")@y.values
auc_knn
[1] 0.8470583
plot(performance_knn,col=2,lwd=2,main="ROC Curves for KNN")

In conclusion, we have learned what KNN is and the pipeline of building a KNN model in R. Also, we have mastered the skills of conducting K-Fold Cross-Validation and how to implement the code in R.
The complete Python code is available on my Github.
Medium recently evolved its Writer Partner Program, which supports ordinary writers like myself. If you are not a subscriber yet and sign up via the following link, I’ll receive a portion of the membership fees.
Read every story from Leihua Ye, Ph.D. Researcher (and thousands of other writers on Medium)
My Data Science Interview Sequence
Essential SQL Skills for Data Scientists in 2021
Enjoy reading this one?
Please find me on LinkedIn and Youtube.
Also, check my other posts on Artificial Intelligence and Machine Learning.