Data Science in R
(Part 1)

Table of Contents
· Library
· Dataset
· Data Cleaning
· Exploratory Data Analysis
· Cross-validation
· Metrics
· Modeling
∘ Naive Bayes
∘ Decision Tree
∘ k-Nearest Neighbors
∘ Random Forest
· Conclusion
A modern smartphone is equipped with sensors such as an accelerometer and gyroscope to give advanced capabilities and facilitate a better user experience. The accelerometer in a smartphone is used to detect the orientation of the phone. The gyroscope adds an additional dimension to the information supplied by the accelerometer by tracking rotation or twist.
There have been studies conducted to utilize these sensors, such as in estimating road surface roughness conditions. However, what we’re doing will be more similar to this study by Harvard University. Concretely, this project is to build a model that predicts human activities such as Walking, Walking Upstairs, Walking Downstairs, Sitting, Standing, or Laying accurately from smartphone measurement.
Library
We will use R language. These are some libraries to be imported.
library(dplyr) # data wrangling
library(ggplot2) # visualization
library(Rtsne) # EDA
library(caret) # machine learning functions
library(MLmetrics) # machine learning metrics
library(e1071) # naive bayes
library(rpart) # decision tree
library(rattle) # tree visualization
library(class) # k-NN
library(randomForest) # random forest
Dataset
The dataset used to train the model is collected from 30 people performing different activities with a smartphone to their waists and recorded with the help of sensors in the smartphone. This experiment was video recorded to label the data manually. To see more details, please refer to this link.
Let’s read the dataset.
uci_har <- read.csv("UCI HAR.csv")
dim(uci_har)
#> [1] 10299 563
As can be seen above, this dataset contains 563 features with 10299 observations. That’s a lot! For now, we don’t need to fully understand what the measurements mean. In short, we have these:
subject
feature denotes the identifier of the subject who carried out the experiment. There are 30 unique ids, each for one of 30 people.Activity
feature represents the activity a subject was doing, consists of: WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING- A 561-feature vector with time and frequency domain variables.
Based on the problem statement, Activity
will be our target feature.
Data Cleaning
First, convert the subject
and Activity
features into factor, others into numeric.
uci_har <- uci_har %>%
mutate_at(c('subject', 'Activity'), as.factor) %>%
mutate_at(vars(-subject, -Activity), as.numeric)
lvl <- levels(uci_har$Activity)
lvl
#> [1] "LAYING" "SITTING" "STANDING" "WALKING" "WALKING_DOWNSTAIRS" "WALKING_UPSTAIRS"
Let’s check if there are any duplicated observations or missing values.
cat("Number of duplicated rows:", sum(duplicated(uci_har)))
#> Number of duplicated rows: 0
cat("Number of missing values:", sum(is.na(uci_har)))
#> Number of missing values: 0
Great! None of them exists. Now let’s check data imbalance.
ggplot(uci_har %>%
group_by(subject, Activity) %>%
count(name = 'activity_count'),
aes(x = subject, y = activity_count, fill = Activity)) +
geom_bar(stat = 'identity')

There is no significant difference in terms of activity count for each subject. Hence, all target classes are balanced.
Exploratory Data Analysis
We can categorize Activity
into two: stationary activities (such as laying, sitting, or standing) and moving activities (such as walking, walking downstairs, or walking upstairs). Let’s see the distribution of the mean of triaxial magnitude of body acceleration signals (phew, that’s a mouthful! what I really mean is tBodyAccMagmean
).
ggplot(uci_har,
aes(x = tBodyAccMagmean, group = Activity, fill = Activity)) +
geom_density(alpha = .5) +
annotate('text', x = -.8, y = 25, label = "Stationary activities") +
annotate('text', x = -.0, y = 5, label = "Moving activities")

ggplot(uci_har,
aes(y = Activity, x = tBodyAccMagmean, group = Activity, fill = Activity)) +
geom_boxplot(show.legend = FALSE)

We can see a clear distinction between the two groups:
- stationary activities have very small body movements compared to those of moving activities.
- if
tBodyAccMagmean
> -0.5, then the activity will probably be either walking, walking upstairs, or walking downstairs. - if
tBodyAccMagmean
< -0.5, then the activity is most probably either laying, standing, or sitting.
Now, there should be also a distinction between laying and other activities in terms of phone orientation. While laying, unlike other activities, people tend to have the phone fairly horizontal on their waist. So, let’s see whether this hypothesis is true by comparing the angle between X, Y, and Z axis to the mean of gravity acceleration signals on each axis (angleXgravityMean
, angleYgravityMean
, and angleZgravityMean
).
for (coor in c('angleXgravityMean', 'angleYgravityMean', 'angleZgravityMean')) {
print(
ggplot(uci_har,
aes_string(y = 'Activity', x = coor, group = 'Activity', fill = 'Activity')) +
geom_boxplot(show.legend = FALSE)
)
}



It’s apparent that:
- phone orientation while laying is significantly different from phone orientation while doing other activities.
- if
angleXgravityMean
> 0, then the activity is most probably laying, other activities otherwise. - if
angleYgravityMean
< -0.25 orangleZgravityMean
< -0.25, then the activity will probably be laying, other activities otherwise. - we can predict laying activity with minimum error only by using
angleXgravityMean
, or probably by other gravity-related features to the X-axis.
Lastly, we can perform t-SNE on the dataset to reduce its dimension for the purpose of visualization, in the hope that each Activity
will be grouped in a different region. t-SNE is an unsupervised, non-linear technique primarily used for data exploration and visualizing high-dimensional data. Basically, what t-SNE does is to give us a feel or intuition of how the data is arranged in a high-dimensional space. We won’t dive deeper into t-SNE in this article. We will perform t-SNE by doing sensitivity in the perplexity of 5, 10, and 20 to ensure the obtained values after dimensional reduction are really grouped into different Activity
.
for (perp in c(5, 10, 20)) {
tsne <- Rtsne(uci_har %>% select(-c(subject, Activity)), perplexity = perp)
tsne_plot <- data.frame(x = tsne$Y[,1], y = tsne$Y[,2], Activity = uci_har$Activity)
print(
ggplot(tsne_plot) +
geom_point(aes(x = x, y = y, color = Activity))
)
}



As can be seen, all activities can easily be separated except for Standing and Sitting. This makes sense since Standing and Sitting don’t have much difference in terms of phone orientation.
Cross-validation
We cannot apply the normal k-fold cross-validation for this problem. Recall that our objective is to predict human activity based on their phone sensors. This means when a new unseen subject
comes in, we don’t know their behavior with the phone. If we use k-fold cross-validation by randomly choosing observations, there is a chance that the same subject
appears in both train and test datasets, suggesting a data leak. To avoid this, during cross-validation we split the dataset such that the subject
in train and test datasets don’t intersect.
set.seed(2072) # for reproducibility
subject_id <- unique(uci_har$subject)
folds <- sample(1:5, 30, replace = TRUE)
d <- data.frame(col1 = c(subject_id), col2 = c(folds))
uci_har$folds <- d$col2[match(uci_har$subject, d$col1)]
uci_har <- uci_har %>%
mutate(folds = as.factor(folds)) %>%
select(-subject)
Please note that after creating folds for each observation, we discarded subject
feature as it’s no longer needed in the analysis. Lastly, we can also see below that the data is distributed evenly among folds, hence no imbalanced data occurred.
ggplot(uci_har %>%
group_by(folds, Activity) %>%
count(name = 'activity_count'),
aes(x = folds, y = activity_count, fill = Activity)) +
geom_bar(stat = 'identity')

Metrics
We use accuracy to quantify the performance of our models after considering the following reasons:
- as per the problem statement, we are only interested in predicting each class equally and accurately without preferring one above the others, hence discrediting the purpose of recall and precision metrics.
- this problem is a multiclass classification, where accuracy is more interpretable than the ROC-AUC metric.
Modeling
First, as a sanity check, let’s see the dimension of the dataset, also the maximum and minimum values in each feature.
dim(uci_har)
#> [1] 10299 563
max(apply(uci_har %>% select(-c(Activity, folds)), 1, max))
#> [1] 1
min(apply(uci_har %>% select(-c(Activity, folds)), 1, max))
#> [1] 0.795525
min(apply(uci_har %>% select(-c(Activity, folds)), 1, min))
#> [1] -1
max(apply(uci_har %>% select(-c(Activity, folds)), 1, min))
#> [1] -0.9960928
The minimum values are close to -1 for all features, whereas the maximum values range between 0.8 to 1. We won’t do any normalization because the beforementioned range is considered small, and more importantly we don’t want to lose much information on the correlation between features.
The dataset has 563 features, two of which will be discarded while modeling. They are Activity
(since this is the target variable) and folds
(since this doesn’t add any information to the dataset).
We will make four models to achieve the best solution to the problem statement: Naive Bayes, Decision Tree, k-Nearest Neighbors, and Random Forest. To simplify, below is a function to cross-validate all models. In this function, we iterate each fold of cross-validation that has been built before and do the following in each iteration:
- create
X_train
,y_train
,X_test
,y_test
, which are predictor variable for training, target variable for training, predictor variable for testing, and target variable for testing, respectively. - build the model and predict the output as
y_pred
. - calculate the model accuracy by comparing
y_pred
toy_test
. - the accuracy results are then averaged across all folds, producing a single number to compare all models.
Note that for Random Forest model, we also use cross-validation accuracy instead of OOB error, so that the comparison with other models is apple to apple. However, later we treat Random Forest model separately to emphasize the importance of Hyperparameter Tuning.
crossvalidate <- function(data, k, model_name,
tuning = FALSE, mtry = NULL, nodesize = NULL) {
# 'data' is the training set with the 'folds' column
# 'k' is the number of folds we have
# 'model_name' is a string describing the model being used
# 'tuning' is a mode in which this function will operate, tuning = TRUE means we are doing hyperparameter tuning
# 'mtry' and 'nodesize' are used only in Random Forest hyperparameter tuning
# initialize empty lists for recording performances
acc_train <- c()
acc_test <- c()
y_preds <- c()
y_tests <- c()
models <- c()
# one iteration per fold
for (fold in 1:k) {
# create training set for this iteration
# subset all the datapoints where folds does not match the current fold
training_set <- data %>% filter(folds != fold)
X_train <- training_set %>% select(-c(Activity, folds))
y_train <- training_set$Activity
# create test set for this iteration
# subset all the datapoints where folds matches the current fold
testing_set <- data %>% filter(folds == fold)
X_test <- testing_set %>% select(-c(Activity, folds))
y_test <- testing_set$Activity
# train & predict
switch(model_name,
nb = {
model <- naiveBayes(x = X_train, y = y_train, laplace = 1)
y_pred <- predict(model, X_test, type = 'class')
y_pred_train <- predict(model, X_train, type = 'class')
},
dt = {
model <- rpart(formula = Activity ~ ., data = training_set %>% select(-folds), method = 'class')
y_pred <- predict(model, X_test, type = 'class')
y_pred_train <- predict(model, X_train, type = 'class')
},
knn = {
k <- round(sqrt(nrow(training_set)))
y_pred <- knn(train = X_train, test = X_test, cl = y_train, k = k)
y_pred_train <- knn(train = X_train, test = X_train, cl = y_train, k = k)
},
rf = {
if (tuning == FALSE) {
model <- randomForest(x = X_train, y = y_train, xtest = X_test, ytest = y_test)
} else {
model <- randomForest(x = X_train, y = y_train, xtest = X_test, ytest = y_test,
mtry = mtry, nodesize = nodesize)
}
y_pred <- model$test$predicted
y_pred_train <- model$predicted
},
{
print("Model is not recognized. Try to input 'nb', 'dt', 'knn', or 'rf'.")
return()
}
)
# populate corresponding lists
acc_train[fold] <- Accuracy(y_pred_train, y_train)
acc_test[fold] <- Accuracy(y_pred, y_test)
y_preds <- append(y_preds, y_pred)
y_tests <- append(y_tests, y_test)
models <- c(models, list(model))
}
# convert back to factor
y_preds <- factor(y_preds, labels = lvl)
y_tests <- factor(y_tests, labels = lvl)
# get the accuracy between the predicted and the observed
cm <- confusionMatrix(y_preds, y_tests)
cm_table <- cm$table
acc <- cm$overall['Accuracy']
# return the results
if (model_name == 'knn') {
return(list('cm' = cm_table, 'acc' = acc, 'acc_train' = acc_train, 'acc_test' = acc_test))
} else {
return(list('cm' = cm_table, 'acc' = acc, 'acc_train' = acc_train, 'acc_test' = acc_test, 'models' = models))
}
}
Now we are ready.
Naive Bayes
nb <- crossvalidate(uci_har, 5, 'nb')
cat("Naive Bayes Accuracy:", nb$acc)
#> Naive Bayes Accuracy: 0.7258957
Naive Bayes model gives poor results with 73% accuracy. This is mainly due to the underlying assumption of the model that each predictor is independent of one another, which is not the case in our dataset. For instance, we have the following high correlations between some predictors which are not captured by the model.
set.seed(3)
col <- c(sample(names(uci_har), 6))
GGally::ggcorr(uci_har[, col], hjust = 1, layout.exp = 3, label = T)

Let’s see the confusion matrix below.
nb$cm
#> Reference
#> Prediction LAYING SITTING STANDING WALKING WALKING_DOWNSTAIRS WALKING_UPSTAIRS
#> LAYING 1623 14 4 0 0 0
#> SITTING 286 1550 1188 0 0 0
#> STANDING 0 187 677 0 0 0
#> WALKING 2 0 1 1181 93 40
#> WALKING_DOWNSTAIRS 4 0 0 234 1065 124
#> WALKING_UPSTAIRS 29 26 36 307 248 1380
Naive Bayes model is still confused between several trivially different activities such as laying and sitting (around 300 wrong predictions). This model is also struggling to differentiate walking upstairs and stationary activities (around 90 wrong predictions).
Lastly, as can be seen below, based on the accuracy resulted from predicting train and test datasets, we could see that the model is already decent, not underfit or overfit to train dataset (except for the fourth fold). Hence, we couldn’t rely much on trading off bias and variance to improve the model performance.
print(nb$acc_train)
#> [1] 0.7349193 0.7647131 0.7280039 0.7635290 0.7612536
print(nb$acc_test)
#> [1] 0.7475728 0.7491702 0.7276636 0.6863137 0.7169533
Decision Tree
dt <- crossvalidate(uci_har, 5, 'dt')
cat("Decision Tree Accuracy:", dt$acc)
#> Decision Tree Accuracy: 0.8599864
Decision Tree model gives better results with 86% accuracy. We can see why by noticing that from the t-SNE result, our dataset is highly separable (except for sitting and standing activity). This characteristic can be exploited by a tree-based model. To give a sense of how Decision Tree works, please observe 5 tree diagrams below, each for one cross-validation fold.
for (model in dt$models) {
fancyRpartPlot(model, sub = NULL)
}





One can easily see that tGravityAccminX
or tGravityAccmeanX
becomes a crucial feature on the first split after the root. If this feature is less than a certain threshold, the models can predict perfectly that the corresponding activity is laying, which takes in 19% of all train dataset observations. This is consistent with our EDA result that laying activity can be distinguished by only observing one gravity-related feature to the X-axis.
On the second split, based on a body acceleration signal, the models can separate perfectly sitting and standing with moving activities (sitting and standing takes in 36% whereas all moving activities take in 45% of train dataset observations). This finding confirms our previous analysis that stationary and moving activities can be separated fairly easily.
Then, on the third split down to the leaf, the models are separating sitting and standing, also all moving activities, with some errors. This signifies that sitting and standing, also all moving activities, are quite hard for the models to distinguish. To see this clearly, here’s a confusion matrix.
dt$cm
#> Reference
#> Prediction LAYING SITTING STANDING WALKING WALKING_DOWNSTAIRS WALKING_UPSTAIRS
#> LAYING 1942 15 0 0 0 0
#> SITTING 2 1564 334 0 0 0
#> STANDING 0 196 1571 8 0 0
#> WALKING 0 0 1 1549 151 296
#> WALKING_DOWNSTAIRS 0 1 0 39 1149 166
#> WALKING_UPSTAIRS 0 1 0 126 106 1082
A small note from this table: unlike Naive Bayes model, Decision Tree model doesn’t predict stationary activities falsely as walking upstairs that much. In fact, only one prediction is wrong in this case compared to 91 in Naive Bayes model.
Next, let’s see which variables are the most important. This table below once again confirms the importance of gravity-related features to the X-axis (tGravityAccminX
and angleXgravityMean
) to our models.
for (model in dt$models) {
var_imp <- varImp(model)
var_imp <- var_imp %>% slice_max(Overall, n = 10)
print(var_imp)
}
#> Overall
#> tGravityAccminX 1647.338
#> tGravityAccmeanX 1522.782
#> tGravityAccenergyX 1521.581
#> angleXgravityMean 1521.581
#> tGravityAccmaxX 1505.993
#> fBodyAccJerkbandsEnergy116 1374.982
#> fBodyAccJerkbandsEnergy124 1374.982
#> fBodyAccmadX 1374.982
#> fBodyAccmeanX 1374.982
#> tBodyAccJerkmadX 1374.178
#> Overall
#> angleXgravityMean 1494.691
#> tGravityAccenergyX 1494.691
#> tGravityAccmeanX 1494.691
#> tGravityAccminX 1494.691
#> tGravityAccmaxX 1483.883
#> fBodyAccJerkentropyX 1376.017
#> fBodyAccmadX 1376.017
#> fBodyAccmeanX 1376.017
#> tBodyAccJerkmadX 1376.017
#> tBodyAccJerkMagenergy 1376.017
#> Overall
#> angleXgravityMean 1504.419
#> tGravityAccenergyX 1504.419
#> tGravityAccmeanX 1504.419
#> tGravityAccminX 1504.419
#> tGravityAccmaxX 1488.823
#> fBodyAccJerkenergyX 1370.566
#> fBodyAccmadX 1370.566
#> fBodyAccmeanX 1370.566
#> tBodyAccJerkenergyX 1370.566
#> tBodyAccJerkstdX 1370.566
#> Overall
#> tGravityAccminX 1528.936
#> tGravityAccmeanX 1527.734
#> angleXgravityMean 1526.532
#> tGravityAccenergyX 1526.532
#> tGravityAccmaxX 1508.549
#> tBodyAccJerkenergyX 1387.769
#> tBodyAccJerkmadX 1387.769
#> tBodyAccJerkstdX 1387.769
#> tBodyAccmaxX 1387.769
#> tBodyAccstdX 1387.769
#> Overall
#> tGravityAccminX 1531.881
#> tGravityAccmeanX 1530.679
#> angleXgravityMean 1529.478
#> tGravityAccenergyX 1529.478
#> tGravityAccmaxX 1512.692
#> fBodyAccJerkbandsEnergy116 1379.583
#> fBodyAccJerkbandsEnergy124 1379.583
#> fBodyAccmadX 1379.583
#> fBodyAccmeanX 1379.583
#> tBodyAccJerkmadX 1378.776
Normally a Decision Tree model tends to overfit to train dataset because they can easily fit noise in the data. Luckily, that’s not the case for us since as can be seen below, the accuracies on train and test datasets are close except for the fourth fold. Hence, tree pruning is not necessary.
print(dt$acc_train)
#> [1] 0.8897925 0.8959707 0.8811845 0.8897192 0.8691917
print(dt$acc_test)
#> [1] 0.8660194 0.8914177 0.8643096 0.7902098 0.8855037
k-Nearest Neighbors
Since k-NN is a distance-based model, the dataset has to be normalized prior to modeling so that the model can treat each feature equally. In other words, if a feature has relatively big values compared to others, then it will dominantly influence the model in selecting a datapoint’s neighbors. But in our case, we didn’t do any normalization on the dataset and the reason has already been explained above at the beginning of this Modeling section.
knn <- crossvalidate(uci_har, 5, 'knn')
cat("k-Nearest Neighbors Accuracy:", knn$acc)
#> k-Nearest Neighbors Accuracy: 0.8925138
k-NN model gives even better results than Decision Tree model, with 89% accuracy. Again, this is due to the fact that our dataset is highly separable so that k-NN algorithm can easily group each activity.
knn$cm
#> Reference
#> Prediction LAYING SITTING STANDING WALKING WALKING_DOWNSTAIRS WALKING_UPSTAIRS
#> LAYING 1926 13 0 0 0 0
#> SITTING 5 1369 245 0 0 0
#> STANDING 5 390 1659 0 0 0
#> WALKING 1 0 1 1658 98 90
#> WALKING_DOWNSTAIRS 0 0 0 35 1178 52
#> WALKING_UPSTAIRS 7 5 1 29 130 1402
However, k-NN model is worse than Decision Tree model in distinguishing between sitting and standing activity. Also, some stationary activities are predicted as walking upstairs. On the other hand, more moving activities are predicted correctly than those in Decision Tree model.
By comparing accuracy on train and test datasets, it can be seen that the model is just right with low bias and variance.
print(knn$acc_train)
#> [1] 0.9299672 0.9269841 0.9332196 0.9308184 0.9335673
print(knn$acc_test)
#> [1] 0.9169903 0.9184448 0.8781653 0.8516484 0.8958231
Random Forest
rf <- crossvalidate(uci_har, 5, 'rf')
cat("Random Forest Accuracy:", rf$acc)
#> Random Forest Accuracy: 0.9336829
Random forest model gives the best result so far compared to others, with a whopping 93% accuracy. In particular, Random Forest is almost always better than Decision Tree due to the following reasons:
- Random Forest is an ensemble of many Decision Tree models. It was based on majority voting of many Decision Trees and hence tends to reduce the error of a single Decision Tree prediction.
- Random Forest performs bootstrap aggregation which generates a family of models that converts weak learners into strong learners and hence overcomes the overfitting problem of one Decision Tree.
However, these advantages don’t come without shortcomings: Random Forest model is slower to train and harder to interpret than a Decision Tree.
Now let’s see the confusion matrix.
rf$cm
#> Reference
#> Prediction LAYING SITTING STANDING WALKING WALKING_DOWNSTAIRS WALKING_UPSTAIRS
#> LAYING 1942 15 0 0 0 0
#> SITTING 0 1626 189 0 0 0
#> STANDING 0 135 1717 0 0 0
#> WALKING 0 0 0 1585 18 37
#> WALKING_DOWNSTAIRS 0 0 0 27 1293 54
#> WALKING_UPSTAIRS 2 1 0 110 95 1453
Random Forest model is still having a hard time figuring out which one is sitting or standing, and which is which from moving activities. Nevertheless, besides those errors, this model only misclassifies 18 other observations, which is a small portion of all observations.
Based on the variable importance table below, we see that Random Forest model prefers gravity-related features more than body-related ones.
for (model in rf$models) {
var_imp <- varImp(model)
var_imp <- var_imp %>% slice_max(Overall, n = 10)
print(var_imp)
}
#> Overall
#> tGravityAccmeanX 233.5234
#> tGravityAccminX 212.9809
#> angleXgravityMean 190.1192
#> tGravityAccmaxX 185.7354
#> angleYgravityMean 168.6782
#> tGravityAccenergyX 158.2211
#> tGravityAccminY 152.9756
#> tGravityAccmaxY 149.8530
#> tGravityAccmeanY 128.6168
#> tGravityAccenergyY 115.0822
#> Overall
#> tGravityAccmeanX 215.90652
#> tGravityAccminX 199.06699
#> tGravityAccenergyX 187.14571
#> tGravityAccmaxX 174.64894
#> angleXgravityMean 170.14726
#> tGravityAccmaxY 148.36554
#> angleYgravityMean 147.43523
#> tGravityAccmeanY 136.34275
#> tGravityAccminY 132.14115
#> tGravityAccenergyY 83.03708
#> Overall
#> angleXgravityMean 211.0124
#> tGravityAccminX 193.4731
#> tGravityAccmaxX 183.6834
#> tGravityAccenergyX 178.2531
#> tGravityAccmaxY 175.3123
#> tGravityAccmeanX 170.4459
#> tGravityAccmeanY 166.4416
#> tGravityAccminY 164.2081
#> angleYgravityMean 159.2264
#> tGravityAccenergyY 113.4814
#> Overall
#> tGravityAccmaxX 214.2470
#> tGravityAccminX 201.6110
#> tGravityAccenergyX 198.3143
#> angleXgravityMean 191.6710
#> tGravityAccmeanY 185.8804
#> tGravityAccmeanX 182.7646
#> tGravityAccmaxY 179.4252
#> angleYgravityMean 172.8559
#> tGravityAccminY 171.5347
#> tGravityAccenergyY 102.5362
#> Overall
#> tGravityAccmeanX 208.67569
#> angleXgravityMean 202.43801
#> tGravityAccminX 192.91251
#> tGravityAccenergyX 185.74270
#> tGravityAccmaxX 158.31243
#> tGravityAccmaxY 148.26482
#> angleYgravityMean 145.74691
#> tGravityAccmeanY 142.97585
#> tGravityAccminY 126.27075
#> tGravityAccenergyY 95.61133
Lastly, from the train and test accuracy below, it’s apparent that the model performs really well on train and test dataset, even though in the third and fourth folds the test accuracies are still slightly lower than 91%. We already knew that the tendency to overfitting should decrease if we switch from Decision Tree model to Random Forest (thanks to bagging and random feature selection). However, the generalization error will not go to zero. The variance of generalization error will approach zero with more trees added but the bias will not!
print(rf$acc_train)
#> [1] 0.9815512 0.9805861 0.9837923 0.9865011 0.9842691
print(rf$acc_test)
#> [1] 0.9631068 0.9720247 0.9096990 0.8971029 0.9248157
We can further improve the model by hyperparameter tuning. It’s expected that by pruning the trees, we can trade some variance with bias, lowering down both values. This is done by increasing nodesize
parameter inside Random Forest model. Referencing to R documentation, nodesize
is the minimum size of terminal nodes. Setting this number larger causes smaller trees to be grown (and thus take less time). The default value is 1 for a classification problem, which tends to make the model overfit to noise in data.
Besides nodesize
, we will also tune mtry
(number of variables randomly sampled as candidates at each split, defaults to sqrt(561) = 23 in our case). We will do a grid search by varying nodesize
to c(3,5,7)
and mtry
to c(11,16)
.
# establish a list of possible values for nodesize and mtry
nodesize <- c(3, 5, 7)
mtry <- c(11, 16)
# create a data frame containing all combinations
hyper_grid <- expand.grid(mtry = mtry, nodesize = nodesize)
# initialize empty vectors to store the results
rf_acc <- c()
rf_acc_train <- c()
rf_acc_test <- c()
# loop over the rows of hyper_grid
for (i in 1:nrow(hyper_grid)) {
# cross validate
rf_tuning <- crossvalidate(uci_har, 5, 'rf',
tuning = TRUE, mtry = hyper_grid$mtry[i], nodesize = hyper_grid$nodesize[i])
# store the results
rf_acc[i] <- rf_tuning$acc
rf_acc_train <- c(rf_acc_train, list(rf_tuning$acc_train))
rf_acc_test <- c(rf_acc_test, list(rf_tuning$acc_test))
}
# identify optimal set of hyperparameters based on accuracy
opt_i <- which.max(rf_acc)
print(hyper_grid[opt_i,])
#> mtry nodesize
#> 5 11 7
The best hyperparameters found are 7 for nodesize
and 11 for mtry
. With these, the accuracy improves a little towards 94% as we can see below. Also, all test accuracies are more uniform across folds and have values above 91%.
print(rf_acc[opt_i])
#> [1] 0.9370813
print(rf_acc_train[opt_i])
#> [[1]]
#> [1] 0.9815512 0.9788767 0.9820863 0.9849343 0.9831801
print(rf_acc_test[opt_i])
#> [[1]]
#> [1] 0.9684466 0.9687055 0.9182991 0.9130869 0.9154791
Conclusion
rbind("Naive Bayes" = nb$acc, "Decision Tree" = dt$acc, "k-Nearest Neighbors" = knn$acc, "Random Forest" = max(rf_acc))
#> Accuracy
#> Naive Bayes 0.7258957
#> Decision Tree 0.8599864
#> k-Nearest Neighbors 0.8925138
#> Random Forest 0.9370813
Based on the accuracy table above, Random Forest clearly wins as the best model. Random Forest is able to recognize human activities based on their phone behavior with an outstanding 94% accuracy. On the other hand, Random Forest is slow to run as it is an ensemble of 500 Decision Trees by default. Of course, we can also try simpler models such as One-vs-Rest Logistic Regression or a pretty standard boosting model in the industry these days such as XGBoost or LightGBM, then compare the result.

🔥 Hi there! If you enjoy this story and want to support me as a writer, consider becoming a member. For only $5 a month, you’ll get unlimited access to all stories on Medium. If you sign up using my link, I’ll earn a small commission.
🔖 Want to know more about how classical machine learning models work and how they optimize their parameters? Or an example of MLOps megaprojects? What about cherry-picked top-notch articles of all time? Continue reading:
