The world’s leading publication for data science, AI, and ML professionals.

Smartphone for Activity Recognition (Part 2)

Handling the curse of dimensionality

Data Science in R

Photo by Grace Madeline on Unsplash
Photo by Grace Madeline on Unsplash
Table of Contents
· Problem Statement 1
· Library
· Dataset
· Data Cleaning
· Principal Component Analysis
· Cross-Validation
· Modeling
· Problem Statement 2
· Clustering
· Conclusion

Before reading this article, you’re encouraged to read the previous one just to give a little bit of context to what we are gonna do. This article examines two problems: dimensionality reduction and clustering. Enjoy!


Problem Statement 1

In the previous article, we were performing classification on the Human Activity Recognition dataset. We know that this dataset has so many features (561 to be exact) and some of them strongly correlate with each other. Random Forest model can classify human activities as good as 94% accuracy using this dataset. However, it takes forever to do so. The second candidate is k-NN model with 89% accuracy followed by Decision Tree model with 86% accuracy and Naive Bayes model with 73% accuracy. This makes us think,

Can we trade some accuracy with computation time? Or, can we actually improve furthermore some model’s performance?

The idea is to reduce the dimension of the dataset just enough so that most of the information is still maintained. Of course, this may also reduce the performance of the model, but the good news is the model can run much faster. We’ve actually performed dimensionality reduction in the previous article using t-SNE to create a visualization in an attempt to see the relative position of each observation in higher-dimensional space. However, t-SNE is a non-deterministic algorithm and so we were varying the perplexity in three different values to make sure that the result was not due to random chances. Also, t-SNE tries to only preserve the local structure of data and cannot preserve its variance.

In this article, we will introduce a new algorithm called Principal Component Analysis (PCA) as a substitute for t-SNE in doing dimensionality reduction. PCA is a deterministic algorithm, preserves the global structure of data, and with PCA we can decide how much variance to preserve using eigenvalues.

Library

As usual, we will use R. Load the needed libraries.

library(tidymodels)     # collection of best packages
library(caret)          # Machine Learning functions
library(MLmetrics)      # machine learning metrics
library(e1071)          # naive bayes
library(rpart)          # decision tree
library(randomForest)   # random forest
library(class)          # kNN
library(FactoMineR)     # PCA algorithm
library(factoextra)     # PCA and clustering visualization

Dataset

Let’s read the dataset. Detailed explanations about the dataset can be found in the previous article.

uci_har <- read.csv("UCI HAR.csv")
dim(uci_har)
#> [1] 10299   563

As can be seen above, this dataset contains 563 features with 10299 observations. That’s a lot!

Data Cleaning

Convert each feature to its appropriate type.

uci_har <- uci_har %>% 
  mutate_at(c('subject', 'Activity'), as.factor) %>% 
  mutate_at(vars(-subject, -Activity), as.numeric)

lvl <- levels(uci_har$Activity)
lvl
#> [1] "LAYING"             "SITTING"            "STANDING"           "WALKING"            "WALKING_DOWNSTAIRS" "WALKING_UPSTAIRS"

There are six activities as the target variable: WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING. We won’t do any further data cleaning since it’s known in the previous article that the dataset can be considered clean.

Principal Component Analysis

PCA algorithm summarizes the information/variance of the initial features using new dimensions called the Principal Components (PC). Our dataset is suitable for PCA because it has many features, and among them many correlated ones. So let’s just perform the PCA right away. We have the following barplot showing that the variance of the dataset largely can be explained by several principal components. For the purpose of analysis, we will only take 10 principal components.

principal_component <- prcomp(x = uci_har %>% select(-c(subject, Activity)), scale. = TRUE, rank. = 10)
plot(principal_component)
Image by author
Image by author

Below are the resulted principal components in numbers and their summary.

head(as.data.frame(principal_component$x))
#>        PC1       PC2        PC3        PC4      PC5        PC6        PC7         PC8          PC9        PC10
#> 1 16.38018 -1.994986 -3.4155244  0.6498264 7.824682 -2.7718293  2.2981737 -5.22746872 -1.335465100  3.75994762
#> 2 15.58142 -1.182536  0.3211912 -2.7479498 4.729307 -1.5887459 -0.3340349 -1.62109884 -0.006348803 -0.07202641
#> 3 15.42324 -2.243058  1.2377235 -4.0026866 4.402521 -1.0350383 -0.1297623 -1.27912309  0.190738822  0.78085331
#> 4 15.64705 -3.762700  1.2752210 -2.8065268 3.238953 -0.7434877  0.3260538 -1.74289880  0.912173701  1.59466696
#> 5 15.84155 -4.438682  1.8081439 -3.1603532 3.331010 -0.9115065 -0.8618895 -0.09012166  0.521603541 -1.01580251
#> 6 15.64401 -4.619950  2.1287441 -2.7722016 2.462343 -0.8805438 -1.1847999  1.40402104  0.692842819 -1.48599448
summary(principal_component)
#> Importance of first k=10 (out of 561) components:
#>                            PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8     PC9    PC10
#> Standard deviation     16.8713 5.91623 3.88655 3.70953 3.25529 3.02525 2.81701 2.61208 2.35101 2.30763
#> Proportion of Variance  0.5074 0.06239 0.02693 0.02453 0.01889 0.01631 0.01415 0.01216 0.00985 0.00949
#> Cumulative Proportion   0.5074 0.56977 0.59670 0.62123 0.64012 0.65643 0.67058 0.68274 0.69259 0.70208

We can see that these 10 principal components capture 70% variance of the original dataset. Note that every principal component is perpendicular to each other and hence they don’t correlate with one another as can be shown below.

GGally::ggcorr(data = principal_component$x, label = TRUE)
Image by author
Image by author

Lastly, keep in mind that these principal components are a culmination of information from every feature in the original dataset. As a result, they are not interpretable. However, we can approximate back the original features using eigenvalues information even though the result is not exactly the same as the original dataset.

We can take the first two principal components and plot them to gain some insight. Let’s perform the PCA once again.

uci_har_pca <- PCA(X = uci_har %>% select(-subject),
                   scale.unit = TRUE,
                   quali.sup = 562,
                   graph = FALSE,
                   ncp = 10)

head(uci_har_pca$ind$coord)
#>       Dim.1    Dim.2      Dim.3     Dim.4    Dim.5     Dim.6      Dim.7       Dim.8        Dim.9     Dim.10
#> 1 -16.38098 1.995083  3.4156902  0.649858 7.825062 2.7719639  2.2982853  5.22772253  1.335529939 -3.7601302
#> 2 -15.58217 1.182594 -0.3212068 -2.748083 4.729536 1.5888231 -0.3340511  1.62117755  0.006349111  0.0720299
#> 3 -15.42399 2.243166 -1.2377836 -4.002881 4.402735 1.0350886 -0.1297686  1.27918520 -0.190748083 -0.7808912
#> 4 -15.64781 3.762882 -1.2752829 -2.806663 3.239111 0.7435238  0.3260696  1.74298342 -0.912217989 -1.5947444
#> 5 -15.84232 4.438897 -1.8082316 -3.160507 3.331172 0.9115508 -0.8619313  0.09012604 -0.521628866  1.0158518
#> 6 -15.64477 4.620174 -2.1288474 -2.772336 2.462463 0.8805866 -1.1848574 -1.40408921 -0.692876458  1.4860666

Then, pick the first two principal components and plot.

plot.PCA(x = uci_har_pca,
         choix = "ind",
         invisible = "quali",
         select = "contrib 10",
         habillage = "Activity")
Image by author
Image by author

There are several things to unpack here:

  1. Dim 1 explains 51% variance of the original dataset, whereas Dim 2 explains 6% of it.
  2. Dim 1 clearly separates stationary activities (such as LAYING, SITTING, and STANDING) from moving activities (such as WALKING, WALKING_DOWNSTAIRS, and WALKING_UPSTAIRS). Dim 1 less than 0 indicates stationary activities and Dim 1 more than 0 indicates moving activities.
  3. There are several outliers for WALKING_DOWNSTAIRS category. Perhaps some people indeed have different styles of WALKING_DOWNSTAIRS?
  4. SITTING and STANDING activities are stacked on top of each other, making them hard to classify.

Next, we can see that there are many body-related features that contribute to Dim 1 and some additional gravity-related features to Dim 2.

dim <- dimdesc(uci_har_pca)
head(as.data.frame(dim$Dim.1$quanti))
#>                     correlation p.value
#> fBodyAccsma           0.9889282       0
#> fBodyAccJerksma       0.9882851       0
#> fBodyGyrosma          0.9878669       0
#> tBodyAccJerksma       0.9877968       0
#> tBodyAccJerkMagsma    0.9868902       0
#> tBodyAccJerkMagmean   0.9868902       0
head(as.data.frame(dim$Dim.2$quanti))
#>                        correlation p.value
#> fBodyAccmeanFreqZ        0.7359410       0
#> tBodyGyroMagarCoeff1     0.7133977       0
#> fBodyAccMagmeanFreq      0.7080972       0
#> tGravityAccarCoeffZ1     0.7075568       0
#> tGravityAccMagarCoeff1   0.7068318       0
#> tBodyAccMagarCoeff1      0.7068318       0

Cross-Validation

As before, we do 5-fold cross-validation by splitting the dataset such that the subject in train and test datasets don’t intersect.

set.seed(2072) # for reproducibility
subject_id <- unique(uci_har$subject)
folds <- sample(1:5, 30, replace = TRUE)
d <- data.frame(col1 = c(subject_id), col2 = c(folds))
uci_har$folds <- d$col2[match(uci_har$subject, d$col1)]
uci_har <- uci_har %>% 
  mutate(folds = as.factor(folds)) %>% 
  select(-subject)

Please note that after creating folds for each observation, we discarded subject feature as it’s no longer needed in the analysis. Lastly, we can also see below that the data is distributed evenly among folds, hence no imbalanced data occurred.

ggplot(uci_har %>%
         group_by(folds, Activity) %>%
         count(name = 'activity_count'),
       aes(x = folds, y = activity_count, fill = Activity)) +
  geom_bar(stat = 'identity')
Image by author
Image by author

Modeling

Now, execute the modeling step using 5-fold cross-validation. The function below is edited from the crossvalidate function in the previous article. The only difference is, this time we:

  1. remove any feature that has near-zero variance using step_nzv function,
  2. center and scale the data (since this is necessary for PCA) using step_center and step_scale functions, and
  3. incorporate PCA using step_pca function.

All four functions mentioned above come from tidymodels library. These functions are fitted to train dataset, then test dataset is transformed using parameters found in the process of fitting. We will take several principal components that explain at minimum 90% cumulative variance of the dataset.

crossvalidate <- function(data, k, model_name) {
  # 'data' is the training set with the 'folds' column
  # 'k' is the number of folds we have
  # 'model_name' is a string describing the model being used

  # initialize empty lists for recording performances
  acc_train <- c()
  acc_test <- c()
  y_preds <- c()
  y_tests <- c()
  models <- c()

  # one iteration per fold
  for (fold in 1:k) {

    # create training set for this iteration
    # subset all the datapoints where folds does not match the current fold
    train <- data %>% filter(folds != fold)
    y_train <- train$Activity

    # create test set for this iteration
    # subset all the datapoints where folds matches the current fold
    test <- data %>% filter(folds == fold)
    y_test <- test$Activity

    # create PCA pipeline
    rec <- recipe(Activity~., data = train) %>%
       step_rm(folds) %>% 
       step_nzv(all_predictors()) %>%
       step_center(all_numeric()) %>%
       step_scale(all_numeric()) %>%
       step_pca(all_numeric(), threshold = 0.90) %>%
       prep()

    # perform PCA pipeline
    X_train_dt <- juice(rec)
    X_test_dt <- bake(rec, test)
    X_train <- X_train_dt %>% select(-Activity)
    X_test <- X_test_dt %>% select(-Activity)

    print(glue::glue("Fold {fold} dataset reduced to {ncol(X_train)} dimensions."))

    # train &amp; predict
    switch(model_name,
      nb = {
        model <- naiveBayes(x = X_train, y = y_train, laplace = 1)
        y_pred <- predict(model, X_test, type = 'class')
        y_pred_train <- predict(model, X_train, type = 'class')
        models <- c(models, list(model))
      },
      dt = {
        model <- rpart(formula = Activity ~ ., data = X_train_dt, method = 'class')
        y_pred <- predict(model, X_test_dt, type = 'class')
        y_pred_train <- predict(model, X_train_dt, type = 'class')
        models <- c(models, list(model))
      },
      knn = {
        k <- round(sqrt(nrow(X_train)))
        y_pred <- knn(train = X_train, test = X_test, cl = y_train, k = k)
        y_pred_train <- knn(train = X_train, test = X_train, cl = y_train, k = k)
      },
      rf = {
        model <- randomForest(x = X_train, y = y_train, xtest = X_test, ytest = y_test)
        y_pred <- model$test$predicted
        y_pred_train <- model$predicted
        models <- c(models, list(model))
      },
      {
        print("Model is not recognized. Try to input 'nb', 'dt', 'knn', or 'rf'.")
        return()
      }
    )

    # populate corresponding lists
    acc_train[fold] <- Accuracy(y_pred_train, y_train)
    acc_test[fold] <- Accuracy(y_pred, y_test)
    y_preds <- append(y_preds, y_pred)
    y_tests <- append(y_tests, y_test)
  }

  # convert back to factor
  y_preds <- factor(y_preds, labels = lvl)
  y_tests <- factor(y_tests, labels = lvl)

  # get the accuracy between the predicted and the observed
  cm <- confusionMatrix(y_preds, y_tests)
  cm_table <- cm$table
  acc <- cm$overall['Accuracy']

  # return the results
  if (model_name == 'knn') {
    return(list('cm' = cm_table, 'acc' = acc, 'acc_train' = acc_train, 'acc_test' = acc_test))
  } else {
    return(list('cm' = cm_table, 'acc' = acc, 'acc_train' = acc_train, 'acc_test' = acc_test, 'models' = models))
  }
}

Run the function on several models.

nb <- crossvalidate(uci_har, 5, 'nb')
#> Fold 1 dataset reduced to 65 dimensions.
#> Fold 2 dataset reduced to 65 dimensions.
#> Fold 3 dataset reduced to 64 dimensions.
#> Fold 4 dataset reduced to 64 dimensions.
#> Fold 5 dataset reduced to 65 dimensions.
cat("Naive Bayes Accuracy:", nb$acc)
#> Naive Bayes Accuracy: 0.8010486
dt <- crossvalidate(uci_har, 5, 'dt')
#> Fold 1 dataset reduced to 65 dimensions.
#> Fold 2 dataset reduced to 65 dimensions.
#> Fold 3 dataset reduced to 64 dimensions.
#> Fold 4 dataset reduced to 64 dimensions.
#> Fold 5 dataset reduced to 65 dimensions.
cat("Decision Tree Accuracy:", dt$acc)
#> Decision Tree Accuracy: 0.7174483
knn <- crossvalidate(uci_har, 5, 'knn')
#> Fold 1 dataset reduced to 65 dimensions.
#> Fold 2 dataset reduced to 65 dimensions.
#> Fold 3 dataset reduced to 64 dimensions.
#> Fold 4 dataset reduced to 64 dimensions.
#> Fold 5 dataset reduced to 65 dimensions.
cat("k-Nearest Neighbors Accuracy:", knn$acc)
#> k-Nearest Neighbors Accuracy: 0.8750364
rf <- crossvalidate(uci_har, 5, 'rf')
#> Fold 1 dataset reduced to 65 dimensions.
#> Fold 2 dataset reduced to 65 dimensions.
#> Fold 3 dataset reduced to 64 dimensions.
#> Fold 4 dataset reduced to 64 dimensions.
#> Fold 5 dataset reduced to 65 dimensions.
cat("Random Forest Accuracy:", rf$acc)
#> Random Forest Accuracy: 0.8714438

We are not going into details about analyzing these models since this is already done in the previous article. One thing to note here is that PCA algorithm can capture 90% of the original dataset variance with only around 65 features. That’s 860% smaller than the original 561 features! This comes with consequences, some are good, others are bad:

  1. Naive Bayes accuracy improves from 73% to 80%. This is because features resulted from PCA are perpendicular in vector space and hence don’t correlate with one another, which fits Naive Bayes assumption perfectly.
  2. Tree-based models such as Decision Tree and Random Forest is getting worse. Decision Tree accuracy decreased from 86% to 72% and Random Forest accuracy decreased from 94% to 87%. This may be caused by blurred areas of Principal Components which makes tree-based models struggle to put decision boundaries between class distributions. This blurred area comes from the fact that each Principal Component is somehow a combination of information from all original features.
  3. k-NN model isn’t affected that much. There’s only a minor decrease in accuracy from 89% to 88%, which is normal since we only capture 90% of the original dataset variance.

Problem Statement 2

Assuming we don’t know what the target variable looks like, let’s drop Activity entirely.

uci_har <- uci_har %>% select(-c(Activity, folds))

Now we are tasked to cluster each observation in the dataset into several categories. How do we do this? How many clusters do we make?

Clustering

Cluster analysis involves finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups. We will use k-means clustering to categorize observations. Basically, k-means is a centroid-based clustering algorithm that works as follows.

  1. Random initialization: place k centroids randomly
  2. Cluster assignment: assign each observation to the closest cluster, based on the distance to centroids.
  3. Centroid update: move centroids to the means of observations of the same cluster.
  4. Repeat steps 2 and 3 until convergence is reached.

There are 2 hyperparameters that must be determined by the user in implementing k-means clustering: the number of centroids k, and the distance metric used (usually Euclidean distance or Manhattan distance). In this article, we will use Euclidean distance which comes by the formula

where x = (x₁, x₂, …, xₙ) and y = (y₁, y₂, …, yₙ) are observations with dimension n. Hence, we only need to determine k.

Since cluster analysis involves finding groups of objects such that the objects in a group will be similar to one another, a good clustering is the one with the lowest Within Sum of Squares (WSS), that is, the lowest sum of the quadratic distance from each observation to the centroid of the cluster it belongs to. Also, since objects in a group need to be different from the objects in other groups, a good clustering also has a bigger Between Sum of Squares (BSS). Here, BSS is just the sum of the quadratic distance from each centroid to the global mean, weighted by the number of observations in each cluster. It’s easy to see that higher k corresponds to lower WSS. However, k has to be determined from the business point of view, or more objectively using what’s called the elbow method. Mathematically, WSS and BSS are formulated as

where

uci_har_scaled <- scale(uci_har)
fviz_nbclust(x = uci_har_scaled, FUNcluster = kmeans, method = "wss")
Image by author
Image by author

The elbow method says that we should pick k where increasing it will result in no more significant decrease of WSS. Hence from the plot above, we choose k = 2. Since k-means clustering is a non-deterministic algorithm (due to random initialization of centroids), we will run k-means 5 times then take the one the lowest WSS.

set.seed(42) # for reproducibility
result <- c()
for (i in 1:5) {
  uci_har_km <- kmeans(x = uci_har_scaled, centers = 2)
  wss <- uci_har_km$tot.withinss
  bss <- uci_har_km$betweenss
  print(glue::glue("Trial {i} WSS: {wss} t BSS: {bss}"))
  result <- c(result, list(uci_har_km))
}
#> Trial 1 WSS: 3272538.83119699   BSS: 2504639.16880301
#> Trial 2 WSS: 3272538.83119699   BSS: 2504639.16880301
#> Trial 3 WSS: 3272538.83119699   BSS: 2504639.16880301
#> Trial 4 WSS: 3272538.83119699   BSS: 2504639.16880301
#> Trial 5 WSS: 3272538.83119699   BSS: 2504639.16880301

Since BSS and WSS give the same number for each trial, the clustering process most probably gives the same result for all trials. We may pick any trial, so let’s just pick the fifth one. We can visualize this clustering as follows.

fviz_cluster(object = uci_har_km, data = uci_har_scaled, labelsize = 0)
Image by author
Image by author

We can easily see that the two clusters obtained are most probably stationary activities for cluster 2 and moving activities for cluster 1, which are separated by Dim 1 at around zero.

Conclusion

PCA is a dimension reduction technique that is useful for working with multidimensional data for two reasons: visualization and simplification. In this article, we see that doing PCA to the Human Activity Recognition dataset before fitting it to the models gives different results: some models are getting better, some are getting worse, and some are not much affected. We also see that clustering this dataset without knowing the target variable gives us a good separation between stationary and moving activities.


🔥 Hi there! If you enjoy this story and want to support me as a writer, consider becoming a member. For only $5 a month, you’ll get unlimited access to all stories on Medium. If you sign up using my link, I’ll earn a small commission.

🔖 Want to know more about how classical machine learning models work and how they optimize their parameters? Or an example of MLOps megaprojects? What about cherry-picked top-notch articles of all time? Continue reading:

Machine Learning from Scratch

Advanced Optimization Methods

MLOps Megaproject

My Best Stories

Data Science in R


Related Articles