The world’s leading publication for data science, AI, and ML professionals.

AutoWorkout: How to Improve Motion Activity Classifier Predictions?

Recently I started working on AutoWorkout iOS/WatchOS app. It automatically recognizes fitness exercises and provides time spent on each…

Recently I started working on AutoWorkout iOS/WatchOS app. It automatically recognizes fitness exercises and provides time spent on each exercise and reps made (in a future release). The idea of this app naturally arose from my first attempts to recognize motion activity type with machine learning by first collecting motion data, designing a neural network classifier and then utilizing it in iOS app with help of Core ML.

In this article, I intentionally skip app implementation details and concentrate on making motion activity classifier predictions useful as it turned out to be the most challenging problem.

No UI is the Best UI

Before importing pre-trained models to AutoWorkout I was pretty sure they will do a good job predicting fitness exercises. They actually did, but each in its own original way. Here I will take a little step aside and describe how I was going to recognize motion activity at all.

Since wrist wearables are capable to provide enough sensor data to predict human motion activity and I already specialize in iOS/Watch OS development, I selected Apple Watch as the target device for my app. Because of Core ML is supported in Watch OS, I could to run classifier directly on the Apple Watch avoiding costly and not always reliable communication between iOS and WatchOS to transfer sensor data to make predictions on iPhone. iOS app contained mainly UI presentation code to indicate workout statistics which was being collected and calculated on Apple Watch.

I’m a fan of embracing No UI is the Best UI approach, therefore the only UI elements in WatchOS app were "Start" and "Stop" buttons. Just tap "Start", start your workout, do your exercises and tap "Stop" at the end. All your exercises will be recognized, time spent for each of them will be calculated and repetitions are recorded.

Interface of the Apple Watch app
Interface of the Apple Watch app
Workout statistics
Workout statistics

Pre-Trained Neural Network Classifiers

To make this all possible, I needed to utilize pre-trained classifiers which perform predictions based on sensor data each ~1.7 seconds. Noticed "classifiers"? Yes, I needed one for each source of the sensor data – X, Y, Z accelerometers – 3 classifiers in total, each predicting 5 activities (jumping jacks, push-ups, squats, lunges, and pauses between exercises) in a real time.

After countless hours tuning their performance I ended up with acceptable accuracy:

  • X-axis classifier: 89.09%
  • Y-axis classifier: 85.01%
  • Z-axis classifier: 81.66%

Those numbers may seem not really impressive for a Classification problem today, but taking into account the time and data collection constraints I had, I find them quite good.

N Classifiers, N Opinions

After I integrated classifiers into the app and it was clear that they are able to predict fitness exercises, I had to check the quality of those predictions.

Ideally, the app should log, say a push-up exercise in case if all 3 classifiers predict that this is a push up exercise. In the real life, I had less than 5% cases where all 3 models had predicted the same class. In nearly 50% of all cases predictions from 2 of 3 models matched. Another 45% of cases were all 3 models predicted different classes.

Not easy to make a decision when you have 3 sources of data and all of them tell you different things, right? This is a well-known issue in Machine Learning environments where more than a single classifier is present and there are techniques to make use of predictions delivered by multiple sources.

Voting

In classification machine learning problems voting is probably the most simple yet powerful approach to derive a single class out of multiple predictions. The first question you ask yourselves when implementing a voting algorithm is whether it will be majority voting or weighted voting?

In my case, it was clear that model X has the highest accuracy, but testing against the ground truth labels had shown that cases where model X predicted class A (which was not correct) and model Y and Z both predicted class B (and were right) are not rare. Therefore, I couldn’t just give model X the most influence in voting and use weighted voting – it had to be a mix of weighted and majority voting to cover such cases.

Although, the testing in a real life shown that if at least one model predicted "Pause" class, the user with a very high probability indeed did a pause during that moment.

All those observations gave me a simple set of rules to be considered in my voting algorithm:

  1. If one model predicts "Pause", log "Pause"
  2. If ≥ 2 models predict class "A", log "A"
  3. If all models predict different classes, use the prediction from model X

See, rule #2 implements a majority voting approach and #3 utilizes a weighted voting one by giving the model X the right to be chosen during the times of uncertainty.

After the voting algorithm was included to classification pipeline, I was able to both subjectively and objectively note that predictions became correct and less confusing.

Outliers Are Everywhere

Nevertheless, the Workout statistics visible to the end user was still not perfect: here and there exercises which were not executed during the workout session were popping out.

It all felt like there is some room for improvement and I started to compare predictions made by the voting algorithm with ground truth. When I knew that I made squats 15 seconds long, a 20-seconds pause after that and then straight 30 seconds of lunges, there should be no other classes predicted among those 3 in a workout session.

Cells with red background contain incorrect predictions. Rows with a black background and bold text are latest exercises of each kind in a row.
Cells with red background contain incorrect predictions. Rows with a black background and bold text are latest exercises of each kind in a row.

In this example above you see that it’s very unlikely that the user made a couple of lunge reps while doing squats all the time (rows #6, 9, 11), as well as made a pause or squats multiple times while doing lunges (rows #24, 26, 29). The row #21 tells us lunges were detected too early in comparison to ground truth.

Of course, there are exceptions possible and people can do exercises in any order and sequence they want which makes the problem of detecting outlying (and most likely incorrect predictions) hardly solvable at all.

I made an educated guess, however, and decided to design an algorithm which will detect and remove outlying predictions from an array of all predictions for the given workout session.

No Single Pattern to Detect

When talking about any human activity it’s tremendously hard to pick out patterns that will repeat from user to user. All people are unique and each of them doing their workouts differently.

I had very hard times deriving the rules for this outlying predictions detection algorithm because the variability of what to consider an outlying prediction is really overwhelming. But having the ground through predictions I could detect the most common patterns for detecting outliers and came up with following rules:

  1. If a current prediction is the same as next one, do nothing.
  2. If a current prediction is the same as previous one, do nothing.
  3. If both previous and next predictions are not equal to current one, use the previous prediction.

See how it helped to eliminate red spot in rows #6, 9, 21, 24, 26 after a single algorithm execution:

The outliers removal algorithm, however, introduced another red spot in row #10 and didn’t solve an issue in #11 either, therefore I decided to apply the same algorithm multiple times:

By experimenting with the exercise patterns I had I found that 3 algorithm passes on "raw" predictions produced by the voting algorithm are enough to remove the majority of outliers and keep resulting predictions as close as possible to ground truth.

To gain even more confidence in the algorithm, I conducted a test which simultaneously compared a standard deviation from ground truth of predictions to which only voting algorithm was applied and those which were obtained after 3 passes of outliers detection algorithm:

  • Standard deviation of "raw" predictions to ground truth: 15,41
  • Standard deviation of predictions obtained from outliers detection algorithm to ground truth: 8,09

An almost 2X improvement in prediction accuracy, not bad!

Conclusion

Of course, there are corner cases where outliers detection algorithm introduces distortions to the original data making one or other exercise to last a couple of seconds longer but nevertheless, it does a good job of making workout statistics look adequate in the cases where classifiers don’t provide enough accuracy.

The voting algorithm is also an essential addition to the classification pipeline, as no single classifier can provide near ideal accuracy in detecting fitness exercises.

Those two approaches allowed AutoWorkout app to more accurately detect motion activity and provide a better user experience.

Follow me on Medium and Twitter to see how AutoWorkout app learned to count exercise reps using machine learning.


Related Articles