The world’s leading publication for data science, AI, and ML professionals.

Computer Vision in Sports

I was recently browsing CVPR's website and came across its Computer Vision in sports workshop. I think sports are an interesting…

Detecting puck possession in ice hockey using an LSTM. Classification of Puck Possession Events in Ice Hockey (91).
Detecting puck possession in ice hockey using an LSTM. Classification of Puck Possession Events in Ice Hockey (91).

I was recently browsing CVPR’s website and came across its Computer Vision in sports workshop. I think sports are an interesting application of many machine learning algorithms since sports, in general, are very fast-paced and (many) include a group dynamic. Therefore algorithms tailored to sports may help to push the boundary of what is possible in CV.

I have always been interested in applying machine learning to sports and recreational activities. In particular, at PaddleSoft, I have wanted to use CV algorithms to detect different strokes and maneuvers throughout rapids and to use that to predict whether paddlers would have a successful line or whether they would flip, swim…etc. But due to my limited knowledge, I did not get very far; in addition, until recently little literature existed about the subject. But I’m glad to see many new papers dealing with this kind of complex event recognition.

Here are some of the papers from the workshop that I found particularly interesting and that I believe address important problems within CV as a whole. In the interest of being concise I have chosen not go into too much implementation detail and instead focus on the broad themes presented in these works.

Learning To Score Olympic Events:

Authors: Paritosh Parmar and Brendan Tran Morris

Although at first glance scoring Olympic events might seem to be a niche area of research, using Computer Vision to score or provide feedback on an action or activity is useful to many different fields and problems. For instance, the authors state that methods similar to those presented in their paper could provide feedback to patients performing physical therapy on their own or people training for a competition without a coach. They also state that their algorithm could help remove some of the subjectivity from the judging of Olympic events.

Before jumping into the paper it is useful to clarify some terminology:

Action recognition: involves classifying the activity taking place in a video. (i.e. the person is running).

Action quality assessment: involves assigning a numerical value based on how well the action was performed (i.e. the person’s running economy is 8.0).

Action quality assessment is a difficult problem because the differences between high scoring and low scoring actions are often very subtle and the entire sequence (not just a segment) must be considered. Additionally, unlike the action recognition datasets, action quality datasets are relatively scarce and the ones that do exists are very small. Prior to this paper only a few other authors explored this problem

The authors propose taking a multilevel approach to solving this specific task. On the first layer, they use a 3d convnet, to extract features. These features are then passed to one of three possible "second levels." The first being a simple average of clip level features or L-SVR, the second being an LSTM with a fully connected layer, and the third being an LSTM to extract features which then passes them onto the L-SVR.

Action quality assessment model overview (taken from page 20 of Learning to Score Olympic Events)
Action quality assessment model overview (taken from page 20 of Learning to Score Olympic Events)

The authors evaluate their model on diving, figure skating, and the vault in gymnastics. The diving dataset that they used is a small dataset originally from MIT "Assessing the Quality of Actions" paper it can be found [here](http://rtis.oit.unlv.edu/datasets.html). . They also tested their m0del on the figure skating and gymnastics datasets. Another dataset called UNLV that they tested on can be found here. The results and evaluation of performance is fairly complicated and cannot be summarized well without reading the full paper. However, the TLDR is essentially that the C3D-SVR gave the best overall results but was unable to detect specific errors in the course of action. This is okay if you are just interested in scoring, but in order to provide feedback, you obviously need to also be able to identify the "problem" areas of the action. To compensate for this they added the LSTM (i.e C3D LSTM-SVR) which increased the error between their score and the actual but was able to detect specific "mistakes" by the individual involved in the action.

Altogether I think this paper is an important contribution to a seemingly under researched field (just wish that they posted their code somewhere). It is puzzling that there has not been more research directed at action quality assessment as it could directly aide any type of coaching/training.

[Continuous Video to Simple Signals for Swimming Stroke Detection with Convolutional Neural Networks](http://Continuous Video to Simple Signals for Swimming Stroke Detection with Convolutional Neural Networks:):

Authors: Brandon Victor, Zhen Hem Stuart Morgan, Dino Miniutti

Action recognition as mentioned previously focuses on classifying a whole video as a single action. In contrast event detection involves detecting the start and end frames of actions (in a continuous video) and then classifying them. This article focuses on the detection of swimming strokes. Specifically. the authors of the paper propose a method for discrete event detection. They then use this method to determine when swimming strokes occur in video.

"Stroke rate is an important metric used in swimming training, and currently, experts spend a significant amount of time manually labelling each stroke in a video in order to provide statistical feedback to the swimmers. We call this task discrete event detection (distinct from, event detection; which is detecting the beginning and end of an action)." 1

Discrete event detection in contrast to simple event detection involves locating the exact frames when an event occurs.

Figure 2 from [Continuous Video to Simple Signals for Swimming Stroke Detection with Convolutional Neural Networks](http://Continuous Video to Simple Signals for Swimming Stroke Detection with Convolutional Neural Networks:)
Figure 2 from [Continuous Video to Simple Signals for Swimming Stroke Detection with Convolutional Neural Networks](http://Continuous Video to Simple Signals for Swimming Stroke Detection with Convolutional Neural Networks:)

The authors use a CNN to perform the detection of these discrete events and map them onto a 1d plane with peaks denoting the location of a swimming stroke (see their figure 1 if you are confused).Their CNN also does very good at predicting either swimming or tennis strokes ("F-Score = 0.92 and 0.97, respectively, at 3-frame tolerance").

This paper peaked my interest for its ability to detect sequences of strokes. I also liked its explanations of the different types of action detection and introduction of the idea of discrete event detection. However, the most interesting part for me, in particular, was how well it generalized to tennis. I would love to test it at identifying paddling strokes.

Hockey Action Recognition via Integrated Stacked Hourglass Network

http://openaccess.thecvf.com/content_cvpr_2017_workshops/w2/papers/Fani_Hockey_Action_Recognition_CVPR_2017_paper.pdf

Authors: Mehrnaz Fani Shiraz University Helmut Neher University of Waterloo [email protected] David A. Clausi, Alexander Wong, John Zelek University of Waterloo

As a hockey fan I particularly found this paper fascinating. The authors of the paper attempt to tackle the problem of action recognition in ice hockey as they argue it could help provide valuable feedback.

Action recognition provides a benefit to coaches, analysts and spectators by providing content for coaches and analysts to evaluate player performance pg.29

However, the authors note that there are many challenges specific to ice hockey (that could be generalized to other Sports as well).

Pose estimation and action recognition are challenging problems in hockey which can be scaled to other types of sports. Action recognition challenges specific to hockey include bulky clothing that deforms a player’s body-shape, a team’s jersey (white) that is highly similar to the background ( the ice and boards… pg.29

The authors term their model ARHN or Action Recognition Hour Glass network. The actual details of their model are fairly complex as it uses many different components. But on the most basic level, their model works by converting a video clip to a sequence of images, estimating the player’s pose with the Stacked Hour Glass network (and outputting it as a feature), transforming it with a latent transformer, and then classifying the action. You can read the full details in their paper.

Figure 7 on page 35 of Hockey Action Recognition via Integrated Stacked Hourglass Network
Figure 7 on page 35 of Hockey Action Recognition via Integrated Stacked Hourglass Network

They classify four distinct actions: straight skating, crossover skating, pre-shot, and post-shot. They achieve precision and recall scores generally in the upper-60s to mid-70s.

Altogether they achieved fairly good results considering their limitations with respect to data availability. It would be interesting see how well this action recognition could pick other events like backward skating, body checking, and passing. Also, as mentioned in the previous article it would also be valuable to grade the effectiveness of this plays, but that would likely require a more sophisticated variation of the techniques described in the previous article. Finally, as we will see in the following article hockey has a major team component that the authors did not really touch upon here.

Classification of Puck Possession Events in Ice Hockey.

Authors Moumita Roy Tora, Jianhui Chen, Jim Litt

This paper attempts to tackle the problem of group activity recognition in sports. Group activity recognition as the name implies attempts to look at multiple people (or in this case players) and determine what they are doing as a group. While the utility for group activity recognition in team sports is rather obvious it could provide utility outside of the sports world as well. One immediate use case being security purposes (i.e determining if a group of people robbing the place or attacking someone). However, group activity recognition could really be useful for almost any task that involves a lot of footage of multiple individuals interacting. As one might expect the "key" so to speak of group activity recognition is understanding the relationship between the actions of individuals.

Now remembering these ideas let’s move into the actual work. In this paper, the authors look at puck possession events in ice hockey. Specifically, they want to detect and classify different types of puck possession events. They mainly focus on puck dump ins, dump outs, loose puck recovery (LPR), shots, and passes. To do this they use f c7 layer of AlexNet to extract the player’s features from the player bounded sub image (players are pre identified with bounded boxes in their algorithm), these features are then max-pooled in order to account for player interactions, finally they are fed to an LSTM which performs the prediction. They test several different configurations of features and LSTMs which you can read more about in the article.

I really enjoyed this article because attempts to tackle a very difficult task that involves multiple player interactions. As they mention in their article I think getting good features to describe the relationship between players is crucial for good performance. As usual, I would really like to see code or a supplement that describes their implementation in more detail as recreating their setup would be very difficult (at least for me) purely based off information in their article.

There were a number of other good papers from the workshop that I didn’t mention here simply because of space, but I encourage you to check out the workshop’s official website. I mainly chose these three because they discussed relatively distinct challenges across different areas of CV in sports. Mainly they discussed challenges related to action quality assessment, pose estimation, discrete event detection, and group action detection. I think these papers all showed ways in which advancement in seemingly specialized sports has applications to the field as a whole. They also inspired me to relook into my goal of applying CV to whitewater boating.

Other resources on the application of computer vision to sports

Computer Vision in Sports CVPR

Computer Vision in Sports: Published by Springer

Domain specific activity recognition in Tennis

Action quality assessment MIT


Related Articles