AutoVideo: An Automated Video Action Recognition System

Identifying human actions such as brush hair, sit, and run with neural networks and automated machine learning

Published in

Towards Data Science

6 min readAug 10, 2021

Some commonly used actions that can be predicted by Video Action Recognition Pipelines using AutoVideo Source: HMDB (CC BY 4.0)

In this article, I will introduce how we can build a neural network to automatically identify human actions. While this seems to be a simple and trivial task for a human, it is difficult for an artificial system to do so. Video-based action recognition aims to address this problem by identifying different actions from video clips. It is a crucial task for video understanding with broad applications in various areas, such as security (Meng, Pears, and Bailey 2007), healthcare (Gao et al. 2018) and behavior analysis (Poppe 2010). Practical applications of video action recognition include elderly behaviour monitoring to enhance assisted living, automated video surveillance systems and much more.

https://github.com/datamllab/autovideo. Image by author.

The aim of this article is to provide a tutorial for our open-source project AutoVideo (GitHub), a comprehensive and easy-to-use toolkit for Video Action Recognition. In this tutorial, you will learn:

(1) how to train a neural network that can predict human actions;
(2) how to use auto-tuning tools to save your efforts of tuning hyperparameters;
(3) how to train a model on your customized datasets.

You will be able to use your trained model to identify any new video clip, just as the following demo of detecting Brush Hair.

Predicting Action on a demo video (Brushing Hair). Image by author.

Overview

AutoVideo presents as a highly modular and extendable system that wraps the data loading, data processing, and state-of-the-art action recognition models with the standard pipeline language. As such, practitioners can easily add new modules for action recognition or beyond, such as other video understanding tasks. AutoVideo additionally introduces data-driven searchers to automatically tune different models and hyperparameters to reduce human efforts.

Getting Started

Let’s get started with our tutorial that will help you get familiar with the package and perform action recognition on custom/benchmark datasets.

Installation:

To install the package, make sure you have Python 3.6 and pip installed on your Linux/MacOS system. First install the following packages:

pip3 install torch
pip3 install torchvision

The AutoVideo package can then be installed simply using:

pip3 install autovideo

Preparing Datasets:

The datasets must follow D3M format, which consists of a csv file and a media folder. The csv file should contain three columns to specify the instance indices, video file names and labels. Here is a sample csv file:

d3mIndex,video,label
0,Brushing_my_waist_lenth_hair_brush_hair_u_nm_np1_ba_goo_0.avi,0
1,brushing_raychel_s_hair_brush_hair_u_cm_np2_ri_goo_2.avi,0
2,Haarek_mmen_brush_hair_h_cm_np1_fr_goo_0.avi,0
3,Haarek_mmen_brush_hair_h_cm_np1_fr_goo_1.avi,0
4,Prelinger_HabitPat1954_brush_hair_h_nm_np1_fr_med_26.avi,0
5,brushing_hair_2_brush_hair_h_nm_np1_ba_med_2.avi,0

The media folder should contain video files. You may refer to our example hmdb6 dataset in Google Drive. We have also prepared hmdb51 and ucf101 in the Google Drive for benchmarking. To try out the tutorial, you may download hmdb6 dataset from here. Then, you may unzip a dataset and put it in datasets.

Conducting Experiments:

The interface is based upon Axolotl, our high-level abstraction of D3M. A minimum example of running action recognition task on a sample video using pre-trained weights (trained on sub-sampled hmdb6 dataset) is given below. Let’s get started!

Load Dataset:

Prepare your dataset in the D3M format as mentioned above and place it in the datasets folder. In this example, wehave used hmdb-6 (Google-Drive) (subsampled from HMDB-51 containing only 6 classes). We use the utility function set_log_path() to setup the logger system for the experiment. The train_table_path is the path to the training csv file with video file names and label information. The train_media_dir indicates the media folder containing the video files and the target_index specifies the index of the column containing output label information

Preprocess:

After reading the train csv file, next we preprocess the dataset by extracting frames from the videos in train_media_dir and store them in a folder directory frames. To achieve this, we use the extract_frames utility function from autovideo package as follows:

Build Model:

We support 7 action recognition models in our package. In this example we have used TSN as the video recognition model. build_pipeline() function returns a pipeline which consists of the end-to-end trainable model. Below is an example to build the pipeline.

Tuning Hyperparameters:

In the above function, users can customize the configuration dictionary (config) in build_pipeline() function to train different algorithms with different hyperparameters. We have specified some hyperparameters values (like learning rate, number of epochs. etc) to demonstrate the usage. Complete list of supported hyperparameters can be found here. Each model in AutoVideo is wrapped as a primitive, which contains some hyperparameters. An example of TSN is here. All the hyperparameters can be specified when building the pipeline by passing them in the configdictionary above.

Train:

The model can be trained end-to-end using the configurations specified above. Here we use the fit function which will train the model and return the fitted pipeline/model. This can then be saved for future use. Here is an example to fit the pipeline:

The above functions undergoes training on the train set and validates on the validation set to save the best performing model on the validation set for inference stage. The output of the fit function looks like this:

Confidence Scores: 
[[0.0692555  0.06158188 0.02618745 0.05211503 0.10426781 0.68659234]...[0.00917702 0.01555088 0.00744944 0.00688883 0.02226333 0.9386706 ]]

The output above are the confidence scores of the videos in the validation set after the above has trained on the training set.

Finally, you can find the complete example code for the above code snippets at examples/fit.py. To train model using the fit() function, simply run:

python3 examples/fit.py

Inference:

We can load the trained model weights obtained using the fit() function to detect the action on a sample video or compute accuracy on the test set. A sample demo video along with pre-trained weights can be downloaded here. An example to detect action on the demo video can be found below:

Here is a GIF illustrating the output on the demo video:

Alternatively, we can use the trained model to compute accuracy on the test set. Here we use the produce() function from autovideo to getpredictions from the test set and then compute the accuracy using compute_accuracy_with_preds() utility function.

Here test_table_path indicates the test csv file containing paths to the video files along with their label information and test_media_dir indicates the media directory containing the video files.

The complete code for computing accuracy on the test set can be found at examples/produce.py. You can run this example using:

python3 examples/produce.py

Searcher Module:

In addition to fitting models with specified configurations, users can also use tuners to automatically search the pipelines (i.e., models and hyperparameters) in a data-driven manner. Our current system supports two types of tuners, including random search and Hyperopt. To use the automated searching, you need to install ray-tune and hyperopt with:

pip3 install 'ray[tune]' hyperopt

An example interface to run the searcher module is as below.

Load Data:

Let’s load the train and validation datasets by specifying the csv file paths in train_table_path and valid_table_path. The Searcher uses train_dataset to train the model using the setof hyperparameters drawn by the Searcher in each sample. The best configuration of hyperparameters at the end of searching are the ones with the best accuracy on the valid_dataset.

Initialise Searcher Module:

Here users can initialise the searcher module for the dataset specified above using the RaySearcher. Users can also define a search space (both continuous and discrete space supported) in addition to specifying the searching algorithm (random v/s hyperopt). In the below example, we have used HyperOpt Search.

Search:

The tuner will then search for the best hyperparameter combinations within the search space to improve the performance. Here is an example to search using the specified configurations and return the best set of hyperparameters:

The complete code for the Searcher can be found at examples/search.pywhich can be run using the following command:

python3 examples/search.py

Summary:

To learn more about this project, check it out here. The team is actively developing more features for the project, including a graphical user interface with visualisation tools and adding more video understanding tasks. The goal is to make video recognition accessible and easier for everyone. I hope you enjoy reading this article.