Understanding the Backbone of Video Classification: The I3D Architecture

Madeline Schiappa
Towards Data Science
4 min readJun 7, 2020

--

One of the distinctive differences between information in a single image and information in a video is the temporal element. This has led to improvements of deep learning model architectures to incorporate 3D processing in order to additionally process temporal information. This article summarizes the architectural changes from images to video through the I3D model.

I3D

Figure 1. The training process for the two-stream I3D on Kinetics Dataset. Image by author, adapted from Carreira and Zisserman (2017) [1].

--

--