The Video Search Engine — My Journey Into Computer Vision

Creating video is easy, but who’s got time to watch it all? I propose a video search engine to find relevant moments (prototype included).

Rod Fuentes
Towards Data Science

--

Creating video content has been my lifelong hobby. I remember making stop animation movies in middle school, graduating to 30-min. short films through high school and college. My ‘films’ are more family-oriented these days thanks to my kids, but I’m always pondering new projects.

As I reflect on projects I’d love to do, I keep returning to the same problem: recording footage is far easier than making sense of it. Just think about your phone’s camera roll. It’s likely filled with hundreds, if not thousands, of videos that are unedited, too long, and unwatchable.

This imbalance between creating and consuming video is part of the “quintessential modern problem”, resulting from cheap recording devices and even cheaper digital storage. Quickly summarized, the problem is that we can point 15 cameras from 15 different angles at the same sporting event for two hours and produce 30 hours of unedited video. But no human has 30 hours to watch that content.

What’s needed is a way to extract the interesting moments from video content.

Let’s consider a more practical example. Your local retailer likely records thousands of hours of video per month to identify shoplifters. Yet that video is unedited, lengthy, and terribly unwatchable despite the obvious business value. What’s needed is a “search engine” for video. A system that takes image or video “key moments” as search term inputs and outputs relevant video segments as search results.

This problem reminded me of my 2004 summer internship at one of our nation’s premier R&D labs, IBM T.J. Watson Research Center. There, I saw early applications of computer vision projects, such as drawing bounding boxes on cars entering and exiting a parking lot.

Bounding boxes identifying cars and persons

It’s been 15 years since my time at IBM, but I’ve never forgotten that magic feeling — watching the screen flicker with boxes around cars and people. Back then, such output was a breakthrough in machine learning and compute power. I can now explore whether general purpose video search engine can be built.

The Quest: A Video Search Engine

My objective is to build a prototype of the video search engine described above. The system will video record a table tennis match, extract the video clips when the ball is in play, and show just the relevant clips to a user. The first key result is to video record a ping pong match using a portable device and send the video to cloud storage for analysis. The second key result is to train an objection detection model that finds a ping pong ball in play. The third key result is to output the extracted video clips to a web UI.

I imposed various limitations on this project to keep it within the scope of nights and weekends. For instance, I don’t intend to build a fully contained object detection system on a Raspberry Pi. Instead, I will use AWS to retrieve stored video clips, process them through the object detector, and return the results to a web UI. Real-time processing of the video is also beyond the scope of this project. That said, these limitations present exciting future opportunities for this project.

Recording Video on a Portable Device

So far, I’ve accomplished 70% of the first key result by video recording content using a Raspberry Pi and sending that video to AWS S3 for future analysis.

From the start, I imagined a Raspberry Pi (with the Pi Camera module) would be ideal for exploring portable video capture. I’ve learned that there are many options, including IP cameras, webcams, and more. But in hindsight I’m glad I chose the Raspberry Pi thanks to its form factor, well-documented code, and ardent community.

Once I had the Raspberry Pi booted, I configured an SSH environment so I could execute code from my laptop to capture images and video.

Then I had to send that video to AWS S3. I started with a simple design using Python: (1) open a video stream from the Pi Camera and (2) send one frame every two seconds to S3.

import imutils
from imutils.video import VideoStream
from imutils.video import FPS
import boto3
import cv2
import datetime
import time
from decimal import Decimal
import uuid
import json
import pytz
from pytz import timezone
import os
# init interface to AWS S3 via Python
s3_client = boto3.client('s3',
aws_access_key_id=os.environ["AWS_ACCESS_KEY_ID"],
aws_secret_access_key=os.environ["AWS_SECRET_ACCESS_KEY"])

# init where to send the captured frames
s3_bucket = "img-from-raspi-for-web"
s3_key_frames_root = "frames/"
# init video stream from Raspberry Pi Camera
print("[INFO] starting video stream...")
vs = VideoStream(src=0).start()
# vs = VideoStream(usePiCamera=True).start()
time.sleep(2.0) # warm up the sensor
def convert_ts(ts, selected_timezone):
# Converts a timestamp to the configured timezone. Returns a localized datetime object
tz = timezone(selected_timezone)
utc = pytz.utc
utc_dt = utc.localize(datetime.datetime.utcfromtimestamp(ts))
localized_dt = utc_dt.astimezone(tz)
return localized_dt

# loop over frames from the video file stream
while True:
# grab the frame from the video stream and resize it
frame = vs.read()
approx_capture_ts = time.time()
if frame is None:
continue # useful to skip blank frames from sensor

frame = imutils.resize(frame, width=500)
ret, jpeg = cv2.imencode('.jpg', frame)
photo = jpeg.tobytes()
# bytes are used in future section to call Amazon Rekognition API

now_ts = time.time()
now = convert_ts(now_ts, "US/Eastern")
year = now.strftime("%Y")
mon = now.strftime("%m")
day = now.strftime("%d")
hour = now.strftime("%H")

# build s3_key using UUID and other unique identifiers
s3_key = (s3_key_frames_root + '{}/{}/{}/{}/{}.jpg').format(year, mon, day, hour, frame_id)

# Store frame image in S3
s3_client.put_object(
Bucket=s3_bucket,
Key=s3_key,
Body=photo
)
time.sleep(2.0) # essentially, a frame rate of 0.5

Images started appearing in my S3 bucket, so I began designing the database for the project.

My design stores each image, its timestamp, and a prediction result for each image in a NoSQL table. Later, I will query that database for the predictions and fetch the corresponding timestamps to trim the video into relevant clips.

For now, I set up a stub for the predictions, relying on AWS Rekognition API to detect objects. Here is how I saved the data to a DynamoDB table:

def detect_labels_local_file(photo):
response = rekog_client.detect_labels(Image={'Bytes': photo},
MaxLabels=5,
MinConfidence=60.0)

for label in response['Labels']:
print (label['Name'] + ' : ' + str(label['Confidence']))

return response
### In the while loop ###
rekog_response = detect_labels_local_file(photo)
# Persist frame data in dynamodb
ddb_item = {
'frame_id': frame_id,
'processed_timestamp': now_ts, # beware of putting floats into DDB. Had to use json.loads as a workaround
'approx_capture_timestamp': approx_capture_ts,
'rekog_labels': rekog_response['Labels'],
'rekog_orientation_correction':
rekog_response['OrientationCorrection']
if 'OrientationCorrection' in rekog_response else 'ROTATE_0',
'processed_year_month': year + mon, # To be used as a Hash Key for DynamoDB GSI
's3_bucket': s3_bucket,
's3_key': s3_key
}

ddb_data = json.loads(json.dumps(ddb_item), parse_float=Decimal)

ddb_table.put_item(Item=ddb_data)

Success! I have a NoSQL table that references an S3 image, a timestamp, and its corresponding prediction:

Items in DynamoDB table, showing Rekognition API results for captured frames from Pi Camera

Building a Machine Learning Model for Ping Pong Ball Detection

With the camera pointed at me, AWS Rekognition detected a ‘person’ with 99.13% confidence. But could Rekognition detect a ping pong ball in play to help me achieve my second key result?

Sadly, no. After testing many ping pong images, I found that Rekognition performs admirably with respect to detecting scenes — such as labeling images pertaining to “Ping Pong”. But when it comes to finding a ping pong ball, Rekognition did not perform well. In most cases, it did not distinguish the ball as an identifiable object at all. When it did find the ball, it labeled it as the “Moon” 😎

Rekognition results for an image of a ping pong match

Using the Rekognition API was convenient but limiting with respect to my project. Fortunately, Amazon offers the SageMaker object detection API if you want to customize your own model.

I began with a video feed of a table tennis match. Here are some sample frames:

Preparing & Labeling Data

My first task was to label video frames with the ping pong ball in play to build a train, validate, test data set. The FFmpeg library was useful to convert the video into images that I could label:

# from Terminal after installing the ffmpeg library# Get one frame to identify the area I care about
ffmpeg -i tt-video-1080p.mp4 -ss 00:00:59.000 -vframes 1 thumb.jpg

# I found an area I care about in a fixed-position video feed using a photo-editing program: a 512x512 image with a top-left corner at x:710, y:183y
ffmpeg -i tt-video-1080p.mp4 -filter:v "crop=512:512:710:183” cropped.mp4
# output the video as a series of jpg images @ 10 frames per second
ffmpeg -i cropped.mp4 -vf fps=10 thumb%04d.jpg -hide_banner

The snippet above generated thousands of images on my machine. The next step was to add ‘bounding boxes’ to the ping pong balls in play.

There are many services that perform this arduous task for you, but I opted to personally label the images to get a deeper understanding and appreciation for computer vision. I turned to RectLabel, an image annotation tool to label images for bounding box object detection and segmentation:

I spent about four hours on this task, averaging 8.3 labels per minute, to get 2000 labeled images. It was mind-numbing work.

About half-way through my labeling work, I wondered whether tight or loose bounding boxes would be better to balance model accuracy versus model generalization given JPEG compression artifacts and motion blur on the ping pong ball. After phoning a friend, Paul Blankley, and consulting the interwebs, I learned that “bounding boxes are usually drawn tightly around every [object] in an image” because:

Without accurately drawn bounding boxes, an entire algorithm can be affected causing it to inaccurately identify [objects]. This is why quality checking and ensuring a high level of attention is paid to the accuracy of every bounding box, resulting in a strong AI engine.

If I had to do this project again, I would use lossless image format (*.png) and draw tighter bounding boxes to improve my training data. Yet I recognize this optimization is not free. My average labeling speed decreased by ~50% when I started labeling images with tighter bounding boxes.

Once I finished labeling images, RectLabel output the data in a JSON file suitable for computer vision tasks. Here’s a sample of the output:

{"images":[
{"id":1,"file_name":"thumb0462.png","width":0,"height":0},
{"id":2,"file_name":"thumb0463.png","width":0,"height":0},
# ...
{"id":4582,"file_name":"thumb6492.png","width":0,"height":0}],
"annotations":[
{"area":198,"iscrowd":0,"id":1,"image_id":1,"category_id":1,"segmentation":[[59,152,76,152,76,142,59,142]],"bbox":[59,142,18,11]},
{"area":221,"iscrowd":0,"id":2,"image_id":2,"category_id":1,"segmentation":[[83,155,99,155,99,143,83,143]],"bbox":[83,143,17,13]},
# ... {"area":361,"iscrowd":0,"id":4,"image_id":4582,"category_id":1,"segmentation":[[132,123,150,123,150,105,132,105]],"bbox":[132,105,19,19]},
"categories":[{"name":"pp_ball","id":1}]
}

Then I created a function to separate the annotations into train and validate folders, as expected by Amazon SageMaker’s input channels. Note this important tip from Ryo Kawamura if you’re following my code:

Though ‘category_id’ in the COCO JSON file starts from 1, ‘class_id’ in the Amazon SageMaker JSON file starts from 0.

import json
import os
def fixCategoryId(category_id):
return category_id - 1;
with open(file_name) as f:
js = json.load(f)
images = js['images']
categories = js['categories']
annotations = js['annotations']
for i in images:
jsonFile = i['file_name']
jsonFile = jsonFile.split('.')[0] + '.json'
line = {}
line['file'] = i['file_name']
line['image_size'] = [{
'width': int(i['width']),
'height': int(i['height']),
'depth': 3
}]
line['annotations'] = []
line['categories'] = []
for j in annotations:
if j['image_id'] == i['id'] and len(j['bbox']) > 0:
line['annotations'].append({
'class_id': fixCategoryId(int(j['category_id'])),
'top': int(j['bbox'][1]),
'left': int(j['bbox'][0]),
'width': int(j['bbox'][2]),
'height': int(j['bbox'][3])
})
class_name = ''
for k in categories:
if int(j['category_id']) == k['id']:
class_name = str(k['name'])
assert class_name is not ''
line['categories'].append({
'class_id': fixCategoryId(int(j['category_id'])),
'name': class_name
})
if line['annotations']:
with open(os.path.join('generated', jsonFile), 'w') as p:
json.dump(line, p)
jsons = os.listdir('generated')
print ('There are {} images that have annotation files'.format(len(jsons)))

Next, I moved the files into an Amazon S3 bucket with four folders as required by the SageMaker endpoint: /train, /validation, /train_annotation, and /validation_annotation. I used a 70% split on the train vs. validation files and shuffled the data:

import shutil
import random
num_annotated_files = len(jsons)
train_split_pct = 0.70
num_train_jsons = int(num_annotated_files * train_split_pct)
random.shuffle(jsons) # randomize/shuffle the JSONs to reduce reliance on *sequenced* frames
train_jsons = jsons[:num_train_jsons]
val_jsons = jsons[num_train_jsons:]

#Moving training files to the training folders
for i in train_jsons:
image_file = './images/'+i.split('.')[0]+'.png'
shutil.move(image_file, './train/')
shutil.move('./generated/'+i, './train_annotation/')

#Moving validation files to the validation folders
for i in val_jsons:
image_file = './images/'+i.split('.')[0]+'.png'
shutil.move(image_file, './validation/')
shutil.move('./generated/'+i, './validation_annotation/')


### Upload to S3
import sagemaker
from sagemaker import get_execution_role

role = sagemaker.get_execution_role()
sess = sagemaker.Session()

from sagemaker.amazon.amazon_estimator import get_image_uri
training_image = get_image_uri(sess.boto_region_name, 'object-detection', repo_version="latest")

bucket = 'pp-object-detection' # custom bucket name.
# bucket = sess.default_bucket()
prefix = 'rect-label-test'

train_channel = prefix + '/train'
validation_channel = prefix + '/validation'
train_annotation_channel = prefix + '/train_annotation'
validation_annotation_channel = prefix + '/validation_annotation'

sess.upload_data(path='train', bucket=bucket, key_prefix=train_channel)
sess.upload_data(path='validation', bucket=bucket, key_prefix=validation_channel)
sess.upload_data(path='train_annotation', bucket=bucket, key_prefix=train_annotation_channel)
sess.upload_data(path='validation_annotation', bucket=bucket, key_prefix=validation_annotation_channel)

s3_train_data = 's3://{}/{}'.format(bucket, train_channel)
s3_validation_data = 's3://{}/{}'.format(bucket, validation_channel)
s3_train_annotation = 's3://{}/{}'.format(bucket, train_annotation_channel)
s3_validation_annotation = 's3://{}/{}'.format(bucket, validation_annotation_channel)

Training the Model

In the next step, I created a SageMaker object detector with certain hyperparameters, such as using the ‘resnet-50’ algorithm with one class (my ping pong ball) and images sized 512x512 pixels.

s3_output_location = 's3://{}/{}/output'.format(bucket, prefix)od_model = sagemaker.estimator.Estimator(training_image, role, train_instance_count=1, train_instance_type='ml.p3.2xlarge', train_volume_size = 50, train_max_run = 360000, input_mode = 'File', output_path=s3_output_location, sagemaker_session=sess)od_model.set_hyperparameters(base_network='resnet-50',
use_pretrained_model=0,
num_classes=1,
mini_batch_size=15,
epochs=30,
learning_rate=0.001,
lr_scheduler_step='10',
lr_scheduler_factor=0.1,
optimizer='sgd',
momentum=0.9,
weight_decay=0.0005,
overlap_threshold=0.5,
nms_threshold=0.45,
image_shape=512,
label_width=600,
num_training_samples=num_train_jsons)

I then set the train/validate location for the object-detector, called the .fit function, and deployed the model to an endpoint:

train_data = sagemaker.session.s3_input(s3_train_data, distribution='FullyReplicated', content_type='image/png', s3_data_type='S3Prefix')validation_data = sagemaker.session.s3_input(s3_validation_data, distribution='FullyReplicated', content_type='image/png', s3_data_type='S3Prefix')train_annotation = sagemaker.session.s3_input(s3_train_annotation, distribution='FullyReplicated', content_type='image/png', s3_data_type='S3Prefix')validation_annotation = sagemaker.session.s3_input(s3_validation_annotation, distribution='FullyReplicated', content_type='image/png', s3_data_type='S3Prefix')data_channels = {'train': train_data, 'validation': validation_data, 'train_annotation': train_annotation, 'validation_annotation':validation_annotation}od_model.fit(inputs=data_channels, logs=True)object_detector = od_model.deploy(initial_instance_count = 1,
instance_type = 'ml.m4.xlarge')

Given that I had only 2000 images, my Amazon box (ml.p3.2xlarge) took about ~10 minutes to train the model. Deploying the endpoint often took even longer, and the anticipation of testing a model was agonizing!

Finally, the moment of truth. I invoked my model by passing it a PNG file it had never seen in bytes:

file_with_path = 'test/thumb0695.png'
with open(file_with_path, 'rb') as image:
f = image.read()
b = bytearray(f)
ne = open('n.txt', 'wb')
ne.write(b)
results = object_detector.predict(b)
detections = json.loads(results)
print(detections)

I got output like this:

[1.0, 0.469, 0.566, 0.537, 0.605, 0.595]

Here’s how to interpret this output according to AWS SageMaker:

Each of these object arrays consists of a list of six numbers. The first number is the predicted class label. The second number is the associated confidence score for the detection. The last four numbers represent the bounding box coordinates [xmin, ymin, xmax, ymax]. These output bounding box corner indices are normalized by the overall image size. Note that this encoding is different than that use by the input .json format. For example, in the first entry of the detection result, 0.3088374733924866 is the left coordinate (x-coordinate of upper-left corner) of the bounding box as a ratio of the overall image width, 0.07030484080314636 is the top coordinate (y-coordinate of upper-left corner) of the bounding box as a ratio of the overall image height, 0.7110607028007507 is the right coordinate (x-coordinate of lower-right corner) of the bounding box as a ratio of the overall image width, and 0.9345266819000244 is the bottom coordinate (y-coordinate of lower-right corner) of the bounding box as a ratio of the overall image height.

Cool 😏

Visualizing the Results

Frankly, I needed something more tangible to appreciate the results. So I used this function to visualize each prediction:

def visualize_detection(img_file, dets, classes=[], thresh=0.6):
import random
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

img = mpimg.imread(img_file)
plt.imshow(img)
height = img.shape[0]
width = img.shape[1]
colors = dict()
for det in dets:
(klass, score, x0, y0, x1, y1) = det
if score < thresh:
continue
cls_id = int(klass)
if cls_id not in colors:
colors[cls_id] = (random.random(), random.random(), random.random())
xmin = int(x0 * width)
ymin = int(y0 * height)
xmax = int(x1 * width)
ymax = int(y1 * height)
rect = plt.Rectangle((xmin, ymin), xmax - xmin,
ymax - ymin, fill=False,
edgecolor=colors[cls_id],
linewidth=3.5)
plt.gca().add_patch(rect)
class_name = str(cls_id)
if classes and len(classes) > cls_id:
class_name = classes[cls_id]
plt.gca().text(xmin, ymin - 2,
'{:s} {:.3f}'.format(class_name, score),
bbox=dict(facecolor=colors[cls_id], alpha=0.5),fontsize=12, color='white')
plt.show()
# then I used the function like this:
object_categories = ['pp_ball']
threshold = 0.40
visualize_detection(file_name_of_image, detections['prediction'], object_categories, threshold)

I shuddered in excitement and relief when I saw this output:

I spent an hour hitting the model with various test images it had never seen. Some were great predictions. Others were plain goofy, such as the model confusing white spots on a player’s uniform for a ping pong ball 🙄

Fortunately, I was able to remove most of the false positives by raising the confidence threshold to 0.40.

Future Directions

I am quite happy with my results so far, but future work is required to evaluate and optimize my model. For example, I intend to calculate the mean average precision (mAP) as a performance metric. That mAP metric will help me evaluate different optimizations, such as adding more training images, experimenting with transfer learning, and trying other deep learning topologies. I’ll leave those tasks for my 2020 roadmap (and future posts).

I’m also excited to tackle my third key result in 2020 — showing the relevant video clips to a user via a web UI. When that key result is complete, I will test the entire setup in a real-world environment:

  • video record a live table tennis match with my Raspberry Pi
  • export the video into image frames to S3
  • use the object detector model to identify when the ball is in play
  • store the timestamp when the ball is in play
  • provide a web UI interface to a user
  • permit the user to filter the video to just the moments when the ball is in play

Stay tuned for more learning and development in this direction.

Closing Thoughts

It is common wisdom among data scientists that

algorithms are cheap;

data is king.

This project has given me deep appreciation for this truism. Indeed, the ability to change deep learning topologies was trivial on AWS SageMaker. Yet the results did not change significantly. I also leveraged transfer learning in another model with minimal effort. Again, the results were not much better. Then I remembered the difficult, painstaking work of collecting and labeling images for my project…

I’m boggled when I compare model-related work against data-related work with respect to the level of effort and cross-project applicability. For instance, switching between deep learning topologies is relatively easy, and many projects can leverage those topologies in a variety of computer vision tasks. In contrast, my image labeling work required significant effort and will likely benefit my project only.

Facing this reality, I’m feeling a tad pessimistic about the viability of a general purpose video search engine that can produce results from a user’s arbitrary input of (1) video and (2) images-as-search-terms. To be sure, a specific purpose video search engine is within grasp. But significant, non-trivial work lies ahead to explore how a model can generalize to detect whatever a user wants to find based on a few image examples.

Here’s to a fun year of learning ahead on that front!

Many thanks to Sandeep Arneja, Paul Blankley, Ryo Kawamura, and Thea Zimnicki for their feedback and contributions to this project.

--

--

Authentic, exited startup founder. Exploring GenAI. Loves: coding, cooking, climbing, and CoLab machine learning 😜