I’ve written about the Mean Shift Algorithm before, but I never gave a practical application of it. I ultimately felt unsatisfied by the article, so I wanted to revisit the topic and apply it to a real world situation.
The Mean Shift algorithm can track objects in real time by using color. The more distinct the color from the background, the better it works. While object tracking presents a huge topic in Computer Vision, I wanted to focus on the gaming.
Motion control, the art of tracking the physical movement of the player and translating it into a computer input, has been used in gaming on several occasions. While Nintendo’s Wii comes to most people’s mind, it used infrared sensors to communicate to the console. By contrast, Xbox’s Kinect and Playstation’s Move use an actual camera to track movement.
While consoles employ specialized hardware, a regular, consumer grade computer or laptop with a camera can utilize the same principles.
The Theory behind Mean Shift Algorithm
Like all clustering algorithms, mean shift attempts to find densely packed areas within a dataset. Unlike the more popular K-Means clustering, mean shift doesn’t require an estimate of the number of clusters.
Instead, it creates a Kernel Density Estimation (KDE) for the dataset. The algorithm will iteratively shift every data point closer to the nearest KDE peak by a small amount until a termination criteria has been met. The final result will produce a well-defined set of clusters.
Obviously, the mean shift algorithm works well for unsupervised learning, but its application to computer vision isn’t as clear.
Essentially, every pixel in an image may be considered a data point. By default, each pixel conveys three pieces of color information: the red, green, and blue channels, collectively known as RGB color space (Although most computer vision experts prefer to use HSV or CIELAB, because they don’t vary as much in different lighting conditions).
When the mean shift algorithm is applied to image, the resulting clusters represent the major colors present. For example, consider the below image:

When applied, mean shift will produce clusters for red, yellow, green, blue, purple, and white. This process is known as image segmentation.
To use it for object tracking, however, an initial image’s color profile is taken and then compared to the clusters. The region of interest is defined as the points within the closest matching cluster.
Strengths, Weaknesses, and Other Considerations
As mentioned before, the closest cluster to a pre-defined color is selected and tracked within the video frame. This raises the question of how to select a pre-defined color.
Usually, a region of interest is drawn around the object to track in the first frame of the video. This method allows to get the exact color of the object under the precise lighting conditions of the video. Unfortunately, this implies that the location and size of the object are known beforehand.
The object to track also needs to be one, distinctive color. For example, the mean shift algorithm will be able to track a plain, red ball on white background; however, if the red ball is on a red background, the algorithm will to track it.
Finally, the size of the region of interest never changes. When tracking an object, regardless of how far and close it is to the camera, the tracking box remains the same size.
Using Mean Shift for a Motion Controller

Object tracking has many applications, but in this instance, it will be used to translate motion into computer input. The mean shift algorithm will track a tennis ball, chosen because it’s ubiquitous, monochrome, and distinctively colored, and transform its location into simulated presses of the arrow keys.
The end user will have to calibrate the algorithm by placing the tennis ball close to the camera. Once done, the user should be able to move the tennis ball up, down, left, or right to control the screen.
While the arrow keys may be used to input commands in any program, this demonstration will play Pacman.
Code Implementation
import numpy as np
import cv2
Before doing any computer vision, two libraries are imported: NumPy to handle efficient calculations and OpenCV to handle computer vision tasks.
# Capture the current frame from the webcam
video = cv2.VideoCapture(0)
_, init_image = video.read()
The VideoCapture method turns on the default webcam and the read method reads the first frame it captures. At this point, the user should hold the tennis ball close to the camera.
# set up initial coordinates for the tracking window
x, y = 100, 100
# Set up initial size of the tracking window
height, width = 100, 100
track_window = (x,y,width,height)
# set up region of interest (roi)
roi = init_image[y:y + height, x:x + width]
An x- and y-coordinate pair as well as height and width variables are defined to create the Region of Interest (ROI), which simply the location of the tennis ball.
In this case, the ROI starts at point (100, 100) and extends 100 pixels down and right. Since the end user is asked to place the tennis ball close to the camera, this ROI will probably capture a subsection of it. If the user, however, doesn’t place the ball appropriately, the ROI will capture the color of something in the background, such as a hand or a wall.
Note that because the tracking window doesn’t change size in the mean shift algorithm, the ROI will always be a square at 100 pixels wide. Based on resolution of the camera or distance of the user from the camera, this may need to be tuned.
# Convert to HSV color space
hsv_frame = cv2.cvtColor(init_image, cv2.COLOR_BGR2HSV)
# Apply mask to make sure pixels are within range
mask = cv2.inRange(hsv_frame, np.array((0, 20, 20)), np.array((180, 250, 250)))
This section simply converts the image into the HSV color space and ensures that all the pixels fall within a reasonable range. Any pixel values that are under- or oversaturated are excluded.
# Calculate HSV histogram
hist_frame = cv2.calcHist([hsv_frame], [0], mask, [180], [0,180])
cv2.normalize(hist_frame, hist_frame, 0, 255, cv2.NORM_MINMAX)
A histogram is calculated for the ROI and normalized. This value will be used to track the closest cluster in every subsequent video frame.
# terminate at either 10 iterations or move by atleast 1 pt
term_crit = (cv2.TERM_CRITERIA_EPS | cv2.TERM_CRITERIA_COUNT, 10, 1 )
As mentioned previously, the mean shift algorithm sifts through points iteratively until it meets a termination criteria. The tuple defined in this line creates the two conditions for the algorithm to stop and declare a new ROI. If either the centroid of the ROI doesn’t move or 10 iterations are completed, mean shift will complete.
The former condition is intuitive. If the tennis ball doesn’t move, there’s no need to compute a new location. The latter condition concerns processing time. More iterations of mean shift will create a more accurate ROI at the cost of time. Setting this value too high might produce a lagged result, but setting it too low might set the ROI off the mark.
In the end, setting a proper termination condition is Goldilocks problem.
# Get the dimensions of the video
frame_height = init_image.shape[0]
frame_width = init_image.shape[1]
These few lines simply retrieve the width and height of the video frame. These will be used later.
from pynput.keyboard import Key, Controller
# initialize keyboard controller
keyboard = Controller()
# Alt + Tab over to another window
with keyboard.pressed(Key.alt):
keyboard.press(Key.tab)
keyboard.release(Key.tab)
A motion controller needs to create input for a device. Consequently, the Pynput library, which simulates key presses, is imported and initialized. Using a with statement, the alt button is pressed down while the tab button is simultaneously pressed. This simulates an alt + tab keyboard shortcut to activate another window.
In this example, the other window is an online version of Pacman which must be opened before running the code. While a better approach might involve code to open the browser or run a command to open an executable, this provides some additional practice using Pynput.
# Initialize background subtractor
subtraction = cv2.createBackgroundSubtractorKNN()
Mean shift works best when there’s minimal background noise. This line initializes OpenCV’s background subtraction method. While the specifics of implementation could warrant another article, the basic idea is that this algorithm considers anything stationary as the background and anything moving as the foreground. Given that a motion controller only needs to observe the moving input, this works ideally.
while True:
ret, frame = video.read()
if ret:
An infinite loop is initialized and the video feed is read. The ret variable is a boolean which dictates whether the camera is correctly streaming. The frame is the current video frame from the stream.
If the ret variable is true, the code proceeds.
# Apply background substraction
mask = subtraction.apply(frame)
# Create 3-channel alpha mask
mask_stack = np.dstack([mask]*3)
# Ensures data types match up
mask_stack = mask_stack.astype('float32') / 255.0
frame = frame.astype('float32') / 255.0
# Blend the image and the mask
masked = (mask_stack * frame) + ((1-mask_stack) * (0.0,0.0,0.0))
frame = (masked * 255).astype('uint8')
The first line applies the background subtraction to the video frame and creates a simple binary mask. Since the binary mask is a two dimensional matrix and the video frame is a three dimensional matrix, they can’t meaningfully interact until their dimensions agree.
The mask_stacked variable simply multiples the mask by three and "stacks" the three instances of the masks on top of each other. The result is a three dimensional matrix which can interact with the video frame. Next, both the mask and the video frame undergo some necessary type conversions and then their blended together.
If the resulting image was shown, it would look something like this:

# Convert to HSV
hsv = cv2.cvtColor(frame, cv2.COLOR_BGR2HSV)
# Apply back projection
dst = cv2.calcBackProject([hsv],[0],hist_frame,[0,180],1)
# apply mean shift to get the new location
track_window = cv2.meanShift(dst, track_window, term_crit)[1]
With the background removed, the video frame is converted into HSV. Back projection is applied to the frame and then the actual mean shift algorithm is applied. The track_window variable is now the x- and y-coordinates of the top -right of the tracking window along with its width and height.
# Calculate the center point
center = (track_widnow[0] + width/2, track_window[1] + height/2)
Instead of using the top-right corner, this line calculates the center of the tracking window. If mean shift can find the tennis ball and center on it correctly, this will allow the program to give input based on where the middle of the ball is moving. This is more intuitive to the end user.
if center[1] > (frame_height * 0.5):
keyboard.press(Key.down)
keyboard.release(Key.down)
print("down")
elif center[1] < (frame_height * 0.20):
keyboard.press(Key.up)
keyboard.release(Key.up)
print("up")
if center[0] > (frame_width * 0.5):
keyboard.press(Key.left)
keyboard.release(Key.left)
print("left")
elif center[0] < (frame_width * 0.2):
keyboard.press(Key.right)
keyboard.release(Key.right)
print("right")
Now that the ball is tracked, its movement and location can be translated into computer input. If the center of the ball is located in the top half of the screen, a simulation of the up arrow will be pressed. Likewise if the center of the ball is in the bottom fifth of the screen, a simulation of the down arrow will be pressed. The same logic applies to the left and right.
These proportions might seem odd, but they generally return better results than keeping them perfectly proportional. Again, though, these proportions may be tweaked given a particular user experience.
else:
break
video.release()
The last few lines clean up the code. The else connects back to the ret variable and will break the loop if the camera is unable to properly read video. The final line will simply shut the camera off.
Results and Conclusions
The ultimate goal of this code was to play a game of Pacman with a tennis ball. It sounded like a fun, silly, and impossible-sounding challenge, but proved quite feasible. The below demonstrates my playing experience.

Although there’s no obvious visual cue, the mean shift algorithm successfully tracks the location of the tennis ball. By evidence of Pacman moving accordingly, the program also translate the ball’s position and movement into input. Consequently, a game of Pacman may be played.
Unfortunately, the experience was lacking. Somewhat noticeable in the video, there’s sometimes a lag between the movement of the ball and input on the screen. For any sort of gaming, the lag should be near zero. Likewise, the motion of the ball wasn’t tracked perfectly. Several times the ROI would get stuck in a corner and Pacman would be stuck until a ghost ended the game.
While this program offers a successful proof of concept, it needs more refinement before being offered as an end product.