Mouse automatically navigating to a coordinate according to eye position (Image by author)

Controlling a Mouse With Your Eyes

A Machine Learning approach to eye pose estimation from just a single front-facing perspective as input

Ryan Rudes
Towards Data Science
6 min readSep 21, 2020

--

In this project, we’ll write code to crop images of your eyes each time you click the mouse. Using this data, we can train a model in reverse, predicting the position of the mouse from your eyes.

We’ll need a few libraries

# For monitoring web camera and performing image minipulations
import cv2
# For performing array operations
import numpy as np
# For creating and removing directories
import os
import shutil
# For recognizing and performing actions on mouse presses
from pynput.mouse import Listener

Let’s first learn how pynput’s Listener works.

pynput.mouse.Listener creates a background thread that records mouse movements and mouse clicks. Here’s a simplifier code that, upon a mouse press, prints the coordinates of the mouse:

from pynput.mouse import Listenerdef on_click(x, y, button, pressed):
"""
Args:
x: the x-coordinate of the mouse
y: the y-coordinate of the mouse
button: 1 or 0, depending on right-click or left-click
pressed: 1 or 0, whether the mouse was pressed or released
"""
if pressed:
print (x, y)
with Listener(on_click = on_click) as listener:
listener.join()

Now, let’s expand this framework for our purposes. However, we first need to write the code that crops the bounding box of your eyes. We’ll call this function from within the on_click function later.

We use Haar cascade object detection to determine the bounding box of the user’s eyes. You can download the detector file here. Let’s make a simple demonstration to show how this works:

import cv2# Load the cascade classifier detection object
cascade = cv2.CascadeClassifier("haarcascade_eye.xml")
# Turn on the web camera
video_capture = cv2.VideoCapture(0)
# Read data from the web camera (get the frame)
_, frame = video_capture.read()
# Convert the image to grayscale
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
# Predict the bounding box of the eyes
boxes = cascade.detectMultiScale(gray, 1.3, 10)
# Filter out images taken from a bad angle with errors
# We want to make sure both eyes were detected, and nothing else
if len(boxes) == 2:
eyes = []
for box in boxes:
# Get the rectangle parameters for the detected eye
x, y, w, h = box
# Crop the bounding box from the frame
eye = frame[y:y + h, x:x + w]
# Resize the crop to 32x32
eye = cv2.resize(eye, (32, 32))
# Normalize
eye = (eye - eye.min()) / (eye.max() - eye.min())
# Further crop to just around the eyeball
eye = eye[10:-10, 5:-5]
# Scale between [0, 255] and convert to int datatype
eye = (eye * 255).astype(np.uint8)
# Add the current eye to the list of 2 eyes
eyes.append(eye)
# Concatenate the two eye images into one
eyes = np.hstack(eyes)

Now, let’s use this knowledge to write a function for cropping the eye image. First, we’ll need a helper function for normalization:

def normalize(x):
minn, maxx = x.min(), x.max()
return (x - minn) / (maxx - minn)

Here’s our eye cropping function. It returns the image if the eyes were found. Otherwise, it returns None:

def scan(image_size=(32, 32)):
_, frame = video_capture.read()
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
boxes = cascade.detectMultiScale(gray, 1.3, 10)
if len(boxes) == 2:
eyes = []
for box in boxes:
x, y, w, h = box
eye = frame[y:y + h, x:x + w]
eye = cv2.resize(eye, image_size)
eye = normalize(eye)
eye = eye[10:-10, 5:-5]
eyes.append(eye)
return (np.hstack(eyes) * 255).astype(np.uint8)
else:
return None

Now, let’s write our automation, which will run each time we press the mouse button. (assume we have already defined the variable root previously in our code as the directory where we would like to store the images):

def on_click(x, y, button, pressed):
# If the action was a mouse PRESS (not a RELEASE)
if pressed:
# Crop the eyes
eyes = scan()
# If the function returned None, something went wrong
if not eyes is None:
# Save the image
filename = root + "{} {} {}.jpeg".format(x, y, button)
cv2.imwrite(filename, eyes)

Now, we can recall our implementation of pynput’s Listener, and make our full code implementation:

When we run this, each time we click the mouse (if both of our eyes are in view), it will automatically crop the webcam and save the image to the appropriate directory. The filename of the image will contain the mouse coordinate information, as well as whether it was a right or left click.

Here’s an example image. In this image, I am performing a left-click at coordinate (385, 686) on a monitor with resolution 2560x1440:

An example (Image by author)

The cascade classifier is highly accurate, and I have not seen any mistakes in my own data directory so far.

Now, let’s write the code for training a neural network to predict the mouse position, given the image of your eyes.

Let’s import some libraries

import numpy as np
import os
import cv2
import pyautogui
from tensorflow.keras.models import *
from tensorflow.keras.layers import *
from tensorflow.keras.optimizers import *

Now, let’s add our cascade classifier:

cascade = cv2.CascadeClassifier("haarcascade_eye.xml")
video_capture = cv2.VideoCapture(0)

Let’s add our helper functions.

Normalization:

def normalize(x):
minn, maxx = x.min(), x.max()
return (x - minn) / (maxx - minn)

Capturing the eyes:

def scan(image_size=(32, 32)):
_, frame = video_capture.read()
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
boxes = cascade.detectMultiScale(gray, 1.3, 10)
if len(boxes) == 2:
eyes = []
for box in boxes:
x, y, w, h = box
eye = frame[y:y + h, x:x + w]
eye = cv2.resize(eye, image_size)
eye = normalize(eye)
eye = eye[10:-10, 5:-5]
eyes.append(eye)
return (np.hstack(eyes) * 255).astype(np.uint8)
else:
return None

Let’s define the dimensions of our monitor. You’ll have to change these parameters according to the resolution of your own computer screen:

# Note that there are actually 2560x1440 pixels on my screen
# I am simply recording one less, so that when we divide by these
# numbers, we will normalize between 0 and 1. Note that mouse
# coordinates are reported starting at (0, 0), not (1, 1)
width, height = 2559, 1439

Now, let’s load in our data (again, assuming you already defined root). We don’t really care whether it was a right or left click, because our goal is just to predict the mouse position:

filepaths = os.listdir(root)
X, Y = [], []
for filepath in filepaths:
x, y, _ = filepath.split(' ')
x = float(x) / width
y = float(y) / height
X.append(cv2.imread(root + filepath))
Y.append([x, y])
X = np.array(X) / 255.0
Y = np.array(Y)
print (X.shape, Y.shape)

Let’s define our model architecture:

model = Sequential()
model.add(Conv2D(32, 3, 2, activation = 'relu', input_shape = (12, 44, 3)))
model.add(Conv2D(64, 2, 2, activation = 'relu'))
model.add(Flatten())
model.add(Dense(32, activation = 'relu'))
model.add(Dense(2, activation = 'sigmoid'))
model.compile(optimizer = "adam", loss = "mean_squared_error")
model.summary()

Here’s our summary:

_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 5, 21, 32) 896
_________________________________________________________________
conv2d_1 (Conv2D) (None, 2, 10, 64) 8256
_________________________________________________________________
flatten (Flatten) (None, 1280) 0
_________________________________________________________________
dense (Dense) (None, 32) 40992
_________________________________________________________________
dense_1 (Dense) (None, 2) 66
=================================================================
Total params: 50,210
Trainable params: 50,210
Non-trainable params: 0
_________________________________________________________________

Let’s train our model. We’ll add some noise to the image data:

epochs = 200
for epoch in range(epochs):
model.fit(X, Y, batch_size = 32)

Not, let’s use our model to move the mouse with our eyes live. Note that this requires a lot of data to work well. However, as a proof of concept, you’ll notice that with just around 200 images, it does, infact, move the mouse to the general region you are looking at. It’s certainly not controllable until you have much more data though.

while True:
eyes = scan()
if not eyes is None:
eyes = np.expand_dims(eyes / 255.0, axis = 0)
x, y = model.predict(eyes)[0]
pyautogui.moveTo(x * width, y * height)

Here’s a proof-of-concept example. Note that I trained with very little data before taking this screen recording. This is a video of my mouse automatically moving to the Terminal application window according to my eyes. As I said, it’s jumpy because there’s very little data. With much more data, it will hopefully be stable enough to control with higher specificity. With just a few hundred images, you’ll only be able to move it to within the general region of your gaze. Also, if throughout your data collection, no images were taken of you looking at a particular region of the screen (say, the edges), the model is unlikely to ever predict within this region. This is one of the many reasons we need more data.

If you are testing the code yourself, remember to change the values of width and height to your monitor’s resolution in the code file prediction.py.

You can view the code from this tutorial here:

--

--

I am a student at the California Institute of Technology, majoring in Electrical Engineering and Computer Science