Real-Time Hand Tracking and Gesture Recognition with MediaPipe: Rerun Showcase

How to visualise MediaPipe’s Hand Tracking and Gesture Recognition with Rerun

Andreas Naoum
Towards Data Science

--

Hand Tracking and Gesture Recognition | Image by Author

In this post, I’m presenting an example of Hand Tracking and Gesture Recognition using MediaPipe Python and Rerun SDK.

If you’re interested in delving deeper and expanding your understanding, I will guide you on how to install MediaPipe Python and Rerun SDK to track a hand, recognise different gestures and visualise the data.

Therefore, you’ll learn:

  • How to install MediaPipe Python and Rerun
  • How to use MediaPipe Gesture Recognition for Hand Tracking and Gesture Recognition
  • How to visualise the results of the hand-tracking and gesture recognition in the Rerun Viewer

If you’re just eager to give the example a try, simply use the provided code:

# Clone the rerun GitHub repository to your local machine.
git clone https://github.com/rerun-io/rerun

# Navigate to the rerun repository directory.
cd rerun

# Install the required Python packages specified in the requirements file
pip install -r examples/python/gesture_detection/requirements.txt

# Run the main Python script for the example
python examples/python/gesture_detection/main.py

# Run the main Python script for a specific image
python examples/python/gesture_detection/main.py --image path/to/your/image.jpg

# Run the main Python script for a specific video
python examples/python/gesture_detection/main.py --video path/to/your/video.mp4

# Run the main Python script with camera stream
python examples/python/gesture_detection/main.py --camera

Hand Tracking and Gesture Recognition Technology

Before we proceed, let’s give credit to the technology that makes this possible. The hand tracking and gesture recognition technology aims to give the ability of the devices to interpret hand movements and gestures as commands or inputs. At the core of this technology, a pre-trained machine-learning model analyses the visual input and identifies hand landmarks and hand gestures. The real applications of such technology vary, as hand movements and gestures can be used to control smart devices. Human-Computer Interaction, Robotics, Gaming, and Augmented Reality are a few of the fields where the potential applications of this technology appear most promising.

However, we should always be conscious of how we use such a technology. It’s really challenging to use it in sensitive and critical systems because the model can misinterpret gestures and the potential for false positives or negatives is not minimal. Ethical and legal challenges arise from utilising this, as users may not want their gestures to be recorded especially in public spaces. If you intend to implement this technology in real-world scenarios, it’s important to take into account any ethical and legal considerations.

Prerequisites & Setup

First, you need to install the necessary libraries, including OpenCV, MediaPipe and Rerun. MediaPipe Python is a handy tool for developers looking to integrate on-device ML solutions for computer vision and machine learning, and Rerun is an SDK for visualizing multimodal data that changes over time.

# Install the required Python packages specified in the requirements file
pip install -r examples/python/gesture_detection/requirements.txt

Then, you have to download the predefined model from here: HandGestureClassifier

Hand Tracking and Gesture Recognition using MediaPipe

Image via Gesture Recognition Task Guide by Google

“The MediaPipe Gesture Recognizer task lets you recognize hand gestures in real time, and provides the recognized hand gesture results along with the landmarks of the detected hands. You can use this task to recognize specific hand gestures from a user, and invoke application features that correspond to those gestures.” from Gesture Recognition Task Guide

Now, let’s try to use the MediaPipe pre-trained model for gesture recognition for a sample image. Overall, the below code sets the foundation for initialising and configuring a MediaPipe Gesture Recognition solution.

from mediapipe.tasks.python import vision
from mediapipe.tasks import python

class GestureDetectorLogger:

def __init__(self, video_mode: bool = False):
self._video_mode = video_mode

base_options = python.BaseOptions(
model_asset_path='gesture_recognizer.task'
)
options = vision.GestureRecognizerOptions(
base_options=base_options,
running_mode=mp.tasks.vision.RunningMode.VIDEO if self._video_mode else mp.tasks.vision.RunningMode.IMAGE
)
self.recognizer = vision.GestureRecognizer.create_from_options(options)


def detect(self, image: npt.NDArray[np.uint8]) -> None:
image = mp.Image(image_format=mp.ImageFormat.SRGB, data=image)

# Get results from Gesture Detection model
recognition_result = self.recognizer.recognize(image)

for i, gesture in enumerate(recognition_result.gestures):
# Get the top gesture from the recognition result
print("Top Gesture Result: ", gesture[0].category_name)

if recognition_result.hand_landmarks:
# Obtain hand landmarks from MediaPipe
hand_landmarks = recognition_result.hand_landmarks
print("Hand Landmarks: " + str(hand_landmarks))

# Obtain hand connections from MediaPipe
mp_hands_connections = mp.solutions.hands.HAND_CONNECTIONS
print("Hand Connections: " + str(mp_hands_connections))

The detect function within the GestureDetectorLogger class accepts an image as its argument and prints the model results, highlighting the top recognized gesture and the detected hand landmarks. For additional details regarding the model, refer to its model card.

Image via Gesture Recognition Task Guide by Google

You can try it by yourself using the code:

def run_from_sample_image(path)-> None:
image = cv2.imread(str(path))
show_image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
logger = GestureDetectorLogger(video_mode=False)
logger.detect_and_log(show_image)

# Run the gesture recognition on a sample image
run_from_sample_image(SAMPLE_IMAGE_PATH)

Verify, Debug and Demo using Rerun

This step allows you to ensure the reliability and effectiveness of your solution. With the model now prepared, visualise the results to verify the accuracy, debug any potential issues, and demonstrate its capabilities. Visualising the results could be simple and fast using Rerun SDK.

How do we use Rerun?

Image via Rerun Docs by Rerun
  1. Stream multimodal data from your code by logging it with the Rerun SDK
  2. Visualise and interact with live or recorded streams, whether local or remote
  3. Interactively build layouts and customize visualisations
  4. Extend Rerun when you need to

Before getting into the code, you should visit the page installing the Rerun Viewer to install the Viewer. Then, I highly suggested getting familiar with Rerun SDK by reading these guides Python Quick Start and Logging Data in Python. These initial steps will ensure a smooth setup and help you get started with the upcoming code implementation.

Run from Video or Real-Time

For video streaming, OpenCV is employed. You can select either a file path for a specific video or access your own camera by providing an argument of 0 or 1 (use 0 for the default camera; on Mac, you may use 1).

It’s noteworthy to emphasise the introduction of timelines. Rerun timelines’ functions enable the association of data with one or more timelines. Consequently, each frame of the video is associated with its corresponding timestamp.

def run_from_video_capture(vid: int | str, max_frame_count: int | None) -> None:
"""
Run the detector on a video stream.

Parameters
----------
vid:
The video stream to run the detector on. Use 0/1 for the default camera or a path to a video file.
max_frame_count:
The maximum number of frames to process. If None, process all frames.
"""
cap = cv2.VideoCapture(vid)
fps = cap.get(cv2.CAP_PROP_FPS)

detector = GestureDetectorLogger(video_mode=True)

try:
it: Iterable[int] = itertools.count() if max_frame_count is None else range(max_frame_count)

for frame_idx in tqdm.tqdm(it, desc="Processing frames"):
ret, frame = cap.read()
if not ret:
break

if np.all(frame == 0):
continue

frame_time_nano = int(cap.get(cv2.CAP_PROP_POS_MSEC) * 1e6)
if frame_time_nano == 0:
frame_time_nano = int(frame_idx * 1000 / fps * 1e6)

frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

rr.set_time_sequence("frame_nr", frame_idx)
rr.set_time_nanos("frame_time", frame_time_nano)
detector.detect_and_log(frame, frame_time_nano)
rr.log(
"Media/Video",
rr.Image(frame)
)

except KeyboardInterrupt:
pass

cap.release()
cv2.destroyAllWindows()

Logging Data for Visualisation

Logging 2D data using Rerun SDK | Image by Author

To visualise the data in the Rerun Viewer, it’s essential to log the data using the Rerun SDK. The guides mentioned earlier provide insights into this process. In this context, we extract hand landmark points as normalized values, and then utilise the image’s width and height for conversion into image coordinates. These coordinates are then logged as 2D points to the Rerun SDK. Additionally, we identify connections between the landmarks and log them as 2D linestrips.

For gesture recognition, the results are printed to the console. However, within the source code, you can explore a method to present these results to the viewer using TextDocument and emojis.

class GestureDetectorLogger:

def detect_and_log(self, image: npt.NDArray[np.uint8], frame_time_nano: int | None) -> None:
# Recognize gestures in the image
height, width, _ = image.shape
image = mp.Image(image_format=mp.ImageFormat.SRGB, data=image)

recognition_result = (
self.recognizer.recognize_for_video(image, int(frame_time_nano / 1e6))
if self._video_mode
else self.recognizer.recognize(image)
)

# Clear the values
for log_key in ["Media/Points", "Media/Connections"]:
rr.log(log_key, rr.Clear(recursive=True))

for i, gesture in enumerate(recognition_result.gestures):
# Get the top gesture from the recognition result
gesture_category = gesture[0].category_name if recognition_result.gestures else "None"
print("Gesture Category: ", gesture_category) # Log the detected gesture

if recognition_result.hand_landmarks:
hand_landmarks = recognition_result.hand_landmarks

# Convert normalized coordinates to image coordinates
points = self.convert_landmarks_to_image_coordinates(hand_landmarks, width, height)

# Log points to the image and Hand Entity
rr.log(
"Media/Points",
rr.Points2D(points, radii=10, colors=[255, 0, 0])
)

# Obtain hand connections from MediaPipe
mp_hands_connections = mp.solutions.hands.HAND_CONNECTIONS
points1 = [points[connection[0]] for connection in mp_hands_connections]
points2 = [points[connection[1]] for connection in mp_hands_connections]

# Log connections to the image and Hand Entity
rr.log(
"Media/Connections",
rr.LineStrips2D(
np.stack((points1, points2), axis=1),
colors=[255, 165, 0]
)
)

def convert_landmarks_to_image_coordinates(hand_landmarks, width, height):
return [(int(lm.x * width), int(lm.y * height)) for hand_landmark in hand_landmarks for lm in hand_landmark]

3D Points

Finally, we examine how we can present the hand landmarks as 3D points. We first define the connections between the points using keypoints from Annotation Context in the init function, and then we log them as 3D points.

Logging 3D data using Rerun SDK | Image by Author
class GestureDetectorLogger:


def __init__(self, video_mode: bool = False):
# ... existing code ...
rr.log(
"/",
rr.AnnotationContext(
rr.ClassDescription(
info=rr.AnnotationInfo(id=0, label="Hand3D"),
keypoint_connections=mp.solutions.hands.HAND_CONNECTIONS
)
),
timeless=True,
)
rr.log("Hand3D", rr.ViewCoordinates.RIGHT_HAND_X_DOWN, timeless=True)


def detect_and_log(self, image: npt.NDArray[np.uint8], frame_time_nano: int | None) -> None:
# ... existing code ...

if recognition_result.hand_landmarks:
hand_landmarks = recognition_result.hand_landmarks

landmark_positions_3d = self.convert_landmarks_to_3d(hand_landmarks)
if landmark_positions_3d is not None:
rr.log(
"Hand3D/Points",
rr.Points3D(landmark_positions_3d, radii=20, class_ids=0, keypoint_ids=[i for i in range(len(landmark_positions_3d))]),
)

# ... existing code ...

You’re ready! Let the magic begin:

# For image
run_from_sample_image(IMAGE_PATH)

# For saved video
run_from_video_capture(VIDEO_PATH)

# For Real-Time
run_from_video_capture(0) # mac may need 1

The full source code for this example is available on GitHub. Feel free to explore, modify, and understand the inner workings of the implementation.

Beyond Hand-Tracking and Gesture Recognition

Rerun Examples | Image by Rerun

Finally, if you have a keen interest in visualising streams of multimodal data across a diverse range of applications, I encourage you to explore the Rerun Examples. These examples highlight potential real-world cases and provide valuable insights into the practical applications of such visualisation techniques.

If you found this article useful and insightful, there’s more coming! I regularly share in-depth content on robotics and computer vision visualisation posts that you won’t want to miss. For future updates and exciting projects, follow me!

Also, you can find me on LinkedIn.

Similar articles:

Portions of this page are reproduced from work created and shared by Google and used according to terms described in the Creative Commons 4.0 Attribution License.

--

--

AI | Robotics | Apple Enthusiast | Passionate Computer Scientist pursuing an MSc in Autonomous Systems at KTH.