The world’s leading publication for data science, AI, and ML professionals.

Building Visual Agents that can Navigate the Web Autonomously

A step-by-step guide to creating visual agents that can navigate the web autonomously

This post was co-authored with Rafael Guedes.

Introduction

In the age of exponential growth in artificial intelligence, the topic of the moment is the rise of Agentic Ai. These AI systems leverage large language models (LLMs) to make decisions, plan, and collaborate with other agents or humans.

When we wrap an LLM with a role, a set of tools, and a specific goal, we create what we call an agent. By focusing on a well-defined objective and having access to relevant APIs or external tools (like search engines, databases, or even browser interfaces – more about this later), agents can autonomously explore paths to achieve their targets. Thus, agentic AI opens up a new paradigm where multiple agents can tackle complex, multi-step workflows.

John Carmack and Andrej Karpathy recently discussed a topic on X (formerly Twitter) that inspired this article. Carmack mentioned that AI-powered assistants can push applications to expose features through text-based interfaces. In this world, LLMs talk to a command-line interface wrapped under the graphical user interface (a.k.a. GUI), sidestepping some of the complexity of pure vision-based navigation (that exists because we humans need it). Karpathy raises the valid point that advanced AI systems can become better at automating GUIs before developers can provide comprehensive textual interfaces to every application. We agree with Karpathy on this one.

Building on these ideas, this aHi rticle explores how to implement and empower AI agents with visual navigation capabilities. We will walk through how to build agents that can navigate the web autonomously by solely relying on their vision skills (no APIs or scraping). We can browse websites, move through pages to achieve a pre-defined goal, and retrieve the necessary information without human intervention.

Figure 2: Visual Agentic AI (image by author with DALL-E)
Figure 2: Visual Agentic AI (image by author with DALL-E)

As always, the code is available on our GitHub.

Multimodal LLMs: how do they work?

Multimodal LLMs (MLLM) were developed to address the limitations of LLMs. The latter performs well in zero-shot reasoning on most NLP tasks but fall short when dealing with vision elements. On the other hand, MLLMs complement Large Vision Models (LVM), which can process visual elements but lack the advanced reasoning capabilities of LLMs. By combining both, MLLMs integrate LLM reasoning with LVM visual processing, enabling the analysis of different inputs such as text and images [1]. Figure 2 shows the current state-of-the-art MLLMs and their evolution over time.

Figure 3: Existing MLLMs landscape (source)
Figure 3: Existing MLLMs landscape (source)

Architecture

A typical MLLM architecture consists of three elements:

1. The pre-trained modality encoder is responsible for understanding the relationship between text and any other modality, such as audio or image. It aligns their respective representations in a shared latent space.

In this case, our model receives images and text as input. Thus, it has two encoders, one targeting images and the other text. An image encoder is typically a convolution neural network (CNN) or a vision transformer (ViT) that converts the image into a high-dimensional vector representation, i.e., an embedding. A text encoder is usually a transformer-based language model that equally converts text into an embedding representation.

Afterward, the model aligns the outputs of both encoders in the shared latent space so that the embeddings of similar images and text descriptions are closer in that space.

This alignment is crucial for the model to understand which images match the text descriptions, and it is achieved by training the model using a contrastive loss that:

  • Computes the similarity (dot product) between every image-text pair in the batch.
  • Applies a softmax function to create a probability distribution over the pairs.
  • Optimizes the model using, for example, cross-entropy loss. It maximizes the similarity between correct image-text pairs and minimizes the similarity between unrelated image-text pairs.
Figure 4: Training a multimodal model to align text and image embeddings (image by author)
Figure 4: Training a multimodal model to align text and image embeddings (image by author)

2. Modality interface consists of a learnable connector. It is responsible for bridging the gap between modalities. This learnable connector is faster and cheaper to train than training an MLLM in an end-to-end manner, and its objective consists of aligning the output of the visual/audio encoder and the input text. It can be implemented in two ways:

  • Token-level fusion is where the output of the image/audio encoders is transformed into tokens (through query-based learning or by simply using a linear MLP) and concatenated with the text tokens.
  • Feature-level fusion adds extra modules to capture deeper interactions between text and visual/audio features through cross-attention layers.
Figure 5: Token and Feature level fusion (image by author)
Figure 5: Token and Feature level fusion (image by author)

3. Pre-trained LLMs are responsible for receiving the aligned representation of the different input modalities as input to reason and generate a text answer. One can also add an optional generator to create more modalities besides text.

We can use any LLMs in this layer, like GPT-4o, LLaMA, Mixtral, Gemini, Qwen, etc. The choice of the LLM depends on the specific use case, as these models come in varying sizes (usually larger models mean better performance). Some models are multilingual, while others focus on a single language, most commonly English. And certain models, such as Mixtral, achieve faster inference times by using a Mixture of Experts (MoE). This technique scales up model expressiveness without increasing so much the total number of parameters.

Figure 6: MLLM architecture (source)
Figure 6: MLLM architecture (source)

Google GenAI SDK

Google recently launched its GenAI SDK which can be easily installed by running the following command pip install google-generativeai. This new package is the easiest way for developers to interact with Gemini, Google’s multimodal model developed by Google DeepMind [2].

The team at Google developed several notebooks with different use cases that can be used with this new package [3]. Some of the most interesting use cases are:

  • Object Detection can be easily performed with gemini-1.5-flash-002 by simply sending the image and the object you want to extract as input. In the example below, the user targeted the umbrellas in the image.
Figure 6: Extracting bounding boxes with Gemini (source)
Figure 6: Extracting bounding boxes with Gemini (source)

Video Analysis is possible by leveraging Gemini multimodal capabilities to analyze a video. One can use it to identify elements of interest in the different frames or to summarize its content.

  • Entity extraction from a large corpus can be a hard and time-consuming task for a human, but not for Gemini. In this example, the team at Google shows how to extract streets and means of transportation from a long text by just prompting the following:
directions_prompt = f"""
  From the given text, extract the following entities and return a list of them.
  Entities to extract: street name, form of transport.
  Text: {directions}
  Street = []
  Transport = [] """

{ "Street": [ "Via dei Fori Imperiali", "Via del Corso", "Via della Lungara" ], "Transport": [ "train", "metro", "bus", "Terravision shuttle", "taxi", "private transfer service" ] }

  • Interpreting Charts, Graphs, and Slides. As we saw earlier, being multimodal means that Gemini can process both text and image inputs and reason across both input types. In this example, Gemini is asked to explain the different pub/sub patterns using the image below:

The image shows three different Pub/Sub patterns:

Many-to-One (Fan-In): In this pattern, multiple publishers send messages to a single topic. A single subscriber listens to this topic and consumes all the messages.

Many-to-Many: This pattern allows multiple publishers to send messages to a single topic and multiple subscribers can consume messages from that topic. It is useful for load balancing and ensuring that messages are distributed across multiple consumers.

One-to-Many (Fan-Out): This pattern has a single publisher sending messages to a topic, and multiple subscribers are listening to that topic. This pattern is used to distribute messages to a large number of consumers.

Figure 7: Slide with Cloud Pub/Sub explanation (source)
Figure 7: Slide with Cloud Pub/Sub explanation (source)

Navigating web pages using Google GenAI SDK

In this section, we work through our use case using Google GenAI SDK. We create an agent capable of navigating web pages and extracting content from them. The goal is to have the agent navigate to ZAAI‘s web page, search for its blog, and get some metadata from the last article published. This agent leverages Gemini’s multimodal capabilities to extract the necessary information based only on screenshots from the website (image) and the instructions we provide through text.

These agents will open the door to many use cases that were impossible before. Even a few months ago, agents only relied on APIs to get information from third-party systems. In this case, there is no need to spend time building an API since the agent can navigate and extract information as humans do. Unlike traditional scraping, our agent will adapt to changes in the UI/UX of the websites and won’t require custom code to find specific elements on the page.

We start by importing the libraries, defining global variables, and loading the Gemini API Key from an env file (the API key can be obtained here):

import subprocess
import time
import pyautogui
import base64
import google.generativeai as genai
import json
import re
import os
from dotenv import load_dotenv
from PIL import Image, ImageDraw
CHROME_PATH="/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"
SCREENSHOT_PATH = "assets/zaai_homepage.png"
SCREENSHOT_BBOXED_PATH = "assets/zaai_homepage_bboxed.png"
SCREENSHOT_BLOG_PATH = "assets/zaai_lab.png"

ZAAI_URL = "https://zaai.ai"

load_dotenv()
api_key = os.getenv("GEMINI_API_KEY")
if not api_key:
    raise ValueError("API key not found. Please set GEMINI_API_KEY in your .env file.")
genai.configure(api_key=api_key)

We create three utility functions to handle the input and output for Gemini. The first function reads an image file and converts it into a Base64-encoded string (necessary for Gemini’s API). The second function extracts JSON data from the LLM’s response and performs the necessary pre-processing to ensure the data is usable. The third function transforms bounding box coordinates, which Google normalizes to a range of 0 to 1000, into actual pixel values.

def encode_image_to_base64(image_path: str) -> str:
    """ Read an image file and return its Base64-encoded string. """
    with open(image_path, "rb") as img_file:
        image_data = img_file.read()
    return base64.b64encode(image_data).decode("utf-8")

def extract_json_from_response(response) -> dict:
    """
    Extract JSON content from the LLM response, removing any code fence markers.
    """
    if not hasattr(response, "candidates") or not response.candidates:
        raise ValueError("Response does not contain valid candidates.")
    raw_text = response.candidates[0].content.parts[0].text
    json_str = re.sub(r"^```json|```$", "", raw_text.strip(), flags=re.MULTILINE)

    try:
        parsed_data = json.loads(json_str)
        return parsed_data
    except json.JSONDecodeError as e:
        raise ValueError(f"Failed to parse JSON: {e}nRaw LLM Response:n{raw_text}")

def update_coordinates_to_pixels(detection_info: dict, width: int, height: int) -> None:
    """
    Convert normalized bounding box coordinates ([0..1000]) to actual pixel values.
    """
    for key, value in detection_info.items():
        coords = value["coordinates"]
        xmin, ymin, xmax, ymax = coords
        value["coordinates"] = [
            (xmin / 1000.0) * width,
            (ymin / 1000.0) * height,
            (xmax / 1000.0) * width,
            (ymax / 1000.0) * height
        ]

Then, we developed functions to make tools available for the agent. First, we define a tool that overlays bounding boxes on an image and saves it. The second one captures a screenshot of the screen. A third one launches the Chrome browser and navigates to a specified URL. Finally, the last one identifies and clicks on bounding boxes, enabling interaction with specific UI elements.

def draw_bounding_boxes(image_path: str, detection_info: dict, output_path: str, color: str = "red") -> None:
    """
    Draw bounding boxes on the image and save to output_path.
    Expects detection_info to have pixel coordinates already.
    """
    try:
        with Image.open(image_path) as img:
            draw = ImageDraw.Draw(img)

        for label, details in detection_info.items():
                coords = details["coordinates"]
                description = details.get("description", "")
                draw.rectangle(coords, outline=color, width=2)
                draw.text((coords[0], coords[1] - 10), label, fill=color)

            img.save(output_path)
            print(f"Image saved with bounding boxes at: {output_path}")
    except Exception as e:
        print(f"Failed to create image with bounding boxes: {e}")

def take_screenshot(output_path: str):
    """Take a screenshot of the main screen (or the active window)."""
    time.sleep(2)  # wait a bit for the page to load
    screenshot = pyautogui.screenshot()
    screenshot.save(output_path)
    print(f"Screenshot saved to {output_path}.")

def open_chrome(url: str):
    """Open Chrome to a specific URL using a subprocess."""
    print(f"Opening Chrome at {url} ...")
    subprocess.Popen([CHROME_PATH, url])
    time.sleep(5)

def find_and_click_lab_element(bounding_box_data: dict):
    """
    Click the bounding box that should lead to the 'Lab' (or blog page).
    For simplicity, let's assume we pick the bounding box whose label
    or description references "Lab" or "Blog"
    """
    target_label = None
    for label, info in bounding_box_data.items():
        lower_desc = info["description"].lower()
        lower_label = label.lower()
        if "lab" in lower_desc or "lab" in lower_label:
            target_label = label
            break
        if "blog" in lower_desc or "blog" in lower_label:
            target_label = label
            break

    if not target_label:
        print("Could not find a bounding box that references Lab/Blog in the description.")
        return
    coords = bounding_box_data[target_label]["coordinates"]

    # coords is [xmin, ymin, xmax, ymax]
    # let's pick the down left corner of the bb
    x_center = coords[0] / 2 # bc of retina res
    y_center = coords[1] / 2 # bc of retina res
    print(f"Clicking element: {target_label}")

    pyautogui.moveTo(x_center, y_center, duration=0.5)
    pyautogui.click()

Finally, we implement capabilities for our agent to extract specific information from what it is "seeing."

def identify_elements_with_descriptions(image_path: str) -> list:
    """
    Step 1: Ask the model to identify clickable elements and include descriptions.
    Return a list of objects, each containing 'label' and 'description'.
    """
    model = genai.GenerativeModel(model_name="gemini-1.5-pro-latest")
    encoded_image = encode_image_to_base64(image_path)

    prompt = """
    You are given a screenshot of a website homepage.
    Identify all the relevant clickable elements (text, buttons, icons, tabs, images, etc.)
    on the website page only (discard browser elements if they appear in the image) and provide:
      - A semantically rich name as the label (e.g., "Lab Link" or "Blog Tab")
      - A short description of its purpose on the page and any relevant visual details

    Output JSON in this format:
    {
      "elements": [
        {
          "label": "some descriptive label",
          "description": "short description with visual nuances"
        },
        ...
      ]
    }
    """

    response = model.generate_content([
        {"mime_type": "image/png", "data": encoded_image},
        prompt
    ])

    parsed_data = extract_json_from_response(response)
    if "elements" not in parsed_data:
        raise ValueError("No 'elements' field found in the JSON response.")

    return parsed_data["elements"]

def propose_bounding_boxes(image_path: str, identified_elements: list) -> dict:
    """
    Step 2: Provide the list of elements from Step 1 (labels + descriptions).
    Ask the model to propose bounding boxes in [xmin, ymin, xmax, ymax] with 0..1000 scale.
    Return a dict where keys are labels, and values have 'coordinates' + 'description'.
    We also copy the 'description' from the elements so that we keep it in the final output.
    """
    model = genai.GenerativeModel(model_name="gemini-1.5-pro-latest")
    encoded_image = encode_image_to_base64(image_path)

    elements_json_str = json.dumps(identified_elements, indent=2)

    prompt = f"""
    The following clickable elements were identified (labels + descriptions):
    {elements_json_str}

    Propose a bounding box (in [xmin, ymin, xmax, ymax], 0..1000 scale) for each element
    so we can locate them on the screenshot.

    Output JSON in the format:
    {{
      "<element_label>": {{
        "coordinates": [xmin, ymin, xmax, ymax],
        "description": "<the same description from above>"
      }},
      ...
    }}
    """

    response = model.generate_content([
        {"mime_type": "image/png", "data": encoded_image},
        prompt
    ])

    parsed_data = extract_json_from_response(response)
    return parsed_data

def retrieve_latest_blog_info(image_path: str) -> (str, str):
    """
    Getting the latest post title and date
    """
    model = genai.GenerativeModel(model_name="gemini-1.5-pro-latest")
    encoded_image = encode_image_to_base64(image_path)

    prompt = """
    You are given a screenshot of a website blog page.
    Identify the latest article and get its title and date.

    Output JSON in this format:
    {
      "title": "title of the article",
      "pub_date": "date of publishing of the article"
    },
    """

    response = model.generate_content([
        {"mime_type": "image/png", "data": encoded_image},
        prompt
    ])

    parsed_data = extract_json_from_response(response)
    if "title" not in parsed_data:
        raise ValueError("No 'title' field found in the JSON response.")

    return parsed_data['title'], parsed_data['pub_date']

The visual workflow of the process can be seen in Figure 8.

Figure 8: Steps performed by the Visual Agent to extract the metadata from the latest blog article (image by author).
Figure 8: Steps performed by the Visual Agent to extract the metadata from the latest blog article (image by author).

Conclusion

Agentic AI opens up many new possibilities. As we make these agents more autonomous, the ability to act on goals and adapt to new information makes old and complex problems suddenly easy to solve. They are also gaining more and more human-like capabilities. For instance, the ability to interpret an image and extract information was a characteristic that only humans could do until recently. A few months ago, if we wanted a machine to be able to extract this level of detailed information from an image, we would 1) train a model for a specific goal or 2) program the machine to perform a very specific task. None of the approaches were scalable since if something changed, we would most likely need to retrain or reprogram our approach.

Nowadays, multimodal models can interpret and analyze visual information by simply following text instructions and reasoning over the inputs. This means that, without any modifications, the model can still function even if the context or the visual appearance of the desired information changes.

Agentic AI is still taking the first steps, and we look forward to seeing what comes next!

About me

Serial entrepreneur and leader in the AI space. I develop AI products for businesses and invest in AI-focused startups.

Founder @ ZAAI | LinkedIn | X/Twitter

References

[1] Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, Enhong Chen. (2024). A Survey on Multimodal Large Language Models. arXiv:2306.13549.

[2] https://github.com/google-gemini/generative-ai-python/blob/main/README.md

[3] https://github.com/google-gemini/cookbook/tree/main/examples


Related Articles