AutoGen MultimodalWebSurfer: A Complete Guide

1. Overview and Role

MultimodalWebSurfer is a multimodal agent capable of browsing the web, visiting webpages, and performing a wide range of interactions (search, click, scroll, form input, etc.). This agent automatically launches a Chromium browser and controls it via Playwright. It is primarily used with multimodal models such as GPT-4o, and provides features including webpage screenshots, interactive element extraction, text summarization, and question answering.

"A helpful assistant with access to a web browser. You can ask it to search the web, open pages, and interact with content (click links, scroll, fill forms, etc.). It can also summarize entire pages or answer questions based on page content." — From the DEFAULT_DESCRIPTION

2. Installation

pip install "autogen-ext[web-surfer]"

Playwright and related packages are installed automatically.

3. Core Operational Flow (in order)

1) Agent Creation and Initialization

When creating a MultimodalWebSurfer object, you can configure various parameters (see below).
The actual browser is launched only on the first call and is reused thereafter.

"The browser is launched when the agent is first called, and reused in subsequent calls."

Key Parameters

name: Agent name
model_client: Multimodal model client (e.g., GPT-4o)
downloads_folder: Folder for downloaded files
description: Agent description
debug_dir: Folder for debug information
headless: Whether to run the browser without a display (default: True)
start_page: Starting page (default: https://www.bing.com/)
animate_actions: Whether to animate actions
to_save_screenshots: Whether to save screenshots
use_ocr: Whether to use OCR
browser_channel: Browser channel
browser_data_dir: Browser data directory
to_resize_viewport: Whether to resize the viewport

2) Message Handling and Web Interaction

Behavior when on_messages() / on_messages_stream() is called

Browser initialization and page load
- On the first call, the browser and page are initialized via _lazy_init().
- The browser stays open until close() is called.
Response generation
- _generate_reply() is called to produce the final response.
Screenshot and interactive element extraction
- A screenshot of the page is taken and interactive elements (click/input targets, etc.) are extracted.
- A screenshot annotated with bounding boxes around those elements is also prepared.
Model invocation and response handling
- The SOM (Screen-Of-Mark) screenshot, message history, and available tool list are passed to the model.
- If the model returns a string, that string becomes the final response.
- If the model returns a list of tool calls, _execute_tool() dispatches them to PlaywrightController for execution.
- The final response includes a screenshot, page metadata, a description of the actions taken, and the webpage text.
Error handling
- If an error occurs during execution, an error message is returned as the final response.

"The agent takes a screenshot of the page, extracts interactive elements, and prepares a screenshot with bounding boxes."

3) State Management and Reset

The agent is stateful — it remembers previous message history and incorporates it into subsequent responses.
Calling on_reset() resets the agent to its initial state.

"Because the agent is stateful, messages passed to this method should be new messages since the previous call."

4) Browser Shutdown

When the agent is no longer needed, call close() to shut down the browser and page.

4. PlaywrightController: Web Interaction Helper

PlaywrightController is a helper class that performs various actions on real webpages via Playwright.

Key Features and Methods

add_cursor_box: Add a red cursor box to a specific element
back: Navigate to the previous page
click_id: Click a specific element
fill_id: Enter a value into an input field
get_focused_rect_id: Return the ID of the currently focused element
get_interactive_rects: Extract information about interactive regions
get_page_markdown: Extract page content as Markdown (not yet implemented)
get_page_metadata: Extract page metadata
get_visible_text: Extract text from the current viewport
get_visual_viewport: Extract viewport information
get_webpage_text: Extract the full page text (number of lines can be specified)
gradual_cursor_animation: Animate cursor movement
hover_id: Mouse over a specific element
on_new_page: Handle actions on a new page
page_down / page_up: Scroll the page
remove_cursor_box: Remove the cursor box
scroll_id: Scroll a specific element
sleep: Wait for a specified duration
visit_page: Navigate to a specific URL

"Adds a red cursor box to a specific element." "Scrolls the page down by one viewport height." "Navigates to the specified URL."

5. Example Code

Below is a practical example of using MultimodalWebSurfer.

import asyncio
from autogen_agentchat.ui import Console
from autogen_agentchat.teams import RoundRobinGroupChat
from autogen_ext.models.openai import OpenAIChatCompletionClient
from autogen_ext.agents.web_surfer import MultimodalWebSurfer

async def main() -> None:
    # Define the agent
    web_surfer_agent = MultimodalWebSurfer(
        name="MultimodalWebSurfer",
        model_client=OpenAIChatCompletionClient(model="gpt-4o-2024-08-06"),
    )
    # Define the team
    agent_team = RoundRobinGroupChat([web_surfer_agent], max_turns=3)
    # Run the team and stream messages to the console
    stream = agent_team.run_stream(task="Navigate to the AutoGen readme on GitHub.")
    await Console(stream)
    # Close the browser
    await web_surfer_agent.close()

asyncio.run(main())

6. Key Constants and Configuration Values

DEFAULT_DESCRIPTION:

"A helpful assistant with access to a web browser. ... (abbreviated)"

DEFAULT_START_PAGE: https://www.bing.com/
VIEWPORT_WIDTH / HEIGHT: Default viewport size (1440 x 900)
MLM_WIDTH / HEIGHT: Screenshot size for multimodal models (1224 x 765)
SCREENSHOT_TOKENS: Screenshot token count (1105)

7. Windows Environment Notes

When using on Windows, you must configure the event loop policy as follows:

import sys
import asyncio
if sys.platform == "win32":
    asyncio.set_event_loop_policy(asyncio.WindowsProactorEventLoopPolicy())

8. Security and Precautions

Because the agent interacts with a digital world designed for humans, it may occasionally attempt potentially dangerous actions (e.g., accepting cookie consent dialogs, requesting human assistance, etc.).
Always monitor the agent and run it in a controlled environment.
Be aware that it may be vulnerable to prompt injection attacks — exercise caution with input values on webpages.

"When using MultimodalWebSurfer, keep in mind that it is interacting with a digital world designed for humans. The agent may occasionally attempt dangerous actions, so always monitor it and run it in a controlled environment."

9. Configuration and Extension

_from_config(config): Create a new instance from a configuration object
_to_config(): Return the current instance's configuration as an object

10. Key Keyword Summary

Multimodal agent
Web browser automation
Playwright
GPT-4o
Screenshot / interactive element extraction
Stateful
Tool calling (Function/Tool Calling)
Security considerations (prompt injection, etc.)

11. Closing Thoughts

MultimodalWebSurfer is a powerful tool that combines web automation with AI, enabling real navigation and interaction with webpages. In practice, pay close attention to security and monitoring, and take advantage of the many parameters and methods available to customize it to your needs. Feel free to ask if you have any questions! 😊

1. Overview and Role

"A helpful assistant with access to a web browser. You can ask it to search the web, open pages, and interact with content (click links, scroll, fill forms, etc.). It can also summarize entire pages or answer questions based on page content." — From the DEFAULT_DESCRIPTION

2. Installation

pip install "autogen-ext[web-surfer]"

Playwright and related packages are installed automatically.

3. Core Operational Flow (in order)

1) Agent Creation and Initialization

When creating a MultimodalWebSurfer object, you can configure various parameters (see below).
The actual browser is launched only on the first call and is reused thereafter.

"The browser is launched when the agent is first called, and reused in subsequent calls."

Key Parameters

name: Agent name
model_client: Multimodal model client (e.g., GPT-4o)
downloads_folder: Folder for downloaded files
description: Agent description
debug_dir: Folder for debug information
headless: Whether to run the browser without a display (default: True)
start_page: Starting page (default: https://www.bing.com/)
animate_actions: Whether to animate actions
to_save_screenshots: Whether to save screenshots
use_ocr: Whether to use OCR
browser_channel: Browser channel
browser_data_dir: Browser data directory
to_resize_viewport: Whether to resize the viewport

2) Message Handling and Web Interaction

Behavior when on_messages() / on_messages_stream() is called

Browser initialization and page load
- On the first call, the browser and page are initialized via _lazy_init().
- The browser stays open until close() is called.
Response generation
- _generate_reply() is called to produce the final response.
Screenshot and interactive element extraction
- A screenshot of the page is taken and interactive elements (click/input targets, etc.) are extracted.
- A screenshot annotated with bounding boxes around those elements is also prepared.
Model invocation and response handling
- The SOM (Screen-Of-Mark) screenshot, message history, and available tool list are passed to the model.
- If the model returns a string, that string becomes the final response.
- If the model returns a list of tool calls, _execute_tool() dispatches them to PlaywrightController for execution.
- The final response includes a screenshot, page metadata, a description of the actions taken, and the webpage text.
Error handling
- If an error occurs during execution, an error message is returned as the final response.

"The agent takes a screenshot of the page, extracts interactive elements, and prepares a screenshot with bounding boxes."

3) State Management and Reset

The agent is stateful — it remembers previous message history and incorporates it into subsequent responses.
Calling on_reset() resets the agent to its initial state.

"Because the agent is stateful, messages passed to this method should be new messages since the previous call."

4) Browser Shutdown

When the agent is no longer needed, call close() to shut down the browser and page.

4. PlaywrightController: Web Interaction Helper

PlaywrightController is a helper class that performs various actions on real webpages via Playwright.

Key Features and Methods

add_cursor_box: Add a red cursor box to a specific element
back: Navigate to the previous page
click_id: Click a specific element
fill_id: Enter a value into an input field
get_focused_rect_id: Return the ID of the currently focused element
get_interactive_rects: Extract information about interactive regions
get_page_markdown: Extract page content as Markdown (not yet implemented)
get_page_metadata: Extract page metadata
get_visible_text: Extract text from the current viewport
get_visual_viewport: Extract viewport information
get_webpage_text: Extract the full page text (number of lines can be specified)
gradual_cursor_animation: Animate cursor movement
hover_id: Mouse over a specific element
on_new_page: Handle actions on a new page
page_down / page_up: Scroll the page
remove_cursor_box: Remove the cursor box
scroll_id: Scroll a specific element
sleep: Wait for a specified duration
visit_page: Navigate to a specific URL

"Adds a red cursor box to a specific element." "Scrolls the page down by one viewport height." "Navigates to the specified URL."

5. Example Code

Below is a practical example of using MultimodalWebSurfer.

import asyncio
from autogen_agentchat.ui import Console
from autogen_agentchat.teams import RoundRobinGroupChat
from autogen_ext.models.openai import OpenAIChatCompletionClient
from autogen_ext.agents.web_surfer import MultimodalWebSurfer

async def main() -> None:
    # Define the agent
    web_surfer_agent = MultimodalWebSurfer(
        name="MultimodalWebSurfer",
        model_client=OpenAIChatCompletionClient(model="gpt-4o-2024-08-06"),
    )
    # Define the team
    agent_team = RoundRobinGroupChat([web_surfer_agent], max_turns=3)
    # Run the team and stream messages to the console
    stream = agent_team.run_stream(task="Navigate to the AutoGen readme on GitHub.")
    await Console(stream)
    # Close the browser
    await web_surfer_agent.close()

asyncio.run(main())

6. Key Constants and Configuration Values

DEFAULT_DESCRIPTION:

"A helpful assistant with access to a web browser. ... (abbreviated)"

DEFAULT_START_PAGE: https://www.bing.com/
VIEWPORT_WIDTH / HEIGHT: Default viewport size (1440 x 900)
MLM_WIDTH / HEIGHT: Screenshot size for multimodal models (1224 x 765)
SCREENSHOT_TOKENS: Screenshot token count (1105)

7. Windows Environment Notes

When using on Windows, you must configure the event loop policy as follows:

import sys
import asyncio
if sys.platform == "win32":
    asyncio.set_event_loop_policy(asyncio.WindowsProactorEventLoopPolicy())

8. Security and Precautions

Because the agent interacts with a digital world designed for humans, it may occasionally attempt potentially dangerous actions (e.g., accepting cookie consent dialogs, requesting human assistance, etc.).
Always monitor the agent and run it in a controlled environment.
Be aware that it may be vulnerable to prompt injection attacks — exercise caution with input values on webpages.

"When using MultimodalWebSurfer, keep in mind that it is interacting with a digital world designed for humans. The agent may occasionally attempt dangerous actions, so always monitor it and run it in a controlled environment."

9. Configuration and Extension

_from_config(config): Create a new instance from a configuration object
_to_config(): Return the current instance's configuration as an object

10. Key Keyword Summary

Multimodal agent
Web browser automation
Playwright
GPT-4o
Screenshot / interactive element extraction
Stateful
Tool calling (Function/Tool Calling)
Security considerations (prompt injection, etc.)

1. Overview and Role

2. Installation

3. Core Operational Flow (in order)

1) Agent Creation and Initialization

Key Parameters

2) Message Handling and Web Interaction

Behavior when on_messages() / on_messages_stream() is called

3) State Management and Reset

4) Browser Shutdown

4. PlaywrightController: Web Interaction Helper

Key Features and Methods

5. Example Code

6. Key Constants and Configuration Values

7. Windows Environment Notes

8. Security and Precautions

9. Configuration and Extension

10. Key Keyword Summary

11. Closing Thoughts

Related writing

Understanding Society Through Simulation: Simile's Joon Sung Park

Vibe Coding University Member Debuts as AX Consultant

Midjourney Full-Body Ultrasound: From Images to Outcomes

Reading

1. Overview and Role

2. Installation

3. Core Operational Flow (in order)

1) Agent Creation and Initialization

Key Parameters

2) Message Handling and Web Interaction

Behavior when on_messages() / on_messages_stream() is called

3) State Management and Reset

4) Browser Shutdown

4. PlaywrightController: Web Interaction Helper

Key Features and Methods

5. Example Code

6. Key Constants and Configuration Values

7. Windows Environment Notes

8. Security and Precautions

9. Configuration and Extension

10. Key Keyword Summary

11. Closing Thoughts

Related writing

Understanding Society Through Simulation: Simile's Joon Sung Park

Vibe Coding University Member Debuts as AX Consultant

Midjourney Full-Body Ultrasound: From Images to Outcomes