1. Overview and Role
MultimodalWebSurfer is a multimodal agent capable of browsing the web, visiting webpages, and performing a wide range of interactions (search, click, scroll, form input, etc.). This agent automatically launches a Chromium browser and controls it via Playwright. It is primarily used with multimodal models such as GPT-4o, and provides features including webpage screenshots, interactive element extraction, text summarization, and question answering.
"A helpful assistant with access to a web browser. You can ask it to search the web, open pages, and interact with content (click links, scroll, fill forms, etc.). It can also summarize entire pages or answer questions based on page content." — From the DEFAULT_DESCRIPTION
2. Installation
pip install "autogen-ext[web-surfer]"
- Playwright and related packages are installed automatically.
3. Core Operational Flow (in order)
1) Agent Creation and Initialization
- When creating a MultimodalWebSurfer object, you can configure various parameters (see below).
- The actual browser is launched only on the first call and is reused thereafter.
"The browser is launched when the agent is first called, and reused in subsequent calls."
Key Parameters
- name: Agent name
- model_client: Multimodal model client (e.g., GPT-4o)
- downloads_folder: Folder for downloaded files
- description: Agent description
- debug_dir: Folder for debug information
- headless: Whether to run the browser without a display (default: True)
- start_page: Starting page (default: https://www.bing.com/)
- animate_actions: Whether to animate actions
- to_save_screenshots: Whether to save screenshots
- use_ocr: Whether to use OCR
- browser_channel: Browser channel
- browser_data_dir: Browser data directory
- to_resize_viewport: Whether to resize the viewport
2) Message Handling and Web Interaction
Behavior when on_messages() / on_messages_stream() is called
-
Browser initialization and page load
- On the first call, the browser and page are initialized via
_lazy_init(). - The browser stays open until
close()is called.
- On the first call, the browser and page are initialized via
-
Response generation
_generate_reply()is called to produce the final response.
-
Screenshot and interactive element extraction
- A screenshot of the page is taken and interactive elements (click/input targets, etc.) are extracted.
- A screenshot annotated with bounding boxes around those elements is also prepared.
-
Model invocation and response handling
- The SOM (Screen-Of-Mark) screenshot, message history, and available tool list are passed to the model.
- If the model returns a string, that string becomes the final response.
- If the model returns a list of tool calls,
_execute_tool()dispatches them to PlaywrightController for execution. - The final response includes a screenshot, page metadata, a description of the actions taken, and the webpage text.
-
Error handling
- If an error occurs during execution, an error message is returned as the final response.
"The agent takes a screenshot of the page, extracts interactive elements, and prepares a screenshot with bounding boxes."
3) State Management and Reset
- The agent is stateful — it remembers previous message history and incorporates it into subsequent responses.
- Calling on_reset() resets the agent to its initial state.
"Because the agent is stateful, messages passed to this method should be new messages since the previous call."
4) Browser Shutdown
- When the agent is no longer needed, call close() to shut down the browser and page.
4. PlaywrightController: Web Interaction Helper
PlaywrightController is a helper class that performs various actions on real webpages via Playwright.
Key Features and Methods
- add_cursor_box: Add a red cursor box to a specific element
- back: Navigate to the previous page
- click_id: Click a specific element
- fill_id: Enter a value into an input field
- get_focused_rect_id: Return the ID of the currently focused element
- get_interactive_rects: Extract information about interactive regions
- get_page_markdown: Extract page content as Markdown (not yet implemented)
- get_page_metadata: Extract page metadata
- get_visible_text: Extract text from the current viewport
- get_visual_viewport: Extract viewport information
- get_webpage_text: Extract the full page text (number of lines can be specified)
- gradual_cursor_animation: Animate cursor movement
- hover_id: Mouse over a specific element
- on_new_page: Handle actions on a new page
- page_down / page_up: Scroll the page
- remove_cursor_box: Remove the cursor box
- scroll_id: Scroll a specific element
- sleep: Wait for a specified duration
- visit_page: Navigate to a specific URL
"Adds a red cursor box to a specific element." "Scrolls the page down by one viewport height." "Navigates to the specified URL."
5. Example Code
Below is a practical example of using MultimodalWebSurfer.
import asyncio
from autogen_agentchat.ui import Console
from autogen_agentchat.teams import RoundRobinGroupChat
from autogen_ext.models.openai import OpenAIChatCompletionClient
from autogen_ext.agents.web_surfer import MultimodalWebSurfer
async def main() -> None:
# Define the agent
web_surfer_agent = MultimodalWebSurfer(
name="MultimodalWebSurfer",
model_client=OpenAIChatCompletionClient(model="gpt-4o-2024-08-06"),
)
# Define the team
agent_team = RoundRobinGroupChat([web_surfer_agent], max_turns=3)
# Run the team and stream messages to the console
stream = agent_team.run_stream(task="Navigate to the AutoGen readme on GitHub.")
await Console(stream)
# Close the browser
await web_surfer_agent.close()
asyncio.run(main())
6. Key Constants and Configuration Values
- DEFAULT_DESCRIPTION:
"A helpful assistant with access to a web browser. ... (abbreviated)" - DEFAULT_START_PAGE:
https://www.bing.com/ - VIEWPORT_WIDTH / HEIGHT: Default viewport size (1440 x 900)
- MLM_WIDTH / HEIGHT: Screenshot size for multimodal models (1224 x 765)
- SCREENSHOT_TOKENS: Screenshot token count (1105)
7. Windows Environment Notes
When using on Windows, you must configure the event loop policy as follows:
import sys
import asyncio
if sys.platform == "win32":
asyncio.set_event_loop_policy(asyncio.WindowsProactorEventLoopPolicy())
8. Security and Precautions
- Because the agent interacts with a digital world designed for humans, it may occasionally attempt potentially dangerous actions (e.g., accepting cookie consent dialogs, requesting human assistance, etc.).
- Always monitor the agent and run it in a controlled environment.
- Be aware that it may be vulnerable to prompt injection attacks — exercise caution with input values on webpages.
"When using MultimodalWebSurfer, keep in mind that it is interacting with a digital world designed for humans. The agent may occasionally attempt dangerous actions, so always monitor it and run it in a controlled environment."
9. Configuration and Extension
- _from_config(config): Create a new instance from a configuration object
- _to_config(): Return the current instance's configuration as an object
10. Key Keyword Summary
- Multimodal agent
- Web browser automation
- Playwright
- GPT-4o
- Screenshot / interactive element extraction
- Stateful
- Tool calling (Function/Tool Calling)
- Security considerations (prompt injection, etc.)
11. Closing Thoughts
MultimodalWebSurfer is a powerful tool that combines web automation with AI, enabling real navigation and interaction with webpages. In practice, pay close attention to security and monitoring, and take advantage of the many parameters and methods available to customize it to your needs. Feel free to ask if you have any questions! 😊