Overview
MultimodalWebSurfer is a multimodal web browsing agent built on Playwright/Chromium. It can search, click, scroll, fill forms, take screenshots, extract interactive elements, summarize pages, and answer questions about web content.
Key Features
- Browser launches on first call, reuses thereafter
- Extracts interactive elements with bounding boxes on screenshots (SOM approach)
- Model decides between text response or tool calls (click, fill, navigate, scroll)
- Stateful — remembers conversation history
- Configurable: headless mode, OCR, viewport sizing, animation, debug logging
PlaywrightController Methods
click_id, fill_id, visit_page, back, page_down/up, scroll_id, hover_id, get_visible_text, get_webpage_text, get_page_metadata, screenshots, and more.
Security Considerations
The agent interacts with the digital world designed for humans — may attempt risky actions, is vulnerable to prompt injection. Always monitor and run in controlled environments.