Overview

MultimodalWebSurfer is a multimodal web browsing agent built on Playwright/Chromium. It can search, click, scroll, fill forms, take screenshots, extract interactive elements, summarize pages, and answer questions about web content.

Key Features

  • Browser launches on first call, reuses thereafter
  • Extracts interactive elements with bounding boxes on screenshots (SOM approach)
  • Model decides between text response or tool calls (click, fill, navigate, scroll)
  • Stateful — remembers conversation history
  • Configurable: headless mode, OCR, viewport sizing, animation, debug logging

PlaywrightController Methods

click_id, fill_id, visit_page, back, page_down/up, scroll_id, hover_id, get_visible_text, get_webpage_text, get_page_metadata, screenshots, and more.

Security Considerations

The agent interacts with the digital world designed for humans — may attempt risky actions, is vulnerable to prompt injection. Always monitor and run in controlled environments.

Related writing

Related writing