Merge 99ad7f455a into 4de0bae45a

2026-01-30 13:06:43 +08:00 · 2026-01-30 13:06:43 +08:00 · 4d31387bef
commit 4d31387bef
parent 4de0bae45a 99ad7f455a
2 changed files with 136 additions and 0 deletions
--- a/docs/BROWSER_ARCHITECTURE.md
+++ b/docs/BROWSER_ARCHITECTURE.md
@ -0,0 +1,104 @@
+# Clawdbot Browser Architecture
+
+This document outlines the architecture of the Clawdbot browser automation system, detailing how agent commands are translated into browser actions via a client-server model using Playwright and Chrome DevTools Protocol (CDP).
+
+## Workflow Diagram
+
+The following sequence diagram illustrates the end-to-end flow from an agent command to a browser action.
+
+```mermaid
+sequenceDiagram
+    participant Agent
+    participant Client as BrowserClient (HTTP)
+    participant Server as BrowserServer (Express)
+    participant Context as ProfileContext
+    participant PW as PlaywrightSession
+    participant Chrome as Chrome Instance
+
+    Note over Agent, Chrome: Opening a new tab (Example)
+
+    Agent->>Client: browser.open(url)
+    Client->>Server: POST /tabs/open?profile=clawd
+    
+    Server->>Context: ctx.forProfile("clawd").openTab(url)
+    
+    alt Remote/Persistent Profile
+        Context->>PW: createPageViaPlaywright(cdpUrl, url)
+        PW->>Chrome: chromium.connectOverCDP()
+        PW->>Chrome: context.newPage() & page.goto()
+        Chrome-->>PW: Target Created
+        PW-->>Context: { targetId, title, url }
+    else Local/Direct CDP
+        Context->>Chrome: PUT /json/new?url=...
+        Chrome-->>Context: { id: targetId }
+    end
+
+    Context->>Context: Update lastTargetId
+    Context-->>Server: BrowserTab
+    Server-->>Client: 200 OK (BrowserTab)
+    Client-->>Agent: Result { targetId: "..." }
+
+    Note over Agent, Chrome: Taking a snapshot (Example)
+
+    Agent->>Client: browser.snapshot(targetId)
+    Client->>Server: GET /snapshot?targetId=...
+    Server->>Context: ctx.forProfile("clawd").ensureTabAvailable(targetId)
+    Context-->>Server: BrowserTab
+    Server->>PW: getPageForTargetId(cdpUrl, targetId)
+    PW->>Chrome: Attach via CDP / Match URL
+    PW-->>Server: Playwright Page Object
+    Server->>PW: page.accessibility.snapshot() / html()
+    PW-->>Server: Snapshot Data
+    Server-->>Client: SnapshotResult
+    Client-->>Agent: Result
+```
+
+## Component Analysis
+
+The architecture is split into four distinct layers, ensuring separation of concerns between the agent interface, the control plane, and the low-level automation.
+
+### 1. Client Layer (`src/browser/client.ts`)
+The **Client** provides a strongly-typed abstraction for agents to interact with the browser. It handles:
+-   **HTTP Communication**: Sends requests to the Browser Server (e.g., `/start`, `/snapshot`, `/tabs`).
+-   **Query Construction**: Serializes options like `profile`, `targetId`, and `format` into URL parameters.
+-   **Type Safety**: Ensures responses match expected interfaces (`BrowserTab`, `SnapshotResult`).
+
+### 2. Server Layer (`src/browser/server.ts`)
+The **Server** is the control plane. It acts as an Express-based HTTP server that:
+-   **Lifecycle Management**: Starts and stops the browser automation service.
+-   **State Management**: Maintains a global `BrowserServerState` containing active profiles and configuration.
+-   **Routing**: Delegates incoming HTTP requests to the appropriate handlers in `src/browser/routes/`.
+
+### 3. Context & Profiles (`src/browser/server-context.ts`)
+The **Context** layer manages the state and logic for individual browser profiles (e.g., "clawd" for automation, "extension" for user-driven sessions).
+-   **Profile Isolation**: Each profile (Default, Chrome, Edge) has its own `ProfileContext`.
+-   **Target Resolution**: Manages `lastTargetId` to provide continuity when an agent doesn't specify a target.
+-   **Abstraction**: Hides the difference between local loopback CDP interactions (via raw HTTP) and remote/persistent connections (via Playwright).
+
+### 4. Automation Layer (`src/browser/pw-session.ts`)
+The **Automation** layer (Playwright Session) manages the direct connection to the browser instance.
+-   **CDP Connection**: Uses `chromium.connectOverCDP` to maintain a persistent WebSocket connection to Chrome.
+-   **Page Resolution**: The critical `getPageForTargetId` function maps a CDP `targetId` to a Playwright `Page` object, enabling the use of Playwright's rich API (locators, snapshots) on specific tabs.
+-   **State Tracking**: Maintains `PageState` (console logs, network requests, errors) using WeakMaps attached to Playwright Page objects.
+-   **Resilience**: Includes fallbacks for finding pages by URL when CDP attachment fails (common with extension-based relays).
+
+## Key Concepts
+
+### Target IDs
+A `targetId` is the unique identifier for a specific browser tab or window (CDP Target).
+-   **Origin**: Generated by Chrome/CDP.
+-   **Usage**: Agents use this ID to direct actions (click, type, snapshot) to a specific page.
+-   **Flow**: The Server returns a `targetId` upon tab creation or listing. The Agent must provide this ID for subsequent actions, or the system falls back to the `lastTargetId`.
+
+### Profiles
+Clawdbot supports multiple isolated browser profiles:
+-   **clawd**: A fully automated, headless (or headed) Chrome instance managed by the bot.
+-   **extension**: A relay mode that connects to a user's existing Chrome instance via the Clawdbot Browser Extension.
+-   **custom**: User-defined profiles with specific CDP endpoints.
+
+### Error Handling
+Errors are propagated up the stack:
+1.  **Playwright/CDP**: Low-level connection or timeout errors are caught in `pw-session.ts`.
+2.  **Context**: Logic errors (e.g., "tab not found", "ambiguous target") are handled in `server-context.ts`.
+3.  **Server**: Maps exceptions to HTTP status codes (404 for missing tabs, 409 for ambiguous requests).
+4.  **Client**: Throws typed errors that the Agent can catch and handle (e.g., retrying a snapshot).
--- a/docs/architecture/browser-automation.md
+++ b/docs/architecture/browser-automation.md
@ -0,0 +1,32 @@
+# Clawdbot Browser Automation Architecture
+
+## 1. Libraries Used
+*   **Playwright (`playwright-core`)**: The primary engine for browser automation. Used for session management, page interactions (click, type, scroll), and capturing snapshots.
+*   **Chromium**: The underlying browser instance managed by Playwright.
+*   **Chrome DevTools Protocol (CDP)**: Used internally by Playwright and for specific low-level control where needed.
+*   **Express**: Hosts the "Browser Bridge" server (`src/browser/bridge-server.ts`), exposing a REST API for the agent to control the browser instance.
+
+## 2. Command Construction & Execution
+The AI controls the browser through a structured tool definition and dispatch pipeline:
+
+1.  **Tool Definition**: The `browser` tool is defined in `src/agents/tools/browser-tool.schema.ts` (using TypeBox). It exposes high-level actions like `navigate`, `act` (click/type/etc.), `snapshot`, and `screenshot`.
+2.  **Tool Invocation**: The AI calls the tool with specific parameters (e.g., `action="act"`, `request={ kind: "click", ref: "42" }`).
+3.  **Client Dispatch**:
+    *   The tool implementation (`src/agents/tools/browser-tool.ts`) handles the request.
+    *   It determines the target: **Host** (local) or **Node** (remote/sandbox).
+    *   For **local execution**, it calls client functions in `src/browser/client-actions-core.ts`.
+4.  **Bridge Request**: The client sends an HTTP POST request to the local Bridge Server (e.g., `POST /act`).
+5.  **Execution**: The Bridge Server receives the request and executes the corresponding Playwright command (e.g., `page.click()`) on the active browser page.
+
+## 3. "Device Nodes" Architecture
+Clawdbot uses a distributed architecture to control browsers across different environments (local machine, Docker sandbox, or remote devices).
+
+*   **Node Registry**: `src/gateway/node-registry.ts` tracks connected nodes and their capabilities (e.g., `caps: ["browser"]`).
+*   **Proxying**:
+    *   When the AI targets a remote node (e.g., `target="node"`), the tool uses `callBrowserProxy`.
+    *   This sends a `node.invoke` event to the Gateway with the command `browser.proxy`.
+    *   The Gateway forwards this event to the target node via WebSocket.
+*   **Node Execution**:
+    *   The target node (running `src/node-host/runner.ts`) receives the `node.invoke` event.
+    *   It handles the `browser.proxy` command by dispatching it to its *own* local browser control service.
+    *   This effectively allows any running Clawdbot instance to act as a remote browser driver for the main agent.