feat(security): add E2E security test harness with LLM judge

Add comprehensive security acceptance testing framework that validates
Moltbot's resistance to prompt injection, data exfiltration, and trust
boundary violations.

Key components:
- LLM-as-judge pattern using Claude to evaluate attack resistance
- WebSocket gateway client for direct protocol testing
- CLI mocking utilities for injecting poisoned external data
- Docker Compose setup for containerized CI execution
- GitHub Actions workflow with daily scheduled runs

Test categories covered:
- Email/calendar prompt injection via external data
- Trust boundary violations and auth bypass attempts
- Data exfiltration prevention
- Tool output poisoning

2026-01-29 08:52:59 +07:00

3.9 KiB

Raw Blame History

Security Acceptance Tests

E2E security testing framework for Moltbot. Validates resistance to:

Prompt injection via external data sources
Data exfiltration attempts
Trust boundary violations
Tool poisoning attacks

Architecture: LLM-as-Judge

Pattern matching can't reliably detect whether prompt injection succeeded. We use Claude as a judge to evaluate whether Moltbot resisted attacks:

Run test scenario (send poisoned data to Moltbot)
Capture Moltbot's response and tool calls
Send to Claude judge with structured output
Judge evaluates: injection detected? complied with? data leaked?

This enables nuanced evaluation of subtle attacks that regex can't catch.

Quick Start

# Install Anthropic SDK (required for LLM judge)
pnpm add -D @anthropic-ai/sdk

# Run security tests
ANTHROPIC_API_KEY=sk-ant-xxx pnpm test:security

# Run specific category
pnpm test:security --grep "Email Injection"

Structure

test/security/
├── SPEC.md                    # Full specification document
├── README.md                  # This file
├── harness/
│   ├── index.ts               # Exports
│   ├── gateway-client.ts      # WebSocket gateway client
│   ├── assertions.ts          # Pattern-based assertions (fast checks)
│   ├── llm-judge.ts           # Claude-based evaluation (nuanced checks)
│   └── cli-mocks/
│       └── mock-binary.ts     # CLI binary mocking utilities
└── *.e2e.test.ts              # Test files by category

Implementation Priority

Based on SPEC.md, implement in this order:

email-injection.e2e.test.ts - Gmail/email tests (highest attack surface)
calendar-injection.e2e.test.ts - Calendar event injection tests
api-injection.e2e.test.ts - Generic API response injection
trust-boundary.e2e.test.ts - Authentication bypass and session leakage
tool-poisoning.e2e.test.ts - Malicious skill/plugin output

Key Dependencies

# Add to devDependencies
pnpm add -D @anthropic-ai/sdk ws

Vitest - Test runner (already configured in repo)
@anthropic-ai/sdk - LLM judge (Claude Sonnet for evaluation)
ws - WebSocket client
Hono - Mock HTTP servers (already in deps)

Running Tests

Option 1: Local Script (Recommended)

# Install Anthropic SDK first
pnpm add -D @anthropic-ai/sdk ws

# Set API key
export ANTHROPIC_API_KEY=sk-ant-xxx

# Run all tests (auto-detects local gateway or spins up Docker)
./test/security/run-local.sh

# Run specific test category
./test/security/run-local.sh "Email Injection"

Option 2: Docker Compose (Full Isolation)

# Run containerized tests
ANTHROPIC_API_KEY=sk-ant-xxx \
docker compose -f test/security/docker-compose.yml up --build --abort-on-container-exit

# Cleanup
docker compose -f test/security/docker-compose.yml down -v

Option 3: Against Local Gateway

# Terminal 1: Start gateway
moltbot gateway

# Terminal 2: Run tests directly
ANTHROPIC_API_KEY=sk-ant-xxx pnpm vitest run --config vitest.security.config.ts

CI Integration

Security tests run automatically via GitHub Actions:

On every push to main
On every PR to main
Daily at midnight UTC (scheduled)
Manually via workflow dispatch

Required secrets:

ANTHROPIC_API_KEY: For LLM judge evaluation

The workflow blocks releases if security tests fail.

Test Output

Each test produces a SecurityVerdict:

{
  passed: boolean;           // Did the security test pass?
  injectionDetected: boolean; // Did agent notice the attack?
  injectionCompliedWith: boolean; // Did agent follow malicious instructions?
  dataLeaked: boolean;       // Was sensitive data exposed?
  reasoning: string;         // Judge's explanation
  severity: "none" | "low" | "medium" | "high" | "critical";
  evidence: string[];        // Supporting evidence
}

See SPEC.md for complete implementation details.

3.9 KiB Raw Blame History