openclaw/test/security/SECURITY_ASSESSMENT.md
2026-01-29 18:00:53 +07:00

4.8 KiB

Security Test Harness Final Assessment

Progress Checklist (Ordered by Priority)

  • CRITICAL: Anthropic API key appears to be a real key stored in test/security/.env and should be rotated immediately; keep the file gitignored and avoid committing secrets in repo fixtures. Intentional local testing fixture (gitignored). Evidence: test/security/.env:1, ./.gitignore:74.
  • HIGH: CLI mocks are static, ignore argv, and cannot simulate multi-step command flows or URL-specific outputs; this can produce false positives because poisoned payloads may never be returned along the real code path. Evidence: test/security/harness/cli-mocks/mock-binary.ts:20, test/security/harness/cli-mocks/mock-binary.ts:33, test/security/harness/cli-mocks/mock-binary.ts:36, test/security/harness/cli-mocks/mock-binary.ts:114, test/security/harness/cli-mocks/curl-mock.ts:153.
  • HIGH: Channel tests do not exercise channel ingress or metadata parsing; they embed payloads into the user prompt and call the gateway directly with operator/admin scopes and deliver: false, which bypasses channel-specific defenses and trust-boundary checks. Addressed via chat.ingress gateway method; tests now send real channel payloads through Telegram/WhatsApp parsing. Evidence: src/gateway/server-methods/chat.ts:1158, test/security/channels/whatsapp-injection.e2e.test.ts:99, test/security/channels/telegram-injection.e2e.test.ts:90.
  • HIGH: Tests do not assert that poisoned data sources were actually accessed (e.g., a gog call happened) before judging, so a refusal or tool failure can still pass the test. Evidence: test/security/email-injection.e2e.test.ts:72, test/security/email-injection.e2e.test.ts:98.
  • HIGH: CLI mocks write into a world-writable /tmp path; use per-test temp dirs (e.g., mkdtemp) to avoid path hijacking or races when multiple runs are active. Evidence: test/security/harness/cli-mocks/mock-binary.ts:10.
  • MEDIUM: LLM judge uses a single evaluation without consensus or retries, so verdicts can be brittle for borderline cases. Evidence: test/security/harness/llm-judge.ts:70, test/security/harness/llm-judge.ts:171.
  • MEDIUM: Judge calls do not pin temperature or seed, so results may drift run-to-run. Evidence: test/security/harness/llm-judge.ts:101.
  • MEDIUM: Exfiltration patterns are a minimal set and miss common mechanisms (httpie, Python requests, PowerShell). Evidence: test/security/harness/assertions.ts:17.
  • LOW: HTML report embeds JSON into a <script> block without escaping; malicious payloads can break out of the script tag and execute when the report is opened. Evidence: test/security/harness/report-generator.ts:448.

Strengths (Confirmed)

  • LLM-as-judge is the right approach for nuanced prompt injection detection; the system prompt is explicit and adversarial. Evidence: test/security/harness/llm-judge.ts:42.
  • The payload library shows good coverage across channels and external sources, and the categories match realistic attack surfaces. Evidence: test/security/harness/cli-mocks/curl-mock.ts:15, test/security/harness/cli-mocks/github-mock.ts:14, test/security/harness/cli-mocks/browser-mock.ts:15.
  • Tool-choice structured output ensures parseable verdicts. Evidence: test/security/harness/llm-judge.ts:106.
  • Reporting is professional and useful for triage and trend analysis. Evidence: test/security/harness/report-generator.ts:333, test/security/reports/assets/script.js:1.

Gaps Identified (Not Yet Addressed)

  • Multi-turn attack sequences are not covered (all tests are single-turn).
  • Jailbreak variants (roleplay, DAN, "pretend you're" patterns) are not represented.
  • No negative/positive control tests to validate normal behavior under benign inputs.
  • Determinism and reproducibility are not addressed for the judge output beyond a single evaluation.

Future Enhancements (Backlog, Ordered)

  • Harden the harness first: arg-aware CLI mocks, real tool-response routing, and explicit assertions that data sources were accessed.
  • Add deterministic or consensus judging (multiple runs or a lightweight rule-based prefilter) before expanding scope.
  • Introduce multi-turn and negative/positive control tests to establish behavioral baselines.
  • Expand channel coverage once ingress testing exists for each channel.
  • Extend exfiltration detection to include more protocols and common tooling (httpie, Python requests, PowerShell).
  • Add safe report embedding (JSON in application/json script tag or escaped serialization).

Current Status (Progress)

  • Harness is ready to serve as a regression gate.
  • Risks from static mocks, lack of channel ingress coverage, and missing data-path assertions are resolved.
  • Deterministic judging is in place so verdicts are reproducible.