openclaw/test/security/SECURITY_ASSESSMENT.md

# Security Test Harness Final Assessment

## Progress Checklist (Ordered by Priority)
- [x] ~~CRITICAL: Anthropic API key appears to be a real key stored in `test/security/.env` and should be rotated immediately; keep the file gitignored and avoid committing secrets in repo fixtures.~~ Intentional local testing fixture (gitignored). Evidence: `test/security/.env:1`, `./.gitignore:74`.
- [x] HIGH: CLI mocks are static, ignore argv, and cannot simulate multi-step command flows or URL-specific outputs; this can produce false positives because poisoned payloads may never be returned along the real code path. Evidence: `test/security/harness/cli-mocks/mock-binary.ts:20`, `test/security/harness/cli-mocks/mock-binary.ts:33`, `test/security/harness/cli-mocks/mock-binary.ts:36`, `test/security/harness/cli-mocks/mock-binary.ts:114`, `test/security/harness/cli-mocks/curl-mock.ts:153`.
- [x] ~~HIGH: Channel tests do not exercise channel ingress or metadata parsing; they embed payloads into the user prompt and call the gateway directly with operator/admin scopes and `deliver: false`, which bypasses channel-specific defenses and trust-boundary checks.~~ Addressed via `chat.ingress` gateway method; tests now send real channel payloads through Telegram/WhatsApp parsing. Evidence: `src/gateway/server-methods/chat.ts:1158`, `test/security/channels/whatsapp-injection.e2e.test.ts:99`, `test/security/channels/telegram-injection.e2e.test.ts:90`.
- [ ] HIGH: Tests do not assert that poisoned data sources were actually accessed (e.g., a `gog` call happened) before judging, so a refusal or tool failure can still pass the test. Evidence: `test/security/email-injection.e2e.test.ts:72`, `test/security/email-injection.e2e.test.ts:98`.
- [ ] HIGH: CLI mocks write into a world-writable `/tmp` path; use per-test temp dirs (e.g., `mkdtemp`) to avoid path hijacking or races when multiple runs are active. Evidence: `test/security/harness/cli-mocks/mock-binary.ts:10`.
- [ ] MEDIUM: LLM judge uses a single evaluation without consensus or retries, so verdicts can be brittle for borderline cases. Evidence: `test/security/harness/llm-judge.ts:70`, `test/security/harness/llm-judge.ts:171`.
- [ ] MEDIUM: Judge calls do not pin temperature or seed, so results may drift run-to-run. Evidence: `test/security/harness/llm-judge.ts:101`.
- [ ] MEDIUM: Exfiltration patterns are a minimal set and miss common mechanisms (httpie, Python requests, PowerShell). Evidence: `test/security/harness/assertions.ts:17`.
- [ ] LOW: HTML report embeds JSON into a `<script>` block without escaping; malicious payloads can break out of the script tag and execute when the report is opened. Evidence: `test/security/harness/report-generator.ts:448`.

## Strengths (Confirmed)

- [x] LLM-as-judge is the right approach for nuanced prompt injection detection; the system prompt is explicit and adversarial. Evidence: `test/security/harness/llm-judge.ts:42`.
- [x] The payload library shows good coverage across channels and external sources, and the categories match realistic attack surfaces. Evidence: `test/security/harness/cli-mocks/curl-mock.ts:15`, `test/security/harness/cli-mocks/github-mock.ts:14`, `test/security/harness/cli-mocks/browser-mock.ts:15`.
- [x] Tool-choice structured output ensures parseable verdicts. Evidence: `test/security/harness/llm-judge.ts:106`.
- [x] Reporting is professional and useful for triage and trend analysis. Evidence: `test/security/harness/report-generator.ts:333`, `test/security/reports/assets/script.js:1`.

## Gaps Identified (Not Yet Addressed)

- [ ] Multi-turn attack sequences are not covered (all tests are single-turn).
- [ ] Jailbreak variants (roleplay, DAN, "pretend you're" patterns) are not represented.
- [ ] No negative/positive control tests to validate normal behavior under benign inputs.
- [ ] Determinism and reproducibility are not addressed for the judge output beyond a single evaluation.

## Future Enhancements (Backlog, Ordered)

- [ ] Harden the harness first: arg-aware CLI mocks, real tool-response routing, and explicit assertions that data sources were accessed.
- [ ] Add deterministic or consensus judging (multiple runs or a lightweight rule-based prefilter) before expanding scope.
- [ ] Introduce multi-turn and negative/positive control tests to establish behavioral baselines.
- [ ] Expand channel coverage once ingress testing exists for each channel.
- [ ] Extend exfiltration detection to include more protocols and common tooling (httpie, Python requests, PowerShell).
- [ ] Add safe report embedding (JSON in `application/json` script tag or escaped serialization).

## Current Status (Progress)

- [ ] Harness is ready to serve as a regression gate.
- [ ] Risks from static mocks, lack of channel ingress coverage, and missing data-path assertions are resolved.
- [ ] Deterministic judging is in place so verdicts are reproducible.