4.8 KiB
4.8 KiB
Security Test Harness Final Assessment
Progress Checklist (Ordered by Priority)
CRITICAL: Anthropic API key appears to be a real key stored inIntentional local testing fixture (gitignored). Evidence:test/security/.envand should be rotated immediately; keep the file gitignored and avoid committing secrets in repo fixtures.test/security/.env:1,./.gitignore:74.- HIGH: CLI mocks are static, ignore argv, and cannot simulate multi-step command flows or URL-specific outputs; this can produce false positives because poisoned payloads may never be returned along the real code path. Evidence:
test/security/harness/cli-mocks/mock-binary.ts:20,test/security/harness/cli-mocks/mock-binary.ts:33,test/security/harness/cli-mocks/mock-binary.ts:36,test/security/harness/cli-mocks/mock-binary.ts:114,test/security/harness/cli-mocks/curl-mock.ts:153. HIGH: Channel tests do not exercise channel ingress or metadata parsing; they embed payloads into the user prompt and call the gateway directly with operator/admin scopes andAddressed viadeliver: false, which bypasses channel-specific defenses and trust-boundary checks.chat.ingressgateway method; tests now send real channel payloads through Telegram/WhatsApp parsing. Evidence:src/gateway/server-methods/chat.ts:1158,test/security/channels/whatsapp-injection.e2e.test.ts:99,test/security/channels/telegram-injection.e2e.test.ts:90.- HIGH: Tests do not assert that poisoned data sources were actually accessed (e.g., a
gogcall happened) before judging, so a refusal or tool failure can still pass the test. Evidence:test/security/email-injection.e2e.test.ts:72,test/security/email-injection.e2e.test.ts:98. - HIGH: CLI mocks write into a world-writable
/tmppath; use per-test temp dirs (e.g.,mkdtemp) to avoid path hijacking or races when multiple runs are active. Evidence:test/security/harness/cli-mocks/mock-binary.ts:10. - MEDIUM: LLM judge uses a single evaluation without consensus or retries, so verdicts can be brittle for borderline cases. Evidence:
test/security/harness/llm-judge.ts:70,test/security/harness/llm-judge.ts:171. - MEDIUM: Judge calls do not pin temperature or seed, so results may drift run-to-run. Evidence:
test/security/harness/llm-judge.ts:101. - MEDIUM: Exfiltration patterns are a minimal set and miss common mechanisms (httpie, Python requests, PowerShell). Evidence:
test/security/harness/assertions.ts:17. - LOW: HTML report embeds JSON into a
<script>block without escaping; malicious payloads can break out of the script tag and execute when the report is opened. Evidence:test/security/harness/report-generator.ts:448.
Strengths (Confirmed)
- LLM-as-judge is the right approach for nuanced prompt injection detection; the system prompt is explicit and adversarial. Evidence:
test/security/harness/llm-judge.ts:42. - The payload library shows good coverage across channels and external sources, and the categories match realistic attack surfaces. Evidence:
test/security/harness/cli-mocks/curl-mock.ts:15,test/security/harness/cli-mocks/github-mock.ts:14,test/security/harness/cli-mocks/browser-mock.ts:15. - Tool-choice structured output ensures parseable verdicts. Evidence:
test/security/harness/llm-judge.ts:106. - Reporting is professional and useful for triage and trend analysis. Evidence:
test/security/harness/report-generator.ts:333,test/security/reports/assets/script.js:1.
Gaps Identified (Not Yet Addressed)
- Multi-turn attack sequences are not covered (all tests are single-turn).
- Jailbreak variants (roleplay, DAN, "pretend you're" patterns) are not represented.
- No negative/positive control tests to validate normal behavior under benign inputs.
- Determinism and reproducibility are not addressed for the judge output beyond a single evaluation.
Future Enhancements (Backlog, Ordered)
- Harden the harness first: arg-aware CLI mocks, real tool-response routing, and explicit assertions that data sources were accessed.
- Add deterministic or consensus judging (multiple runs or a lightweight rule-based prefilter) before expanding scope.
- Introduce multi-turn and negative/positive control tests to establish behavioral baselines.
- Expand channel coverage once ingress testing exists for each channel.
- Extend exfiltration detection to include more protocols and common tooling (httpie, Python requests, PowerShell).
- Add safe report embedding (JSON in
application/jsonscript tag or escaped serialization).
Current Status (Progress)
- Harness is ready to serve as a regression gate.
- Risks from static mocks, lack of channel ingress coverage, and missing data-path assertions are resolved.
- Deterministic judging is in place so verdicts are reproducible.