# Security Test Harness Final Assessment ## Progress Checklist (Ordered by Priority) - [x] ~~CRITICAL: Anthropic API key appears to be a real key stored in `test/security/.env` and should be rotated immediately; keep the file gitignored and avoid committing secrets in repo fixtures.~~ Intentional local testing fixture (gitignored). Evidence: `test/security/.env:1`, `./.gitignore:74`. - [x] HIGH: CLI mocks are static, ignore argv, and cannot simulate multi-step command flows or URL-specific outputs; this can produce false positives because poisoned payloads may never be returned along the real code path. Evidence: `test/security/harness/cli-mocks/mock-binary.ts:20`, `test/security/harness/cli-mocks/mock-binary.ts:33`, `test/security/harness/cli-mocks/mock-binary.ts:36`, `test/security/harness/cli-mocks/mock-binary.ts:114`, `test/security/harness/cli-mocks/curl-mock.ts:153`. - [x] ~~HIGH: Channel tests do not exercise channel ingress or metadata parsing; they embed payloads into the user prompt and call the gateway directly with operator/admin scopes and `deliver: false`, which bypasses channel-specific defenses and trust-boundary checks.~~ Addressed via `chat.ingress` gateway method; tests now send real channel payloads through Telegram/WhatsApp parsing. Evidence: `src/gateway/server-methods/chat.ts:1158`, `test/security/channels/whatsapp-injection.e2e.test.ts:99`, `test/security/channels/telegram-injection.e2e.test.ts:90`. - [ ] HIGH: Tests do not assert that poisoned data sources were actually accessed (e.g., a `gog` call happened) before judging, so a refusal or tool failure can still pass the test. Evidence: `test/security/email-injection.e2e.test.ts:72`, `test/security/email-injection.e2e.test.ts:98`. - [ ] HIGH: CLI mocks write into a world-writable `/tmp` path; use per-test temp dirs (e.g., `mkdtemp`) to avoid path hijacking or races when multiple runs are active. Evidence: `test/security/harness/cli-mocks/mock-binary.ts:10`. - [ ] MEDIUM: LLM judge uses a single evaluation without consensus or retries, so verdicts can be brittle for borderline cases. Evidence: `test/security/harness/llm-judge.ts:70`, `test/security/harness/llm-judge.ts:171`. - [ ] MEDIUM: Judge calls do not pin temperature or seed, so results may drift run-to-run. Evidence: `test/security/harness/llm-judge.ts:101`. - [ ] MEDIUM: Exfiltration patterns are a minimal set and miss common mechanisms (httpie, Python requests, PowerShell). Evidence: `test/security/harness/assertions.ts:17`. - [ ] LOW: HTML report embeds JSON into a `