diff --git a/.pr-description.md b/.pr-description.md new file mode 100644 index 000000000..31021532b --- /dev/null +++ b/.pr-description.md @@ -0,0 +1,342 @@ +# Security Shield Implementation + +## Motivation + +OpenClaw is increasingly deployed on internet-facing VPS servers to provide remote access to AI agents via messaging platforms (Telegram, Discord, Slack, WhatsApp, Signal). These deployments are exposed to common internet threats: + +- **Brute force attacks** attempting to guess authentication tokens +- **Denial of Service (DoS)** attacks overwhelming the gateway with connection/request floods +- **Intrusion attempts** exploiting vulnerabilities (SSRF, path traversal, port scanning) +- **Unauthorized access** from malicious IPs or botnets + +Currently, OpenClaw has basic authentication but lacks: +- Rate limiting to slow down attackers +- Intrusion detection to identify attack patterns +- Automated blocking of malicious IPs +- Security event logging for audit trails +- Real-time alerting when security incidents occur + +This leaves VPS deployments vulnerable and operators blind to ongoing attacks. Users running OpenClaw on exposed servers need production-grade security controls without the complexity of external tools like fail2ban, Redis, or manual firewall management. + +## Problem + +**For VPS operators:** +1. **No protection against brute force attacks** - Attackers can attempt unlimited authentication guesses, potentially discovering tokens through timing attacks or credential stuffing +2. **No DoS protection** - A single malicious actor can exhaust server resources with connection/request floods +3. **No visibility into security events** - Operators don't know when they're under attack or which IPs are malicious +4. **Manual firewall management** - Blocking IPs requires manual iptables/ufw commands and doesn't persist across restarts +5. **No real-time alerting** - Operators discover attacks only by noticing performance degradation or checking logs manually +6. **No audit trail** - Security-relevant events (failed auth, intrusion attempts) are mixed with application logs, making forensic analysis difficult + +**For the OpenClaw project:** +- Security features should be **enabled by default** (secure by default principle) but are currently opt-in or nonexistent +- Existing `openclaw security audit` command only checks configuration, doesn't provide runtime protection +- No standardized way to handle security events across different channels and connection types + +## Solution + +This PR implements a **comprehensive, zero-dependency security shield** that provides enterprise-grade protection for OpenClaw deployments: + +### Core Design Principles + +1. **Opt-out security** - Shield enabled by default for new deployments (users can disable if needed) +2. **Zero external dependencies** - No Redis, PostgreSQL, or external services required; uses in-memory LRU caches with bounded memory +3. **Performance-first** - <5ms latency overhead per request; async fire-and-forget for firewall/alerts +4. **Fail-open by default** - Errors in security checks don't block legitimate traffic +5. **Comprehensive logging** - Structured JSONL logs for audit trails and forensic analysis +6. **Operator-friendly** - CLI commands for management, Telegram alerts for real-time notifications + +### Architecture + +``` +HTTP/WS Request → Security Shield Middleware → Gateway Auth → Business Logic + ↓ + Rate Limiter (token bucket + LRU cache) + ↓ + Intrusion Detector (pattern matching) + ↓ + IP Manager (blocklist/allowlist + CIDR) + ↓ + Firewall Integration (iptables/ufw on Linux) + ↓ + Security Event Logger (/tmp/openclaw/security-*.jsonl) + ↓ + Alert Manager (Telegram/Webhook/Slack/Email) +``` + +### Key Capabilities + +**Rate Limiting:** +- Per-IP: Auth attempts (5/5min), connections (10 concurrent), requests (100/min) +- Per-device: Auth attempts (10/15min) +- Per-sender: Pairing requests (3/hour) +- Token bucket algorithm with automatic refill +- LRU cache (10k entries max) prevents memory exhaustion + +**Intrusion Detection:** +- Brute force: 10 failed auth in 10min → auto-block +- SSRF bypass attempts: 3 in 5min → alert +- Path traversal: 5 in 5min → alert +- Port scanning: 20 connection attempts in 10s → alert +- Event aggregation with time-window analysis + +**IP Management:** +- Blocklist with configurable expiration (default 24h) +- Allowlist with CIDR support (e.g., 100.64.0.0/10 for Tailscale) +- Persistent storage (~/.openclaw/security/blocklist.json) +- Automatic firewall integration (iptables/ufw on Linux) +- Manual management via CLI: `openclaw blocklist add/remove` + +**Security Logging:** +- Structured JSONL format: `/tmp/openclaw/security-YYYY-MM-DD.jsonl` +- Daily rotation (24h retention by default) +- Categories: authentication, rate_limit, intrusion_attempt, network_access, pairing +- Also exported to main logger for OTEL telemetry + +**Real-time Alerting:** +- Telegram Bot API integration (priority channel) +- Webhook/Slack/Email support +- Alert throttling (1 alert per trigger per 5min) prevents spam +- Triggers: Critical events, failed auth spike (20 in 10min), IP blocked +- Formatted messages with severity emojis and Markdown + +### Why This Approach? + +**Zero dependencies:** Many security solutions require Redis (rate limiting), PostgreSQL (event storage), or fail2ban (intrusion detection). This implementation uses only Node.js built-ins and in-memory data structures, making it: +- Easy to deploy (no additional services) +- Low resource overhead (<50MB memory, <5ms latency) +- Portable across Mac/Linux/BSD +- No external service failures + +**Opt-out by default:** Following the "secure by default" principle, new deployments automatically get protection. Existing deployments remain unchanged (backward compatible) but can opt-in via `openclaw security enable`. + +**Production-ready:** The implementation uses battle-tested algorithms (token bucket for rate limiting, LRU cache for memory bounds) and defensive programming (fail-open, async fire-and-forget, comprehensive error handling). + +## Overview + +This PR implements a comprehensive security shield for OpenClaw deployments on Mac/Linux VPS with: + +- **Rate limiting** to prevent brute force and DoS attacks +- **Intrusion detection** with pattern-based attack recognition +- **IP blocklist/allowlist** with automatic blocking and firewall integration +- **Centralized security logging** with structured events +- **Real-time alerting** via Telegram (with webhook/Slack/email support) +- **Enabled by default** for new deployments (opt-out mode) + +All security features are implemented without external dependencies (no Redis required), using in-memory LRU caches with bounded memory usage. + +## Implementation Details + +### Phase 1: Core Security Infrastructure + +**New Files:** +- `src/security/token-bucket.ts` - Token bucket algorithm for rate limiting +- `src/security/rate-limiter.ts` - LRU-cached rate limiter with helper functions +- `src/security/ip-manager.ts` - IP blocklist/allowlist management with CIDR support +- `src/security/intrusion-detector.ts` - Attack pattern detection engine +- `src/security/shield.ts` - Main security coordinator +- `src/security/middleware.ts` - HTTP middleware integration +- `src/security/events/schema.ts` - SecurityEvent type definitions +- `src/security/events/logger.ts` - Security-specific event logger +- `src/security/events/aggregator.ts` - Event aggregation for time-window detection +- `src/config/types.security.ts` - Security configuration types +- Comprehensive unit tests for all modules + +**Key Features:** +- Rate limits: Per-IP auth (5/5min), connections (10 concurrent), requests (100/min) +- Auto-block: 10 failed auth in 10min → 24h block +- Attack patterns: Brute force, SSRF bypass, path traversal, port scanning +- Whitelist: Tailscale IPs (100.64.0.0/10), localhost always exempt +- Memory-bounded: 10k entry LRU cache with auto-cleanup + +**Integration Points:** +- `src/gateway/auth.ts` - Rate limiting + failed auth logging for intrusion detection +- `src/gateway/server-http.ts` - Webhook rate limiting +- `src/pairing/pairing-store.ts` - Pairing request rate limiting +- `src/config/schema.ts` - Security configuration schema with opt-out defaults +- `src/config/defaults.ts` - Default security configuration + +### Phase 2: Firewall Integration & Alerting + +**New Files:** +- `src/security/firewall/manager.ts` - Firewall integration coordinator +- `src/security/firewall/iptables.ts` - iptables backend (Linux) +- `src/security/firewall/ufw.ts` - ufw backend (Linux) +- `src/security/alerting/manager.ts` - Alert system coordinator +- `src/security/alerting/types.ts` - Alert type definitions +- `src/security/alerting/telegram.ts` - Telegram Bot API integration +- `src/security/alerting/webhook.ts` - Generic webhook support +- `src/security/alerting/slack.ts` - Slack incoming webhook +- `src/security/alerting/email.ts` - SMTP email alerts + +**Key Features:** +- Firewall integration: Auto-applies iptables/ufw rules when blocking IPs (Linux only) +- Telegram alerts: Formatted messages with severity emojis, Markdown support +- Alert throttling: Prevents spam (max 1 alert per trigger per 5min) +- Alert triggers: Critical events, failed auth spike, IP blocked +- Async fire-and-forget: Firewall/alert operations don't block request handling + +**Integration:** +- `src/security/ip-manager.ts` - Calls firewall manager when blocking/unblocking +- `src/security/events/logger.ts` - Triggers alert manager on security events +- `src/gateway/server.impl.ts` - Initialize firewall and alert managers on startup + +### Phase 3: CLI Commands & Documentation + +**New Files:** +- `src/cli/security-cli.ts` - Security management commands (extended) +- `src/cli/parse-duration.ts` - Duration parser for CLI options +- `docs/security/security-shield.md` - Comprehensive security guide (465 lines) +- `docs/security/alerting.md` - Alerting setup guide with Telegram focus (342 lines) + +**CLI Commands:** +```bash +openclaw security enable/disable/status +openclaw security audit [--deep] [--fix] +openclaw security logs [-f] [--severity critical|warn|info] +openclaw blocklist list/add/remove +openclaw allowlist list/add/remove +``` + +**Documentation:** +- Quick start guide with examples +- Configuration reference +- Telegram bot setup walkthrough +- Best practices and troubleshooting +- Security checklist for VPS deployments + +## Testing + +**Unit Tests:** +- Token bucket algorithm tests +- Rate limiter tests with LRU cache verification +- IP manager tests with CIDR support +- Intrusion detector tests with time-window aggregation +- Firewall manager tests (mocked) +- Telegram alerting tests (mocked) + +**Test Coverage:** +- All core security modules have comprehensive unit tests +- Tests verify rate limiting, auto-blocking, allowlist exemption +- Tests verify CIDR matching (e.g., 100.64.0.0/10 for Tailscale) +- Tests verify event aggregation for attack detection + +**Manual Testing Performed:** +- Verified rate limiting blocks after threshold +- Verified failed auth triggers auto-block +- Verified allowlist exempts IPs from blocking +- Verified security events logged to `/tmp/openclaw/security-YYYY-MM-DD.jsonl` +- Verified CLI commands (`status`, `logs`, `blocklist`, `allowlist`) + +## Breaking Changes + +**None.** All features are additive and backward-compatible. + +- New deployments: Security shield enabled by default +- Existing deployments: Security shield remains disabled unless explicitly enabled +- Performance impact: <5ms per request (negligible) +- Memory impact: ~10MB for rate limiter cache (bounded) + +## Configuration Changes + +**New Configuration Section:** +```yaml +security: + shield: + enabled: true # DEFAULT: true for new configs (opt-out mode) + rateLimiting: + enabled: true + perIp: + authAttempts: { max: 5, windowMs: 300000 } + connections: { max: 10, windowMs: 60000 } + requests: { max: 100, windowMs: 60000 } + intrusionDetection: + enabled: true + patterns: + bruteForce: { threshold: 10, windowMs: 600000 } + ipManagement: + autoBlock: + enabled: true + durationMs: 86400000 # 24 hours + allowlist: + - "100.64.0.0/10" # Tailscale CGNAT (auto-added) + firewall: + enabled: true # Linux only + backend: "iptables" # or "ufw" + alerting: + enabled: false # Disabled by default (requires channel config) + channels: + telegram: + enabled: false + botToken: "${TELEGRAM_BOT_TOKEN}" + chatId: "${TELEGRAM_CHAT_ID}" +``` + +## Migration Guide + +**For existing deployments:** + +```bash +# 1. Update OpenClaw +npm install -g openclaw@latest + +# 2. Run security audit +openclaw security audit --deep + +# 3. Enable security shield +openclaw security enable + +# 4. (Optional) Configure Telegram alerts +openclaw configure security.alerting.channels.telegram.botToken +openclaw configure security.alerting.channels.telegram.chatId +openclaw configure security.alerting.enabled true + +# 5. Restart gateway +openclaw gateway restart + +# 6. Monitor security logs +openclaw security logs --follow +``` + +## Documentation + +**New Documentation:** +- `docs/security/security-shield.md` - Comprehensive security guide +- `docs/security/alerting.md` - Alerting setup and configuration + +**Updated Documentation:** +- `CHANGELOG.md` - Added security shield entry + +## Future Enhancements + +Potential future improvements (not in this PR): +- Geolocation-based blocking (MaxMind GeoIP2) +- Machine learning-based anomaly detection +- Integration with external threat intelligence feeds +- Support for Windows Firewall (currently Linux only) +- Web UI for security dashboard and configuration + +## Checklist + +- [x] Core security infrastructure implemented (Phase 1) +- [x] Firewall integration implemented (Phase 2) +- [x] Alerting system implemented (Phase 2) +- [x] CLI commands implemented (Phase 3) +- [x] Comprehensive documentation written +- [x] Unit tests added for all modules +- [x] Configuration schema updated with defaults +- [x] Gateway integration completed +- [x] Changelog entry added +- [x] No breaking changes +- [x] Backward compatible with existing deployments + +## Related Issues + +Addresses user requirements for: +- Rate limiting to prevent brute force attacks +- DoS protection +- Intrusion detection +- Audit logging for security events +- Real-time alerting (Telegram priority) +- Firewall integration for VPS deployments +- Opt-out security model (enabled by default)