Docs: Document comprehensive gateway stability infrastructure

Added new section "Gateway Stability Infrastructure" covering: - Multi-layer stability design (system, PM2, startup hooks, health monitoring) - All monitoring commands with examples - Recovery scenarios and automated responses - What problems this prevents This comprehensive infrastructure ensures: - No more crashes from Telegram message processing - Automatic detection and recovery from hangs - Prevention of inotify exhaustion hangs - Memory limit protection - Clean lock file management - Full visibility into gateway health Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-01-29 20:03:31 +00:00 · 2026-01-29 20:03:31 +00:00 · 5900a08626
commit 5900a08626
parent dd7f826d0a
1 changed files with 94 additions and 0 deletions
--- a/README_Tech.md
+++ b/README_Tech.md
@ -265,6 +265,100 @@ pm2 restart moltbot-gateway

 ---

+## Gateway Stability Infrastructure (Jan 29, 2026)
+
+### Multi-Layer Stability Design
+
+**Layer 1: System Level**
+- Inotify watcher limit: 524288 (prevents file monitoring exhaustion)
+  - Config: `/etc/sysctl.d/99-moltbot-inotify.conf`
+  - Verify: `cat /proc/sys/fs/inotify/max_user_watches`
+
+**Layer 2: PM2 Process Management**
+- Automatic restart on crash
+- Memory limit: 500MB (auto-restart if exceeded)
+- Min uptime: 10 seconds (prevents restart storms)
+- Kill timeout: 5 seconds (graceful shutdown before force kill)
+- Config: `/root/moltbot/ecosystem.config.cjs`
+
+**Layer 3: Startup Hooks**
+- `scripts/gateway-start.sh`: Wrapper script that runs on every startup
+  - Automatically cleans stale lock files (`~/.clawdbot/*.lock`)
+  - Prevents "gateway already running" errors
+  - Runs before `node dist/entry.js gateway`
+
+**Layer 4: Health Monitoring**
+- `scripts/pm2-health-monitor.js`: Standalone health check app managed by PM2
+  - Runs every 5 minutes (configurable)
+  - Tests port 18789 connectivity (detects hung processes)
+  - Monitors inotify watcher usage (warns at 80% of limit)
+  - Force-restarts via `killall -9 moltbot` if unresponsive
+  - Logs to `/tmp/moltbot/pm2-health-monitor.log`
+  - Isolated from gateway in same PM2 daemon
+
+### Monitoring Commands
+
+```bash
+# View both gateway and health monitor status
+pm2 list
+
+# View gateway logs (real-time)
+pm2 logs moltbot-gateway
+
+# View health monitor logs
+pm2 logs moltbot-health-monitor
+
+# View last 50 lines of either
+pm2 logs moltbot-gateway -n 50
+pm2 logs moltbot-health-monitor -n 50
+
+# Monitor health checks in real-time
+tail -f /tmp/moltbot/pm2-health-monitor.log
+
+# Force restart gateway
+pm2 restart moltbot-gateway
+
+# Emergency restart (if stuck)
+killall -9 moltbot && pm2 restart moltbot-gateway
+```
+
+### Recovery Scenarios
+
+**Scenario 1: Gateway Becomes Unresponsive (Process Running but Port Hung)**
+- Symptom: `pm2 status` shows `online`, but `nc -zv 127.0.0.1 18789` fails
+- Response: Health monitor detects this within 5 minutes
+- Action: Auto-kills process, PM2 restarts it
+- Result: Bot responds to next Telegram message
+
+**Scenario 2: Lock File Left Behind**
+- Symptom: `Gateway failed to start: gateway already running`
+- Cause: Previous process crashed without cleaning locks
+- Response: `gateway-start.sh` cleans locks on startup
+- Result: Gateway starts cleanly
+
+**Scenario 3: Inotify Exhaustion**
+- Symptom: `Error: ENOSPC: System limit for number of file watchers reached`
+- Cause: Too many config/skill files being watched
+- Response: Health monitor logs warning at 80% threshold
+- Solution: Delete unused skills or increase limit further (requires code review)
+
+**Scenario 4: Memory Exhaustion**
+- Symptom: Process becomes slow/unresponsive, memory climbing
+- Response: PM2 auto-restart when hitting 500MB limit
+- Result: Clean restart, memory reset
+
+### What This Prevents
+
+✅ Telegram messages causing gateway hang
+✅ Stale lock files blocking restarts
+✅ Inotify limit exhaustion going unnoticed
+✅ Memory leaks causing slowness
+✅ Restart storms from `Restart=always`
+✅ Systemd conflicts with PM2
+✅ Lack of visibility into gateway health
+
+---
+
 ## Configuration Summary

 ### Model Fallback Chain