Docs: Document comprehensive gateway stability infrastructure

Added new section "Gateway Stability Infrastructure" covering:
- Multi-layer stability design (system, PM2, startup hooks, health monitoring)
- All monitoring commands with examples
- Recovery scenarios and automated responses
- What problems this prevents

This comprehensive infrastructure ensures:
- No more crashes from Telegram message processing
- Automatic detection and recovery from hangs
- Prevention of inotify exhaustion hangs
- Memory limit protection
- Clean lock file management
- Full visibility into gateway health

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
This commit is contained in:
valtterimelkko 2026-01-29 20:03:31 +00:00
parent dd7f826d0a
commit 5900a08626

View File

@ -265,6 +265,100 @@ pm2 restart moltbot-gateway
---
## Gateway Stability Infrastructure (Jan 29, 2026)
### Multi-Layer Stability Design
**Layer 1: System Level**
- Inotify watcher limit: 524288 (prevents file monitoring exhaustion)
- Config: `/etc/sysctl.d/99-moltbot-inotify.conf`
- Verify: `cat /proc/sys/fs/inotify/max_user_watches`
**Layer 2: PM2 Process Management**
- Automatic restart on crash
- Memory limit: 500MB (auto-restart if exceeded)
- Min uptime: 10 seconds (prevents restart storms)
- Kill timeout: 5 seconds (graceful shutdown before force kill)
- Config: `/root/moltbot/ecosystem.config.cjs`
**Layer 3: Startup Hooks**
- `scripts/gateway-start.sh`: Wrapper script that runs on every startup
- Automatically cleans stale lock files (`~/.clawdbot/*.lock`)
- Prevents "gateway already running" errors
- Runs before `node dist/entry.js gateway`
**Layer 4: Health Monitoring**
- `scripts/pm2-health-monitor.js`: Standalone health check app managed by PM2
- Runs every 5 minutes (configurable)
- Tests port 18789 connectivity (detects hung processes)
- Monitors inotify watcher usage (warns at 80% of limit)
- Force-restarts via `killall -9 moltbot` if unresponsive
- Logs to `/tmp/moltbot/pm2-health-monitor.log`
- Isolated from gateway in same PM2 daemon
### Monitoring Commands
```bash
# View both gateway and health monitor status
pm2 list
# View gateway logs (real-time)
pm2 logs moltbot-gateway
# View health monitor logs
pm2 logs moltbot-health-monitor
# View last 50 lines of either
pm2 logs moltbot-gateway -n 50
pm2 logs moltbot-health-monitor -n 50
# Monitor health checks in real-time
tail -f /tmp/moltbot/pm2-health-monitor.log
# Force restart gateway
pm2 restart moltbot-gateway
# Emergency restart (if stuck)
killall -9 moltbot && pm2 restart moltbot-gateway
```
### Recovery Scenarios
**Scenario 1: Gateway Becomes Unresponsive (Process Running but Port Hung)**
- Symptom: `pm2 status` shows `online`, but `nc -zv 127.0.0.1 18789` fails
- Response: Health monitor detects this within 5 minutes
- Action: Auto-kills process, PM2 restarts it
- Result: Bot responds to next Telegram message
**Scenario 2: Lock File Left Behind**
- Symptom: `Gateway failed to start: gateway already running`
- Cause: Previous process crashed without cleaning locks
- Response: `gateway-start.sh` cleans locks on startup
- Result: Gateway starts cleanly
**Scenario 3: Inotify Exhaustion**
- Symptom: `Error: ENOSPC: System limit for number of file watchers reached`
- Cause: Too many config/skill files being watched
- Response: Health monitor logs warning at 80% threshold
- Solution: Delete unused skills or increase limit further (requires code review)
**Scenario 4: Memory Exhaustion**
- Symptom: Process becomes slow/unresponsive, memory climbing
- Response: PM2 auto-restart when hitting 500MB limit
- Result: Clean restart, memory reset
### What This Prevents
✅ Telegram messages causing gateway hang
✅ Stale lock files blocking restarts
✅ Inotify limit exhaustion going unnoticed
✅ Memory leaks causing slowness
✅ Restart storms from `Restart=always`
✅ Systemd conflicts with PM2
✅ Lack of visibility into gateway health
---
## Configuration Summary
### Model Fallback Chain