valtterimelkko
|
eec556c71e
|
Fix: Resolve gateway crash loop and inotify exhaustion
Problem: Gateway was hung in 1200+ restart loop, causing Telegram bot to stop
responding. Root cause: system inotify file descriptor limit exhausted when
monitoring config/skill files.
Solutions implemented:
1. **Inotify limit increase** (/etc/sysctl.d/99-moltbot-inotify.conf)
- Increased fs.inotify.max_user_watches from 65536 to 524288
- Prevents "ENOSPC: System limit for number of file watchers reached"
- Persistent across reboots
2. **Improved systemd service** (/etc/systemd/system/moltbot-gateway.service)
- Changed Restart=always → Restart=on-failure
- Increased RestartSec=5 → RestartSec=10 (reduce CPU churn)
- Reduced StartLimitBurst=10 → StartLimitBurst=5
- Added ExecStartPre to auto-clean stale locks on startup
- Service remains isolated from other services (code-server, ssh, etc)
3. **Health check automation** (new files)
- scripts/health-check-gateway.sh: detects hang/lock issues, auto-recovers
- /etc/systemd/system/moltbot-health-check.service: runs health checks
- /etc/systemd/system/moltbot-health-check.timer: runs every 5 minutes
- Logs to /tmp/moltbot-health-check.log
4. **Documentation** (README_Tech.md)
- Added section on crash loop root cause and preventative measures
- Added Architecture section documenting service isolation
- Updated troubleshooting with health check steps
- Updated file locations with new monitoring files
Testing: Gateway now starts cleanly, health checks pass, other services
(code-server, ssh) remain unaffected. Timer runs every 5 minutes to prevent
future hangs.
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-01-29 18:55:41 +00:00 |
|