Fix: Resolve gateway crash loop and inotify exhaustion

Problem: Gateway was hung in 1200+ restart loop, causing Telegram bot to stop
responding. Root cause: system inotify file descriptor limit exhausted when
monitoring config/skill files.

Solutions implemented:

1. **Inotify limit increase** (/etc/sysctl.d/99-moltbot-inotify.conf)
   - Increased fs.inotify.max_user_watches from 65536 to 524288
   - Prevents "ENOSPC: System limit for number of file watchers reached"
   - Persistent across reboots

2. **Improved systemd service** (/etc/systemd/system/moltbot-gateway.service)
   - Changed Restart=always → Restart=on-failure
   - Increased RestartSec=5 → RestartSec=10 (reduce CPU churn)
   - Reduced StartLimitBurst=10 → StartLimitBurst=5
   - Added ExecStartPre to auto-clean stale locks on startup
   - Service remains isolated from other services (code-server, ssh, etc)

3. **Health check automation** (new files)
   - scripts/health-check-gateway.sh: detects hang/lock issues, auto-recovers
   - /etc/systemd/system/moltbot-health-check.service: runs health checks
   - /etc/systemd/system/moltbot-health-check.timer: runs every 5 minutes
   - Logs to /tmp/moltbot-health-check.log

4. **Documentation** (README_Tech.md)
   - Added section on crash loop root cause and preventative measures
   - Added Architecture section documenting service isolation
   - Updated troubleshooting with health check steps
   - Updated file locations with new monitoring files

Testing: Gateway now starts cleanly, health checks pass, other services
(code-server, ssh) remain unaffected. Timer runs every 5 minutes to prevent
future hangs.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
This commit is contained in:
valtterimelkko 2026-01-29 18:55:41 +00:00
parent c768d26ab3
commit eec556c71e
2 changed files with 205 additions and 5 deletions

View File

@ -211,6 +211,63 @@ sudo apt-get install -y nodejs
--- ---
### 7. **Gateway Crash Loop & Inotify Exhaustion** (Jan 29, 2026)
**Problem:** Gateway hung/became unresponsive. Systemd crashed 1203+ times. Telegram bot stopped responding.
**Symptoms:**
- `Port 18789 is already in use` (but port handler didn't properly clean up)
- `Gateway failed to start: gateway already running (pid 618450); lock timeout after 5000ms`
- Lock files stale/not released
**Root Cause:** System hit **inotify file descriptor limit** (`ENOSPC`):
```
Error: ENOSPC: System limit for number of file watchers reached, watch '/root/.moltbot/moltbot.json'
Error: ENOSPC: System limit for number of file watchers reached, watch '/root/clawd/canvas'
Error: ENOSPC: System limit for number of file watchers reached, watch '/root/clawd'
```
Gateway couldn't monitor config/skill files for changes → config reloading broke → became hung/unresponsive → systemd restart loop.
**Solutions:**
1. **Immediate fix:** Kill stuck process + clean lock files
```bash
kill -9 618450
rm -f ~/.clawdbot/gateway.lock ~/.clawdbot/moltbot.lock
systemctl restart moltbot-gateway
```
2. **Permanent inotify limit increase:** `/etc/sysctl.d/99-moltbot-inotify.conf`
```
fs.inotify.max_user_watches=524288 # Increased from 65536
```
Apply: `sysctl -p /etc/sysctl.d/99-moltbot-inotify.conf`
3. **Better systemd service:** `/etc/systemd/system/moltbot-gateway.service`
- Changed `Restart=always``Restart=on-failure` (only restart on actual failure)
- Increased `RestartSec=5``RestartSec=10` (reduce CPU churn)
- Reduced `StartLimitBurst=10``StartLimitBurst=5` (fewer restart attempts before blocking)
- Added `ExecStartPre` to auto-clean stale locks on startup
4. **Health check monitoring:** `/etc/systemd/system/moltbot-health-check.{service,timer}`
- Runs `/root/moltbot/scripts/health-check-gateway.sh` every 5 minutes
- Checks if gateway is responding on port 18789
- Detects stale lock files and crash loops
- Automatically cleans locks and restarts if needed
- **Isolated:** Does not interfere with other services (code-server, ssh, etc.)
**Key Files Added/Modified:**
- Created: `scripts/health-check-gateway.sh` (health check logic)
- Created: `/etc/systemd/system/moltbot-health-check.service`
- Created: `/etc/systemd/system/moltbot-health-check.timer`
- Created: `/etc/sysctl.d/99-moltbot-inotify.conf`
- Modified: `/etc/systemd/system/moltbot-gateway.service` (restart policy)
**Status:** ✅ Fixed. All preventative measures in place.
---
## Configuration Summary ## Configuration Summary
### Model Fallback Chain ### Model Fallback Chain
@ -240,14 +297,38 @@ sudo apt-get install -y nodejs
--- ---
---
## Architecture: Service Isolation & Stability
### Systemd Services Running on This Host
- `moltbot-gateway.service` - Telegram bot gateway (isolated, does not affect others)
- `moltbot-health-check.timer` - Periodic gateway health monitoring (oneshot service, no resource hoarding)
- `code-server.service` - Code editor (independent, unaffected)
- `ssh.service` - SSH server (independent, unaffected)
### Safety Design
- **No shared resources:** Each service runs independently
- **No resource limits affecting others:** Moltbot has `LimitNOFILE/NOPROC` set locally only
- **Health check is isolated:** Runs as `oneshot` (completes quickly), doesn't run concurrently with gateway
- **No interference with startup/shutdown:** Services can be restarted independently
### Monitoring
- **Automatic health checks:** Every 5 minutes (can be adjusted in `moltbot-health-check.timer`)
- **Logs:** `/tmp/moltbot-health-check.log` (separate from gateway logs)
- **Manual check:** `systemctl list-timers moltbot-health-check.timer`
---
## Quick Troubleshooting ## Quick Troubleshooting
### Bot Not Responding ### Bot Not Responding
1. Check status: `systemctl status moltbot-gateway` 1. Check status: `systemctl status moltbot-gateway`
2. Check logs: `journalctl -u moltbot-gateway -n 50` 2. Check logs: `journalctl -u moltbot-gateway -n 50`
3. Restart: `systemctl restart moltbot-gateway` 3. Check health: `bash /root/moltbot/scripts/health-check-gateway.sh`
4. Verify Telegram: `node dist/entry.js channels status` 4. Restart: `systemctl restart moltbot-gateway`
5. Check inotify limit (if file watching errors): `cat /proc/sys/fs/inotify/max_user_watches`
### Telegram Connection Error ### Telegram Connection Error
@ -286,6 +367,8 @@ If issue persists, reduce retry attempts in retry policy config.
/root/moltbot/ Main installation /root/moltbot/ Main installation
├── dist/ Compiled code (loaded at runtime) ├── dist/ Compiled code (loaded at runtime)
├── src/ TypeScript source ├── src/ TypeScript source
├── scripts/
│ └── health-check-gateway.sh Health monitoring script
├── ecosystem.config.cjs PM2 config (legacy, not used) ├── ecosystem.config.cjs PM2 config (legacy, not used)
└── README_Tech.md This file └── README_Tech.md This file
@ -297,16 +380,24 @@ If issue persists, reduce retry attempts in retry policy config.
└── .env Environment variables └── .env Environment variables
/etc/systemd/system/ System services /etc/systemd/system/ System services
└── moltbot-gateway.service Systemd service file ├── moltbot-gateway.service Systemd service (gateway)
├── moltbot-health-check.service Health check service
├── moltbot-health-check.timer Health check timer (runs every 5min)
└── ...other services (code-server, ssh, etc)
/etc/sysctl.d/ System configuration
└── 99-moltbot-inotify.conf Inotify limit config
/var/log/ System logs /var/log/ System logs
└── moltbot-gateway.log Gateway application log └── moltbot-gateway.log Gateway application log
/tmp/moltbot/ Runtime logs /tmp/moltbot/ Runtime logs
└── moltbot-*.log Detailed debug logs ├── moltbot-*.log Detailed debug logs
└── moltbot-health-check.log Health check results
``` ```
--- ---
**Last Updated:** Jan 29, 2026 **Last Updated:** Jan 29, 2026 (18:50 UTC)
**Maintained By:** Claude Code + Moltbot Task Router **Maintained By:** Claude Code + Moltbot Task Router
**Latest:** Crash loop root cause fixed (inotify limit), health monitoring added, service isolation verified

109
scripts/health-check-gateway.sh Executable file
View File

@ -0,0 +1,109 @@
#!/bin/bash
# Moltbot Gateway Health Check and Recovery Script
# Monitors gateway health, detects hangs, and initiates recovery
# Designed to run as a cronjob or systemd timer (not interfering with other services)
set -e
GATEWAY_PORT=18789
GATEWAY_HOST="127.0.0.1"
GATEWAY_WS="ws://${GATEWAY_HOST}:${GATEWAY_PORT}"
HEALTH_CHECK_TIMEOUT=10
MAX_LOCK_AGE=600 # 10 minutes in seconds
LOCK_FILES=(
~/.clawdbot/gateway.lock
~/.clawdbot/moltbot.lock
/tmp/moltbot-gateway.lock
)
LOG_FILE="/tmp/moltbot-health-check.log"
# Logging function
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}
# Check if gateway process is responding
check_gateway_responsive() {
# Try to connect to gateway port
if timeout 3 bash -c "echo > /dev/tcp/${GATEWAY_HOST}/${GATEWAY_PORT}" 2>/dev/null; then
return 0 # Gateway is responding
else
return 1 # Gateway is not responding
fi
}
# Check for stale lock files
check_stale_locks() {
for lock_file in "${LOCK_FILES[@]}"; do
if [ -f "$lock_file" ]; then
file_age=$(($(date +%s) - $(stat -f%m "$lock_file" 2>/dev/null || stat -c%Y "$lock_file" 2>/dev/null)))
if [ "$file_age" -gt "$MAX_LOCK_AGE" ]; then
log "WARN: Stale lock file found: $lock_file (age: ${file_age}s)"
return 1 # Stale lock detected
fi
fi
done
return 0 # No stale locks
}
# Check if gateway is in crash loop
check_crash_loop() {
# Get restart count from systemd
restart_count=$(systemctl show moltbot-gateway.service -p NRestarts --value 2>/dev/null || echo "0")
if [ "$restart_count" -gt "10" ]; then
log "WARN: Gateway in potential crash loop (restart count: $restart_count)"
return 1
fi
return 0
}
# Clean stale lock files
cleanup_locks() {
log "Cleaning stale lock files..."
for lock_file in "${LOCK_FILES[@]}"; do
if [ -f "$lock_file" ]; then
rm -f "$lock_file" 2>/dev/null && log "Removed: $lock_file"
fi
done
}
# Graceful restart of gateway
restart_gateway() {
log "Initiating graceful gateway restart..."
systemctl restart moltbot-gateway.service
sleep 5
if check_gateway_responsive; then
log "Gateway restarted successfully"
return 0
else
log "ERROR: Gateway failed to respond after restart"
return 1
fi
}
# Main health check
main() {
log "Starting gateway health check..."
# Check if gateway is responsive
if ! check_gateway_responsive; then
log "ERROR: Gateway is not responding on port $GATEWAY_PORT"
# Check for stale locks or crash loop
if ! check_stale_locks || ! check_crash_loop; then
log "Detected lock/crash issue. Cleaning and restarting..."
cleanup_locks
restart_gateway
else
log "ERROR: Gateway unresponsive but no recovery needed. Manual intervention required."
exit 1
fi
else
log "Gateway is healthy and responsive"
return 0
fi
}
# Run health check
main