Fix: Resolve gateway crash loop and inotify exhaustion

Problem: Gateway was hung in 1200+ restart loop, causing Telegram bot to stop responding. Root cause: system inotify file descriptor limit exhausted when monitoring config/skill files. Solutions implemented: 1. **Inotify limit increase** (/etc/sysctl.d/99-moltbot-inotify.conf) - Increased fs.inotify.max_user_watches from 65536 to 524288 - Prevents "ENOSPC: System limit for number of file watchers reached" - Persistent across reboots 2. **Improved systemd service** (/etc/systemd/system/moltbot-gateway.service) - Changed Restart=always → Restart=on-failure - Increased RestartSec=5 → RestartSec=10 (reduce CPU churn) - Reduced StartLimitBurst=10 → StartLimitBurst=5 - Added ExecStartPre to auto-clean stale locks on startup - Service remains isolated from other services (code-server, ssh, etc) 3. **Health check automation** (new files) - scripts/health-check-gateway.sh: detects hang/lock issues, auto-recovers - /etc/systemd/system/moltbot-health-check.service: runs health checks - /etc/systemd/system/moltbot-health-check.timer: runs every 5 minutes - Logs to /tmp/moltbot-health-check.log 4. **Documentation** (README_Tech.md) - Added section on crash loop root cause and preventative measures - Added Architecture section documenting service isolation - Updated troubleshooting with health check steps - Updated file locations with new monitoring files Testing: Gateway now starts cleanly, health checks pass, other services (code-server, ssh) remain unaffected. Timer runs every 5 minutes to prevent future hangs. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-01-29 18:55:41 +00:00 · 2026-01-29 18:55:41 +00:00 · eec556c71e
commit eec556c71e
parent c768d26ab3
2 changed files with 205 additions and 5 deletions
--- a/README_Tech.md
+++ b/README_Tech.md
@ -211,6 +211,63 @@ sudo apt-get install -y nodejs
 ---
 ### 7. **Gateway Crash Loop & Inotify Exhaustion** (Jan 29, 2026)
 **Problem:** Gateway hung/became unresponsive. Systemd crashed 1203+ times. Telegram bot stopped responding.
 **Symptoms:**
 - `Port 18789 is already in use` (but port handler didn't properly clean up)
 - `Gateway failed to start: gateway already running (pid 618450); lock timeout after 5000ms`
 - Lock files stale/not released
 **Root Cause:** System hit **inotify file descriptor limit** (`ENOSPC`):
 ```
 Error: ENOSPC: System limit for number of file watchers reached, watch '/root/.moltbot/moltbot.json'
 Error: ENOSPC: System limit for number of file watchers reached, watch '/root/clawd/canvas'
 Error: ENOSPC: System limit for number of file watchers reached, watch '/root/clawd'
 ```
 Gateway couldn't monitor config/skill files for changes → config reloading broke → became hung/unresponsive → systemd restart loop.
 **Solutions:**
 1. **Immediate fix:** Kill stuck process + clean lock files
 ```bash
 kill -9 618450
 rm -f ~/.clawdbot/gateway.lock ~/.clawdbot/moltbot.lock
 systemctl restart moltbot-gateway
 ```
 2. **Permanent inotify limit increase:** `/etc/sysctl.d/99-moltbot-inotify.conf`
 ```
 fs.inotify.max_user_watches=524288  # Increased from 65536
 ```
 Apply: `sysctl -p /etc/sysctl.d/99-moltbot-inotify.conf`
 3. **Better systemd service:** `/etc/systemd/system/moltbot-gateway.service`
   - Changed `Restart=always` → `Restart=on-failure` (only restart on actual failure)
   - Increased `RestartSec=5` → `RestartSec=10` (reduce CPU churn)
   - Reduced `StartLimitBurst=10` → `StartLimitBurst=5` (fewer restart attempts before blocking)
   - Added `ExecStartPre` to auto-clean stale locks on startup
 4. **Health check monitoring:** `/etc/systemd/system/moltbot-health-check.{service,timer}`
   - Runs `/root/moltbot/scripts/health-check-gateway.sh` every 5 minutes
   - Checks if gateway is responding on port 18789
   - Detects stale lock files and crash loops
   - Automatically cleans locks and restarts if needed
   - **Isolated:** Does not interfere with other services (code-server, ssh, etc.)
 **Key Files Added/Modified:**
 - Created: `scripts/health-check-gateway.sh` (health check logic)
 - Created: `/etc/systemd/system/moltbot-health-check.service`
 - Created: `/etc/systemd/system/moltbot-health-check.timer`
 - Created: `/etc/sysctl.d/99-moltbot-inotify.conf`
 - Modified: `/etc/systemd/system/moltbot-gateway.service` (restart policy)
 **Status:** ✅ Fixed. All preventative measures in place.
 ---
 ## Configuration Summary
 ### Model Fallback Chain
@ -240,14 +297,38 @@ sudo apt-get install -y nodejs
 ---
 ---
 ## Architecture: Service Isolation & Stability
 ### Systemd Services Running on This Host
 - `moltbot-gateway.service` - Telegram bot gateway (isolated, does not affect others)
 - `moltbot-health-check.timer` - Periodic gateway health monitoring (oneshot service, no resource hoarding)
 - `code-server.service` - Code editor (independent, unaffected)
 - `ssh.service` - SSH server (independent, unaffected)
 ### Safety Design
 - **No shared resources:** Each service runs independently
 - **No resource limits affecting others:** Moltbot has `LimitNOFILE/NOPROC` set locally only
 - **Health check is isolated:** Runs as `oneshot` (completes quickly), doesn't run concurrently with gateway
 - **No interference with startup/shutdown:** Services can be restarted independently
 ### Monitoring
 - **Automatic health checks:** Every 5 minutes (can be adjusted in `moltbot-health-check.timer`)
 - **Logs:** `/tmp/moltbot-health-check.log` (separate from gateway logs)
 - **Manual check:** `systemctl list-timers moltbot-health-check.timer`
 ---
 ## Quick Troubleshooting
 ### Bot Not Responding
 1. Check status: `systemctl status moltbot-gateway`
 2. Check logs: `journalctl -u moltbot-gateway -n 50`
-3. Restart: `systemctl restart moltbot-gateway`
+3. Check health: `bash /root/moltbot/scripts/health-check-gateway.sh`
-4. Verify Telegram: `node dist/entry.js channels status`
+4. Restart: `systemctl restart moltbot-gateway`
 5. Check inotify limit (if file watching errors): `cat /proc/sys/fs/inotify/max_user_watches`
 ### Telegram Connection Error
@ -286,6 +367,8 @@ If issue persists, reduce retry attempts in retry policy config.
 /root/moltbot/                          Main installation
 ├── dist/                               Compiled code (loaded at runtime)
 ├── src/                                TypeScript source
 ├── scripts/
 │   └── health-check-gateway.sh         Health monitoring script
 ├── ecosystem.config.cjs                PM2 config (legacy, not used)
 └── README_Tech.md                      This file
@ -297,16 +380,24 @@ If issue persists, reduce retry attempts in retry policy config.
 └── .env                                Environment variables
 /etc/systemd/system/                    System services
-└── moltbot-gateway.service             Systemd service file
+├── moltbot-gateway.service             Systemd service (gateway)
 ├── moltbot-health-check.service        Health check service
 ├── moltbot-health-check.timer          Health check timer (runs every 5min)
 └── ...other services (code-server, ssh, etc)
 /etc/sysctl.d/                          System configuration
 └── 99-moltbot-inotify.conf            Inotify limit config
 /var/log/                               System logs
 └── moltbot-gateway.log                 Gateway application log
 /tmp/moltbot/                           Runtime logs
-└── moltbot-*.log                       Detailed debug logs
+├── moltbot-*.log                       Detailed debug logs
 └── moltbot-health-check.log            Health check results
 ```
 ---
-**Last Updated:** Jan 29, 2026
+**Last Updated:** Jan 29, 2026 (18:50 UTC)
 **Maintained By:** Claude Code + Moltbot Task Router
 **Latest:** Crash loop root cause fixed (inotify limit), health monitoring added, service isolation verified
--- a/scripts/health-check-gateway.sh
+++ b/scripts/health-check-gateway.sh
@ -0,0 +1,109 @@
 #!/bin/bash
 # Moltbot Gateway Health Check and Recovery Script
 # Monitors gateway health, detects hangs, and initiates recovery
 # Designed to run as a cronjob or systemd timer (not interfering with other services)
 set -e
 GATEWAY_PORT=18789
 GATEWAY_HOST="127.0.0.1"
 GATEWAY_WS="ws://${GATEWAY_HOST}:${GATEWAY_PORT}"
 HEALTH_CHECK_TIMEOUT=10
 MAX_LOCK_AGE=600  # 10 minutes in seconds
 LOCK_FILES=(
  ~/.clawdbot/gateway.lock
  ~/.clawdbot/moltbot.lock
  /tmp/moltbot-gateway.lock
 )
 LOG_FILE="/tmp/moltbot-health-check.log"
 # Logging function
 log() {
  echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
 }
 # Check if gateway process is responding
 check_gateway_responsive() {
  # Try to connect to gateway port
  if timeout 3 bash -c "echo > /dev/tcp/${GATEWAY_HOST}/${GATEWAY_PORT}" 2>/dev/null; then
    return 0  # Gateway is responding
  else
    return 1  # Gateway is not responding
  fi
 }
 # Check for stale lock files
 check_stale_locks() {
  for lock_file in "${LOCK_FILES[@]}"; do
    if [ -f "$lock_file" ]; then
      file_age=$(($(date +%s) - $(stat -f%m "$lock_file" 2>/dev/null || stat -c%Y "$lock_file" 2>/dev/null)))
      if [ "$file_age" -gt "$MAX_LOCK_AGE" ]; then
        log "WARN: Stale lock file found: $lock_file (age: ${file_age}s)"
        return 1  # Stale lock detected
      fi
    fi
  done
  return 0  # No stale locks
 }
 # Check if gateway is in crash loop
 check_crash_loop() {
  # Get restart count from systemd
  restart_count=$(systemctl show moltbot-gateway.service -p NRestarts --value 2>/dev/null || echo "0")
  if [ "$restart_count" -gt "10" ]; then
    log "WARN: Gateway in potential crash loop (restart count: $restart_count)"
    return 1
  fi
  return 0
 }
 # Clean stale lock files
 cleanup_locks() {
  log "Cleaning stale lock files..."
  for lock_file in "${LOCK_FILES[@]}"; do
    if [ -f "$lock_file" ]; then
      rm -f "$lock_file" 2>/dev/null && log "Removed: $lock_file"
    fi
  done
 }
 # Graceful restart of gateway
 restart_gateway() {
  log "Initiating graceful gateway restart..."
  systemctl restart moltbot-gateway.service
  sleep 5
  if check_gateway_responsive; then
    log "Gateway restarted successfully"
    return 0
  else
    log "ERROR: Gateway failed to respond after restart"
    return 1
  fi
 }
 # Main health check
 main() {
  log "Starting gateway health check..."
  # Check if gateway is responsive
  if ! check_gateway_responsive; then
    log "ERROR: Gateway is not responding on port $GATEWAY_PORT"
    # Check for stale locks or crash loop
    if ! check_stale_locks || ! check_crash_loop; then
      log "Detected lock/crash issue. Cleaning and restarting..."
      cleanup_locks
      restart_gateway
    else
      log "ERROR: Gateway unresponsive but no recovery needed. Manual intervention required."
      exit 1
    fi
  else
    log "Gateway is healthy and responsive"
    return 0
  fi
 }
 # Run health check
 main