Fix: Remove systemd conflicts, clarify PM2-based process management

- Removed conflicting systemd service files (moltbot-gateway.service, moltbot-health-check.*)
- Removed redundant health-check script (PM2 handles restarts natively)
- Updated README_Tech.md to document PM2 as actual process manager
- Clarified that inotify fix (524288 limit) is permanent solution
- Documented PM2 commands for troubleshooting and monitoring
- Added safety note: Never use systemd for moltbot-gateway (causes port conflicts)
- Fixed architecture documentation to reflect PM2 daemon isolation model

Gateway now running cleanly via PM2 (PID 661291) without systemd interference.
Inotify limit verified at 524288 (prevents file watcher exhaustion).

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
This commit is contained in:
valtterimelkko 2026-01-29 19:51:16 +00:00
parent eec556c71e
commit a37c9cad6d
2 changed files with 95 additions and 176 deletions

View File

@ -213,12 +213,13 @@ sudo apt-get install -y nodejs
### 7. **Gateway Crash Loop & Inotify Exhaustion** (Jan 29, 2026) ### 7. **Gateway Crash Loop & Inotify Exhaustion** (Jan 29, 2026)
**Problem:** Gateway hung/became unresponsive. Systemd crashed 1203+ times. Telegram bot stopped responding. **Problem:** Gateway hung/became unresponsive. PM2 restarted 448+ times. Telegram bot stopped responding.
**Symptoms:** **Symptoms:**
- `Port 18789 is already in use` (but port handler didn't properly clean up) - `Port 18789 is already in use` (process stuck, wouldn't release port)
- `Gateway failed to start: gateway already running (pid 618450); lock timeout after 5000ms` - `Gateway failed to start: gateway already running; lock timeout after 5000ms`
- Lock files stale/not released - Lock files stale/not released
- Port conflict between different PM2 daemons and attempted systemd service
**Root Cause:** System hit **inotify file descriptor limit** (`ENOSPC`): **Root Cause:** System hit **inotify file descriptor limit** (`ENOSPC`):
``` ```
@ -227,44 +228,40 @@ Error: ENOSPC: System limit for number of file watchers reached, watch '/root/cl
Error: ENOSPC: System limit for number of file watchers reached, watch '/root/clawd' Error: ENOSPC: System limit for number of file watchers reached, watch '/root/clawd'
``` ```
Gateway couldn't monitor config/skill files for changes → config reloading broke → became hung/unresponsive → systemd restart loop. Gateway couldn't monitor config/skill files for changes → config reloading broke → became hung/unresponsive → PM2 restart loop (448+ restarts).
**Solutions:** **Solutions:**
1. **Immediate fix:** Kill stuck process + clean lock files 1. **Permanent inotify limit increase:** `/etc/sysctl.d/99-moltbot-inotify.conf`
```bash
kill -9 618450
rm -f ~/.clawdbot/gateway.lock ~/.clawdbot/moltbot.lock
systemctl restart moltbot-gateway
```
2. **Permanent inotify limit increase:** `/etc/sysctl.d/99-moltbot-inotify.conf`
``` ```
fs.inotify.max_user_watches=524288 # Increased from 65536 fs.inotify.max_user_watches=524288 # Increased from 65536
``` ```
Apply: `sysctl -p /etc/sysctl.d/99-moltbot-inotify.conf` Apply: `sysctl -p /etc/sysctl.d/99-moltbot-inotify.conf`
3. **Better systemd service:** `/etc/systemd/system/moltbot-gateway.service` Verify: `cat /proc/sys/fs/inotify/max_user_watches` (should show 524288)
- Changed `Restart=always``Restart=on-failure` (only restart on actual failure)
- Increased `RestartSec=5``RestartSec=10` (reduce CPU churn)
- Reduced `StartLimitBurst=10``StartLimitBurst=5` (fewer restart attempts before blocking)
- Added `ExecStartPre` to auto-clean stale locks on startup
4. **Health check monitoring:** `/etc/systemd/system/moltbot-health-check.{service,timer}` 2. **Process management:** Gateway is managed by **PM2 (separate daemon)**, not systemd
- Runs `/root/moltbot/scripts/health-check-gateway.sh` every 5 minutes ```bash
- Checks if gateway is responding on port 18789 pm2 status # Check gateway status
- Detects stale lock files and crash loops pm2 restart moltbot-gateway # Restart gateway (PM2-managed)
- Automatically cleans locks and restarts if needed pm2 logs moltbot-gateway # View real-time logs
- **Isolated:** Does not interfere with other services (code-server, ssh, etc.) ```
**Key Files Added/Modified:** 3. **Manual recovery (if stuck):**
- Created: `scripts/health-check-gateway.sh` (health check logic) ```bash
- Created: `/etc/systemd/system/moltbot-health-check.service` killall -9 moltbot
- Created: `/etc/systemd/system/moltbot-health-check.timer` pm2 restart moltbot-gateway
- Created: `/etc/sysctl.d/99-moltbot-inotify.conf` ```
- Modified: `/etc/systemd/system/moltbot-gateway.service` (restart policy)
**Status:** ✅ Fixed. All preventative measures in place. **Key Files Modified:**
- Created: `/etc/sysctl.d/99-moltbot-inotify.conf` (inotify limit increase)
**Architecture Note:**
- PM2 runs multiple independent daemons: `si_project/dashboard`, `ai_product_visualizer`, `moltbot-gateway`
- Each daemon is separate to prevent process interference
- **Never** use systemd for moltbot-gateway (causes port conflicts with PM2)
**Status:** ✅ Fixed. Inotify limit increased. PM2 managing gateway cleanly.
--- ---
@ -299,24 +296,39 @@ Apply: `sysctl -p /etc/sysctl.d/99-moltbot-inotify.conf`
--- ---
## Architecture: Service Isolation & Stability ## Architecture: Process Management
### Systemd Services Running on This Host ### PM2 Daemons on This Host
- `moltbot-gateway.service` - Telegram bot gateway (isolated, does not affect others) - **Moltbot Gateway** (`pm2 start ecosystem.config.cjs` or `pm2 restart moltbot-gateway`)
- `moltbot-health-check.timer` - Periodic gateway health monitoring (oneshot service, no resource hoarding) - Managed by: PM2 (separate daemon, independent from other PM2 instances)
- `code-server.service` - Code editor (independent, unaffected) - Port: 18789
- `ssh.service` - SSH server (independent, unaffected) - PID file: `/root/.pm2/pids/moltbot-gateway-0.pid`
- Config: `/root/moltbot/ecosystem.config.cjs`
- **Other PM2 Daemons** (separate instances)
- `si_project/dashboard` - Frontend dashboard
- `ai_product_visualizer` - Backend visualizer
- **Each runs in its own PM2 daemon to prevent interference**
### Systemd Services (Independent)
- `code-server.service` - Code editor (not PM2-managed)
- `ssh.service` - SSH server (not PM2-managed)
- These do not interact with moltbot or other PM2 processes
### Safety Design ### Safety Design
- **No shared resources:** Each service runs independently - **Process isolation:** Each PM2 daemon is independent (separate instances)
- **No resource limits affecting others:** Moltbot has `LimitNOFILE/NOPROC` set locally only - **No port conflicts:** Moltbot never uses systemd (only PM2)
- **Health check is isolated:** Runs as `oneshot` (completes quickly), doesn't run concurrently with gateway - **Independent logging:** Moltbot logs to `/tmp/moltbot/` (separate from other services)
- **No interference with startup/shutdown:** Services can be restarted independently - **Inotify limit:** System-wide increased to prevent file watcher exhaustion
### Monitoring ### Monitoring & Control
- **Automatic health checks:** Every 5 minutes (can be adjusted in `moltbot-health-check.timer`) ```bash
- **Logs:** `/tmp/moltbot-health-check.log` (separate from gateway logs) pm2 list # View all PM2 daemons (in current user's daemon)
- **Manual check:** `systemctl list-timers moltbot-health-check.timer` pm2 status # Show moltbot-gateway status
pm2 restart moltbot-gateway # Restart gateway
pm2 logs moltbot-gateway # Real-time logs
pm2 logs moltbot-gateway -n 100 # Last 100 lines
```
--- ---
@ -324,16 +336,25 @@ Apply: `sysctl -p /etc/sysctl.d/99-moltbot-inotify.conf`
### Bot Not Responding ### Bot Not Responding
1. Check status: `systemctl status moltbot-gateway` 1. Check status: `pm2 status`
2. Check logs: `journalctl -u moltbot-gateway -n 50` 2. Check logs: `pm2 logs moltbot-gateway` or `pm2 logs moltbot-gateway -n 50`
3. Check health: `bash /root/moltbot/scripts/health-check-gateway.sh` 3. Restart: `pm2 restart moltbot-gateway`
4. Restart: `systemctl restart moltbot-gateway` 4. Check port: `nc -zv 127.0.0.1 18789`
5. Check inotify limit (if file watching errors): `cat /proc/sys/fs/inotify/max_user_watches` 5. Check inotify limit (if file watching errors): `cat /proc/sys/fs/inotify/max_user_watches` (should be 524288)
6. If stuck, force kill and restart:
```bash
killall -9 moltbot
pm2 restart moltbot-gateway
```
### Telegram Connection Error ### Telegram Connection Error
Check logs for `setMyCommands failed` or network errors. Check logs for `setMyCommands failed` or network errors:
If command limit error: Verify `native: false` in both config files. ```bash
pm2 logs moltbot-gateway | grep -i telegram
```
If command limit error: Verify `native: false` in `/root/.clawdbot/moltbot.json` and `/root/.clawdbot/agents/main/config.json`.
### High Latency (>1 minute) ### High Latency (>1 minute)
@ -343,6 +364,14 @@ If consistent, check model health: `node dist/entry.js models status`
### Duplicate Responses ### Duplicate Responses
Check `streamMode: "block"` is set in `/root/.clawdbot/moltbot.json`. Check `streamMode: "block"` is set in `/root/.clawdbot/moltbot.json`.
### Gateway Crashes Frequently
Check PM2 restart count: `pm2 list | grep moltbot-gateway`
- If `↺` count is high (>10), check logs for root cause:
- Inotify exhaustion: `dmesg | grep -i inotify`
- Memory pressure: `pm2 logs moltbot-gateway | grep -i memory`
- Telegram errors: `pm2 logs moltbot-gateway | grep -i telegram`
If issue persists, reduce retry attempts in retry policy config. If issue persists, reduce retry attempts in retry policy config.
--- ---
@ -367,9 +396,7 @@ If issue persists, reduce retry attempts in retry policy config.
/root/moltbot/ Main installation /root/moltbot/ Main installation
├── dist/ Compiled code (loaded at runtime) ├── dist/ Compiled code (loaded at runtime)
├── src/ TypeScript source ├── src/ TypeScript source
├── scripts/ ├── ecosystem.config.cjs PM2 config (moltbot-gateway process)
│ └── health-check-gateway.sh Health monitoring script
├── ecosystem.config.cjs PM2 config (legacy, not used)
└── README_Tech.md This file └── README_Tech.md This file
~/.clawdbot/ Config directory ~/.clawdbot/ Config directory
@ -379,25 +406,26 @@ If issue persists, reduce retry attempts in retry policy config.
│ └── auth-profiles.json API key storage │ └── auth-profiles.json API key storage
└── .env Environment variables └── .env Environment variables
/etc/systemd/system/ System services /root/.pm2/ PM2 daemon directory
├── moltbot-gateway.service Systemd service (gateway) ├── pids/moltbot-gateway-0.pid Process ID file (moltbot-gateway)
├── moltbot-health-check.service Health check service └── logs/ PM2 logs directory
├── moltbot-health-check.timer Health check timer (runs every 5min)
└── ...other services (code-server, ssh, etc)
/etc/sysctl.d/ System configuration /etc/sysctl.d/ System configuration
└── 99-moltbot-inotify.conf Inotify limit config └── 99-moltbot-inotify.conf Inotify limit increase (filesystem watchers)
/var/log/ System logs
└── moltbot-gateway.log Gateway application log
/tmp/moltbot/ Runtime logs /tmp/moltbot/ Runtime logs
├── moltbot-*.log Detailed debug logs ├── moltbot-*.log Detailed application debug logs
└── moltbot-health-check.log Health check results ├── pm2-out.log PM2 stdout log
└── pm2-error.log PM2 stderr/error log
/etc/systemd/system/ Systemd services (NOT used for moltbot)
├── code-server.service Code editor (independent)
├── ssh.service SSH server (independent)
└── ...other services
``` ```
--- ---
**Last Updated:** Jan 29, 2026 (18:50 UTC) **Last Updated:** Jan 29, 2026 (19:50 UTC)
**Maintained By:** Claude Code + Moltbot Task Router **Maintained By:** Claude Code + Moltbot Task Router
**Latest:** Crash loop root cause fixed (inotify limit), health monitoring added, service isolation verified **Latest:** Crash loop root cause fixed (inotify limit increased to 524288). PM2 process manager confirmed as correct, systemd conflicts removed. Gateway now running cleanly via PM2.

View File

@ -1,109 +0,0 @@
#!/bin/bash
# Moltbot Gateway Health Check and Recovery Script
# Monitors gateway health, detects hangs, and initiates recovery
# Designed to run as a cronjob or systemd timer (not interfering with other services)
set -e
GATEWAY_PORT=18789
GATEWAY_HOST="127.0.0.1"
GATEWAY_WS="ws://${GATEWAY_HOST}:${GATEWAY_PORT}"
HEALTH_CHECK_TIMEOUT=10
MAX_LOCK_AGE=600 # 10 minutes in seconds
LOCK_FILES=(
~/.clawdbot/gateway.lock
~/.clawdbot/moltbot.lock
/tmp/moltbot-gateway.lock
)
LOG_FILE="/tmp/moltbot-health-check.log"
# Logging function
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}
# Check if gateway process is responding
check_gateway_responsive() {
# Try to connect to gateway port
if timeout 3 bash -c "echo > /dev/tcp/${GATEWAY_HOST}/${GATEWAY_PORT}" 2>/dev/null; then
return 0 # Gateway is responding
else
return 1 # Gateway is not responding
fi
}
# Check for stale lock files
check_stale_locks() {
for lock_file in "${LOCK_FILES[@]}"; do
if [ -f "$lock_file" ]; then
file_age=$(($(date +%s) - $(stat -f%m "$lock_file" 2>/dev/null || stat -c%Y "$lock_file" 2>/dev/null)))
if [ "$file_age" -gt "$MAX_LOCK_AGE" ]; then
log "WARN: Stale lock file found: $lock_file (age: ${file_age}s)"
return 1 # Stale lock detected
fi
fi
done
return 0 # No stale locks
}
# Check if gateway is in crash loop
check_crash_loop() {
# Get restart count from systemd
restart_count=$(systemctl show moltbot-gateway.service -p NRestarts --value 2>/dev/null || echo "0")
if [ "$restart_count" -gt "10" ]; then
log "WARN: Gateway in potential crash loop (restart count: $restart_count)"
return 1
fi
return 0
}
# Clean stale lock files
cleanup_locks() {
log "Cleaning stale lock files..."
for lock_file in "${LOCK_FILES[@]}"; do
if [ -f "$lock_file" ]; then
rm -f "$lock_file" 2>/dev/null && log "Removed: $lock_file"
fi
done
}
# Graceful restart of gateway
restart_gateway() {
log "Initiating graceful gateway restart..."
systemctl restart moltbot-gateway.service
sleep 5
if check_gateway_responsive; then
log "Gateway restarted successfully"
return 0
else
log "ERROR: Gateway failed to respond after restart"
return 1
fi
}
# Main health check
main() {
log "Starting gateway health check..."
# Check if gateway is responsive
if ! check_gateway_responsive; then
log "ERROR: Gateway is not responding on port $GATEWAY_PORT"
# Check for stale locks or crash loop
if ! check_stale_locks || ! check_crash_loop; then
log "Detected lock/crash issue. Cleaning and restarting..."
cleanup_locks
restart_gateway
else
log "ERROR: Gateway unresponsive but no recovery needed. Manual intervention required."
exit 1
fi
else
log "Gateway is healthy and responsive"
return 0
fi
}
# Run health check
main