Fix: Remove systemd conflicts, clarify PM2-based process management

- Removed conflicting systemd service files (moltbot-gateway.service, moltbot-health-check.*)
- Removed redundant health-check script (PM2 handles restarts natively)
- Updated README_Tech.md to document PM2 as actual process manager
- Clarified that inotify fix (524288 limit) is permanent solution
- Documented PM2 commands for troubleshooting and monitoring
- Added safety note: Never use systemd for moltbot-gateway (causes port conflicts)
- Fixed architecture documentation to reflect PM2 daemon isolation model

Gateway now running cleanly via PM2 (PID 661291) without systemd interference.
Inotify limit verified at 524288 (prevents file watcher exhaustion).

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
This commit is contained in:
valtterimelkko 2026-01-29 19:51:16 +00:00
parent eec556c71e
commit a37c9cad6d
2 changed files with 95 additions and 176 deletions

View File

@ -213,12 +213,13 @@ sudo apt-get install -y nodejs
### 7. **Gateway Crash Loop & Inotify Exhaustion** (Jan 29, 2026)
**Problem:** Gateway hung/became unresponsive. Systemd crashed 1203+ times. Telegram bot stopped responding.
**Problem:** Gateway hung/became unresponsive. PM2 restarted 448+ times. Telegram bot stopped responding.
**Symptoms:**
- `Port 18789 is already in use` (but port handler didn't properly clean up)
- `Gateway failed to start: gateway already running (pid 618450); lock timeout after 5000ms`
- `Port 18789 is already in use` (process stuck, wouldn't release port)
- `Gateway failed to start: gateway already running; lock timeout after 5000ms`
- Lock files stale/not released
- Port conflict between different PM2 daemons and attempted systemd service
**Root Cause:** System hit **inotify file descriptor limit** (`ENOSPC`):
```
@ -227,44 +228,40 @@ Error: ENOSPC: System limit for number of file watchers reached, watch '/root/cl
Error: ENOSPC: System limit for number of file watchers reached, watch '/root/clawd'
```
Gateway couldn't monitor config/skill files for changes → config reloading broke → became hung/unresponsive → systemd restart loop.
Gateway couldn't monitor config/skill files for changes → config reloading broke → became hung/unresponsive → PM2 restart loop (448+ restarts).
**Solutions:**
1. **Immediate fix:** Kill stuck process + clean lock files
```bash
kill -9 618450
rm -f ~/.clawdbot/gateway.lock ~/.clawdbot/moltbot.lock
systemctl restart moltbot-gateway
```
2. **Permanent inotify limit increase:** `/etc/sysctl.d/99-moltbot-inotify.conf`
1. **Permanent inotify limit increase:** `/etc/sysctl.d/99-moltbot-inotify.conf`
```
fs.inotify.max_user_watches=524288 # Increased from 65536
```
Apply: `sysctl -p /etc/sysctl.d/99-moltbot-inotify.conf`
3. **Better systemd service:** `/etc/systemd/system/moltbot-gateway.service`
- Changed `Restart=always``Restart=on-failure` (only restart on actual failure)
- Increased `RestartSec=5``RestartSec=10` (reduce CPU churn)
- Reduced `StartLimitBurst=10``StartLimitBurst=5` (fewer restart attempts before blocking)
- Added `ExecStartPre` to auto-clean stale locks on startup
Verify: `cat /proc/sys/fs/inotify/max_user_watches` (should show 524288)
4. **Health check monitoring:** `/etc/systemd/system/moltbot-health-check.{service,timer}`
- Runs `/root/moltbot/scripts/health-check-gateway.sh` every 5 minutes
- Checks if gateway is responding on port 18789
- Detects stale lock files and crash loops
- Automatically cleans locks and restarts if needed
- **Isolated:** Does not interfere with other services (code-server, ssh, etc.)
2. **Process management:** Gateway is managed by **PM2 (separate daemon)**, not systemd
```bash
pm2 status # Check gateway status
pm2 restart moltbot-gateway # Restart gateway (PM2-managed)
pm2 logs moltbot-gateway # View real-time logs
```
**Key Files Added/Modified:**
- Created: `scripts/health-check-gateway.sh` (health check logic)
- Created: `/etc/systemd/system/moltbot-health-check.service`
- Created: `/etc/systemd/system/moltbot-health-check.timer`
- Created: `/etc/sysctl.d/99-moltbot-inotify.conf`
- Modified: `/etc/systemd/system/moltbot-gateway.service` (restart policy)
3. **Manual recovery (if stuck):**
```bash
killall -9 moltbot
pm2 restart moltbot-gateway
```
**Status:** ✅ Fixed. All preventative measures in place.
**Key Files Modified:**
- Created: `/etc/sysctl.d/99-moltbot-inotify.conf` (inotify limit increase)
**Architecture Note:**
- PM2 runs multiple independent daemons: `si_project/dashboard`, `ai_product_visualizer`, `moltbot-gateway`
- Each daemon is separate to prevent process interference
- **Never** use systemd for moltbot-gateway (causes port conflicts with PM2)
**Status:** ✅ Fixed. Inotify limit increased. PM2 managing gateway cleanly.
---
@ -299,24 +296,39 @@ Apply: `sysctl -p /etc/sysctl.d/99-moltbot-inotify.conf`
---
## Architecture: Service Isolation & Stability
## Architecture: Process Management
### Systemd Services Running on This Host
- `moltbot-gateway.service` - Telegram bot gateway (isolated, does not affect others)
- `moltbot-health-check.timer` - Periodic gateway health monitoring (oneshot service, no resource hoarding)
- `code-server.service` - Code editor (independent, unaffected)
- `ssh.service` - SSH server (independent, unaffected)
### PM2 Daemons on This Host
- **Moltbot Gateway** (`pm2 start ecosystem.config.cjs` or `pm2 restart moltbot-gateway`)
- Managed by: PM2 (separate daemon, independent from other PM2 instances)
- Port: 18789
- PID file: `/root/.pm2/pids/moltbot-gateway-0.pid`
- Config: `/root/moltbot/ecosystem.config.cjs`
- **Other PM2 Daemons** (separate instances)
- `si_project/dashboard` - Frontend dashboard
- `ai_product_visualizer` - Backend visualizer
- **Each runs in its own PM2 daemon to prevent interference**
### Systemd Services (Independent)
- `code-server.service` - Code editor (not PM2-managed)
- `ssh.service` - SSH server (not PM2-managed)
- These do not interact with moltbot or other PM2 processes
### Safety Design
- **No shared resources:** Each service runs independently
- **No resource limits affecting others:** Moltbot has `LimitNOFILE/NOPROC` set locally only
- **Health check is isolated:** Runs as `oneshot` (completes quickly), doesn't run concurrently with gateway
- **No interference with startup/shutdown:** Services can be restarted independently
- **Process isolation:** Each PM2 daemon is independent (separate instances)
- **No port conflicts:** Moltbot never uses systemd (only PM2)
- **Independent logging:** Moltbot logs to `/tmp/moltbot/` (separate from other services)
- **Inotify limit:** System-wide increased to prevent file watcher exhaustion
### Monitoring
- **Automatic health checks:** Every 5 minutes (can be adjusted in `moltbot-health-check.timer`)
- **Logs:** `/tmp/moltbot-health-check.log` (separate from gateway logs)
- **Manual check:** `systemctl list-timers moltbot-health-check.timer`
### Monitoring & Control
```bash
pm2 list # View all PM2 daemons (in current user's daemon)
pm2 status # Show moltbot-gateway status
pm2 restart moltbot-gateway # Restart gateway
pm2 logs moltbot-gateway # Real-time logs
pm2 logs moltbot-gateway -n 100 # Last 100 lines
```
---
@ -324,16 +336,25 @@ Apply: `sysctl -p /etc/sysctl.d/99-moltbot-inotify.conf`
### Bot Not Responding
1. Check status: `systemctl status moltbot-gateway`
2. Check logs: `journalctl -u moltbot-gateway -n 50`
3. Check health: `bash /root/moltbot/scripts/health-check-gateway.sh`
4. Restart: `systemctl restart moltbot-gateway`
5. Check inotify limit (if file watching errors): `cat /proc/sys/fs/inotify/max_user_watches`
1. Check status: `pm2 status`
2. Check logs: `pm2 logs moltbot-gateway` or `pm2 logs moltbot-gateway -n 50`
3. Restart: `pm2 restart moltbot-gateway`
4. Check port: `nc -zv 127.0.0.1 18789`
5. Check inotify limit (if file watching errors): `cat /proc/sys/fs/inotify/max_user_watches` (should be 524288)
6. If stuck, force kill and restart:
```bash
killall -9 moltbot
pm2 restart moltbot-gateway
```
### Telegram Connection Error
Check logs for `setMyCommands failed` or network errors.
If command limit error: Verify `native: false` in both config files.
Check logs for `setMyCommands failed` or network errors:
```bash
pm2 logs moltbot-gateway | grep -i telegram
```
If command limit error: Verify `native: false` in `/root/.clawdbot/moltbot.json` and `/root/.clawdbot/agents/main/config.json`.
### High Latency (>1 minute)
@ -343,6 +364,14 @@ If consistent, check model health: `node dist/entry.js models status`
### Duplicate Responses
Check `streamMode: "block"` is set in `/root/.clawdbot/moltbot.json`.
### Gateway Crashes Frequently
Check PM2 restart count: `pm2 list | grep moltbot-gateway`
- If `↺` count is high (>10), check logs for root cause:
- Inotify exhaustion: `dmesg | grep -i inotify`
- Memory pressure: `pm2 logs moltbot-gateway | grep -i memory`
- Telegram errors: `pm2 logs moltbot-gateway | grep -i telegram`
If issue persists, reduce retry attempts in retry policy config.
---
@ -367,9 +396,7 @@ If issue persists, reduce retry attempts in retry policy config.
/root/moltbot/ Main installation
├── dist/ Compiled code (loaded at runtime)
├── src/ TypeScript source
├── scripts/
│ └── health-check-gateway.sh Health monitoring script
├── ecosystem.config.cjs PM2 config (legacy, not used)
├── ecosystem.config.cjs PM2 config (moltbot-gateway process)
└── README_Tech.md This file
~/.clawdbot/ Config directory
@ -379,25 +406,26 @@ If issue persists, reduce retry attempts in retry policy config.
│ └── auth-profiles.json API key storage
└── .env Environment variables
/etc/systemd/system/ System services
├── moltbot-gateway.service Systemd service (gateway)
├── moltbot-health-check.service Health check service
├── moltbot-health-check.timer Health check timer (runs every 5min)
└── ...other services (code-server, ssh, etc)
/root/.pm2/ PM2 daemon directory
├── pids/moltbot-gateway-0.pid Process ID file (moltbot-gateway)
└── logs/ PM2 logs directory
/etc/sysctl.d/ System configuration
└── 99-moltbot-inotify.conf Inotify limit config
/var/log/ System logs
└── moltbot-gateway.log Gateway application log
└── 99-moltbot-inotify.conf Inotify limit increase (filesystem watchers)
/tmp/moltbot/ Runtime logs
├── moltbot-*.log Detailed debug logs
└── moltbot-health-check.log Health check results
├── moltbot-*.log Detailed application debug logs
├── pm2-out.log PM2 stdout log
└── pm2-error.log PM2 stderr/error log
/etc/systemd/system/ Systemd services (NOT used for moltbot)
├── code-server.service Code editor (independent)
├── ssh.service SSH server (independent)
└── ...other services
```
---
**Last Updated:** Jan 29, 2026 (18:50 UTC)
**Last Updated:** Jan 29, 2026 (19:50 UTC)
**Maintained By:** Claude Code + Moltbot Task Router
**Latest:** Crash loop root cause fixed (inotify limit), health monitoring added, service isolation verified
**Latest:** Crash loop root cause fixed (inotify limit increased to 524288). PM2 process manager confirmed as correct, systemd conflicts removed. Gateway now running cleanly via PM2.

View File

@ -1,109 +0,0 @@
#!/bin/bash
# Moltbot Gateway Health Check and Recovery Script
# Monitors gateway health, detects hangs, and initiates recovery
# Designed to run as a cronjob or systemd timer (not interfering with other services)
set -e
GATEWAY_PORT=18789
GATEWAY_HOST="127.0.0.1"
GATEWAY_WS="ws://${GATEWAY_HOST}:${GATEWAY_PORT}"
HEALTH_CHECK_TIMEOUT=10
MAX_LOCK_AGE=600 # 10 minutes in seconds
LOCK_FILES=(
~/.clawdbot/gateway.lock
~/.clawdbot/moltbot.lock
/tmp/moltbot-gateway.lock
)
LOG_FILE="/tmp/moltbot-health-check.log"
# Logging function
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}
# Check if gateway process is responding
check_gateway_responsive() {
# Try to connect to gateway port
if timeout 3 bash -c "echo > /dev/tcp/${GATEWAY_HOST}/${GATEWAY_PORT}" 2>/dev/null; then
return 0 # Gateway is responding
else
return 1 # Gateway is not responding
fi
}
# Check for stale lock files
check_stale_locks() {
for lock_file in "${LOCK_FILES[@]}"; do
if [ -f "$lock_file" ]; then
file_age=$(($(date +%s) - $(stat -f%m "$lock_file" 2>/dev/null || stat -c%Y "$lock_file" 2>/dev/null)))
if [ "$file_age" -gt "$MAX_LOCK_AGE" ]; then
log "WARN: Stale lock file found: $lock_file (age: ${file_age}s)"
return 1 # Stale lock detected
fi
fi
done
return 0 # No stale locks
}
# Check if gateway is in crash loop
check_crash_loop() {
# Get restart count from systemd
restart_count=$(systemctl show moltbot-gateway.service -p NRestarts --value 2>/dev/null || echo "0")
if [ "$restart_count" -gt "10" ]; then
log "WARN: Gateway in potential crash loop (restart count: $restart_count)"
return 1
fi
return 0
}
# Clean stale lock files
cleanup_locks() {
log "Cleaning stale lock files..."
for lock_file in "${LOCK_FILES[@]}"; do
if [ -f "$lock_file" ]; then
rm -f "$lock_file" 2>/dev/null && log "Removed: $lock_file"
fi
done
}
# Graceful restart of gateway
restart_gateway() {
log "Initiating graceful gateway restart..."
systemctl restart moltbot-gateway.service
sleep 5
if check_gateway_responsive; then
log "Gateway restarted successfully"
return 0
else
log "ERROR: Gateway failed to respond after restart"
return 1
fi
}
# Main health check
main() {
log "Starting gateway health check..."
# Check if gateway is responsive
if ! check_gateway_responsive; then
log "ERROR: Gateway is not responding on port $GATEWAY_PORT"
# Check for stale locks or crash loop
if ! check_stale_locks || ! check_crash_loop; then
log "Detected lock/crash issue. Cleaning and restarting..."
cleanup_locks
restart_gateway
else
log "ERROR: Gateway unresponsive but no recovery needed. Manual intervention required."
exit 1
fi
else
log "Gateway is healthy and responsive"
return 0
fi
}
# Run health check
main