openclaw/README_Tech.md
valtterimelkko eec556c71e Fix: Resolve gateway crash loop and inotify exhaustion
Problem: Gateway was hung in 1200+ restart loop, causing Telegram bot to stop
responding. Root cause: system inotify file descriptor limit exhausted when
monitoring config/skill files.

Solutions implemented:

1. **Inotify limit increase** (/etc/sysctl.d/99-moltbot-inotify.conf)
   - Increased fs.inotify.max_user_watches from 65536 to 524288
   - Prevents "ENOSPC: System limit for number of file watchers reached"
   - Persistent across reboots

2. **Improved systemd service** (/etc/systemd/system/moltbot-gateway.service)
   - Changed Restart=always → Restart=on-failure
   - Increased RestartSec=5 → RestartSec=10 (reduce CPU churn)
   - Reduced StartLimitBurst=10 → StartLimitBurst=5
   - Added ExecStartPre to auto-clean stale locks on startup
   - Service remains isolated from other services (code-server, ssh, etc)

3. **Health check automation** (new files)
   - scripts/health-check-gateway.sh: detects hang/lock issues, auto-recovers
   - /etc/systemd/system/moltbot-health-check.service: runs health checks
   - /etc/systemd/system/moltbot-health-check.timer: runs every 5 minutes
   - Logs to /tmp/moltbot-health-check.log

4. **Documentation** (README_Tech.md)
   - Added section on crash loop root cause and preventative measures
   - Added Architecture section documenting service isolation
   - Updated troubleshooting with health check steps
   - Updated file locations with new monitoring files

Testing: Gateway now starts cleanly, health checks pass, other services
(code-server, ssh) remain unaffected. Timer runs every 5 minutes to prevent
future hangs.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-01-29 18:55:41 +00:00

13 KiB

Moltbot Technical Documentation

Format Guidelines for Contributors

Style: Concise, technical, action-oriented. Brevity: One sentence per command/concept. Use bullet points, not paragraphs. Problem Log: Keep entries short—problem → symptom → solution. Add date and who fixed it if known. Commands: Always include the command first, explanation after (e.g., systemctl restart moltbot-gateway # Restarts the gateway service). Sections: Group by topic. Use ## for major sections, ### for subsections. Updates: When adding new problems/solutions, add to the end of the Problem Log section with date.


Process Architecture

Core Components

  1. Moltbot Gateway (moltbot-gateway)

    • Service: /etc/systemd/system/moltbot-gateway.service
    • Runs: /usr/bin/node dist/entry.js gateway --port 18789
    • Manager: systemd (isolated from PM2)
    • Handles: Telegram integration, message routing, model selection
  2. Supporting Processes

    • Dashboard (si_project/dashboard) - PM2 managed, separate from bot
    • AI Product Visualizer (ai_product_visualizer) - PM2 managed, separate from bot
    • Telegram Relay - Embedded in gateway (grammY framework)
    • Task-Type Router - Compiled TypeScript module in gateway
  3. Configuration Files

    • Global: /root/.clawdbot/moltbot.json
    • Agent-specific: /root/.clawdbot/agents/main/config.json
    • Environment: /root/.clawdbot/.env

Process Management

Moltbot Gateway (Systemd)

# Check status
systemctl status moltbot-gateway

# Restart (reloads config + code)
systemctl restart moltbot-gateway

# Stop gracefully
systemctl stop moltbot-gateway

# Start if stopped
systemctl start moltbot-gateway

# View live logs
journalctl -u moltbot-gateway -f

# View last 100 lines
journalctl -u moltbot-gateway -n 100

Auto-restart: Enabled. If process crashes, systemd restarts it within 5 seconds. Boot persistence: Enabled. Starts automatically on system reboot.

From Telegram Chat

Send /restart command in Telegram to restart the bot gracefully without terminal access.

Dashboard (PM2)

# Check status
pm2 list

# Restart
pm2 restart dashboard

# Logs
pm2 logs dashboard

# Stop
pm2 stop dashboard

Isolation: Runs in separate PM2 daemon. Does not interfere with Moltbot.

Logs Location

# Moltbot systemd logs
journalctl -u moltbot-gateway -n 200

# Moltbot app logs (most detailed)
tail -f /var/log/moltbot-gateway.log

# Application debug logs
tail -f /tmp/moltbot/moltbot-*.log

Problem Log & Solutions

1. Duplicate Telegram Responses (Jan 28, 2026)

Problem: Bot sending same message 2-3 times.

Root Cause: streamMode: "partial" in Telegram config caused responses to stream as chunks, each sent separately.

Solution: Changed streamMode from "partial" to "block" in /root/.clawdbot/moltbot.json.

"telegram": {
  "streamMode": "block"  // Single unified message
}

Status: Fixed. Single responses now.


2. Unknown Model Error (Jan 28, 2026)

Problem: Error: Unknown model: openrouter/mistralai/mistral-devstral-2

Root Cause: Incorrect OpenRouter model ID format. Used old naming convention.

Solution: Updated model IDs to correct OpenRouter format:

  • mistralai/devstral-2512 (Mistral Devstral 2)
  • google/gemini-2.0-flash-001 (Gemini 2.0 Flash)
  • meta-llama/llama-3.3-70b-instruct:free (Llama 3.3 70B)

Status: Fixed. Models now load correctly.


3. PM2 Process Isolation Conflict (Jan 28, 2026)

Problem: Dashboard PM2 restarting 140+ times. Gateway conflicting with dashboard in same PM2 daemon.

Root Cause: Moltbot gateway was added to default PM2 instance, sharing resources with dashboard.

Solution: Moved Moltbot from PM2 to systemd service (isolated).

  • Moltbot: systemd only
  • Dashboard: PM2 only
  • No shared daemon = no conflicts

Status: Fixed. Processes now isolated.

Files changed:

  • Created: /etc/systemd/system/moltbot-gateway.service
  • Removed: Moltbot from PM2 list

4. Missing Task-Type Router Compilation (Jan 28, 2026)

Problem: Bot said it implemented task-type routing but nothing changed.

Root Cause: TypeScript source files modified but not compiled to dist/.

Solution:

  1. Fixed import error in src/agents/task-type-router.ts (DEFAULT_PROVIDER location)
  2. Compiled: npm run build
  3. Restarted gateway to load new dist/ code

Status: Fixed. Task-type router now active.


5. Telegram Command Limit Exceeded (Jan 29, 2026)

Problem: Error: setMyCommands failed: BOT_COMMANDS_TOO_MUCH (Telegram API limit = 100 commands).

Root Cause: Both config files had "native": "auto" trying to register all skills + commands with Telegram.

Solution: Disabled native command auto-registration:

// /root/.clawdbot/moltbot.json
"commands": {
  "native": false,
  "nativeSkills": false
}

// /root/.clawdbot/agents/main/config.json
"commands": {
  "native": false,
  "text": true,
  "restart": true
}

Status: Fixed. Telegram now connects without errors.


6. Node.js Version Too Old (Jan 28, 2026)

Problem: Moltbot requires Node.js 24+ but only v20 was installed.

Root Cause: Package.json specified engines: { node: ">=24" }.

Solution: Upgraded Node.js:

curl -fsSL https://deb.nodesource.com/setup_24.x | sudo -E bash -
sudo apt-get install -y nodejs

Verified: node --version → v24.13.0

Status: Fixed.


7. Gateway Crash Loop & Inotify Exhaustion (Jan 29, 2026)

Problem: Gateway hung/became unresponsive. Systemd crashed 1203+ times. Telegram bot stopped responding.

Symptoms:

  • Port 18789 is already in use (but port handler didn't properly clean up)
  • Gateway failed to start: gateway already running (pid 618450); lock timeout after 5000ms
  • Lock files stale/not released

Root Cause: System hit inotify file descriptor limit (ENOSPC):

Error: ENOSPC: System limit for number of file watchers reached, watch '/root/.moltbot/moltbot.json'
Error: ENOSPC: System limit for number of file watchers reached, watch '/root/clawd/canvas'
Error: ENOSPC: System limit for number of file watchers reached, watch '/root/clawd'

Gateway couldn't monitor config/skill files for changes → config reloading broke → became hung/unresponsive → systemd restart loop.

Solutions:

  1. Immediate fix: Kill stuck process + clean lock files
kill -9 618450
rm -f ~/.clawdbot/gateway.lock ~/.clawdbot/moltbot.lock
systemctl restart moltbot-gateway
  1. Permanent inotify limit increase: /etc/sysctl.d/99-moltbot-inotify.conf
fs.inotify.max_user_watches=524288  # Increased from 65536

Apply: sysctl -p /etc/sysctl.d/99-moltbot-inotify.conf

  1. Better systemd service: /etc/systemd/system/moltbot-gateway.service

    • Changed Restart=alwaysRestart=on-failure (only restart on actual failure)
    • Increased RestartSec=5RestartSec=10 (reduce CPU churn)
    • Reduced StartLimitBurst=10StartLimitBurst=5 (fewer restart attempts before blocking)
    • Added ExecStartPre to auto-clean stale locks on startup
  2. Health check monitoring: /etc/systemd/system/moltbot-health-check.{service,timer}

    • Runs /root/moltbot/scripts/health-check-gateway.sh every 5 minutes
    • Checks if gateway is responding on port 18789
    • Detects stale lock files and crash loops
    • Automatically cleans locks and restarts if needed
    • Isolated: Does not interfere with other services (code-server, ssh, etc.)

Key Files Added/Modified:

  • Created: scripts/health-check-gateway.sh (health check logic)
  • Created: /etc/systemd/system/moltbot-health-check.service
  • Created: /etc/systemd/system/moltbot-health-check.timer
  • Created: /etc/sysctl.d/99-moltbot-inotify.conf
  • Modified: /etc/systemd/system/moltbot-gateway.service (restart policy)

Status: Fixed. All preventative measures in place.


Configuration Summary

Model Fallback Chain

Primary: Mistral Devstral 2 2512 (agentic specialist) Fallbacks:

  1. Gemini 2.0 Flash (long-context, 1M tokens)
  2. Llama 3.3 70B (creative/pedagogical)
  3. Moonshot Kimi K2.5 (language model)
  4. Claude Sonnet 4.5 (escalation)
  5. Claude Opus 4.5 (complex reasoning)

Task-Type Routing

  • File Analysis → Gemini Flash
  • Creative Content → Llama 3.3 70B
  • Debugging → Claude Sonnet 4.5
  • CLI/Commands → Mistral Devstral 2
  • General → Mistral Devstral 2 (default)

Telegram Settings

  • Streaming Mode: block (single message per response)
  • Commands Native: false (avoid API limit)
  • Restart Command: true (allows /restart from chat)
  • User ID Allowlist: 876311493 (only you)


Architecture: Service Isolation & Stability

Systemd Services Running on This Host

  • moltbot-gateway.service - Telegram bot gateway (isolated, does not affect others)
  • moltbot-health-check.timer - Periodic gateway health monitoring (oneshot service, no resource hoarding)
  • code-server.service - Code editor (independent, unaffected)
  • ssh.service - SSH server (independent, unaffected)

Safety Design

  • No shared resources: Each service runs independently
  • No resource limits affecting others: Moltbot has LimitNOFILE/NOPROC set locally only
  • Health check is isolated: Runs as oneshot (completes quickly), doesn't run concurrently with gateway
  • No interference with startup/shutdown: Services can be restarted independently

Monitoring

  • Automatic health checks: Every 5 minutes (can be adjusted in moltbot-health-check.timer)
  • Logs: /tmp/moltbot-health-check.log (separate from gateway logs)
  • Manual check: systemctl list-timers moltbot-health-check.timer

Quick Troubleshooting

Bot Not Responding

  1. Check status: systemctl status moltbot-gateway
  2. Check logs: journalctl -u moltbot-gateway -n 50
  3. Check health: bash /root/moltbot/scripts/health-check-gateway.sh
  4. Restart: systemctl restart moltbot-gateway
  5. Check inotify limit (if file watching errors): cat /proc/sys/fs/inotify/max_user_watches

Telegram Connection Error

Check logs for setMyCommands failed or network errors. If command limit error: Verify native: false in both config files.

High Latency (>1 minute)

Expected for first API call to OpenRouter. Check OpenRouter API status. If consistent, check model health: node dist/entry.js models status

Duplicate Responses

Check streamMode: "block" is set in /root/.clawdbot/moltbot.json. If issue persists, reduce retry attempts in retry policy config.


Deployment Checklist

  • Node.js 24+ installed
  • Moltbot cloned and built (npm run build)
  • Systemd service created and enabled
  • Config files populated (moltbot.json, agents/main/config.json)
  • API keys in environment or .env
  • Telegram bot token configured
  • Gateway started: systemctl start moltbot-gateway
  • Telegram connection verified: node dist/entry.js channels status
  • Test message sent in Telegram

Key File Locations

/root/moltbot/                          Main installation
├── dist/                               Compiled code (loaded at runtime)
├── src/                                TypeScript source
├── scripts/
│   └── health-check-gateway.sh         Health monitoring script
├── ecosystem.config.cjs                PM2 config (legacy, not used)
└── README_Tech.md                      This file

~/.clawdbot/                            Config directory
├── moltbot.json                        Global gateway config
├── agents/main/
│   ├── config.json                     Agent-specific config
│   └── auth-profiles.json              API key storage
└── .env                                Environment variables

/etc/systemd/system/                    System services
├── moltbot-gateway.service             Systemd service (gateway)
├── moltbot-health-check.service        Health check service
├── moltbot-health-check.timer          Health check timer (runs every 5min)
└── ...other services (code-server, ssh, etc)

/etc/sysctl.d/                          System configuration
└── 99-moltbot-inotify.conf            Inotify limit config

/var/log/                               System logs
└── moltbot-gateway.log                 Gateway application log

/tmp/moltbot/                           Runtime logs
├── moltbot-*.log                       Detailed debug logs
└── moltbot-health-check.log            Health check results

Last Updated: Jan 29, 2026 (18:50 UTC) Maintained By: Claude Code + Moltbot Task Router Latest: Crash loop root cause fixed (inotify limit), health monitoring added, service isolation verified