openclaw/HIGH-AVAILABILITY.md
Claude Code e274d4d781 feat: add high availability and automation (v2.2)
This commit adds comprehensive high availability, disaster recovery,
and automation capabilities for enterprise-grade deployment.

High Availability Features:
- Keepalived integration for Virtual IP (38.14.254.100)
- Automatic failover monitoring and recovery
- PostgreSQL streaming replication support
- Health check scripts with auto-restart
- State change notifications

Disaster Recovery:
- Complete system backup script (database, configs, Docker volumes)
- Automated backup with retention policies
- Recovery manifest with step-by-step instructions
- Off-site backup support (S3, rsync ready)

Automation Tools:
- auto-deploy-server.sh - Deploy to remote server from local
- auto-deploy-server.bat - Windows version with WSL/Git Bash support
- deploy-oneclick.sh - One-click deployment on fresh server
- docker-compose-full.yml - Complete containerized stack

Container Orchestration:
- Full Docker Compose setup with all services
- Service dependencies and health checks
- Persistent volumes for data
- Network isolation with dedicated network
- Production-ready configuration

Deployment Automation:
- Automated dependency installation
- Database initialization with tables and indexes
- Monitoring stack auto-deployment
- Service auto-start via systemd
- Firewall auto-configuration
- Cron job automation

New Services:
- moltbot-failover.service - Auto-recovery monitor
- moltbot-metrics.service - Metrics exporter (9101)
- moltbot-log-analyzer.service - Log aggregation (9102)
- keepalived.service - VIP management

Documentation:
- HIGH-AVAILABILITY.md - Complete HA and automation guide

Architecture Improvements:
- Virtual IP for transparent failover
- Health-based service routing
- Automated disaster recovery backups
- Zero-touch server deployment
- Complete container orchestration support

Service Ports:
- Database API: 18800
- Metrics Exporter: 9101
- Log Analyzer: 9102
- Virtual IP: 38.14.254.100

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2026-01-29 20:17:59 +08:00

488 lines
9.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 🏗️ Moltbot 高可用性和自动化指南
**版本**: v2.2
**最后更新**: 2026-01-29
---
## 📋 高可用性 (HA) 架构
### 架构概览
```
┌───────────────────┐
│ Virtual IP │
│ (38.14.254.100) │
└────────┬───────────┘
┌────────────┴────────────┐
│ │
┌──────▼──────┐ ┌──────▼──────┐
│ Master │ │ Backup │
│ Server │ │ Server │
│ │ │ │
│ Gateway │ │ Gateway │
│ PostgreSQL │ │ PostgreSQL │
│ Monitoring │ │ Monitoring │
└─────────────┘ └─────────────┘
│ │
└────────────┬────────────┘
┌────────────▼───────────┐
│ Shared Storage │
│ (Optional) │
└────────────────────────┘
```
---
## 🚀 快速开始
### 一键部署新服务器
在全新的服务器上运行:
```bash
# 方法 1: 使用 curl
curl -fsSL https://raw.githubusercontent.com/flowerjunjie/moltbot/main/deploy-oneclick.sh | bash
# 方法 2: 使用 git
git clone https://github.com/flowerjunjie/moltbot.git /opt/moltbot
cd /opt/moltbot
bash deploy-oneclick.sh
```
### 远程部署服务器
从本地机器部署到远程服务器:
```bash
# Linux/Mac
bash auto-deploy-server.sh root@192.168.1.100
# Windows
auto-deploy-server.bat root@192.168.1.100
```
---
## 🔧 高可用性组件
### 1. Keepalived (虚拟 IP)
**功能**: 自动故障转移和虚拟 IP 管理
**安装**:
```bash
apt-get install keepalived
```
**配置文件**: `/etc/keepalived/keepalived.conf`
```conf
vrrp_script chk_moltbot_gateway {
script "curl -f http://localhost:18789 || exit 1"
interval 2
weight 2
}
vrrp_instance VI_MOLTBOT {
state MASTER
interface eth0
virtual_router_id 51
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass moltbot2024
}
virtual_ipaddress {
38.14.254.100/24
}
track_script {
chk_moltbot_gateway
}
}
```
**状态检查**:
```bash
systemctl status keepalived
ip addr show eth0 | grep 38.14.254.100
```
### 2. 自动故障转移
**脚本**: `/usr/local/bin/moltbot-failover.sh`
**功能**:
- 健康检查(每 10 秒)
- 自动重启失败的服务
- 故障计数和阈值
- 日志记录
**服务**: `moltbot-failover.service`
**启用**:
```bash
systemctl enable moltbot-failover
systemctl start moltbot-failover
```
**查看日志**:
```bash
journalctl -u moltbot-failover -f
cat /var/log/moltbot-failover.log
```
### 3. PostgreSQL 流复制
**配置**: `/etc/postgresql/14/main/conf.d/replication.conf`
**设置主服务器**:
```sql
-- 创建复制用户
CREATE USER replicator WITH REPLICATION ENCRYPTED PASSWORD 'replicator_pass';
-- 配置复制槽
SELECT * FROM pg_create_physical_replication_slot('replica_slot');
```
**设置从服务器**:
```bash
# 在从服务器上
pg_basebackup -h master-server -D /var/lib/postgresql/data -P -U replicator --wal-method=stream
# 配置 recovery.conf
standby_mode = on
primary_conninfo = 'host=master-server port=5432 user=replicator'
restore_command = 'cp /var/lib/postgresql/archive/%f %p'
```
### 4. 灾难恢复备份
**脚本**: `/usr/local/bin/moltbot-dr-backup.sh`
**备份内容**:
- PostgreSQL 完整转储
- 配置文件
- Docker 卷数据
- 系统包列表
- 防火墙规则
**运行备份**:
```bash
/usr/local/bin/moltbot-dr-backup.sh
```
**备份位置**: `/opt/moltbot-backup/disaster-recovery/`
**自动备份**: 每周日凌晨 3 点
---
## 🤖 自动化工具
### 1. 自动部署工具
**文件**: `auto-deploy-server.sh` (Linux) / `auto-deploy-server.bat` (Windows)
**功能**:
- 自动安装所有依赖
- 配置数据库
- 部署监控栈
- 设置防火墙
- 配置自动化任务
**使用**:
```bash
# 部署到新服务器
bash auto-deploy-server.sh root@192.168.1.100
```
### 2. 一键部署脚本
**文件**: `deploy-oneclick.sh`
**场景**: 在全新的服务器上运行
**使用**:
```bash
# SSH 到服务器
ssh root@your-server
# 运行部署
curl -fsSL https://raw.githubusercontent.com/flowerjunjie/moltbot/main/deploy-oneclick.sh | bash
```
**部署时间**: 约 5-10 分钟
### 3. 容器编排支持
**文件**: `docker-compose-full.yml`
**包含服务**:
- Moltbot Gateway
- Database API
- PostgreSQL
- Redis
- Prometheus
- Grafana
- Node Exporter
- Metrics Exporter
- Log Analyzer
- Nginx
**启动**:
```bash
docker-compose -f docker-compose-full.yml up -d
```
---
## 📊 监控和告警
### 服务端口
| 服务 | 端口 | 说明 |
|------|------|------|
| Database API | 18800 | REST API |
| Metrics | 9101 | Prometheus 指标 |
| Log Analyzer | 9102 | 日志分析 API |
| Prometheus | 9090 | 指标采集 |
| Grafana | 3000 | 可视化 |
### 健康检查端点
```bash
# Database API
curl http://localhost:18800/api/health
# Metrics
curl http://localhost:9101/metrics
# Log summary
curl http://localhost:9102/api/logs/summary
# Service status
curl http://localhost:18800/api/devices
```
---
## 🛠️ 维护操作
### 日常维护
**检查服务状态**:
```bash
# 所有 Moltbot 服务
systemctl status moltbot-*
# Docker 容器
docker ps
# 监控栈
cd /opt/moltbot-monitoring && docker-compose ps
```
**查看日志**:
```bash
# 服务日志
journalctl -u moltbot-db-api -f
journalctl -u moltbot-failover -f
# 应用日志
tail -f /var/log/moltbot-failover.log
```
### 备份操作
**手动备份**:
```bash
# 数据库备份
/usr/local/bin/moltbot-backup-auto.sh
# 灾难恢复备份
/usr/local/bin/moltbot-dr-backup.sh
```
**恢复数据库**:
```bash
# 列出备份
ls -lh /opt/moltbot-backup/database/daily/
# 恢复最新备份
gunzip -c /opt/moltbot-backup/database/daily/moltbot_latest.sql.gz | psql -d moltbot
```
### 故障排除
**服务无法启动**:
```bash
# 检查端口占用
netstat -tlnp | grep <port>
# 检查日志
journalctl -u <service> -n 50
# 重启服务
systemctl restart <service>
```
**Keepalived 问题**:
```bash
# 检查配置
keepalived -t
# 查看日志
journalctl -u keepalived -f
# 检查虚拟 IP
ip addr show eth0
```
---
## 🔐 安全配置
### 防火墙规则
**查看当前规则**:
```bash
iptables -L -n -v
```
**添加规则**:
```bash
iptables -A INPUT -p tcp --dport 18789 -s 192.168.1.0/24 -j ACCEPT
netfilter-persistent save
```
### 安全建议
1. **使用密钥认证**: 禁用密码登录
2. **配置 fail2ban**: 防止暴力攻击
3. **定期更新**: `apt-get update && apt-get upgrade`
4. **监控日志**: 定期检查异常访问
---
## 📈 性能优化
### 系统优化
**运行优化脚本**:
```bash
/usr/local/bin/moltbot-optimize.sh
```
**优化项目**:
- 网络参数调优
- PostgreSQL 配置优化
- Docker 资源限制
- 日志轮转配置
### 性能监控
**查看系统指标**:
```bash
# CPU
top -bn1 | grep "Cpu(s)"
# 内存
free -h
# 磁盘
df -h
# 负载
cat /proc/loadavg
```
---
## 🚨 应急响应
### 服务全部宕机
1. **检查服务器状态**
```bash
ping <server-ip>
ssh root@<server-ip> "systemctl status moltbot-*"
```
2. **启动关键服务**
```bash
systemctl start moltbot-db-api
systemctl start moltbot-gateway
```
3. **切换到备用服务器**(如果配置了 HA
```bash
# 备用服务器会自动提升为主服务器
# 虚拟 IP 会自动迁移
```
### 数据库损坏
1. **从备份恢复**
```bash
gunzip -c /opt/moltbot-backup/disaster-recovery/pg_all_*.sql.gz | psql
```
2. **检查数据完整性**
```bash
psql -d moltbot -c "SELECT COUNT(*) FROM conversations;"
psql -d moltbot -c "SELECT COUNT(*) FROM devices;"
```
### 网络问题
1. **检查网络连接**
```bash
ping 8.8.8.8
traceroute 8.8.8.8
```
2. **检查防火墙**
```bash
iptables -L -n
ufw status
```
---
## 📚 相关文档
- `DEPLOYMENT-COMPLETE.md` - 完整部署指南
- `EXTENSIONS.md` - 扩展功能文档
- `ROADMAP.md` - 功能路线图
- `docker-compose-full.yml` - 容器编排配置
---
## 🎯 最佳实践
1. **定期测试备份恢复**
- 每月测试一次灾难恢复流程
- 验证备份完整性
2. **监控告警**
- 配置邮件或 Webhook 告警
- 设置合理的告警阈值
3. **文档更新**
- 记录所有配置更改
- 维护操作手册
4. **容量规划**
- 监控资源使用趋势
- 提前规划扩容
---
**🎉 高可用性和自动化配置完成!**