openclaw/docs/providers/llmproxy.md
ztao 99d571cab1 docs: add LLMProxy provider integration guide
Add documentation for LLMProxy, a high-performance LLM reverse proxy with
zero-buffer streaming, native token metering, and load balancing support.

Changes:
- Add docs/providers/llmproxy.md with full setup guide
- Add LLMProxy to providers index (Community tools section)
- Add LLMProxy section to model-providers.md

Co-Authored-By: Warp <agent@warp.dev>
2026-01-30 16:36:12 +08:00

6.0 KiB

summary read_when
Use LLMProxy as a high-performance reverse proxy for LLM backends
You want to route OpenClaw through a local LLM proxy
You need zero-buffer streaming with token metering
You want load balancing across multiple LLM backends

LLMProxy

LLMProxy is a high-performance reverse proxy for LLM inference services — like nginx for web servers, but built specifically for LLM workloads.

Why LLMProxy?

Feature LLMProxy Generic API Gateway
SSE Streaming Zero-buffer forwarding Buffer causes delay
Token Metering Native support Plugin required
Deployment Single binary Requires database
LLM Optimization Built for LLM General purpose

Performance:

  • First token latency overhead: < 1ms
  • Memory usage: < 50MB
  • Concurrent connections: 10,000+

Quick start

1. Start LLMProxy

# Download config
curl -o config.yaml https://raw.githubusercontent.com/aiyuekuang/LLMProxy/main/config.yaml.example

# Edit backend URL (point to your vLLM/TGI/Ollama instance)
vim config.yaml

# Start with Docker
docker run -d -p 8000:8000 \
  -v $(pwd)/config.yaml:/home/llmproxy/config.yaml \
  ghcr.io/aiyuekuang/llmproxy:latest

2. Configure OpenClaw

{
  models: {
    providers: {
      llmproxy: {
        baseUrl: "http://localhost:8000/v1",
        apiKey: "optional-key",  // LLMProxy handles auth separately
        api: "openai-completions",
        models: [
          {
            id: "qwen-coder",
            name: "Qwen Coder via LLMProxy",
            reasoning: false,
            input: ["text"],
            cost: { input: 0, output: 0, cacheRead: 0, cacheWrite: 0 },
            contextWindow: 128000,
            maxTokens: 8192
          }
        ]
      }
    }
  },
  agents: {
    defaults: {
      model: { primary: "llmproxy/qwen-coder" }
    }
  }
}

3. Test the connection

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen-coder",
    "messages": [{"role": "user", "content": "Hello"}],
    "stream": true
  }'

Configuration

LLMProxy config (config.yaml)

server:
  listen: ":8000"

backends:
  - url: "http://vllm:8000"
    weight: 5
  - url: "http://ollama:11434"
    weight: 3

# Optional: API Key authentication
auth:
  enabled: true
  header_names: ["Authorization", "X-API-Key"]
  skip_paths: ["/health", "/metrics"]

# Optional: Usage reporting (for billing/monitoring)
usage:
  enabled: true
  reporters:
    - name: billing
      type: webhook
      enabled: true
      webhook:
        url: "https://your-billing.com/llm-usage"
        timeout: 3s

# Optional: Rate limiting
rate_limit:
  enabled: true
  per_key:
    requests_per_minute: 60
    max_concurrent: 3

Multiple backends with load balancing

LLMProxy supports load balancing across multiple LLM backends:

backends:
  - url: "http://vllm-1:8000"
    weight: 10
  - url: "http://vllm-2:8000"
    weight: 10
  - url: "http://ollama:11434"
    weight: 5

routing:
  load_balance: least_connections  # or: round_robin, latency_based

OpenClaw config for multiple models

{
  models: {
    providers: {
      llmproxy: {
        baseUrl: "http://localhost:8000/v1",
        api: "openai-completions",
        models: [
          {
            id: "qwen-coder-32b",
            name: "Qwen 2.5 Coder 32B",
            reasoning: false,
            input: ["text"],
            contextWindow: 128000,
            maxTokens: 8192
          },
          {
            id: "deepseek-r1",
            name: "DeepSeek R1",
            reasoning: true,
            input: ["text"],
            contextWindow: 64000,
            maxTokens: 8192
          }
        ]
      }
    }
  },
  agents: {
    defaults: {
      model: {
        primary: "llmproxy/qwen-coder-32b",
        fallbacks: ["llmproxy/deepseek-r1"]
      }
    }
  }
}

Use cases

Self-hosted AI coding assistant

Route Cursor, Aider, or other coding tools through LLMProxy to your private vLLM instance:

Developer IDE → LLMProxy → vLLM (Qwen2.5-Coder-32B)

Benefits:

  • Fully private code data
  • Tool calling support
  • Unified API key management
  • Response latency < 500ms

Cost optimization

Use LLMProxy to route requests based on complexity:

# LLMProxy routes to cheaper/faster backends first
backends:
  - url: "http://ollama:11434"      # Free, local
    weight: 10
  - url: "http://vllm-large:8000"   # More capable
    weight: 3

Monitoring and billing

LLMProxy sends usage data to your webhook:

{
  "request_id": "req_abc123",
  "user_id": "user_alice",
  "model": "qwen-coder",
  "prompt_tokens": 15,
  "completion_tokens": 42,
  "total_tokens": 57,
  "is_stream": true,
  "timestamp": "2026-01-30T10:30:00Z"
}

Monitoring

LLMProxy exposes Prometheus metrics at /metrics:

Metric Description
llmproxy_requests_total Total requests
llmproxy_latency_ms Request latency
llmproxy_usage_tokens_total Token usage

Access metrics:

curl http://localhost:8000/metrics

Troubleshooting

Connection refused

Check that LLMProxy is running:

curl http://localhost:8000/health

Backend not responding

Verify your backend is accessible from LLMProxy:

# From the LLMProxy host
curl http://vllm:8000/v1/models

Token counts missing

Ensure your backend returns usage data. For vLLM, add --return-detailed-tokens:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-8b-Instruct \
  --return-detailed-tokens \
  --port 8000

See also