---
summary: "Use LLMProxy as a high-performance reverse proxy for LLM backends"
read_when:
  - You want to route OpenClaw through a local LLM proxy
  - You need zero-buffer streaming with token metering
  - You want load balancing across multiple LLM backends
---
# LLMProxy

[LLMProxy](https://github.com/aiyuekuang/LLMProxy) is a high-performance reverse proxy for LLM inference services — like nginx for web servers, but built specifically for LLM workloads.

## Why LLMProxy?

| Feature | LLMProxy | Generic API Gateway |
|---------|----------|---------------------|
| SSE Streaming | Zero-buffer forwarding | Buffer causes delay |
| Token Metering | Native support | Plugin required |
| Deployment | Single binary | Requires database |
| LLM Optimization | Built for LLM | General purpose |

**Performance:**
- First token latency overhead: < 1ms
- Memory usage: < 50MB
- Concurrent connections: 10,000+

## Quick start

### 1. Start LLMProxy

```bash
# Download config
curl -o config.yaml https://raw.githubusercontent.com/aiyuekuang/LLMProxy/main/config.yaml.example

# Edit backend URL (point to your vLLM/TGI/Ollama instance)
vim config.yaml

# Start with Docker
docker run -d -p 8000:8000 \
  -v $(pwd)/config.yaml:/home/llmproxy/config.yaml \
  ghcr.io/aiyuekuang/llmproxy:latest
```

### 2. Configure OpenClaw

```json5
{
  models: {
    providers: {
      llmproxy: {
        baseUrl: "http://localhost:8000/v1",
        apiKey: "optional-key",  // LLMProxy handles auth separately
        api: "openai-completions",
        models: [
          {
            id: "qwen-coder",
            name: "Qwen Coder via LLMProxy",
            reasoning: false,
            input: ["text"],
            cost: { input: 0, output: 0, cacheRead: 0, cacheWrite: 0 },
            contextWindow: 128000,
            maxTokens: 8192
          }
        ]
      }
    }
  },
  agents: {
    defaults: {
      model: { primary: "llmproxy/qwen-coder" }
    }
  }
}
```

### 3. Test the connection

```bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen-coder",
    "messages": [{"role": "user", "content": "Hello"}],
    "stream": true
  }'
```

## Configuration

### LLMProxy config (config.yaml)

```yaml
server:
  listen: ":8000"

backends:
  - url: "http://vllm:8000"
    weight: 5
  - url: "http://ollama:11434"
    weight: 3

# Optional: API Key authentication
auth:
  enabled: true
  header_names: ["Authorization", "X-API-Key"]
  skip_paths: ["/health", "/metrics"]

# Optional: Usage reporting (for billing/monitoring)
usage:
  enabled: true
  reporters:
    - name: billing
      type: webhook
      enabled: true
      webhook:
        url: "https://your-billing.com/llm-usage"
        timeout: 3s

# Optional: Rate limiting
rate_limit:
  enabled: true
  per_key:
    requests_per_minute: 60
    max_concurrent: 3
```

### Multiple backends with load balancing

LLMProxy supports load balancing across multiple LLM backends:

```yaml
backends:
  - url: "http://vllm-1:8000"
    weight: 10
  - url: "http://vllm-2:8000"
    weight: 10
  - url: "http://ollama:11434"
    weight: 5

routing:
  load_balance: least_connections  # or: round_robin, latency_based
```

### OpenClaw config for multiple models

```json5
{
  models: {
    providers: {
      llmproxy: {
        baseUrl: "http://localhost:8000/v1",
        api: "openai-completions",
        models: [
          {
            id: "qwen-coder-32b",
            name: "Qwen 2.5 Coder 32B",
            reasoning: false,
            input: ["text"],
            contextWindow: 128000,
            maxTokens: 8192
          },
          {
            id: "deepseek-r1",
            name: "DeepSeek R1",
            reasoning: true,
            input: ["text"],
            contextWindow: 64000,
            maxTokens: 8192
          }
        ]
      }
    }
  },
  agents: {
    defaults: {
      model: {
        primary: "llmproxy/qwen-coder-32b",
        fallbacks: ["llmproxy/deepseek-r1"]
      }
    }
  }
}
```

## Use cases

### Self-hosted AI coding assistant

Route Cursor, Aider, or other coding tools through LLMProxy to your private vLLM instance:

```
Developer IDE → LLMProxy → vLLM (Qwen2.5-Coder-32B)
```

Benefits:
- Fully private code data
- Tool calling support
- Unified API key management
- Response latency < 500ms

### Cost optimization

Use LLMProxy to route requests based on complexity:

```yaml
# LLMProxy routes to cheaper/faster backends first
backends:
  - url: "http://ollama:11434"      # Free, local
    weight: 10
  - url: "http://vllm-large:8000"   # More capable
    weight: 3
```

### Monitoring and billing

LLMProxy sends usage data to your webhook:

```json
{
  "request_id": "req_abc123",
  "user_id": "user_alice",
  "model": "qwen-coder",
  "prompt_tokens": 15,
  "completion_tokens": 42,
  "total_tokens": 57,
  "is_stream": true,
  "timestamp": "2026-01-30T10:30:00Z"
}
```

## Monitoring

LLMProxy exposes Prometheus metrics at `/metrics`:

| Metric | Description |
|--------|-------------|
| `llmproxy_requests_total` | Total requests |
| `llmproxy_latency_ms` | Request latency |
| `llmproxy_usage_tokens_total` | Token usage |

Access metrics:

```bash
curl http://localhost:8000/metrics
```

## Troubleshooting

### Connection refused

Check that LLMProxy is running:

```bash
curl http://localhost:8000/health
```

### Backend not responding

Verify your backend is accessible from LLMProxy:

```bash
# From the LLMProxy host
curl http://vllm:8000/v1/models
```

### Token counts missing

Ensure your backend returns usage data. For vLLM, add `--return-detailed-tokens`:

```bash
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-8b-Instruct \
  --return-detailed-tokens \
  --port 8000
```

## See also

- [LLMProxy GitHub](https://github.com/aiyuekuang/LLMProxy) - Source code and full documentation
- [Model Providers](/concepts/model-providers) - Overview of all providers
- [Ollama](/providers/ollama) - Local LLM runtime (can be used as LLMProxy backend)
- [Configuration](/gateway/configuration) - Full OpenClaw config reference