Add documentation for LLMProxy, a high-performance LLM reverse proxy with zero-buffer streaming, native token metering, and load balancing support. Changes: - Add docs/providers/llmproxy.md with full setup guide - Add LLMProxy to providers index (Community tools section) - Add LLMProxy section to model-providers.md Co-Authored-By: Warp <agent@warp.dev>
282 lines
6.0 KiB
Markdown
282 lines
6.0 KiB
Markdown
---
|
|
summary: "Use LLMProxy as a high-performance reverse proxy for LLM backends"
|
|
read_when:
|
|
- You want to route OpenClaw through a local LLM proxy
|
|
- You need zero-buffer streaming with token metering
|
|
- You want load balancing across multiple LLM backends
|
|
---
|
|
# LLMProxy
|
|
|
|
[LLMProxy](https://github.com/aiyuekuang/LLMProxy) is a high-performance reverse proxy for LLM inference services — like nginx for web servers, but built specifically for LLM workloads.
|
|
|
|
## Why LLMProxy?
|
|
|
|
| Feature | LLMProxy | Generic API Gateway |
|
|
|---------|----------|---------------------|
|
|
| SSE Streaming | Zero-buffer forwarding | Buffer causes delay |
|
|
| Token Metering | Native support | Plugin required |
|
|
| Deployment | Single binary | Requires database |
|
|
| LLM Optimization | Built for LLM | General purpose |
|
|
|
|
**Performance:**
|
|
- First token latency overhead: < 1ms
|
|
- Memory usage: < 50MB
|
|
- Concurrent connections: 10,000+
|
|
|
|
## Quick start
|
|
|
|
### 1. Start LLMProxy
|
|
|
|
```bash
|
|
# Download config
|
|
curl -o config.yaml https://raw.githubusercontent.com/aiyuekuang/LLMProxy/main/config.yaml.example
|
|
|
|
# Edit backend URL (point to your vLLM/TGI/Ollama instance)
|
|
vim config.yaml
|
|
|
|
# Start with Docker
|
|
docker run -d -p 8000:8000 \
|
|
-v $(pwd)/config.yaml:/home/llmproxy/config.yaml \
|
|
ghcr.io/aiyuekuang/llmproxy:latest
|
|
```
|
|
|
|
### 2. Configure OpenClaw
|
|
|
|
```json5
|
|
{
|
|
models: {
|
|
providers: {
|
|
llmproxy: {
|
|
baseUrl: "http://localhost:8000/v1",
|
|
apiKey: "optional-key", // LLMProxy handles auth separately
|
|
api: "openai-completions",
|
|
models: [
|
|
{
|
|
id: "qwen-coder",
|
|
name: "Qwen Coder via LLMProxy",
|
|
reasoning: false,
|
|
input: ["text"],
|
|
cost: { input: 0, output: 0, cacheRead: 0, cacheWrite: 0 },
|
|
contextWindow: 128000,
|
|
maxTokens: 8192
|
|
}
|
|
]
|
|
}
|
|
}
|
|
},
|
|
agents: {
|
|
defaults: {
|
|
model: { primary: "llmproxy/qwen-coder" }
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### 3. Test the connection
|
|
|
|
```bash
|
|
curl http://localhost:8000/v1/chat/completions \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "qwen-coder",
|
|
"messages": [{"role": "user", "content": "Hello"}],
|
|
"stream": true
|
|
}'
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### LLMProxy config (config.yaml)
|
|
|
|
```yaml
|
|
server:
|
|
listen: ":8000"
|
|
|
|
backends:
|
|
- url: "http://vllm:8000"
|
|
weight: 5
|
|
- url: "http://ollama:11434"
|
|
weight: 3
|
|
|
|
# Optional: API Key authentication
|
|
auth:
|
|
enabled: true
|
|
header_names: ["Authorization", "X-API-Key"]
|
|
skip_paths: ["/health", "/metrics"]
|
|
|
|
# Optional: Usage reporting (for billing/monitoring)
|
|
usage:
|
|
enabled: true
|
|
reporters:
|
|
- name: billing
|
|
type: webhook
|
|
enabled: true
|
|
webhook:
|
|
url: "https://your-billing.com/llm-usage"
|
|
timeout: 3s
|
|
|
|
# Optional: Rate limiting
|
|
rate_limit:
|
|
enabled: true
|
|
per_key:
|
|
requests_per_minute: 60
|
|
max_concurrent: 3
|
|
```
|
|
|
|
### Multiple backends with load balancing
|
|
|
|
LLMProxy supports load balancing across multiple LLM backends:
|
|
|
|
```yaml
|
|
backends:
|
|
- url: "http://vllm-1:8000"
|
|
weight: 10
|
|
- url: "http://vllm-2:8000"
|
|
weight: 10
|
|
- url: "http://ollama:11434"
|
|
weight: 5
|
|
|
|
routing:
|
|
load_balance: least_connections # or: round_robin, latency_based
|
|
```
|
|
|
|
### OpenClaw config for multiple models
|
|
|
|
```json5
|
|
{
|
|
models: {
|
|
providers: {
|
|
llmproxy: {
|
|
baseUrl: "http://localhost:8000/v1",
|
|
api: "openai-completions",
|
|
models: [
|
|
{
|
|
id: "qwen-coder-32b",
|
|
name: "Qwen 2.5 Coder 32B",
|
|
reasoning: false,
|
|
input: ["text"],
|
|
contextWindow: 128000,
|
|
maxTokens: 8192
|
|
},
|
|
{
|
|
id: "deepseek-r1",
|
|
name: "DeepSeek R1",
|
|
reasoning: true,
|
|
input: ["text"],
|
|
contextWindow: 64000,
|
|
maxTokens: 8192
|
|
}
|
|
]
|
|
}
|
|
}
|
|
},
|
|
agents: {
|
|
defaults: {
|
|
model: {
|
|
primary: "llmproxy/qwen-coder-32b",
|
|
fallbacks: ["llmproxy/deepseek-r1"]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## Use cases
|
|
|
|
### Self-hosted AI coding assistant
|
|
|
|
Route Cursor, Aider, or other coding tools through LLMProxy to your private vLLM instance:
|
|
|
|
```
|
|
Developer IDE → LLMProxy → vLLM (Qwen2.5-Coder-32B)
|
|
```
|
|
|
|
Benefits:
|
|
- Fully private code data
|
|
- Tool calling support
|
|
- Unified API key management
|
|
- Response latency < 500ms
|
|
|
|
### Cost optimization
|
|
|
|
Use LLMProxy to route requests based on complexity:
|
|
|
|
```yaml
|
|
# LLMProxy routes to cheaper/faster backends first
|
|
backends:
|
|
- url: "http://ollama:11434" # Free, local
|
|
weight: 10
|
|
- url: "http://vllm-large:8000" # More capable
|
|
weight: 3
|
|
```
|
|
|
|
### Monitoring and billing
|
|
|
|
LLMProxy sends usage data to your webhook:
|
|
|
|
```json
|
|
{
|
|
"request_id": "req_abc123",
|
|
"user_id": "user_alice",
|
|
"model": "qwen-coder",
|
|
"prompt_tokens": 15,
|
|
"completion_tokens": 42,
|
|
"total_tokens": 57,
|
|
"is_stream": true,
|
|
"timestamp": "2026-01-30T10:30:00Z"
|
|
}
|
|
```
|
|
|
|
## Monitoring
|
|
|
|
LLMProxy exposes Prometheus metrics at `/metrics`:
|
|
|
|
| Metric | Description |
|
|
|--------|-------------|
|
|
| `llmproxy_requests_total` | Total requests |
|
|
| `llmproxy_latency_ms` | Request latency |
|
|
| `llmproxy_usage_tokens_total` | Token usage |
|
|
|
|
Access metrics:
|
|
|
|
```bash
|
|
curl http://localhost:8000/metrics
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Connection refused
|
|
|
|
Check that LLMProxy is running:
|
|
|
|
```bash
|
|
curl http://localhost:8000/health
|
|
```
|
|
|
|
### Backend not responding
|
|
|
|
Verify your backend is accessible from LLMProxy:
|
|
|
|
```bash
|
|
# From the LLMProxy host
|
|
curl http://vllm:8000/v1/models
|
|
```
|
|
|
|
### Token counts missing
|
|
|
|
Ensure your backend returns usage data. For vLLM, add `--return-detailed-tokens`:
|
|
|
|
```bash
|
|
python -m vllm.entrypoints.openai.api_server \
|
|
--model meta-llama/Llama-3-8b-Instruct \
|
|
--return-detailed-tokens \
|
|
--port 8000
|
|
```
|
|
|
|
## See also
|
|
|
|
- [LLMProxy GitHub](https://github.com/aiyuekuang/LLMProxy) - Source code and full documentation
|
|
- [Model Providers](/concepts/model-providers) - Overview of all providers
|
|
- [Ollama](/providers/ollama) - Local LLM runtime (can be used as LLMProxy backend)
|
|
- [Configuration](/gateway/configuration) - Full OpenClaw config reference
|