feat(tools): add Jina Reader as web_fetch fallback provider

Add Jina Reader (https://jina.ai/reader/) as a native fallback provider
for web_fetch, similar to how Firecrawl is integrated.

Jina provides high-quality content extraction with:
- PDF support (native text extraction)
- Image captioning (via vision language models)
- JavaScript rendering (browser engine option)
- Token-based pricing (10M free tokens, more affordable than Firecrawl)
- Markdown-optimised output for LLM consumption

Changes:
- Add ToolsWebFetchJinaSchema to zod-schema.agent-runtime.ts
- Add fetchJinaContent() and tryJinaFallback() to web-fetch.ts
- Update fallback chain: Readability -> Jina -> Firecrawl -> error
- Add UI hints for Jina config options in schema.ts
- Add docs/tools/jina.md documentation
- Update docs/tools/web.md to reference Jina

Configuration example:
  tools.web.fetch.jina.apiKey or JINA_API_KEY env var

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Nathan Schram 2026-01-29 14:54:15 +11:00
parent 699784dbee
commit 2f22e1a88b
5 changed files with 443 additions and 11 deletions

108
docs/tools/jina.md Normal file
View File

@ -0,0 +1,108 @@
---
summary: "Jina Reader fallback for web_fetch (PDF support + browser rendering)"
read_when:
- You want Jina-backed web extraction
- You need a Jina API key
- You want PDF extraction for web_fetch
- You want browser-rendered extraction for JS-heavy sites
---
# Jina Reader
Moltbot can use **Jina Reader** as a fallback extractor for `web_fetch`. It is a
content extraction service optimised for LLM consumption, with excellent PDF support
and optional browser rendering for JavaScript-heavy sites.
## Get an API key
1) Sign up at https://jina.ai/?sui=apikey (10M free tokens included)
2) Store it in config or set `JINA_API_KEY` in the gateway environment.
## Configure Jina
```json5
{
tools: {
web: {
fetch: {
jina: {
apiKey: "JINA_API_KEY_HERE",
baseUrl: "https://r.jina.ai",
engine: "browser",
timeoutSeconds: 30
}
}
}
}
}
```
Notes:
- `jina.enabled` defaults to true when an API key is present.
- `engine` can be "browser" (quality), "direct" (speed), or "cf-browser-rendering" (JS-heavy).
## Engine Options
| Engine | Best For | Speed |
|--------|----------|-------|
| `direct` | Simple HTML pages | Fastest |
| `browser` | Most pages (default) | Medium |
| `cf-browser-rendering` | JS-heavy SPAs | Slowest |
## Additional Options
| Option | Description |
|--------|-------------|
| `noCache` | Bypass Jina's cache for fresh content |
| `withLinksSummary` | Include a summary of all links at end of content |
| `withImagesSummary` | Include a summary of all images at end of content |
| `returnFormat` | Output format: "markdown", "text", or "html" |
## How `web_fetch` uses Jina
`web_fetch` extraction order:
1) Readability (local)
2) **Jina** (if configured)
3) Firecrawl (if configured)
4) Basic HTML cleanup (last fallback)
Jina is tried before Firecrawl because:
- Token-based pricing is more affordable for high-volume use
- Better PDF extraction support
- No anti-bot circumvention overhead
See [Web tools](/tools/web) for the full web tool setup.
## Comparison with Firecrawl
| Feature | Jina | Firecrawl |
|---------|------|-----------|
| **Pricing** | Token-based (10M free) | Credit-based |
| **PDF Support** | Excellent | Basic |
| **Image Captioning** | Yes (via VLM) | No |
| **Anti-bot Bypass** | No | Yes (stealth proxy) |
| **JS Rendering** | Yes (browser engine) | Yes |
**Recommendation:** Use Jina for most use cases; add Firecrawl if you frequently
hit bot detection on specific sites.
## Environment Variables
- `JINA_API_KEY` - API key for Jina Reader (fallback when not set in config)
## Rate Limits
| Tier | RPM |
|------|-----|
| No API Key | 20 |
| Free API Key | 500 |
| Premium | 5,000 |
## Pricing
Jina uses token-based pricing:
- **Free tier**: 10 million tokens included
- **Paid**: ~$0.02 per 1M tokens (varies by plan)
This is typically more cost-effective than Firecrawl's credit-based model for
high-volume extraction.

View File

@ -209,6 +209,7 @@ Fetch a URL and extract readable content.
### Requirements
- `tools.web.fetch.enabled` must not be `false` (default: enabled)
- Optional Jina fallback: set `tools.web.fetch.jina.apiKey` or `JINA_API_KEY`.
- Optional Firecrawl fallback: set `tools.web.fetch.firecrawl.apiKey` or `FIRECRAWL_API_KEY`.
### Config
@ -225,6 +226,13 @@ Fetch a URL and extract readable content.
maxRedirects: 3,
userAgent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_7_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
readability: true,
jina: {
enabled: true,
apiKey: "JINA_API_KEY_HERE", // optional if JINA_API_KEY is set
baseUrl: "https://r.jina.ai",
engine: "browser", // "browser", "direct", or "cf-browser-rendering"
timeoutSeconds: 30
},
firecrawl: {
enabled: true,
apiKey: "FIRECRAWL_API_KEY_HERE", // optional if FIRECRAWL_API_KEY is set
@ -246,12 +254,14 @@ Fetch a URL and extract readable content.
- `maxChars` (truncate long pages)
Notes:
- `web_fetch` uses Readability (main-content extraction) first, then Firecrawl (if configured). If both fail, the tool returns an error.
- `web_fetch` uses Readability (main-content extraction) first, then Jina (if configured), then Firecrawl (if configured). If all fail, the tool returns an error.
- Jina is tried before Firecrawl because it has better PDF support and more affordable token-based pricing.
- Firecrawl requests use bot-circumvention mode and cache results by default.
- `web_fetch` sends a Chrome-like User-Agent and `Accept-Language` by default; override `userAgent` if needed.
- `web_fetch` blocks private/internal hostnames and re-checks redirects (limit with `maxRedirects`).
- `web_fetch` is best-effort extraction; some sites will need the browser tool.
- See [Firecrawl](/tools/firecrawl) for key setup and service details.
- See [Jina](/tools/jina) for Jina Reader setup and service details.
- See [Firecrawl](/tools/firecrawl) for Firecrawl key setup and service details.
- Responses are cached (default 15 minutes) to reduce repeated fetches.
- If you use tool profiles/allowlists, add `web_search`/`web_fetch` or `group:web`.
- If the Brave key is missing, `web_search` returns a short setup hint with a docs link.

View File

@ -40,6 +40,7 @@ const DEFAULT_FETCH_MAX_REDIRECTS = 3;
const DEFAULT_ERROR_MAX_CHARS = 4_000;
const DEFAULT_FIRECRAWL_BASE_URL = "https://api.firecrawl.dev";
const DEFAULT_FIRECRAWL_MAX_AGE_MS = 172_800_000;
const DEFAULT_JINA_BASE_URL = "https://r.jina.ai";
const DEFAULT_FETCH_USER_AGENT =
"Mozilla/5.0 (Macintosh; Intel Mac OS X 14_7_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36";
@ -78,6 +79,20 @@ type FirecrawlFetchConfig =
}
| undefined;
type JinaFetchConfig =
| {
enabled?: boolean;
apiKey?: string;
baseUrl?: string;
engine?: "browser" | "direct" | "cf-browser-rendering";
returnFormat?: "markdown" | "text" | "html";
timeoutSeconds?: number;
noCache?: boolean;
withLinksSummary?: boolean;
withImagesSummary?: boolean;
}
| undefined;
function resolveFetchConfig(cfg?: MoltbotConfig): WebFetchConfig {
const fetch = cfg?.tools?.web?.fetch;
if (!fetch || typeof fetch !== "object") return undefined;
@ -147,6 +162,33 @@ function resolveFirecrawlMaxAgeMsOrDefault(firecrawl?: FirecrawlFetchConfig): nu
return DEFAULT_FIRECRAWL_MAX_AGE_MS;
}
// ===== Jina Configuration Resolvers =====
function resolveJinaConfig(fetch?: WebFetchConfig): JinaFetchConfig {
if (!fetch || typeof fetch !== "object") return undefined;
const jina = "jina" in fetch ? fetch.jina : undefined;
if (!jina || typeof jina !== "object") return undefined;
return jina as JinaFetchConfig;
}
function resolveJinaApiKey(jina?: JinaFetchConfig): string | undefined {
const fromConfig =
jina && "apiKey" in jina && typeof jina.apiKey === "string" ? jina.apiKey.trim() : "";
const fromEnv = (process.env.JINA_API_KEY ?? "").trim();
return fromConfig || fromEnv || undefined;
}
function resolveJinaEnabled(params: { jina?: JinaFetchConfig; apiKey?: string }): boolean {
if (typeof params.jina?.enabled === "boolean") return params.jina.enabled;
return Boolean(params.apiKey);
}
function resolveJinaBaseUrl(jina?: JinaFetchConfig): string {
const raw =
jina && "baseUrl" in jina && typeof jina.baseUrl === "string" ? jina.baseUrl.trim() : "";
return raw || DEFAULT_JINA_BASE_URL;
}
function resolveMaxChars(value: unknown, fallback: number): number {
const parsed = typeof value === "number" && Number.isFinite(value) ? value : fallback;
return Math.max(100, Math.floor(parsed));
@ -329,6 +371,83 @@ export async function fetchFirecrawlContent(params: {
};
}
export async function fetchJinaContent(params: {
url: string;
extractMode: ExtractMode;
apiKey: string;
baseUrl: string;
engine?: "browser" | "direct" | "cf-browser-rendering";
noCache?: boolean;
withLinksSummary?: boolean;
withImagesSummary?: boolean;
timeoutSeconds: number;
}): Promise<{
text: string;
title?: string;
finalUrl?: string;
status?: number;
}> {
const headers: Record<string, string> = {
Authorization: `Bearer ${params.apiKey}`,
Accept: "application/json",
"Content-Type": "application/json",
};
// Optional Jina headers
if (params.engine) {
headers["X-Engine"] = params.engine;
}
if (params.noCache) {
headers["X-No-Cache"] = "true";
}
if (params.withLinksSummary) {
headers["X-With-Links-Summary"] = "true";
}
if (params.withImagesSummary) {
headers["X-With-Images-Summary"] = "true";
}
if (params.timeoutSeconds) {
headers["X-Timeout"] = String(params.timeoutSeconds);
}
// Determine return format based on extractMode
const returnFormat = params.extractMode === "text" ? "text" : "markdown";
headers["X-Return-Format"] = returnFormat;
const res = await fetch(params.baseUrl, {
method: "POST",
headers,
body: JSON.stringify({ url: params.url }),
signal: withTimeout(undefined, params.timeoutSeconds * 1000),
});
const payload = (await res.json()) as {
code?: number;
status?: number;
data?: {
title?: string;
content?: string;
url?: string;
};
error?: string;
};
if (!res.ok || (payload?.code && payload.code !== 200)) {
const detail = payload?.error || res.statusText;
throw new Error(`Jina fetch failed (${res.status}): ${detail}`.trim());
}
const data = payload?.data ?? {};
const text = typeof data.content === "string" ? data.content : "";
return {
text,
title: data.title,
finalUrl: data.url,
status: payload.code ?? res.status,
};
}
async function runWebFetch(params: {
url: string;
extractMode: ExtractMode;
@ -338,6 +457,14 @@ async function runWebFetch(params: {
cacheTtlMs: number;
userAgent: string;
readabilityEnabled: boolean;
jinaEnabled: boolean;
jinaApiKey?: string;
jinaBaseUrl: string;
jinaEngine?: "browser" | "direct" | "cf-browser-rendering";
jinaNoCache?: boolean;
jinaWithLinksSummary?: boolean;
jinaWithImagesSummary?: boolean;
jinaTimeoutSeconds: number;
firecrawlEnabled: boolean;
firecrawlApiKey?: string;
firecrawlBaseUrl: string;
@ -381,6 +508,42 @@ async function runWebFetch(params: {
if (error instanceof SsrFBlockedError) {
throw error;
}
// Try Jina first (cheaper, better PDF support)
if (params.jinaEnabled && params.jinaApiKey) {
try {
const jina = await fetchJinaContent({
url: finalUrl,
extractMode: params.extractMode,
apiKey: params.jinaApiKey,
baseUrl: params.jinaBaseUrl,
engine: params.jinaEngine,
noCache: params.jinaNoCache,
withLinksSummary: params.jinaWithLinksSummary,
withImagesSummary: params.jinaWithImagesSummary,
timeoutSeconds: params.jinaTimeoutSeconds,
});
const truncated = truncateText(jina.text, params.maxChars);
const payload = {
url: params.url,
finalUrl: jina.finalUrl || finalUrl,
status: jina.status ?? 200,
contentType: "text/markdown",
title: jina.title,
extractMode: params.extractMode,
extractor: "jina",
truncated: truncated.truncated,
length: truncated.text.length,
fetchedAt: new Date().toISOString(),
tookMs: Date.now() - start,
text: truncated.text,
};
writeCache(FETCH_CACHE, cacheKey, payload, params.cacheTtlMs);
return payload;
} catch {
// Fall through to Firecrawl
}
}
// Then try Firecrawl (bot circumvention)
if (params.firecrawlEnabled && params.firecrawlApiKey) {
const firecrawl = await fetchFirecrawlContent({
url: finalUrl,
@ -417,6 +580,42 @@ async function runWebFetch(params: {
try {
if (!res.ok) {
// Try Jina first (cheaper, better PDF support)
if (params.jinaEnabled && params.jinaApiKey) {
try {
const jina = await fetchJinaContent({
url: params.url,
extractMode: params.extractMode,
apiKey: params.jinaApiKey,
baseUrl: params.jinaBaseUrl,
engine: params.jinaEngine,
noCache: params.jinaNoCache,
withLinksSummary: params.jinaWithLinksSummary,
withImagesSummary: params.jinaWithImagesSummary,
timeoutSeconds: params.jinaTimeoutSeconds,
});
const truncated = truncateText(jina.text, params.maxChars);
const payload = {
url: params.url,
finalUrl: jina.finalUrl || finalUrl,
status: jina.status ?? res.status,
contentType: "text/markdown",
title: jina.title,
extractMode: params.extractMode,
extractor: "jina",
truncated: truncated.truncated,
length: truncated.text.length,
fetchedAt: new Date().toISOString(),
tookMs: Date.now() - start,
text: truncated.text,
};
writeCache(FETCH_CACHE, cacheKey, payload, params.cacheTtlMs);
return payload;
} catch {
// Fall through to Firecrawl
}
}
// Then try Firecrawl (bot circumvention)
if (params.firecrawlEnabled && params.firecrawlApiKey) {
const firecrawl = await fetchFirecrawlContent({
url: params.url,
@ -475,20 +674,29 @@ async function runWebFetch(params: {
title = readable.title;
extractor = "readability";
} else {
const firecrawl = await tryFirecrawlFallback({ ...params, url: finalUrl });
if (firecrawl) {
text = firecrawl.text;
title = firecrawl.title;
extractor = "firecrawl";
// Try Jina first (cheaper, better PDF support)
const jina = await tryJinaFallback({ ...params, url: finalUrl });
if (jina) {
text = jina.text;
title = jina.title;
extractor = "jina";
} else {
throw new Error(
"Web fetch extraction failed: Readability and Firecrawl returned no content.",
);
// Then try Firecrawl (bot circumvention)
const firecrawl = await tryFirecrawlFallback({ ...params, url: finalUrl });
if (firecrawl) {
text = firecrawl.text;
title = firecrawl.title;
extractor = "firecrawl";
} else {
throw new Error(
"Web fetch extraction failed: Readability, Jina, and Firecrawl returned no content.",
);
}
}
}
} else {
throw new Error(
"Web fetch extraction failed: Readability disabled and Firecrawl unavailable.",
"Web fetch extraction failed: Readability disabled and Jina/Firecrawl unavailable.",
);
}
} else if (contentType.includes("application/json")) {
@ -554,6 +762,37 @@ async function tryFirecrawlFallback(params: {
}
}
async function tryJinaFallback(params: {
url: string;
extractMode: ExtractMode;
jinaEnabled: boolean;
jinaApiKey?: string;
jinaBaseUrl: string;
jinaEngine?: "browser" | "direct" | "cf-browser-rendering";
jinaNoCache?: boolean;
jinaWithLinksSummary?: boolean;
jinaWithImagesSummary?: boolean;
jinaTimeoutSeconds: number;
}): Promise<{ text: string; title?: string } | null> {
if (!params.jinaEnabled || !params.jinaApiKey) return null;
try {
const jina = await fetchJinaContent({
url: params.url,
extractMode: params.extractMode,
apiKey: params.jinaApiKey,
baseUrl: params.jinaBaseUrl,
engine: params.jinaEngine,
noCache: params.jinaNoCache,
withLinksSummary: params.jinaWithLinksSummary,
withImagesSummary: params.jinaWithImagesSummary,
timeoutSeconds: params.jinaTimeoutSeconds,
});
return { text: jina.text, title: jina.title };
} catch {
return null;
}
}
function resolveFirecrawlEndpoint(baseUrl: string): string {
const trimmed = baseUrl.trim();
if (!trimmed) return `${DEFAULT_FIRECRAWL_BASE_URL}/v2/scrape`;
@ -576,6 +815,22 @@ export function createWebFetchTool(options?: {
const fetch = resolveFetchConfig(options?.config);
if (!resolveFetchEnabled({ fetch, sandboxed: options?.sandboxed })) return null;
const readabilityEnabled = resolveFetchReadabilityEnabled(fetch);
// Jina config
const jina = resolveJinaConfig(fetch);
const jinaApiKey = resolveJinaApiKey(jina);
const jinaEnabled = resolveJinaEnabled({ jina, apiKey: jinaApiKey });
const jinaBaseUrl = resolveJinaBaseUrl(jina);
const jinaEngine = jina?.engine;
const jinaNoCache = jina?.noCache;
const jinaWithLinksSummary = jina?.withLinksSummary;
const jinaWithImagesSummary = jina?.withImagesSummary;
const jinaTimeoutSeconds = resolveTimeoutSeconds(
jina?.timeoutSeconds ?? fetch?.timeoutSeconds,
DEFAULT_TIMEOUT_SECONDS,
);
// Firecrawl config
const firecrawl = resolveFirecrawlConfig(fetch);
const firecrawlApiKey = resolveFirecrawlApiKey(firecrawl);
const firecrawlEnabled = resolveFirecrawlEnabled({ firecrawl, apiKey: firecrawlApiKey });
@ -586,6 +841,7 @@ export function createWebFetchTool(options?: {
firecrawl?.timeoutSeconds ?? fetch?.timeoutSeconds,
DEFAULT_TIMEOUT_SECONDS,
);
const userAgent =
(fetch && "userAgent" in fetch && typeof fetch.userAgent === "string" && fetch.userAgent) ||
DEFAULT_FETCH_USER_AGENT;
@ -609,6 +865,14 @@ export function createWebFetchTool(options?: {
cacheTtlMs: resolveCacheTtlMs(fetch?.cacheTtlMinutes, DEFAULT_CACHE_TTL_MINUTES),
userAgent,
readabilityEnabled,
jinaEnabled,
jinaApiKey,
jinaBaseUrl,
jinaEngine,
jinaNoCache,
jinaWithLinksSummary,
jinaWithImagesSummary,
jinaTimeoutSeconds,
firecrawlEnabled,
firecrawlApiKey,
firecrawlBaseUrl,

View File

@ -199,6 +199,15 @@ const FIELD_LABELS: Record<string, string> = {
"tools.web.fetch.cacheTtlMinutes": "Web Fetch Cache TTL (min)",
"tools.web.fetch.maxRedirects": "Web Fetch Max Redirects",
"tools.web.fetch.userAgent": "Web Fetch User-Agent",
"tools.web.fetch.jina.enabled": "Enable Jina Reader",
"tools.web.fetch.jina.apiKey": "Jina API Key",
"tools.web.fetch.jina.baseUrl": "Jina Base URL",
"tools.web.fetch.jina.engine": "Jina Engine",
"tools.web.fetch.jina.returnFormat": "Jina Return Format",
"tools.web.fetch.jina.timeoutSeconds": "Jina Timeout (sec)",
"tools.web.fetch.jina.noCache": "Jina Disable Cache",
"tools.web.fetch.jina.withLinksSummary": "Jina Include Links Summary",
"tools.web.fetch.jina.withImagesSummary": "Jina Include Images Summary",
"gateway.controlUi.basePath": "Control UI Base Path",
"gateway.controlUi.allowInsecureAuth": "Allow Insecure Control UI Auth",
"gateway.controlUi.dangerouslyDisableDeviceAuth": "Dangerously Disable Control UI Device Auth",
@ -463,6 +472,17 @@ const FIELD_HELP: Record<string, string> = {
"tools.web.fetch.firecrawl.maxAgeMs":
"Firecrawl maxAge (ms) for cached results when supported by the API.",
"tools.web.fetch.firecrawl.timeoutSeconds": "Timeout in seconds for Firecrawl requests.",
"tools.web.fetch.jina.enabled": "Enable Jina Reader fallback for web_fetch (if configured).",
"tools.web.fetch.jina.apiKey": "Jina API key (fallback: JINA_API_KEY env var).",
"tools.web.fetch.jina.baseUrl": "Jina base URL (default: https://r.jina.ai).",
"tools.web.fetch.jina.engine":
'Jina engine ("browser" for quality, "direct" for speed, "cf-browser-rendering" for JS-heavy sites).',
"tools.web.fetch.jina.returnFormat": 'Jina return format ("markdown", "text", or "html").',
"tools.web.fetch.jina.timeoutSeconds": "Timeout in seconds for Jina requests.",
"tools.web.fetch.jina.noCache": "Bypass Jina cache for fresh content (default: false).",
"tools.web.fetch.jina.withLinksSummary": "Include a summary of all links at the end of content.",
"tools.web.fetch.jina.withImagesSummary":
"Include a summary of all images at the end of content.",
"channels.slack.allowBots":
"Allow bot-authored messages to trigger Slack replies (default: false).",
"channels.slack.thread.historyScope":

View File

@ -182,6 +182,33 @@ export const ToolsWebSearchSchema = z
.strict()
.optional();
export const ToolsWebFetchFirecrawlSchema = z
.object({
enabled: z.boolean().optional(),
apiKey: z.string().optional(),
baseUrl: z.string().optional(),
onlyMainContent: z.boolean().optional(),
maxAgeMs: z.number().int().nonnegative().optional(),
timeoutSeconds: z.number().int().positive().optional(),
})
.strict()
.optional();
export const ToolsWebFetchJinaSchema = z
.object({
enabled: z.boolean().optional(),
apiKey: z.string().optional(),
baseUrl: z.string().optional(),
engine: z.enum(["browser", "direct", "cf-browser-rendering"]).optional(),
returnFormat: z.enum(["markdown", "text", "html"]).optional(),
timeoutSeconds: z.number().int().positive().optional(),
noCache: z.boolean().optional(),
withLinksSummary: z.boolean().optional(),
withImagesSummary: z.boolean().optional(),
})
.strict()
.optional();
export const ToolsWebFetchSchema = z
.object({
enabled: z.boolean().optional(),
@ -190,6 +217,9 @@ export const ToolsWebFetchSchema = z
cacheTtlMinutes: z.number().nonnegative().optional(),
maxRedirects: z.number().int().nonnegative().optional(),
userAgent: z.string().optional(),
readability: z.boolean().optional(),
firecrawl: ToolsWebFetchFirecrawlSchema,
jina: ToolsWebFetchJinaSchema,
})
.strict()
.optional();