Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.fim.ai/llms.txt

Use this file to discover all available pages before exploring further.

FIM One is provider-agnostic — any OpenAI-compatible endpoint works. This page helps you pick the best model combination for your use case. For configuration details, see Environment Variables.

How FIM One Uses Models

FIM One has three model roles:
RoleEnv VariableUsed For
GeneralLLM_MODELPlanning, analysis, ReAct agent, complex reasoning
FastFAST_LLM_MODELDAG step execution, context compaction (cheaper, faster)
ReasoningREASONING_LLM_MODELDeep analysis, complex planning, mathematical proofs
Fast and Reasoning fall back to General if not configured. For production deployments, splitting into at least two models (General + Fast) gives the best cost/quality balance. These roles can be configured via ENV variables or through the admin UI’s Model Groups feature, which allows one-click switching between model sets. See Model Management for the full admin UI guide.

Quick Selection Matrix

ProviderMain LLMFast LLMReasoningVisionNotes
OpenAIgpt-5.4gpt-5.4-mini / gpt-5.4-nanoreasoning_effort✅ AllBest native tool-calling; GPT-5.4 is latest flagship (Mar 2026)
Anthropicclaude-sonnet-4-6claude-haiku-4-5✅ via LiteLLM✅ AllNative API routing; full reasoning_content support; 1M context GA
Google Geminigemini-2.5-pro / gemini-3.1-pro-previewgemini-2.5-flash / gemini-3-flash-previewreasoning_effort✅ All2.5 is stable GA; 3.x is preview; gemini-3-pro-preview shut down Mar 9
DeepSeekdeepseek-chat (V3.2)deepseek-chatdeepseek-reasonerText-only; V4 (Apr 2026) will add vision
Qwen (Alibaba)qwen3.5-plus / qwen3-maxqwen3.5-flash / qwen-turboenable_thinking on qwen3-max⚠️ qwen3.5 onlyStrongest Chinese language; qwq/reasoning text-only
ChatGLM (Zhipu)glm-4.7glm-4.7-flashglm-5⚠️ GLM-4.6VForced FC not supported; vision requires separate VLM model
MiniMaxMiniMax-M2.7MiniMax-M2.5Text-only; M2.7 latest (Mar 2026); 80.2% SWE-Bench
Kimi (Moonshot)kimi-k2.5kimi-k2kimi-k2-thinking⚠️ K2.5 onlyK2-thinking text-only; forced FC not supported with thinking
Ollama (local)qwen3.5 / llama4qwen3.5:9bVariesFully offline, no API key; Llama 4 supports vision
Vision indicates whether the model accepts image input. This is required for Intelligent Document Processing (IDP) — if your model doesn’t support vision, IDP will fall back to text-only extraction. Providers marked ⚠️ have vision on some models but not others; check the specific model you’re using.

Structured Output Compatibility

FIM One’s DAG planner needs the model to return valid structured JSON. Internally, it tries three extraction levels in order:
  1. Native Function Calling — forces the model to output JSON matching a schema via the tool-call API. Most reliable.
  2. JSON Mode — requests response_format: json_object. Guarantees valid JSON, but does not enforce schema compliance.
  3. Plain Text Extraction — parses JSON from free-form text as a last resort.
Models that support Level 1 (native FC with forced tool_choice) give the best planning reliability. If a model only reaches Level 2, its output quality depends on how well it follows prompt instructions — weaker models may produce valid JSON that doesn’t match the expected structure.
ProviderForced Function CallingJSON ModePlanning Reliability
OpenAI (GPT-5.x, o3)✅ Full support⭐⭐⭐ Excellent
Anthropic (Claude 4.x)⚠️ Conflicts with thinking mode⭐⭐⭐ Excellent (strong instruction following compensates)
Google Gemini (2.5/3.x)✅ Full support⭐⭐⭐ Excellent
Mistral✅ Full support⭐⭐ Good
DeepSeek (V3.2)⚠️ Unstable (tool_choice="required" works, "auto" unreliable)⭐⭐ Good
Qwen (3.x)⚠️ Partial⭐⭐ Good
Kimi (K2.5)⚠️ Partial — auto only when thinking enabled⭐ Fair — may produce malformed plans
ChatGLM (GLM-4.7/5)❌ Not supported (auto only)⭐ Fair
MiniMax (M2.5/M2.7)✅ Full support⭐⭐ Good
Local (Ollama)Varies by modelVaries⭐ Fair — 32B+ recommended
If you see the error “failed to generate a valid task plan”, the model’s structured output capability is insufficient for DAG planning. Switch your Main LLM to a model rated ⭐⭐⭐ or ⭐⭐ above, or disable DAG mode and use the simpler ReAct agent instead.

Thinking / Reasoning Compatibility

Different providers implement “thinking” (chain-of-thought reasoning) in fundamentally different ways. This matters because thinking mode can conflict with tool calling, and the output appears in different places depending on the provider. FIM One handles all of these transparently — this table helps you understand what’s happening under the hood.

Key Concepts

  • Opt-in — thinking is off by default; you enable it via an API parameter (e.g., reasoning_effort). Can be selectively disabled per call.
  • Always-on — the model always thinks; no API parameter to turn it off. You’d need to switch to a non-thinking model variant to avoid it.
  • Model-level — thinking is determined by which model ID you choose (e.g., deepseek-reasoner vs deepseek-chat), not by a parameter.

Compatibility Matrix

ProviderHow to EnableCan Disable?Thinking OutputForced FC Conflict?
OpenAI (GPT-5.x)reasoning_effort param✅ Opt-inInternal (not visible to user)⚠️ API drops reasoning_effort when tools present
OpenAI (o-series)Always-onInternal (tokens counted, not returned)✅ No conflict
Anthropic (Claude 4.x)reasoning_effortthinking✅ Opt-inAPI reasoning_content field → Reasoning panel❌ Forced FC + thinking = 400 error
Google Gemini (2.5/3.x)reasoning_effort param✅ Opt-inInternal✅ No conflict
DeepSeekModel variant (deepseek-reasoner)Model-levelAPI reasoning_content field → Reasoning panel⚠️ Forced FC unreliable
Qwen (3.x)enable_thinking param✅ Opt-in<think> tags in content⚠️ Partial FC support
MiniMax (M2.7)Always-on<think> tags in content✅ No conflict
ChatGLM (GLM-5)Model variantModel-levelNot externalizedN/A — forced FC not supported
Kimi (K2-thinking)Model variantModel-levelAPI field❌ Forced FC + thinking = conflict

How FIM One Handles Each Case

API-level reasoning_content (Claude, DeepSeek): The reasoning field is read directly from the API response and displayed in the UI Reasoning panel. No post-processing needed. <think> tags in content (MiniMax, Qwen, QwQ, and other open-source derivatives): FIM One automatically strips <think>...</think> tags from the content field and reroutes the thinking text to the Reasoning panel. This works for both streaming and non-streaming responses. Forced FC + thinking conflicts (Claude, Kimi): When FIM One needs forced function calling (e.g., during DAG planning’s structured output extraction), it temporarily disables thinking for that specific call by passing reasoning_effort=None. This works because Claude’s thinking is opt-in — not sending the parameter means no thinking, which avoids the 400 error. For providers where thinking cannot be disabled (MiniMax), forced FC works fine since those providers don’t reject the combination. Fallback chain: If forced function calling fails for any reason, FIM One falls back automatically: native FC → JSON mode → plain text extraction. This three-tier approach ensures planning works even with providers that have partial tool-calling support.
If you’re using a model that always thinks (MiniMax M2.7, DeepSeek R1) as your Main LLM, the thinking output will appear in every agent iteration’s Reasoning panel. This is normal — it doesn’t affect functionality, and you get to see the model’s reasoning process.

Provider Details

OpenAI

The most battle-tested option. OpenAI models have the best native function calling (tool-calling) support, which directly impacts agent reliability. The GPT-5 family (August 2025+) is a major generational leap over GPT-4. Recommended models:
  • Main: gpt-5.4 (latest flagship, Mar 2026 — 1M+ context, computer use) or o3 (best reasoning accuracy)
  • Fast: gpt-5.4-mini (0.75/0.75/4.50 per MTok) or gpt-5.4-nano (cheapest at 0.20/0.20/1.25 per MTok)
  • Budget Fast: gpt-5-mini (0.25/0.25/2.00) and gpt-5-nano (0.05/0.05/0.40) remain available at lower prices
  • Legacy: gpt-4.1 (still in API, 1M context, good for coding)
Reasoning: Set LLM_REASONING_EFFORT=medium — works natively with o-series and GPT-5.x models. GPT-5.4 supports reasoning_effort with levels none, low, medium, high, xhigh. The o-series requires max_completion_tokens instead of max_tokens, which LiteLLM handles automatically. Note: GPT-5.x still drops reasoning_effort when tools are present in /v1/chat/completions — FIM One silently drops it during agent tool-use steps so workflows run uninterrupted. GPT-5.4 requires temperature=1 — FIM One handles this automatically via LiteLLM’s parameter filtering (drop_params).
ModelInput $/MTokOutput $/MTokContext
gpt-5.4$2.50$15.001,050K (surcharge >272K)
gpt-5.4-mini$0.75$4.50400K
gpt-5.4-nano$0.20$1.25400K
o3$2.00$8.00200K
o4-mini$1.10$4.40200K
gpt-5-mini$0.25$2.00400K
gpt-5-nano$0.05$0.40400K
# .env — OpenAI (production with reasoning)
LLM_API_KEY=sk-...
LLM_BASE_URL=https://api.openai.com/v1
LLM_MODEL=gpt-5.4
FAST_LLM_MODEL=gpt-5.4-nano
LLM_REASONING_EFFORT=medium
# .env — OpenAI (budget reasoning)
LLM_API_KEY=sk-...
LLM_BASE_URL=https://api.openai.com/v1
LLM_MODEL=o3
FAST_LLM_MODEL=gpt-5.4-nano
LLM_REASONING_EFFORT=medium

Anthropic (Claude)

Claude excels at nuanced reasoning and complex multi-step tasks. FIM One connects via LiteLLM, which routes Anthropic models through their native API automatically. The current generation is Claude 4.6 (February 2026). Recommended models:
  • Main: claude-sonnet-4-6 (best balance of capability and cost — 3/3/15 per MTok)
  • Fast: claude-haiku-4-5 (fast and cheap — 1/1/5 per MTok)
  • Premium: claude-opus-4-6 (most capable, 128K max output — 5/5/25 per MTok)
Base URL: https://api.anthropic.com/v1/ Opus 4.6 and Sonnet 4.6 have a 1M context window (GA since March 13, 2026 — no beta header needed). Haiku 4.5 has a 200K context window. Reasoning: Set LLM_REASONING_EFFORT=medium — LiteLLM routes Anthropic models through the native API, so reasoning_content (extended thinking) is fully returned and visible in the UI “thinking” step. Claude 4.6 models support Adaptive Thinking (thinking: {type: "adaptive"}) which replaces manual budget_tokens — LiteLLM handles the translation automatically. When extended thinking is enabled, Anthropic requires temperature=1 — set LLM_TEMPERATURE=1 in your .env or model configuration. See Extended Thinking for details.
# .env — Anthropic Claude
LLM_API_KEY=sk-ant-...
LLM_BASE_URL=https://api.anthropic.com/v1/
LLM_MODEL=claude-sonnet-4-6
FAST_LLM_MODEL=claude-haiku-4-5
LLM_REASONING_EFFORT=medium

Google Gemini

Gemini models offer strong performance at competitive pricing via Google’s OpenAI-compatible endpoint. The 3.x generation (late 2025+) is a major leap — Gemini 3 Flash outperforms 2.5 Pro while being 3x faster. Note: gemini-3-pro-preview was shut down March 9, 2026 — use gemini-3.1-pro-preview instead. Recommended models:
  • Stable (GA): gemini-2.5-pro (main) + gemini-2.5-flash (fast) — production-ready
  • Latest (Preview): gemini-3.1-pro-preview (main) + gemini-3-flash-preview (fast) + gemini-3.1-flash-lite-preview (budget fast) — best performance, but preview status
Base URL: https://generativelanguage.googleapis.com/v1beta/openai/ Reasoning: reasoning_effort is supported on the compatibility endpoint — set LLM_REASONING_EFFORT=medium and it works out of the box.
ModelInput $/MTokOutput $/MTokStatus
gemini-3.1-pro-preview$2.00$12.00Preview
gemini-3-flash-preview$0.50$3.00Preview
gemini-3.1-flash-lite-preview$0.25$1.50Preview (Mar 2026)
gemini-2.5-pro$1.25$10.00Stable GA
gemini-2.5-flash$0.30$2.50Stable GA
gemini-2.5-flash-lite$0.10$0.40Stable GA
# .env — Gemini (stable)
LLM_API_KEY=AIza...
LLM_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai/
LLM_MODEL=gemini-2.5-pro
FAST_LLM_MODEL=gemini-2.5-flash
LLM_REASONING_EFFORT=medium
# .env — Gemini (latest preview)
LLM_API_KEY=AIza...
LLM_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai/
LLM_MODEL=gemini-3.1-pro-preview
FAST_LLM_MODEL=gemini-3-flash-preview
LLM_REASONING_EFFORT=medium

DeepSeek

DeepSeek offers the best cost/performance ratio in the market. V3.2 (December 2025) unified the chat and reasoning lineages into a single model, with incredibly low pricing. Model IDs (both backed by V3.2):
  • deepseek-chat — general purpose (non-thinking mode)
  • deepseek-reasoner — chain-of-thought reasoning mode, returns reasoning_content
Base URL: https://api.deepseek.com Pricing: 0.28/0.28/0.42 per MTok (cache hit: $0.028) — by far the cheapest frontier-class API. Output limits: deepseek-chat max output is 8K tokens (must set explicitly via max_tokens). deepseek-reasoner max output is 64K tokens (includes chain-of-thought).
V4 expected April 2026: trillion-parameter multimodal model with 1M context window. Expect new model IDs when it launches.
# .env — DeepSeek (budget-friendly)
LLM_API_KEY=sk-...
LLM_BASE_URL=https://api.deepseek.com
LLM_MODEL=deepseek-chat
FAST_LLM_MODEL=deepseek-chat
# .env — DeepSeek (with reasoning)
LLM_API_KEY=sk-...
LLM_BASE_URL=https://api.deepseek.com
LLM_MODEL=deepseek-reasoner
FAST_LLM_MODEL=deepseek-chat

Chinese Domestic Models

All major Chinese model providers expose OpenAI-compatible endpoints. These are particularly strong for Chinese-language tasks and offer competitive local pricing.

Qwen / 通义千问 (Alibaba Cloud)

Qwen 3.5 (February 2026) is the latest generation — the 397B MoE flagship outperforms GPT-5.2 on MMLU-Pro. Strongest Chinese language support and cheapest frontier-class pricing (~$0.11/MTok input).
  • Base URL (China): https://dashscope.aliyuncs.com/compatible-mode/v1
  • Base URL (Global): https://dashscope-intl.aliyuncs.com/compatible-mode/v1
  • Main: qwen3.5-plus (flagship, 1M context, 0.11/0.11/0.66 per MTok) or qwen3-max (256K, strongest)
  • Fast: qwen3.5-flash (0.055/0.055/0.22 per MTok) or qwen-turbo (0.04/0.04/0.08 per MTok)
  • Reasoning: qwen3-max with enable_thinking: true parameter (there is no separate qwen3-max-thinking model ID)
# .env — Qwen (China)
LLM_API_KEY=sk-...
LLM_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
LLM_MODEL=qwen3.5-plus
FAST_LLM_MODEL=qwen3.5-flash
# .env — Qwen (Global)
LLM_API_KEY=sk-...
LLM_BASE_URL=https://dashscope-intl.aliyuncs.com/compatible-mode/v1
LLM_MODEL=qwen3.5-plus
FAST_LLM_MODEL=qwen3.5-flash

ChatGLM / 智谱

GLM-4.7 and GLM-5 (2026) are the latest models. GLM-5 is the 745B MoE flagship approaching Claude Opus-level on coding/agent tasks.
  • Base URL (Domestic): https://open.bigmodel.cn/api/paas/v4
  • Base URL (Z.AI International): https://api.z.ai/api/paas/v4
  • Main: glm-4.7 (strong coding, 0.60/0.60/2.20 on Z.AI)
  • Fast: glm-4.7-flash (free tier!) or glm-4.7-flashx (0.07/0.07/0.40, higher throughput)
  • Reasoning: glm-5 (745B MoE flagship, 1.00/1.00/3.20)
Forced tool_choice is not supported — only "auto" works.
Some HTTP clients auto-append /v1 to base URLs. Zhipu uses /v4 — ensure your client does not force an OpenAI-style path suffix or you’ll get 404 errors.
# .env — ChatGLM (domestic)
LLM_API_KEY=...
LLM_BASE_URL=https://open.bigmodel.cn/api/paas/v4
LLM_MODEL=glm-4.7
FAST_LLM_MODEL=glm-4.7-flash
# .env — ChatGLM (Z.AI international)
LLM_API_KEY=...
LLM_BASE_URL=https://api.z.ai/api/paas/v4
LLM_MODEL=glm-4.7
FAST_LLM_MODEL=glm-4.7-flash

MiniMax

MiniMax M2.7 (March 18, 2026) is the latest model, open-weight and scores 80.2% on SWE-Bench. M2.5 remains available as a fast/budget option. MiniMax provides two separate API endpoints for different regions:
  • Base URL (Global/海外版): https://api.minimax.io/v1 — for users outside mainland China
  • Base URL (China/国内版): https://api.minimaxi.com/v1 — for users in mainland China (note the extra i in minimaxi)
  • Main: MiniMax-M2.7
  • Fast: MiniMax-M2.5
  • Speed: MiniMax-M2.7-highspeed (2x cost, lower latency)
ModelInput $/MTokOutput $/MTok
MiniMax-M2.7$0.30$1.20
MiniMax-M2.7-highspeed$0.60$2.40
MiniMax-M2.5$0.30$1.20
MiniMax-M2.5-highspeed$0.60$2.40
# .env — MiniMax (global endpoint)
LLM_API_KEY=...
LLM_BASE_URL=https://api.minimax.io/v1
LLM_MODEL=MiniMax-M2.7
FAST_LLM_MODEL=MiniMax-M2.5
# .env — MiniMax (China mainland endpoint)
LLM_API_KEY=...
LLM_BASE_URL=https://api.minimaxi.com/v1
LLM_MODEL=MiniMax-M2.7
FAST_LLM_MODEL=MiniMax-M2.5

Kimi / 月之暗面 (Moonshot)

Kimi K2.5 (January 2026) has 256K context and strong coding performance (76.8% SWE-Bench among open-source models).
  • Base URL (Global): https://api.moonshot.ai/v1
  • Base URL (China): https://api.moonshot.cn/v1
  • Main: kimi-k2.5
  • Fast: kimi-k2 (non-thinking, function calling works)
  • Reasoning: kimi-k2-thinking (0.47/0.47/2.00 per MTok)
Forced tool_choice only works when thinking mode is off. When thinking is enabled, only "auto" is supported.
# .env — Kimi (Global)
LLM_API_KEY=...
LLM_BASE_URL=https://api.moonshot.ai/v1
LLM_MODEL=kimi-k2.5
FAST_LLM_MODEL=kimi-k2
# .env — Kimi (China)
LLM_API_KEY=...
LLM_BASE_URL=https://api.moonshot.cn/v1
LLM_MODEL=kimi-k2.5
FAST_LLM_MODEL=kimi-k2

Local Models (Ollama)

Run models entirely on your own hardware — no API key needed, fully offline. Ollama exposes an OpenAI-compatible endpoint out of the box. The open-source landscape has changed dramatically — Qwen 3.5, Llama 4, and GPT-OSS (OpenAI’s first open-weight models) are all available. Base URL: http://localhost:11434/v1 Recommended models by VRAM:
VRAMMain LLMFast LLMNotes
8 GBqwen3.5:9b / gemma3:4bqwen3.5:4bQwen 3.5 9B is the standout at this tier
16 GBgpt-oss:20b / deepseek-r1:14bqwen3.5:9bGPT-OSS 20B is agent-optimized
24 GBqwen3:32b / deepseek-r1:32bqwen3.5:9bQwen 3 32B is best for tool-calling
48 GB+llama3.3:70b / gpt-oss:120bqwen3.5:14bNear-frontier quality
Best for tool-calling: Qwen 3/3.5 (32B+), GLM-4.7, GPT-OSS, Mistral — these have explicit function-calling training. Models with 14B+ parameters are the minimum for reliable tool calling; 32B+ is strongly preferred.
Tool-calling quality varies significantly across local models. Not all models reliably generate valid function calls. Test your chosen model with agent workflows before using in production. The general rule: 14B minimum, 32B+ recommended for agent tasks.
# .env — Ollama (balanced, 16GB VRAM)
LLM_API_KEY=ollama
LLM_BASE_URL=http://localhost:11434/v1
LLM_MODEL=gpt-oss:20b
FAST_LLM_MODEL=qwen3.5:9b
LLM_CONTEXT_SIZE=32768
LLM_MAX_OUTPUT_TOKENS=8192
# .env — Ollama (agent-optimized, 24GB VRAM)
LLM_API_KEY=ollama
LLM_BASE_URL=http://localhost:11434/v1
LLM_MODEL=qwen3:32b
FAST_LLM_MODEL=qwen3.5:9b
LLM_CONTEXT_SIZE=32768
LLM_MAX_OUTPUT_TOKENS=8192

Third-Party Relay Platforms

Many users access multiple model providers through a single relay (proxy) service. FIM One automatically detects the correct API protocol based on URL path patterns — just fill in the LLM_BASE_URL and it works.

How It Works

When your base URL points to a third-party relay, FIM One inspects the URL path to determine which protocol to use:
URL Path ContainsDetected ProtocolAuth HeaderKey Benefit
/v1 (or no match)OpenAI compatibleAuthorization: BearerUniversal fallback, works with most relays
/claude or /anthropicAnthropic nativex-api-keyFull reasoning_content (extended thinking) support
/geminiGoogle nativex-goog-api-keyNative Gemini parameter translation
Resolution order: Explicit DB provider field > domain match (official APIs) > URL path hint (relay platforms) > OpenAI compatible fallback.

Example: One Relay, Three Protocols

With a single relay account, you can access different providers by simply changing the base URL path:
# .env — Claude via relay (Anthropic native protocol)
LLM_API_KEY=your-relay-key
LLM_BASE_URL=https://relay.example.com/anthropic
LLM_MODEL=claude-sonnet-4-6
# .env — Gemini via relay (Google native protocol)
LLM_API_KEY=your-relay-key
LLM_BASE_URL=https://relay.example.com/gemini
LLM_MODEL=gemini-2.5-pro
# .env — GPT via relay (OpenAI compatible protocol)
LLM_API_KEY=your-relay-key
LLM_BASE_URL=https://relay.example.com/v1
LLM_MODEL=gpt-5.4
No extra configuration needed — authentication headers, parameter formats, and response parsing all switch automatically.

Step-by-Step: How Path Detection Works

Here’s a concrete example showing what happens internally when you configure a relay:
# .env — Claude via a relay platform
LLM_API_KEY=your-relay-key
LLM_BASE_URL=https://my-relay.example.com/claude
LLM_MODEL=claude-sonnet-4-6
LLM_REASONING_EFFORT=medium
  1. FIM One sees /claude in the URL path → detects Anthropic native protocol
  2. Model is prefixed as anthropic/claude-sonnet-4-6 for LiteLLM routing
  3. Requests use Anthropic’s /v1/messages format with x-api-key auth header
  4. reasoning_effort=medium is translated to Anthropic’s native thinking parameter (not OpenAI’s reasoning_effort)
If the same relay URL were https://my-relay.example.com/v1 instead, the /claude hint would be missing — FIM One would fall back to OpenAI-compatible protocol, sending /v1/chat/completions requests to a Claude-native endpoint, which would fail. The URL path matters.

Why This Matters

  • Anthropic native endpoint gives you proper reasoning_content support (extended thinking visible in the UI), correct tool-calling format, and x-api-key authentication — features lost when using OpenAI-compatible translation.
  • Google native endpoint gives you native Gemini parameters and x-goog-api-key authentication.
  • OpenAI compatible is the universal fallback and works with any relay, but provider-specific features (like extended thinking output) may be unavailable.
If your relay platform uses non-standard path conventions (e.g., no /claude or /anthropic in the URL), FIM One falls back to OpenAI compatible protocol — which works for most use cases. For full native protocol support, you can set the provider field explicitly via the admin model configuration UI.

Configuration Strategy

Main vs Fast: When to Split

  • Split when your main model is expensive or slow (e.g., gpt-5.4 + gpt-5.4-nano). DAG mode runs many parallel steps — using a cheaper fast model saves significant cost.
  • Same model when your model is already cheap (e.g., deepseek-chat for both). The overhead of managing two models isn’t worth it.

When to Enable Reasoning

  • Enable for complex analytical tasks, multi-step planning, and tasks requiring careful judgment
  • Disable (default) for routine tasks, simple Q&A, and cost-sensitive deployments
  • Reasoning typically increases cost 2-5x per request — medium effort is a good starting point

Context Window Sizing

Set LLM_CONTEXT_SIZE to match your model’s actual window:
ModelContext Window
GPT-5.41,050K (surcharge >272K)
o3 / o4-mini200K
Claude Opus 4.61M
Claude Sonnet 4.61M
Claude Haiku 4.5200K
Gemini 2.5 Pro1M
Gemini 3.1 Pro1M
DeepSeek V3.2128K
Qwen 3.5 Plus1M
Local (Ollama)4K–128K (varies)
For local models, set both LLM_CONTEXT_SIZE and LLM_MAX_OUTPUT_TOKENS explicitly — defaults assume cloud-scale context windows that local models cannot support.