Documentation Index
Fetch the complete documentation index at: https://docs.fim.ai/llms.txt
Use this file to discover all available pages before exploring further.
FIM One is provider-agnostic — any OpenAI-compatible endpoint works. This page helps you pick the best model combination for your use case. For configuration details, see Environment Variables.
How FIM One Uses Models
FIM One has three model roles:
| Role | Env Variable | Used For |
|---|
| General | LLM_MODEL | Planning, analysis, ReAct agent, complex reasoning |
| Fast | FAST_LLM_MODEL | DAG step execution, context compaction (cheaper, faster) |
| Reasoning | REASONING_LLM_MODEL | Deep analysis, complex planning, mathematical proofs |
Fast and Reasoning fall back to General if not configured. For production deployments, splitting into at least two models (General + Fast) gives the best cost/quality balance.
These roles can be configured via ENV variables or through the admin UI’s Model Groups feature, which allows one-click switching between model sets. See Model Management for the full admin UI guide.
Quick Selection Matrix
| Provider | Main LLM | Fast LLM | Reasoning | Vision | Notes |
|---|
| OpenAI | gpt-5.4 | gpt-5.4-mini / gpt-5.4-nano | ✅ reasoning_effort | ✅ All | Best native tool-calling; GPT-5.4 is latest flagship (Mar 2026) |
| Anthropic | claude-sonnet-4-6 | claude-haiku-4-5 | ✅ via LiteLLM | ✅ All | Native API routing; full reasoning_content support; 1M context GA |
| Google Gemini | gemini-2.5-pro / gemini-3.1-pro-preview | gemini-2.5-flash / gemini-3-flash-preview | ✅ reasoning_effort | ✅ All | 2.5 is stable GA; 3.x is preview; gemini-3-pro-preview shut down Mar 9 |
| DeepSeek | deepseek-chat (V3.2) | deepseek-chat | ✅ deepseek-reasoner | ❌ | Text-only; V4 (Apr 2026) will add vision |
| Qwen (Alibaba) | qwen3.5-plus / qwen3-max | qwen3.5-flash / qwen-turbo | ✅ enable_thinking on qwen3-max | ⚠️ qwen3.5 only | Strongest Chinese language; qwq/reasoning text-only |
| ChatGLM (Zhipu) | glm-4.7 | glm-4.7-flash | glm-5 | ⚠️ GLM-4.6V | Forced FC not supported; vision requires separate VLM model |
| MiniMax | MiniMax-M2.7 | MiniMax-M2.5 | ❌ | ❌ | Text-only; M2.7 latest (Mar 2026); 80.2% SWE-Bench |
| Kimi (Moonshot) | kimi-k2.5 | kimi-k2 | ✅ kimi-k2-thinking | ⚠️ K2.5 only | K2-thinking text-only; forced FC not supported with thinking |
| Ollama (local) | qwen3.5 / llama4 | qwen3.5:9b | ❌ | Varies | Fully offline, no API key; Llama 4 supports vision |
Vision indicates whether the model accepts image input. This is required for Intelligent Document Processing (IDP) — if your model doesn’t support vision, IDP will fall back to text-only extraction. Providers marked ⚠️ have vision on some models but not others; check the specific model you’re using.
Structured Output Compatibility
FIM One’s DAG planner needs the model to return valid structured JSON. Internally, it tries three extraction levels in order:
- Native Function Calling — forces the model to output JSON matching a schema via the tool-call API. Most reliable.
- JSON Mode — requests
response_format: json_object. Guarantees valid JSON, but does not enforce schema compliance.
- Plain Text Extraction — parses JSON from free-form text as a last resort.
Models that support Level 1 (native FC with forced tool_choice) give the best planning reliability. If a model only reaches Level 2, its output quality depends on how well it follows prompt instructions — weaker models may produce valid JSON that doesn’t match the expected structure.
| Provider | Forced Function Calling | JSON Mode | Planning Reliability |
|---|
| OpenAI (GPT-5.x, o3) | ✅ Full support | ✅ | ⭐⭐⭐ Excellent |
| Anthropic (Claude 4.x) | ⚠️ Conflicts with thinking mode | ✅ | ⭐⭐⭐ Excellent (strong instruction following compensates) |
| Google Gemini (2.5/3.x) | ✅ Full support | ✅ | ⭐⭐⭐ Excellent |
| Mistral | ✅ Full support | ✅ | ⭐⭐ Good |
| DeepSeek (V3.2) | ⚠️ Unstable (tool_choice="required" works, "auto" unreliable) | ✅ | ⭐⭐ Good |
| Qwen (3.x) | ⚠️ Partial | ✅ | ⭐⭐ Good |
| Kimi (K2.5) | ⚠️ Partial — auto only when thinking enabled | ✅ | ⭐ Fair — may produce malformed plans |
| ChatGLM (GLM-4.7/5) | ❌ Not supported (auto only) | ✅ | ⭐ Fair |
| MiniMax (M2.5/M2.7) | ✅ Full support | ✅ | ⭐⭐ Good |
| Local (Ollama) | Varies by model | Varies | ⭐ Fair — 32B+ recommended |
If you see the error “failed to generate a valid task plan”, the model’s structured output capability is insufficient for DAG planning. Switch your Main LLM to a model rated ⭐⭐⭐ or ⭐⭐ above, or disable DAG mode and use the simpler ReAct agent instead.
Thinking / Reasoning Compatibility
Different providers implement “thinking” (chain-of-thought reasoning) in fundamentally different ways. This matters because thinking mode can conflict with tool calling, and the output appears in different places depending on the provider. FIM One handles all of these transparently — this table helps you understand what’s happening under the hood.
Key Concepts
- Opt-in — thinking is off by default; you enable it via an API parameter (e.g.,
reasoning_effort). Can be selectively disabled per call.
- Always-on — the model always thinks; no API parameter to turn it off. You’d need to switch to a non-thinking model variant to avoid it.
- Model-level — thinking is determined by which model ID you choose (e.g.,
deepseek-reasoner vs deepseek-chat), not by a parameter.
Compatibility Matrix
| Provider | How to Enable | Can Disable? | Thinking Output | Forced FC Conflict? |
|---|
| OpenAI (GPT-5.x) | reasoning_effort param | ✅ Opt-in | Internal (not visible to user) | ⚠️ API drops reasoning_effort when tools present |
| OpenAI (o-series) | Always-on | ❌ | Internal (tokens counted, not returned) | ✅ No conflict |
| Anthropic (Claude 4.x) | reasoning_effort → thinking | ✅ Opt-in | API reasoning_content field → Reasoning panel | ❌ Forced FC + thinking = 400 error |
| Google Gemini (2.5/3.x) | reasoning_effort param | ✅ Opt-in | Internal | ✅ No conflict |
| DeepSeek | Model variant (deepseek-reasoner) | Model-level | API reasoning_content field → Reasoning panel | ⚠️ Forced FC unreliable |
| Qwen (3.x) | enable_thinking param | ✅ Opt-in | <think> tags in content | ⚠️ Partial FC support |
| MiniMax (M2.7) | Always-on | ❌ | <think> tags in content | ✅ No conflict |
| ChatGLM (GLM-5) | Model variant | Model-level | Not externalized | N/A — forced FC not supported |
| Kimi (K2-thinking) | Model variant | Model-level | API field | ❌ Forced FC + thinking = conflict |
How FIM One Handles Each Case
API-level reasoning_content (Claude, DeepSeek): The reasoning field is read directly from the API response and displayed in the UI Reasoning panel. No post-processing needed.
<think> tags in content (MiniMax, Qwen, QwQ, and other open-source derivatives): FIM One automatically strips <think>...</think> tags from the content field and reroutes the thinking text to the Reasoning panel. This works for both streaming and non-streaming responses.
Forced FC + thinking conflicts (Claude, Kimi): When FIM One needs forced function calling (e.g., during DAG planning’s structured output extraction), it temporarily disables thinking for that specific call by passing reasoning_effort=None. This works because Claude’s thinking is opt-in — not sending the parameter means no thinking, which avoids the 400 error. For providers where thinking cannot be disabled (MiniMax), forced FC works fine since those providers don’t reject the combination.
Fallback chain: If forced function calling fails for any reason, FIM One falls back automatically: native FC → JSON mode → plain text extraction. This three-tier approach ensures planning works even with providers that have partial tool-calling support.
If you’re using a model that always thinks (MiniMax M2.7, DeepSeek R1) as your Main LLM, the thinking output will appear in every agent iteration’s Reasoning panel. This is normal — it doesn’t affect functionality, and you get to see the model’s reasoning process.
Provider Details
OpenAI
The most battle-tested option. OpenAI models have the best native function calling (tool-calling) support, which directly impacts agent reliability. The GPT-5 family (August 2025+) is a major generational leap over GPT-4.
Recommended models:
- Main:
gpt-5.4 (latest flagship, Mar 2026 — 1M+ context, computer use) or o3 (best reasoning accuracy)
- Fast:
gpt-5.4-mini (0.75/4.50 per MTok) or gpt-5.4-nano (cheapest at 0.20/1.25 per MTok)
- Budget Fast:
gpt-5-mini (0.25/2.00) and gpt-5-nano (0.05/0.40) remain available at lower prices
- Legacy:
gpt-4.1 (still in API, 1M context, good for coding)
Reasoning: Set LLM_REASONING_EFFORT=medium — works natively with o-series and GPT-5.x models. GPT-5.4 supports reasoning_effort with levels none, low, medium, high, xhigh. The o-series requires max_completion_tokens instead of max_tokens, which LiteLLM handles automatically. Note: GPT-5.x still drops reasoning_effort when tools are present in /v1/chat/completions — FIM One silently drops it during agent tool-use steps so workflows run uninterrupted. GPT-5.4 requires temperature=1 — FIM One handles this automatically via LiteLLM’s parameter filtering (drop_params).
| Model | Input $/MTok | Output $/MTok | Context |
|---|
gpt-5.4 | $2.50 | $15.00 | 1,050K (surcharge >272K) |
gpt-5.4-mini | $0.75 | $4.50 | 400K |
gpt-5.4-nano | $0.20 | $1.25 | 400K |
o3 | $2.00 | $8.00 | 200K |
o4-mini | $1.10 | $4.40 | 200K |
gpt-5-mini | $0.25 | $2.00 | 400K |
gpt-5-nano | $0.05 | $0.40 | 400K |
# .env — OpenAI (production with reasoning)
LLM_API_KEY=sk-...
LLM_BASE_URL=https://api.openai.com/v1
LLM_MODEL=gpt-5.4
FAST_LLM_MODEL=gpt-5.4-nano
LLM_REASONING_EFFORT=medium
# .env — OpenAI (budget reasoning)
LLM_API_KEY=sk-...
LLM_BASE_URL=https://api.openai.com/v1
LLM_MODEL=o3
FAST_LLM_MODEL=gpt-5.4-nano
LLM_REASONING_EFFORT=medium
Anthropic (Claude)
Claude excels at nuanced reasoning and complex multi-step tasks. FIM One connects via LiteLLM, which routes Anthropic models through their native API automatically. The current generation is Claude 4.6 (February 2026).
Recommended models:
- Main:
claude-sonnet-4-6 (best balance of capability and cost — 3/15 per MTok)
- Fast:
claude-haiku-4-5 (fast and cheap — 1/5 per MTok)
- Premium:
claude-opus-4-6 (most capable, 128K max output — 5/25 per MTok)
Base URL: https://api.anthropic.com/v1/
Opus 4.6 and Sonnet 4.6 have a 1M context window (GA since March 13, 2026 — no beta header needed). Haiku 4.5 has a 200K context window.
Reasoning: Set LLM_REASONING_EFFORT=medium — LiteLLM routes Anthropic models through the native API, so reasoning_content (extended thinking) is fully returned and visible in the UI “thinking” step. Claude 4.6 models support Adaptive Thinking (thinking: {type: "adaptive"}) which replaces manual budget_tokens — LiteLLM handles the translation automatically. When extended thinking is enabled, Anthropic requires temperature=1 — set LLM_TEMPERATURE=1 in your .env or model configuration. See Extended Thinking for details.
# .env — Anthropic Claude
LLM_API_KEY=sk-ant-...
LLM_BASE_URL=https://api.anthropic.com/v1/
LLM_MODEL=claude-sonnet-4-6
FAST_LLM_MODEL=claude-haiku-4-5
LLM_REASONING_EFFORT=medium
Google Gemini
Gemini models offer strong performance at competitive pricing via Google’s OpenAI-compatible endpoint. The 3.x generation (late 2025+) is a major leap — Gemini 3 Flash outperforms 2.5 Pro while being 3x faster. Note: gemini-3-pro-preview was shut down March 9, 2026 — use gemini-3.1-pro-preview instead.
Recommended models:
- Stable (GA):
gemini-2.5-pro (main) + gemini-2.5-flash (fast) — production-ready
- Latest (Preview):
gemini-3.1-pro-preview (main) + gemini-3-flash-preview (fast) + gemini-3.1-flash-lite-preview (budget fast) — best performance, but preview status
Base URL: https://generativelanguage.googleapis.com/v1beta/openai/
Reasoning: reasoning_effort is supported on the compatibility endpoint — set LLM_REASONING_EFFORT=medium and it works out of the box.
| Model | Input $/MTok | Output $/MTok | Status |
|---|
gemini-3.1-pro-preview | $2.00 | $12.00 | Preview |
gemini-3-flash-preview | $0.50 | $3.00 | Preview |
gemini-3.1-flash-lite-preview | $0.25 | $1.50 | Preview (Mar 2026) |
gemini-2.5-pro | $1.25 | $10.00 | Stable GA |
gemini-2.5-flash | $0.30 | $2.50 | Stable GA |
gemini-2.5-flash-lite | $0.10 | $0.40 | Stable GA |
# .env — Gemini (stable)
LLM_API_KEY=AIza...
LLM_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai/
LLM_MODEL=gemini-2.5-pro
FAST_LLM_MODEL=gemini-2.5-flash
LLM_REASONING_EFFORT=medium
# .env — Gemini (latest preview)
LLM_API_KEY=AIza...
LLM_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai/
LLM_MODEL=gemini-3.1-pro-preview
FAST_LLM_MODEL=gemini-3-flash-preview
LLM_REASONING_EFFORT=medium
DeepSeek
DeepSeek offers the best cost/performance ratio in the market. V3.2 (December 2025) unified the chat and reasoning lineages into a single model, with incredibly low pricing.
Model IDs (both backed by V3.2):
deepseek-chat — general purpose (non-thinking mode)
deepseek-reasoner — chain-of-thought reasoning mode, returns reasoning_content
Base URL: https://api.deepseek.com
Pricing: 0.28/0.42 per MTok (cache hit: $0.028) — by far the cheapest frontier-class API.
Output limits: deepseek-chat max output is 8K tokens (must set explicitly via max_tokens). deepseek-reasoner max output is 64K tokens (includes chain-of-thought).
V4 expected April 2026: trillion-parameter multimodal model with 1M context window. Expect new model IDs when it launches.
# .env — DeepSeek (budget-friendly)
LLM_API_KEY=sk-...
LLM_BASE_URL=https://api.deepseek.com
LLM_MODEL=deepseek-chat
FAST_LLM_MODEL=deepseek-chat
# .env — DeepSeek (with reasoning)
LLM_API_KEY=sk-...
LLM_BASE_URL=https://api.deepseek.com
LLM_MODEL=deepseek-reasoner
FAST_LLM_MODEL=deepseek-chat
Chinese Domestic Models
All major Chinese model providers expose OpenAI-compatible endpoints. These are particularly strong for Chinese-language tasks and offer competitive local pricing.
Qwen / 通义千问 (Alibaba Cloud)
Qwen 3.5 (February 2026) is the latest generation — the 397B MoE flagship outperforms GPT-5.2 on MMLU-Pro. Strongest Chinese language support and cheapest frontier-class pricing (~$0.11/MTok input).
- Base URL (China):
https://dashscope.aliyuncs.com/compatible-mode/v1
- Base URL (Global):
https://dashscope-intl.aliyuncs.com/compatible-mode/v1
- Main:
qwen3.5-plus (flagship, 1M context, 0.11/0.66 per MTok) or qwen3-max (256K, strongest)
- Fast:
qwen3.5-flash (0.055/0.22 per MTok) or qwen-turbo (0.04/0.08 per MTok)
- Reasoning:
qwen3-max with enable_thinking: true parameter (there is no separate qwen3-max-thinking model ID)
# .env — Qwen (China)
LLM_API_KEY=sk-...
LLM_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
LLM_MODEL=qwen3.5-plus
FAST_LLM_MODEL=qwen3.5-flash
# .env — Qwen (Global)
LLM_API_KEY=sk-...
LLM_BASE_URL=https://dashscope-intl.aliyuncs.com/compatible-mode/v1
LLM_MODEL=qwen3.5-plus
FAST_LLM_MODEL=qwen3.5-flash
ChatGLM / 智谱
GLM-4.7 and GLM-5 (2026) are the latest models. GLM-5 is the 745B MoE flagship approaching Claude Opus-level on coding/agent tasks.
- Base URL (Domestic):
https://open.bigmodel.cn/api/paas/v4
- Base URL (Z.AI International):
https://api.z.ai/api/paas/v4
- Main:
glm-4.7 (strong coding, 0.60/2.20 on Z.AI)
- Fast:
glm-4.7-flash (free tier!) or glm-4.7-flashx (0.07/0.40, higher throughput)
- Reasoning:
glm-5 (745B MoE flagship, 1.00/3.20)
Forced tool_choice is not supported — only "auto" works.
Some HTTP clients auto-append /v1 to base URLs. Zhipu uses /v4 — ensure your client does not force an OpenAI-style path suffix or you’ll get 404 errors.
# .env — ChatGLM (domestic)
LLM_API_KEY=...
LLM_BASE_URL=https://open.bigmodel.cn/api/paas/v4
LLM_MODEL=glm-4.7
FAST_LLM_MODEL=glm-4.7-flash
# .env — ChatGLM (Z.AI international)
LLM_API_KEY=...
LLM_BASE_URL=https://api.z.ai/api/paas/v4
LLM_MODEL=glm-4.7
FAST_LLM_MODEL=glm-4.7-flash
MiniMax
MiniMax M2.7 (March 18, 2026) is the latest model, open-weight and scores 80.2% on SWE-Bench. M2.5 remains available as a fast/budget option.
MiniMax provides two separate API endpoints for different regions:
- Base URL (Global/海外版):
https://api.minimax.io/v1 — for users outside mainland China
- Base URL (China/国内版):
https://api.minimaxi.com/v1 — for users in mainland China (note the extra i in minimaxi)
- Main:
MiniMax-M2.7
- Fast:
MiniMax-M2.5
- Speed:
MiniMax-M2.7-highspeed (2x cost, lower latency)
| Model | Input $/MTok | Output $/MTok |
|---|
MiniMax-M2.7 | $0.30 | $1.20 |
MiniMax-M2.7-highspeed | $0.60 | $2.40 |
MiniMax-M2.5 | $0.30 | $1.20 |
MiniMax-M2.5-highspeed | $0.60 | $2.40 |
# .env — MiniMax (global endpoint)
LLM_API_KEY=...
LLM_BASE_URL=https://api.minimax.io/v1
LLM_MODEL=MiniMax-M2.7
FAST_LLM_MODEL=MiniMax-M2.5
# .env — MiniMax (China mainland endpoint)
LLM_API_KEY=...
LLM_BASE_URL=https://api.minimaxi.com/v1
LLM_MODEL=MiniMax-M2.7
FAST_LLM_MODEL=MiniMax-M2.5
Kimi / 月之暗面 (Moonshot)
Kimi K2.5 (January 2026) has 256K context and strong coding performance (76.8% SWE-Bench among open-source models).
- Base URL (Global):
https://api.moonshot.ai/v1
- Base URL (China):
https://api.moonshot.cn/v1
- Main:
kimi-k2.5
- Fast:
kimi-k2 (non-thinking, function calling works)
- Reasoning:
kimi-k2-thinking (0.47/2.00 per MTok)
Forced tool_choice only works when thinking mode is off. When thinking is enabled, only "auto" is supported.
# .env — Kimi (Global)
LLM_API_KEY=...
LLM_BASE_URL=https://api.moonshot.ai/v1
LLM_MODEL=kimi-k2.5
FAST_LLM_MODEL=kimi-k2
# .env — Kimi (China)
LLM_API_KEY=...
LLM_BASE_URL=https://api.moonshot.cn/v1
LLM_MODEL=kimi-k2.5
FAST_LLM_MODEL=kimi-k2
Local Models (Ollama)
Run models entirely on your own hardware — no API key needed, fully offline. Ollama exposes an OpenAI-compatible endpoint out of the box. The open-source landscape has changed dramatically — Qwen 3.5, Llama 4, and GPT-OSS (OpenAI’s first open-weight models) are all available.
Base URL: http://localhost:11434/v1
Recommended models by VRAM:
| VRAM | Main LLM | Fast LLM | Notes |
|---|
| 8 GB | qwen3.5:9b / gemma3:4b | qwen3.5:4b | Qwen 3.5 9B is the standout at this tier |
| 16 GB | gpt-oss:20b / deepseek-r1:14b | qwen3.5:9b | GPT-OSS 20B is agent-optimized |
| 24 GB | qwen3:32b / deepseek-r1:32b | qwen3.5:9b | Qwen 3 32B is best for tool-calling |
| 48 GB+ | llama3.3:70b / gpt-oss:120b | qwen3.5:14b | Near-frontier quality |
Best for tool-calling: Qwen 3/3.5 (32B+), GLM-4.7, GPT-OSS, Mistral — these have explicit function-calling training. Models with 14B+ parameters are the minimum for reliable tool calling; 32B+ is strongly preferred.
Tool-calling quality varies significantly across local models. Not all models reliably generate valid function calls. Test your chosen model with agent workflows before using in production. The general rule: 14B minimum, 32B+ recommended for agent tasks.
# .env — Ollama (balanced, 16GB VRAM)
LLM_API_KEY=ollama
LLM_BASE_URL=http://localhost:11434/v1
LLM_MODEL=gpt-oss:20b
FAST_LLM_MODEL=qwen3.5:9b
LLM_CONTEXT_SIZE=32768
LLM_MAX_OUTPUT_TOKENS=8192
# .env — Ollama (agent-optimized, 24GB VRAM)
LLM_API_KEY=ollama
LLM_BASE_URL=http://localhost:11434/v1
LLM_MODEL=qwen3:32b
FAST_LLM_MODEL=qwen3.5:9b
LLM_CONTEXT_SIZE=32768
LLM_MAX_OUTPUT_TOKENS=8192
Many users access multiple model providers through a single relay (proxy) service. FIM One automatically detects the correct API protocol based on URL path patterns — just fill in the LLM_BASE_URL and it works.
How It Works
When your base URL points to a third-party relay, FIM One inspects the URL path to determine which protocol to use:
| URL Path Contains | Detected Protocol | Auth Header | Key Benefit |
|---|
/v1 (or no match) | OpenAI compatible | Authorization: Bearer | Universal fallback, works with most relays |
/claude or /anthropic | Anthropic native | x-api-key | Full reasoning_content (extended thinking) support |
/gemini | Google native | x-goog-api-key | Native Gemini parameter translation |
Resolution order: Explicit DB provider field > domain match (official APIs) > URL path hint (relay platforms) > OpenAI compatible fallback.
Example: One Relay, Three Protocols
With a single relay account, you can access different providers by simply changing the base URL path:
# .env — Claude via relay (Anthropic native protocol)
LLM_API_KEY=your-relay-key
LLM_BASE_URL=https://relay.example.com/anthropic
LLM_MODEL=claude-sonnet-4-6
# .env — Gemini via relay (Google native protocol)
LLM_API_KEY=your-relay-key
LLM_BASE_URL=https://relay.example.com/gemini
LLM_MODEL=gemini-2.5-pro
# .env — GPT via relay (OpenAI compatible protocol)
LLM_API_KEY=your-relay-key
LLM_BASE_URL=https://relay.example.com/v1
LLM_MODEL=gpt-5.4
No extra configuration needed — authentication headers, parameter formats, and response parsing all switch automatically.
Step-by-Step: How Path Detection Works
Here’s a concrete example showing what happens internally when you configure a relay:
# .env — Claude via a relay platform
LLM_API_KEY=your-relay-key
LLM_BASE_URL=https://my-relay.example.com/claude
LLM_MODEL=claude-sonnet-4-6
LLM_REASONING_EFFORT=medium
- FIM One sees
/claude in the URL path → detects Anthropic native protocol
- Model is prefixed as
anthropic/claude-sonnet-4-6 for LiteLLM routing
- Requests use Anthropic’s
/v1/messages format with x-api-key auth header
reasoning_effort=medium is translated to Anthropic’s native thinking parameter (not OpenAI’s reasoning_effort)
If the same relay URL were https://my-relay.example.com/v1 instead, the /claude hint would be missing — FIM One would fall back to OpenAI-compatible protocol, sending /v1/chat/completions requests to a Claude-native endpoint, which would fail. The URL path matters.
Why This Matters
- Anthropic native endpoint gives you proper
reasoning_content support (extended thinking visible in the UI), correct tool-calling format, and x-api-key authentication — features lost when using OpenAI-compatible translation.
- Google native endpoint gives you native Gemini parameters and
x-goog-api-key authentication.
- OpenAI compatible is the universal fallback and works with any relay, but provider-specific features (like extended thinking output) may be unavailable.
If your relay platform uses non-standard path conventions (e.g., no /claude or /anthropic in the URL), FIM One falls back to OpenAI compatible protocol — which works for most use cases. For full native protocol support, you can set the provider field explicitly via the admin model configuration UI.
Configuration Strategy
Main vs Fast: When to Split
- Split when your main model is expensive or slow (e.g.,
gpt-5.4 + gpt-5.4-nano). DAG mode runs many parallel steps — using a cheaper fast model saves significant cost.
- Same model when your model is already cheap (e.g.,
deepseek-chat for both). The overhead of managing two models isn’t worth it.
When to Enable Reasoning
- Enable for complex analytical tasks, multi-step planning, and tasks requiring careful judgment
- Disable (default) for routine tasks, simple Q&A, and cost-sensitive deployments
- Reasoning typically increases cost 2-5x per request —
medium effort is a good starting point
Context Window Sizing
Set LLM_CONTEXT_SIZE to match your model’s actual window:
| Model | Context Window |
|---|
| GPT-5.4 | 1,050K (surcharge >272K) |
| o3 / o4-mini | 200K |
| Claude Opus 4.6 | 1M |
| Claude Sonnet 4.6 | 1M |
| Claude Haiku 4.5 | 200K |
| Gemini 2.5 Pro | 1M |
| Gemini 3.1 Pro | 1M |
| DeepSeek V3.2 | 128K |
| Qwen 3.5 Plus | 1M |
| Local (Ollama) | 4K–128K (varies) |
For local models, set both LLM_CONTEXT_SIZE and LLM_MAX_OUTPUT_TOKENS explicitly — defaults assume cloud-scale context windows that local models cannot support.