Vision mode is automatic by default. If the model you’ve configured has vision support enabled, uploaded documents will use the richest processing pipeline available for that format.
Format Support Matrix
Each document format has a dedicated processing pipeline. The behavior changes depending on whether the model supports vision.| Format | Text Extraction | Vision Mode (supports_vision=ON) | Fallback (supports_vision=OFF) |
|---|---|---|---|
| pdfplumber (page-by-page text) | Smart processing: text-rich pages extract text + embedded images (token-efficient); scanned/image-only pages render as full-page PNG via PyMuPDF | Text extraction only; agent reads via read_uploaded_file tool | |
| DOCX / DOC | markitdown (Markdown conversion) | Embedded images extracted with [Figure N] positional markers via python-docx | Text extraction only; images lost |
| PPTX / PPT | markitdown (text from each slide) | Embedded images extracted with [Figure N] markers and slide separators via python-pptx | Text extraction only; slide visuals lost |
| XLSX / XLS | markitdown (table structure preserved) | Same as text mode (tables do not benefit from vision) | Same |
| Images (JPG, PNG, GIF, WebP) | N/A | Sent as image_url vision content blocks | Annotated as [Attached image: filename] — model is aware but cannot see the content |
| Text files (TXT, MD, PY, JS, HTML, CSV, JSON) | Direct read / parse | N/A (text is text) | Same |
How It Works
When a user uploads a document to a conversation, FIM One runs a processing pipeline based on the file type and model capabilities:File Type Detection
The system identifies the document format by file extension and MIME type, then selects the appropriate extraction pipeline. Each uploaded file is tagged with its UUID
file_id, which is injected into the message context so the agent can access it directly via the read_uploaded_file tool.Text Extraction
Regardless of vision support, the system always extracts text content. PDF uses pdfplumber for page-by-page text. Office formats use markitdown for Markdown conversion. Text files are read directly.
Vision Processing (if supported)
When the model has
supports_vision=true and the document is a supported type:- PDF (smart processing): Each page is analyzed — text-rich pages extract text plus any embedded images separately (saving tokens), while scanned or image-only pages render as full-page PNG at the configured DPI for maximum fidelity
- DOCX / PPTX: Embedded images are extracted from the document XML with
[Figure N]positional markers - Images: Passed through directly as vision content blocks
Content Assembly
The extracted text and vision content are assembled into the message format expected by the model. Text and images are interleaved so the model can correlate visual and textual information.
Vision Mode Configuration
There are three ways to control how documents are processed, listed from most specific to most general.1. Per-Model Toggle
Navigate to Admin > Models > Edit > Advanced and toggle the Vision Support checkbox. This is the primary control — it tells the system whether a specific model can accept image content.2. Global Environment Variable
SetDOCUMENT_PROCESSING_MODE in your environment to override the default behavior system-wide:
3. Per-Request Parameter
Pass thedoc_mode parameter in the chat API to control processing for a single request:
The
auto mode (default) uses vision when the model has supports_vision=true and the document is a type that benefits from vision processing. This is the recommended setting for most deployments.Environment Variables
| Variable | Default | Description |
|---|---|---|
DOCUMENT_PROCESSING_MODE | auto | Processing strategy: auto (use vision when available), vision (always render), text (never render) |
DOCUMENT_VISION_DPI | 150 | PDF rendering resolution in dots per inch. Higher values produce better quality but consume more tokens |
DOCUMENT_VISION_MAX_PAGES | 20 | Maximum number of PDF pages to render as images. Pages beyond this limit fall back to text extraction |
Token Cost Considerations
Vision content consumes significantly more tokens than plain text. Understanding the cost tradeoffs helps you configure the system appropriately. Rough estimates:| Scenario | Approximate Token Cost |
|---|---|
| One PDF page at 150 DPI | 1,000 — 2,000 tokens |
| 10-page PDF in vision mode | 10,000 — 20,000 tokens |
| Same 10-page PDF as text only | 2,000 — 5,000 tokens |
| One embedded DOCX image | 500 — 1,500 tokens |
Cost Optimization Tips
- Set
DOCUMENT_VISION_MAX_PAGESto a reasonable limit (e.g., 10) for general use - Lower
DOCUMENT_VISION_DPIfrom 150 to 100 if image quality is acceptable — this reduces token consumption by roughly 40% - Use
textmode for documents where layout does not matter (e.g., plain-text reports, spreadsheets) - Use
visionmode selectively for documents where visual layout is critical (e.g., invoices, forms, diagrams)
Design Decisions and Limitations
Why Not LibreOffice for Full-Page Rendering?
LibreOffice can produce pixel-perfect page renders of DOCX and PPTX files, but it adds approximately 4 GB to the Docker image. Instead, FIM One extracts embedded images directly from the document XML using python-docx and python-pptx — both are already transitive dependencies of markitdown, so this adds zero additional installation overhead. The tradeoff: We get the actual embedded images at full quality but lose page layout context. The[Figure N] positional markers help the LLM correlate text and images, but the spatial relationship is approximate rather than exact.
What Gets Lost Without LibreOffice?
| Lost Element | Impact |
|---|---|
| Text formatting (bold, italic, font sizes) | LLM receives plain text only |
| Image-text spatial positioning | [Figure N] markers approximate but do not show exact placement |
| Charts generated by Office (not embedded as images) | XML-defined charts are not extracted |
| Page headers and footers in DOCX | Partially preserved by markitdown |
PDF Vision vs. DOCX/PPTX Vision
The quality of vision processing varies by format:- PDF — Smart page-by-page processing. Text-rich pages extract text content plus any embedded images separately, which is significantly more token-efficient. Scanned or image-only pages (e.g., photographed documents, scanned contracts) render as full-page PNG images for maximum visual fidelity. This adaptive approach balances quality and token cost automatically.
- DOCX / PPTX — Text content plus extracted embedded images. Good for most business documents, but page layout and formatting are not preserved.
Automatic Fallback
If a model is configured with vision support but actually rejects image content at runtime, the system automatically retries the request without document images. User-uploaded images (e.g., screenshots attached by the user) are preserved in the retry — only document-derived images are stripped.This fallback mechanism prevents task failures caused by misconfigured vision settings. If you see a model consistently falling back, check its vision support configuration in Admin > Models.
Safety Guardrails
File Integrity Protection
When the agent cannot read a file (e.g., an image-based PDF without vision enabled), a system-level guardrail prevents the agent from substituting content from other files. Without this protection, the agent could read a different accessible file and present its content as if it came from the target document. The guardrail ensures that when a file is unreadable, the agent reports the limitation honestly rather than fabricating an answer from unrelated sources.Descriptive Error Guidance
When a file cannot be read by theread_uploaded_file tool, the error message includes:
- The detected file type and why it could not be processed
- A suggestion to enable vision on the model if the file is image-based
- Alternative approaches the user can try (e.g., exporting to a different format)
Best Practices
For Administrators
-
Enable vision selectively. Only enable
supports_visionon models that genuinely support image input. Misconfiguration wastes tokens on the fallback retry cycle. -
Start with
automode. The default behavior is correct for most deployments — vision is used when beneficial and available. -
Monitor token usage. After enabling vision, watch the token consumption dashboard. If costs spike unexpectedly, adjust
DOCUMENT_VISION_MAX_PAGESorDOCUMENT_VISION_DPI. -
Use the pre-built sandbox image. The
Dockerfile.sandboxincludes common data-science packages (pdfplumber, Pillow, pandas, etc.) needed for AI code execution against documents. Build it viadocker composeor manually withdocker build -f Dockerfile.sandbox -t fim-sandbox .to ensure code execution works in--network=nonecontainers.
For End Users
- PDF gives the best results. When visual fidelity matters, export your Office documents to PDF before uploading.
- Spreadsheets are fine as-is. XLSX files are extracted as structured tables — vision adds no benefit.
- Large PDFs may be truncated. If your document exceeds the
DOCUMENT_VISION_MAX_PAGESlimit, only the first N pages will be rendered as images. The remaining pages are still available as extracted text. - Image quality matters. For standalone image uploads, use high-resolution originals when possible. Compressed or low-resolution images reduce the model’s ability to extract details.