Document Intelligence
FE-50 · v5.4Convert any file — PDF, DOCX, PPTX, XLSX, HTML, images, audio, ZIP — to Markdown via Microsoft MarkItDown, then pipe the extracted text through the full 9-layer Warden security pipeline before it touches an LLM or a corporate data store.
Processing Pipeline
PDF / DOCX / PPTX / XLSX / HTML / image / audio / ZIP
Microsoft MarkItDown → clean Markdown
SecretRedactor — 15 patterns + entropy scan
Data class: PHI / PII / FINANCIAL / CLASSIFIED / GENERAL
9-layer pipeline — Topology → Brain → Causal → Decision
ALLOW · MEDIUM · HIGH · BLOCK + STIX audit entry
Three Ways to Use It
Portal — No-code UI
- Visit /doc-scanner/ in the tenant portal.
- Drag and drop any file.
- Get a verdict, data class, secrets found, and extracted Markdown — instantly.
REST API
- POST /document-intel/convert-and-scan
- POST /document-intel/convert
- POST /document-intel/convert-batch
- GET /document-intel/stats
Filter Hook
- Add file_base64 + file_filename to any POST /filter request.
- The gateway converts the file and replaces content before the pipeline.
- Fail-open: conversion errors fall back to original content.
File-Type Cache TTLs
| File Types | Cache TTL | Reason |
|---|---|---|
| PDF, DOCX, PPTX, XLSX | 24 h | Office documents rarely change |
| MP3, WAV, FLAC, M4A | 7 days | Transcription is expensive |
| JPG, PNG, WEBP, GIF | 1 h | Images update more frequently |
| All others | DOC_INTEL_CACHE_TTL | Configurable fallback |
Cache key: doc_intel:md:{sha256_of_file_bytes}
— identical files are never converted twice, regardless of filename.
Configuration
| Variable | Default | Description |
|---|---|---|
DOC_INTEL_MAX_BYTES | 52428800 | Max file size before rejection (50 MB default) |
DOC_INTEL_TIMEOUT_S | 30 | Per-conversion thread timeout in seconds |
DOC_INTEL_CACHE_TTL | 3600 | Fallback Redis cache TTL (overridden by file type) |
REDIS_URL | redis://… | Cache and stats store — set to memory:// for tests |
Observability
warden_doc_intel_convert_total {ext, data_class} Total conversions by file type and inferred data class
warden_doc_intel_convert_errors_total {ext, error} Conversion errors — use for SLO alerting on error rate
warden_doc_intel_cache_hits_total Redis cache hits — use to track cache efficiency
Stats also available at GET /document-intel/stats
— returns total, cache_hits, errors, sensitive, secrets_found from Redis.
Shipped Features
Filter Pipeline — file_base64 Hook
POST /filter accepts file_base64 + file_filename fields. Before the 9-layer pipeline runs, the file is converted to Markdown via MarkItDown and replaces content. Fail-open: conversion errors fall back to original content. Supports PDF, DOCX, PPTX, XLSX, HTML, images, audio, ZIP, EPUB.
MarkItDown Converter
Microsoft MarkItDown converts any office or media file to clean Markdown. File-type-aware cache TTLs: PDF/DOCX/XLSX 24h, audio 7 days, images 1h. 50 MB size gate (DOC_INTEL_MAX_BYTES). 30s thread-pool timeout (DOC_INTEL_TIMEOUT_S). Redis cache keyed by SHA-256 hash.
Prometheus Metrics
3 Grafana-ready counters: warden_doc_intel_convert_total{ext,data_class}, warden_doc_intel_convert_errors_total{ext,error}, warden_doc_intel_cache_hits_total. Ready for SLO alerting on conversion error rate and cache efficiency.
Document Intel API — 6 Endpoints
/document-intel: POST /convert, POST /convert-and-scan (SecretRedactor + SemanticGuard), POST /convert-batch, GET /health, GET /formats, GET /stats. Gated at Community Business+. Stats endpoint reads Redis hash (total, cache_hits, errors, sensitive, secrets_found).
SOVA Tool #50 — scan_document
SOVA agent can scan base64-encoded files through the full pipeline. Converts PDF/DOCX/PPTX to Markdown via MarkItDown, then runs SecretRedactor + SemanticGuard + HyperbolicBrain + CausalArbiter. Returns full FilterResponse: allowed, risk_level, secrets_found, semantic_flags.
SOC Dashboard — Document Scans Widget
5-metric row on the SOC overview: Total Scanned, Cache Hits, Sensitive Docs, Secrets Found, Errors. Queries GET /document-intel/stats via DocScanStats type. Widget hidden gracefully when endpoint is unreachable.
Ready to scan your first document?
No API key needed — the portal handles authentication and proxies securely.
Open Document Scanner