Home / Cyber Security / Document Intelligence

📄

Document Intelligence

FE-50 · v5.4

Convert any file — PDF, DOCX, PPTX, XLSX, HTML, images, audio, ZIP — to Markdown via Microsoft MarkItDown, then pipe the extracted text through the full 9-layer Warden security pipeline before it touches an LLM or a corporate data store.

MarkItDown SecretRedactor SemanticGuard Redis Cache Prometheus SOVA Tool #50

Try in Portal API Reference ↗

Processing Pipeline

1 Upload

PDF / DOCX / PPTX / XLSX / HTML / image / audio / ZIP

→

2 Convert

Microsoft MarkItDown → clean Markdown

→

3 Redact

SecretRedactor — 15 patterns + entropy scan

→

4 Classify

Data class: PHI / PII / FINANCIAL / CLASSIFIED / GENERAL

→

5 Filter

9-layer pipeline — Topology → Brain → Causal → Decision

→

6 Verdict

ALLOW · MEDIUM · HIGH · BLOCK + STIX audit entry

Three Ways to Use It

🖥️

Portal — No-code UI

Visit /doc-scanner/ in the tenant portal.
Drag and drop any file.
Get a verdict, data class, secrets found, and extracted Markdown — instantly.

Open Portal

🔗

REST API

POST /document-intel/convert-and-scan
POST /document-intel/convert
POST /document-intel/convert-batch
GET /document-intel/stats

API Docs

⚡

Filter Hook

Add file_base64 + file_filename to any POST /filter request.
The gateway converts the file and replaces content before the pipeline.
Fail-open: conversion errors fall back to original content.

File-Type Cache TTLs

File Types	Cache TTL	Reason
PDF, DOCX, PPTX, XLSX	24 h	Office documents rarely change
MP3, WAV, FLAC, M4A	7 days	Transcription is expensive
JPG, PNG, WEBP, GIF	1 h	Images update more frequently
All others	DOC_INTEL_CACHE_TTL	Configurable fallback

Cache key: doc_intel:md:{sha256_of_file_bytes} — identical files are never converted twice, regardless of filename.

Configuration

Variable	Default	Description
`DOC_INTEL_MAX_BYTES`	`52428800`	Max file size before rejection (50 MB default)
`DOC_INTEL_TIMEOUT_S`	`30`	Per-conversion thread timeout in seconds
`DOC_INTEL_CACHE_TTL`	`3600`	Fallback Redis cache TTL (overridden by file type)
`REDIS_URL`	`redis://…`	Cache and stats store — set to memory:// for tests

Observability

warden_doc_intel_convert_total {ext, data_class}

Total conversions by file type and inferred data class

warden_doc_intel_convert_errors_total {ext, error}

Conversion errors — use for SLO alerting on error rate

warden_doc_intel_cache_hits_total

Redis cache hits — use to track cache efficiency

Stats also available at GET /document-intel/stats — returns total, cache_hits, errors, sensitive, secrets_found from Redis.

Shipped Features

FE-50-0

Filter Pipeline — file_base64 Hook

✅ Shipped

POST /filter accepts file_base64 + file_filename fields. Before the 9-layer pipeline runs, the file is converted to Markdown via MarkItDown and replaces content. Fail-open: conversion errors fall back to original content. Supports PDF, DOCX, PPTX, XLSX, HTML, images, audio, ZIP, EPUB.

All v5.4

FE-50-1

MarkItDown Converter

✅ Shipped

Microsoft MarkItDown converts any office or media file to clean Markdown. File-type-aware cache TTLs: PDF/DOCX/XLSX 24h, audio 7 days, images 1h. 50 MB size gate (DOC_INTEL_MAX_BYTES). 30s thread-pool timeout (DOC_INTEL_TIMEOUT_S). Redis cache keyed by SHA-256 hash.

Community+ v5.4

FE-50-2

Prometheus Metrics

✅ Shipped

3 Grafana-ready counters: warden_doc_intel_convert_total{ext,data_class}, warden_doc_intel_convert_errors_total{ext,error}, warden_doc_intel_cache_hits_total. Ready for SLO alerting on conversion error rate and cache efficiency.

All v5.4

FE-50-3

Document Intel API — 6 Endpoints

✅ Shipped

/document-intel: POST /convert, POST /convert-and-scan (SecretRedactor + SemanticGuard), POST /convert-batch, GET /health, GET /formats, GET /stats. Gated at Community Business+. Stats endpoint reads Redis hash (total, cache_hits, errors, sensitive, secrets_found).

Community+ v5.4

FE-50-4

SOVA Tool #50 — scan_document

✅ Shipped

SOVA agent can scan base64-encoded files through the full pipeline. Converts PDF/DOCX/PPTX to Markdown via MarkItDown, then runs SecretRedactor + SemanticGuard + HyperbolicBrain + CausalArbiter. Returns full FilterResponse: allowed, risk_level, secrets_found, semantic_flags.

Pro+ v5.4

FE-50-5

SOC Dashboard — Document Scans Widget

✅ Shipped

5-metric row on the SOC overview: Total Scanned, Cache Hits, Sensitive Docs, Secrets Found, Errors. Queries GET /document-intel/stats via DocScanStats type. Widget hidden gracefully when endpoint is unreachable.

All v5.4

Ready to scan your first document?

No API key needed — the portal handles authentication and proxies securely.

Open Document Scanner