Document Processing Pipeline Documentation
The batch-workflow coordinates all document processing stages.
Orchestrates the full document processing pipeline: parse, classify, extract, second-pass, normalize, and rename.
Individual processing services called by the orchestrator.
PDF and document parsing via LlamaCloud API with sync, async, and queue-based processing.
Classifies documents against Fibery artifact catalog using hybrid search, reranking, and LLM agent review.
Structured data extraction from documents using LlamaCloud Extract with auto-generated schemas.
Gap-finding second-pass extraction that discovers missing data points via naive vs. first-pass comparison.
Cross-document field normalization and conflict resolution with heuristic and LLM auto-resolution.
Renames documents in R2 storage based on classification and normalized data with revert support.
Syncs AdviceOS artifacts and attributes from Fibery workspace to D1 database.