Project Flow Architecture
End-to-end system blueprint with tools, mechanisms, and purpose for each stage.
Core Engine
OCR + LLM structuring
Data Layer
PostgreSQL system of record
RAG Stage
Planned next phase
Training Mode
Not training custom LLM yet
Live deployment status (March 5, 2026)
Backend is live on Cloud Run at ml-docintel-be-dev-1007650653347.us-central1.run.app. Frontend Cloud Run deployment and OAuth callback cutover are the current in-progress tasks.
3 Core Features (Project USP)
1) Document Reader / Extractor
File ingestion, OCR/text parsing, LLM structured extraction, and normalized persistence.
2) RAG Pipeline
Chunking, embeddings, vector indexing, retriever logic, and grounded answer generation with citations.
3) ML Training over Data
Evaluation metrics, human correction loop, training dataset curation, and controlled fine-tuning rollout.
USP Focus: RAG + ML Intelligence
This architecture intentionally separates core extraction from intelligence layers. Extraction gives structured data; RAG and ML quality loops create the differentiator for decision support and domain-specific insights.
RAG USP
Context retrieval + citations + tenant-safe filtering for trustworthy answers.
ML USP
Measured quality improvement through benchmarking, corrections, and retraining readiness.
High-Level Flow Diagram (Architecture)
Stage-Wise Tools and Mechanisms Matrix
Active Architecture Steps
Step 1
User Upload
Tool/Mechanism: Next.js UI + multipart upload
Purpose: Capture tenant documents in mixed formats (PDF, DOCX, DOC, images).
Step 2
API Ingestion
Tool/Mechanism: FastAPI upload endpoint + auth/tenant checks
Purpose: Validate type/size and start async extraction pipeline.
Step 3
Storage + Metadata
Tool/Mechanism: Local/GCP file storage + PostgreSQL documents table
Purpose: Persist source file and link with tenant/user context.
Step 4
Text Acquisition Layer
Tool/Mechanism: pdfplumber + pytesseract + python-docx + antiword + Pillow
Purpose: Convert unstructured files into normalized raw text.
Step 5
LLM Structuring
Tool/Mechanism: Claude prompt-driven JSON extraction with model fallback
Purpose: Generate deterministic structured output from raw text.
Step 6
Validation + Normalization
Tool/Mechanism: Rule-based checks in Python services
Purpose: Apply reference range checks, severity flags, and data cleanup.
Step 7
Persistence + UI
Tool/Mechanism: PostgreSQL structured tables + Dashboard/Extracted Data pages
Purpose: Store, review, and operationalize extracted data.
RAG Extension Steps
Step 8
RAG Ingestion Worker
Tool/Mechanism: Async worker pipeline (LangChain runnable compatible)
Purpose: Process extracted text into retrieval-ready chunks.
Step 9
Chunking + Metadata
Tool/Mechanism: Recursive or semantic splitter + tenant/doc tags
Purpose: Create context chunks with strict metadata filters.
Step 10
Embeddings
Tool/Mechanism: Embedding model API/local model
Purpose: Transform chunks and queries into vector representations.
Step 11
Vector Index
Tool/Mechanism: PostgreSQL + pgvector (first choice)
Purpose: Enable semantic retrieval with low ops complexity.
Step 12
Retriever + Reranker
Tool/Mechanism: Top-k + metadata filters + optional rerank
Purpose: Fetch the most relevant grounded context.
Step 13
Grounded Response
Tool/Mechanism: LLM answer chain with citation constraints
Purpose: Return explainable answers linked to source chunks.
Vector DB Decision
Recommended start
Use PostgreSQL + pgvector with your current DB for lower operational complexity.
When to move to dedicated vector DB
Consider migration only when scale, latency, or concurrency exceeds pgvector comfort limits.
Best practice
Keep PostgreSQL as system-of-record even if you add a specialized vector engine later.
LangChain Mapping (for your requested process)
Prompt + Output Parser
Use structured prompts + Pydantic parsers for deterministic schema output.
Doc loaders + splitters + retrievers
Wrap current parsers into LangChain docs, then apply chunking and retriever policies.
Runnable chains + tools
Compose ingestion and QA pipelines using runnables with traceable, testable chain steps.
Flow architecture aligned with current implementation and RAG expansion path
KPI Targets for USP Validation
RAG Citation Coverage
Target high percentage of responses with source-backed citations.
Retrieval Recall@k
Measure if correct supporting chunks are retrieved consistently.
Answer Groundedness
Track unsupported statements and drive them toward near-zero.
Extraction Precision/Recall
Field-level reliability by tenant and document domain.
Review Burden
Reduce human correction rate over releases.
Latency + Cost
Track end-to-end query time and per-document/per-query cost.