AI Doc EnginePOC
Multi-Tenant Extraction

Project Flow Architecture

End-to-end system blueprint with tools, mechanisms, and purpose for each stage.

Core Engine

OCR + LLM structuring

Data Layer

PostgreSQL system of record

RAG Stage

Planned next phase

Training Mode

Not training custom LLM yet

Live deployment status (March 5, 2026)

Backend is live on Cloud Run at ml-docintel-be-dev-1007650653347.us-central1.run.app. Frontend Cloud Run deployment and OAuth callback cutover are the current in-progress tasks.

3 Core Features (Project USP)

1) Document Reader / Extractor

File ingestion, OCR/text parsing, LLM structured extraction, and normalized persistence.

2) RAG Pipeline

Chunking, embeddings, vector indexing, retriever logic, and grounded answer generation with citations.

3) ML Training over Data

Evaluation metrics, human correction loop, training dataset curation, and controlled fine-tuning rollout.

USP Focus: RAG + ML Intelligence

This architecture intentionally separates core extraction from intelligence layers. Extraction gives structured data; RAG and ML quality loops create the differentiator for decision support and domain-specific insights.

RAG USP

Context retrieval + citations + tenant-safe filtering for trustworthy answers.

ML USP

Measured quality improvement through benchmarking, corrections, and retraining readiness.

High-Level Flow Diagram (Architecture)

AI Document Platform Architecture (Current + RAG + ML Training)

Stage-Wise Tools and Mechanisms Matrix

Project flow stages with tools, technology, and mechanism per stage

Active Architecture Steps

Step 1

User Upload

Active

Tool/Mechanism: Next.js UI + multipart upload

Purpose: Capture tenant documents in mixed formats (PDF, DOCX, DOC, images).

Step 2

API Ingestion

Active

Tool/Mechanism: FastAPI upload endpoint + auth/tenant checks

Purpose: Validate type/size and start async extraction pipeline.

Step 3

Storage + Metadata

Active

Tool/Mechanism: Local/GCP file storage + PostgreSQL documents table

Purpose: Persist source file and link with tenant/user context.

Step 4

Text Acquisition Layer

Active

Tool/Mechanism: pdfplumber + pytesseract + python-docx + antiword + Pillow

Purpose: Convert unstructured files into normalized raw text.

Step 5

LLM Structuring

Active

Tool/Mechanism: Claude prompt-driven JSON extraction with model fallback

Purpose: Generate deterministic structured output from raw text.

Step 6

Validation + Normalization

Active

Tool/Mechanism: Rule-based checks in Python services

Purpose: Apply reference range checks, severity flags, and data cleanup.

Step 7

Persistence + UI

Active

Tool/Mechanism: PostgreSQL structured tables + Dashboard/Extracted Data pages

Purpose: Store, review, and operationalize extracted data.

RAG Extension Steps

Step 8

RAG Ingestion Worker

Planned

Tool/Mechanism: Async worker pipeline (LangChain runnable compatible)

Purpose: Process extracted text into retrieval-ready chunks.

Step 9

Chunking + Metadata

Planned

Tool/Mechanism: Recursive or semantic splitter + tenant/doc tags

Purpose: Create context chunks with strict metadata filters.

Step 10

Embeddings

Planned

Tool/Mechanism: Embedding model API/local model

Purpose: Transform chunks and queries into vector representations.

Step 11

Vector Index

Planned

Tool/Mechanism: PostgreSQL + pgvector (first choice)

Purpose: Enable semantic retrieval with low ops complexity.

Step 12

Retriever + Reranker

Planned

Tool/Mechanism: Top-k + metadata filters + optional rerank

Purpose: Fetch the most relevant grounded context.

Step 13

Grounded Response

Planned

Tool/Mechanism: LLM answer chain with citation constraints

Purpose: Return explainable answers linked to source chunks.

Vector DB Decision

Recommended start

Use PostgreSQL + pgvector with your current DB for lower operational complexity.

When to move to dedicated vector DB

Consider migration only when scale, latency, or concurrency exceeds pgvector comfort limits.

Best practice

Keep PostgreSQL as system-of-record even if you add a specialized vector engine later.

LangChain Mapping (for your requested process)

Prompt + Output Parser

Use structured prompts + Pydantic parsers for deterministic schema output.

Doc loaders + splitters + retrievers

Wrap current parsers into LangChain docs, then apply chunking and retriever policies.

Runnable chains + tools

Compose ingestion and QA pipelines using runnables with traceable, testable chain steps.

Flow architecture aligned with current implementation and RAG expansion path

KPI Targets for USP Validation

RAG Citation Coverage

Target high percentage of responses with source-backed citations.

Retrieval Recall@k

Measure if correct supporting chunks are retrieved consistently.

Answer Groundedness

Track unsupported statements and drive them toward near-zero.

Extraction Precision/Recall

Field-level reliability by tenant and document domain.

Review Burden

Reduce human correction rate over releases.

Latency + Cost

Track end-to-end query time and per-document/per-query cost.