RAG Pipeline Architecture

System Overview Diagram

                              RAG SYSTEM ARCHITECTURE
═══════════════════════════════════════════════════════════════════════════════

┌─────────────────────────────────────────────────────────────────────────────┐
│                            OFFLINE INDEXING                                 │
└─────────────────────────────────────────────────────────────────────────────┘

  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
  │ Source Docs │     │  Chunking   │     │  Embedding  │     │   Vector    │
  │ PDF/DOCX/   │────▶│  Strategy   │────▶│   Model     │────▶│   Store     │
  │ HTML/TXT    │     │ 512 tokens  │     │  e5-large   │     │  (Milvus)   │
  └──────┬──────┘     │ 50 overlap  │     └─────────────┘     └─────────────┘
         │            └─────────────┘
         ▼                                                    ┌─────────────┐
  ┌─────────────┐     ┌─────────────┐                         │    BM25     │
  │ Extraction  │────▶│  Metadata   │────────────────────────▶│   Index     │
  │ + Cleaning  │     │ Enrichment  │                         │ (Elastic)   │
  └──────┬──────┘     └─────────────┘                         └─────────────┘
         │
         ▼
  ┌─────────────┐
  │  Quality    │
  │   Gates     │
  └─────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│                           ONLINE RETRIEVAL                                  │
└─────────────────────────────────────────────────────────────────────────────┘

  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐      ┌─────────────┐
  │    User     │     │   Query     │  ┌─▶│   Vector    │────┐ │   Hybrid    │
  │   Query     │────▶│  Embedding  │──┤  │   Search    │    ├▶│  Fusion     │
  └─────────────┘     │  e5-large   │  │  │  (top 100)  │    │ │   (RRF)     │
                      └─────────────┘  │  └─────────────┘    │ └──────┬──────┘
                                       │                     │        │
                                       │  ┌─────────────┐    │        │
                                       └─▶│    BM25     │────┘        │
                                          │   Search    │             │
                                          │  (top 100)  │             │
                                          └─────────────┘             │
                                                                      ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                              GENERATION                                     │
└─────────────────────────────────────────────────────────────────────────────┘

  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
  │   Hybrid    │     │   Reranker  │     │   Context   │     │     LLM     │
  │   Results   │────▶│ bge-reranker│────▶│  Assembly   │────▶│  (Claude)   │
  │  (top 100)  │     │ top 20 → 5  │     │  + Prompt   │     │             │
  └─────────────┘     └─────────────┘     └─────────────┘     └──────┬──────┘
                                                                     │
                                                                     ▼
                                                              ┌─────────────┐
                                                              │  Response   │
                                                              │ + Citations │
                                                              └──────┬──────┘
                                                                     │
                                   ┌─────────────────────────────────┘
                                   ▼    
┌─────────────────────────────────────────────────────────────────────────────┐
│                             FEEDBACK LOOP                                   │
│        User Clicks  │  Corrections  │  Ratings  │  Latency Metrics          │
└─────────────────────────────────────────────────────────────────────────────┘

Mermaid Diagram (for rendered viewing)

flowchart TB
    subgraph Offline["OFFLINE INDEXING"]
        direction LR
        Docs[("Source Docs<br/>PDF/DOCX/HTML")] --> Extract["Extraction<br/>+ Cleaning"]
        Extract --> Chunk["Chunking<br/>512 tokens"]
        Chunk --> Embed["Embedding<br/>e5-large"]
        Embed --> VectorStore[("Vector Store<br/>Milvus")]
        Extract --> Meta["Metadata<br/>Enrichment"]
        Meta --> BM25Store[("BM25 Index<br/>Elasticsearch")]
    end

    subgraph Online["ONLINE RETRIEVAL"]
        direction LR
        Query["User Query"] --> QueryEmbed["Query<br/>Embedding"]
        QueryEmbed --> VectorSearch["Vector Search<br/>top 100"]
        QueryEmbed --> BM25Search["BM25 Search<br/>top 100"]
        VectorSearch --> Fusion["Hybrid Fusion<br/>RRF"]
        BM25Search --> Fusion
    end

    subgraph Generation["GENERATION"]
        direction LR
        Fusion --> Rerank["Reranker<br/>100→5"]
        Rerank --> Context["Context<br/>Assembly"]
        Context --> LLM["LLM<br/>Claude"]
        LLM --> Response["Response<br/>+ Citations"]
    end

    VectorStore -.-> VectorSearch
    BM25Store -.-> BM25Search

    Response --> Feedback["Feedback Loop<br/>Clicks, Ratings, Latency"]

    style Offline fill:#e1f5fe
    style Online fill:#fff3e0
    style Generation fill:#e8f5e9

Key Design Decisions

Component Choice Rationale
Chunk size 512 tokens, 50 overlap Balance retrieval precision vs context coherence
Embedding e5-large-v2 Best quality/latency tradeoff for English
Vector DB Milvus Self-hosted, scales to 100M+ vectors
Reranker bge-reranker-large Cross-encoder accuracy, acceptable latency
Hybrid weight 0.7 vector / 0.3 BM25 Tuned on held-out queries

Data Flow Summary

  1. Ingestion: Source docs → Extraction → Cleaning → Metadata enrichment → Quality gates
  2. Indexing: Chunking → Embedding → Vector store + BM25 index
  3. Retrieval: Query embed → Vector search → Hybrid fusion (RRF) → Rerank top 20 → top 5
  4. Generation: Context assembly → Prompt template → LLM → Response with citations
  5. Feedback: User signals feed back into system improvement