DB DevBrain

Ai

RAG (Retrieval-Augmented Generation)

Glossary

RAG Architecture: Overview

RAG (Retrieval-Augmented Generation) combines two main AI capabilities:

RAG Workflow: Step-by-Step

Simple

  1. User Query: User sends a question or prompt.
  2. Query Embedding: An embedding model converts the query into a vector.
  3. Vector Search: The vector DB compares the query vector to stored document vectors, retrieving the most semantically similar documents.
  4. Context Assembly: Retrieved documents are compiled into a prompt along with the original question.
  5. LLM Inference: The inference model (LLM) takes the prompt (query + context) and generates a coherent, context-aware response.
  6. Return Response: The answer is returned to the user.

Optional

Agent/Orchestrator

Step-by-Step Flow
  1. User/API sends a query.
  2. The API Gateway / Ingress receives the query and routes it to the Agent/Orchestrator.
  3. The Agent/Orchestrator:
    • Embeds the query using an embedding model service (can be a pod or external service).
    • Searches the Vector DB with the query embedding to retrieve relevant documents.
    • Assembles the prompt: combines user query + retrieved docs.
    • Optionally checks the KV Cache (e.g., Redis) for repeated queries/answers.
  4. The LLM Serving Engine (e.g., vLLM, TGI, ONNX, TensorRT-LLM) receives the prompt and generates the answer.
  5. The response is returned to the user via the API gateway.
  6. If caching is enabled, frequent queries/responses are stored in the KV Cache for speed.

Where Each Technology Sits:


Embedding Model vs. Inference Model


Embedding Model
Inference Model (LLM)
Purpose
Converts text to vectors for semantic search
Generates text responses/answers
Examples
OpenAI text-embedding-ada-002,
OpenAI GPT-4, GPT-3.5
HuggingFace all-MiniLM-L6-v2,
Meta Llama-3, Google Gemini
Cohere embed-english-v3.0
Anthropic Claude, Mistral Mixtral
Google gemini-embedding-001
Google Gemini
Usage
Used at retrieval step to find relevant docs
Used at generation step to write answer
Interplay
Embedding model selects documents, LLM uses them as context in answer generation


Here’s a table of popular vector DBs and their main characteristics:
Name
Open Source
Cloud-Managed
Key Features
Pinecone
Fully managed, fast, scalable, easy API. Used for production RAG at scale.
Weaviate
Open source and SaaS, supports hybrid search (vector + keyword), RESTful API, built-in ML modules.
Qdrant
Open source and managed, strong filtering, gRPC support, real-time updates, written in Rust for speed.
Milvus
Very scalable, built for high-volume use, supports hybrid search, often used in enterprise.
FAISS
Not a DB, but a library for vector search; commonly used for fast local or in-memory similarity search.

Common Use Cases That Connect All the Above

KV Caching is crucial in all these to reduce repeated LLM calls for similar queries, dramatically improving speed and cost.

How This Is Served on Kubernetes (Managed or Self-Hosted)

Typical Architecture:


TL;DR Summary Table

Component
Example Choices
Role
Embedding Model
OpenAI ada-002 , all-MiniLM-L6-v2 , Cohere
Semantic vectorization for search
Vector DB
Pinecone , Weaviate , Qdrant , Milvus , FAISS
Fast similarity search over knowledge base
Inference Model (LLM)
GPT-4 , Llama-3 , Gemini , Claude , Mixtral
Generates final response using retrieved context
KV Cache
Redis , Memcached
Speed up repeated/long queries
Serving Engine
vLLM , TGI , ONNX , TensorRT-LLM
Efficient model hosting for LLMs
Kubernetes Platform
Self-hosted K8s, GKE, Vertex AI
Scalable, production-grade deployment