RAG (Retrieval-Augmented Generation)

Source: RAG (Retrieval-Augmented Generation).html

Glossary

RAG (Retrieval-Augmented Generation) combines two main AI capabilities:

Retrieval: Fetch relevant info from an external knowledge base (using semantic search in a vector database).
Generation: Use an LLM to generate a natural language answer using both the user query and the retrieved info.

User Query: User sends a question or prompt.
Query Embedding: An embedding model converts the query into a vector.
Vector Search: The vector DB compares the query vector to stored document vectors, retrieving the most semantically similar documents.
Context Assembly: Retrieved documents are compiled into a prompt along with the original question.
LLM Inference: The inference model (LLM) takes the prompt (query + context) and generates a coherent, context-aware response.
Return Response: The answer is returned to the user.

KV Caching: Frequently used prompts and their generated responses are cached for speed, reducing repetitive computation.
Agentic Workflow: If multi-step actions are needed (planning, API calls), an agentic layer may manage additional steps, each repeating some or all of the above process.

Step-by-Step Flow

User/API sends a query.
The API Gateway / Ingress receives the query and routes it to the Agent/Orchestrator.
The Agent/Orchestrator:

Embeds the query using an embedding model service (can be a pod or external service).
Searches the Vector DB with the query embedding to retrieve relevant documents.
Assembles the prompt: combines user query + retrieved docs.
Optionally checks the KV Cache (e.g., Redis) for repeated queries/answers.

The LLM Serving Engine (e.g., vLLM, TGI, ONNX, TensorRT-LLM) receives the prompt and generates the answer.
The response is returned to the user via the API gateway.
If caching is enabled, frequent queries/responses are stored in the KV Cache for speed.

Embedding Model: Pod/service (HuggingFace, OpenAI endpoint, or similar)
Vector DB: Pod/service (Weaviate, Qdrant, Milvus, Pinecone, FAISS, etc.)
LLM Serving Engine: Pod/service (vLLM, TGI, etc.) or cloud LLM endpoint
KV Cache: Pod/service (Redis, Memcached)
Agent/Orchestrator: Pod/service (your business logic, maybe a FastAPI app, LangChain agent, etc.)
API Gateway / Ingress: K8s ingress controller, NGINX, Istio, etc.

	Embedding Model	Inference Model (LLM)
Purpose	Converts text to vectors for semantic search	Generates text responses/answers
Examples	OpenAI `text-embedding-ada-002`,	OpenAI `GPT-4`, `GPT-3.5`
	HuggingFace `all-MiniLM-L6-v2`,	Meta `Llama-3`, Google `Gemini`
	Cohere `embed-english-v3.0`	Anthropic `Claude`, Mistral `Mixtral`
	Google `gemini-embedding-001`	Google `Gemini`
Usage	Used at retrieval step to find relevant docs	Used at generation step to write answer
Interplay	Embedding model selects documents, LLM uses them as context in answer generation

They are separate models but work together: embedding model finds what to show, LLM decides how to use it to answer your question.

Here’s a table of popular vector DBs and their main characteristics:

Name	Open Source	Cloud-Managed	Key Features
Pinecone	❌	✅	Fully managed, fast, scalable, easy API. Used for production RAG at scale.
Weaviate	✅	✅	Open source and SaaS, supports hybrid search (vector + keyword), RESTful API, built-in ML modules.
Qdrant	✅	✅	Open source and managed, strong filtering, gRPC support, real-time updates, written in Rust for speed.
Milvus	✅	✅	Very scalable, built for high-volume use, supports hybrid search, often used in enterprise.
FAISS	✅	❌	Not a DB, but a library for vector search; commonly used for fast local or in-memory similarity search.

Cloud-managed options (Pinecone, Weaviate, Qdrant Cloud) are easiest to get started with for production, but self-hosting (Weaviate, Qdrant, Milvus) is common for privacy/compliance.
FAISS is often used for prototyping or when you want to tightly control everything in-memory.

AI Chatbots with Company Knowledge: Users ask questions; RAG retrieves and summarizes internal docs, policies, FAQs.
Semantic Search Applications: Employees or users search a knowledge base using meaning, not just keywords.
Document Q&A: RAG finds and reads long legal, scientific, or technical documents to answer questions.
Code/Support Copilots: RAG retrieves relevant support tickets, code snippets, or documentation, and LLM explains, summarizes, or troubleshoots.
Agentic Workflow Example: “Book a meeting with Alice next week and summarize her last three emails”—an agent retrieves data, generates summaries, and takes action step by step.

KV Caching is crucial in all these to reduce repeated LLM calls for similar queries, dramatically improving speed and cost.

Embedding Model Service: Converts text to vectors (can be REST API, open-source model, or cloud embedding endpoint).
Vector DB Service: (Pinecone/Weaviate/Qdrant/Milvus pod, or external managed service).
LLM Inference Service: (TGI, vLLM, ONNX Runtime, or cloud LLM endpoint).
Agent/Orchestrator: (If using agentic workflows, runs logic, manages prompts).
API Gateway/Ingress: Receives user queries and routes requests.

Kubernetes provides scaling, monitoring, and deployment.
Vertex AI can host managed LLMs and vector search endpoints as a service, which can be integrated with other K8s services.
Secrets, autoscaling, logging, and monitoring handled by the cloud provider.

All components (vector DB, embedding service, LLM server, orchestrator) run as containers/pods.
Use K8s tools for config, scaling, rolling updates, and networking.

Add a cache layer (e.g., Redis, Memcached) as a Kubernetes service.
Used by agent/orchestrator and LLM inference pods to store and retrieve frequent prompt/response pairs or token sequences.

Component	Example Choices	Role
Embedding Model	OpenAI `ada-002` , `all-MiniLM-L6-v2` , `Cohere`	Semantic vectorization for search
Vector DB	`Pinecone` , `Weaviate` , `Qdrant` , `Milvus` , `FAISS`	Fast similarity search over knowledge base
Inference Model (LLM)	`GPT-4` , `Llama-3` , `Gemini` , `Claude` , `Mixtral`	Generates final response using retrieved context
KV Cache	`Redis` , `Memcached`	Speed up repeated/long queries
Serving Engine	`vLLM` , `TGI` , `ONNX` , `TensorRT-LLM`	Efficient model hosting for LLMs
Kubernetes Platform	Self-hosted K8s, GKE, Vertex AI	Scalable, production-grade deployment