Ai
RAG (Retrieval-Augmented Generation)
Glossary
- Embedding(s):Turning words into vectors
RAG Architecture: Overview
RAG (Retrieval-Augmented Generation) combines two main AI capabilities:
- Retrieval: Fetch relevant info from an external knowledge base (using semantic search in a vector database).
- Generation: Use an LLM to generate a natural language answer using both the user query and the retrieved info.
RAG Workflow: Step-by-Step
Simple
- User Query: User sends a question or prompt.
- Query Embedding: An embedding model converts the query into a vector.
- Vector Search: The vector DB compares the query vector to stored document vectors, retrieving the most semantically similar documents.
- Context Assembly: Retrieved documents are compiled into a prompt along with the original question.
- LLM Inference: The inference model (LLM) takes the prompt (query + context) and generates a coherent, context-aware response.
- Return Response: The answer is returned to the user.
Optional
- KV Caching: Frequently used prompts and their generated responses are cached for speed, reducing repetitive computation.
- Agentic Workflow: If multi-step actions are needed (planning, API calls), an agentic layer may manage additional steps, each repeating some or all of the above process.
Agent/Orchestrator
Step-by-Step Flow
- User/API sends a query.
- The API Gateway / Ingress receives the query and routes it to the Agent/Orchestrator.
- The Agent/Orchestrator:
- Embeds the query using an embedding model service (can be a pod or external service).
- Searches the Vector DB with the query embedding to retrieve relevant documents.
- Assembles the prompt: combines user query + retrieved docs.
- Optionally checks the KV Cache (e.g., Redis) for repeated queries/answers.
- The LLM Serving Engine (e.g., vLLM, TGI, ONNX, TensorRT-LLM) receives the prompt and generates the answer.
- The response is returned to the user via the API gateway.
- If caching is enabled, frequent queries/responses are stored in the KV Cache for speed.
Where Each Technology Sits:
- Embedding Model: Pod/service (HuggingFace, OpenAI endpoint, or similar)
- Vector DB: Pod/service (Weaviate, Qdrant, Milvus, Pinecone, FAISS, etc.)
- LLM Serving Engine: Pod/service (vLLM, TGI, etc.) or cloud LLM endpoint
- KV Cache: Pod/service (Redis, Memcached)
- Agent/Orchestrator: Pod/service (your business logic, maybe a FastAPI app, LangChain agent, etc.)
- API Gateway / Ingress: K8s ingress controller, NGINX, Istio, etc.
Embedding Model vs. Inference Model
Embedding Model | Inference Model (LLM) | |
|---|---|---|
Purpose | Converts text to vectors for semantic search | Generates text responses/answers |
Examples | OpenAI text-embedding-ada-002, | OpenAI GPT-4, GPT-3.5 |
HuggingFace all-MiniLM-L6-v2, | Meta Llama-3, Google Gemini | |
Cohere embed-english-v3.0 | Anthropic Claude, Mistral Mixtral | |
Google gemini-embedding-001 | Google Gemini | |
Usage | Used at retrieval step to find relevant docs | Used at generation step to write answer |
Interplay | Embedding model selects documents, LLM uses them as context in answer generation |
- They are separate models but work together: embedding model finds what to show, LLM decides how to use it to answer your question.
Popular Vector Databases
Here’s a table of popular vector DBs and their main characteristics:
Name | Open Source | Cloud-Managed | Key Features |
|---|---|---|---|
Pinecone | ❌ | ✅ | Fully managed, fast, scalable, easy API. Used for production RAG at scale. |
Weaviate | ✅ | ✅ | Open source and SaaS, supports hybrid search (vector + keyword), RESTful API, built-in ML modules. |
Qdrant | ✅ | ✅ | Open source and managed, strong filtering, gRPC support, real-time updates, written in Rust for speed. |
Milvus | ✅ | ✅ | Very scalable, built for high-volume use, supports hybrid search, often used in enterprise. |
FAISS | ✅ | ❌ | Not a DB, but a library for vector search; commonly used for fast local or in-memory similarity search. |
- Cloud-managed options (Pinecone, Weaviate, Qdrant Cloud) are easiest to get started with for production, but self-hosting (Weaviate, Qdrant, Milvus) is common for privacy/compliance.
- FAISS is often used for prototyping or when you want to tightly control everything in-memory.
Common Use Cases That Connect All the Above
- AI Chatbots with Company Knowledge: Users ask questions; RAG retrieves and summarizes internal docs, policies, FAQs.
- Semantic Search Applications: Employees or users search a knowledge base using meaning, not just keywords.
- Document Q&A: RAG finds and reads long legal, scientific, or technical documents to answer questions.
- Code/Support Copilots: RAG retrieves relevant support tickets, code snippets, or documentation, and LLM explains, summarizes, or troubleshoots.
- Agentic Workflow Example: “Book a meeting with Alice next week and summarize her last three emails”—an agent retrieves data, generates summaries, and takes action step by step.
KV Caching is crucial in all these to reduce repeated LLM calls for similar queries, dramatically improving speed and cost.
How This Is Served on Kubernetes (Managed or Self-Hosted)
Typical Architecture:
- Pods/Deployments:
- Embedding Model Service: Converts text to vectors (can be REST API, open-source model, or cloud embedding endpoint).
- Vector DB Service: (Pinecone/Weaviate/Qdrant/Milvus pod, or external managed service).
- LLM Inference Service: (TGI, vLLM, ONNX Runtime, or cloud LLM endpoint).
- Agent/Orchestrator: (If using agentic workflows, runs logic, manages prompts).
- API Gateway/Ingress: Receives user queries and routes requests.
- Managed (e.g., GKE, Vertex AI):
- Kubernetes provides scaling, monitoring, and deployment.
- Vertex AI can host managed LLMs and vector search endpoints as a service, which can be integrated with other K8s services.
- Secrets, autoscaling, logging, and monitoring handled by the cloud provider.
- Self-Hosted:
- All components (vector DB, embedding service, LLM server, orchestrator) run as containers/pods.
- Use K8s tools for config, scaling, rolling updates, and networking.
- KV Caching:
- Add a cache layer (e.g., Redis, Memcached) as a Kubernetes service.
- Used by agent/orchestrator and LLM inference pods to store and retrieve frequent prompt/response pairs or token sequences.
TL;DR Summary Table
Component | Example Choices | Role |
|---|---|---|
Embedding Model | OpenAI ada-002 , all-MiniLM-L6-v2 , Cohere | Semantic vectorization for search |
Vector DB | Pinecone , Weaviate , Qdrant , Milvus , FAISS | Fast similarity search over knowledge base |
Inference Model (LLM) | GPT-4 , Llama-3 , Gemini , Claude , Mixtral | Generates final response using retrieved context |
KV Cache | Redis , Memcached | Speed up repeated/long queries |
Serving Engine | vLLM , TGI , ONNX , TensorRT-LLM | Efficient model hosting for LLMs |
Kubernetes Platform | Self-hosted K8s, GKE, Vertex AI | Scalable, production-grade deployment |