Ai
Agent And Application Orchestration
Overview
This document is about the layer above model serving.
It covers the software that coordinates agents, tools, memory, sessions, handoffs, guardrails, and workflow state.
It does not cover the runtime layer that actually executes model weights. That belongs to a different document.
A useful stack is:
Model runtime → serving API → agent orchestration SDK → protocol layers → app/backend/frontend → platform
Examples:
- Model runtime: Ollama, llama.cpp, vLLM, TGI
- Serving API: local HTTP API, OpenAI-compatible API, workflow API
- Agent orchestration SDK: OpenAI Agents SDK, LangGraph, AutoGen, Semantic Kernel
- Protocol layers: MCP, A2A, AG-UI
- App/backend/frontend: your business logic, API, UI backend, frontend, worker system
- Platform: auth, queues, storage, observability, deployment, scaling
The important boundary is this:
Runtimes execute models. Agent frameworks decide how model calls, tools, and state are coordinated.
Mental model
What this layer is responsible for
This layer usually owns:
- tool invocation
- workflow state
- memory and session handling
- approval checkpoints and human-in-the-loop steps
- multi-agent routing and delegation
- guardrails and policy checks
- retry logic and fallbacks
- tracing and debugging of agent runs
What this layer usually does not own
This layer usually does not own:
- low-level inference scheduling
- GPU allocation
- model-weight loading strategy
- token batching in the serving engine
- ingress, autoscaling, reverse proxies, or cluster scheduling
Those belong lower in the stack.
Product categories
Agent orchestration frameworks
These define how agents are composed, routed, resumed, and observed.
Tool-calling and function-execution layer
This is the layer that connects the model to real actions.
Common patterns:
- JSON-schema or typed function calling
- local tools
- remote tools over HTTP
- MCP-backed tools
- code-execution tools
- connector-backed tools for files, email, calendars, CRMs, and internal systems
Memory and session layer
This handles continuity across turns and across runs.
Common patterns:
- short-term conversation state
- working memory for the current task
- long-term memory or retrieved context
- thread/session persistence
- durable workflow state
- resumable execution state
Guardrails and policy layer
This is where agent behavior is constrained or checked.
Common patterns:
- input validation
- output validation
- tool-use restrictions
- approval requirements
- policy checks before side effects
- content safety and compliance checks
Multi-agent routing and handoffs
This is the control-flow layer for specialization.
Common patterns:
- planner agent delegates to specialists
- router agent chooses a domain expert
- one agent escalates to a human or approval queue
- one agent transfers control to another with shared or partial context
Quick comparison
Name | Primary role | Main strength | State model | Best fit |
|---|---|---|---|---|
OpenAI Agents SDK | Agent orchestration SDK | Clean agent, tool, handoff, guardrail model | Sessions and run-level state | Applications built around tool use and handoffs |
LangGraph | Graph-based orchestration framework | Durable execution and explicit stateful workflows | Strong explicit graph and checkpoint state | Long-running, resumable, production-style agent workflows |
AutoGen | Multi-agent framework | Conversational multi-agent patterns | Agent/chat-centric state | Existing AutoGen users and research/prototyping patterns |
Semantic Kernel | AI middleware + agent framework | Enterprise-oriented integration and plugin model | App/service-oriented state | Teams building AI features into larger business systems |
Tool-calling and execution design
What matters
In practice, tool-calling design is one of the main determinants of whether an agent system is useful or fragile.
The important questions are:
- how are tool contracts defined?
- how are arguments validated?
- how are side effects approved?
- what happens when a tool fails or times out?
- can tools be retried safely?
- is the output structured enough for downstream logic?
- can the system distinguish read-only tools from write tools?
Common implementation patterns
Typed function calling
The model selects from a defined set of functions with structured arguments.
Best when:
- tools are deterministic
- argument validation matters
- you want clean logging and auditing
MCP-backed tools
A protocol layer exposes tools from external systems in a standard way.
Best when:
- tool inventory changes often
- you want reuse across clients and agent runtimes
- tools come from connector-backed or remote systems
Sandboxed execution tools
The agent can run code or shell commands inside an isolated environment.
Best when:
- tasks need real computation or file manipulation
- you need bounded execution and auditability
Risk:
- the side-effect surface expands fast, so approvals and isolation matter a lot
Memory and session layers
Short-term vs long-term memory
A useful distinction:
- short-term memory: current thread state, recent messages, active task context
- working memory: scratchpad-like task state, plan state, intermediate results
- long-term memory: retrieved history, user preferences, prior artifacts, stored facts
- durable workflow state: checkpointed execution state needed to pause, resume, recover, or continue a multi-step process
These should not automatically be treated as the same thing.
Session design questions
Important design questions:
- what belongs in the current session state?
- what can be re-derived from source systems?
- what must persist across runs?
- what should never be persisted due to privacy or compliance constraints?
- how do you resume a partially completed task safely?
Engineering reality
Most bad agent memory systems fail because they mix all of this together:
- chat history
- business state
- retrieved context
- durable workflow state
- user profile data
Those should be modeled separately.
Guardrails
What guardrails are actually for
Guardrails are not magic safety dust.
They are explicit checks around inputs, outputs, tool calls, and side effects.
Useful guardrails include:
- schema validation
- tool allowlists and deny lists
- approval checkpoints before writes
- policy checks before external actions
- output validation for format, scope, or risk
- fallback behavior when confidence is low
Where guardrails belong
Good systems usually place guardrails at multiple points:
- before the model call
- before tool execution
- after tool output is returned
- before committing side effects
- before showing a final answer to the user in sensitive flows
Multi-agent routing and handoffs
When multi-agent actually helps
Multi-agent designs help when there is real specialization, for example:
- billing vs support vs technical troubleshooting
- planner vs executor vs reviewer
- retrieval specialist vs action-taking specialist
- human escalation as a first-class route
When it does not help
It does not help when one agent could do the job and the system is split into many agents just to look advanced.
That usually adds:
- more latency
- more token cost
- more debugging pain
- more state-transfer bugs
Handoff design questions
When one agent transfers control to another, you need to define:
- what context is transferred?
- what is redacted?
- who owns the next action?
- can control return to the original agent?
- how is the handoff traced and audited?
- does the next agent inherit the same tool permissions?
Protocol layers
Protocols are not the same thing as orchestration frameworks.
A useful split is:
- frameworks define control flow, state, and orchestration behavior
- protocols define interoperability boundaries between parts of the system
MCP
Category: Tool and context protocol
What it is
- A protocol for connecting AI applications and agents to external tools, resources, and context exposed by MCP servers
- Best thought of as the integration boundary for tool use and context access, not an orchestration framework
Engineering strengths
- standardizes tool and context integration
- reduces one-off connector glue
- useful when tools need to be reused across different clients and agent runtimes
- creates a cleaner boundary between agent logic and external capabilities
Operational concerns
- protocol standardization does not remove the need for auth, authorization, auditing, rate limits, and side-effect controls
- poorly designed MCP servers can still expose messy or unsafe tool surfaces
- transport and trust boundaries still need careful design
Best fit
- shared tool ecosystems
- reusable connectors across multiple agent clients
- systems that want cleaner separation between orchestration and external capability access
Poor fit
- tiny single-app systems where direct function calls are simpler and fully sufficient
A2A
Category: Agent-to-agent protocol
What it is
- A protocol for interoperability and collaboration between independent agent systems
- Best thought of as the communication boundary between agents or agentic applications, not a replacement for orchestration inside one system
Engineering strengths
- creates a cleaner contract for inter-agent collaboration
- useful when agents are owned by different teams, vendors, or systems
- makes specialization and delegation easier to reason about across boundaries
Operational concerns
- inter-agent communication can still become expensive, slow, and hard to debug
- capability discovery and trust boundaries need discipline
- cross-agent state transfer remains a design problem even with a protocol
Best fit
- independently owned agent systems
- cross-team or cross-vendor delegation
- architectures where agent boundaries are real and organizationally meaningful
Poor fit
- single-process agent systems where internal orchestration is enough
- designs splitting agents purely for novelty
AG-UI
Category: Agent-to-frontend interaction protocol
What it is
- A protocol for connecting agent backends to user-facing applications through events, shared state, streaming, tool rendering, and interaction flow
- Best thought of as the boundary between the agent/backend and the frontend experience
Engineering strengths
- cleaner frontend/backend contract for agent experiences
- useful for streaming stateful interaction patterns
- helps expose interrupts, tool calls, agent steps, and handoffs to the UI in a structured way
- reduces bespoke event wiring between agent backends and frontend apps
Operational concerns
- frontend protocol cleanliness does not solve orchestration quality underneath
- event richness can become UI complexity if not designed carefully
- trust, auth, and state ownership still need explicit decisions
Best fit
- rich frontend agent experiences
- applications where streaming events, tool rendering, and human-in-the-loop interaction matter
- teams that want a cleaner contract between frontend and agent backend
Poor fit
- very simple chat UIs
- systems where a plain response stream is enough
Framework profiles
OpenAI Agents SDK
Category: Agent orchestration SDK
What it is
- A framework for building agents around a clear set of primitives: agents, runner, tools, handoffs, guardrails, sessions, and tracing
- Best thought of as an application-layer orchestration SDK, not a model runtime
Engineering strengths
- clean mental model
- strong fit for tool-calling agents
- handoffs are first-class
- sessions and tracing are built into the shape of the framework
- works well when you want a direct path from model call to tool use to delegated control
Operational concerns
- you still need to design memory, persistence, retries, and side-effect controls carefully
- the SDK helps with orchestration, but it does not replace platform concerns like auth, queueing, rollout, or observability outside agent traces
- tool design quality matters more than framework marketing
Best fit
- tool-using assistants
- routed specialist-agent systems
- apps where handoffs and policy checks are part of the core design
Poor fit
- workflows that are better modeled as explicit deterministic graphs
- teams expecting the SDK to solve platform engineering for them
LangGraph
Category: Graph-based orchestration framework
What it is
- A framework for building stateful agent systems as graphs with durable execution, resumability, and explicit control flow
Engineering strengths
- explicit graph and state model
- durable execution
- pause/resume and human-in-the-loop patterns fit naturally
- strong for long-running or failure-prone workflows
- easier to reason about than free-form agent loops when systems become operationally serious
Operational concerns
- more structure means more design work up front
- can feel heavier than needed for simple assistants
- graph complexity can become its own maintenance burden if the workflow is poorly designed
Best fit
- long-lived workflows
- resumable systems
- production agent systems where state and recovery matter a lot
Poor fit
- very small agent features
- teams that only need a simple request-response tool-calling layer
AutoGen
Category: Multi-agent framework
What it is
- A framework centered on agent-to-agent interaction patterns, especially conversational and multi-agent coordination styles
Engineering strengths
- strong historical mindshare in multi-agent examples
- useful patterns for agent collaboration and decomposition
- can still be relevant when working from an existing AutoGen codebase or research prototype style
Operational concerns
- the project is in maintenance mode, which matters for long-term framework bets
- many real systems need tighter control over state, retries, tool contracts, and platform integration than naive chat-between-agents designs provide
- conversational multi-agent patterns can become expensive and hard to debug if left too loose
Best fit
- existing AutoGen users
- researchy multi-agent experiments
- teams maintaining a codebase already built around its abstractions
Poor fit
- greenfield framework choice when long-term evolution matters
- systems that need strong deterministic control over workflow state
Semantic Kernel
Category: AI middleware + agent framework
What it is
- A development kit that sits comfortably inside broader application architectures, with strong emphasis on plugins, integrations, and enterprise-style composition
Engineering strengths
- good fit for integrating AI behavior into existing services
- strong plugin and function model
- practical for teams already building around Microsoft-oriented or enterprise integration patterns
- works well when AI is one subsystem inside a larger application, not the whole product
Operational concerns
- can feel middleware-heavy if all you want is a lightweight agent loop
- abstraction surface is broader than some teams need
- framework choice should align with the host architecture, not just agent feature checklists
Best fit
- enterprise apps
- integrated service architectures
- teams that care about plugins, connectors, and business-system integration
Poor fit
- minimal prototypes where a smaller SDK would do
- teams wanting the most explicit graph-centric workflow model
Reference architectures
Thin agent layer over an existing model API
Typical stack:
- serving layer: OpenAI-compatible API or hosted model endpoint
- orchestration layer: OpenAI Agents SDK or Semantic Kernel
- tools: internal HTTP services, DB access, file retrieval, business actions
- app/backend: REST API, web app backend, worker
- platform: auth, logs, tracing, secrets
Good for:
- customer support agents
- internal productivity tools
- assistant features inside an existing product
Durable workflow agent system
Typical stack:
- serving layer: vLLM, TGI, or hosted model API
- orchestration layer: LangGraph
- memory/state: checkpoint store, DB, vector store, thread store
- human oversight: approval nodes, interrupt/resume points
- platform: queues, observability, rollout controls
Good for:
- long-running tasks
- workflows that pause and resume
- agents with explicit control flow and recovery needs
Multi-agent specialist system
Typical stack:
- serving layer: one or more model endpoints
- orchestration layer: OpenAI Agents SDK, LangGraph, AutoGen, or Semantic Kernel depending on style
- agents: router, planner, specialist agents, human escalation path
- tools: search, retrieval, code execution, internal services
- platform: tracing, cost controls, quotas, audit logging
Good for:
- domain-routed assistants
- task decomposition across specialists
- applications where one general agent is too messy
Enterprise app integration pattern
Typical stack:
- serving layer: hosted or self-hosted model API
- orchestration layer: Semantic Kernel or similar middleware-heavy framework
- integrations: plugins, connectors, enterprise services, vector stores
- app/backend: existing .NET, Python, or Java service layer
- platform: identity, governance, deployment controls
Good for:
- enterprise applications
- existing service-oriented architectures
- teams that care as much about integration shape as agent behavior
Failure modes
Agent systems usually break in boring ways, not magical ones.
Common failure modes:
- tool contracts are vague or underspecified
- memory types get mixed together
- retries repeat unsafe side effects
- multi-agent routing adds cost and latency without real specialization
- handoffs lose context or leak too much context
- tool outputs are not validated before downstream use
- agent traces are too weak to debug what actually happened
- business state is hidden inside chat history instead of modeled explicitly
In practice, most production pain comes from control-flow ambiguity, state ambiguity, and side-effect ambiguity.
What actually matters in framework selection
When comparing frameworks, the real engineering questions are:
- State model: implicit loop, explicit graph, or middleware-driven orchestration?
- Durability: can the workflow pause, resume, recover, and survive failures?
- Tooling model: typed tools, MCP tools, plugins, code execution, connectors?
- Guardrails: are checks first-class or bolted on later?
- Observability: can you trace runs, decisions, tool calls, and handoffs cleanly?
- Integration fit: does it match your existing backend architecture?
- Operational discipline: does it encourage structure, or let the system become agent spaghetti?
- Longevity: is the framework actively evolving in a direction that matches your needs?