Execution Tools

Source: Execution Tools.html

Internal Links

Stable Diffusion

Overview

An execution tool is the runtime layer that loads model weights, allocates compute, executes inference, and returns outputs.

That runtime layer is only one part of the system.

It does not by itself define:

where the workload runs
how requests enter the system
how concurrency is handled
how auth, rate limits, retries, observability, and scaling are implemented

Those are deployment and platform concerns.

A useful separation is:

Model artifact: the weights and related assets
Runtime layer: the software that can actually run those artifacts
Serving layer: the API surface, batching, queuing, concurrency model
Platform layer: scheduling, scaling, networking, auth, storage, observability, CI/CD

Same runtime, different deployment:

Ollama on a laptop
Ollama on a dedicated edge box
Ollama behind a reverse proxy on a cloud VM

The runtime is the same. The operational profile is not.

Artifacts are what you run. Runtime is what executes them. Serving is how requests reach them. Platform is everything needed to operate them reliably.

Mental model

Artifacts / Runtime / Serving / Platform / Hardware

1. Artifacts

The actual model files and related assets:

LLM weights: GGUF, safetensors, etc.
Image checkpoints: .ckpt, .safetensors
Adapters: LoRA
Auxiliary components: VAE, ControlNet, tokenizer files, config files

2. Runtime

The process that can load the model and execute inference.

Examples:

llama.cpp
Ollama
ComfyUI execution engine
vLLM worker/runtime
TGI runtime

3. Serving

The layer that turns inference into a usable service.

Examples:

local CLI
local HTTP endpoint
OpenAI-compatible API
workflow API
queue-backed async jobs

4. Platform

The surrounding operational system:

Docker / Compose / Kubernetes
ingress / reverse proxy
auth and API keys
autoscaling
metrics, logs, traces
model distribution and caching
persistent volumes / object storage
CI/CD and rollout strategy

5. Hardware

The constraint that dominates latency, throughput, concurrency ceiling, and cost.

CPU can be viable for small quantized LLMs
GPU is usually mandatory for serious image generation
Memory and VRAM often matter more than raw core count once models are large

Engineering reality

For infra work, the key question is usually not:

Can this tool run the model?

It is:

What operating model does this tool force on the system?

That means asking:

Does it support multi-user concurrency well?
Is batching built in?
Is the API stable enough for production integration?
How easy is horizontal scaling?
How are models pulled, cached, and upgraded?
What is the observability story?
How painful is cold start?
Can it be isolated per tenant or per workload?
Does it fit request-response traffic, batch jobs, or interactive human use?

Product categories

1. User-facing apps

These optimize for humans operating the tool directly.

ComfyUI
Automatic1111
InvokeAI
Open WebUI

Common traits:

UI-first
state often lives inside app conventions
good for experimentation and creative workflows
less ideal as the primary multi-tenant production serving layer

2. Local runtimes and lightweight engines

These optimize for running models with minimal ceremony.

Ollama
llama.cpp

Common traits:

easy local bootstrap
useful for edge, private, or embedded scenarios
weaker platform primitives than full serving stacks

3. Inference servers

These optimize for API-serving behavior.

vLLM
TGI

Common traits:

designed for sustained request traffic
better concurrency model
stronger fit for app backends
usually paired with external platform tooling

4. Managed cloud execution

These outsource infrastructure ownership.

Replicate
Modal
commercial hosted model APIs

Common traits:

low ops burden
less control over underlying scheduling and cost shape
faster path to delivery, weaker control over infra details

Quick comparison

Name	Primary role	Operational sweet spot	Main interface	Concurrency posture
ComfyUI	Workflow app + execution backend	Image pipelines and reproducible graph workflows	Browser UI, API	Low to moderate, usually controlled externally
Automatic1111	UI app	Interactive image experimentation	Browser UI	Low
InvokeAI	Structured image app	More organized image workflows	App or browser UI	Low to moderate
Ollama	Local runtime + local API	Private local inference, simple internal services	CLI, local HTTP API	Moderate at best, not its main advantage
llama.cpp	Low-level engine	Efficient local and edge inference	CLI, minimal server	Low to moderate depending on wrapper
vLLM	Inference server	High-throughput LLM serving	HTTP API	Strong
TGI	Inference server	Standardized HF-style LLM serving	HTTP API	Strong
Replicate	Managed execution API	Fast externalized inference	Cloud API	Managed for you
Modal	Serverless compute platform	Custom endpoints, jobs, GPU workloads	Python SDK, endpoints, jobs	Platform-managed

Reference architectures

A. Local dev / private single-user setup

Typical stack:

runtime: Ollama or llama.cpp
frontend: Open WebUI or custom app
storage: local disk
ingress: none or localhost only
ops: manual

Good for:

experimentation
offline/private use
building app logic before production hardening

Bad for:

reliable multi-user service
predictable throughput
clean tenant isolation

B. Internal team service on one VM

Typical stack:

runtime/serving: Ollama, vLLM, or TGI
reverse proxy: Nginx, Caddy, Traefik
auth: basic API gateway or proxy auth
observability: Prometheus + Grafana + logs
deployment: Docker Compose or systemd-managed containers

Good for:

internal copilots
low-to-medium traffic tools
fast delivery with some operational control

Failure mode:

the VM becomes both runtime host and bottleneck

C. Production backend service

Typical stack:

serving: vLLM or TGI
scheduler: Kubernetes or equivalent
ingress: API gateway / ingress controller
autoscaling: HPA/KEDA/custom metrics
model storage: object storage + local cache volumes
observability: metrics, logs, traces
rollout: blue/green, canary, versioned model endpoints
queueing: optional, for async or burst smoothing

Good for:

app backends
agent systems
RAG APIs
multi-user traffic

Hard parts:

VRAM fragmentation
model startup time
request burst handling
model version pinning
per-model cost control

D. Async job architecture for image generation

Typical stack:

UI/API submits generation request
queue stores job
worker pulls job and runs ComfyUI or other image runtime
artifacts stored in object storage
status tracked in DB / cache
results returned via polling or callback

Why this often wins:

image generation is bursty and slow relative to standard HTTP expectations
async jobs isolate user experience from GPU execution time
retries and prioritization are easier

Typical stack

A more complete infra-oriented stack often looks like this:

Model registry / artifact source: Hugging Face, Civitai, internal registry
Runtime layer: Ollama, llama.cpp, ComfyUI engine, vLLM, TGI
Serving layer: REST API, OpenAI-compatible API, workflow API, job queue
Gateway: auth, rate limits, routing, tenancy, quotas
Orchestration: Kubernetes, Nomad, Compose, batch workers
Storage:

local NVMe cache for hot models
object storage for model and output artifacts
relational DB / Redis for metadata, state, jobs

Observability:

latency histograms
tokens/sec or images/minute
GPU utilization and VRAM pressure
queue depth
cache hit/miss
error classes and timeout rate

Delivery: CI/CD, config management, secrets, rollout policy

Image model assets

What a checkpoint is

A checkpoint is the primary weight artifact for an image model.

Operationally, it is the main file the runtime loads into memory before inference.

Without it, the pipeline does not have a base model to execute.

Typical formats:

.ckpt for older ecosystems
.safetensors for safer modern packaging

Examples:

Stable Diffusion 1.5
SDXL
community fine-tunes
newer image model families where supported by the runtime and surrounding ecosystem

How it relates to other assets

Checkpoint: base weight set
LoRA: delta-style adaptation layered on top of the base model
VAE: decoder component that affects reconstruction behavior
ControlNet: structural conditioning module
IP-Adapter: image-conditioned guidance component

Infra implication:

Loading one base checkpoint with optional adapters is not just a UX choice. It affects memory pressure, startup time, artifact distribution, and cache strategy.

Tool profiles

ComfyUI

Category: Workflow app + execution backend

What it is

A graph-based image runtime wrapped in a node-oriented workflow UI
Useful both as an interactive tool and as a deterministic execution backend for image pipelines

Engineering strengths

explicit pipeline structure
reproducible multi-step workflows
easy separation between graph design and execution
good fit for queue-worker image systems

Operational concerns

plugin/node sprawl can become an environment-management problem
dependency drift is common in fast-moving image ecosystems
not inherently a full production platform, usually needs external job control and storage patterns

Best fit

internal image platform
async image workers
controlled workflow execution

Poor fit

simplest end-user prompt box
high-scale multi-tenant API without extra platform layers

AUTOMATIC1111

Category: UI app

What it is

A classic web UI centered on interactive image generation

Engineering strengths

huge ecosystem familiarity
fast interactive experimentation
broad community documentation and extensions

Operational concerns

extension compatibility becomes operational debt
less clean for deterministic pipeline management
weaker fit for disciplined service architecture

Best fit

workstation usage
lab environment
exploratory prompt iteration

Poor fit

reproducible production workflows
service-oriented backend architecture

InvokeAI

Category: UI app

What it is

A more structured image-generation application with a more product-like operating model than A1111

Engineering strengths

cleaner workflow organization
more disciplined user experience
easier fit for teams that want less chaos than extension-heavy setups

Operational concerns

less flexible than graph-native approaches for complex orchestration
still not the primary answer for large-scale serving

Best fit

internal creative tooling
organized team sandboxing

Poor fit

maximal graph control
large multi-tenant image backend

Ollama

Category: Local runtime + local API

What it is

A local-first model runtime with a very low-friction packaging and API experience
Good at turning model execution into a simple local service quickly

Engineering strengths

fast bootstrap
consistent local API surface
easy model pulls and simple developer experience
strong fit for internal tools, demos, private assistants, and edge boxes

Operational concerns

not the strongest choice for high-concurrency serving
abstraction is convenient, but it hides low-level control you may want in tuned serving setups
lifecycle, scaling, and isolation patterns usually need surrounding infrastructure

Best fit

developer workstations
edge deployment
private internal service with modest traffic

Poor fit

throughput-optimized central serving tier
highly tuned inference platform

llama.cpp

Category: Low-level engine

What it is

A compact inference engine that prioritizes efficient local execution, especially for quantized models

Engineering strengths

small operational footprint
excellent for CPU-centric and edge scenarios
flexible building block for custom wrappers and embedded products

Operational concerns

by itself, it is more engine than platform
production features usually come from what you wrap around it, not from the engine alone
model format and build-target choices matter a lot

Best fit

embedded systems
local agents
custom products that need tight control over local inference

Poor fit

large shared serving layer out of the box
teams wanting a polished turnkey platform

vLLM

Category: Inference server

What it is

A serving system built around high-throughput LLM inference
Strong fit for request-heavy backend workloads

Engineering strengths

efficient batching and scheduling behavior
good concurrency posture
common choice for OpenAI-compatible serving internally
strong fit for agent, chat, and RAG backends

Operational concerns

infra complexity is still yours: ingress, scaling, rollout, secrets, observability
large models shift the bottleneck to VRAM, startup, and placement strategy
serving efficiency does not remove the need for traffic shaping and admission control

Best fit

central inference backend
multi-user internal or external APIs
performance-sensitive production serving

Poor fit

simplest local-first experimentation
teams that do not want to own serving infrastructure

TGI

Category: Inference server

What it is

Hugging Face’s production-oriented LLM serving stack, centered on standard API deployment and integration patterns

Engineering strengths

familiar fit for HF-oriented workflows
solid containerized deployment story
good service boundary for teams standardizing around Hugging Face ecosystem conventions

Operational concerns

same platform burden as any serious serving stack: networking, rollout, metrics, quotas, scaling
the runtime is only one layer, not the whole service architecture

Best fit

production model endpoints
teams already invested in HF model workflows
standardized containerized serving

Poor fit

minimal local tool usage
zero-ops expectations

Replicate

Category: Managed execution API

What it is

A hosted execution layer exposed through an API
You trade infrastructure control for faster delivery

Engineering strengths

no serving stack to own
low friction for trying models and exposing capabilities quickly
useful for externalizing GPU operations during prototype or early product phases

Operational concerns

limited control over deep runtime behavior
cost shape is externalized and can become painful at scale
data locality, compliance, and latency topology may become blockers

Best fit

prototypes
low-ops product experiments
external model access without infra investment

Poor fit

cost-sensitive high-volume serving
regulated or strongly self-hosted environments

Category: Serverless compute platform

What it is

A platform for running Python-centric compute workloads, including inference services, jobs, scheduled tasks, and GPU-backed execution

Engineering strengths

low friction from code to deployed compute
good for custom endpoints and batch systems
strong fit for small teams that want serious execution without owning the full substrate

Operational concerns

platform constraints are still someone else’s constraints
less direct control over the underlying infra envelope than self-hosting
may not match teams that need bespoke networking, residency, or infra policy models

Best fit

lean infra teams
batch AI systems
custom APIs without building a platform team first

Poor fit

organizations that need full infra ownership
workloads requiring deep platform-level customization

Decision guide

Want a local-first private runtime: Ollama
Want an efficient engine for embedded or edge use: llama.cpp
Want a high-throughput shared LLM serving tier: vLLM
Want a HF-oriented production serving stack: TGI
Want an async image worker backend: ComfyUI plus queue/storage/platform layers
Want a human-facing image sandbox: Automatic1111 or InvokeAI
Want fast externalized execution with low ops: Replicate or Modal

Adjacent layers and out of scope

This document is about the runtime, serving, and platform layers for model execution.

It does not try to cover higher-level orchestration frameworks in depth, for example:

agent SDKs
workflow orchestration frameworks
tool-calling abstractions
memory and session layers
multi-agent coordination frameworks

Those sit above the serving layer and usually consume model APIs rather than executing model weights directly.

Examples would include agent/application-layer frameworks such as OpenAI Agents SDK, LangGraph, AutoGen, and Semantic Kernel.

What actually matters in infra selection

When comparing tools, the decision usually comes down to these engineering questions:

Concurrency model: interactive single-user, modest shared service, or serious multi-user traffic?
Latency target: best-effort, human-tolerable, or strict API SLO?
Execution pattern: request-response, streaming, async jobs, or batch?
Artifact handling: how are models stored, distributed, pinned, warmed, and rolled back?
Isolation: shared runtime, per-model runtime, per-tenant runtime?
Observability: can you measure queue depth, startup time, throughput, VRAM pressure, and failure classes?
Operations burden: do you want to own the stack, or buy the operating model from someone else?
Cost shape: steady baseline traffic, bursty workloads, or spiky GPU-heavy jobs?

That is usually the real decision surface. Not the marketing category of the tool.