Devops Interview Prep

Observability Stack Summary

Source: Observability Stack Summary.html

Data types and core needs

Metrics Numeric measurements sampled over time - think CPU usage percentage, requests per second, or error rates - that let you track trends and set alerts.
Logs Time-stamped, structured or free-form text records of events or actions - like web server access lines or application error stacks - that help you diagnose what happened.
Traces Chains of timed spans representing a single transaction’s path through multiple services - e.g. HTTP request → auth service → database call, with per-span latency - so you can pinpoint where delays occur.

Instrumentation & collection: OpenTelemetry

OpenTelemetry SDKs Embed in your code (Java, Python, Go, JavaScript etc.) to emit traces, custom metrics and enriched logs in a standard format.
OpenTelemetry Collector Acts as a unified agent or gateway: receives, batches and optionally transforms all three telemetry types. It can replace lightweight collectors (Promtail, Fluentd) or push gateways by exporting via Prometheus remote-write, Loki HTTP API, Jaeger/Tempo gRPC, Kafka, and more.
Does not replace Storage backends (Prometheus, Loki, Elasticsearch, Jaeger, Tempo) or UIs (Grafana, Kibana) - those remain your systems of record and visualization.

Metrics: Prometheus and Grafana

Prometheus Scrapes /metrics endpoints or a Pushgateway, stores time series on disk, and runs PromQL queries.
Grafana Builds dashboards and alert rules on top of Prometheus (and other data sources).
Alternatives VictoriaMetrics, Thanos or Cortex for long-term or highly available metric storage; Graphite + Grafana; InfluxDB + Chronograf.

Logs: Promtail, Fluentd, Logstash, Loki and OT Collector

Promtail Tails log files or accepts syslog, attaches labels (e.g. Kubernetes metadata) and ships straight to Loki.
Fluentd / Fluent Bit General-purpose log/event forwarders with plugin ecosystems; send to Loki, Elasticsearch, Kafka etc.
Logstash Heavy-weight ETL pipeline: ingest, parse (grok), enrich and route logs/events to Elasticsearch, Kafka and more.
Grafana Loki “Prometheus for logs” - indexes only labels to keep storage costs low; integrates natively with Grafana.
OT (OpenTelemetry) Collector Can ingest logs (alongside metrics and traces) and forward to any compatible backend.

Traces and distributed telemetry

OpenTelemetry SDK & Collector Instrumentation and transport for spans in a vendor-neutral format.
Jaeger or Tempo Stores and visualizes traces (flame graphs, span details).
Alternatives Zipkin; commercial APMs (Datadog, New Relic, Lightstep).

Streaming and buffering: Kafka (+ Zookeeper / KRaft)

Kafka High-throughput event bus for buffering and fan-out of telemetry.
Zookeeper or KRaft Manages broker metadata and consumer groups (KRaft can replace Zookeeper in newer Kafka versions).
Alternatives RabbitMQ, NATS, Pulsar; managed services like Amazon Kinesis or Google Pub/Sub.

Use-Case Recipes

Basic Metrics Monitoring

Goal: Track service health and fire alerts.

(Optional) Add OpenTelemetry SDK counters/gauges and expose /metrics.
Prometheus scrapes app and node exporters.
Grafana dashboards use PromQL (e.g. rate(http_requests_total[5m])).
Alert on thresholds via Grafana notification channels.

Gotchas: Ensure service discovery for scrape targets and secure /metrics endpoints.

Lightweight Logs + Metrics Correlation

Goal: Link simple logs with metric spikes at low cost.

Promtail DaemonSet tails /var/log/containers/*.log, labels by namespace/pod.
Promtail pushes to Loki (/loki/api/v1/push).
Prometheus scrapes metrics as above.
Grafana mixed panels show metric time series alongside log streams ({app="myapp", level="error"}).

Gotchas: Plan Loki labels well; Promtail needs file-read permissions.

Full-Stack Observability for Microservices

Goal: End-to-end traces, metrics and logs.

Instrument services with OpenTelemetry SDK (or auto-instrument).
Deploy Collector with pipelines for OTLP → Prometheus, Loki and Jaeger exporters.
(Optional) Buffer bursts via Collector → Kafka → downstream exporters.
Prometheus scrapes Collector; Loki and Jaeger receive logs and spans.
Grafana dashboards combine metrics, log panels and trace timelines.

Gotchas: Tune Collector batch sizes/limits and choose an appropriate trace sampling rate.

High-Volume IoT Telemetry

Goal: Ingest millions of device metrics/logs reliably.

Edge Collector receives OTLP (OpenTelemetry Protocol) or HTTP from devices, adds device metadata.
Collector exports to Kafka topic iot-telemetry.
Consumers read from Kafka:

Prometheus remote-write (or Thanos) for metrics
Fluentd for logs → Loki or Elasticsearch
Collector for traces → Tempo

Grafana device health maps, global trends, error heatmaps and trace explorer.

Gotchas: Partition Kafka by region; tune retention and compaction.

Heavy Log-Processing Pipeline

Goal: Multi-stage parsing, enrichment and routing to multiple sinks.

Logstash ingest via Beats input, apply grok, geoip, mutate filters.
Output to Elasticsearch for full-text search and to Kafka for real-time consumers.
Optionally use Fluentd/Logstash to consume from Kafka → normalize → push to Loki.
Kibana for ad-hoc searches; Grafana + Loki panels for dashboards and log-based alerts.

Gotchas: Allocate sufficient JVM heap for Logstash; manage Elasticsearch index lifecycles.

Key overlaps and alternatives

Logstash vs Fluentd vs OT Collector for log ingestion and transformation
Prometheus Pushgateway vs OT Collector for pushed metrics
Zookeeper vs KRaft as Kafka coordinator
Loki vs Elasticsearch for log indexing and querying

Bottom line: Standardize on OpenTelemetry for unified instrumentation and collection, then choose storage backends - Prometheus for metrics, Loki or Elasticsearch for logs, Jaeger/Tempo for traces - and visualize and alert on everything in Grafana (and Kibana where needed).