Devops Interview Prep
Observability Stack Summary
Data types and core needs
- Metrics Numeric measurements sampled over time - think CPU usage percentage, requests per second, or error rates - that let you track trends and set alerts.
- Logs Time-stamped, structured or free-form text records of events or actions - like web server access lines or application error stacks - that help you diagnose what happened.
- Traces Chains of timed spans representing a single transaction’s path through multiple services - e.g. HTTP request → auth service → database call, with per-span latency - so you can pinpoint where delays occur.
Instrumentation & collection: OpenTelemetry
- OpenTelemetry SDKs Embed in your code (Java, Python, Go, JavaScript etc.) to emit traces, custom metrics and enriched logs in a standard format.
- OpenTelemetry Collector Acts as a unified agent or gateway: receives, batches and optionally transforms all three telemetry types. It can replace lightweight collectors (Promtail, Fluentd) or push gateways by exporting via Prometheus remote-write, Loki HTTP API, Jaeger/Tempo gRPC, Kafka, and more.
- Does not replace Storage backends (Prometheus, Loki, Elasticsearch, Jaeger, Tempo) or UIs (Grafana, Kibana) - those remain your systems of record and visualization.
Metrics: Prometheus and Grafana
- Prometheus Scrapes
/metricsendpoints or a Pushgateway, stores time series on disk, and runs PromQL queries. - Grafana Builds dashboards and alert rules on top of Prometheus (and other data sources).
- Alternatives VictoriaMetrics, Thanos or Cortex for long-term or highly available metric storage; Graphite + Grafana; InfluxDB + Chronograf.
Logs: Promtail, Fluentd, Logstash, Loki and OT Collector
- Promtail Tails log files or accepts syslog, attaches labels (e.g. Kubernetes metadata) and ships straight to Loki.
- Fluentd / Fluent Bit General-purpose log/event forwarders with plugin ecosystems; send to Loki, Elasticsearch, Kafka etc.
- Logstash Heavy-weight ETL pipeline: ingest, parse (grok), enrich and route logs/events to Elasticsearch, Kafka and more.
- Grafana Loki “Prometheus for logs” - indexes only labels to keep storage costs low; integrates natively with Grafana.
- OT (OpenTelemetry) Collector Can ingest logs (alongside metrics and traces) and forward to any compatible backend.
Traces and distributed telemetry
- OpenTelemetry SDK & Collector Instrumentation and transport for spans in a vendor-neutral format.
- Jaeger or Tempo Stores and visualizes traces (flame graphs, span details).
- Alternatives Zipkin; commercial APMs (Datadog, New Relic, Lightstep).
Streaming and buffering: Kafka (+ Zookeeper / KRaft)
- Kafka High-throughput event bus for buffering and fan-out of telemetry.
- Zookeeper or KRaft Manages broker metadata and consumer groups (KRaft can replace Zookeeper in newer Kafka versions).
- Alternatives RabbitMQ, NATS, Pulsar; managed services like Amazon Kinesis or Google Pub/Sub.
Use-Case Recipes
Basic Metrics Monitoring
Goal: Track service health and fire alerts.
- (Optional) Add OpenTelemetry SDK counters/gauges and expose
/metrics. - Prometheus scrapes app and node exporters.
- Grafana dashboards use PromQL (e.g.
rate(http_requests_total[5m])). - Alert on thresholds via Grafana notification channels.
Gotchas: Ensure service discovery for scrape targets and secure
/metrics endpoints.Lightweight Logs + Metrics Correlation
Goal: Link simple logs with metric spikes at low cost.
- Promtail DaemonSet tails
/var/log/containers/*.log, labels by namespace/pod. - Promtail pushes to Loki (
/loki/api/v1/push). - Prometheus scrapes metrics as above.
- Grafana mixed panels show metric time series alongside log streams (
{app="myapp", level="error"}).
Gotchas: Plan Loki labels well; Promtail needs file-read permissions.
Full-Stack Observability for Microservices
Goal: End-to-end traces, metrics and logs.
- Instrument services with OpenTelemetry SDK (or auto-instrument).
- Deploy Collector with pipelines for OTLP → Prometheus, Loki and Jaeger exporters.
- (Optional) Buffer bursts via Collector → Kafka → downstream exporters.
- Prometheus scrapes Collector; Loki and Jaeger receive logs and spans.
- Grafana dashboards combine metrics, log panels and trace timelines.
Gotchas: Tune Collector batch sizes/limits and choose an appropriate trace sampling rate.
High-Volume IoT Telemetry
Goal: Ingest millions of device metrics/logs reliably.
- Edge Collector receives OTLP (OpenTelemetry Protocol) or HTTP from devices, adds device metadata.
- Collector exports to Kafka topic
iot-telemetry. - Consumers read from Kafka:
- Prometheus remote-write (or Thanos) for metrics
- Fluentd for logs → Loki or Elasticsearch
- Collector for traces → Tempo
- Grafana device health maps, global trends, error heatmaps and trace explorer.
Gotchas: Partition Kafka by region; tune retention and compaction.
Heavy Log-Processing Pipeline
Goal: Multi-stage parsing, enrichment and routing to multiple sinks.
- Logstash ingest via Beats input, apply grok, geoip, mutate filters.
- Output to Elasticsearch for full-text search and to Kafka for real-time consumers.
- Optionally use Fluentd/Logstash to consume from Kafka → normalize → push to Loki.
- Kibana for ad-hoc searches; Grafana + Loki panels for dashboards and log-based alerts.
Gotchas: Allocate sufficient JVM heap for Logstash; manage Elasticsearch index lifecycles.
Key overlaps and alternatives
- Logstash vs Fluentd vs OT Collector for log ingestion and transformation
- Prometheus Pushgateway vs OT Collector for pushed metrics
- Zookeeper vs KRaft as Kafka coordinator
- Loki vs Elasticsearch for log indexing and querying
Bottom line: Standardize on OpenTelemetry for unified instrumentation and collection, then choose storage backends - Prometheus for metrics, Loki or Elasticsearch for logs, Jaeger/Tempo for traces - and visualize and alert on everything in Grafana (and Kibana where needed).