DB DevBrain

Devops Interview Prep

Observability Stack Summary

Data types and core needs

  1. Metrics Numeric measurements sampled over time - think CPU usage percentage, requests per second, or error rates - that let you track trends and set alerts.
  2. Logs Time-stamped, structured or free-form text records of events or actions - like web server access lines or application error stacks - that help you diagnose what happened.
  3. Traces Chains of timed spans representing a single transaction’s path through multiple services - e.g. HTTP request → auth service → database call, with per-span latency - so you can pinpoint where delays occur.

Instrumentation & collection: OpenTelemetry

Metrics: Prometheus and Grafana

Logs: Promtail, Fluentd, Logstash, Loki and OT Collector

Traces and distributed telemetry

Streaming and buffering: Kafka (+ Zookeeper / KRaft)

Use-Case Recipes

Basic Metrics Monitoring

Goal: Track service health and fire alerts.
Gotchas: Ensure service discovery for scrape targets and secure /metrics endpoints.

Lightweight Logs + Metrics Correlation

Goal: Link simple logs with metric spikes at low cost.
Gotchas: Plan Loki labels well; Promtail needs file-read permissions.

Full-Stack Observability for Microservices

Goal: End-to-end traces, metrics and logs.
Gotchas: Tune Collector batch sizes/limits and choose an appropriate trace sampling rate.

High-Volume IoT Telemetry

Goal: Ingest millions of device metrics/logs reliably.
Gotchas: Partition Kafka by region; tune retention and compaction.

Heavy Log-Processing Pipeline

Goal: Multi-stage parsing, enrichment and routing to multiple sinks.
Gotchas: Allocate sufficient JVM heap for Logstash; manage Elasticsearch index lifecycles.

Key overlaps and alternatives


Bottom line: Standardize on OpenTelemetry for unified instrumentation and collection, then choose storage backends - Prometheus for metrics, Loki or Elasticsearch for logs, Jaeger/Tempo for traces - and visualize and alert on everything in Grafana (and Kibana where needed).