DB DevBrain

Devops Interview Prep

DevOps Screening Cheat-sheet

Links

FAQ

In Linux, what is OOM?

Answer: OOM stands for Out Of Memory. It happens when the Linux kernel can’t allocate memory because both RAM and swap are exhausted. When that happens, the kernel triggers the OOM killer, which forcefully terminates one or more processes to free memory and keep the system alive instead of letting it completely hang or crash.
The OOM killer chooses which process to kill based on factors like how much memory it’s using, its priority, and whether it’s considered critical. You’ll usually see OOM events in dmesg or system logs. In containerized environments, OOM is very common when memory limits are too tight or when there’s a memory leak.

What are Linux cgroups?

Answer: cgroups, short for control groups, are a Linux kernel mechanism used to manage and limit resource usage for groups of processes. They allow you to control things like CPU usage, memory limits, disk IO, and the number of processes a group can create.
They’re heavily used by container runtimes and systemd. For example, when you set a memory limit on a container, that limit is enforced using cgroups. If a process exceeds its memory limit, it can be OOM-killed inside the cgroup without affecting the rest of the system.

What are Linux namespaces?

Answer: Linux namespaces provide isolation by giving processes their own view of system resources. Each namespace makes a process think it’s running in its own environment, even though it’s sharing the same kernel with other processes.
There are namespaces for things like process IDs, networking, filesystem mounts, users, hostnames, and IPC. For example, a process in its own PID namespace sees itself as PID 1, and a process in a network namespace has its own network interfaces and routing table.

What is the difference between cgroups and namespaces?

Answer: They solve different problems. namespaces are about isolation and visibility, what a process can see. cgroups are about resource control, how much CPU, memory, or IO a process is allowed to use.
Containers combine both. namespaces isolate the container from the host and other containers, while cgroups ensure it doesn’t consume more resources than it’s allowed to.

What is the TCP handshake?

Answer: The TCP handshake is the process used to establish a reliable TCP connection between a client and a server. It’s a three-step exchange. First the client sends a SYN packet to request a connection. The server responds with SYN-ACK to acknowledge and share its own sequence number. Finally, the client sends an ACK, and the connection is established.
This process ensures both sides are reachable and synchronizes sequence numbers before any data is sent. Operationally, issues here show up as connections stuck in SYN-SENT or SYN-RECV states, and attacks like SYN floods exploit this phase.

What is a 502 error?

Answer: A 502 Bad Gateway error means that a proxy, load balancer, or gateway received an invalid or no response from an upstream server. The client successfully reached the gateway, but the gateway couldn’t successfully talk to the backend service.
This often happens when the backend crashes, times out, restarts during a request, or is misconfigured. You commonly see 502s in systems using Nginx, Envoy, cloud load balancers, or CDNs. It’s different from a 503, which usually means the service is unavailable, and a 504, which means the upstream didn’t respond in time.

ALB 502 Bad Gateway in EKS

Question: You’re getting 502 Bad Gateway from an AWS ALB in EKS. What do you check first?
Answer: A 502 from ALB usually means the ALB could not get a valid response from its targets, or the targets are unhealthy/unreachable. I follow a fast checklist to isolate whether it’s ALB config, Kubernetes wiring, networking, or app behavior.
Checklist:
Traffic flow: Client → ALB listener → Target Group → Node/Pod IP → Container port
Cause and effect checks:
Concrete example:
Required rules:
If the Node/Pod SG does not allow 8080, ALB fails to connect to targets and returns 502.
Follow-up: What’s the fastest signal to narrow it down?
Follow-up answer: If the target group shows unhealthy, it’s usually health check config or Kubernetes wiring. If targets look healthy but ALB still returns 502, it’s usually security groups, protocol mismatch, or app-level connection failures.

What is a Pod Disruption Budget?

Answer: A Pod Disruption Budget, or PDB, defines how many pods of an application must remain available during voluntary disruptions in Kubernetes. Voluntary disruptions include things like draining nodes, cluster upgrades, or manual pod evictions (it does not control rolling updates).
You define a PDB using either minAvailable or maxUnavailable. Kubernetes checks the PDB before allowing a pod eviction. It doesn’t protect against crashes, OOM kills, or node failures, only against controlled maintenance actions. If configured too strictly, it can block node drains and upgrades.

Pod Disruption Budget, what is it? How to use it?

Answer: A PDB limits how many Pods can be unavailable during voluntary disruptions like node drains, upgrades, or autoscaler removals.
You set:
Kubernetes will block evictions that violate the budget.
Common mistakes:
Tradeoff: better availability during maintenance, but it can slow or block cluster operations if misconfigured.

What types of Services exist in Kubernetes?

Answer: Kubernetes has several Service types to expose applications. ClusterIP is the default and is used for internal communication inside the cluster. NodePort exposes a service on a static port on every node, which is simple but not ideal for production.
LoadBalancer provisions an external load balancer through the cloud provider and is the most common choice for production workloads. ExternalName maps a service to an external DNS name without proxying traffic.
There’s also the concept of a headless service, where no cluster IP is assigned and clients get the pod IPs directly. This is commonly used with StatefulSets.

Vertical scaling vs horizontal scaling

Vertical scaling: is making one machine bigger. More CPU, more RAM. Easy to do, no code changes, but you hit a hard limit and it’s a single point of failure.
Horizontal scaling: is adding more machines. More complex, but it scales much better, improves availability, and is how most production systems are built.
In practice, teams often start vertical for speed, but aim for horizontal long-term.

Load balancer routing by URL – Layer 4 or Layer 7?

Layer 7.
If the load balancer looks at URLs, paths, hostnames, or headers, it’s inspecting HTTP, which is application layer. Layer 4 only sees IPs and ports.

App on a VM accessing a storage bucket – best authentication?

Use the VM’s managed identity.
Attach an IAM role or service account to the VM and give it least-privilege access to the bucket. The app uses short-lived credentials automatically. No hardcoded keys, no secrets in config.
That means:

K8s Deployment succeeds, everything healthy?

Question: If a Deployment rolled out with no errors, can you assume everything is OK?
Quick Answer: No. It only means Kubernetes scheduled the pods. You still need readiness probes, logs, metrics, and functional checks to confirm the app actually works.

DIs it OK to run RabbitMQ or MySQL in a StatefulSet in production?

Question: Is running databases or brokers inside Kubernetes acceptable?
Quick Answer: RabbitMQ: OK if using the operator and proper storage. MySQL: Usually no. Use managed DB unless you have a strong infra team and specific reasons, because backups, upgrades, failover, replication, and data safety are operationally complex and easy to get wrong in Kubernetes.

Slow message processing: should you increase threads?

Question: If consumers process messages slowly, is “increase threads” the solution?
Quick Answer: Not by default. More threads only help if you are actually under‑utilizing CPU. In practice, slow consumers are usually blocked on I/O, waiting on the database, or competing for locks. Adding threads in those cases just increases context switching, DB pressure, and lock contention. First check where time is spent, then tune prefetch (how many messages a consumer pulls and buffers ahead of processing), batch sizes, queries, or the real bottleneck before scaling concurrency.

MySQL replication lag high, CPU low: increase resources?

Question: Replica lagging but CPU almost idle. Should you scale up?
Quick Answer: No. Lag is usually I/O bottleneck, single-threaded replication, long queries, huge transactions, or slow storage. Fix the root cause, not CPU/RAM.

MySQL on Kubernetes: what actually breaks first?

Question: In real production setups, what are the first failure modes you usually hit when running MySQL on Kubernetes?
Quick Answer: Storage and failover. Disk latency and volume semantics cause slow queries and long crash recovery because MySQL is extremely sensitive to fsync latency, write ordering, and predictable disk behavior, which Kubernetes volumes and network‑backed storage often cannot guarantee. Pod restarts that are routine in Kubernetes become dangerous for MySQL, since a restart can trigger long crash recovery, replay large redo logs, or leave replicas temporarily inconsistent. Network partitions further break assumptions MySQL replication relies on, making leader election and replica promotion error‑prone and risking split‑brain or stale reads. Backups and upgrades then amplify the risk, because they require tight coordination with replication, disk snapshots, and write traffic. When these operations are performed during normal Kubernetes events like rescheduling or rolling updates, they can easily turn expected maintenance into real data loss or prolonged downtime.

Linux server slow, CPU idle: is it hardware?

Question: If the server is slow but CPU is low, should you suspect hardware failure?
Quick Answer: Not first. Usually it’s disk I/O, memory pressure, swap, network issues, or blocked processes. Hardware is the last guess unless logs show actual errors.

Puppet run has no changes: everything OK?

Question: If Puppet runs clean with no drift, does it guarantee the system is fine?
Quick Answer: No. It only means Puppet thinks the state matches the manifest. Services can be unhealthy, configs wrong, or dependencies broken without Puppet noticing.

What happens to the cluster if etcd stops responding?

What etcd is
The source of truth for all cluster state (pods, nodes, configs, secrets).
What happens
Common causes
Mitigation

On which node does kubeadm run?

Key point
kubeadm is a bootstrap tool, not something that manages the cluster day-to-day.

Reasons for ImagePullBackOff / ErrImagePull in Kubernetes

Most common causes
How to debug

Where is sensitive data stored in Kubernetes?

Primary mechanism: Secrets
What they are
Encryption
Best practice

Difference between Secret and ConfigMap

ConfigMap
Secret
Rule of thumb If leaking it is bad, it belongs in a Secret.
  1. ConfigMaps for usability
  2. Secrets for controlled access

Difference between VM and Container

VM
Container

Where can environment variables be stored in Linux?

Common locations

What are taint and cordon? Use cases

Cordon
Taint
Example use cases

Best practices for writing a Dockerfile

Key rules

Docker: difference between bind mount and named volume

Bind mount

Maps a host path → container path.
Why it exists
Why it’s risky
Use when

Named volume

Managed by Docker, stored under Docker’s data directory.
Why it’s better
Why it’s recommended
Rule of thumb If data should survive container restarts and move cleanly between hosts, use a named volume.

Core VPC Design Considerations

Non-overlapping CIDR

Why: Overlapping CIDRs break peering, VPNs, and future mergers. Fixing overlap later is extremely painful.
Follow-up: What mistake do teams commonly make here?
Follow-up answer: Choosing CIDRs that are too small or conflict with on-prem or future environments.

Enough IP space

Why

Why do we design subnets per Availability Zone?

Answer: For fault isolation, predictable routing, and easier debugging.
Follow-up: What breaks if subnets aren’t AZ-aligned?
Follow-up answer: Failover behavior becomes unclear and debugging network issues gets much harder.
Key point: High availability in AWS is achieved by creating multiple subnets, one per AZ, and spreading resources across them.

Private IP ranges (RFC1918) and why they matter

Why choose carefully

How should you size VPC subnets?

Question: What is a typical subnet CIDR configuration, and why does sizing matter?
Answer: Subnet CIDRs must be planned intentionally. The CIDR determines how many IPs are available in each AZ, and running out of IPs causes hard failures. This is especially critical for EKS, where nodes, pods, ENIs, load balancers, and VPC endpoints all consume IP addresses.
Typical configuration (example):
Quick sizing intuition:
Common pattern: One public subnet and one private subnet per AZ. If you run EKS at scale, private subnets are usually the first ones you oversize.

How subnets are used in multi-AZ EKS

Question: How does EKS use subnets when the cluster spans multiple Availability Zones?
Answer: In a multi-AZ EKS cluster, you provide multiple subnets, typically one private subnet per AZ. EKS spreads worker nodes across these subnets, and the Kubernetes scheduler places pods on nodes in whichever AZ has capacity.
Each node consumes IPs from its subnet, and each pod consumes an IP from the node’s ENI allocation. This means IP pressure is per-AZ, not global. If one AZ’s subnet runs out of IPs, pods cannot be scheduled there, even if other AZs still have free capacity.
Load balancers also follow this model: an ALB or NLB creates one node per AZ and attaches to the corresponding subnets. If a subnet is missing or exhausted, that AZ is excluded from load balancing.
Key implication: High availability in EKS requires both enough subnets and sufficiently large CIDRs in every AZ. Multi-AZ does not save you from IP exhaustion if one subnet is undersized.

Common CDN caching protocols

Note: All of the mechanisms above (Cache-Control, ETag, Last-Modified, Vary, and status-based caching) are defined by the HTTP protocol and apply across HTTP/1.1, HTTP/2, and HTTP/3. They do not exist outside HTTP. CDNs and proxies may extend or override their behavior, but they all rely on these HTTP-defined semantics as the foundation.
Key point: If you understand HTTP caching, all CDNs behave similarly.

EKS: Infrastructure vs Workloads

Infrastructure (AWS-side)

How managed
Platform inside the cluster
How managed

Workloads

How managed

Why separate node groups in EKS

Common node pools
Why
How enforced

Who actually scales nodes in EKS

Trigger
Key insight

Can you get a static IP with an Internet Gateway?

Question: Can an Internet Gateway (IGW) give you a static IP?
Answer: No. An IGW is a routing target, not a resource with an IP. It scales and load-balances globally, so there is nothing to pin a static IP to.
Follow-up: Why did AWS design it this way?
Follow-up Answer: Static IPs would break elasticity and fault tolerance. AWS wants IGW traffic to scale transparently without customers depending on fixed IPs.

Why does ALB not support static IPs?

Question: Why is an Application Load Balancer DNS-only?
Answer: Because ALBs scale horizontally. AWS constantly adds and removes backing nodes (the underlying EC2 instances or network endpoints that actually receive and handle traffic behind the load balancer), so IPs change as part of normal operation.
Follow-up: What is the recommended way to integrate with ALB?
Follow-up Answer: Always rely on DNS. AWS explicitly designs ALB to be consumed via DNS, not IP pinning.

If you truly need static IPs in AWS, what are your options?

Answer:
Follow-up: What are the tradeoffs?
Follow-up Answer: NLB is L4 only with less routing intelligence. CloudFront adds another layer and complexity but gives caching, TLS termination, and WAF.

Why is NAT Gateway expensive?

Question: Why does NAT Gateway cost so much?
Answer: Because it’s fully managed, highly available, and horizontally scalable. You pay hourly plus per-GB processing.
Follow-up: Why do teams often underestimate NAT cost?
Follow-up Answer: Because NAT charges per GB. Kubernetes, image pulls, logs, and retries silently push large volumes through NAT.

NAT Gateway vs NAT Instance

Question: When would you use a NAT Instance instead of NAT Gateway?
Answer: Only at small scale when cost matters more than operational simplicity.
Follow-up: Why is a NAT Instance risky?
Follow-up Answer: You manage patching, scaling, and HA yourself. It’s easy to create a single point of failure.

How does IPv6 reduce NAT usage?

Answer: IPv6 gives every instance a globally unique IP, so outbound traffic doesn’t require NAT.
Follow-up: Why isn’t IPv6 widely adopted yet?
Follow-up Answer: Many services are still IPv4-only. Dual-stack increases complexity and makes debugging harder.

What are VPC Endpoints and why use them?

Question: What are VPC Endpoints, why would you use them, and do you need them for every service?
Answer: VPC Endpoints allow private access to supported AWS services without routing traffic over the public internet or through NAT. Traffic stays on the AWS backbone, which improves security and makes costs and latency more predictable.
You do not need endpoints for every service. Only AWS services that support PrivateLink or Gateway Endpoints can use them, and they should be added selectively. High-volume, AWS-internal traffic like S3, ECR, STS, SSM, and CloudWatch is usually a good fit. Low-volume, infrequent, or highly variable traffic is typically fine through NAT.
Why they cost money: Each VPC Endpoint is implemented as managed network interfaces (ENIs). As you add more services and availability zones, the number of endpoints grows, and so does the cost.
Rule of thumb: If traffic is predictable, high-volume, and stays within AWS, use a VPC Endpoint. If traffic goes to many external or changing destinations, or volume is low, keep using NAT.

NAT vs VPC Endpoints at scale

Question: How does cost behavior differ between NAT and VPC Endpoints?
Answer: NAT scales with traffic volume, while endpoints are mostly flat and predictable. At large scale, endpoints are often cheaper.
Follow-up:  What’s the usual cost-optimized pattern?
Follow-up Answer: Use VPC Endpoints for AWS services and NAT only for true external internet access.
Example use case: A private EKS cluster pulling images from ECR and writing logs to CloudWatch. Without VPC Endpoints, all this traffic goes through NAT and scales linearly with usage. With Interface Endpoints for ECR, S3, and CloudWatch, the traffic stays inside the AWS network, costs become predictable, and NAT is only used for real outbound internet calls like external APIs.
Example where NAT is still the right choice: Your workloads need to reach many changing third‑party services on the public internet (payment providers, SaaS APIs, public package registries, webhooks), and you also want a single, controlled outbound egress point with stable source IPs for allowlisting. In that case, NAT is the simplest default: endpoints won’t help (they only cover specific AWS services), and trying to replace NAT with dozens of service-specific paths becomes operationally messy.

Why do some teams put instances in public subnets?

Answer: To avoid NAT cost and simplify routing.
Follow-up: Why is this dangerous?
Follow-up Answer: It increases the attack surface and is easy to misconfigure. One bad security group can expose everything.

When are public IPs acceptable?

Question: When is it acceptable to use public IPs?
Answer:  For stateless services with no SSH access, SSM only, and very tight security groups.
Follow-up: Would you ever do this in regulated environments?
Follow-up Answer: Rarely. Most regulated environments prefer private subnets with controlled egress.

Why use multiple AWS accounts or environments?

Answer: To reduce blast radius, enforce clean IAM boundaries, and simplify audits and billing.
Follow-up: What’s a common anti-pattern?
Follow-up answer: Putting prod, staging, and experiments in the same account and relying only on naming conventions.

Why consider IPv6 even if you don’t use it today?

Answer: It future-proofs networking and can reduce NAT dependency long term.
Follow-up: How should teams adopt IPv6?
Follow-up answer: Gradually, usually via dual-stack, starting with non-critical services.

Requests vs usage

Question: Why does Kubernetes scale nodes based on requests, not usage?
Answer: Because the scheduler guarantees capacity. If requests can’t be satisfied, pods can’t be scheduled safely.
Follow-up: What’s the common failure mode here?
Follow-up answer: Over-requesting CPU or memory causes unnecessary node scaling and higher cost.

What health checks does an ASG use by default?

Answer: EC2 health checks only: hardware, network reachability, and hypervisor status.
Follow-up: What does that mean in practice?
Follow-up answer: ASG only knows whether the EC2 instance is alive from AWS’s perspective. It has no understanding of Kubernetes health.

Why ASG alone is insufficient in EKS

Question: Why can’t you rely on ASG alone for EKS scaling?
Answer: Because ASG has no awareness of Kubernetes scheduling, pod health, or resource pressure.
Follow-up: What happens if you don’t run a Cluster Autoscaler?
Follow-up answer: Pods remain pending forever even though ASG is healthy and running.

What ASG health checks ignore

Question: What does ASG not consider unhealthy?
Answer:
Follow-up: Why is this dangerous in EKS?
Follow-up answer: Kubernetes may consider a node broken while ASG keeps it running, leading to stuck workloads and silent failures.

If a node is NotReady, will ASG replace it?

Answer: No. ASG only replaces instances if the EC2 health check fails.
Follow-up: So how are bad nodes handled in EKS?
Follow-up answer: Through Kubernetes controllers, Cluster Autoscaler logic, Karpenter, or manual remediation. ASG itself is unaware.

What is ASG’s role in EKS?

Answer: It manages EC2 instances only and ensures the desired number of nodes exist.
Follow-up: Does ASG understand Kubernetes at all?
Follow-up answer: No. ASG does not know about pods, scheduling, or cluster state.

Pods failing to schedule and Auto Scaling Groups

Question: If pods are failing to schedule, will that trigger the Auto Scaling Group (ASG)?
Answer: Not by itself. Unschedulable pods do not directly trigger an ASG scale-out. ASGs react to EC2-level signals (like CPU, memory via CloudWatch, or explicit scaling policies), not Kubernetes pod states.
What actually triggers scale-out: In EKS, scale-out happens only if you run a Cluster Autoscaler (or Karpenter). The autoscaler watches for pods that cannot be scheduled due to lack of resources and then requests new nodes by increasing the ASG desired capacity (or provisioning new instances).
Common failure case: Pods are pending, but the ASG does nothing because:
Mental model: Kubernetes decides what it wants to run. The autoscaler translates that into how many nodes are needed. The ASG only follows scaling instructions, it does not understand pods.

ASG vs HPA

Question: Does HPA affect ASG directly?
Answer: No. HPA scales pods, ASG scales nodes. They are decoupled.
Follow-up: How do they interact indirectly?
Follow-up answer: HPA creates more pods. If pods become unschedulable, the Cluster Autoscaler increases ASG desired capacity.

Typical ASG + EKS failure scenario

Question: Describe a common real-world failure involving ASG in EKS.
Answer: Nodes are EC2-healthy but Kubernetes NotReady due to kubelet or CNI issues. ASG does nothing, pods stay pending, and the cluster appears half-alive.
Follow-up: How do experienced teams mitigate this?
Follow-up answer: Monitoring node readiness, automated remediation, Karpenter, and clear operational runbooks.

Scaling failure patterns

Pods pending but no scale
ASG scaled but pods still pending

Well architected web application in AWS with a DB, what kind of services?

Answer: A typical setup is:
Tradeoffs: managed services reduce ops work but cost more. Aurora scales better but can be overkill early.

Can the Terraform state file contain sensitive data?

Answer: Yes. State often contains real values returned by providers, including passwords, tokens, endpoints, private IPs, and sometimes more.
Important nuance: marking something as sensitive only hides it in CLI output. It does not keep it out of state.
Best practice: treat state like a secret, restrict access hard, use encryption at rest, and avoid managing secrets directly in Terraform when possible.

If I lost the Terraform state file, what happens to the resources?

Answer: The resources keep running. Terraform just becomes blind.
What breaks:
Recovery:

What is Kubernetes and when to use it?

Answer: Kubernetes is a container orchestration platform. It schedules containers, restarts failed ones, scales them, and provides service discovery and networking with declarative configs.
Use it when:
Avoid or delay it when:
Tradeoff: power and flexibility vs operational complexity.

Kubernetes update strategies

Question: What is an update strategy in Kubernetes, what are the common ones, and how does each work in simple terms?
Answer: An update strategy defines how Kubernetes replaces existing Pods when you deploy a new version.
Common strategies and patterns:
Follow-up: Which update strategies are built in to Kubernetes?
Follow-up answer:
Blue green and canary are rollout patterns built using Services, Ingress, or external controllers, not native strategies.

Update strategy in Deployment vs StatefulSet

Question: How does update strategy behave differently in a Deployment compared to a StatefulSet?
Answer:
Follow-up: How do maxUnavailable and maxSurge affect the update strategy?
Follow-up answer:
These fields are set on a Deployment under spec.strategy.rollingUpdate. They control the speed vs safety tradeoff: higher values mean faster rollouts but more resource usage or risk.
Important:

Difference between library Helm and umbrella Helm?

Answer:
Mental model: library charts define patterns, umbrella charts assemble systems.

In Grafana, when having many clusters, how do you aggregate all of them?

Answer: You usually aggregate at the metrics backend, not in Grafana.
Common patterns:
Tradeoff: centralized backend is more infra, but scales and simplifies querying.

What backend exists for Prometheus?

Answer: Common Prometheus-compatible backends include:
Rule of thumb:

What’s the difference between monitoring and observability?

Monitoring
Observability
How they work together

What are load balancing algorithms?

Answer:
Load balancing algorithms decide which backend instance gets a request so traffic is spread efficiently and predictably.
Common algorithms:
Layer matters

Diagnosing an application that keeps missing its SLA

Question: An application is consistently missing its SLA. How do you approach fixing it?
Answer: I use a structured approach: clarify the SLA, measure the system, then apply fixes based on whether the issue is load-related or correctness-related.

Step 0: Clarify the SLA breach

First, I confirm what is actually broken:
Kubernetes metrics to check:
Example alert:

Step 1: Prove where the time or failures go

I don’t guess. I break metrics down per dependency, not just the app.
Golden signals per component:
Kubernetes metrics to check:
Key insight: Most SLA breaches come from downstream dependencies, not the application code.

Step 2: Separate load problems from correctness problems

If the SLA breaks only under load

Likely signals:
Kubernetes metrics and alerts:
Typical fixes:

If the SLA breaks even at low load

Likely signals:
Kubernetes metrics and alerts:
Typical fixes:

Security risk of running containers as root

Question: What’s the risk of running a container as root?
Answer: Running a container as root increases the blast radius of any isolation failure. Containers are isolated mainly by namespaces and cgroups, not full VM boundaries, and root inside the container can map to host-level root unless user namespaces are used.
Concrete risks:
Follow-up: Why do people still run containers as root?
Follow-up answer: Usually convenience or legacy image assumptions, but it’s rarely justified:
Best practice: run as non-root, drop capabilities, use read-only filesystems, and enforce policies with Pod Security / admission controls.
Where this is enforced (important distinction):
securityContext:
runAsNonRoot: true
runAsUser: 1000
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]


Multi-tenant “two teams can deploy” in one cluster

Question: You have a multi-tenant Kubernetes cluster with two teams. Both teams need to deploy, but must not touch each other’s workloads. How do you set this up?
Answer: I set it up as a complete package: Namespaces for object grouping, RBAC for API boundaries, NetworkPolicies for traffic isolation, quotas for resource fairness, and pod security/policy to prevent privilege escalation.

Rule of thumb (interview line)

Namespaces isolate objects. RBAC isolates API access. NetworkPolicies isolate traffic. Quotas isolate resources. Pod Security isolates privilege.

High-level architecture (K8s + Postgres + Mongo + Kafka)?

Core Position

Run stateless services on Kubernetes. Keep PostgreSQL, MongoDB, and Kafka managed unless constraints force self-hosting.
Managed data services reduce risk and toil around backups, failover, upgrades, and incident response.
Enforce platform standards across all services: RBAC, network policy, autoscaling, observability, secrets, and safe progressive deployments.
Goal: predictable operations and safe scaling, not short-term speed.

Why this split works

1) Stateless compute belongs in Kubernetes

Kubernetes is excellent at:
Why this matters: Kubernetes gives you repeatability for app runtime. Teams can ship faster with fewer manual steps, and operational behavior is consistent across services.

2) Stateful systems are operationally expensive

Databases and event brokers are failure-sensitive and operations-heavy. Running PostgreSQL, MongoDB, or Kafka yourself on Kubernetes means your team owns:
Why this matters: Most incidents in distributed systems are data-path incidents. Managed services reduce toil and reduce the chance your product team gets dragged into infrastructure firefighting.

3) Faster recovery and lower incident blast radius

Using managed PostgreSQL, Atlas, and managed Kafka usually improves:
Why this matters: You recover faster, and failures are isolated better. Reliability improves without forcing every application engineer to become a database SRE.

Unified Data Layer and Contract Strategy

In microservices, most failures happen at service and data boundaries. So storage choices and schema evolution must be designed together.

Data layer (managed) and AWS mapping

Guiding rule: Each service owns its data boundary. Avoid shared-write databases across services.

Decision rule: when to self-host stateful systems on Kubernetes

Only self-host PostgreSQL, MongoDB, or Kafka on K8s if one or more of these are true:
If none apply, managed is usually the better engineering and business decision.

Risks if you ignore this model

If everything runs in-cluster without strong standards, typical outcomes are:
You get short-term speed, then operational drag.

MongoDB connections keep climbing

If MongoDB connections keep climbing in Atlas, assume leak or pool misconfig until proven otherwise. Here’s what I’d check, in order.

1) Confirm what “connections increasing” actually means

In Atlas:
If it correlates with HPA or deploys, you probably have “pool per pod” explosion.

2) App-side: pooling and leaks (most common)

Look for these patterns:
Quick rule: pods * maxPoolSize should be within cluster capacity with headroom.

3) K8s scaling interaction

4) Atlas-side checks

In Atlas metrics and logs:

5) Identify the source quickly

Best move: attribute connections by app.

6) Concrete fixes you can say in interview


What value does ArgoCD bring?

Answer: ArgoCD gives GitOps control for Kubernetes. Git becomes the source of truth, and ArgoCD continuously reconciles cluster state to match it. Main value:

ArgoCD vs CircleCI: what is the difference?

Answer: They solve different stages of delivery.
A common setup is CircleCI builds and updates image tags in Git, then ArgoCD deploys.

How can one web server host multiple websites (domains) on the same VM and same port 443?

Answer: A single web server (like Nginx or Apache) can host multiple websites on the same VM using virtual hosts (server blocks in Nginx). Even if 1.com and 2.com both resolve to the same public IP, the server can still serve the correct site.
When you type https://1.com:
  1. DNS resolves 1.com to the VM’s public IP.
  2. The browser connects to the VM on port 443.
  3. The server must start a TLS handshake and choose which SSL certificate to present.
  4. The problem is: the actual HTTP request (including the header Host: 1.com) is sent after TLS is established, meaning the hostname is inside the encrypted traffic.
  5. So the server cannot see the hostname early enough unless the client sends it during the handshake.
  6. That’s why the browser sends SNI, telling the server: “I’m connecting to this IP, but I want 1.com.”
  7. Nginx uses that hostname to select the right server_name block + certificate, then serves the correct website.
Example:
server {
listen 443 ssl;
server_name 1.com;
ssl_certificate /etc/ssl/1.com.crt;
ssl_certificate_key /etc/ssl/1.com.key;

location / {
proxy_pass http://app1;
}
}

server {
listen 443 ssl;
server_name 2.com;
ssl_certificate /etc/ssl/2.com.crt;
ssl_certificate_key /etc/ssl/2.com.key;

location / {
proxy_pass http://app2;
}
}

What is SNI?

Answer: SNI (Server Name Indication) is a TLS extension (defined in the TLS standard, originally RFC 3546 and later RFC 6066) that allows the client to include the hostname during the TLS handshake, before encryption is established.
Key point: The hostname (Host: 1.com) is part of the encrypted HTTP traffic, so the server needs SNI to know which site and certificate to serve before encryption is established.

Name common compliance frameworks

ISO 27001
A framework that defines how an organization should manage and protect information securely through policies and controls.
SOC 2
A standard that evaluates whether a company properly protects customer data based on security and availability principles.
PCI-DSS
A mandatory security standard for companies that store, process, or transmit credit card information.
HIPAA
A U.S. regulation that protects sensitive healthcare and medical information.

Follow-up: What does compliance mean from a DevOps perspective?

From a DevOps perspective, compliance means enforcing security technically, not just documenting it: implement least-privilege IAM, ensure all infrastructure changes go through auditable CI/CD pipelines, enforce encryption at rest and in transit, centralize logs with proper retention, store secrets in a secure manager, run regular vulnerability scans, test backups and disaster recovery, separate environments clearly, and maintain strict role-based access controls. For PCI and HIPAA in particular, you also need strong network segmentation, tighter access restrictions, detailed audit trails, and proper handling or masking of sensitive data.

What is KEDA?

Answer: KEDA (Kubernetes Event-Driven Autoscaling) allows workloads to scale based on external event sources such as Kafka lag, SQS queue length, RabbitMQ messages, Redis depth, or Prometheus queries, not just CPU or memory. It can also scale workloads down to zero.

Follow-up: How does it work internally?

KEDA creates a ScaledObject, polls the external source (for example Prometheus), converts the result into an external metric, and feeds it to an HPA. The HPA performs the actual scaling.
Mental model: KEDA reads external signals. HPA changes replica count.

Follow-up: How does KEDA work with Prometheus?

Answer: KEDA uses a Prometheus scaler. You define a PromQL query and threshold. KEDA periodically executes the query, exposes the result as an external metric, and HPA scales based on that metric.
Key point: KEDA does not replace HPA. It extends it.

What are the ways to run startup commands in AWS and GCP?

Option 1: Startup Script

What is it?

A boot-time script hook used to run commands when a VM starts.
This is the most common and fastest way to bootstrap a VM.

Typical use cases

Format and supported languages

Does it run every boot?

Main tradeoffs

Pros
Cons

Platform-specific notes

GCP specifics
AWS specifics

Option 2: Cloud-init

What is it?

A Linux OS initialization system that reads boot-time configuration (user data) and applies it during startup.
This is usually the cleaner option for structured provisioning.

Typical use cases

Format and supported languages

Primary format
Also supports

Does it run every boot?

Main tradeoffs

Pros
Cons

Platform-specific notes

GCP specifics
AWS specifics

How do you choose between Startup Script and Cloud-init?

Use Startup Script when

Use Cloud-init when

Practical rule

If the boot logic is getting large, stop stuffing it into startup scripts. Bake more into the image or move provisioning to a proper config tool.

Service type LoadBalancer vs Ingress

In a nutshell

Service type LoadBalancer exposes a single Kubernetes Service directly through an external load balancer, usually at Layer 4, so it is great for simple exposure and also for non-HTTP protocols. Ingress is a Layer 7 HTTP/HTTPS routing resource that sits in front of multiple Services and routes traffic based on hostnames and paths, usually through an Ingress Controller. Use LoadBalancer for simple direct exposure or non-HTTP traffic, and use Ingress when you want centralized web routing, TLS termination, and one public entry point for multiple services.

Service of type LoadBalancer?

Esposes an application externally by asking the underlying cloud provider or load balancer integration to create a network load balancer for that Service.
Flow: external LB → Service → Pods

Ingress

A Kubernetes API object. It does not expose traffic by itself. It needs an Ingress Controller (e.g. NGINX Ingress, AWS Load Balancer Controller, Traefik, or Kong)
Flow: external LB → Ingress Controller → Service → Pods

Comparison table

Feature
LoadBalancer Service
Ingress
Use cases
Single app exposure, non‑HTTP protocols, internal LB, simple environments
Multiple web apps, shared endpoint, host/path routing, centralized TLS
Protocol level
L4 (TCP/UDP)
L7 (HTTP/HTTPS)
Routing
No routing, forwards to one Service
Host and path based routing
TLS management
Per service / per LB
Centralized
Pros
Simple, direct, no controller, easy debugging
One entry point, cheaper at scale, centralized TLS, flexible routing
Cons
One LB per service (i.e. expensive at scale) no L7 routing
Needs controller, more moving parts, HTTP/HTTPS only
Rule of thumb
Use for one service or non‑HTTP traffic
Use for many HTTP services
Complexity
Low
Medium

Certificates and TLS handling with Ingress/LoadBalancer

With LoadBalancer Service

Certificate handling is more fragmented because each exposed service may handle TLS separately. TLS can terminate at the cloud load balancer, inside the app, or in a reverse proxy.
With many separately exposed services, TLS management is more distributed. You may need multiple certificates, or you may reuse wildcard/SAN certificates, but you still have multiple public endpoints, listeners, repeated DNS mappings, and repeated renewal setup.
With Service type LoadBalancer, teams often expose apps like this:
If each LB terminates TLS independently, then each LB needs a cert that covers its hostname. That often means:

With Ingress

TLS is centralized at the Ingress layer.
Common patterns are:
That is usually easier because:
Kubernetes TLS secret object
When using a service like Let’s Encryptcert-manager obtains the certificate from the ACME server and saves it in Kubernetes as a TLS Secret (type:kubernetes.io/tls).The Ingress Controller must have access to this secret in order to terminate TLS, so during the handshake it loads the certificate and key and presents the correct certificate for the requested host.
Common certificate flow with Ingress
  1. You create an Ingress for app.example.com
  2. The Ingress includes TLS configuration
  3. cert-manager requests a certificate from an issuer such as Let's Encrypt
  4. The certificate is stored in a Kubernetes secret
  5. The Ingress Controller loads that certificate
  6. Client connects with HTTPS
  7. TLS terminates at the Ingress Controller
  8. The controller routes the request to the correct backend Service
  9. The Service sends traffic to the Pods
So the request flow is usually:
Client -> public DNS -> external load balancer -> Ingress Controller -> Service -> Pods

What is the request path when using LoadBalancer vs Ingress?

LoadBalancer
Client → DNS → Load Balancer → Kubernetes Service → Pod
Ingress
Client → DNS → ALB / ingress load balancer → Ingress rules → Service → Pod

Why use IAM roles instead of users or hardcoded credentials?

An IAM user is a long-lived identity that can have permanent credentials like passwords or access keys, while an IAM role is a temporary identity that is assumed when needed and provides short-lived credentials through STS. In modern AWS design, users are mainly for human access, while roles are preferred for workloads, services, CI/CD, cross-account access, and automation.
This approach follows core security principles:

How is CI/CD given a role in AWS?

Usually through STS AssumeRole or AssumeRoleWithWebIdentity.
Traditional pattern The CI system has some initial AWS credentials and uses them to call AssumeRole into a target role.
Modern preferred pattern The CI platform uses OIDC federation.Example: GitHub Actions gets an OIDC token from GitHub, AWS verifies it, and AWS lets that workflow assume a role without stored AWS secrets.

How does AssumeRole authorization flow work?

Layers flow
IAM Identity (user / role / service) → tries to assume a role → Trust policy (who is allowed to assume) → STS gives temporary credentials → Permission policy (what the role can do) → AWS API authorization
Runtime flow
Caller (user / service / CI) → tries AssumeRole → AWS checks trust policy → if allowed → temporary credentials → permission policies evaluated → API allowed / denied

How should I structure CI/CD for 300 Lambda functions deployed with SAM so one change does not hurt the others?

Problem: If hundreds of Lambdas are bundled into a small number of deployment units, small changes create unnecessary builds and deploys, increase blast radius, slow down the pipeline, and make ownership unclear.
Best practice: Group functions by service or domain, not all together and not necessarily one pipeline per Lambda. Keep deployment units small enough to limit blast radius, but standardized enough to stay maintainable at scale.
Approach

How do I move from a model where one person deploys infrastructure from their local machine to a model where the whole team can deploy safely?

Problem: Person-based deployment is risky. It creates a single point of failure, weak auditability, and too much trust in one engineer’s laptop.
Best practice: Make deployments standardized, reviewable, auditable, and repeatable.
Approach

How do I move changes from dev to production safely?

Problem: If dev and prod are built or deployed differently, environments drift and releases become less trustworthy.
Best practice: Do not rebuild separately for production. Promote the exact same version forward.
Approach

A developer wants a new DynamoDB table and a new Lambda. What permissions should they get?

Problem: Giving developers direct AWS create permissions scales badly and increases the risk of unsafe or inconsistent infrastructure changes.
Best practice: Developers should be able to propose infrastructure, not create it directly in production.
Approach

If production stability is my responsibility, how should I collaborate with engineers on infrastructure changes?

Problem: If changes go directly from engineers into production, accountability is unclear and production risk rises.
Best practice: Engineers should build through code, and the pipeline should deploy. This avoids both chaos and bottlenecks.
Approach

How should I manage Lambda access to DynamoDB and SQS?

Problem: When several functions share data stores and queues, broad permissions and hardcoded config quickly become messy and unsafe.
Best practice: Use least privilege for access and a central config store for environment-specific values.
Approach

We are a health tech company in Germany and ISO 27001 and GDPR are mandatory. How should I implement this in the platform?

Problem: Compliance fails when it is treated as documentation only or left to each service team to implement ad hoc.
Best practice: Treat compliance as a platform capability: access control, encryption, logging, auditability, retention, and process.
Approach

We currently emit a metric every time code logs an error, and that metric raises a CloudWatch alarm. What are the limitations of that approach?

Problem: Alerting on every application error sounds safe, but at team scale it usually creates more noise than value.
Best practice: Alert on error rate, latency, throttling, backlog, DLQ depth, and other service-level indicators. Use logs and traces for diagnosis.
Approach

How would I improve observability beyond basic CloudWatch alarms?

Problem: Basic metrics and alarms tell you that something is wrong, but not why.
Best practice: Make alerts actionable and make investigation fast.
Approach

Which third-party observability tools could improve on basic CloudWatch?

Problem: CloudWatch is useful, but many teams outgrow it when they need stronger correlation and easier root-cause analysis.
Best practice: Pick the tool that matches team maturity and operating model, not just the one with the most features.
Approach

How would I implement tracing across Lambda, SQS, and DynamoDB?

Problem: Logs from separate services do not give an end-to-end view of where a request slowed down or failed.
Best practice: Tracing should show the full path across the event-driven system, not isolated service fragments.
Approach

How do I decide whether a Lambda, DynamoDB table, or queue is actually being used?

Problem: Unused infrastructure wastes money and increases complexity, but deleting the wrong thing can break hidden consumers.
Best practice: Do not guess. Verify usage, deprecate safely, then remove through IaC.
Approach

Can Terraform detect unused resources automatically?

Problem: Teams often assume Terraform can tell whether something is safe to delete, but that is not what Terraform does.
Best practice: Terraform manages what should exist, not whether the business still uses it.
Approach

In a fully serverless environment with tenant-specific components and third-party services that may run on EC2 or elsewhere, how should I design the network topology safely?

Problem: Mixing tenant-facing services, internal components, and third-party systems into one flat topology creates unnecessary risk and weak isolation.
Best practice: Segment by trust zone, minimize public exposure, and isolate third-party and tenant-sensitive components more aggressively than the rest of the stack.
Approach

When would I replace a synchronous API call with a queue?

Problem: Synchronous calls are simple, but they break down when latency, retries, spikes, or downstream instability become serious problems.
Best practice: Use a queue when you want buffering, decoupling, and more control over failure handling.
Approach

In a standard API stack with a client, load balancer, application server, and database, how do I spot hard latency problems?

Problem: Latency issues are easy to misdiagnose when you only look at averages or only watch one layer.
Best practice: Latency debugging works best when you move from system-level symptom to per-hop breakdown instead of guessing from one graph.
Approach

Shared scenario for workflow reliability, observability, and distributed systems

A multi-step agent workflow accepts inbound jobs from customers, stores metadata in MongoDB, calls an external enrichment API, and writes results back asynchronously. Load is bursty, retries happen automatically, some jobs are slow, and one downstream API has hard rate limits. The user-facing API should respond quickly, even when background processing is under stress.

How do you design agent or workflow systems so they stay reliable under real-world load?

I assume work will arrive in bursts, dependencies will fail, and messages may be delivered more than once. So I usually decouple ingestion from execution with queues, make workers idempotent, and define retry behavior explicitly instead of treating retries as a default safety net. I also add backpressure so the system can slow itself down instead of melting downstream dependencies. In practice that means concurrency limits, rate limits, timeouts, circuit breakers, and dead-letter handling for poison messages. The main goal is not just throughput. It is keeping the system predictable under stress. In production, I watch queue depth, message age, retry volume, failure rates, and saturation at each bottleneck.
In this scenario, what goes wrong If inbound jobs are processed inline, a burst of traffic can push user-facing latency up immediately. If retries are blind and concurrency is uncapped, workers can hammer MongoDB and the external API at the same time, which makes the backlog worse and creates duplicate side effects.
How this answer helps in that scenario Queueing separates ingestion from execution, idempotency makes duplicate delivery survivable, and bounded concurrency protects the real bottleneck. That turns a spike into a backlog you can manage instead of a cascade that spills into the whole system.

How do you prevent retries from causing duplicate side effects?

I make the operation idempotent at the business level, not just at the transport level. That usually means an idempotency key, a unique operation ID stored with the result, or a state transition model where the same step can be replayed safely. If I am calling an external system, I try to send a stable request identifier and store the outbound intent before the call so I can reconcile later.

Where would you apply backpressure in this kind of system?

At the real bottleneck. If the database is saturating, I cap worker concurrency there. If the external API is rate-limited, I shape outbound calls there. I also use bounded queues and admission control at the edge so the system can reject or defer work before overload becomes a cascade.

What metrics tell you the system is falling behind before users notice?

Queue depth is useful, but queue age is usually better because it shows whether work is actually being drained. I also watch retry rate, dead-letter growth, worker saturation, dependency latency, and user-facing latency at p95 or p99. Those usually surface stress before a full outage.

When would you keep a workflow synchronous instead of queue-based?

If the result is required immediately for the user-facing path and the work is short, predictable, and low-risk, synchronous is often the right tradeoff. I avoid adding async complexity unless I need buffering, isolation, retries, or long-running execution.

How do you deal with poison messages or permanently failing jobs?

I stop infinite retry loops quickly. After bounded retries, the job should go to a dead-letter queue or failure store with full context, payload metadata, and failure reason. Then I want triage tooling, replay controls, and usually classification between bad input, dependency failure, and code defect.

How do you make retries safe in distributed systems?

Retries are only safe if the operation is idempotent or if you have a clear deduplication mechanism. Otherwise retries can create duplicate payments, duplicate jobs, or conflicting state transitions. I usually design each step with an idempotency key, a unique operation identifier, or a state machine that makes repeated execution harmless. I also separate transient failures from permanent ones, because retrying validation errors or bad payloads just creates noise. Good retry policy includes bounded attempts, exponential backoff, and jitter so failures do not synchronize into a retry storm. If a step still fails after that, it should go to a dead-letter path with enough context for investigation.
In this scenario, what goes wrong The enrichment API might time out after partially completing the request. If the worker retries blindly, the job may write duplicate results, send the same downstream event twice, or trigger conflicting state transitions.
How this answer helps in that scenario Idempotency keys, operation tracking, and failure classification let you retry only when it is actually safe. That keeps transient failure from becoming duplicate business impact.

What makes an operation truly idempotent?

Running it multiple times has the same final business effect as running it once. That is stronger than saying the same HTTP request returns the same status code. If a payment, email, or state transition would happen twice, it is not truly idempotent.

How would you handle retries for an external API that is not idempotent?

I would avoid blind automatic retries. First choice is to see whether the API supports an idempotency token. If not, I would record outbound intent, detect ambiguous outcomes, and reconcile before retrying. In some cases the right answer is to fail safely and escalate instead of guessing.

What is the difference between at-least-once delivery and exactly-once processing?

At-least-once delivery means duplicates are possible. Exactly-once processing is the stronger business guarantee that the effect happens once. In real systems, infrastructure-level exactly-once is rare, so teams usually achieve exactly-once business effect through idempotency, deduplication, and controlled state transitions.

When do you stop retrying and surface failure?

When the failure is clearly permanent, like validation errors or malformed input, or when bounded retries are exhausted for a transient issue. After that, I want the failure surfaced to operators or downstream consumers with enough context to decide whether to replay, fix data, or patch code.

How do you think about production reliability?

I treat reliability as an engineering budget, not a vague goal. That means defining service-level objectives, understanding what level of errors or latency is acceptable, and then making design choices that fit inside those limits. For example, if the user-facing path has a strict latency budget, I avoid putting long-running or failure-prone work inline and move it to async processing where possible. I also look at dependency risk, redundancy, failure domains, and operational readiness before calling a system production-ready. Reliability work is not only about preventing outages. It is also about shortening detection time, narrowing blast radius, and making recovery predictable.
In this scenario, what goes wrong If the system treats reliability as just uptime, it may ignore queue age, degraded dependency behavior, and user-facing latency until customers are already feeling the impact.
How this answer helps in that scenario SLOs, latency budgets, and explicit failure-domain thinking force you to design for the failure modes that actually matter to users, not just whether the service is technically still up.

What is the difference between an SLA, SLO, and error budget?

An SLA is the external commitment, usually commercial. An SLO is the internal target you engineer against. The error budget is the allowed amount of unreliability implied by that SLO. The budget is useful because it turns reliability into a decision framework instead of a vague aspiration.

How do latency budgets influence architecture?

They force you to decide what belongs in the request path and what should move out of band. They also limit fan-out, shape timeout values, and expose which dependency hops are too expensive for the user journey.

When would you deliberately accept lower reliability?

When the feature is low criticality, internal-only, experimental, or too expensive to harden to the same level as a core path. The key is making that a conscious tradeoff instead of accidental neglect.

How do you reduce blast radius during incidents?

I like isolation boundaries, feature flags, progressive rollout, rate limiting, and the ability to disable or degrade one subsystem without taking everything else down. Small failure domains make recovery much easier.

How do you use latency budgets in system design?

A latency budget forces you to break the end-to-end response time into pieces and decide where time is allowed to go. That usually means budgeting for network hops, application processing, database calls, and external dependencies. Once that is visible, you can decide what belongs in the request path and what should move to async execution. It also helps with timeout design, because timeouts should reflect the budget instead of being random defaults. In practice, I use latency budgets to keep the critical path small, reduce fan-out, cache where it helps, and avoid hidden tail-latency traps. It is a good way to keep architecture decisions grounded in user-facing expectations.
In this scenario, what goes wrong If the API waits on MongoDB, the external enrichment call, and multiple internal hops before responding, p95 and p99 latency can explode even when average latency looks fine.
How this answer helps in that scenario A latency budget makes it obvious that enrichment belongs off the critical path. It also helps you set tighter internal timeouts so one slow dependency does not consume the whole request budget.

What is tail latency and why does it matter?

Tail latency is the slow end of the distribution, usually p95 or p99. Users often feel the tail more than the average, especially in fan-out systems where one slow dependency can dominate the whole request.

How do retries affect latency budgets?

Retries spend latency budget fast. If the retry is inline, it can easily turn a slow request into a timeout. That is why retry policy, timeout policy, and latency budgets need to be designed together, not separately.

How do you set timeouts between services?

I start from the end-to-end budget, reserve time for the full path, and then assign tighter budgets to internal hops. Timeouts should be deliberate and shorter than the caller’s timeout so failures surface cleanly instead of stacking.

When does caching help, and when does it just hide deeper issues?

Caching helps when the workload is read-heavy, data freshness requirements allow it, and the cache removes repeat expensive work. It hides deeper issues when it is masking bad query patterns, excessive fan-out, or poor dependency design without actually fixing them.

How do you approach incident response as an owner of production reliability?

During an incident, my first priority is restoring service or reducing impact, not proving root cause in real time. I want clear severity assessment, ownership, communication, and a fast view of blast radius. That usually means checking recent changes, dependency health, saturation signals, and whether rollback or traffic reduction is safer than continuing to debug live. After stabilization, I care about root cause analysis, timeline reconstruction, and action items that actually change the system, not just documentation theater. Good incident response is calm, structured, and focused on decision quality under pressure.
In this scenario, what goes wrong Teams can lose time debating root cause while queue age rises, workers fail repeatedly, and user-visible latency keeps climbing.
How this answer helps in that scenario It forces the first move toward mitigation: assess blast radius, identify the release as a likely trigger, and decide quickly whether rollback, traffic reduction, or feature disablement is the safest stabilizing action.

What would you check in the first 10 minutes?

User impact, blast radius, recent deploys or config changes, dependency health, saturation metrics, and whether rollback or traffic shedding is available. I want fast orientation before deep debugging.

When do you roll back versus fix forward?

I roll back when the change is clearly implicated and rollback is lower risk than live repair. I fix forward when rollback is unsafe, stateful migrations are involved, or the fix is smaller and faster than reversing the release.

How do you avoid noisy alerts during incidents?

By grouping related alerts, muting derived noise where appropriate, and focusing on a few primary signals tied to impact. During an incident, more alert volume is usually not more insight.

What makes a postmortem useful instead of ceremonial?

A real timeline, a clear explanation of why existing controls failed, and action items that change code, process, or observability. If the outcome is just “be more careful,” the postmortem was weak.

How do you approach capacity planning?

Capacity planning starts with workload shape, not instance count. I want to know peak versus average traffic, concurrency patterns, job duration, storage growth, dependency bottlenecks, and what happens during recovery events or batch spikes. Then I translate that into headroom targets and scaling behavior. I also care about the non-obvious bottlenecks, like connection pools, partitions, rate-limited APIs, queue consumers, and database write amplification. Good capacity planning is not guessing one big number. It is understanding where the system saturates first and what the cost of extra headroom is.

What signals tell you scaling is not solving the real bottleneck?

Throughput stops improving even as you add capacity, latency stays high, queue age keeps growing, or one dependency remains saturated. That usually means the bottleneck is elsewhere, like the database, connection pool, partition hot spots, or an external API.

How do you plan for bursty async workloads?

I look at arrival rate distribution, not just average load. Then I size for backlog absorption, drain time, worker concurrency, and recovery behavior after spikes. Bursty systems need queue-based thinking, not just steady-state scaling.

How much headroom is enough?

Enough to absorb expected spikes, recovery events, and small forecast errors without immediate instability. There is no universal number. It depends on workload volatility, scaling speed, and failure tolerance.

What changes in capacity planning for stateful systems?

You care much more about storage growth, replication overhead, failover behavior, write amplification, rebalancing cost, and hot partitions. Stateful systems are usually harder to scale and slower to recover than stateless ones.

How do you instrument a system so problems surface before users notice?

I start from the important user journeys and failure modes, not from whatever metrics the platform gives me for free. Then I instrument the stack with structured logs, traces across service boundaries, and metrics that reflect both user impact and system health. I want to know request rate, errors, latency, saturation, and workflow-specific signals like retry volume, queue age, and dead-letter growth. Good observability is not just data collection. It is being able to answer why a system is slow, failing, or falling behind without guessing. Alerting should focus on symptoms that matter and be specific enough that an engineer knows where to start.
In this scenario, what goes wrong A job may be accepted by the API and then disappear somewhere between queue publish, worker execution, the external API call, and MongoDB write-back. Without correlation IDs and step-level visibility, the team ends up guessing where it died.
How this answer helps in that scenario Traces, structured logs, and workflow-aware metrics let you narrow the failure to a specific hop or retry boundary. That turns async debugging from detective work into a normal operational task.

What is the difference between metrics, logs, and traces?

Metrics tell you that something changed. Traces show where time or failure happened across the request path. Logs give detailed local context inside a component. I want all three connected by shared identifiers.

How do you choose what deserves an alert?

If it affects users, burns reliability budget, indicates real service degradation, or predicts imminent failure, it probably deserves an alert. If it is only informational or needs human interpretation every time, it probably belongs in a dashboard, not a pager.

What makes structured logging better than plain text logs?

It makes filtering, aggregation, and correlation much easier. Fields like request ID, workflow ID, tenant, status, and error class become queryable instead of buried in free text.

How do you debug async workflows that hop across services?

Correlation IDs are mandatory. I want trace context carried through messages, logs enriched with workflow and step IDs, and dashboards that show queue state, retry history, and dead-letter events. Without cross-hop correlation, async debugging turns into guesswork.

What makes an alert actionable?

An alert is actionable when it points to a real symptom, has a clear owner, and gives enough context to start narrowing the problem immediately. Good alerts usually tie to user impact, SLO burn, or known failure patterns like queue backlog growth, sustained error rate increase, or dependency saturation. They should include thresholds that reflect meaningful degradation, not every transient blip. I also want routing, severity, and links to dashboards or runbooks. The best alert is one that wakes someone up only when a decision is needed.

What are examples of noisy alerts you would remove?

Single blip CPU alerts, isolated pod restarts with no impact, transient error spikes below user-visible thresholds, and duplicate alerts that all describe the same underlying issue.

How do you alert on slow degradation, not just hard failures?

Burn-rate alerts, backlog growth, saturation trends, and latency distribution shifts are good for that. Slow degradation usually shows up in trends before it becomes a full outage.

When would you use burn-rate alerts?

When I care about SLO consumption over time, especially for catching both fast severe outages and slower reliability leaks. Burn-rate alerting is useful because it ties pages to budget impact instead of raw metric noise.

How do runbooks improve alert quality?

They force the alert to be grounded in a real action path. If nobody can explain what to do when the alert fires, the alert is probably weak.

Shared scenario for CI/CD, deployments, and incident response

A new release adds a schema change to MongoDB and a new worker behavior in the agent workflow. The deployment technically succeeds, but shortly after rollout, job failures rise, queue age starts climbing, and some workers are reading data in the new shape while others still expect the old shape.

What does a solid CI/CD pipeline look like to you?

A solid pipeline makes the path from local development to production predictable, repeatable, and low-friction without lowering safety standards. I want fast feedback on pull requests, automated tests at the right layers, consistent artifact creation, environment promotion rules, and infrastructure changes tracked as code. I also want preview or staging environments where they add value, especially for integration-heavy changes. The key is balancing speed with confidence. A pipeline should catch common regressions early, make deployments boring, and make rollback or rollback-equivalent actions straightforward when something goes wrong.
In this scenario, what goes wrong If the pipeline validates only that code builds and deploys, it can still miss the dangerous part: mixed-version workers operating against an evolving data shape.
How this answer helps in that scenario A stronger pipeline pushes you toward compatibility checks, safer promotion, and better release discipline. That reduces the chance of a technically successful deploy causing a real production incident.

What checks belong in PR validation versus later stages?

Fast and high-signal checks belong in PRs: linting, unit tests, static analysis, basic build validation, maybe lightweight integration tests. Slower or more environment-dependent checks can happen post-merge or in staging. The PR stage should protect quality without killing iteration speed.

How do you handle database changes safely in CI/CD?

Backward-compatible migrations first, application rollout second, destructive cleanup last. I try to avoid release patterns where code and schema must change in lockstep. For risky migrations, I want testing on production-like data shape and a rollback-aware plan.

When do preview environments make sense?

When changes are integration-heavy, involve UI or API contract validation, or need stakeholder review before merge. They are most useful when they reduce real uncertainty, not when they exist only because it sounds mature.

How do you avoid slow pipelines becoming a productivity drag?

Keep the fast path fast, parallelize where possible, cache builds responsibly, and separate required gates from informational checks. A slow pipeline trains people to bypass it.

How do you reduce deployment risk?

I try to reduce risk before deployment and also reduce blast radius after deployment. Before release, that means strong validation, reproducible artifacts, config review, and testing paths that reflect real integrations. During rollout, I prefer progressive delivery when possible, like canaries, phased rollout, or feature flags, because that limits exposure and makes regression detection easier. I also want rollback criteria to be defined before deployment, not invented during an incident. The idea is that deployment should be a controlled experiment, not a leap of faith.
In this scenario, what goes wrong Rolling the whole fleet at once can turn a compatibility bug into a full outage. And if the release includes a risky schema change, rollback may not be clean anymore.
How this answer helps in that scenario Progressive rollout, feature flags, and pre-defined rollback criteria reduce exposure early and make it easier to stop the blast radius before the backlog and failures become systemic.

When is a rollback dangerous?

When the release included irreversible state changes, destructive migrations, side effects already emitted to other systems, or data shape changes that older code cannot handle. In those cases rollback can make the incident worse.

How do feature flags help, and where do they create complexity?

They help decouple deploy from release, reduce blast radius, and let you disable behavior quickly. The downside is flag sprawl, stale code paths, hidden interactions, and extra testing matrix complexity.

What metrics do you watch right after a release?

Error rate, latency, saturation, resource usage, key business flows, and any feature-specific metrics tied to the change. I want both system health and product impact.

How do you validate infrastructure changes safely?

IaC review, plan output review, policy checks, lower-environment application where useful, and progressive rollout if the platform allows it. Infrastructure should have the same discipline as application code.

What is your view on preview, staging, and production environments?

I see environments as confidence tools, not as a ritual. Preview environments are useful for fast feedback on isolated changes, especially UI or integration-heavy work. Staging is useful when it is production-like enough to expose real integration or deployment issues. But fake staging can create false confidence if it does not reflect production topology, data shape, or traffic behavior. The right question is what uncertainty each environment is supposed to remove. If an environment does not reduce meaningful risk, it is just cost and operational drag.

What makes staging misleading?

Different scale, different data shape, missing dependencies, fake traffic, or different configuration. If staging removes the hard parts of production, it teaches the wrong lessons.

How close should staging be to production?

Close enough to exercise the risky integrations and deployment path realistically. It does not have to be identical in size, but it should be honest about the failure modes you care about.

When are ephemeral environments worth the cost?

When they speed up validation of integration-heavy changes or unblock collaboration across engineering, QA, and product. They are not worth much for changes that can already be validated cheaply with tests.

What should never be tested only in production?

Basic correctness, destructive migration logic, auth flows, critical integration contracts, and obvious failure paths. Production should still validate reality, but it should not be the first place you learn the basics are broken.

What does eventual consistency mean in practice?

Eventual consistency means different parts of the system may temporarily disagree, and your application has to be designed so that this is acceptable and understandable. In practice that affects read-after-write expectations, workflow timing, reconciliation logic, and how users experience state changes. You cannot assume every reader sees the newest value immediately, especially across replicas, caches, or event-driven pipelines. So you design around that reality with clear ownership of state transitions, idempotent consumers, compensating logic when needed, and user-facing behavior that does not depend on perfect immediacy unless the business case really requires it.
In this scenario, what goes wrong A user may submit a job and immediately reload the UI, but the workflow status has not propagated yet. If the system assumes instant consistency, the UI may look broken or trigger duplicate user actions.
How this answer helps in that scenario Designing for eventual consistency lets you use pending states, reconciliation, and idempotent updates instead of pretending every component sees the same truth at the same time.

When is eventual consistency acceptable, and when is it not?

It is acceptable when temporary staleness is tolerable and can be explained or reconciled. It is not acceptable for workflows where incorrect intermediate state causes financial loss, safety issues, or broken core business guarantees.

How do you explain eventual consistency to product stakeholders?

I would say the system will converge to the correct state, but not every screen or service will see the update instantly. Then I would translate that into user-visible behavior, like delayed status refresh or short-lived pending states.

What patterns help reconcile delayed or out-of-order updates?

Versioning, sequence numbers where possible, idempotent consumers, reconciliation jobs, and state machines that reject invalid transitions. The right choice depends on whether ordering can be enforced or only repaired.

How do you test for these edge cases?

By simulating duplicates, delayed events, out-of-order delivery, stale reads, and partial dependency failure. Happy-path tests are not enough here.

What failure modes do managed platforms hide until they do not?

Managed platforms remove a lot of undifferentiated heavy lifting, but they do not remove distributed systems realities. The failure modes are still there, just hidden behind cleaner APIs. For example, retries can still duplicate work, queues can still back up, cold starts can still affect tail latency, and abstractions around orchestration can hide state growth or timeout behavior until load increases. Rate limits, noisy neighbors, partitioning limits, and consistency assumptions also surface eventually. I like managed tools, but I do not treat them as magic. I want to understand the semantics underneath the abstraction so I know where it will break under scale, burstiness, or partial failure.

What questions would you ask before adopting a managed workflow platform?

What are the retry semantics, timeout limits, state retention behavior, ordering guarantees, throughput limits, debugging tools, cold start characteristics, and failure visibility? I want to know what happens under stress, not just on the product page.

How do you test platform limits before production?

Load tests, failure injection, long-running workflow tests, quota boundary tests, and recovery drills. I especially want to see how the platform behaves at the edges, not just in steady state.

What hidden assumptions around retries or ordering matter most?

Whether retries are automatic, whether duplicate execution is possible, whether ordering is guaranteed per key or not at all, and whether visibility timeouts or leases can cause reprocessing. Those assumptions change everything upstream.

When would you accept abstraction leakage instead of building lower-level yourself?

When the managed platform still buys enough speed, reliability, and operator leverage that the leaked complexity is manageable. I do not mind some abstraction leakage if the overall tradeoff is still good.

How do you reason about partial failure?

I assume parts of the system will fail independently and that success is often mixed, not binary. One dependency may be slow, another may be unavailable, and a third may succeed after a retry. That means the design has to handle timeout, fallback, compensation, and degraded modes explicitly. The key question is what the user or downstream system should observe when only part of the workflow succeeds. Good systems make partial failure visible, bounded, and recoverable instead of hiding it until state becomes inconsistent or operators lose track of what happened.

What is a good example of graceful degradation?

Serving a partial response, showing stale but clearly marked data, accepting work asynchronously instead of inline, or disabling a non-critical feature while keeping the core flow alive. The point is preserving value instead of chasing all-or-nothing behavior.

How do you keep partial success from corrupting state?

Clear ownership of state transitions, idempotency, durable workflow state, and compensating actions where needed. Partial success is manageable if the system can tell exactly which steps completed and which did not.

When should a workflow fail fast instead of continuing?

When a prerequisite is missing, a permanent validation error is detected, or continuing would produce unsafe or misleading state. Failing fast is usually better than digging a deeper hole.

What role does a saga or compensating action play here?

It gives you a structured way to unwind or counteract earlier successful steps when later steps fail. It is not magic, but it is a practical pattern for multi-step workflows without global transactions.

How do you think about workflow orchestration tools like Temporal, Inngest, or Step Functions?

I see workflow orchestrators as tools for making multi-step execution more explicit, durable, and observable. They are especially useful when steps span time, dependencies, retries, and human or external-system boundaries. But I still care a lot about the semantics underneath them, like retry behavior, step timeouts, state persistence, ordering assumptions, and what happens when a worker crashes mid-step. I do not want workflow code to become a black box that hides complexity. The orchestrator should make complexity manageable, not invisible. I usually evaluate them based on execution model, operational overhead, debugging experience, and how clearly they express failure and compensation.

When is an orchestrator overkill?

When the workflow is short, stateless, easy to retry safely, and does not need durable coordination across time. In that case a simple service plus queue may be enough.

What logic belongs in the orchestrator versus the worker?

The orchestrator should own sequencing, waiting, retries, and workflow state. The worker should own the actual business action or side effect. That keeps coordination visible and execution units testable.

How do retries differ between steps and whole workflows?

Step retries are usually local and targeted. Whole-workflow retries are broader and can re-execute more state unless carefully controlled. That is why retry boundaries matter.

What operational tradeoffs matter when choosing one?

Execution durability, developer ergonomics, debugging, vendor lock-in, throughput limits, pricing model, and how much operational burden the tool itself introduces.

Shared scenario for MongoDB design and performance

The system launched with a simple MongoDB model and small traffic. Six months later, data volume is much higher, some documents have grown large, new product requirements added more filters and sorting patterns, and a previously acceptable query is now slow enough to affect worker throughput and backlog recovery.

How do you think about schema design tradeoffs in MongoDB?

MongoDB gives you flexibility, but flexible schema does not mean schema should be accidental. I start from access patterns, document growth expectations, update frequency, and how related data is read together. Then I decide where embedding makes sense versus where references are safer. Embedding can simplify reads, but it can also create oversized documents, duplication, or painful update paths as the model evolves. The tradeoff is usually read efficiency versus long-term maintainability and write behavior. I try to make the document model reflect real query patterns rather than generic entity diagrams.
In this scenario, what goes wrongA model that felt convenient early can become painful when documents keep growing, new fields get queried in different ways, and workers touch more of the document than they need.
How this answer helps in that scenarioThinking in terms of access patterns and document growth helps you avoid a design that looks simple at launch but turns into a performance and maintenance problem under scale.

When would you embed versus reference?

I embed when the data is read together, bounded in size, and changes with the parent. I reference when the relationship is large, reused across entities, or updated independently.

What document growth issues do you watch for?

Unbounded arrays, repeated embedded history, large nested blobs, and patterns where documents keep expanding with normal usage. Growth affects storage, index behavior, and update cost.

How do you evolve schema safely over time?

Backward-compatible reads first, then gradual writers, then cleanup. In practice that means code that can handle both old and new shapes while the migration is in progress.

What anti-patterns do you see often in MongoDB design?

Treating schemaless as designless, over-embedding because it looks convenient early, and ignoring query patterns until performance is already bad.

How do you approach indexing strategy in MongoDB?

Indexing should follow actual query patterns, not guesswork. I look at the most important reads, sort patterns, filters, and cardinality, then build indexes that support those paths efficiently. I also watch the write cost, because every extra index adds overhead to inserts and updates. Good indexing is a tradeoff, not a checklist. As usage evolves, I revisit slow queries, execution plans, and index usage so the index set stays aligned with reality instead of drifting into clutter. I also pay attention to compound index order, selective fields, and avoiding indexes that look useful but are rarely used.
In this scenario, what goes wrong A query that was fine at launch may degrade once data volume grows or a new sort pattern appears. If the index set does not evolve with the workload, workers spend too long scanning and the backlog drains too slowly.
How this answer helps in that scenario A query-pattern-driven indexing strategy makes the database support the workload you actually have, instead of the workload you had months ago.

How do you choose field order in a compound index?

Based on the query shape: equality filters first in many cases, then sort fields, then less selective range fields depending on access pattern. The right order depends on the real query plan, not a memorized rule alone.

What signals tell you an index is missing or wrong?

Slow queries, collection scans, poor execution stats, low selectivity, or indexes that exist but are not chosen by the planner. If the plan is doing too much work, the index strategy is probably off.

How do indexes hurt write-heavy workloads?

Every insert, update, and delete has to maintain them. Too many indexes increase write latency, storage use, and memory pressure, so indexing has to be selective.

When would you remove an index?

When it has low usage, overlaps heavily with a better index, or its write cost no longer justifies the read benefit. Stale indexes are not harmless.

How do you diagnose MongoDB performance issues as data and query patterns evolve?

I first want to identify whether the bottleneck is query shape, indexing, document size, working set pressure, connection behavior, or infrastructure limits. Then I look at slow query logs, execution stats, index usage, and resource saturation. In many cases the issue is not MongoDB itself but an application pattern, like N+1 access, unbounded scans, over-fetching, or a data model that no longer matches how the system is used. The fix depends on what changed. Sometimes it is a new index. Sometimes it is rewriting the query. Sometimes it is changing the document model or partitioning strategy.
In this scenario, what goes wrongTeams often blame MongoDB generically when the real issue is that the workload changed and the data model, query shape, or index strategy did not keep up.
How this answer helps in that scenarioIt gives you a structured way to find the actual bottleneck instead of jumping straight to scaling or blaming the database without evidence.

What would you check first for a suddenly slow query?

Whether the plan changed, whether data volume or selectivity shifted, whether an index was dropped or became ineffective, and whether resource saturation increased at the same time.

How do you distinguish database bottlenecks from application bottlenecks?

I compare query execution behavior with application traces and resource saturation. If the database is healthy but the app is slow, the bottleneck may be connection handling, serialization, chatty access patterns, or retry amplification.

What role does document size play in performance?

Big documents cost more to read, move, cache, and update. They also make indexes less efficient indirectly because the workload becomes heavier overall.

When does sharding actually help, and when does it just add complexity?

It helps when a single node is the real capacity limit and the shard key distributes load well. It adds complexity when the workload is not truly outgrowing simpler options or when the shard key creates hot spots.

What is one distributed systems mistake teams make often?

They treat successful happy-path integration tests as proof that the system is reliable. Real failures are usually around timeouts, retries, duplicates, out-of-order events, partial success, and saturation, not whether the basic API call worked once.

How would you test those failure cases?

Failure injection, chaos-style dependency disruption, duplicate message replay, delayed delivery, rate-limit simulation, timeout simulation, and load tests that push the system into recovery paths.

Which ones matter most in production?

The ones that match the real shape of the system: dependency latency, retry amplification, saturation, duplicate execution, and operator visibility gaps. Those are the ones that usually hurt first.

Personal answer framing

Use this pattern when answering out loud:
  1. Start with the principle
  2. Name the mechanisms you would use
  3. Mention the main tradeoff
  4. End with what you would watch in production
Example shape:
I usually start by assuming the system will see bursts, partial failure, and duplicate execution, so I design around that instead of around the happy path. In practice that means queues, idempotent workers, bounded retries with backoff and jitter, and clear backpressure at the actual bottleneck. The tradeoff is that you gain resilience but also add operational complexity and eventual consistency concerns. In production, I would watch queue depth, queue age, retry rate, saturation, and user-facing latency to see whether the design is holding under load.