Devops Interview Prep

DevOps Screening Cheat-sheet

Source: DevOps Screening Cheat-sheet.html

FAQ

In Linux, what is OOM?

Answer: OOM stands for Out Of Memory. It happens when the Linux kernel can’t allocate memory because both RAM and swap are exhausted. When that happens, the kernel triggers the OOM killer, which forcefully terminates one or more processes to free memory and keep the system alive instead of letting it completely hang or crash.

The OOM killer chooses which process to kill based on factors like how much memory it’s using, its priority, and whether it’s considered critical. You’ll usually see OOM events in dmesg or system logs. In containerized environments, OOM is very common when memory limits are too tight or when there’s a memory leak.

What are Linux cgroups?

Answer: cgroups, short for control groups, are a Linux kernel mechanism used to manage and limit resource usage for groups of processes. They allow you to control things like CPU usage, memory limits, disk IO, and the number of processes a group can create.

They’re heavily used by container runtimes and systemd. For example, when you set a memory limit on a container, that limit is enforced using cgroups. If a process exceeds its memory limit, it can be OOM-killed inside the cgroup without affecting the rest of the system.

What are Linux namespaces?

Answer: Linux namespaces provide isolation by giving processes their own view of system resources. Each namespace makes a process think it’s running in its own environment, even though it’s sharing the same kernel with other processes.

There are namespaces for things like process IDs, networking, filesystem mounts, users, hostnames, and IPC. For example, a process in its own PID namespace sees itself as PID 1, and a process in a network namespace has its own network interfaces and routing table.

What is the difference between cgroups and namespaces?

Answer: They solve different problems. namespaces are about isolation and visibility, what a process can see. cgroups are about resource control, how much CPU, memory, or IO a process is allowed to use.

Containers combine both. namespaces isolate the container from the host and other containers, while cgroups ensure it doesn’t consume more resources than it’s allowed to.

What is the TCP handshake?

Answer: The TCP handshake is the process used to establish a reliable TCP connection between a client and a server. It’s a three-step exchange. First the client sends a SYN packet to request a connection. The server responds with SYN-ACK to acknowledge and share its own sequence number. Finally, the client sends an ACK, and the connection is established.

This process ensures both sides are reachable and synchronizes sequence numbers before any data is sent. Operationally, issues here show up as connections stuck in SYN-SENT or SYN-RECV states, and attacks like SYN floods exploit this phase.

What is a 502 error?

Answer: A 502 Bad Gateway error means that a proxy, load balancer, or gateway received an invalid or no response from an upstream server. The client successfully reached the gateway, but the gateway couldn’t successfully talk to the backend service.

This often happens when the backend crashes, times out, restarts during a request, or is misconfigured. You commonly see 502s in systems using Nginx, Envoy, cloud load balancers, or CDNs. It’s different from a 503, which usually means the service is unavailable, and a 504, which means the upstream didn’t respond in time.

ALB 502 Bad Gateway in EKS

Question: You’re getting 502 Bad Gateway from an AWS ALB in EKS. What do you check first?

Answer: A 502 from ALB usually means the ALB could not get a valid response from its targets, or the targets are unhealthy/unreachable. I follow a fast checklist to isolate whether it’s ALB config, Kubernetes wiring, networking, or app behavior.

Checklist:

Confirm it’s really ALB: check response headers, if you see Server: awselb/2.0 the response was generated by ALB (meaning the request reached ALB but ALB failed talking to the backend). This immediately narrows the issue to target group health, networking, protocol mismatch, or backend failures, instead of debugging the client, DNS, or the application itself. If enabled, check ALB access logs.
Target group health: in AWS console go to Target Group → Targets → Health status + reason. If unhealthy, verify:

Health check path (/healthz vs /health)
Health check port (service port vs container port)
Success codes (200 vs 204)
Protocol (HTTP vs HTTPS)

Kubernetes wiring (most common): verify Ingress, Service, and Endpoints are correct:

kubectl get ingress -n <ns>
kubectl describe ingress -n <ns> <name> (check controller events)
kubectl get svc -n <ns> / kubectl describe svc -n <ns> <svc>
kubectl get endpoints -n <ns> <svc> (empty endpoints = no traffic)
kubectl get pods -n <ns> -l <selector> -o wide

Look for:

Service selector doesn’t match Pods
targetPort points to wrong container port
Pods not Ready (readinessProbe failing)
App only listening on localhost instead of 0.0.0.0

Security Groups and networking: if pods are healthy but ALB can’t reach them:

Security Groups and networking is about verifying the network path hop by hop.

Traffic flow: Client → ALB listener → Target Group → Node/Pod IP → Container port

Cause and effect checks:

If clients can’t reach the ALB (ALB SG issue):

ALB SG must allow inbound on the listener port (e.g. 443).
If this is wrong, you won’t even reach ALB.

If ALB is reachable but can’t reach targets (Node/Pod SG issue):

Target Group forwards to a backend port (e.g. 8080).
Node SG (or Pod SG) must allow inbound from the ALB SG to that backend port.
If this is blocked, ALB cannot connect, targets become unhealthy, and you get 502.

If using IP target mode (targets are pods):

ALB connects directly to Pod IPs.
Security group rules must allow ALB → Pod IPs on the target port.
If this is misconfigured, ALB returns 502 even if pods are Ready.

Concrete example:

Client hits ALB on 443
Target Group forwards to backend on 8080
Service maps port: 80 → targetPort: 8080

Required rules:

ALB SG: allow inbound 443 from client
Node/Pod SG: allow inbound 8080 from ALB SG

If the Node/Pod SG does not allow 8080, ALB fails to connect to targets and returns 502.

Protocol mismatch / TLS confusion: common 502 cause:

ALB sends HTTP but backend expects HTTPS (or vice versa)
Wrong backend-protocol annotation
Backend closes connection or redirects unexpectedly

App behavior causing 502:

Crashes / OOMKills during traffic
Slow startup, readiness passes too early
Connection closed before response

Check:

kubectl logs -n <ns> <pod>
kubectl describe pod -n <ns> <pod> (restarts, OOMKilled)

Timeouts: less common but real:

ALB idle timeout too low for long requests
Backend takes too long and returns nothing

Follow-up: What’s the fastest signal to narrow it down?

Follow-up answer: If the target group shows unhealthy, it’s usually health check config or Kubernetes wiring. If targets look healthy but ALB still returns 502, it’s usually security groups, protocol mismatch, or app-level connection failures.

What is a Pod Disruption Budget?

Answer: A Pod Disruption Budget, or PDB, defines how many pods of an application must remain available during voluntary disruptions in Kubernetes. Voluntary disruptions include things like draining nodes, cluster upgrades, or manual pod evictions (it does not control rolling updates).

You define a PDB using either minAvailable or maxUnavailable. Kubernetes checks the PDB before allowing a pod eviction. It doesn’t protect against crashes, OOM kills, or node failures, only against controlled maintenance actions. If configured too strictly, it can block node drains and upgrades.

Pod Disruption Budget, what is it? How to use it?

Answer: A PDB limits how many Pods can be unavailable during voluntary disruptions like node drains, upgrades, or autoscaler removals.

You set:

minAvailable or
maxUnavailable

Kubernetes will block evictions that violate the budget.

Common mistakes:

Setting minAvailable too strict can block upgrades.
Using it with a single replica gives false confidence.

Tradeoff: better availability during maintenance, but it can slow or block cluster operations if misconfigured.

What types of Services exist in Kubernetes?

Answer: Kubernetes has several Service types to expose applications. ClusterIP is the default and is used for internal communication inside the cluster. NodePort exposes a service on a static port on every node, which is simple but not ideal for production.

LoadBalancer provisions an external load balancer through the cloud provider and is the most common choice for production workloads. ExternalName maps a service to an external DNS name without proxying traffic.

There’s also the concept of a headless service, where no cluster IP is assigned and clients get the pod IPs directly. This is commonly used with StatefulSets.

Vertical scaling vs horizontal scaling

Vertical scaling: is making one machine bigger. More CPU, more RAM. Easy to do, no code changes, but you hit a hard limit and it’s a single point of failure.

Horizontal scaling: is adding more machines. More complex, but it scales much better, improves availability, and is how most production systems are built.

In practice, teams often start vertical for speed, but aim for horizontal long-term.

Load balancer routing by URL – Layer 4 or Layer 7?

Layer 7.

If the load balancer looks at URLs, paths, hostnames, or headers, it’s inspecting HTTP, which is application layer. Layer 4 only sees IPs and ports.

App on a VM accessing a storage bucket – best authentication?

Use the VM’s managed identity.

Attach an IAM role or service account to the VM and give it least-privilege access to the bucket. The app uses short-lived credentials automatically. No hardcoded keys, no secrets in config.

That means:

AWS: IAM Role via instance profile
GCP: VM service account
Azure: Managed identity

K8s Deployment succeeds, everything healthy?

Question: If a Deployment rolled out with no errors, can you assume everything is OK?

Quick Answer: No. It only means Kubernetes scheduled the pods. You still need readiness probes, logs, metrics, and functional checks to confirm the app actually works.

DIs it OK to run RabbitMQ or MySQL in a StatefulSet in production?

Question: Is running databases or brokers inside Kubernetes acceptable?

Quick Answer: RabbitMQ: OK if using the operator and proper storage. MySQL: Usually no. Use managed DB unless you have a strong infra team and specific reasons, because backups, upgrades, failover, replication, and data safety are operationally complex and easy to get wrong in Kubernetes.

Slow message processing: should you increase threads?

Question: If consumers process messages slowly, is “increase threads” the solution?

Quick Answer: Not by default. More threads only help if you are actually under‑utilizing CPU. In practice, slow consumers are usually blocked on I/O, waiting on the database, or competing for locks. Adding threads in those cases just increases context switching, DB pressure, and lock contention. First check where time is spent, then tune prefetch (how many messages a consumer pulls and buffers ahead of processing), batch sizes, queries, or the real bottleneck before scaling concurrency.

MySQL replication lag high, CPU low: increase resources?

Question: Replica lagging but CPU almost idle. Should you scale up?

Quick Answer: No. Lag is usually I/O bottleneck, single-threaded replication, long queries, huge transactions, or slow storage. Fix the root cause, not CPU/RAM.

MySQL on Kubernetes: what actually breaks first?

Question: In real production setups, what are the first failure modes you usually hit when running MySQL on Kubernetes?

Quick Answer: Storage and failover. Disk latency and volume semantics cause slow queries and long crash recovery because MySQL is extremely sensitive to fsync latency, write ordering, and predictable disk behavior, which Kubernetes volumes and network‑backed storage often cannot guarantee. Pod restarts that are routine in Kubernetes become dangerous for MySQL, since a restart can trigger long crash recovery, replay large redo logs, or leave replicas temporarily inconsistent. Network partitions further break assumptions MySQL replication relies on, making leader election and replica promotion error‑prone and risking split‑brain or stale reads. Backups and upgrades then amplify the risk, because they require tight coordination with replication, disk snapshots, and write traffic. When these operations are performed during normal Kubernetes events like rescheduling or rolling updates, they can easily turn expected maintenance into real data loss or prolonged downtime.

Linux server slow, CPU idle: is it hardware?

Question: If the server is slow but CPU is low, should you suspect hardware failure?

Quick Answer: Not first. Usually it’s disk I/O, memory pressure, swap, network issues, or blocked processes. Hardware is the last guess unless logs show actual errors.

Puppet run has no changes: everything OK?

Question: If Puppet runs clean with no drift, does it guarantee the system is fine?

Quick Answer: No. It only means Puppet thinks the state matches the manifest. Services can be unhealthy, configs wrong, or dependencies broken without Puppet noticing.

What happens to the cluster if `etcd` stops responding?

What etcd is

The source of truth for all cluster state (pods, nodes, configs, secrets).

What happens

API Server cannot read or write state
No new pods, updates, or deletes
Scheduler and controllers effectively stop working
Existing running pods usually keep running
Cluster becomes read-only or fully stuck, depending on failure mode

Common causes

Disk full / slow IOPS
Network partition between control-plane nodes
CPU or memory starvation
Corrupted data

Mitigation

Run etcd on fast, dedicated disks
Regular snapshots
Odd number of members (3, 5)
Monitor latency, disk, and leader changes

On which node does `kubeadm` run?

kubeadm init runs on the control plane node
It bootstraps:

API Server
Controller Manager
Scheduler
etcd (if self-hosted)

kubeadm join runs on worker nodes (and additional control planes)

Key point

kubeadm is a bootstrap tool, not something that manages the cluster day-to-day.

Reasons for `ImagePullBackOff` / `ErrImagePull` in Kubernetes

Most common causes

Wrong image name or tag
Image does not exist
Private registry without credentials
Wrong imagePullSecret
Registry unavailable or rate-limited
Node has no internet access (private subnet + no NAT / endpoint)

How to debug

kubectl describe pod
Check events
Test pulling manually from the node

Where is sensitive data stored in Kubernetes?

Primary mechanism: Secrets

What they are

Stored in etcd
Base64-encoded by default (not encrypted)

Encryption

At rest encryption is optional and must be enabled explicitly
Uses envelope encryption (KMS, AES-CBC, etc.)

Best practice

Enable encryption at rest
Restrict RBAC access
Secrets are a separate Kubernetes resource, which lets RBAC grant access to pods, deployments, and ConfigMaps while explicitly denying any ability to read secrets
Avoid putting secrets in env vars when possible
Consider external secret managers (AWS Secrets Manager, Vault)

Difference between Secret and ConfigMap

ConfigMap

Non-sensitive data
App config, flags, URLs
Stored in plain text in etcd
Humans should read and edit it
Checked into Git
Debuggable with kubectl get cm -o yaml

Secret

Sensitive data
Passwords, tokens, keys
Base64-encoded (not encrypted by default)
Humans should not casually read it
Not meant for Git (in theory)
Treated as opaque payload

Rule of thumb If leaking it is bad, it belongs in a Secret.

ConfigMaps for usability
Secrets for controlled access

Difference between VM and Container

Hardware-level virtualization
Full OS per VM
Slower startup
Stronger isolation

Container

OS-level virtualization
Shares host kernel
Fast startup
Lightweight
Weaker isolation than VMs

Where can environment variables be stored in Linux?

Common locations

Shell session
.bashrc, .bash_profile, .profile
/etc/environment
Systemd unit files
Docker / Kubernetes env configuration

What are taint and cordon? Use cases

Cordon

Marks node as unschedulable
Existing pods stay
Used for maintenance

Taint

Repels pods unless they tolerate it
Fine-grained scheduling control

Example use cases

Dedicated GPU nodes
Isolating unstable nodes
Control plane protection

Best practices for writing a Dockerfile

Key rules

Use small base images (alpine, distroless)
Pin versions, not latest
Leverage layer caching
Docker reuses unchanged layers, so put rarely changing steps first.
Prefer multi-stage builds
Build with heavy tools, ship only what you need.
Copy only what you need
Use .dockerignore
Run as non-root
One concern per container

Docker: difference between bind mount and named volume

Bind mount

Maps a host path → container path.

Why it exists

You want direct access to host files
Common in local dev (edit code on host, container sees changes)

Why it’s risky

Tied to a specific host path
Breaks portability (container assumes /home/user/app exists)
Easy to leak host data into containers
Harder backups (you must remember where data lives)

Use when

Local development
One-off debugging
Tight coupling to host is acceptable

Named volume

Managed by Docker, stored under Docker’s data directory.

Why it’s better

Docker abstracts storage location
Containers don’t care where data lives
Easier backup, restore, and migration
Works the same on any machine

Why it’s recommended

Better isolation
Safer defaults
Designed for production workloads

Rule of thumb If data should survive container restarts and move cleanly between hosts, use a named volume.

Core VPC Design Considerations

Non-overlapping CIDR

Why: Overlapping CIDRs break peering, VPNs, and future mergers. Fixing overlap later is extremely painful.

Follow-up: What mistake do teams commonly make here?

Follow-up answer: Choosing CIDRs that are too small or conflict with on-prem or future environments.

Enough IP space

Why

IP exhaustion silently breaks scaling
Kubernetes and ALBs consume IPs aggressively

Kubernetes gives every pod its own IP.
AWS VPC CNI:

Assigns VPC IPs directly to pods
Keeps a buffer of unused IPs per node and Pre-allocates IPs to nodes to allow fast pod startup

Those IPs are:

Allocated from the subnet
Count as “used”
Not available to anything else

Even if no pod is running yet.
This is controlled by documented settings:

WARM_IP_TARGET
MINIMUM_IP_TARGET
WARM_ENI_TARGET

Important clarification - This is not universal Kubernetes behavior

Overlay CNIs (Calico overlay, Flannel) → no, pods don’t consume VPC IPs
That’s why this problem is much worse on EKS than on some other clusters

Subnet resizing is painful

Why do we design subnets per Availability Zone?

Answer: For fault isolation, predictable routing, and easier debugging.

Follow-up: What breaks if subnets aren’t AZ-aligned?

Follow-up answer: Failover behavior becomes unclear and debugging network issues gets much harder.

Key point: High availability in AWS is achieved by creating multiple subnets, one per AZ, and spreading resources across them.

Private IP ranges (RFC1918) and why they matter

10.0.0.0/8 – large, flexible, most common (16,777,216 addresses, ranging from 10.0.0.0 to 10.255. 255.255)
172.16.0.0/12 – medium
192.168.0.0/16 – small, often clashes with home networks (65,536 addresses)

Why choose carefully

VPN conflicts
Peering conflicts
Hard to change later

How should you size VPC subnets?

Question: What is a typical subnet CIDR configuration, and why does sizing matter?

Answer: Subnet CIDRs must be planned intentionally. The CIDR determines how many IPs are available in each AZ, and running out of IPs causes hard failures. This is especially critical for EKS, where nodes, pods, ENIs, load balancers, and VPC endpoints all consume IP addresses.

Typical configuration (example):

VPC: 10.0.0.0/16
3 AZs, per AZ:

Public subnet: /24 (for ALB/NLB, NAT Gateway)
Private app subnet: /20 to /21 (for EKS nodes and most workloads, sized larger because EKS burns IPs)
Optional intra / isolated subnet: /24 to /23 (no route to IGW or NAT, used for internal-only components or specific patterns)

Quick sizing intuition:

/24 gives 256 IPs minus AWS-reserved addresses, which is easy to exhaust in EKS-heavy subnets.
/20 gives 4096 IPs minus reserved, which is why many teams default to /20 for private EKS subnets.

Common pattern: One public subnet and one private subnet per AZ. If you run EKS at scale, private subnets are usually the first ones you oversize.

How subnets are used in multi-AZ EKS

Question: How does EKS use subnets when the cluster spans multiple Availability Zones?

Answer: In a multi-AZ EKS cluster, you provide multiple subnets, typically one private subnet per AZ. EKS spreads worker nodes across these subnets, and the Kubernetes scheduler places pods on nodes in whichever AZ has capacity.

Each node consumes IPs from its subnet, and each pod consumes an IP from the node’s ENI allocation. This means IP pressure is per-AZ, not global. If one AZ’s subnet runs out of IPs, pods cannot be scheduled there, even if other AZs still have free capacity.

Load balancers also follow this model: an ALB or NLB creates one node per AZ and attaches to the corresponding subnets. If a subnet is missing or exhausted, that AZ is excluded from load balancing.

Key implication: High availability in EKS requires both enough subnets and sufficiently large CIDRs in every AZ. Multi-AZ does not save you from IP exhaustion if one subnet is undersized.

Common CDN caching protocols

Cache-Control (max-age, s-maxage, stale-while-revalidate) Controls how long a response can be cached and by whom.

max-age: how many seconds a browser may reuse the response without rechecking the server.
s-maxage: same idea, but for shared caches like CDNs; overrides max-age for them.
stale-while-revalidate: allows serving slightly outdated content while the cache refreshes in the background, trading freshness for speed.

ETag / If-None-Match A content fingerprint mechanism. The server sends an ETag (a hash or version). On the next request, the client sends it back using If-None-Match. If unchanged, the server replies 304 Not Modified, saving bandwidth and compute.
Last-Modified A timestamp-based validation mechanism. The server tells the client when the resource last changed. The client later asks “has it changed since X?”. Simpler than ETag, but less precise.
Vary header Defines what makes responses different. It tells caches which request headers (for example Accept-Encoding, Authorization, or User-Agent) affect the response. If Vary is wrong or missing, caches may serve the wrong content to the wrong users.
HTTP status caching rules Determines which responses are cacheable at all. Some status codes are cacheable by default (200, 301), others are conditionally cacheable (404, 403), and some should never be cached (500). These rules prevent CDNs from amplifying errors or serving broken responses.

Note: All of the mechanisms above (Cache-Control, ETag, Last-Modified, Vary, and status-based caching) are defined by the HTTP protocol and apply across HTTP/1.1, HTTP/2, and HTTP/3. They do not exist outside HTTP. CDNs and proxies may extend or override their behavior, but they all rely on these HTTP-defined semantics as the foundation.

Key point: If you understand HTTP caching, all CDNs behave similarly.

EKS: Infrastructure vs Workloads

Infrastructure (AWS-side)

VPC, subnets, routing
EKS cluster
Node groups / Karpenter
IAM, OIDC, IRSA

How managed

Terraform
Separate state
Platform team only

Platform inside the cluster

Ingress controller
cert-manager
ExternalDNS
autoscalers
CSI drivers
observability

How managed

Helm
GitOps (ArgoCD / Flux)

Workloads

Deployments, Services, Ingress
HPA
ConfigMaps / Secrets

How managed

Namespaces
Helm charts
Team-owned repos

Why separate node groups in EKS

Common node pools

system (cluster infra)
private-nat (general workloads)
private-no-egress (restricted workloads)
batch / gpu (specialized)

Why

Prevent noisy neighbors
Control internet access
Predictable scaling and cost

How enforced

Subnets + routing
Security groups
Taints / tolerations
Node affinity

Who actually scales nodes in EKS

Cluster Autoscaler or Karpenter

Trigger

Pending pods
Resource requests, not usage

Key insight

Over-requested CPU/memory causes unnecessary scale-out

Can you get a static IP with an Internet Gateway?

Question: Can an Internet Gateway (IGW) give you a static IP?

Answer: No. An IGW is a routing target, not a resource with an IP. It scales and load-balances globally, so there is nothing to pin a static IP to.

Follow-up: Why did AWS design it this way?

Follow-up Answer: Static IPs would break elasticity and fault tolerance. AWS wants IGW traffic to scale transparently without customers depending on fixed IPs.

Why does ALB not support static IPs?

Question: Why is an Application Load Balancer DNS-only?

Answer: Because ALBs scale horizontally. AWS constantly adds and removes backing nodes (the underlying EC2 instances or network endpoints that actually receive and handle traffic behind the load balancer), so IPs change as part of normal operation.

Follow-up: What is the recommended way to integrate with ALB?

Follow-up Answer: Always rely on DNS. AWS explicitly designs ALB to be consumed via DNS, not IP pinning.

If you truly need static IPs in AWS, what are your options?

Answer:

Network Load Balancer with Elastic IPs
CloudFront in front of ALB for stable Anycast IP ranges

Follow-up: What are the tradeoffs?

Follow-up Answer: NLB is L4 only with less routing intelligence. CloudFront adds another layer and complexity but gives caching, TLS termination, and WAF.

Why is NAT Gateway expensive?

Question: Why does NAT Gateway cost so much?

Answer: Because it’s fully managed, highly available, and horizontally scalable. You pay hourly plus per-GB processing.

Follow-up: Why do teams often underestimate NAT cost?

Follow-up Answer: Because NAT charges per GB. Kubernetes, image pulls, logs, and retries silently push large volumes through NAT.

NAT Gateway vs NAT Instance

Question: When would you use a NAT Instance instead of NAT Gateway?

Answer: Only at small scale when cost matters more than operational simplicity.

Follow-up: Why is a NAT Instance risky?

Follow-up Answer: You manage patching, scaling, and HA yourself. It’s easy to create a single point of failure.

How does IPv6 reduce NAT usage?

Answer: IPv6 gives every instance a globally unique IP, so outbound traffic doesn’t require NAT.

Follow-up: Why isn’t IPv6 widely adopted yet?

Follow-up Answer: Many services are still IPv4-only. Dual-stack increases complexity and makes debugging harder.

What are VPC Endpoints and why use them?

Question: What are VPC Endpoints, why would you use them, and do you need them for every service?

Answer: VPC Endpoints allow private access to supported AWS services without routing traffic over the public internet or through NAT. Traffic stays on the AWS backbone, which improves security and makes costs and latency more predictable.

You do not need endpoints for every service. Only AWS services that support PrivateLink or Gateway Endpoints can use them, and they should be added selectively. High-volume, AWS-internal traffic like S3, ECR, STS, SSM, and CloudWatch is usually a good fit. Low-volume, infrequent, or highly variable traffic is typically fine through NAT.

Why they cost money: Each VPC Endpoint is implemented as managed network interfaces (ENIs). As you add more services and availability zones, the number of endpoints grows, and so does the cost.

Rule of thumb: If traffic is predictable, high-volume, and stays within AWS, use a VPC Endpoint. If traffic goes to many external or changing destinations, or volume is low, keep using NAT.

NAT vs VPC Endpoints at scale

Question: How does cost behavior differ between NAT and VPC Endpoints?

Answer: NAT scales with traffic volume, while endpoints are mostly flat and predictable. At large scale, endpoints are often cheaper.

Follow-up: What’s the usual cost-optimized pattern?

Follow-up Answer: Use VPC Endpoints for AWS services and NAT only for true external internet access.

Example use case: A private EKS cluster pulling images from ECR and writing logs to CloudWatch. Without VPC Endpoints, all this traffic goes through NAT and scales linearly with usage. With Interface Endpoints for ECR, S3, and CloudWatch, the traffic stays inside the AWS network, costs become predictable, and NAT is only used for real outbound internet calls like external APIs.

Example where NAT is still the right choice: Your workloads need to reach many changing third‑party services on the public internet (payment providers, SaaS APIs, public package registries, webhooks), and you also want a single, controlled outbound egress point with stable source IPs for allowlisting. In that case, NAT is the simplest default: endpoints won’t help (they only cover specific AWS services), and trying to replace NAT with dozens of service-specific paths becomes operationally messy.

Why do some teams put instances in public subnets?

Answer: To avoid NAT cost and simplify routing.

Follow-up: Why is this dangerous?

Follow-up Answer: It increases the attack surface and is easy to misconfigure. One bad security group can expose everything.

When are public IPs acceptable?

Question: When is it acceptable to use public IPs?

Answer: For stateless services with no SSH access, SSM only, and very tight security groups.

Follow-up: Would you ever do this in regulated environments?

Follow-up Answer: Rarely. Most regulated environments prefer private subnets with controlled egress.

Why use multiple AWS accounts or environments?

Answer: To reduce blast radius, enforce clean IAM boundaries, and simplify audits and billing.

Follow-up: What’s a common anti-pattern?

Follow-up answer: Putting prod, staging, and experiments in the same account and relying only on naming conventions.

Why consider IPv6 even if you don’t use it today?

Answer: It future-proofs networking and can reduce NAT dependency long term.

Follow-up: How should teams adopt IPv6?

Follow-up answer: Gradually, usually via dual-stack, starting with non-critical services.

Requests vs usage

Question: Why does Kubernetes scale nodes based on requests, not usage?

Answer: Because the scheduler guarantees capacity. If requests can’t be satisfied, pods can’t be scheduled safely.

Follow-up: What’s the common failure mode here?

Follow-up answer: Over-requesting CPU or memory causes unnecessary node scaling and higher cost.

What health checks does an ASG use by default?

Answer: EC2 health checks only: hardware, network reachability, and hypervisor status.

Follow-up: What does that mean in practice?

Follow-up answer: ASG only knows whether the EC2 instance is alive from AWS’s perspective. It has no understanding of Kubernetes health.

Why ASG alone is insufficient in EKS

Question: Why can’t you rely on ASG alone for EKS scaling?

Answer: Because ASG has no awareness of Kubernetes scheduling, pod health, or resource pressure.

Follow-up: What happens if you don’t run a Cluster Autoscaler?

Follow-up answer: Pods remain pending forever even though ASG is healthy and running.

What ASG health checks ignore

Question: What does ASG not consider unhealthy?

Answer:

kubelet failures
CNI issues
Kubernetes NotReady state
Pods failing to schedule

Follow-up: Why is this dangerous in EKS?

Follow-up answer: Kubernetes may consider a node broken while ASG keeps it running, leading to stuck workloads and silent failures.

If a node is `NotReady`, will ASG replace it?

Answer: No. ASG only replaces instances if the EC2 health check fails.

Follow-up: So how are bad nodes handled in EKS?

Follow-up answer: Through Kubernetes controllers, Cluster Autoscaler logic, Karpenter, or manual remediation. ASG itself is unaware.

What is ASG’s role in EKS?

Answer: It manages EC2 instances only and ensures the desired number of nodes exist.

Follow-up: Does ASG understand Kubernetes at all?

Follow-up answer: No. ASG does not know about pods, scheduling, or cluster state.

Pods failing to schedule and Auto Scaling Groups

Question: If pods are failing to schedule, will that trigger the Auto Scaling Group (ASG)?

Answer: Not by itself. Unschedulable pods do not directly trigger an ASG scale-out. ASGs react to EC2-level signals (like CPU, memory via CloudWatch, or explicit scaling policies), not Kubernetes pod states.

What actually triggers scale-out: In EKS, scale-out happens only if you run a Cluster Autoscaler (or Karpenter). The autoscaler watches for pods that cannot be scheduled due to lack of resources and then requests new nodes by increasing the ASG desired capacity (or provisioning new instances).

Common failure case: Pods are pending, but the ASG does nothing because:

No Cluster Autoscaler is installed
The pod requests resources no existing node type can satisfy
ASG has reached max capacity
Subnet or AZ capacity is exhausted

Mental model: Kubernetes decides what it wants to run. The autoscaler translates that into how many nodes are needed. The ASG only follows scaling instructions, it does not understand pods.

ASG vs HPA

Question: Does HPA affect ASG directly?

Answer: No. HPA scales pods, ASG scales nodes. They are decoupled.

Follow-up: How do they interact indirectly?

Follow-up answer: HPA creates more pods. If pods become unschedulable, the Cluster Autoscaler increases ASG desired capacity.

Typical ASG + EKS failure scenario

Question: Describe a common real-world failure involving ASG in EKS.

Answer: Nodes are EC2-healthy but Kubernetes NotReady due to kubelet or CNI issues. ASG does nothing, pods stay pending, and the cluster appears half-alive.

Follow-up: How do experienced teams mitigate this?

Follow-up answer: Monitoring node readiness, automated remediation, Karpenter, and clear operational runbooks.

Scaling failure patterns

Pods pending but no scale

Requests too large
Node selector mismatch
Taints not tolerated

ASG scaled but pods still pending

Wrong labels
Wrong node group
Missing tolerations

Well architected web application in AWS with a DB, what kind of services?

Answer: A typical setup is:

VPC with public and private subnets across multiple AZs, IGW for public access, NAT for private egress.
Route 53 for DNS, ALB for HTTP(S), ACM for TLS.
Optional CloudFront in front for caching and DDoS reduction.
Compute: ECS on Fargate or EKS (ECS is simpler, EKS is more flexible). Keep the app stateless and autoscaled.
DB: RDS (Multi-AZ), or Aurora if you need more scale.
Cache: ElastiCache Redis to reduce DB load.
Async: SQS plus workers (Lambda or ECS).
Secrets/config: Secrets Manager and SSM Parameter Store.
Observability: CloudWatch logs/metrics, plus tracing (X-Ray or OTel).
Security: IAM roles, least privilege, private DB, optional WAF.

Tradeoffs: managed services reduce ops work but cost more. Aurora scales better but can be overkill early.

Can the Terraform state file contain sensitive data?

Answer: Yes. State often contains real values returned by providers, including passwords, tokens, endpoints, private IPs, and sometimes more.

Important nuance: marking something as sensitive only hides it in CLI output. It does not keep it out of state.

Best practice: treat state like a secret, restrict access hard, use encryption at rest, and avoid managing secrets directly in Terraform when possible.

If I lost the Terraform state file, what happens to the resources?

Answer: The resources keep running. Terraform just becomes blind.

What breaks:

plan may think it needs to create everything again.
destroy can’t reliably clean up.
You risk duplicates or conflicts if you apply.

Recovery:

Restore from remote backend versioning if you have it.
Otherwise, rebuild state via terraform import, which is slow and error-prone.

What is Kubernetes and when to use it?

Answer: Kubernetes is a container orchestration platform. It schedules containers, restarts failed ones, scales them, and provides service discovery and networking with declarative configs.

Use it when:

You have multiple services.
You need scaling, self-healing, and consistent deployment patterns.
You care about zero-downtime rollouts and rollbacks.

Avoid or delay it when:

You have one small service.
A managed platform like ECS, Lambda, or App Runner solves it with less ops overhead.

Tradeoff: power and flexibility vs operational complexity.

Kubernetes update strategies

Question: What is an update strategy in Kubernetes, what are the common ones, and how does each work in simple terms?

Answer: An update strategy defines how Kubernetes replaces existing Pods when you deploy a new version.

Common strategies and patterns:

RollingUpdate: replaces Pods gradually so the app keeps running.
Recreate: stops everything first, then starts the new version.
Blue green (pattern): runs old and new versions side by side and switches traffic.
Canary (pattern): sends a small amount of traffic to the new version first, then increases it.

Follow-up: Which update strategies are built in to Kubernetes?

Follow-up answer:

Deployment: RollingUpdate, Recreate
StatefulSet: RollingUpdate, OnDelete

Blue green and canary are rollout patterns built using Services, Ingress, or external controllers, not native strategies.

Update strategy in Deployment vs StatefulSet

Question: How does update strategy behave differently in a Deployment compared to a StatefulSet?

Answer:

Deployment: Pods are stateless and interchangeable, so updates focus on availability and speed. Pods can be created and removed in any order.
StatefulSet: Pods have stable identities and ordering, so updates happen one Pod at a time, in order, to protect state and dependencies.

Follow-up: How do maxUnavailable and maxSurge affect the update strategy?

Follow-up answer:

These fields are set on a Deployment under spec.strategy.rollingUpdate. They control the speed vs safety tradeoff: higher values mean faster rollouts but more resource usage or risk.

maxUnavailable: how many Pods are allowed to be unavailable during an update.
maxSurge: how many extra Pods can be created above the desired replica count during an update.

Important:

maxUnavailable and maxSurge apply only to Deployments.
StatefulSets do not support them because Pods must be updated sequentially and predictably.

Difference between library Helm and umbrella Helm?

Answer:

Library chart: reusable templates and helpers. Not installable. Used to share patterns and standards across charts.
Umbrella chart: a parent chart that deploys multiple child charts via dependencies. Installable. Used to ship a whole stack together.

Mental model: library charts define patterns, umbrella charts assemble systems.

In Grafana, when having many clusters, how do you aggregate all of them?

Answer: You usually aggregate at the metrics backend, not in Grafana.

Common patterns:

Prometheus per cluster, then remote_write into a centralized backend (Thanos, Cortex, Mimir). Grafana queries one datasource and filters by labels like cluster and env.
Small scale option: many Prometheus datasources in Grafana, but it does not scale well and global aggregation is harder.
Federation exists but is usually less robust at scale than the remote-write backends.

Tradeoff: centralized backend is more infra, but scales and simplifies querying.

What backend exists for Prometheus?

Answer: Common Prometheus-compatible backends include:

Thanos (global query, HA, object storage based)
Mimir (scalable Prometheus backend)
Cortex (similar space, often replaced by Mimir)
VictoriaMetrics (fast and efficient, simpler ops)
Amazon Managed Service for Prometheus (AWS managed option)

Rule of thumb:

Want control and proven OSS: Thanos
Very large scale or Grafana ecosystem: Mimir
Simplicity and speed: VictoriaMetrics
All-in AWS: AMP

What’s the difference between monitoring and observability?

Monitoring

Goal: detect known problems.
You predefine what to watch, using metrics and alerts.
Answers: “Is something broken?”

Observability

Goal: explain unknown problems.
You instrument systems so you can ask new questions at runtime.
Uses metrics, logs, and traces.
Answers: “Why is this happening?”

How they work together

Monitoring notifies (i.e. pages) you (SLO breach, error spike, latency jump).
Observability helps you debug fast (trace to the slow dependency, logs for the specific failure path).

What are load balancing algorithms?

Answer:

Load balancing algorithms decide which backend instance gets a request so traffic is spread efficiently and predictably.

Common algorithms:

Round robin: rotate requests across servers.
Weighted round robin: round robin but stronger servers get more traffic.
Least connections: choose the server with the fewest active connections.
Weighted least connections: least connections with capacity weighting.
Least response time: pick the backend with the best observed latency.
Hash-based / consistent hashing: route based on a key (IP, user ID, session) for stickiness and cache locality.
Random: randomly select a backend, often with weights.
Power of two choices: pick two random backends, choose the less loaded one.

Layer matters

L4 (TCP/UDP): typically connection-based approaches (round robin, least connections).
L7 (HTTP): can use paths, headers, cookies, latency-aware routing, and hashing for stickiness.

Diagnosing an application that keeps missing its SLA

Question: An application is consistently missing its SLA. How do you approach fixing it?

Answer: I use a structured approach: clarify the SLA, measure the system, then apply fixes based on whether the issue is load-related or correctness-related.

Step 0: Clarify the SLA breach

First, I confirm what is actually broken:

SLA type: latency, availability, error rate, throughput
Percentile: p50 (typical user), p95 (slowest 5% of users), p99 (worst-case experience, usually what SLAs care about)
Pattern: steady-state vs peak-only
Scope: client-visible vs internal

Kubernetes metrics to check:

Request latency histograms (p95, p99) from Ingress or Service metrics
Error rate from HTTP response codes
Request rate over time

Example alert:

p95 latency > SLA threshold for 5 minutes

Step 1: Prove where the time or failures go

I don’t guess. I break metrics down per dependency, not just the app.

Golden signals per component:

Latency: request duration per service or external call
Traffic: requests per second
Errors: 5xx, retries, timeouts
Saturation: CPU, memory, IO usage

Kubernetes metrics to check:

Pod CPU and memory utilization vs limits
Network latency to databases or external APIs
Retry and timeout metrics in clients

Key insight: Most SLA breaches come from downstream dependencies, not the application code.

Step 2: Separate load problems from correctness problems

If the SLA breaks only under load

Likely signals:

High CPU usage or throttling
Thread or connection pool exhaustion
Increased GC time

Kubernetes metrics and alerts:

CPU usage > 80% for sustained periods
Pod CPU throttling increasing
Request queue depth or active connections maxed

Typical fixes:

Tune concurrency limits
Scale horizontally or vertically
Add backpressure instead of longer timeouts

If the SLA breaks even at low load

Likely signals:

High latency with low CPU usage
Spikes during pod startups
Uneven latency across replicas

Kubernetes metrics and alerts:

Cold start latency after pod restarts
Long request durations without saturation
Pod restarts correlated with latency spikes

Typical fixes:

Optimize hot code paths
Fix N+1 queries
avoid one query per item by batching or joining data fetches

A bug where the app makes 1 query to fetch a list, then N extra queries for each item
Fixing it means batching or joining queries so data is fetched in fewer round trips

Reduce cold start impact

Security risk of running containers as root

Question: What’s the risk of running a container as root?

Answer: Running a container as root increases the blast radius of any isolation failure. Containers are isolated mainly by namespaces and cgroups, not full VM boundaries, and root inside the container can map to host-level root unless user namespaces are used.

Concrete risks:

Container escape blast radius: if there’s a kernel vuln, runtime bug, or bad capability, root makes it much easier to turn a container escape into full node compromise.
Excess Linux capabilities: root often comes with powerful capabilities (especially CAP_SYS_ADMIN), enabling filesystem mounts, kernel-level changes, and other actions that are frequently involved in real-world escapes.
Host filesystem access: if /var/run/docker.sock or hostPath volumes are mounted, root can control the runtime, start privileged containers, or modify host files, which is basically game over.
Easier lateral movement: root makes it easier to inspect processes, steal service account tokens, and pivot to other workloads on the node, especially in multi-tenant clusters.
Compliance and policy failures: running as root typically violates Pod Security Standards (restricted) and fails expectations in SOC2, ISO, and PCI-style environments.

Follow-up: Why do people still run containers as root?

Follow-up answer: Usually convenience or legacy image assumptions, but it’s rarely justified:

“Needs port 80” (use 8080 or NET_BIND_SERVICE)
“Image expects root” (fix the Dockerfile)
“It’s internal” (attackers love internal)

Best practice: run as non-root, drop capabilities, use read-only filesystems, and enforce policies with Pod Security / admission controls.

Where this is enforced (important distinction):

Configured in Kubernetes (Pod spec): securityContext.runAsNonRoot, runAsUser, capabilities.drop, readOnlyRootFilesystem
Example:

securityContext:

runAsNonRoot: true

runAsUser: 1000

readOnlyRootFilesystem: true

allowPrivilegeEscalation: false

capabilities:

drop: ["ALL"]

Enforced by Kubernetes policy: Pod Security Standards (PSS), Pod Security Admission, Gatekeeper/Kyverno
Backed by OS/runtime mechanisms: Linux capabilities, container runtime filesystem mounts, and user namespaces (if enabled)

Multi-tenant “two teams can deploy” in one cluster

Question: You have a multi-tenant Kubernetes cluster with two teams. Both teams need to deploy, but must not touch each other’s workloads. How do you set this up?

Answer: I set it up as a complete package: Namespaces for object grouping, RBAC for API boundaries, NetworkPolicies for traffic isolation, quotas for resource fairness, and pod security/policy to prevent privilege escalation.

Namespaces (object boundaries):

One namespace per team (or per team per environment).

RBAC (who can do what via the API):

Create a Role per team, scoped to that namespace (avoid ClusterRole unless needed).
Bind it to the team identity group with a RoleBinding.
Typical “can deploy” permissions:

Workloads: Deployments, StatefulSets, DaemonSets (create/update/patch/delete/get/list/watch)
Service exposure: Services, Ingress (same verbs)
Config: ConfigMaps (same)
Pods: get/list/watch + pods/log
Be careful with: pods/exec and pods/portforward (this is effectively shell access into workloads)
Secrets: often restricted (read-only, or managed via an external secrets system)

Guardrails:

Be careful with pods/exec and pods/portforward (effectively shell access).
Secrets are often restricted (read-only or managed via external secrets).

NetworkPolicies (who can talk to whom):

Cause → effect: Namespaces do not block network traffic by default, without this, Team A’s pods can still talk to Team B’s pods if they know the service DNS.
What it does: controls pod-to-pod network traffic (IP/port). Namespaces alone do not block traffic.
Typical setup: default deny ingress + egress per team namespace, then allow only:

DNS (kube-dns/CoreDNS)
ingress from the Ingress/ALB controller namespace to the app ports
egress to specific dependencies (DB, cache, external APIs via egress gateway if you have one)

ResourceQuotas + LimitRanges (fairness and blast radius):

Cause → effect: Prevent one team from consuming all cluster resources, without quotas, one team can scale too hard and starve the cluster or burn the budget.
ResourceQuota: caps totals per namespace (CPU, memory, pod count, services, PVCs, load balancers, etc.).
LimitRange: sets per-container defaults and min/max (default requests/limits, max memory per container, etc.).

Pod Security + policy enforcement (prevent dangerous pods):

Block risky features like:

privileged: true
hostPath mounts
host networking / host PID
running as root
adding dangerous capabilities

Enforce via Pod Security Admission (PSS) and/or Kyverno/Gatekeeper.
Common policy rules: allowed registries, required labels, forbid latest, forbid public LoadBalancers.

Optional: Kyverno/Gatekeeper policies (governance):

Enforce allowed registries/images, required labels, forbidden annotations, etc.
Example: prevent teams from creating LoadBalancer services or using latest tags.

Rule of thumb (interview line)

Namespaces isolate objects. RBAC isolates API access. NetworkPolicies isolate traffic. Quotas isolate resources. Pod Security isolates privilege.

High-level architecture (K8s + Postgres + Mongo + Kafka)?

Core Position

Run stateless services on Kubernetes. Keep PostgreSQL, MongoDB, and Kafka managed unless constraints force self-hosting.

Managed data services reduce risk and toil around backups, failover, upgrades, and incident response.

Enforce platform standards across all services: RBAC, network policy, autoscaling, observability, secrets, and safe progressive deployments.

Goal: predictable operations and safe scaling, not short-term speed.

Why this split works

1) Stateless compute belongs in Kubernetes

Kubernetes is excellent at:

scheduling and bin-packing service workloads
self-healing and restart orchestration
rolling deployments and progressive rollout patterns
horizontal scaling based on demand
enforcing resource and security policies consistently

Why this matters: Kubernetes gives you repeatability for app runtime. Teams can ship faster with fewer manual steps, and operational behavior is consistent across services.

2) Stateful systems are operationally expensive

Databases and event brokers are failure-sensitive and operations-heavy. Running PostgreSQL, MongoDB, or Kafka yourself on Kubernetes means your team owns:

backup and restore correctness
replication health and failover behavior
storage IOPS and latency tuning
version upgrades and compatibility risks
disaster recovery drills
partition, quorum, and consistency incidents

Why this matters: Most incidents in distributed systems are data-path incidents. Managed services reduce toil and reduce the chance your product team gets dragged into infrastructure firefighting.

3) Faster recovery and lower incident blast radius

Using managed PostgreSQL, Atlas, and managed Kafka usually improves:

mean time to recovery
patch and upgrade safety
operational predictability under load
confidence in backup and restore workflows

Why this matters: You recover faster, and failures are isolated better. Reliability improves without forcing every application engineer to become a database SRE.

Unified Data Layer and Contract Strategy

In microservices, most failures happen at service and data boundaries. So storage choices and schema evolution must be designed together.

Data layer (managed) and AWS mapping

S3 for objects, artifacts, backups, and lake-style data

Limits: not a transactional DB, no joins, higher latency for OLTP patterns
Why managed: extreme durability, lifecycle/replication/encryption built in, zero storage fleet ops
AWS: Amazon S3

PostgreSQL for relational and transactional workloads

Self-managed limits: failover, backup/restore testing, vacuum/bloat tuning, upgrade windows, replica lag handling
Why managed: automated backups, Multi-AZ failover, patching, read replicas, better day-2 operations
AWS: Amazon RDS for PostgreSQL, Amazon Aurora PostgreSQL-Compatible Edition

MongoDB for document and flexible-schema workloads

Self-managed limits: replica/shard operations, rebalancing, upgrade orchestration, index lifecycle, PITR reliability
Why managed: easier scaling/failover, backup/PITR automation, lower operational toil
AWS options: MongoDB Atlas on AWS (primary managed choice), Amazon DocumentDB (Mongo-compatible alternative with tradeoffs)

Kafka for event streaming and async decoupling

Self-managed limits: broker sizing, partition strategy mistakes, rebalance storms, upgrades, cross-AZ replication tuning, retention cost control
Why managed: broker lifecycle automation, patching, scaling support, stronger baseline reliability
AWS: Amazon MSK, Amazon MSK Serverless

Guiding rule: Each service owns its data boundary. Avoid shared-write databases across services.

Decision rule: when to self-host stateful systems on Kubernetes

Only self-host PostgreSQL, MongoDB, or Kafka on K8s if one or more of these are true:

strict regulatory or residency constraints block managed offerings
workload profile requires deep engine-level customization unavailable in managed tiers
cost model at your scale materially favors self-hosting after full operations cost accounting
the team already has proven in-house SRE capability for stateful platforms

If none apply, managed is usually the better engineering and business decision.

Risks if you ignore this model

If everything runs in-cluster without strong standards, typical outcomes are:

deployment velocity with declining reliability
noisy-neighbor and multi-tenant conflicts
recurring data incidents from weak backup/failover posture
hard-to-debug cross-service outages due to poor observability
security drift and secret sprawl

You get short-term speed, then operational drag.

MongoDB connections keep climbing

If MongoDB connections keep climbing in Atlas, assume leak or pool misconfig until proven otherwise. Here’s what I’d check, in order.

1) Confirm what “connections increasing” actually means

In Atlas:

Connections chart: is it linear growth or step spikes?
Split by current vs available if shown.
Check app instance count at the same time (K8s scaling can multiply pools).

If it correlates with HPA or deploys, you probably have “pool per pod” explosion.

2) App-side: pooling and leaks (most common)

Look for these patterns:

New MongoClient per request (classic bug). Client must be singleton per process.
Connection pool too big:

Node driver: maxPoolSize too high or default mismatch
Java: too many connections per pod, plus too many pods

Not closing cursors:

long-lived find() streams, change streams not closed
missing try/finally around iterators

Retries create extra sockets under failure:

aggressive retry logic, timeouts too high, hanging sockets

Multiple services sharing same DB user makes it harder to attribute.

Quick rule: pods * maxPoolSize should be within cluster capacity with headroom.

3) K8s scaling interaction

HPA scales pods, each pod has its own pool.
On deploys, you briefly have old + new pods running, doubling pools.
If termination is not graceful, connections can linger. Check:
preStop hook + terminationGracePeriodSeconds
app actually handles SIGTERM and closes client

4) Atlas-side checks

In Atlas metrics and logs:

Connection limits: are you approaching max? (tier dependent)
Slow queries / CPU / memory: if DB is slow, app threads pile up and hold connections.
Op latency: rising latency can make pools “fill” and increase concurrency demand.
Auth errors / network flaps: can cause reconnect storms.

5) Identify the source quickly

Best move: attribute connections by app.

Use separate DB users per service or at least per environment.
Check Atlas “Top connections” style views if available, or inspect logs.
From app side, log pool stats periodically (most drivers expose it).

6) Concrete fixes you can say in interview

Make Mongo client singleton per service process.
Set sane pool size, for example: start small and scale with load, not with fear.
Add timeouts: serverSelectionTimeout, connectTimeout, socketTimeout.
Ensure cursors/streams closed, cap batch sizes.
Add graceful shutdown that closes the client.
Put an alert on connections growth rate and correlate with deploy/HPA events.

What value does ArgoCD bring?

Answer: ArgoCD gives GitOps control for Kubernetes. Git becomes the source of truth, and ArgoCD continuously reconciles cluster state to match it. Main value:

automated deploy from Git
drift detection and correction
auditable change history
easy rollback through Git revert
consistent multi-environment delivery

ArgoCD vs CircleCI: what is the difference?

Answer: They solve different stages of delivery.

CircleCI is CI: build, test, scan, package, publish artifacts.
ArgoCD is CD for Kubernetes: sync manifests from Git into clusters and keep them aligned.

A common setup is CircleCI builds and updates image tags in Git, then ArgoCD deploys.

How can one web server host multiple websites (domains) on the same VM and same port 443?

Answer: A single web server (like Nginx or Apache) can host multiple websites on the same VM using virtual hosts (server blocks in Nginx). Even if 1.com and 2.com both resolve to the same public IP, the server can still serve the correct site.

When you type https://1.com:

DNS resolves 1.com to the VM’s public IP.
The browser connects to the VM on port 443.
The server must start a TLS handshake and choose which SSL certificate to present.
The problem is: the actual HTTP request (including the header Host: 1.com) is sent after TLS is established, meaning the hostname is inside the encrypted traffic.
So the server cannot see the hostname early enough unless the client sends it during the handshake.
That’s why the browser sends SNI, telling the server: “I’m connecting to this IP, but I want 1.com.”
Nginx uses that hostname to select the right server_name block + certificate, then serves the correct website.

Example:

server {

listen 443 ssl;

server_name 1.com;

ssl_certificate /etc/ssl/1.com.crt;

ssl_certificate_key /etc/ssl/1.com.key;

location / {

proxy_pass http://app1;

}

server {

listen 443 ssl;

server_name 2.com;

ssl_certificate /etc/ssl/2.com.crt;

ssl_certificate_key /etc/ssl/2.com.key;

location / {

proxy_pass http://app2;

}

What is SNI?

Answer: SNI (Server Name Indication) is a TLS extension (defined in the TLS standard, originally RFC 3546 and later RFC 6066) that allows the client to include the hostname during the TLS handshake, before encryption is established.

Key point: The hostname (Host: 1.com) is part of the encrypted HTTP traffic, so the server needs SNI to know which site and certificate to serve before encryption is established.

Name common compliance frameworks

ISO 27001

A framework that defines how an organization should manage and protect information securely through policies and controls.

SOC 2

A standard that evaluates whether a company properly protects customer data based on security and availability principles.

PCI-DSS

A mandatory security standard for companies that store, process, or transmit credit card information.

HIPAA

A U.S. regulation that protects sensitive healthcare and medical information.

Follow-up: What does compliance mean from a DevOps perspective?

From a DevOps perspective, compliance means enforcing security technically, not just documenting it: implement least-privilege IAM, ensure all infrastructure changes go through auditable CI/CD pipelines, enforce encryption at rest and in transit, centralize logs with proper retention, store secrets in a secure manager, run regular vulnerability scans, test backups and disaster recovery, separate environments clearly, and maintain strict role-based access controls. For PCI and HIPAA in particular, you also need strong network segmentation, tighter access restrictions, detailed audit trails, and proper handling or masking of sensitive data.

What is KEDA?

Answer: KEDA (Kubernetes Event-Driven Autoscaling) allows workloads to scale based on external event sources such as Kafka lag, SQS queue length, RabbitMQ messages, Redis depth, or Prometheus queries, not just CPU or memory. It can also scale workloads down to zero.

Follow-up: How does it work internally?

KEDA creates a ScaledObject, polls the external source (for example Prometheus), converts the result into an external metric, and feeds it to an HPA. The HPA performs the actual scaling.

Mental model: KEDA reads external signals. HPA changes replica count.

Follow-up: How does KEDA work with Prometheus?

Answer: KEDA uses a Prometheus scaler. You define a PromQL query and threshold. KEDA periodically executes the query, exposes the result as an external metric, and HPA scales based on that metric.

Key point: KEDA does not replace HPA. It extends it.

What are the ways to run startup commands in AWS and GCP?

Option 1: Startup Script

What is it?

A boot-time script hook used to run commands when a VM starts.

GCP: usually via instance metadata (startup-script)
AWS: usually via EC2 User Data as a shell script

This is the most common and fastest way to bootstrap a VM.

Typical use cases

Install packages
Start or enable services
Pull config from storage or a secret manager
Register the VM with another system
Run small setup logic at boot

Format and supported languages

Any executable script with a valid shebang
Most common:

Bash (#!/bin/bash)
Sh (#!/bin/sh)
Python (#!/usr/bin/env python3) if installed

In practice, shell is the safe default on both AWS and GCP

Does it run every boot?

GCP: usually yes, treat it as a boot hook
AWS: usually first boot only on common Linux AMIs, unless behavior is explicitly changed

Main tradeoffs

Pros

Simple
Fast to implement
Works in almost any environment

Cons

Gets messy fast
You own idempotency (avoid redoing work)
Harder to maintain for complex provisioning
Ordering and retries are on you

Platform-specific notes

GCP specifics

Uses Compute Engine metadata startup script mechanism
Good fallback when cloud-init is not available
Especially useful on images where cloud-init is missing or not enabled

AWS specifics

Usually passed as EC2 User Data
On many Linux AMIs, cloud-init is still the component that executes the script under the hood
Behavior can vary by AMI, so do not assume per-boot execution

Option 2: Cloud-init

What is it?

A Linux OS initialization system that reads boot-time configuration (user data) and applies it during startup.

This is usually the cleaner option for structured provisioning.

Typical use cases

Create users and SSH keys
Install packages declaratively
Write configuration files
Configure services and systemd units
First-boot provisioning with clearer structure than shell scripts

Format and supported languages

Primary format

#cloud-config YAML (most common and cleanest)

Also supports

Shell scripts
Multi-part user data (advanced cases)

Does it run every boot?

Typically first-boot focused
Per-boot behavior is possible, but not the default pattern
Better for provisioning than recurring runtime tasks

Main tradeoffs

Pros

Cleaner and more maintainable than script-only bootstrap
Better structure for provisioning
Good multi-cloud pattern (same idea in AWS and GCP)

Cons

Only works if the image or AMI supports cloud-init
Requires understanding cloud-init lifecycle and logs
Behavior can differ slightly by distro or image

Platform-specific notes

GCP specifics

Passed via instance metadata, usually as user-data
Only works if the VM image has cloud-init installed and enabled
Common on Ubuntu and Debian cloud images
Do not assume support on minimal images, hardened custom images, or COS

AWS specifics

On most Linux AMIs, cloud-init reads and processes EC2 User Data
Common on Ubuntu, Amazon Linux, Debian, and many marketplace Linux AMIs
Do not assume support on custom stripped AMIs
Windows is a different model and does not use the same cloud-init pattern

How do you choose between Startup Script and Cloud-init?

Use Startup Script when

You need a quick bootstrap
You want the most universal option
The image does not support cloud-init
The logic is small and simple

Use Cloud-init when

You want cleaner, structured provisioning
You are creating users, files, packages, and services
You want a more maintainable setup
The image supports cloud-init

Practical rule

If the boot logic is getting large, stop stuffing it into startup scripts. Bake more into the image or move provisioning to a proper config tool.

Service type `LoadBalancer` vs `Ingress`

In a nutshell

Service type LoadBalancer exposes a single Kubernetes Service directly through an external load balancer, usually at Layer 4, so it is great for simple exposure and also for non-HTTP protocols. Ingress is a Layer 7 HTTP/HTTPS routing resource that sits in front of multiple Services and routes traffic based on hostnames and paths, usually through an Ingress Controller. Use LoadBalancer for simple direct exposure or non-HTTP traffic, and use Ingress when you want centralized web routing, TLS termination, and one public entry point for multiple services.

`Service` of type `LoadBalancer`?

Esposes an application externally by asking the underlying cloud provider or load balancer integration to create a network load balancer for that Service.

Flow: external LB → Service → Pods

`Ingress`

A Kubernetes API object. It does not expose traffic by itself. It needs an Ingress Controller (e.g. NGINX Ingress, AWS Load Balancer Controller, Traefik, or Kong)

Flow: external LB → Ingress Controller → Service → Pods

Comparison table

Feature	LoadBalancer Service	Ingress
Use cases	Single app exposure, non‑HTTP protocols, internal LB, simple environments	Multiple web apps, shared endpoint, host/path routing, centralized TLS
Protocol level	L4 (TCP/UDP)	L7 (HTTP/HTTPS)
Routing	No routing, forwards to one Service	Host and path based routing
TLS management	Per service / per LB	Centralized
Pros	Simple, direct, no controller, easy debugging	One entry point, cheaper at scale, centralized TLS, flexible routing
Cons	One LB per service (i.e. expensive at scale) no L7 routing	Needs controller, more moving parts, HTTP/HTTPS only
Rule of thumb	Use for one service or non‑HTTP traffic	Use for many HTTP services
Complexity	Low	Medium

Certificates and TLS handling with Ingress/LoadBalancer

With `LoadBalancer` Service

Certificate handling is more fragmented because each exposed service may handle TLS separately. TLS can terminate at the cloud load balancer, inside the app, or in a reverse proxy.

With many separately exposed services, TLS management is more distributed. You may need multiple certificates, or you may reuse wildcard/SAN certificates, but you still have multiple public endpoints, listeners, repeated DNS mappings, and repeated renewal setup.

With Service type LoadBalancer, teams often expose apps like this:

If each LB terminates TLS independently, then each LB needs a cert that covers its hostname. That often means:

One cert per hostname
One cert for many LoadBalancers, using:

a wildcard cert like *.example.com
a SAN cert containing app1.example.com, app2.example.com, app3.example.com

With `Ingress`

TLS is centralized at the Ingress layer.

Common patterns are:

the Ingress references a Kubernetes TLS secret
cert-manager automatically requests and renews certificates from an issuer such as Let's Encrypt
the Ingress Controller uses that certificate for the matching hostname

That is usually easier because:

certificate management is handled in one place
multiple apps can share the same controller and the same TLS automation flow
adding a new hostname is often just adding an Ingress rule and certificate config
renewal is easier to standardize
teams do not need each app to solve TLS separately

Kubernetes TLS secret object

When using a service like Let’s Encrypt, cert-manager obtains the certificate from the ACME server and saves it in Kubernetes as a TLS Secret (type:kubernetes.io/tls).The Ingress Controller must have access to this secret in order to terminate TLS, so during the handshake it loads the certificate and key and presents the correct certificate for the requested host.

Common certificate flow with Ingress

You create an Ingress for app.example.com
The Ingress includes TLS configuration
cert-manager requests a certificate from an issuer such as Let's Encrypt
The certificate is stored in a Kubernetes secret
The Ingress Controller loads that certificate
Client connects with HTTPS
TLS terminates at the Ingress Controller
The controller routes the request to the correct backend Service
The Service sends traffic to the Pods

So the request flow is usually:

Client -> public DNS -> external load balancer -> Ingress Controller -> Service -> Pods

What is the request path when using LoadBalancer vs Ingress?

LoadBalancer

Client → DNS → Load Balancer → Kubernetes Service → Pod

Ingress

Client → DNS → ALB / ingress load balancer → Ingress rules → Service → Pod

Why use IAM roles instead of users or hardcoded credentials?

An IAM user is a long-lived identity that can have permanent credentials like passwords or access keys, while an IAM role is a temporary identity that is assumed when needed and provides short-lived credentials through STS. In modern AWS design, users are mainly for human access, while roles are preferred for workloads, services, CI/CD, cross-account access, and automation.

This approach follows core security principles:

Least privilege — grant only the permissions required
Temporary access — use short-lived credentials instead of static secrets
Separation of duties — different identities for different responsibilities
Credential minimization — avoid distributing long-term secrets
Auditability and accountability — role assumption is easier to track and reason about than static credentials.

How is CI/CD given a role in AWS?

Usually through STS AssumeRole or AssumeRoleWithWebIdentity.

Traditional pattern The CI system has some initial AWS credentials and uses them to call AssumeRole into a target role.

Modern preferred pattern The CI platform uses OIDC federation.Example: GitHub Actions gets an OIDC token from GitHub, AWS verifies it, and AWS lets that workflow assume a role without stored AWS secrets.

How does AssumeRole authorization flow work?

Layers flow

IAM Identity (user / role / service) → tries to assume a role → Trust policy (who is allowed to assume) → STS gives temporary credentials → Permission policy (what the role can do) → AWS API authorization

Runtime flow

Caller (user / service / CI) → tries AssumeRole → AWS checks trust policy → if allowed → temporary credentials → permission policies evaluated → API allowed / denied

How should I structure CI/CD for 300 Lambda functions deployed with SAM so one change does not hurt the others?

Problem: If hundreds of Lambdas are bundled into a small number of deployment units, small changes create unnecessary builds and deploys, increase blast radius, slow down the pipeline, and make ownership unclear.

Best practice: Group functions by service or domain, not all together and not necessarily one pipeline per Lambda. Keep deployment units small enough to limit blast radius, but standardized enough to stay maintainable at scale.

Approach

Organize Lambdas by bounded context, domain, or service, not just as 300 isolated folders with 300 fully separate hand-made pipelines
Use shared CI/CD templates so all services follow the same deployment pattern
Trigger builds and deploys only for the changed service or stack using path-based filtering
Keep shared libraries versioned and explicit, so common code changes are intentional and traceable
Use separate stacks where isolation matters, but avoid creating unnecessary operational sprawl
Standardize observability, IAM patterns, tagging, and config structure across all functions
Define ownership clearly, so each group of Lambdas belongs to a team or service boundary
Promote artifacts through environments the same way everywhere, instead of allowing one-off deployment logic per function

How do I move from a model where one person deploys infrastructure from their local machine to a model where the whole team can deploy safely?

Problem: Person-based deployment is risky. It creates a single point of failure, weak auditability, and too much trust in one engineer’s laptop.

Best practice: Make deployments standardized, reviewable, auditable, and repeatable.

Approach

Store all infrastructure as code in Git
Require pull requests and review before merges
Run validation, linting, plan, and security checks in CI
Let the pipeline assume a deployment role and apply changes
Give engineers permission to contribute and merge, not direct production credentials
Use separate environments and approvals for production
Keep break-glass access for emergencies only, with logging

How do I move changes from dev to production safely?

Problem: If dev and prod are built or deployed differently, environments drift and releases become less trustworthy.

Best practice: Do not rebuild separately for production. Promote the exact same version forward.

Approach

On PR: run tests, lint, validation, and infrastructure plan
On merge to main: build once and deploy to dev
After dev passes: promote the same artifact to staging
After staging passes: approve and promote the same version to prod
Use the same IaC code across environments, with different config and separate state

A developer wants a new DynamoDB table and a new Lambda. What permissions should they get?

Problem: Giving developers direct AWS create permissions scales badly and increases the risk of unsafe or inconsistent infrastructure changes.

Best practice: Developers should be able to propose infrastructure, not create it directly in production.

Approach

Give developers repo access and the ability to modify IaC
Let them open PRs for new Lambda and DynamoDB resources
Optionally give read-only AWS access for logs, metrics, and debugging
Do not grant direct CreateTable or CreateFunction permissions in shared environments
Let the CI/CD role apply approved changes with scoped permissions
Use sandbox accounts if developers need more freedom for experiments

If production stability is my responsibility, how should I collaborate with engineers on infrastructure changes?

Problem: If changes go directly from engineers into production, accountability is unclear and production risk rises.

Best practice: Engineers should build through code, and the pipeline should deploy. This avoids both chaos and bottlenecks.

Approach

Engineers define their services and infrastructure in code
Platform or DevOps defines guardrails, templates, defaults, and policies
The pipeline enforces review, validation, and safe deployment
Observability, tagging, encryption, and standards are built into the platform
Every change is reviewed and traceable

How should I manage Lambda access to DynamoDB and SQS?

Problem: When several functions share data stores and queues, broad permissions and hardcoded config quickly become messy and unsafe.

Best practice: Use least privilege for access and a central config store for environment-specific values.

Approach

Give each Lambda its own IAM role
Scope permissions to the exact DynamoDB tables and SQS queues it needs
Store table names, queue URLs, feature flags, and non-secret config in Parameter Store
Store secrets in Secrets Manager or Parameter Store with KMS encryption
Avoid hardcoding endpoints, resource names, or credentials

We are a health tech company in Germany and ISO 27001 and GDPR are mandatory. How should I implement this in the platform?

Problem: Compliance fails when it is treated as documentation only or left to each service team to implement ad hoc.

Best practice: Treat compliance as a platform capability: access control, encryption, logging, auditability, retention, and process.

Approach

Enforce least-privilege IAM and centralized identity
Store secrets securely and encrypt with KMS
Encrypt data at rest and in transit
Avoid raw PII in logs and minimize personal data everywhere possible
Separate dev, staging, and production data strictly
Enable CloudTrail and centralize audit logs
Define retention and deletion policies
Manage infrastructure through code for full auditability
Run access reviews and maintain incident response processes
Verify that external vendors support GDPR requirements and EU residency where needed

We currently emit a metric every time code logs an error, and that metric raises a CloudWatch alarm. What are the limitations of that approach?

Problem: Alerting on every application error sounds safe, but at team scale it usually creates more noise than value.

Best practice: Alert on error rate, latency, throttling, backlog, DLQ depth, and other service-level indicators. Use logs and traces for diagnosis.

Approach

Replace per-error alarms with alerts on service-level indicators such as error rate, latency, throttling, queue backlog, age of oldest message, and DLQ depth
Define which alerts are page-worthy and which should go to chat or ticketing, based on customer impact
Aggregate repeated failures into rate-based or threshold-based alerts instead of firing on every exception
Use logs to understand the exact error after an alert fires, rather than making each log line its own alert source
Correlate alerts with traces, dependency health, and deployment version so the responder can find root cause quickly
Use static thresholds only where they make sense, and add anomaly detection where normal traffic patterns vary

How would I improve observability beyond basic CloudWatch alarms?

Problem: Basic metrics and alarms tell you that something is wrong, but not why.

Best practice: Make alerts actionable and make investigation fast.

Approach

Track service-level indicators such as latency, error rate, throttles, queue backlog, and DynamoDB throttling
Use structured logs instead of free-form text
Correlate logs with trace IDs and deployment version
Split alerts by severity and customer impact
Use anomaly detection or rate-based alerts where useful
Build dashboards per service, not one giant dashboard for everything

Which third-party observability tools could improve on basic CloudWatch?

Problem: CloudWatch is useful, but many teams outgrow it when they need stronger correlation and easier root-cause analysis.

Best practice: Pick the tool that matches team maturity and operating model, not just the one with the most features.

Approach

Datadog is often the easiest upgrade for AWS-heavy serverless environments
New Relic is strong if the organization prefers an APM-style view
Honeycomb is a good fit for mature teams that want deep OpenTelemetry-based debugging

How would I implement tracing across Lambda, SQS, and DynamoDB?

Problem: Logs from separate services do not give an end-to-end view of where a request slowed down or failed.

Best practice: Tracing should show the full path across the event-driven system, not isolated service fragments.

Approach

Enable active tracing on each Lambda
Instrument the Lambda handler and downstream SDK calls
Propagate trace context through producer and consumer flows where possible
Trace DynamoDB access and queue handoff
Use trace views to identify whether latency is in the code, queue, or data layer

How do I decide whether a Lambda, DynamoDB table, or queue is actually being used?

Problem: Unused infrastructure wastes money and increases complexity, but deleting the wrong thing can break hidden consumers.

Best practice: Do not guess. Verify usage, deprecate safely, then remove through IaC.

Approach

Check CloudWatch metrics for recent activity
Review logs for invocations, reads, writes, or message movement
Use traces to see whether the resource still appears in live flows
Check IaC references, environment variables, and service dependencies
Look over a meaningful time window, such as 30 to 90 days depending on the business cycle
Mark the resource deprecated before deleting it
Add alerts in case traffic suddenly returns

Can Terraform detect unused resources automatically?

Problem: Teams often assume Terraform can tell whether something is safe to delete, but that is not what Terraform does.

Best practice: Terraform manages what should exist, not whether the business still uses it.

Approach

Terraform can detect drift and state differences
Terraform can show resources tracked in state but no longer in code
Terraform cannot tell whether a Lambda is invoked, a table is read, or a queue is active
Use CloudWatch, cost signals, and dependency analysis to determine actual usage
Remove resources safely through Terraform only after runtime validation

In a fully serverless environment with tenant-specific components and third-party services that may run on EC2 or elsewhere, how should I design the network topology safely?

Problem: Mixing tenant-facing services, internal components, and third-party systems into one flat topology creates unnecessary risk and weak isolation.

Best practice: Segment by trust zone, minimize public exposure, and isolate third-party and tenant-sensitive components more aggressively than the rest of the stack.

Approach

Keep the public edge limited to services such as CloudFront, WAF, and API Gateway
Run Lambdas outside the VPC unless they need private access
If private access is needed, place Lambdas in private subnets only
Use VPC endpoints for private AWS service access where possible
Keep databases and stateful services private
Isolate third-party services hosted on EC2 into separate segments, or separate accounts for higher-risk cases
Restrict ingress and egress tightly
For external SaaS, allowlist outbound traffic and prefer private connectivity where supported
Use stronger isolation for tenant-specific or high-risk workloads when required by compliance or risk level
Separate dev, staging, and production into different accounts

When would I replace a synchronous API call with a queue?

Problem: Synchronous calls are simple, but they break down when latency, retries, spikes, or downstream instability become serious problems.

Best practice: Use a queue when you want buffering, decoupling, and more control over failure handling.

Approach

Keep synchronous calls when the user needs an immediate answer
Move to a queue when the work is slow, retry-prone, bursty, or can be processed asynchronously
A queue smooths traffic spikes and protects downstream systems
It also improves resilience when dependencies are temporarily unavailable
Pair queues with DLQs, retries, and idempotency handling

In a standard API stack with a client, load balancer, application server, and database, how do I spot hard latency problems?

Problem: Latency issues are easy to misdiagnose when you only look at averages or only watch one layer.

Best practice: Latency debugging works best when you move from system-level symptom to per-hop breakdown instead of guessing from one graph.

Approach

Measure end-to-end request latency first
Then break it down into load balancer time, app time, database time, and external dependency time
Look at percentiles such as p95 and p99, not just averages
Correlate slow requests with deployment changes, query patterns, and spikes in concurrency
Use traces to pinpoint where time is actually spent

Shared scenario for workflow reliability, observability, and distributed systems

A multi-step agent workflow accepts inbound jobs from customers, stores metadata in MongoDB, calls an external enrichment API, and writes results back asynchronously. Load is bursty, retries happen automatically, some jobs are slow, and one downstream API has hard rate limits. The user-facing API should respond quickly, even when background processing is under stress.

How do you design agent or workflow systems so they stay reliable under real-world load?

I assume work will arrive in bursts, dependencies will fail, and messages may be delivered more than once. So I usually decouple ingestion from execution with queues, make workers idempotent, and define retry behavior explicitly instead of treating retries as a default safety net. I also add backpressure so the system can slow itself down instead of melting downstream dependencies. In practice that means concurrency limits, rate limits, timeouts, circuit breakers, and dead-letter handling for poison messages. The main goal is not just throughput. It is keeping the system predictable under stress. In production, I watch queue depth, message age, retry volume, failure rates, and saturation at each bottleneck.

In this scenario, what goes wrong If inbound jobs are processed inline, a burst of traffic can push user-facing latency up immediately. If retries are blind and concurrency is uncapped, workers can hammer MongoDB and the external API at the same time, which makes the backlog worse and creates duplicate side effects.

How this answer helps in that scenario Queueing separates ingestion from execution, idempotency makes duplicate delivery survivable, and bounded concurrency protects the real bottleneck. That turns a spike into a backlog you can manage instead of a cascade that spills into the whole system.

How do you prevent retries from causing duplicate side effects?

I make the operation idempotent at the business level, not just at the transport level. That usually means an idempotency key, a unique operation ID stored with the result, or a state transition model where the same step can be replayed safely. If I am calling an external system, I try to send a stable request identifier and store the outbound intent before the call so I can reconcile later.

Where would you apply backpressure in this kind of system?

At the real bottleneck. If the database is saturating, I cap worker concurrency there. If the external API is rate-limited, I shape outbound calls there. I also use bounded queues and admission control at the edge so the system can reject or defer work before overload becomes a cascade.

What metrics tell you the system is falling behind before users notice?

Queue depth is useful, but queue age is usually better because it shows whether work is actually being drained. I also watch retry rate, dead-letter growth, worker saturation, dependency latency, and user-facing latency at p95 or p99. Those usually surface stress before a full outage.

When would you keep a workflow synchronous instead of queue-based?

If the result is required immediately for the user-facing path and the work is short, predictable, and low-risk, synchronous is often the right tradeoff. I avoid adding async complexity unless I need buffering, isolation, retries, or long-running execution.

How do you deal with poison messages or permanently failing jobs?

I stop infinite retry loops quickly. After bounded retries, the job should go to a dead-letter queue or failure store with full context, payload metadata, and failure reason. Then I want triage tooling, replay controls, and usually classification between bad input, dependency failure, and code defect.

How do you make retries safe in distributed systems?

Retries are only safe if the operation is idempotent or if you have a clear deduplication mechanism. Otherwise retries can create duplicate payments, duplicate jobs, or conflicting state transitions. I usually design each step with an idempotency key, a unique operation identifier, or a state machine that makes repeated execution harmless. I also separate transient failures from permanent ones, because retrying validation errors or bad payloads just creates noise. Good retry policy includes bounded attempts, exponential backoff, and jitter so failures do not synchronize into a retry storm. If a step still fails after that, it should go to a dead-letter path with enough context for investigation.

In this scenario, what goes wrong The enrichment API might time out after partially completing the request. If the worker retries blindly, the job may write duplicate results, send the same downstream event twice, or trigger conflicting state transitions.

How this answer helps in that scenario Idempotency keys, operation tracking, and failure classification let you retry only when it is actually safe. That keeps transient failure from becoming duplicate business impact.

What makes an operation truly idempotent?

Running it multiple times has the same final business effect as running it once. That is stronger than saying the same HTTP request returns the same status code. If a payment, email, or state transition would happen twice, it is not truly idempotent.

How would you handle retries for an external API that is not idempotent?

I would avoid blind automatic retries. First choice is to see whether the API supports an idempotency token. If not, I would record outbound intent, detect ambiguous outcomes, and reconcile before retrying. In some cases the right answer is to fail safely and escalate instead of guessing.

What is the difference between at-least-once delivery and exactly-once processing?

At-least-once delivery means duplicates are possible. Exactly-once processing is the stronger business guarantee that the effect happens once. In real systems, infrastructure-level exactly-once is rare, so teams usually achieve exactly-once business effect through idempotency, deduplication, and controlled state transitions.

When do you stop retrying and surface failure?

When the failure is clearly permanent, like validation errors or malformed input, or when bounded retries are exhausted for a transient issue. After that, I want the failure surfaced to operators or downstream consumers with enough context to decide whether to replay, fix data, or patch code.

How do you think about production reliability?

I treat reliability as an engineering budget, not a vague goal. That means defining service-level objectives, understanding what level of errors or latency is acceptable, and then making design choices that fit inside those limits. For example, if the user-facing path has a strict latency budget, I avoid putting long-running or failure-prone work inline and move it to async processing where possible. I also look at dependency risk, redundancy, failure domains, and operational readiness before calling a system production-ready. Reliability work is not only about preventing outages. It is also about shortening detection time, narrowing blast radius, and making recovery predictable.

In this scenario, what goes wrong If the system treats reliability as just uptime, it may ignore queue age, degraded dependency behavior, and user-facing latency until customers are already feeling the impact.

How this answer helps in that scenario SLOs, latency budgets, and explicit failure-domain thinking force you to design for the failure modes that actually matter to users, not just whether the service is technically still up.

What is the difference between an SLA, SLO, and error budget?

An SLA is the external commitment, usually commercial. An SLO is the internal target you engineer against. The error budget is the allowed amount of unreliability implied by that SLO. The budget is useful because it turns reliability into a decision framework instead of a vague aspiration.

How do latency budgets influence architecture?

They force you to decide what belongs in the request path and what should move out of band. They also limit fan-out, shape timeout values, and expose which dependency hops are too expensive for the user journey.

When would you deliberately accept lower reliability?

When the feature is low criticality, internal-only, experimental, or too expensive to harden to the same level as a core path. The key is making that a conscious tradeoff instead of accidental neglect.

How do you reduce blast radius during incidents?

I like isolation boundaries, feature flags, progressive rollout, rate limiting, and the ability to disable or degrade one subsystem without taking everything else down. Small failure domains make recovery much easier.

How do you use latency budgets in system design?

A latency budget forces you to break the end-to-end response time into pieces and decide where time is allowed to go. That usually means budgeting for network hops, application processing, database calls, and external dependencies. Once that is visible, you can decide what belongs in the request path and what should move to async execution. It also helps with timeout design, because timeouts should reflect the budget instead of being random defaults. In practice, I use latency budgets to keep the critical path small, reduce fan-out, cache where it helps, and avoid hidden tail-latency traps. It is a good way to keep architecture decisions grounded in user-facing expectations.

In this scenario, what goes wrong If the API waits on MongoDB, the external enrichment call, and multiple internal hops before responding, p95 and p99 latency can explode even when average latency looks fine.

How this answer helps in that scenario A latency budget makes it obvious that enrichment belongs off the critical path. It also helps you set tighter internal timeouts so one slow dependency does not consume the whole request budget.

What is tail latency and why does it matter?

Tail latency is the slow end of the distribution, usually p95 or p99. Users often feel the tail more than the average, especially in fan-out systems where one slow dependency can dominate the whole request.

How do retries affect latency budgets?

Retries spend latency budget fast. If the retry is inline, it can easily turn a slow request into a timeout. That is why retry policy, timeout policy, and latency budgets need to be designed together, not separately.

How do you set timeouts between services?

I start from the end-to-end budget, reserve time for the full path, and then assign tighter budgets to internal hops. Timeouts should be deliberate and shorter than the caller’s timeout so failures surface cleanly instead of stacking.

When does caching help, and when does it just hide deeper issues?

Caching helps when the workload is read-heavy, data freshness requirements allow it, and the cache removes repeat expensive work. It hides deeper issues when it is masking bad query patterns, excessive fan-out, or poor dependency design without actually fixing them.

How do you approach incident response as an owner of production reliability?

During an incident, my first priority is restoring service or reducing impact, not proving root cause in real time. I want clear severity assessment, ownership, communication, and a fast view of blast radius. That usually means checking recent changes, dependency health, saturation signals, and whether rollback or traffic reduction is safer than continuing to debug live. After stabilization, I care about root cause analysis, timeline reconstruction, and action items that actually change the system, not just documentation theater. Good incident response is calm, structured, and focused on decision quality under pressure.

In this scenario, what goes wrong Teams can lose time debating root cause while queue age rises, workers fail repeatedly, and user-visible latency keeps climbing.

How this answer helps in that scenario It forces the first move toward mitigation: assess blast radius, identify the release as a likely trigger, and decide quickly whether rollback, traffic reduction, or feature disablement is the safest stabilizing action.

What would you check in the first 10 minutes?

User impact, blast radius, recent deploys or config changes, dependency health, saturation metrics, and whether rollback or traffic shedding is available. I want fast orientation before deep debugging.

When do you roll back versus fix forward?

I roll back when the change is clearly implicated and rollback is lower risk than live repair. I fix forward when rollback is unsafe, stateful migrations are involved, or the fix is smaller and faster than reversing the release.

How do you avoid noisy alerts during incidents?

By grouping related alerts, muting derived noise where appropriate, and focusing on a few primary signals tied to impact. During an incident, more alert volume is usually not more insight.

What makes a postmortem useful instead of ceremonial?

A real timeline, a clear explanation of why existing controls failed, and action items that change code, process, or observability. If the outcome is just “be more careful,” the postmortem was weak.

How do you approach capacity planning?

Capacity planning starts with workload shape, not instance count. I want to know peak versus average traffic, concurrency patterns, job duration, storage growth, dependency bottlenecks, and what happens during recovery events or batch spikes. Then I translate that into headroom targets and scaling behavior. I also care about the non-obvious bottlenecks, like connection pools, partitions, rate-limited APIs, queue consumers, and database write amplification. Good capacity planning is not guessing one big number. It is understanding where the system saturates first and what the cost of extra headroom is.

What signals tell you scaling is not solving the real bottleneck?

Throughput stops improving even as you add capacity, latency stays high, queue age keeps growing, or one dependency remains saturated. That usually means the bottleneck is elsewhere, like the database, connection pool, partition hot spots, or an external API.

How do you plan for bursty async workloads?

I look at arrival rate distribution, not just average load. Then I size for backlog absorption, drain time, worker concurrency, and recovery behavior after spikes. Bursty systems need queue-based thinking, not just steady-state scaling.

How much headroom is enough?

Enough to absorb expected spikes, recovery events, and small forecast errors without immediate instability. There is no universal number. It depends on workload volatility, scaling speed, and failure tolerance.

What changes in capacity planning for stateful systems?

You care much more about storage growth, replication overhead, failover behavior, write amplification, rebalancing cost, and hot partitions. Stateful systems are usually harder to scale and slower to recover than stateless ones.

How do you instrument a system so problems surface before users notice?

I start from the important user journeys and failure modes, not from whatever metrics the platform gives me for free. Then I instrument the stack with structured logs, traces across service boundaries, and metrics that reflect both user impact and system health. I want to know request rate, errors, latency, saturation, and workflow-specific signals like retry volume, queue age, and dead-letter growth. Good observability is not just data collection. It is being able to answer why a system is slow, failing, or falling behind without guessing. Alerting should focus on symptoms that matter and be specific enough that an engineer knows where to start.

In this scenario, what goes wrong A job may be accepted by the API and then disappear somewhere between queue publish, worker execution, the external API call, and MongoDB write-back. Without correlation IDs and step-level visibility, the team ends up guessing where it died.

How this answer helps in that scenario Traces, structured logs, and workflow-aware metrics let you narrow the failure to a specific hop or retry boundary. That turns async debugging from detective work into a normal operational task.

What is the difference between metrics, logs, and traces?

Metrics tell you that something changed. Traces show where time or failure happened across the request path. Logs give detailed local context inside a component. I want all three connected by shared identifiers.

How do you choose what deserves an alert?

If it affects users, burns reliability budget, indicates real service degradation, or predicts imminent failure, it probably deserves an alert. If it is only informational or needs human interpretation every time, it probably belongs in a dashboard, not a pager.

What makes structured logging better than plain text logs?

It makes filtering, aggregation, and correlation much easier. Fields like request ID, workflow ID, tenant, status, and error class become queryable instead of buried in free text.

How do you debug async workflows that hop across services?

Correlation IDs are mandatory. I want trace context carried through messages, logs enriched with workflow and step IDs, and dashboards that show queue state, retry history, and dead-letter events. Without cross-hop correlation, async debugging turns into guesswork.

What makes an alert actionable?

An alert is actionable when it points to a real symptom, has a clear owner, and gives enough context to start narrowing the problem immediately. Good alerts usually tie to user impact, SLO burn, or known failure patterns like queue backlog growth, sustained error rate increase, or dependency saturation. They should include thresholds that reflect meaningful degradation, not every transient blip. I also want routing, severity, and links to dashboards or runbooks. The best alert is one that wakes someone up only when a decision is needed.

What are examples of noisy alerts you would remove?

Single blip CPU alerts, isolated pod restarts with no impact, transient error spikes below user-visible thresholds, and duplicate alerts that all describe the same underlying issue.

How do you alert on slow degradation, not just hard failures?

Burn-rate alerts, backlog growth, saturation trends, and latency distribution shifts are good for that. Slow degradation usually shows up in trends before it becomes a full outage.

When would you use burn-rate alerts?

When I care about SLO consumption over time, especially for catching both fast severe outages and slower reliability leaks. Burn-rate alerting is useful because it ties pages to budget impact instead of raw metric noise.

How do runbooks improve alert quality?

They force the alert to be grounded in a real action path. If nobody can explain what to do when the alert fires, the alert is probably weak.

Shared scenario for CI/CD, deployments, and incident response

A new release adds a schema change to MongoDB and a new worker behavior in the agent workflow. The deployment technically succeeds, but shortly after rollout, job failures rise, queue age starts climbing, and some workers are reading data in the new shape while others still expect the old shape.

What does a solid CI/CD pipeline look like to you?

A solid pipeline makes the path from local development to production predictable, repeatable, and low-friction without lowering safety standards. I want fast feedback on pull requests, automated tests at the right layers, consistent artifact creation, environment promotion rules, and infrastructure changes tracked as code. I also want preview or staging environments where they add value, especially for integration-heavy changes. The key is balancing speed with confidence. A pipeline should catch common regressions early, make deployments boring, and make rollback or rollback-equivalent actions straightforward when something goes wrong.

In this scenario, what goes wrong If the pipeline validates only that code builds and deploys, it can still miss the dangerous part: mixed-version workers operating against an evolving data shape.

How this answer helps in that scenario A stronger pipeline pushes you toward compatibility checks, safer promotion, and better release discipline. That reduces the chance of a technically successful deploy causing a real production incident.

What checks belong in PR validation versus later stages?

Fast and high-signal checks belong in PRs: linting, unit tests, static analysis, basic build validation, maybe lightweight integration tests. Slower or more environment-dependent checks can happen post-merge or in staging. The PR stage should protect quality without killing iteration speed.

How do you handle database changes safely in CI/CD?

Backward-compatible migrations first, application rollout second, destructive cleanup last. I try to avoid release patterns where code and schema must change in lockstep. For risky migrations, I want testing on production-like data shape and a rollback-aware plan.

When do preview environments make sense?

When changes are integration-heavy, involve UI or API contract validation, or need stakeholder review before merge. They are most useful when they reduce real uncertainty, not when they exist only because it sounds mature.

How do you avoid slow pipelines becoming a productivity drag?

Keep the fast path fast, parallelize where possible, cache builds responsibly, and separate required gates from informational checks. A slow pipeline trains people to bypass it.

How do you reduce deployment risk?

I try to reduce risk before deployment and also reduce blast radius after deployment. Before release, that means strong validation, reproducible artifacts, config review, and testing paths that reflect real integrations. During rollout, I prefer progressive delivery when possible, like canaries, phased rollout, or feature flags, because that limits exposure and makes regression detection easier. I also want rollback criteria to be defined before deployment, not invented during an incident. The idea is that deployment should be a controlled experiment, not a leap of faith.

In this scenario, what goes wrong Rolling the whole fleet at once can turn a compatibility bug into a full outage. And if the release includes a risky schema change, rollback may not be clean anymore.

How this answer helps in that scenario Progressive rollout, feature flags, and pre-defined rollback criteria reduce exposure early and make it easier to stop the blast radius before the backlog and failures become systemic.

When is a rollback dangerous?

When the release included irreversible state changes, destructive migrations, side effects already emitted to other systems, or data shape changes that older code cannot handle. In those cases rollback can make the incident worse.

How do feature flags help, and where do they create complexity?

They help decouple deploy from release, reduce blast radius, and let you disable behavior quickly. The downside is flag sprawl, stale code paths, hidden interactions, and extra testing matrix complexity.

What metrics do you watch right after a release?

Error rate, latency, saturation, resource usage, key business flows, and any feature-specific metrics tied to the change. I want both system health and product impact.

How do you validate infrastructure changes safely?

IaC review, plan output review, policy checks, lower-environment application where useful, and progressive rollout if the platform allows it. Infrastructure should have the same discipline as application code.

What is your view on preview, staging, and production environments?

I see environments as confidence tools, not as a ritual. Preview environments are useful for fast feedback on isolated changes, especially UI or integration-heavy work. Staging is useful when it is production-like enough to expose real integration or deployment issues. But fake staging can create false confidence if it does not reflect production topology, data shape, or traffic behavior. The right question is what uncertainty each environment is supposed to remove. If an environment does not reduce meaningful risk, it is just cost and operational drag.

What makes staging misleading?

Different scale, different data shape, missing dependencies, fake traffic, or different configuration. If staging removes the hard parts of production, it teaches the wrong lessons.

How close should staging be to production?

Close enough to exercise the risky integrations and deployment path realistically. It does not have to be identical in size, but it should be honest about the failure modes you care about.

When are ephemeral environments worth the cost?

When they speed up validation of integration-heavy changes or unblock collaboration across engineering, QA, and product. They are not worth much for changes that can already be validated cheaply with tests.

What should never be tested only in production?

Basic correctness, destructive migration logic, auth flows, critical integration contracts, and obvious failure paths. Production should still validate reality, but it should not be the first place you learn the basics are broken.

What does eventual consistency mean in practice?

Eventual consistency means different parts of the system may temporarily disagree, and your application has to be designed so that this is acceptable and understandable. In practice that affects read-after-write expectations, workflow timing, reconciliation logic, and how users experience state changes. You cannot assume every reader sees the newest value immediately, especially across replicas, caches, or event-driven pipelines. So you design around that reality with clear ownership of state transitions, idempotent consumers, compensating logic when needed, and user-facing behavior that does not depend on perfect immediacy unless the business case really requires it.

In this scenario, what goes wrong A user may submit a job and immediately reload the UI, but the workflow status has not propagated yet. If the system assumes instant consistency, the UI may look broken or trigger duplicate user actions.

How this answer helps in that scenario Designing for eventual consistency lets you use pending states, reconciliation, and idempotent updates instead of pretending every component sees the same truth at the same time.

When is eventual consistency acceptable, and when is it not?

It is acceptable when temporary staleness is tolerable and can be explained or reconciled. It is not acceptable for workflows where incorrect intermediate state causes financial loss, safety issues, or broken core business guarantees.

How do you explain eventual consistency to product stakeholders?

I would say the system will converge to the correct state, but not every screen or service will see the update instantly. Then I would translate that into user-visible behavior, like delayed status refresh or short-lived pending states.

What patterns help reconcile delayed or out-of-order updates?

Versioning, sequence numbers where possible, idempotent consumers, reconciliation jobs, and state machines that reject invalid transitions. The right choice depends on whether ordering can be enforced or only repaired.

How do you test for these edge cases?

By simulating duplicates, delayed events, out-of-order delivery, stale reads, and partial dependency failure. Happy-path tests are not enough here.

What failure modes do managed platforms hide until they do not?

Managed platforms remove a lot of undifferentiated heavy lifting, but they do not remove distributed systems realities. The failure modes are still there, just hidden behind cleaner APIs. For example, retries can still duplicate work, queues can still back up, cold starts can still affect tail latency, and abstractions around orchestration can hide state growth or timeout behavior until load increases. Rate limits, noisy neighbors, partitioning limits, and consistency assumptions also surface eventually. I like managed tools, but I do not treat them as magic. I want to understand the semantics underneath the abstraction so I know where it will break under scale, burstiness, or partial failure.

What questions would you ask before adopting a managed workflow platform?

What are the retry semantics, timeout limits, state retention behavior, ordering guarantees, throughput limits, debugging tools, cold start characteristics, and failure visibility? I want to know what happens under stress, not just on the product page.

How do you test platform limits before production?

Load tests, failure injection, long-running workflow tests, quota boundary tests, and recovery drills. I especially want to see how the platform behaves at the edges, not just in steady state.

What hidden assumptions around retries or ordering matter most?

Whether retries are automatic, whether duplicate execution is possible, whether ordering is guaranteed per key or not at all, and whether visibility timeouts or leases can cause reprocessing. Those assumptions change everything upstream.

When would you accept abstraction leakage instead of building lower-level yourself?

When the managed platform still buys enough speed, reliability, and operator leverage that the leaked complexity is manageable. I do not mind some abstraction leakage if the overall tradeoff is still good.

How do you reason about partial failure?

I assume parts of the system will fail independently and that success is often mixed, not binary. One dependency may be slow, another may be unavailable, and a third may succeed after a retry. That means the design has to handle timeout, fallback, compensation, and degraded modes explicitly. The key question is what the user or downstream system should observe when only part of the workflow succeeds. Good systems make partial failure visible, bounded, and recoverable instead of hiding it until state becomes inconsistent or operators lose track of what happened.

What is a good example of graceful degradation?

Serving a partial response, showing stale but clearly marked data, accepting work asynchronously instead of inline, or disabling a non-critical feature while keeping the core flow alive. The point is preserving value instead of chasing all-or-nothing behavior.

How do you keep partial success from corrupting state?

Clear ownership of state transitions, idempotency, durable workflow state, and compensating actions where needed. Partial success is manageable if the system can tell exactly which steps completed and which did not.

When should a workflow fail fast instead of continuing?

When a prerequisite is missing, a permanent validation error is detected, or continuing would produce unsafe or misleading state. Failing fast is usually better than digging a deeper hole.

What role does a saga or compensating action play here?

It gives you a structured way to unwind or counteract earlier successful steps when later steps fail. It is not magic, but it is a practical pattern for multi-step workflows without global transactions.

How do you think about workflow orchestration tools like Temporal, Inngest, or Step Functions?

I see workflow orchestrators as tools for making multi-step execution more explicit, durable, and observable. They are especially useful when steps span time, dependencies, retries, and human or external-system boundaries. But I still care a lot about the semantics underneath them, like retry behavior, step timeouts, state persistence, ordering assumptions, and what happens when a worker crashes mid-step. I do not want workflow code to become a black box that hides complexity. The orchestrator should make complexity manageable, not invisible. I usually evaluate them based on execution model, operational overhead, debugging experience, and how clearly they express failure and compensation.

When is an orchestrator overkill?

When the workflow is short, stateless, easy to retry safely, and does not need durable coordination across time. In that case a simple service plus queue may be enough.

What logic belongs in the orchestrator versus the worker?

The orchestrator should own sequencing, waiting, retries, and workflow state. The worker should own the actual business action or side effect. That keeps coordination visible and execution units testable.

How do retries differ between steps and whole workflows?

Step retries are usually local and targeted. Whole-workflow retries are broader and can re-execute more state unless carefully controlled. That is why retry boundaries matter.

What operational tradeoffs matter when choosing one?

Execution durability, developer ergonomics, debugging, vendor lock-in, throughput limits, pricing model, and how much operational burden the tool itself introduces.

Shared scenario for MongoDB design and performance

The system launched with a simple MongoDB model and small traffic. Six months later, data volume is much higher, some documents have grown large, new product requirements added more filters and sorting patterns, and a previously acceptable query is now slow enough to affect worker throughput and backlog recovery.

How do you think about schema design tradeoffs in MongoDB?

MongoDB gives you flexibility, but flexible schema does not mean schema should be accidental. I start from access patterns, document growth expectations, update frequency, and how related data is read together. Then I decide where embedding makes sense versus where references are safer. Embedding can simplify reads, but it can also create oversized documents, duplication, or painful update paths as the model evolves. The tradeoff is usually read efficiency versus long-term maintainability and write behavior. I try to make the document model reflect real query patterns rather than generic entity diagrams.

In this scenario, what goes wrongA model that felt convenient early can become painful when documents keep growing, new fields get queried in different ways, and workers touch more of the document than they need.

How this answer helps in that scenarioThinking in terms of access patterns and document growth helps you avoid a design that looks simple at launch but turns into a performance and maintenance problem under scale.

When would you embed versus reference?

I embed when the data is read together, bounded in size, and changes with the parent. I reference when the relationship is large, reused across entities, or updated independently.

What document growth issues do you watch for?

Unbounded arrays, repeated embedded history, large nested blobs, and patterns where documents keep expanding with normal usage. Growth affects storage, index behavior, and update cost.

How do you evolve schema safely over time?

Backward-compatible reads first, then gradual writers, then cleanup. In practice that means code that can handle both old and new shapes while the migration is in progress.

What anti-patterns do you see often in MongoDB design?

Treating schemaless as designless, over-embedding because it looks convenient early, and ignoring query patterns until performance is already bad.

How do you approach indexing strategy in MongoDB?

Indexing should follow actual query patterns, not guesswork. I look at the most important reads, sort patterns, filters, and cardinality, then build indexes that support those paths efficiently. I also watch the write cost, because every extra index adds overhead to inserts and updates. Good indexing is a tradeoff, not a checklist. As usage evolves, I revisit slow queries, execution plans, and index usage so the index set stays aligned with reality instead of drifting into clutter. I also pay attention to compound index order, selective fields, and avoiding indexes that look useful but are rarely used.

In this scenario, what goes wrong A query that was fine at launch may degrade once data volume grows or a new sort pattern appears. If the index set does not evolve with the workload, workers spend too long scanning and the backlog drains too slowly.

How this answer helps in that scenario A query-pattern-driven indexing strategy makes the database support the workload you actually have, instead of the workload you had months ago.

How do you choose field order in a compound index?

Based on the query shape: equality filters first in many cases, then sort fields, then less selective range fields depending on access pattern. The right order depends on the real query plan, not a memorized rule alone.

What signals tell you an index is missing or wrong?

Slow queries, collection scans, poor execution stats, low selectivity, or indexes that exist but are not chosen by the planner. If the plan is doing too much work, the index strategy is probably off.

How do indexes hurt write-heavy workloads?

Every insert, update, and delete has to maintain them. Too many indexes increase write latency, storage use, and memory pressure, so indexing has to be selective.

When would you remove an index?

When it has low usage, overlaps heavily with a better index, or its write cost no longer justifies the read benefit. Stale indexes are not harmless.

How do you diagnose MongoDB performance issues as data and query patterns evolve?

I first want to identify whether the bottleneck is query shape, indexing, document size, working set pressure, connection behavior, or infrastructure limits. Then I look at slow query logs, execution stats, index usage, and resource saturation. In many cases the issue is not MongoDB itself but an application pattern, like N+1 access, unbounded scans, over-fetching, or a data model that no longer matches how the system is used. The fix depends on what changed. Sometimes it is a new index. Sometimes it is rewriting the query. Sometimes it is changing the document model or partitioning strategy.

In this scenario, what goes wrongTeams often blame MongoDB generically when the real issue is that the workload changed and the data model, query shape, or index strategy did not keep up.

How this answer helps in that scenarioIt gives you a structured way to find the actual bottleneck instead of jumping straight to scaling or blaming the database without evidence.

What would you check first for a suddenly slow query?

Whether the plan changed, whether data volume or selectivity shifted, whether an index was dropped or became ineffective, and whether resource saturation increased at the same time.

How do you distinguish database bottlenecks from application bottlenecks?

I compare query execution behavior with application traces and resource saturation. If the database is healthy but the app is slow, the bottleneck may be connection handling, serialization, chatty access patterns, or retry amplification.

What role does document size play in performance?

Big documents cost more to read, move, cache, and update. They also make indexes less efficient indirectly because the workload becomes heavier overall.

When does sharding actually help, and when does it just add complexity?

It helps when a single node is the real capacity limit and the shard key distributes load well. It adds complexity when the workload is not truly outgrowing simpler options or when the shard key creates hot spots.

What is one distributed systems mistake teams make often?

They treat successful happy-path integration tests as proof that the system is reliable. Real failures are usually around timeouts, retries, duplicates, out-of-order events, partial success, and saturation, not whether the basic API call worked once.

How would you test those failure cases?

Failure injection, chaos-style dependency disruption, duplicate message replay, delayed delivery, rate-limit simulation, timeout simulation, and load tests that push the system into recovery paths.

Which ones matter most in production?

The ones that match the real shape of the system: dependency latency, retry amplification, saturation, duplicate execution, and operator visibility gaps. Those are the ones that usually hurt first.

Personal answer framing

Use this pattern when answering out loud:

Start with the principle
Name the mechanisms you would use
Mention the main tradeoff
End with what you would watch in production

Example shape:

I usually start by assuming the system will see bursts, partial failure, and duplicate execution, so I design around that instead of around the happy path. In practice that means queues, idempotent workers, bounded retries with backoff and jitter, and clear backpressure at the actual bottleneck. The tradeoff is that you gain resilience but also add operational complexity and eventual consistency concerns. In production, I would watch queue depth, queue age, retry rate, saturation, and user-facing latency to see whether the design is holding under load.

Links

FAQ

In Linux, what is OOM?

What are Linux cgroups?

What are Linux namespaces?

What is the difference between cgroups and namespaces?

What is the TCP handshake?

What is a 502 error?

ALB 502 Bad Gateway in EKS

What is a Pod Disruption Budget?

Pod Disruption Budget, what is it? How to use it?

What types of Services exist in Kubernetes?

Vertical scaling vs horizontal scaling

Load balancer routing by URL – Layer 4 or Layer 7?

App on a VM accessing a storage bucket – best authentication?

K8s Deployment succeeds, everything healthy?

DIs it OK to run RabbitMQ or MySQL in a StatefulSet in production?

Slow message processing: should you increase threads?

MySQL replication lag high, CPU low: increase resources?

MySQL on Kubernetes: what actually breaks first?

Linux server slow, CPU idle: is it hardware?

Puppet run has no changes: everything OK?

What happens to the cluster if etcd stops responding?

On which node does kubeadm run?

Reasons for ImagePullBackOff / ErrImagePull in Kubernetes

Where is sensitive data stored in Kubernetes?

Difference between Secret and ConfigMap

Difference between VM and Container

Where can environment variables be stored in Linux?

What are taint and cordon? Use cases

Best practices for writing a Dockerfile

Docker: difference between bind mount and named volume

Bind mount

Named volume

Core VPC Design Considerations

Non-overlapping CIDR

Enough IP space

Why do we design subnets per Availability Zone?

Private IP ranges (RFC1918) and why they matter

How should you size VPC subnets?

How subnets are used in multi-AZ EKS

Common CDN caching protocols

EKS: Infrastructure vs Workloads

Infrastructure (AWS-side)

Workloads

Why separate node groups in EKS

Who actually scales nodes in EKS

Can you get a static IP with an Internet Gateway?

Why does ALB not support static IPs?

If you truly need static IPs in AWS, what are your options?

Why is NAT Gateway expensive?

NAT Gateway vs NAT Instance

How does IPv6 reduce NAT usage?

What are VPC Endpoints and why use them?

NAT vs VPC Endpoints at scale

Why do some teams put instances in public subnets?

When are public IPs acceptable?

Why use multiple AWS accounts or environments?

Why consider IPv6 even if you don’t use it today?

Requests vs usage

What health checks does an ASG use by default?

Why ASG alone is insufficient in EKS

What ASG health checks ignore

If a node is NotReady, will ASG replace it?

What is ASG’s role in EKS?

Pods failing to schedule and Auto Scaling Groups

ASG vs HPA

Typical ASG + EKS failure scenario

Scaling failure patterns

Well architected web application in AWS with a DB, what kind of services?

Can the Terraform state file contain sensitive data?

If I lost the Terraform state file, what happens to the resources?

What is Kubernetes and when to use it?

Kubernetes update strategies

Update strategy in Deployment vs StatefulSet

Difference between library Helm and umbrella Helm?

In Grafana, when having many clusters, how do you aggregate all of them?

What backend exists for Prometheus?

What’s the difference between monitoring and observability?

What are load balancing algorithms?

What happens to the cluster if `etcd` stops responding?

On which node does `kubeadm` run?

Reasons for `ImagePullBackOff` / `ErrImagePull` in Kubernetes

If a node is `NotReady`, will ASG replace it?

Service type `LoadBalancer` vs `Ingress`

`Service` of type `LoadBalancer`?

`Ingress`

With `LoadBalancer` Service