DB DevBrain

Devops Interview Prep

Database Architecture & Design

Glossary

Core workload and design

Database types (high-level)

Write operations (examples and short explanations)

Reliability and recovery

Performance fundamentals

Redis specific


Partitioning vs sharding

Summary

Partitioning

What it is

Splitting a single database’s data into smaller logical pieces, usually by:

Who decides?

Depends on the database.

Example: PostgreSQL

Postgres supports native table partitioning, but you define it:
PARTITION BY RANGE (created_at)
You decide:

Example: MySQL

You define partitioning strategy. It does not auto-partition your tables intelligently.

Key points

Partitioning is:

Sharding

What it is

Splitting data across multiple database instances or nodes (horizontal scaling).
Example:

Who decides?

In most systems: You. At the architecture level.
You decide:
Classic examples:

Databases That Auto-Shard

Some distributed databases manage sharding internally.

Example: MongoDB

You choose a shard key, but cluster handles:

Example: Apache Cassandra

Automatically distributes data by partition key across nodes using consistent hashing.
You choose partition key, but:

In modern systems

If you're running:

Real-World Database Design

Operating mindset

Design for p99 latency, failure recovery, and day-2 operations. Keep it simple until evidence forces complexity.
For every decision:

Constraints That Shape the Database

Workload truth (do not guess)

Capture:
Why this matters:
Examples:

Correctness requirements

Decide:
Examples:

Availability and recovery

Define:
Rule:

Database categories - fit and trade-offs

Relational (Postgres, MySQL)

When to use:
Why it fits:
Trade-off:

Wide-column and key-value (Cassandra, DynamoDB)

When to use:
Why it fits:
Trade-off:

Document (MongoDB)

When to use:
Why it fits:
Trade-off:

Columnar analytics (BigQuery, Snowflake, ClickHouse)

When to use:
Why it fits:
Trade-off:

Cache (Redis)

When to use:
Why it fits:
Trade-off:

Scaling strategy ladder (use the least complex that works)

  1. Single node + indexes + right hardware.
  2. Read replicas for read scale.
  3. Partitioning for huge tables and retention.
  4. Caching for hot reads.
  5. Sharding only when forced by write or storage ceilings.

Storage and I/O decisions

Why storage dominates

Most OLTP pain is random I/O + fsync. If storage p99 gets worse, DB p99 gets worse.

Local storage types, SSD vs HDD

Quick rule:

RAID basics (local disks)

RAID is about performance and fault tolerance. It does not replace backups.
Considerations:

Networked block storage (cloud disks)

Fit: durability and managed DB patterns.

What to size

Failure-mode table

Symptom
Likely root cause
Why it happens
What to check
Quick fix
Long-term fix
p99 latency jumps, CPU ok
Disk latency spike, checkpoint, fsync stalls
Queries wait on storage, not CPU
Disk latency, queue depth, checkpoint time, WAL fsync
Pause batch, reduce load
Faster storage, tune checkpoints/WAL
CPU pegged, slow gets worse
Missing index, bad plan, hot query
DB reads far more rows than needed
Slow query log, EXPLAIN, rows scanned vs returned
Add index, rewrite query
Query review, guardrails, partitioning
Timeouts, CPU moderate
Lock contention, long tx
Queries block waiting for locks
Lock waits, long tx, deadlocks
Kill blocker, reduce concurrency
Fix tx scope, add indexes
Unstable during deploy or failover
Connection storm, no pooling, retries
Connections and retries amplify load
Conn count and churn, retry rate
Pooling + throttling + backoff
Standardize pooling, limits, circuit breakers
Replica lag grows
Heavy writes, slow replica I/O
Replica cannot apply changes fast enough
Lag, WAL/binlog backlog, replica disk latency
Route reads to primary, stop heavy replica queries
Upgrade replica storage, move analytics to OLAP
Disk fills quickly
WAL/binlog retention, stuck slots
Logs append, cleanup blocked
WAL/binlog size, slots, retention
Free space, fix slot, expand
Fix retention, early alerts
OOMKilled or random restarts
Memory limits, per-query memory * concurrency
Memory multiplies under concurrency
OOMKilled events, memory graphs
Raise limit, lower per-query mem, reduce concurrency
Right-size, tune, isolate workloads
Pooling and throttling, concise:

Cross-engine operational patterns

Durability log

Background maintenance

Replication lag

Memory pressure

Disk growth

Connection and request pressure

Write stalls and latency spikes

Engine quick reference

Postgres

MySQL (InnoDB)

Redis

Cassandra