THN Interview Prep

DevOps & Cloud

Operations work is repeatable paths: build quality gates, immutable artifacts, observable systems, and rollouts that can stop without drama. This hub assumes Node.js services, Docker, and AWS-class clouds—the same story interviews expect at senior level.

The problem DevOps solves is not "where do we deploy?" It is how a team changes production safely, proves the system is healthy, and recovers when a deploy, dependency, or traffic shape breaks assumptions.

How to use

  • Skim Topic study plan and pick one track per week (CI → container → K8s → AWS → incidents).
  • Cross-link /backend (timeouts, pools) and /databases (RDS, connections)—perf is joint with ops.
  • For interview prep, rehearse the same path every time: artifact → runtime → traffic → telemetry → rollback.

Topic study plan (deep pages)

Each /devops/topics/... page follows: Core details → Understanding → Senior understanding → Diagram.

TopicFocus
CI/CD pipelines & quality gatesStages, gates, promotion, idempotent deploys
Docker images & containersMulti-stage Node images, signals, health, security
Kubernetes deployment & healthPods, Deployments, probes, Services, Ingress
AWS compute patterns — ECS, EKS & LambdaWhen which compute model; VPC, ALB, IAM
Observability, incidents & rolloutsRED/USE, incidents, canary, flags, rollback

Core basics (vocabulary)

TermOne line
Immutable infrastructurereplace instances; don’t SSH-patch in place
GitOps / IaCdesired state in repo; apply reconciles drift
Blast radiuslimit scope: accounts, VPCs, feature flags
MTTD / MTTRdetect fast; mitigate before root-cause complete
Artifactversioned deployable unit: image, package, migration bundle, config reference
Promotionmoving the same artifact through environments, not rebuilding per environment
Readinesswhether a workload should receive traffic right now
Rollbackreturning users to a known safe path; often previous artifact + compatible schema

Operating model

LayerSenior questionGood default
SourceCan we trust the change?Small PRs, code owners, tests, secret scan
BuildCan we reproduce the bits?Pinned dependencies, image digest, SBOM, signed artifact where needed
RuntimeCan the app start, stop, and drain?Non-root container, readiness check, graceful SIGTERM
TrafficCan we limit exposure?Canary, feature flag, weighted target groups, per-tenant rollout
TelemetryWill failure page someone correctly?SLO alerts, RED/USE dashboards, deploy markers
RecoveryCan we undo safely?Previous artifact, expand-contract migrations, runbook

The simple mental model: CI decides whether code may merge; CD decides whether a specific artifact may receive production traffic. Do not mix those decisions. Rebuilding during rollback changes the evidence.

Interview answer structure

When asked "How would you deploy this service safely?", answer in this order:

  1. Build one immutable artifact with tests, security scan, and version metadata.
  2. Promote the same artifact through environments with config injected at runtime.
  3. Expose traffic gradually with health checks, readiness gates, and canary metrics.
  4. Watch customer-facing SLOs and dependency saturation, not only CPU.
  5. Roll back by shifting traffic or returning to the previous artifact, with schema compatibility already planned.

Common weak answer:

"Put it in Docker and deploy to Kubernetes."

That skips artifact discipline, traffic control, health semantics, observability, and rollback safety.

Common mistakes

  • Treating staging and production as different builds.
  • Using liveness checks for dependency health, causing restart loops during dependency outages.
  • Shipping a DB migration that cannot roll back or coexist with old code.
  • Alerting on noisy resource metrics without customer impact.
  • Keeping secrets in image layers, CI logs, or checked-in YAML.
  • Calling a deploy successful before canary metrics and error budgets are checked.

Mind map (ASCII)

DevOps & Cloud
├── Ship
│   ├── CI/CD + gates
│   └── artifacts (images, versioned)
├── Run
│   ├── Docker → K8s / ECS / Lambda
│   └── config + secrets (not in image)
└── Operate
    ├── logs / metrics / traces
    └── incidents + rollouts (canary, rollback)

Mark this page when you finish learning it.

Spotted something unclear or wrong on this page?

On this page