DevOps & Cloud

Operations work is repeatable paths: build quality gates, immutable artifacts, observable systems, and rollouts that can stop without drama. This hub assumes Node.js services, Docker, and AWS-class clouds—the same story interviews expect at senior level.

The problem DevOps solves is not "where do we deploy?" It is how a team changes production safely, proves the system is healthy, and recovers when a deploy, dependency, or traffic shape breaks assumptions.

How to use

Skim Topic study plan and pick one track per week (CI → container → K8s → AWS → incidents).
Cross-link /backend (timeouts, pools) and /databases (RDS, connections)—perf is joint with ops.
For interview prep, rehearse the same path every time: artifact → runtime → traffic → telemetry → rollback.

Topic study plan (deep pages)

Each /devops/topics/... page follows: Core details → Understanding → Senior understanding → Diagram.

Topic	Focus
CI/CD pipelines & quality gates	Stages, gates, promotion, idempotent deploys
Docker images & containers	Multi-stage Node images, signals, health, security
Kubernetes deployment & health	Pods, Deployments, probes, Services, Ingress
AWS compute patterns — ECS, EKS & Lambda	When which compute model; VPC, ALB, IAM
Observability, incidents & rollouts	RED/USE, incidents, canary, flags, rollback

Core basics (vocabulary)

Term	One line
Immutable infrastructure	replace instances; don’t SSH-patch in place
GitOps / IaC	desired state in repo; apply reconciles drift
Blast radius	limit scope: accounts, VPCs, feature flags
MTTD / MTTR	detect fast; mitigate before root-cause complete
Artifact	versioned deployable unit: image, package, migration bundle, config reference
Promotion	moving the same artifact through environments, not rebuilding per environment
Readiness	whether a workload should receive traffic right now
Rollback	returning users to a known safe path; often previous artifact + compatible schema

Operating model

Layer	Senior question	Good default
Source	Can we trust the change?	Small PRs, code owners, tests, secret scan
Build	Can we reproduce the bits?	Pinned dependencies, image digest, SBOM, signed artifact where needed
Runtime	Can the app start, stop, and drain?	Non-root container, readiness check, graceful SIGTERM
Traffic	Can we limit exposure?	Canary, feature flag, weighted target groups, per-tenant rollout
Telemetry	Will failure page someone correctly?	SLO alerts, RED/USE dashboards, deploy markers
Recovery	Can we undo safely?	Previous artifact, expand-contract migrations, runbook

The simple mental model: CI decides whether code may merge; CD decides whether a specific artifact may receive production traffic. Do not mix those decisions. Rebuilding during rollback changes the evidence.

Interview answer structure

When asked "How would you deploy this service safely?", answer in this order:

Build one immutable artifact with tests, security scan, and version metadata.
Promote the same artifact through environments with config injected at runtime.
Expose traffic gradually with health checks, readiness gates, and canary metrics.
Watch customer-facing SLOs and dependency saturation, not only CPU.
Roll back by shifting traffic or returning to the previous artifact, with schema compatibility already planned.

Common weak answer:

"Put it in Docker and deploy to Kubernetes."

That skips artifact discipline, traffic control, health semantics, observability, and rollback safety.

Common mistakes

Treating staging and production as different builds.
Using liveness checks for dependency health, causing restart loops during dependency outages.
Shipping a DB migration that cannot roll back or coexist with old code.
Alerting on noisy resource metrics without customer impact.
Keeping secrets in image layers, CI logs, or checked-in YAML.
Calling a deploy successful before canary metrics and error budgets are checked.

Mind map (ASCII)

DevOps & Cloud
├── Ship
│   ├── CI/CD + gates
│   └── artifacts (images, versioned)
├── Run
│   ├── Docker → K8s / ECS / Lambda
│   └── config + secrets (not in image)
└── Operate
    ├── logs / metrics / traces
    └── incidents + rollouts (canary, rollback)

/performance — tails, saturation, profiling
/databases/topics/aws-data-services-nodejs — RDS, pools, ElastiCache
/backend/topics/observability-slos-alerts — SLO framing
/backend/topics/request-lifecycle-timeouts — request budgets and cancellation

On this page