DevOps & Cloud
Operations work is repeatable paths: build quality gates, immutable artifacts, observable systems, and rollouts that can stop without drama. This hub assumes Node.js services, Docker, and AWS-class clouds—the same story interviews expect at senior level.
The problem DevOps solves is not "where do we deploy?" It is how a team changes production safely, proves the system is healthy, and recovers when a deploy, dependency, or traffic shape breaks assumptions.
How to use
- Skim Topic study plan and pick one track per week (CI → container → K8s → AWS → incidents).
- Cross-link
/backend(timeouts, pools) and/databases(RDS, connections)—perf is joint with ops. - For interview prep, rehearse the same path every time: artifact → runtime → traffic → telemetry → rollback.
Topic study plan (deep pages)
Each /devops/topics/... page follows: Core details → Understanding → Senior understanding → Diagram.
| Topic | Focus |
|---|---|
| CI/CD pipelines & quality gates | Stages, gates, promotion, idempotent deploys |
| Docker images & containers | Multi-stage Node images, signals, health, security |
| Kubernetes deployment & health | Pods, Deployments, probes, Services, Ingress |
| AWS compute patterns — ECS, EKS & Lambda | When which compute model; VPC, ALB, IAM |
| Observability, incidents & rollouts | RED/USE, incidents, canary, flags, rollback |
Core basics (vocabulary)
| Term | One line |
|---|---|
| Immutable infrastructure | replace instances; don’t SSH-patch in place |
| GitOps / IaC | desired state in repo; apply reconciles drift |
| Blast radius | limit scope: accounts, VPCs, feature flags |
| MTTD / MTTR | detect fast; mitigate before root-cause complete |
| Artifact | versioned deployable unit: image, package, migration bundle, config reference |
| Promotion | moving the same artifact through environments, not rebuilding per environment |
| Readiness | whether a workload should receive traffic right now |
| Rollback | returning users to a known safe path; often previous artifact + compatible schema |
Operating model
| Layer | Senior question | Good default |
|---|---|---|
| Source | Can we trust the change? | Small PRs, code owners, tests, secret scan |
| Build | Can we reproduce the bits? | Pinned dependencies, image digest, SBOM, signed artifact where needed |
| Runtime | Can the app start, stop, and drain? | Non-root container, readiness check, graceful SIGTERM |
| Traffic | Can we limit exposure? | Canary, feature flag, weighted target groups, per-tenant rollout |
| Telemetry | Will failure page someone correctly? | SLO alerts, RED/USE dashboards, deploy markers |
| Recovery | Can we undo safely? | Previous artifact, expand-contract migrations, runbook |
The simple mental model: CI decides whether code may merge; CD decides whether a specific artifact may receive production traffic. Do not mix those decisions. Rebuilding during rollback changes the evidence.
Interview answer structure
When asked "How would you deploy this service safely?", answer in this order:
- Build one immutable artifact with tests, security scan, and version metadata.
- Promote the same artifact through environments with config injected at runtime.
- Expose traffic gradually with health checks, readiness gates, and canary metrics.
- Watch customer-facing SLOs and dependency saturation, not only CPU.
- Roll back by shifting traffic or returning to the previous artifact, with schema compatibility already planned.
Common weak answer:
"Put it in Docker and deploy to Kubernetes."
That skips artifact discipline, traffic control, health semantics, observability, and rollback safety.
Common mistakes
- Treating staging and production as different builds.
- Using liveness checks for dependency health, causing restart loops during dependency outages.
- Shipping a DB migration that cannot roll back or coexist with old code.
- Alerting on noisy resource metrics without customer impact.
- Keeping secrets in image layers, CI logs, or checked-in YAML.
- Calling a deploy successful before canary metrics and error budgets are checked.
Mind map (ASCII)
DevOps & Cloud
├── Ship
│ ├── CI/CD + gates
│ └── artifacts (images, versioned)
├── Run
│ ├── Docker → K8s / ECS / Lambda
│ └── config + secrets (not in image)
└── Operate
├── logs / metrics / traces
└── incidents + rollouts (canary, rollback)Related on this site
- /performance — tails, saturation, profiling
- /databases/topics/aws-data-services-nodejs — RDS, pools, ElastiCache
- /backend/topics/observability-slos-alerts — SLO framing
- /backend/topics/request-lifecycle-timeouts — request budgets and cancellation
Mark this page when you finish learning it.
Spotted something unclear or wrong on this page?