SRE & DevOps Blog — Shanthan Neelagiri

The SRE Roadmap is more than a checklist — it's a living philosophy. After 10+ years in the trenches keeping distributed systems alive at scale, I've distilled what separates a reactive firefighter from a proactive reliability architect. This deep dive walks through every layer of the roadmap, backed by real patterns I've implemented across Nordstrom, Walmart, and enterprise cloud migrations.

SRE Skill Domain Map — Visualised

The Three Phases of SRE Mastery

Every SRE I've mentored or worked alongside follows a similar trajectory — from systems fundamentals, through automation and observability, to architectural influence. Here's how to think about the roadmap in layers:

Phase 1 · Foundation

Systems Thinking

Linux internals — processes, cgroups, namespaces
Networking fundamentals — TCP/IP, DNS, load balancing
Distributed systems theory — CAP, consensus, retries
Git workflows & scripting — Bash/Python automation
Containerisation — Docker, OCI specs, image optimisation
Monitoring primitives — metrics, logs, traces (the three pillars)

Phase 2 · Proficiency

Platform Engineering

Kubernetes — workloads, RBAC, networking, HPA
CI/CD pipelines — ArgoCD, GitOps, progressive delivery
IaC — Terraform, Helm, Pulumi at scale
Observability platforms — Datadog, Prometheus, Grafana
SLO/SLA definition & error budget management
Incident response — runbooks, war rooms, postmortems

Phase 3 · Architecture

Reliability Architecture

Chaos Engineering — failure injection, game days
Multi-cloud strategy — portability & cost governance
Platform as a Product — Internal Developer Platforms
AI-Ops — anomaly detection, intelligent alerting
Security by design — zero-trust, policy-as-code
Org-level reliability culture & blameless postmortems

Observability: Beyond "Is It Up?"

The three pillars — Metrics, Logs, and Traces — are just the start. True observability means you can ask any question about your system's behaviour without deploying new code. After instrumenting GraphQL APIs across 300+ microservices at Nordstrom, here's the architecture that actually works at scale:

Observability Stack Architecture

💡

Pro tip from the field: Instrument with OpenTelemetry from day one. Vendor lock-in on observability is a silent tax that compounds over years. At Nordstrom, migrating partially to OTel reduced our Datadog bill by ~18% while improving trace coverage across GraphQL APIs.

SLOs, SLIs & Error Budgets: The Language of Reliability

SLOs are the single most impactful cultural shift an SRE can introduce. They turn subjective debates ("is this slow enough to page?") into objective data-driven decisions. Here's the reliability framework I've used across every team:

99.95%

Availability SLO — checkout flow

200ms

p95 Latency SLO — API gateway

21.9m

Error budget / month at 99.95%

<4 min

MTTD target with burn-rate alerts

Error Budget Burn Rate — Alert Strategy

How I Define SLOs That Actually Work

✓
Start with user journeys, not technical metrics — "checkout succeeds" beats "API returns 200"
✓
Define SLIs first: the ratio of good events to total events in a time window
✓
Set SLOs at 99.9% to 99.99% based on actual user pain thresholds, not ambition
⚠
Avoid SLOs on things you can't actually measure end-to-end (synthetic monitoring ≠ real traffic)
✓
Implement multi-window burn rate alerts: 1h fast burn + 6h slow burn = fewer missed incidents
◈
Treat the error budget as currency — spend it on features, save it for incidents

Kubernetes Reliability Patterns

Running Kubernetes at enterprise scale (300+ services, 15 clusters, AKS + EKS) teaches you things no certification covers. Here are the patterns that keep clusters healthy:

Kubernetes Reliability Architecture — Multi-Cluster Pattern

10 Kubernetes Reliability Wins — Applied

✓
Set resource requests AND limits on every container — scheduler can't work blind
✓
Use PodDisruptionBudgets to ensure rolling updates never drain a service completely
✓
Implement readiness probes that check real dependencies (DB connection, downstream health)
✓
Run OPA Gatekeeper or Kyverno to enforce policy — no privileged pods in prod
✓
Separate node pools by workload criticality — don't co-locate batch jobs with user-facing APIs
⚠
Never rely on a single replica for critical services — min 2 for HA, 3 for stateful
✓
Use topologySpreadConstraints to distribute pods across AZs automatically
◈
Implement Vertical Pod Autoscaler (VPA) in recommendation mode before applying changes
✓
GitOps everything — ArgoCD/Flux means every cluster state is auditable and reproducible
✓
Run chaos experiments monthly — kill a pod in staging, verify your SLOs hold

CI/CD & GitOps at Scale

A deployment pipeline should be a safety net, not a ritual. Here's the pattern I've refined across Azure DevOps, GitHub Actions, and Spinnaker — designed to ship confidently, not carefully:

Production-Grade Deployment Pipeline

YAML — ArgoCD Canary Rollout with SLO Gate

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout-service
spec:
  strategy:
    canary:
      steps:
        - setWeight: 5         # 5% canary traffic
        - pause: {duration: 15m}
        - analysis:           # SLO validation gate
            templates:
              - templateName: slo-checkout-gate
        - setWeight: 50
        - pause: {duration: 10m}
        - setWeight: 100       # Full traffic flip
  rollbackWindow:
    revisions: 5           # Keep 5 previous versions

Incident Response: The War Room Playbook

Every production incident is a learning opportunity in disguise. Here's the structured response framework I've used to cut MTTR by over 40% — by eliminating the chaotic "everyone shouts at once" model:

T+0:00

Detection & Triage

Alert fires via PagerDuty with pre-populated runbook link. On-call acknowledges within 2 min SLA. Incident severity assigned (P1–P4) based on SLO impact. War room channel auto-created in Slack.

T+0:05

Incident Commander Assigned

IC takes coordination — separates responders into roles: diagnostics, comms, stakeholder updates. No one starts fixing before understanding the blast radius. "Contain before you fix."

T+0:10

Mitigation — Not Root Cause

Immediate goal is service restoration, not understanding why. Rollback the deployment, add capacity, disable a feature flag. Restore user experience first. The why comes in the postmortem.

T+0:30

Service Restored · Timeline Frozen

Document the blast radius, affected SLOs, customer impact, and every action taken with timestamps. This becomes the postmortem input. Don't lose context while it's fresh.

T+24:00

Blameless Postmortem Published

Five Whys analysis. Systems thinking — not finger-pointing. Action items with owners and due dates. Published to the entire engineering org. Shared in the next reliability review.

⚡

The most dangerous incident phase is the gap between "we rolled back" and "we wrote the postmortem." Context evaporates within hours. I mandate a live incident doc that everyone updates in real time — it becomes your postmortem draft automatically.

How to Level Up Your SRE Practice — Right Now

The roadmap is a map, not a destination. Here's the pragmatic sequence I'd use if I were starting over today — or advising a mid-level engineer wanting to accelerate:

This Week

Quick Wins

Audit every alert for actionability — if it doesn't have a runbook, delete it
Define one SLO for your most critical service
Set resource requests on every pod if you haven't
Install OpenTelemetry Collector in dev environment
Write a postmortem template for your team

This Month

Infrastructure

Implement GitOps for all cluster configuration
Add canary step to your deployment pipeline
Set up burn rate alerts on your top SLOs
Run a chaos game day — kill a pod in staging
Integrate Vault or External Secrets Operator

This Quarter

Architecture

Build an Internal Developer Platform (IDP) foundation
Implement AI-assisted anomaly detection on key metrics
Design a multi-region DR strategy with RTO < 15 min
Establish reliability review cadence with product teams
Contribute a postmortem to a public reliability newsletter

🗺️

On the attached SRE Roadmap: The roadmap you shared maps the full spectrum from Linux fundamentals through cloud-native architecture. Think of it as a self-assessment tool: rate yourself 1–5 on each domain. Any area below 3 in your critical path is a reliability risk. Prioritise those first — depth over breadth, always.

The best SREs I know don't just keep systems running — they design systems that expect to fail gracefully. They write code for the 3am version of themselves. They automate the boring parts so humans can focus on the novel problems. That's what this roadmap is really pointing toward: engineering reliability as a practice, not a job title.

The Complete SRE Roadmap:
From Engineer to Architect of Reliability

The Three Phases of SRE Mastery

Observability: Beyond "Is It Up?"

SLOs, SLIs & Error Budgets: The Language of Reliability

How I Define SLOs That Actually Work

Kubernetes Reliability Patterns

10 Kubernetes Reliability Wins — Applied

CI/CD & GitOps at Scale

Incident Response: The War Room Playbook

How to Level Up Your SRE Practice — Right Now

More from the Blog

Building an Observability Platform from Scratch with OpenTelemetry

Writing Postmortems That Actually Change Behaviour

Kubernetes Cost Optimisation at Scale: 40% Savings in 90 Days

The Complete SRE Roadmap:From Engineer to Architect of Reliability

The Three Phases of SRE Mastery

Observability: Beyond "Is It Up?"

SLOs, SLIs & Error Budgets: The Language of Reliability

How I Define SLOs That Actually Work

Kubernetes Reliability Patterns

10 Kubernetes Reliability Wins — Applied

CI/CD & GitOps at Scale

Incident Response: The War Room Playbook

How to Level Up Your SRE Practice — Right Now

More from the Blog

Building an Observability Platform from Scratch with OpenTelemetry

Writing Postmortems That Actually Change Behaviour

Kubernetes Cost Optimisation at Scale: 40% Savings in 90 Days

The Complete SRE Roadmap:
From Engineer to Architect of Reliability