Shanthan Neelagiri — Sr Site Reliability Engineer

About Me

Engineering Reliability, One System at a Time

I'm a customer-focused SRE professional at Ahold Delhaize USA, supporting high-traffic e-commerce platforms that power some of America's most-loved grocery brands—Stop & Shop, Giant, Food Lion, Hannaford, and Peapod. I specialize in ensuring reliability, scalability, performance, and observability across web and mobile applications backed by GraphQL APIs and containerized microservices running in Azure.

My journey spans Nordstrom, AT&T, and Innova Solutions—from building CI/CD pipelines and containerizing microservices to standing up full observability stacks and driving incident management. I hold a M.S. in Computer Science from California State University, East Bay and a B.S. in Computer Science from Osmania University. I'm both an AWS Certified DevOps Engineer and a Certified Kubernetes Administrator.

When I'm not on-call, I'm exploring canyons, learning new cloud patterns, or mentoring the next generation of engineers.

Years Experience

Fortune 500 Clients

Cloud Certifications

Core Expertise

End-to-End SRE in Practice

A deep look at how I drive reliability across frontend, mobile, and e-commerce platforms.

⚙️

Platform & Infrastructure Engineering

Cloud & Container Orchestration

Managed Kubernetes workloads on Azure Kubernetes Service (AKS)

Designed autoscaling strategies using HPA and cluster autoscaler

Improved pod resiliency and rollout safety across environments

Optimized resource utilization and infrastructure cost efficiency

Enhanced deployment reliability with robust health checks and readiness probes

CI/CD & DevOps

Built and optimized pipelines in Azure DevOps with GitHub Actions integration

Reduced deployment failure rates through automated validation gates

Implemented rollback strategies and safe release mechanisms

Enabled blue-green and canary deployment patterns for zero-downtime releases

📊

Observability & Monitoring Excellence

Real User Monitoring (RUM)

Designed and implemented RUM monitoring in Datadog for mobile and web

Tracked LCP, TTFB, error rates, and session replay insights

Identified client-side performance bottlenecks impacting checkout and product flows

Application Performance Monitoring (APM)

Instrumented GraphQL services for distributed tracing across microservices

Identified latency contributors and reduced mean API response times

Correlated frontend performance with backend dependency health

Alerting & SLOs

Defined SLIs, SLOs, and error budgets for critical customer-facing services

Created actionable alerts to reduce alert fatigue with high-signal monitoring

Improved Mean Time to Detect (MTTD) across all domains

🛡️

Incident Management & Reliability Operations

Production Incident Response

Led incident triage across web, mobile, and API domains

Conducted thorough root cause analysis (RCA) for every major incident

Reduced Mean Time to Recovery (MTTR) through structured response runbooks

Improved cross-team communication during outages with clear escalation paths

Troubleshooting Expertise

Kubernetes pod crashes, OOM issues, and scaling misconfigurations

GraphQL resolver latency and frontend-to-backend timeout debugging

Cache inconsistencies, dependency failure isolation, and CI/CD pipeline failures

Post-Incident Improvements

Introduced preventive monitoring and automated diagnostics collection

Improved logging standards and strengthened deployment guardrails

🤖

AI-Driven Reliability Initiatives

Incident Triage Agent (Design Initiative)

Designed automated alert clustering and root cause suggestion engine

Built intelligent log summarization to reduce manual triage time

Improved MTTR through AI-powered pattern recognition across incidents

Observability Intelligence

Anomaly detection on latency and error rate patterns

AI-assisted noise reduction in alerting pipelines

Predictive capacity signals and deployment risk scoring

Business Impact

Results That Matter

Concrete outcomes from my focus on reliability, automation, and cross-team collaboration.

⚡

20%

Faster Incident Recovery

Led root cause analysis for major incidents, implementing long-term fixes that improved Mean Time To Recovery by 20%.

🛡️

25%

Fewer P1 Incidents

Built system resilience through preventive monitoring and deployment guardrails, reducing critical production incidents by a quarter.

📈

99.9%

SLA Compliance

Maintained 99.9% uptime SLAs while simultaneously cutting infrastructure costs across multi-cloud environments.

🛒

↑

Checkout Conversion

Improved frontend performance impacting checkout conversion rates through targeted RUM insights and API optimization.

🚀

Zero

Downtime Releases

Led weekly cross-team releases using blue-green and canary patterns, ensuring zero customer-facing downtime per deployment.

🤖

Agentic Operations

Pioneered AI-driven incident triage and observability intelligence, positioning the team for next-generation SRE automation.

Career Journey

Where I've Made My Mark

15+ years building infrastructure and reliability at scale for leading enterprises.

August 2023 — Present

Sr Site Reliability Engineer

Ahold Delhaize USA (Peapod Digital Labs)

One of the largest US grocery retailers — $96B+ revenue, 2,000+ stores, 23 states. Brands include Stop & Shop, Giant, Food Lion, Hannaford, and Peapod.

Delivered high-availability infrastructure and led weekly cross-team releases with zero downtime

Implemented SLOs/SLIs and built observability dashboards with Datadog APM, RUM, and Synthetics

Led RCA for major incidents — improved MTTR by 20% and reduced P1s by 25%

Supported frontend domain (GraphQL APIs, mobile apps) on multi-cloud infrastructure

Pioneered AI-driven incident triage and anomaly detection initiatives

Azure AKSDatadogGraphQLGitHub ActionsKubernetesSLOs/SLIsAI/ML Ops

May 2015 — July 2023

DevOps Engineer

Nordstrom

8 years spanning Enterprise Product Management, Voice & Network Services — from CI/CD pipelines to store infrastructure upgrades.

Containerized applications using Docker and deployed to Kubernetes clusters on Amazon EKS

Implemented CI/CD pipelines driving microservice builds to Docker registries and Kubernetes

Developed Python and shell automation scripts for build and release processes

Managed AWS infrastructure with Terraform, CloudFormation, Chef, and Ansible

Configured Kafka and Tibco messaging across Kubernetes clusters

DockerKubernetesEKSJenkinsTerraformPythonChefAnsibleKafka

January 2014 — August 2014

AWS DevOps / SRE

AT&T — Continuous Delivery Architecture Team

Full lifecycle AI management and Order Management System modernization across hybrid cloud environments.

Implemented containerized applications on Azure Kubernetes Service (AKS) with Ingress API Gateway

Deployed two-tier web applications to Azure DevOps CI/CD with Application Insights

Managed Kubernetes charts using Helm for reproducible builds and templatized manifests

Leveraged Azure, CloudWatch events, and Lambda for automated remediation

Azure AKSHelmTerraformJenkinsAnsiblePowerShellDocker

April 2012 — November 2013

Systems Engineer

Innova Solutions, India

Near-zero downtime application migration to AWS and GCP with scalable cloud-ready architecture.

Built migration framework for electronic software delivery application from on-premises to AWS

Created Terraform scripts for AWS infrastructure provisioning and penetration testing

Ensured cloud readiness with API Gateway onboarding, SSO, and data encryption compliance

Managed CI/CD with Jenkins, SVN, and automated deployments using Puppet and Chef

AWSGCPTerraformPuppetChefJenkinsPythonBigQuery

Technical Stack

Skills & Technologies

Cloud Platforms

Microsoft Azure AWS (EC2, S3, EKS, Lambda, RDS, VPC) GCP (BigQuery, Dataproc, Cloud SQL) Multi-Cloud Architecture

Containers & Orchestration

Kubernetes (AKS / EKS) Docker Helm Charts Docker Swarm Spinnaker HPA / Cluster Autoscaler

CI/CD & Automation

Azure DevOps GitHub Actions Jenkins Terraform CloudFormation Ansible Chef Puppet

Observability & SRE

Datadog (APM, RUM, Logs, Synthetics) Splunk Prometheus & Grafana ELK Stack New Relic CloudWatch SLOs / Error Budgets Chaos Engineering

Languages & Scripting

Python Shell / Bash PowerShell Ruby YAML JSON

Architecture & Reliability

Microservices GraphQL APIs RCA / Incident Management Blue-Green / Canary Deploys AI-Driven Incident Automation Anomaly Detection

Building Systems
That Never Sleep.

Engineering Reliability, One System at a Time

End-to-End SRE in Practice

Results That Matter

Faster Incident Recovery

Fewer P1 Incidents

SLA Compliance

Checkout Conversion

Downtime Releases

Agentic Operations

Where I've Made My Mark

Skills & Technologies

Education & Certifications

M.S. Computer Science

B.S. Computer Science

AWS Certified DevOps Engineer

Certified Kubernetes Administrator

Let's Connect

Building SystemsThat Never Sleep.

Engineering Reliability, One System at a Time

End-to-End SRE in Practice

Results That Matter

Faster Incident Recovery

Fewer P1 Incidents

SLA Compliance

Checkout Conversion

Downtime Releases

Agentic Operations

Where I've Made My Mark

Skills & Technologies

Education & Certifications

M.S. Computer Science

B.S. Computer Science

AWS Certified DevOps Engineer

Certified Kubernetes Administrator

Let's Connect

Building Systems
That Never Sleep.