Hi, I'm Shanthan

Building Reliable Systems at Scale

Sr Site Reliability Engineer · AWS Certified · Kubernetes Administrator

15+ years ensuring reliability, scalability, and performance across high-traffic e-commerce platforms serving millions of customers. I bridge development, platform engineering, and operations—enabling fast releases without compromising system stability.

Shanthan Neelagiri
Shanthan at the canyons Shanthan outdoors Shanthan with his dog

Engineering Reliability, One System at a Time

I'm a customer-focused SRE professional at Ahold Delhaize USA, supporting high-traffic e-commerce platforms that power some of America's most-loved grocery brands—Stop & Shop, Giant, Food Lion, Hannaford, and Peapod. I specialize in ensuring reliability, scalability, performance, and observability across web and mobile applications backed by GraphQL APIs and containerized microservices running in Azure.

My journey spans Nordstrom, AT&T, and Innova Solutions—from building CI/CD pipelines and containerizing microservices to standing up full observability stacks and driving incident management. I hold a M.S. in Computer Science from California State University, East Bay and a B.S. in Computer Science from Osmania University. I'm both an AWS Certified DevOps Engineer and a Certified Kubernetes Administrator.

When I'm not on-call, I'm exploring canyons, learning new cloud patterns, or mentoring the next generation of engineers.

15+
Years Experience
3
Fortune 500 Clients
2
Cloud Certifications

Site Reliability Engineer specializing in large-scale e-commerce platforms, focused on Kubernetes reliability, observability engineering, AI-driven incident automation, and performance optimization across web and mobile ecosystems. Passionate about reducing MTTR, improving customer experience, and building resilient distributed systems in Azure cloud environments.

End-to-End SRE in Practice

A deep look at how I drive reliability across frontend, mobile, and e-commerce platforms.

⚙️
01
Platform & Infrastructure Engineering
Cloud & Container Orchestration
Managed Kubernetes workloads on Azure Kubernetes Service (AKS)
Designed autoscaling strategies using HPA and cluster autoscaler
Improved pod resiliency and rollout safety across environments
Optimized resource utilization and infrastructure cost efficiency
Enhanced deployment reliability with robust health checks and readiness probes
CI/CD & DevOps
Built and optimized pipelines in Azure DevOps with GitHub Actions integration
Reduced deployment failure rates through automated validation gates
Implemented rollback strategies and safe release mechanisms
Enabled blue-green and canary deployment patterns for zero-downtime releases
📊
02
Observability & Monitoring Excellence
Real User Monitoring (RUM)
Designed and implemented RUM monitoring in Datadog for mobile and web
Tracked LCP, TTFB, error rates, and session replay insights
Identified client-side performance bottlenecks impacting checkout and product flows
Application Performance Monitoring (APM)
Instrumented GraphQL services for distributed tracing across microservices
Identified latency contributors and reduced mean API response times
Correlated frontend performance with backend dependency health
Alerting & SLOs
Defined SLIs, SLOs, and error budgets for critical customer-facing services
Created actionable alerts to reduce alert fatigue with high-signal monitoring
Improved Mean Time to Detect (MTTD) across all domains
🛡️
03
Incident Management & Reliability Operations
Production Incident Response
Led incident triage across web, mobile, and API domains
Conducted thorough root cause analysis (RCA) for every major incident
Reduced Mean Time to Recovery (MTTR) through structured response runbooks
Improved cross-team communication during outages with clear escalation paths
Troubleshooting Expertise
Kubernetes pod crashes, OOM issues, and scaling misconfigurations
GraphQL resolver latency and frontend-to-backend timeout debugging
Cache inconsistencies, dependency failure isolation, and CI/CD pipeline failures
Post-Incident Improvements
Introduced preventive monitoring and automated diagnostics collection
Improved logging standards and strengthened deployment guardrails
🤖
04
AI-Driven Reliability Initiatives
Incident Triage Agent (Design Initiative)
Designed automated alert clustering and root cause suggestion engine
Built intelligent log summarization to reduce manual triage time
Improved MTTR through AI-powered pattern recognition across incidents
Observability Intelligence
Anomaly detection on latency and error rate patterns
AI-assisted noise reduction in alerting pipelines
Predictive capacity signals and deployment risk scoring

Results That Matter

Concrete outcomes from my focus on reliability, automation, and cross-team collaboration.

20%

Faster Incident Recovery

Led root cause analysis for major incidents, implementing long-term fixes that improved Mean Time To Recovery by 20%.

🛡️
25%

Fewer P1 Incidents

Built system resilience through preventive monitoring and deployment guardrails, reducing critical production incidents by a quarter.

📈
99.9%

SLA Compliance

Maintained 99.9% uptime SLAs while simultaneously cutting infrastructure costs across multi-cloud environments.

🛒

Checkout Conversion

Improved frontend performance impacting checkout conversion rates through targeted RUM insights and API optimization.

🚀
Zero

Downtime Releases

Led weekly cross-team releases using blue-green and canary patterns, ensuring zero customer-facing downtime per deployment.

🤖
AI

Agentic Operations

Pioneered AI-driven incident triage and observability intelligence, positioning the team for next-generation SRE automation.

Where I've Made My Mark

15+ years building infrastructure and reliability at scale for leading enterprises.

August 2023 — Present
Sr Site Reliability Engineer
Ahold Delhaize USA (Peapod Digital Labs)
One of the largest US grocery retailers — $96B+ revenue, 2,000+ stores, 23 states. Brands include Stop & Shop, Giant, Food Lion, Hannaford, and Peapod.
Delivered high-availability infrastructure and led weekly cross-team releases with zero downtime
Implemented SLOs/SLIs and built observability dashboards with Datadog APM, RUM, and Synthetics
Led RCA for major incidents — improved MTTR by 20% and reduced P1s by 25%
Supported frontend domain (GraphQL APIs, mobile apps) on multi-cloud infrastructure
Pioneered AI-driven incident triage and anomaly detection initiatives
Azure AKSDatadogGraphQLGitHub ActionsKubernetesSLOs/SLIsAI/ML Ops
May 2015 — July 2023
DevOps Engineer
Nordstrom
8 years spanning Enterprise Product Management, Voice & Network Services — from CI/CD pipelines to store infrastructure upgrades.
Containerized applications using Docker and deployed to Kubernetes clusters on Amazon EKS
Implemented CI/CD pipelines driving microservice builds to Docker registries and Kubernetes
Developed Python and shell automation scripts for build and release processes
Managed AWS infrastructure with Terraform, CloudFormation, Chef, and Ansible
Configured Kafka and Tibco messaging across Kubernetes clusters
DockerKubernetesEKSJenkinsTerraformPythonChefAnsibleKafka
January 2014 — August 2014
AWS DevOps / SRE
AT&T — Continuous Delivery Architecture Team
Full lifecycle AI management and Order Management System modernization across hybrid cloud environments.
Implemented containerized applications on Azure Kubernetes Service (AKS) with Ingress API Gateway
Deployed two-tier web applications to Azure DevOps CI/CD with Application Insights
Managed Kubernetes charts using Helm for reproducible builds and templatized manifests
Leveraged Azure, CloudWatch events, and Lambda for automated remediation
Azure AKSHelmTerraformJenkinsAnsiblePowerShellDocker
April 2012 — November 2013
Systems Engineer
Innova Solutions, India
Near-zero downtime application migration to AWS and GCP with scalable cloud-ready architecture.
Built migration framework for electronic software delivery application from on-premises to AWS
Created Terraform scripts for AWS infrastructure provisioning and penetration testing
Ensured cloud readiness with API Gateway onboarding, SSO, and data encryption compliance
Managed CI/CD with Jenkins, SVN, and automated deployments using Puppet and Chef
AWSGCPTerraformPuppetChefJenkinsPythonBigQuery

Skills & Technologies

Cloud Platforms
Microsoft Azure AWS (EC2, S3, EKS, Lambda, RDS, VPC) GCP (BigQuery, Dataproc, Cloud SQL) Multi-Cloud Architecture
Containers & Orchestration
Kubernetes (AKS / EKS) Docker Helm Charts Docker Swarm Spinnaker HPA / Cluster Autoscaler
CI/CD & Automation
Azure DevOps GitHub Actions Jenkins Terraform CloudFormation Ansible Chef Puppet
Observability & SRE
Datadog (APM, RUM, Logs, Synthetics) Splunk Prometheus & Grafana ELK Stack New Relic CloudWatch SLOs / Error Budgets Chaos Engineering
Languages & Scripting
Python Shell / Bash PowerShell Ruby YAML JSON
Architecture & Reliability
Microservices GraphQL APIs RCA / Incident Management Blue-Green / Canary Deploys AI-Driven Incident Automation Anomaly Detection

Education & Certifications

🎓

M.S. Computer Science

California State University, East Bay, 2015

🎓

B.S. Computer Science

Osmania University, India, 2012

☁️

AWS Certified DevOps Engineer

Professional Level

Certified Kubernetes Administrator

CKA — CNCF

Let's Connect

Whether you want to discuss SRE best practices, Kubernetes reliability, AI-driven operations, or potential collaboration — I'd love to hear from you.