Naresh Erlapalli

Cary

Summary

Site Reliability Engineer with 10+ years of experience operating and supporting large-scale production systems. Strong background in observability platforms, Kubernetes, CI/CD, and automation across multi-cloud environments. Experienced in designing monitoring, alerting, and SLO-driven reliability practices, leading incident response, and reducing operational toil through automation and self-service tooling.

Overview

years of professional experience

Work History

Site Reliability Engineer

NetApp Inc.

11.2021 - Current

Architect and operate Kubernetes-native observability platforms using Prometheus and Grafana to monitor microservices and infrastructure.
Design and maintain resilient metrics and logging pipelines to ensure systems are visible and debuggable.
Define and operationalize SLIs, SLOs, and alerting models to guide reliability decisions.
Lead incident response and root cause investigations; drive follow-up actions to prevent recurrence.
Led migration of 40+ microservices from Rancher to Kubernetes with standardized deployment and observability patterns.
Implement GitOps-based deployments using ArgoCD and Helm to automate application and platform management.
Collaborate closely with platform, infrastructure, security, and application teams on instrumentation and reliability best practices.
Automate operational tasks to reduce toil and improve on-call efficiency.

DevOps Engineer

Mindteck Inc.

02.2019 - 11.2021

Designed and implemented CI/CD pipelines for cloud-native applications across AWS, Azure, and GCP.
Built and supported monitoring solutions using Prometheus and Grafana for infrastructure and applications.
Defined SLAs/SLOs and supported incident response and production operations.
Automated build, deployment, and operational workflows using Jenkins, Python, Docker, and Ansible.
Managed large build server farms across multiple geographies and supported release operations.

Datacenter System Administrator

Mindteck Inc.

10.2017 - 01.2019

Administered Linux servers, VMware environments, and NetApp storage systems.
Managed Jenkins CI infrastructure and supported application deployments.
Automated provisioning and patching using Red Hat Satellite and scripting.
Supported storage, networking, and virtualization operations in production environments.

DevTools Engineer

NetApp

08.2014 - 09.2017

Built and maintained CI/CD systems using Jenkins for large development organizations.
Developed automation scripts in Python, Perl, and Shell for build and deployment workflows.
Supported build reliability, on-call rotations, and release management.
Worked closely with development and QA teams to resolve build and infrastructure issues.

Education

Master of Science - Electrical Engineering

University of Bridgeport

Bridgeport, CT

Skills

Observability tools: Prometheus, Grafana, Elasticsearch, LogScale, InfluxDB
Monitoring and reliability: SLIs, SLOs, incident response
Container orchestration: Kubernetes (AKS, GKE, Rancher)
Containerization: Docker
DevOps practices: GitOps, CI/CD (ArgoCD, Helm, Jenkins, Azure DevOps)

Cloud platforms: AWS, Azure, GCP
Automation and scripting: Python, Shell, Ansible
Operating systems: Linux
Networking fundamentals
Storage solutions: NetApp SAN/NAS

Current Projects

Building an Agentic SRE system that uses AI to analyze alerts, correlate metrics and logs, reduce false positives, and assist with incident triage and root cause analysis.
Designing a Metrics & Context Platform (MCP) to unify observability data from Prometheus, Elasticsearch, LogScale, InfluxDB, and relational databases into a single queryable context for AI-driven reasoning.
Developing AI-assisted workflows to automatically generate postmortems, RCA documents, and incident summaries from telemetry and alert data.
Implementing SLO-aware automation where SLI breaches are used to detect outages and drive reliability decisions programmatically.
Building Slack-based SRE bots that enrich alerts with contextual insights, suggest remediation steps, and improve on-call efficiency.
Exploring knowledge-graph-based models to represent service dependencies, telemetry locations, and remediation paths for autonomous SRE agents.

Timeline

Site Reliability Engineer

NetApp Inc.

11.2021 - Current

DevOps Engineer

Mindteck Inc.

02.2019 - 11.2021

Datacenter System Administrator

Mindteck Inc.

10.2017 - 01.2019

DevTools Engineer

NetApp

08.2014 - 09.2017

Master of Science - Electrical Engineering

University of Bridgeport