Summary
Overview
Work History
Education
Skills
Current Projects
Timeline
Generic

Naresh Erlapalli

Cary

Summary

Site Reliability Engineer with 10+ years of experience operating and supporting large-scale production systems. Strong background in observability platforms, Kubernetes, CI/CD, and automation across multi-cloud environments. Experienced in designing monitoring, alerting, and SLO-driven reliability practices, leading incident response, and reducing operational toil through automation and self-service tooling.

Overview

11
11
years of professional experience

Work History

Site Reliability Engineer

NetApp Inc.
11.2021 - Current
  • Architect and operate Kubernetes-native observability platforms using Prometheus and Grafana to monitor microservices and infrastructure.
  • Design and maintain resilient metrics and logging pipelines to ensure systems are visible and debuggable.
  • Define and operationalize SLIs, SLOs, and alerting models to guide reliability decisions.
  • Lead incident response and root cause investigations; drive follow-up actions to prevent recurrence.
  • Led migration of 40+ microservices from Rancher to Kubernetes with standardized deployment and observability patterns.
  • Implement GitOps-based deployments using ArgoCD and Helm to automate application and platform management.
  • Collaborate closely with platform, infrastructure, security, and application teams on instrumentation and reliability best practices.
  • Automate operational tasks to reduce toil and improve on-call efficiency.

DevOps Engineer

Mindteck Inc.
02.2019 - 11.2021
  • Designed and implemented CI/CD pipelines for cloud-native applications across AWS, Azure, and GCP.
  • Built and supported monitoring solutions using Prometheus and Grafana for infrastructure and applications.
  • Defined SLAs/SLOs and supported incident response and production operations.
  • Automated build, deployment, and operational workflows using Jenkins, Python, Docker, and Ansible.
  • Managed large build server farms across multiple geographies and supported release operations.

Datacenter System Administrator

Mindteck Inc.
10.2017 - 01.2019
  • Administered Linux servers, VMware environments, and NetApp storage systems.
  • Managed Jenkins CI infrastructure and supported application deployments.
  • Automated provisioning and patching using Red Hat Satellite and scripting.
  • Supported storage, networking, and virtualization operations in production environments.

DevTools Engineer

NetApp
08.2014 - 09.2017
  • Built and maintained CI/CD systems using Jenkins for large development organizations.
  • Developed automation scripts in Python, Perl, and Shell for build and deployment workflows.
  • Supported build reliability, on-call rotations, and release management.
  • Worked closely with development and QA teams to resolve build and infrastructure issues.

Education

Master of Science - Electrical Engineering

University of Bridgeport
Bridgeport, CT

Skills

  • Observability tools: Prometheus, Grafana, Elasticsearch, LogScale, InfluxDB
  • Monitoring and reliability: SLIs, SLOs, incident response
  • Container orchestration: Kubernetes (AKS, GKE, Rancher)
  • Containerization: Docker
  • DevOps practices: GitOps, CI/CD (ArgoCD, Helm, Jenkins, Azure DevOps)
  • Cloud platforms: AWS, Azure, GCP
  • Automation and scripting: Python, Shell, Ansible
  • Operating systems: Linux
  • Networking fundamentals
  • Storage solutions: NetApp SAN/NAS

Current Projects

  • Building an Agentic SRE system that uses AI to analyze alerts, correlate metrics and logs, reduce false positives, and assist with incident triage and root cause analysis.
  • Designing a Metrics & Context Platform (MCP) to unify observability data from Prometheus, Elasticsearch, LogScale, InfluxDB, and relational databases into a single queryable context for AI-driven reasoning.
  • Developing AI-assisted workflows to automatically generate postmortems, RCA documents, and incident summaries from telemetry and alert data.
  • Implementing SLO-aware automation where SLI breaches are used to detect outages and drive reliability decisions programmatically.
  • Building Slack-based SRE bots that enrich alerts with contextual insights, suggest remediation steps, and improve on-call efficiency.
  • Exploring knowledge-graph-based models to represent service dependencies, telemetry locations, and remediation paths for autonomous SRE agents.

Timeline

Site Reliability Engineer

NetApp Inc.
11.2021 - Current

DevOps Engineer

Mindteck Inc.
02.2019 - 11.2021

Datacenter System Administrator

Mindteck Inc.
10.2017 - 01.2019

DevTools Engineer

NetApp
08.2014 - 09.2017

Master of Science - Electrical Engineering

University of Bridgeport
Naresh Erlapalli