About The Opportunity

An established player in Financial Technology and Enterprise Cloud Infrastructure, delivering resilient, high-throughput systems that support mission-critical institutional workloads. We operate large-scale distributed services and are investing in reliability, observability, and automation to meet aggressive SLAs across the business.

Location: United States (On-site)

Role & Responsibilities

• Lead and grow a high-performing Site Reliability Engineering team responsible for production availability, incident response, and operational excellence.

• Define and own SLIs, SLOs, SLA frameworks and a reliability roadmap; translate business requirements into measurable reliability targets.

• Drive incident management and postmortem culture: lead major incidents, coordinate cross-functional response, and implement corrective actions to eliminate repeat failures.

• Architect and implement observability, monitoring, and alerting solutions to provide actionable signal (metrics, logs, tracing) and reduce MTTD/MTTR.

• Improve platform scalability and resilience through automation, CI/CD pipelines, infrastructure-as-code, capacity planning and performance testing.

• Partner with Engineering, Security, and Product teams to influence architecture, deploy robust runbooks, and bake reliability into the development lifecycle.

Skills & Qualifications

Must-Have

• Kubernetes

• Docker

• Prometheus

• Grafana

• Terraform

• AWS

Preferred

• Go

• Python

• Jenkins

Qualifications & Experience

• Proven experience leading SRE/Platform teams in production; track record owning reliability for distributed systems.

• Strong understanding of incident management, postmortem discipline, capacity planning, and on-call rotations.

• Hands-on experience with cloud-native architectures, IaC, and CI/CD practices; able to both lead strategy and contribute technically.

Benefits & Culture Highlights

• Opportunity to shape reliability for large-scale, mission-critical systems with measurable business impact.

• Collaborative engineering culture that prioritizes automation, continuous improvement, and transparent postmortems.

• On-site team environment focused on mentorship, career growth, and technical leadership.

We seek a strategic SRE leader who combines deep operational expertise with people leadership to drive measurable uptime and velocity improvements. If you are passionate about observability, incident prevention, and building reliable cloud platforms, we want to hear from you.

Skills: kubernetes,docker,prometheus,grafana,terraform,aws,ci,cd,cloud,reliability

SRE Manager

Job Description

Apply for this Job

About Black Rock Solutions INC