Minimum: USD $111,700.00/Yr.
Maximum: USD $145,200.00/Yr.
As an SRE Architect with a specialization in Devops, monitoring and diagnostics, you will play a critical role in ensuring the reliability, availability, and performance of our mission-critical services. You will design and implement end-to-end monitoring solutions, build observability pipelines, and help create scalable systems for proactive incident detection, diagnostics, and root cause analysis. In this role, you will work closely with engineering, product, and operations teams to drive a culture of reliability and continuous improvement.
Monitoring & Observability:
- Design and implement comprehensive monitoring and alerting solutions for production systems across multiple environments (cloud, on-prem, hybrid).
- Develop and refine metrics collection and visualization strategies using tools like Prometheus, Grafana, OpenTelemetry, and others.
- Build dashboards and custom monitoring solutions to ensure system health, performance, and security.
- Establish SLIs (Service Level Indicators), SLOs (Service Level Objectives), and SLAs (Service Level Agreements) to align with business goals.
Incident Management & Diagnostics:
- Develop and implement tools and systems for real-time diagnostics and root cause analysis during incidents.
- Lead post-mortem analysis and drive remediation of systemic issues to prevent future incidents.
- Design diagnostic tools and automation to reduce mean time to detection (MTTD) and mean time to resolution (MTTR).
- Collaborate with engineering teams to define monitoring standards and ensure that new features and services meet reliability and observability requirements.
System Design & Architecture:
- Architect scalable, resilient, and highly available systems with observability baked in from the start.
- Apply SRE principles to design and optimize services for reliability, availability, and performance.
- Identify and address single points of failure, bottlenecks, and other operational risks in production environments.
Automation & Tooling:
- Create, maintain, and improve automation tools that enhance monitoring, diagnostics, and incident response.
- Integrate monitoring and observability tools into CI/CD pipelines for proactive issue detection and remediation.
- Contribute to the development of custom diagnostic tools for troubleshooting complex, distributed systems.
Collaboration & Knowledge Sharing:
- Collaborate with software engineering, platform engineering, and DevOps teams to ensure seamless integration of monitoring and diagnostics practices.
- Mentor and coach junior SREs and other team members on best practices for observability and incident management.
- Stay up-to-date with the latest industry trends and innovations in monitoring, diagnostics, and reliability engineering.
Education & Training Experience:
- Experience with advanced observability techniques, such as synthetic monitoring, canary deployments, and feature flags.
- Certification in cloud platforms (AWS, GCP, Azure), or monitoring tools (e.g., Prometheus Certified Associate).
- Previous experience in an SRE or DevOps leadership role.
- Knowledge of serverless architecture, microservices, and edge computing environments.
- Strong experience in distributed systems, cloud platforms (AWS, GCP, Azure), and container orchestration (Kubernetes, Docker).
- Deep knowledge of monitoring tools such as Datadog and Cloud Monitoring
- Proficient in instrumentation techniques (e.g., OpenTelemetry, StatsD, custom metrics).
- Experience with log aggregation and analysis tools like ELK Stack (Elasticsearch, Logstash, Kibana), Splunk, or similar.
- Expertise in alerting and notification systems, including PagerDuty, Opsgenie, or VictorOps.
Architect position
This position is an individual contributor.
Travel required: 5%
Job Will Remain Open Until Filled