Posted on 
Apr 4, 2025

Senior Site Reliability Engineer

Boston
Mid-Senior ICs
CarGurus
CarGurus
CarGurus
Public
1001-5000
Consumer Products & Tech

Headquartered in Boston, Massachusetts, CarGurus is the all-in-one platform that’s moving the entire car shopping journey online and guiding customers through each step. This includes everything from selling an old car to financing, purchasing, and delivering a new one. Today, millions of consumers visit cargurus.com each month, and more than 30,000 dealerships use our products. We have a people-first culture that fosters kindness, collaboration, and innovation, while empowering our Gurus with tools and resources to fuel their career growth. Our goal is to give all people—consumers, dealers, and our employees—the power to reach their destination. 

Job Description

Role overview

As a member of the CarGurus reliability team, the site reliability engineer will be responsible for defining, maintaining, and promulgating best practices and tools for SRE and observability.

What you’ll do

  • Linux administration, site reliability best practices, incident management, critical on call.
  • Collaborating with Engineering and Product Managers to define SLOs and monitoring of well-designed SLIs
  • Embedding with Engineering teams and independently addressing issues or collaborating to improve operational excellence
  • Being the primary point of escalation and on the on call rotation for major engineering incidents
  • Owning our Incident Response Process, including conducting blameless Postmortems
  • Partnering with Engineering teams to ensure new services are production-ready
  • Championing our organizational standards for architecting, observing, deploying, and scaling our products
  • Evolving and maintaining our tracing, logging, monitoring, alerting, and other observability systems to increase observability and transparency
  • Educating the company on observability tools and troubleshooting techniques and practices
  • Making Data-Driven decisions to drive continuous improvement
  • Refusing to accept manual work as a solution to areas of weakness

What you’ll bring

  • Linux administration, SRE theory and vocabulary, basic coding and scripting, production experience, incident management experience.
  • A proven background in software engineering with multiple languages and significant relative operational experience running revenue-critical services at scale
  • Understanding of technologies beyond coding such as Load Balancing, Configuration Management, Kubernetes, Terraform and Observability Systems
  • Comfort in dealing with Incidents and Availability Issues under pressure
  • Familiarity and experience working with cloud infrastructure in an AWS environment
  • Familiarity with modern best Site Reliability Engineering practices and theory
  • Comfort and skill in written and verbal communication across teams and organizations
  • Excitement in solving puzzles, discovering how a new service or tool works by identifying the individual components, libraries, and relationships it is built upon
  • A bias for action, but sufficient emotional intelligence to approach colleagues with positive regard and understanding their challenges and decisions
  • Curiosity and the acceptance that there are always ways to learn and grow
  • The desire to be an active contributor in a collaborative and fast-paced environment

Role overview

As a member of the CarGurus reliability team, the site reliability engineer will be responsible for defining, maintaining, and promulgating best practices and tools for SRE and observability.

What you’ll do

  • Linux administration, site reliability best practices, incident management, critical on call.
  • Collaborating with Engineering and Product Managers to define SLOs and monitoring of well-designed SLIs
  • Embedding with Engineering teams and independently addressing issues or collaborating to improve operational excellence
  • Being the primary point of escalation and on the on call rotation for major engineering incidents
  • Owning our Incident Response Process, including conducting blameless Postmortems
  • Partnering with Engineering teams to ensure new services are production-ready
  • Championing our organizational standards for architecting, observing, deploying, and scaling our products
  • Evolving and maintaining our tracing, logging, monitoring, alerting, and other observability systems to increase observability and transparency
  • Educating the company on observability tools and troubleshooting techniques and practices
  • Making Data-Driven decisions to drive continuous improvement
  • Refusing to accept manual work as a solution to areas of weakness

What you’ll bring

  • Linux administration, SRE theory and vocabulary, basic coding and scripting, production experience, incident management experience.
  • A proven background in software engineering with multiple languages and significant relative operational experience running revenue-critical services at scale
  • Understanding of technologies beyond coding such as Load Balancing, Configuration Management, Kubernetes, Terraform and Observability Systems
  • Comfort in dealing with Incidents and Availability Issues under pressure
  • Familiarity and experience working with cloud infrastructure in an AWS environment
  • Familiarity with modern best Site Reliability Engineering practices and theory
  • Comfort and skill in written and verbal communication across teams and organizations
  • Excitement in solving puzzles, discovering how a new service or tool works by identifying the individual components, libraries, and relationships it is built upon
  • A bias for action, but sufficient emotional intelligence to approach colleagues with positive regard and understanding their challenges and decisions
  • Curiosity and the acceptance that there are always ways to learn and grow
  • The desire to be an active contributor in a collaborative and fast-paced environment
Why apply via Tech Ladies
Receive Tech Ladies'
newest jobs in your inbox,
every week.

Join Tech Ladies for full-access to the job board, member-only events, and more!

If you're already a member, we haven't forgotten you. We promise. It's a new system. If you fill out the form once, it'll remember you going forward. Apologies for the inconvenience.

Boston
Boston
Java
Java
JavaScript
JavaScript
JQuery
JQuery
Node.Js
Node.Js
Python
Python
R
R
React
React
Redux
Redux
Ruby
Ruby
Spring
Spring
SQL
SQL
AWS
AWS
Kubernetes
Kubernetes
Docker
Docker
Jenkins
Jenkins
Elasticsearch
Elasticsearch
Kafka
Kafka
No items found.
In-Person
In-Person