Posted on 
Oct 3, 2024

Engineer, Fleet Monitoring & Analysis

Roseland
Mid-Senior ICs
Engineering, Data Science + Analytics
CoreWeave
CoreWeave
CoreWeave
Private
101-250
Software, Security & Developer Tools

CoreWeave is a specialized cloud provider focused on GPU accelerated use cases including VFX, AI/ML, Batch Processing and Real Time Experiences. We support countless AI/ML services in the text to image, NLP and broader AI/ML space, reducing client’s infrastructure management requirements with our Kubernetes based serverless GPU cloud offerings.

Job Description

About the role:

The Fleet Monitoring & Analysis Team contributes to the automated provisioning and management of CoreWeave’s ever-expanding fleet of hardware nodes and node types by continually improving node and environmental monitoring and observability. Playing a central role in CoreWeave’s growth strategy, this team is a critical piece of our cohesive, zero-touch, and high-reliability fleet management engine.

We seek an Engineer to join the Fleet Monitoring & Analysis team to help us build, run, and refine our metrics, alerts, visualizations, and data-driven insights. This individual will join a team of mixed-skill engineers focused on elevating the art of managing high-performance hardware at scale. As a team member, you would have the opportunity to:

  • Design and implement solutions to large-scale server observability to continually improve the stability of CoreWeave’s global hardware fleet.
  • Adapt, extend, and implement open-source solutions to augment the depth and breadth of our visibility into our operating environment.
  • Generate and maintain custom reports, alarms, and visualizations to help teams understand and respond to our growth and changes.
  • Create test plans, deployment automation, dashboards, alerts, and insights into our fleet operations, as well as participate in the Fleet Engineering Developers’ on-call rotation.
  • Grow, change, invest in your teammates, be invested in, share your ideas, listen to others, be curious, have fun, and, above all, be yourself.

Wondering if you’re a good fit? We believe in investing in our people, and value candidates who can bring their own diversified experiences to our teams – even if you aren't a 100% skill or experience match. Here are some qualities we’ve found compatible with our team. If a portion of this resonates with you, we’d love to talk.

  • You have 2 or more years experience in a software or infrastructure engineering industry.
  • You have experience in the domains of automation and orchestration workflows and are knowledgeable about server hardware, components, and related technologies and strategies for the management of physical infrastructure at scale.
  • You have experience implementing metrics collection and alerting on standard platforms.
  • You believe in the value of automation and will champion practices that drive reliability and prioritize the CoreWeave customer experience.

 

Our compensation reflects the cost of labor across several US geographic markets. The base pay for this position ranges from $160,000-$185,000. Pay is based on a number of factors including market location and may vary depending on job-related knowledge, skills, and experience.

About the role:

The Fleet Monitoring & Analysis Team contributes to the automated provisioning and management of CoreWeave’s ever-expanding fleet of hardware nodes and node types by continually improving node and environmental monitoring and observability. Playing a central role in CoreWeave’s growth strategy, this team is a critical piece of our cohesive, zero-touch, and high-reliability fleet management engine.

We seek an Engineer to join the Fleet Monitoring & Analysis team to help us build, run, and refine our metrics, alerts, visualizations, and data-driven insights. This individual will join a team of mixed-skill engineers focused on elevating the art of managing high-performance hardware at scale. As a team member, you would have the opportunity to:

  • Design and implement solutions to large-scale server observability to continually improve the stability of CoreWeave’s global hardware fleet.
  • Adapt, extend, and implement open-source solutions to augment the depth and breadth of our visibility into our operating environment.
  • Generate and maintain custom reports, alarms, and visualizations to help teams understand and respond to our growth and changes.
  • Create test plans, deployment automation, dashboards, alerts, and insights into our fleet operations, as well as participate in the Fleet Engineering Developers’ on-call rotation.
  • Grow, change, invest in your teammates, be invested in, share your ideas, listen to others, be curious, have fun, and, above all, be yourself.

Wondering if you’re a good fit? We believe in investing in our people, and value candidates who can bring their own diversified experiences to our teams – even if you aren't a 100% skill or experience match. Here are some qualities we’ve found compatible with our team. If a portion of this resonates with you, we’d love to talk. 

  • You have 2 or more years experience in a software or infrastructure engineering industry.
  • You have experience in the domains of automation and orchestration workflows and are knowledgeable about server hardware, components, and related technologies and strategies for the management of physical infrastructure at scale.
  • You have experience implementing metrics collection and alerting on standard platforms.
  • You believe in the value of automation and will champion practices that drive reliability and prioritize the CoreWeave customer experience.

 

Our compensation reflects the cost of labor across several US geographic markets. The base pay for this position ranges from $160,000-$185,000. Pay is based on a number of factors including market location and may vary depending on job-related knowledge, skills, and experience.

Receive Tech Ladies'
newest jobs in your inbox,
every week.

Join Tech Ladies for full-access to the job board, member-only events, and more!

If you're already a member, we haven't forgotten you. We promise. It's a new system. If you fill out the form once, it'll remember you going forward. Apologies for the inconvenience.

Roseland
Roseland
No items found.
Engineering
Engineering
Data Science + Analytics
Data Science + Analytics
In-Person
In-Person