CoreWeave is The Essential Cloud for AI™, providing a platform for innovators to build and scale AI. The Operations Engineer will support the deployment, monitoring, troubleshooting, and maintenance of large-scale InfiniBand fabrics, ensuring their stability and performance.
Responsibilities
- Regularly monitor the performance and health of InfiniBand fabrics, including switches, host adapters, and nodes
- Investigate and resolve operational issues within InfiniBand fabrics, such as network connectivity problems and performance bottlenecks
- Assist with the installation and operational bring-up of large InfiniBand fabrics in collaboration with onsite personnel and customer teams
- Perform routine maintenance and upgrades on InfiniBand switches and control plane components
- Collaborate with HPC cluster operations teams to provide troubleshooting and operational expertise
Skills
- At least 1 year of experience with InfiniBand or similar networking technologies
- Solid understanding of networking concepts, including architectures, topologies, operational best practices, and troubleshooting
- Experience with Linux system administration and maintenance
- Proficiency in at least one scripting language
- Hands-on experience with Nvidia UFM or similar fabric management tools
- Familiarity with SLURM job scheduler and its role in HPC environments
- Experience with monitoring and visualization platforms such as Grafana or Prometheus
- Experience with operational tooling and automation frameworks like Ansible
- Knowledge of data center operations, including server racks, and cabling
- Python or Bash scripting
Benefits
- Medical, dental, and vision insurance - 100% paid for by CoreWeave
- Company-paid Life Insurance
- Voluntary supplemental life insurance
- Short and long-term disability insurance
- Flexible Spending Account
- Health Savings Account
- Tuition Reimbursement
- Ability to Participate in Employee Stock Purchase Program (ESPP)
- Mental Wellness Benefits through Spring Health
- Family-Forming support provided by Carrot
- Paid Parental Leave
- Flexible, full-service childcare support with Kinside
- 401(k) with a generous employer match
- Flexible PTO
- Catered lunch each day in our office and data center locations
- A casual work environment
- A work culture focused on innovative disruption
Company Overview
CoreWeave provides cloud infrastructure services designed to support artificial intelligence and high-performance computing workloads. It was founded in 2017, and is headquartered in Livingston, New Jersey, USA, with a workforce of 1001-5000 employees. Its website is https://www.coreweave.com.