Note: The job is a remote job and is open to candidates in USA. Cohere is a company focused on scaling intelligence to serve humanity through AI systems. They are seeking a Site Reliability Engineer to join their Model Serving team, responsible for developing and operating AI platforms that deliver large language models through API endpoints, ensuring high performance and reliability.
Responsibilities
- Build self-service systems that automate managing, deploying and operating services
- This includes our custom Kubernetes operators that support language model deployments
- Automate environment observability and resilience. Enable all developers to troubleshoot and resolve problems
- Take steps required to ensure we hit defined SLOs, including participation in an on-call rotation
- Build strong relationships with internal developers and influence the Infrastructure team’s roadmap based on their feedback
- Develop our team through knowledge sharing and an active review process
Skills
- 5+ years of engineering experience running production infrastructure at a large scale
- Experience designing large, highly available distributed systems with Kubernetes, and GPU workloads on those clusters
- Experience with Kubernetes dev and production coding and support
- Experience with GCP, Azure, AWS, OCI, multi-cloud on-prem / hybrid serving
- Experience in designing, deploying, supporting, and troubleshooting in complex Linux-based computing environments
- Experience in compute/storage/network resource and cost management
- Excellent collaboration and troubleshooting skills to build mission-critical systems, and ensure smooth operations and efficient teamwork
- The grit and adaptability to solve complex technical challenges that evolve day to day
- Familiarity with computational characteristics of accelerators (GPUs, TPUs, and/or custom accelerators), especially how they influence latency and throughput of inference
- Strong understanding or working experience with distributed systems
- Experience in Golang, C++ or other languages designed for high-performance scalable servers
Benefits
- An open and inclusive culture and work environment
- Work closely with a team on the cutting edge of AI research
- Weekly lunch stipend, in-office lunches & snacks
- Full health and dental benefits, including a separate budget to take care of your mental health
- 100% Parental Leave top-up for up to 6 months
- Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
- Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
- 6 weeks of vacation (30 working days!)
Company Overview
Cohere develops enterprise artificial intelligence software and provides language models, retrieval tools, and workplace platforms. It was founded in 2019, and is headquartered in Toronto, Ontario, CAN, with a workforce of 201-500 employees. Its website is https://cohere.com.Company H1B Sponsorship
Cohere has a track record of offering H1B sponsorships, with 11 in 2025, 14 in 2024, 13 in 2023, 5 in 2022, 2 in 2021. Please note that this does not guarantee sponsorship for this specific role.