
Mistral Cloud - Site Reliability Engineer
Mistral AI
Amsterdam
3 weeks ago
Mistral Cloud - Site Reliability Engineer
Mistral AI is seeking highly experienced Site Reliability Engineers to shape the reliability, scalability and performance of their Cloud platform and customer facing applications. The role involves designing and maintaining scalable infrastructures, implementing monitoring and incident response systems, and collaborating with software engineers and product teams. Requires 5+ years of experience in a DevOps/SRE role and a Master’s degree in Computer Science or related field.
Hybrid
Full-time
Senior
DevOps
SRE
Salary
Not specified
Core Qualifications
Technical (Must-have)
DevOpsSREdistributed systemssite reliabilityreliability KPIsobservabilitySLAsCI/CDcontainerizationorchestration
Soft Skills
problem-solvingcommunicationself-motivatedteamwork
Tools (Must-have)
DockerKubernetesPrometheusGrafanaELK StackDatadogTerraformCloudFormation
Preferred Qualifications
Technical (Nice-to-have)
AI/MLhigh-performance computingHPCworkload managersSlurmAI-oriented solutionsFluidstackCoreweaveVast
Key Responsibilities
- Design, build, and maintain scalable, highly available and fault-tolerant infrastructures
- Operate systems and troubleshoot issues in production environments
- Implement and improve monitoring, alerting, and incident response systems
- Implement and maintain workflows and tools for customer-facing APIs and large training runs
- Participate occasionally in on-call rotations to respond to incidents
- Drive continuous improvement in infrastructure automation, deployment, and orchestration
- Collaborate with software engineers to develop and implement solutions for model-training experiments
- Help build a cloud platform offering an abstraction layer between science, engineering and infrastructure
- Design and develop new workflows and tooling to improve reliability, availability and performance
- Collaborate with the security team to ensure infrastructure adheres to best security practices
- Document processes and procedures to ensure consistency and knowledge sharing
- Contribute to open-source projects, research publications, blog articles and conferences
Site Reliability EngineerSREDevOpsCloudAITechnologyEngineeringRemoteFull-timeSenior