Mistral Cloud - Site Reliability Engineer

Mistral AI

Amsterdam

3 weeks ago

Amsterdam

3 weeks ago

Apply

Mistral Cloud - Site Reliability Engineer

Mistral AI is seeking highly experienced Site Reliability Engineers to shape the reliability, scalability and performance of their Cloud platform and customer facing applications. The role involves designing and maintaining scalable infrastructures, implementing monitoring and incident response systems, and collaborating with software engineers and product teams. Requires 5+ years of experience in a DevOps/SRE role and a Master’s degree in Computer Science or related field.

Apply

Hybrid

Full-time

Senior

DevOps

SRE

Salary

Not specified

Work Location

Amsterdam, North Holland, Netherlands, NL

Work Model

Remote with monthly office visits: at least 3 days per month in Paris office for remote hires

Experience Required

5 years

Employment Type

Full-time

Experience Level

5+ years of experience in a DevOps/SRE role

Core Qualifications

Technical (Must-have)

DevOpsSREdistributed systemssite reliabilityreliability KPIsobservabilitySLAsCI/CDcontainerizationorchestration

Soft Skills

problem-solvingcommunicationself-motivatedteamwork

Tools (Must-have)

DockerKubernetesPrometheusGrafanaELK StackDatadogTerraformCloudFormation

Preferred Qualifications

Technical (Nice-to-have)

AI/MLhigh-performance computingHPCworkload managersSlurmAI-oriented solutionsFluidstackCoreweaveVast

Key Responsibilities

•Design, build, and maintain scalable, highly available and fault-tolerant infrastructures
•Operate systems and troubleshoot issues in production environments
•Implement and improve monitoring, alerting, and incident response systems
•Implement and maintain workflows and tools for customer-facing APIs and large training runs
•Participate occasionally in on-call rotations to respond to incidents
•Drive continuous improvement in infrastructure automation, deployment, and orchestration
•Collaborate with software engineers to develop and implement solutions for model-training experiments
•Help build a cloud platform offering an abstraction layer between science, engineering and infrastructure
•Design and develop new workflows and tooling to improve reliability, availability and performance
•Collaborate with the security team to ensure infrastructure adheres to best security practices
•Document processes and procedures to ensure consistency and knowledge sharing
•Contribute to open-source projects, research publications, blog articles and conferences

Site Reliability EngineerSREDevOpsCloudAITechnologyEngineeringRemoteFull-timeSenior

Key Responsibilities

•Design, build, and maintain scalable, highly available and fault-tolerant infrastructures

•Operate systems and troubleshoot issues in production environments

•Implement and improve monitoring, alerting, and incident response systems

•Implement and maintain workflows and tools for customer-facing APIs and large training runs

•Participate occasionally in on-call rotations to respond to incidents

•Drive continuous improvement in infrastructure automation, deployment, and orchestration

•Collaborate with software engineers to develop and implement solutions for model-training experiments

•Help build a cloud platform offering an abstraction layer between science, engineering and infrastructure

•Design and develop new workflows and tooling to improve reliability, availability and performance

•Collaborate with the security team to ensure infrastructure adheres to best security practices

•Document processes and procedures to ensure consistency and knowledge sharing

•Contribute to open-source projects, research publications, blog articles and conferences