Mistral Cloud - Site Reliability Engineer

Mistral

Amsterdam

2 days ago

Amsterdam

2 days ago

Apply

Mistral Cloud - Site Reliability Engineer

Mistral AI seeks an experienced Site Reliability Engineer to ensure the reliability, scalability, and performance of its Cloud platform. The role involves operations, development, and collaboration with software engineers. Candidates need 5+ years DevOps/SRE experience, strong infrastructure skills, and proficiency in scripting languages.

Apply

Hybrid

Full-time

Senior

Docker

Kubernetes

Salary

Not specified

Work Location

Amsterdam, North Holland, Netherlands, NL

Work Model

Remote with expectation of in-person collaboration; candidates may work from European offices or remote in listed countries with visits to Paris HQ.

Experience Required

5 years

Employment Type

Full-time

Experience Level

Senior: 5+ years in DevOps/SRE role

Core Qualifications

Technical (Must-have)

DockerKubernetesTerraformCloudFormationPythonGoBashPrometheusGrafanaELK StackDatadog

Soft Skills

problem-solvingcommunicationself-motivatedteam collaboration

Preferred Qualifications

Technical (Nice-to-have)

SlurmFluidstackCoreweaveVast

Key Responsibilities

•Design, build, and maintain scalable, highly available and fault-tolerant infrastructures
•Operate systems and troubleshoot issues in production environments
•Implement and improve monitoring, alerting, and incident response systems
•Implement and maintain workflows and tools (CI/CD, containerization, orchestration, monitoring, logging and alerting systems)
•Participate in on-call rotations and perform root cause analysis
•Drive continuous improvement in infrastructure automation, deployment, and orchestration
•Collaborate with software engineers to develop solutions for safe and reproducible model-training experiments
•Help build a cloud platform offering an abstraction layer between science, engineering and infrastructure
•Design and develop new workflows and tooling to improve reliability, availability and performance
•Collaborate with security team to ensure best practices and compliance
•Document processes and procedures
•Contribute to open-source projects, research publications, blog articles and conferences

Site Reliability EngineerSRECloudKubernetesDockerTerraformPythonCI/CDObservabilityAI

Key Responsibilities

•Design, build, and maintain scalable, highly available and fault-tolerant infrastructures

•Operate systems and troubleshoot issues in production environments

•Implement and improve monitoring, alerting, and incident response systems

•Implement and maintain workflows and tools (CI/CD, containerization, orchestration, monitoring, logging and alerting systems)

•Participate in on-call rotations and perform root cause analysis

•Drive continuous improvement in infrastructure automation, deployment, and orchestration

•Collaborate with software engineers to develop solutions for safe and reproducible model-training experiments

•Help build a cloud platform offering an abstraction layer between science, engineering and infrastructure

•Design and develop new workflows and tooling to improve reliability, availability and performance

•Collaborate with security team to ensure best practices and compliance

•Document processes and procedures

•Contribute to open-source projects, research publications, blog articles and conferences