Senior Site Reliability Engineer
New Today
Job Description
High-growth infrastructure company focused on delivering large-scale compute, data centre capacity, and power solutions for advanced machine learning workloads. Platforms support leading research and industry teams requiring high-performance computing at significant scale. Fast-paced environment with emphasis on ownership, execution speed, and quality. Culture centred on pragmatic problem-solving, cross-functional collaboration, and full lifecycle responsibility.
Role Overview:
- Position operating across software, infrastructure, and operations to ensure reliability, scalability, and performance of a globally distributed compute platform.
- Close collaboration with networking, platform engineering, and physical infrastructure teams to design and operate systems supporting high-demand computational workloads.
- Hands-on engineering role requiring strong systems expertise, with responsibility for resolving complex production issues, improving system resilience, and enhancing platform observability.
Responsibilities
- Deployment and management of large-scale compute clusters using automation tooling, with adaptation to customer requirements
- Validation and optimisation of compute, storage, and networking systems in coordination with internal teams and vendors
- Execution of large-scale data migrations between cloud and on-premise environments with focus on efficiency and cost
- Troubleshooting across the full stack, including hardware, networking, and distributed systems
- Development of internal tooling and automation to improve deployment speed, reliability, and operational efficiency
Participation in an on-call rotation required (approximately one week per month).
Key Attributes
- Strong ownership mindset with focus on delivery and accountability
- Experience building maintainable, well-documented systems in complex environments
- Ability to operate effectively in ambiguous and rapidly evolving contexts
- Clear and effective communication skills with collaborative, low-ego approach
Minimum Requirements
- 5+ years of experience in site reliability engineering, DevOps, systems administration, or high-performance computing
- Strong written and verbal communication skills in English
- Experience deploying and operating container orchestration or workload scheduling systems (e.g. Kubernetes or similar)
- Programming or scripting experience in Go, Python, or Bash
- Familiarity with infrastructure automation and infrastructure-as-code tools
- Strong technical foundation in computing or related discipline
Preferred Experience
- Experience operating large-scale machine learning or AI-compute workloads
- Background in multi-tenant distributed systems at scale
- Hands-on experience with data centre or bare-metal infrastructure
- Knowledge of high-performance networking technologies
- Experience managing large-scale storage systems (commercial or open-source)
Compensation & Benefits
- Competitive salary and equity package
- Retirement or pension contributions aligned with local standards
- Health coverage including medical, dental, and vision
- Generous paid time off policy
- Location:
- London
- Job Type:
- FullTime
- Category:
- Technology
We found some similar jobs based on your search
-
New Today
Senior Site Reliability Engineer
-
London
- Technology
Job Description Role: Senior Site Reliability Engineer Salary: £70,000 – £80,000 Location: London (Hybrid – 1 day per week in office) We are working with a mission-led technology organisation that is continuing to scale a fully cloud-native p...
More Details -
-
New Today
Senior Site Reliability Engineer
-
City Of London
- Technology
Job Description High-growth infrastructure company focused on delivering large-scale compute, data centre capacity, and power solutions for advanced machine learning workloads. Platforms support leading research and industry teams requiring high-p...
More Details -
-
New Today
Senior Site Reliability Engineer
-
London
- Technology
Job Description High-growth infrastructure company focused on delivering large-scale compute, data centre capacity, and power solutions for advanced machine learning workloads. Platforms support leading research and industry teams requiring high-p...
More Details -
-
New Today
Senior Site Reliability Engineer
-
City Of London
- Technology
Job Description Role: Senior Site Reliability Engineer Salary: £70,000 – £80,000 Location: London (Hybrid – 1 day per week in office) We are working with a mission-led technology organisation that is continuing to scale a fully cloud-native p...
More Details -
-
New Yesterday
Senior Site Reliability Engineer
-
City Of London
- Technology
Job Description HCLTech is a global technology company, home to 219,000+ people across 54 countries, delivering industry-leading capabilities centered on digital, engineering and cloud, powered by a broad portfolio of technology services and produ...
More Details -
-
New Yesterday
Senior Site Reliability Engineer
-
London
- Technology
Job Description HCLTech is a global technology company, home to 219,000+ people across 54 countries, delivering industry-leading capabilities centered on digital, engineering and cloud, powered by a broad portfolio of technology services and produ...
More Details -