Site Reliability Engineer (CloudOps)

New Today

The Site Reliability Engineer plays a critical role in ensuring that our AI-driven, cloud-native platform is reliable, observable, secure, and able to scale with the organisation's growth. As we adopt intelligent agents, autonomous workflows, and increasingly complex distributed systems, the SRE ensures that resilience, performance, and operational excellence are built into everything we deliver. By partnering closely with Engineers, Architects, and the Engineering Manager, the SRE defines the patterns, tooling, and automation that enable fast, safe, and repeatable deployments.

This role safeguards our production environment, drives continuous improvement across CI/CD and observability, and establishes the reliability practices that empower autonomous squads to move quickly without compromising stability. The SRE is essential to maintaining customer trust, supporting AI-first innovation, and ensuring our platform remains robust, secure, and highly available at scale.

In this position you will ensure the reliability, scalability, and security of our engineering systems. Working closely with the Engineering Manager and Head of Engineering, the SRE will identify priorities to remove friction from engineering teams, streamline processes, and enhance operational excellence. This role combines software engineering principles with systems administration to deliver robust, automated, cost-effective, and secure-by-design solutions.

Key Responsibilities

Reliability, Performance & Security:

Design and implement strategies to improve system reliability, availability, and security.
Ensure all solutions follow secure-by-design principles, incorporating cybersecurity best practices from inception through deployment.
Conduct regular security reviews and collaborate with security teams to address vulnerabilities.

CI/CD Management:

Own and optimise Continuous Integration and Continuous Deployment pipelines.
Embed security checks (e.g., static analysis, dependency scanning) into CI/CD workflows.
Ensure secure, efficient, and automated deployment processes across environments.

Monitoring & Observability:

Implement and maintain monitoring solutions for infrastructure and applications.
Develop dashboards and alerting systems to ensure proactive incident and security event management.
Evaluate and integrate new observability tools as needed.

Automation & Tooling:

Automate repetitive tasks to improve efficiency and reduce human error.
Build and maintain internal tools that support engineering productivity and security compliance.
Champion Infrastructure as Code (IaC) practices using tools like Terraform or ARM templates.

Cloud Infrastructure Management:

Manage and optimise services across AWS and Azure environments.
Ensure scalability, resilience, and security of service-based architectures.
Implement cost management strategies to optimise cloud spend without compromising performance or security.

Incident Response & Root Cause Analysis:

Lead incident response efforts, including security incidents, and conduct post-mortem reviews.
Drive continuous improvement through lessons learned and preventive measures.

Skills & Experience

Proven experience in AWS and Azure cloud environments.
Strong background in CI/CD tools (e.g., Azure DevOps, Pipelines, GitHub Actions, Jenkins).
Expertise in monitoring and observability platforms (e.g., Prometheus, Grafana, Datadog).
Proficiency in scripting and automation (Python, Bash, PowerShell).
Familiarity with containerisation and orchestration (Docker, Kubernetes).
Solid understanding of networking, security, and cost optimisation in cloud environments.
Knowledge of cybersecurity principles, secure coding practices, and compliance frameworks.
A problem-solver with a proactive mindset.
Comfortable working in fast-paced, evolving environments.
Strong communicator who can bridge gaps between operations, development, and security teams.
Passionate about automation, scalability, cost efficiency, and security.

Apply

Location:: Manchester
Job Type:: FullTime
Category:: I.T. & Communications

We found some similar jobs based on your search

New Today

Site Reliability Engineer (CloudOps)
- Lancashire
- I.T. & Communications
The Site Reliability Engineer plays a critical role in ensuring that our AI-driven, cloud-native platform is reliable, observable, secure, and able to scale with the organisation's growth. As we adopt intelligent agents, autonomous workflows, and inc...

More Details