Site Reliability Engineer
New Yesterday
Job Description
NOTE: VISA SPONSORSHIP IS NOT PROVIDED
Role: SRE
Location: London, UK (5 Days / Week Onsite)
Type: Contract Inside IR35 / Permanent
Exp: Minimum 8+ Years
Skills:
- SRE experience with Python-based applications (not Java)
- Exposure to cloud technologies
- Familiarity with Athena ecosystem or similar (SecDB, Quartz)
- banking and risk domain exposure
SRE Role description
We need an experienced SRE to focus predominantly on automation, optimization, and process re-engineering using AI for the Market Risk Platform. Success is measured by capacity created 9toil eliminated, fewer manual steps, faster recovery, safer/faster changes) not by being the primary BAU support resources. Strong Python and provable agentic AI delivery
Primary Objectives:
- Eliminate Operational toil and recurring manual work through durable automation
- Re-engineer support/change processes to reduce handoffs, approvals friction and rerun complexity
- Industrialize reliability operations so existing SREs spend less time firefighting and more time engineering
Key Responsibilities (Automation & Process first)
Automation Engineering (Core)
- Build production grade automation in Python(tools, services, workflows) to remove repetitive work: environment checks, dependency validation, automated reruns/reprocessing, safe restarts, drift detection, remediation actions, and standardized operation tasks
- Create self-service capabilities for common requests(guard railed, auditable, repeatable)
- Implement “automation with Safety”: idempotency, dry-run modes, approval gates where needed, rollback/undo strategies, and clear audit trails
Process Re-engineering (Core)
- Map current operation processes (incident/problem/change, release readiness, rerun/recovery, access/entitlements, environment onboarding) and redesign them to remove waster and reduce cycle time.
- Standardize runbooks/playbooks into executable workflows, reduce tribal knowledge via templates, checklists, and automated pre-flight controls
- Defined and track operation KPIs (toil hours removed, alert volume reduction, MTTR improvements, change failure rate reduction, rerun time reduction).
Agentic AI
- Design and implement agentic workflows that take action using tools/runbooks(e.g., diagnostics, evidence gathering, correlation, guided remediation, change-risk checks, automated rerun orchestration)
- Put strong controls in place: soped permissions, deterministic fallbacks, human-in-the-loop approvals for risky actions, evaluation harnesses and measurable outcomes.
- Productionize with monitoring, logging and post incident learnings feeding back into the agent/tooling
Observability (enablemen for automation)
Required skills & Experience
- Senior SRE experience on distributed systems and batch/intraday workloads in a production environment.
- Strong Python
- Provable agentic AI experience showing
- Tool integration, guard rails, evaluation approach
- Measurable impact (toil reduction, MTTR reduction, alert reduction etc)
- Demonstrated process optimization ability (removing steps/handoffs, standardizing workflows, implementing light weight controls with metrics)
- Strong Linux and troubleshooting fundamentals across application/system/network layers
- Experience working across mixed estates ( On Pre VMs + Cloud, with some Kubernetes exposure for operational monitoring/reruns)
Differentiators
- Exposure to Banking/Finance Market Risk Domains
- Experience and knowledge of Athena eco system familiarity or similar (Sec DB Quartz)
- Location:
- London
- Job Type:
- FullTime
- Category:
- Technology
We found some similar jobs based on your search
-
New Yesterday
Site Reliability Engineer (Python)
-
City Of London
- Technology
Job Description We need an experienced SRE to focus predominantly on automation, optimization, and process re-engineering using AI for the Market Risk Platform. Strong Python and provable agentic AI delivery Primary Objectives: Eliminate Operation...
More Details -
-
New Yesterday
Site Reliability Engineer
-
London
- Technology
Job Description NOTE: VISA SPONSORSHIP IS NOT PROVIDED Role: SRE Location: London, UK (5 Days / Week Onsite) Type: Contract Inside IR35 / Permanent Exp: Minimum 8+ Years Skills: SRE experience with Python-based applications (not Java) Exposure...
More Details -
-
New Yesterday
Senior Site Reliability Engineer
-
City Of London
- Technology
Job Description HCLTech is a global technology company, home to 219,000+ people across 54 countries, delivering industry-leading capabilities centered on digital, engineering and cloud, powered by a broad portfolio of technology services and produ...
More Details -
-
New Yesterday
Senior Site Reliability Engineer
-
London
- Technology
Job Description HCLTech is a global technology company, home to 219,000+ people across 54 countries, delivering industry-leading capabilities centered on digital, engineering and cloud, powered by a broad portfolio of technology services and produ...
More Details -
-
New Yesterday
Site Reliability Engineer (Python)
-
London
- Technology
Job Description We need an experienced SRE to focus predominantly on automation, optimization, and process re-engineering using AI for the Market Risk Platform. Strong Python and provable agentic AI delivery Primary Objectives: Eliminate Operation...
More Details -
-
New Yesterday
Site Reliability Engineer
-
City Of London
- Technology
Job Description NOTE: VISA SPONSORSHIP IS NOT PROVIDED Role: SRE Location: London, UK (5 Days / Week Onsite) Type: Contract Inside IR35 / Permanent Exp: Minimum 8+ Years Skills: SRE experience with Python-based applications (not Java) Exposure...
More Details -