Service Reliability Lead | SPD Technology

At SPD Technology, we bring together a team of like-minded people who are driven by the desire to bring value through their work, united in their commitment to high performance and delivering custom, cutting-edge tech solutions that drive clients’ growth. We empower our people with a culture of excellence and enable them with the opportunity to uphold their accountability to contribute on each level. We value humanity and collaboration, encourage professional and personal growth, and foster a supportive and flexible work environment where everyone’s contribution is welcomed.

We are looking for a Service Reliability Lead to join us as part of our team.

About the role

The Service Reliability Lead is the single technical owner accountable for the operational health, SLA compliance, and continuous improvement of the Utlx payment orchestration platform.

This role combines Site Reliability Engineering discipline with managed-service governance. The lead owns incident response end-to-end (from P1 triage to RCA delivery), builds and operates the monitoring and alerting stack, manages SLA measurement and penalty mechanics, and serves as the senior technical escalation point for the client.

The Service Reliability Lead reports to the Project Manager, who manages the support team as a whole (including engineers). The PM owns scheduling, administrative client communications, resource coordination, and project governance. The Service Reliability Lead operates as the technical authority within the team: a player-coach who directs engineering work, makes real-time incident decisions, sets technical standards, and participates in hands-on work alongside the engineers. Both the lead and the engineers report to the PM; the lead’s authority over the team is technical, not managerial.

About the project

You will work on a Payment Orchestration Platform, a greenfield project designed to optimize transaction processing, enhance operational efficiency, and deliver a seamless user experience. As part of this project, you will have the opportunity to influence its architecture and technical decisions.

The support team consists of the following roles:

Project Manager, Service Reliability Lead, Support / DevOps Engineers (3–4)

Tech Stack

Grafana, CloudWatch, PagerDuty, CloudWatch, EKS, RDS, networking

Work Environment

Work within the EU time zone (UTC+1/UTC+2), which is 2 hours behind Ukraine.

As a qualified expert, You will

Incident Management and On-Call

Own the L2/L3/L4 escalation path: serve as the senior technical point of contact for all incidents,and coordinate with third-party vendors (AWS, payment gateways, infrastructure providers) when an external root cause is identified
Ensure incident acknowledgement and resolution in line with SLA targets across all priority levels
Make real-time decisions on hotfixes, rollbacks, and configuration changes under pressure
Build and maintain the on-call rotation; ensure zero coverage gaps
Manage workarounds through to permanent resolution and maintain the escalation matrix for the client

Observability and Monitoring

Deliver an operational monitoring dashboard (CloudWatch / Grafana)
Configure PagerDuty for automated alerting and on-call escalation aligned to SLA targets
Maintain instrumentation across availability, latency, and error rate metrics per service tier

SLA and Penalty Governance

Instrument and validate SLA clocks across response, workaround, and resolution targets
Prepare monthly service credit calculations and service performance reports
Provide metrics evidence during any client dispute review
Deliver monthly reports covering incident volumes, SLA performance, RCA status, and risk log

RCA and Service Improvement

Author Root Cause Analysis documents within 5 days of incident resolution
Identify recurring patterns and monitor for Service Improvement Plan triggers
Design and implement SIPs with corrective actions, owners, and delivery timelines
Proactively reduce incident frequency and improve mean time to resolution

Infrastructure and AWS Operations

Operate in line with the AWS Shared Responsibility Model
Distinguish SPD-caused from third-party failures; maintain evidence for availability exclusion claims
Coordinate planned and urgent maintenance windows with the client

We’re looking for you if you have

5–8 years in production operations / SRE
Hands-on incident command experience
AWS operational depth (CloudWatch, EKS, RDS, networking)
Monitoring stack: Grafana, CloudWatch, PagerDuty
PCI DSS awareness
RCA authorship and structured problem-solving
SLA management and service credit mechanics
Experience with hypercare / go-live stabilisation periods
Experience in fintech or payment systems

What’s in it for You

Reveal great tech solutions

Join the team of experts who create custom, cutting-edge tech solutions for world-renowned businesses, fueling client growth. Unleash your potential, tackle new challenges, and be part of a team that values your skills and contributions. Focus on long-term impact and building tailored, long-lasting partnerships with our clients.

Experience an agile and flexible working environment

Enjoy the freedom of fully remote work with a flexible working schedule. Empower yourself with a stable workload and a stable income, supported by provided laptops and licensed software. We focus on lasting cooperation and unite result-oriented individuals who stand on a high-performance approach to work.

Embrace the opportunity for personal and professional growth

Benefit from performance and merit reviews, elevate your skills with personal development plans, and individual learnings through the corporate library, public speaking support, and more.

Be among like-minded people

Work with a team of one mind who cares about what they do and how they do. Collaborate with top-notch experts who are always ready to help and support you through any challenges. Join company-wide tech and cultural events, and contribute to meaningful CSR initiatives that resonate with your values. Feel supported by your HR, and take advantage of our referral bonus program.

Interview steps

Pre-Screening with the recruiter (30 min)
Technical interview (up to 90 min)
Manager interview (30 – 45 min)

Oksana Shulha

Senior Talent Acquisition Specialist