At SPD Technology, we bring together a team of like-minded people who are driven by the desire to bring value through their work, united in their commitment to high performance and delivering custom, cutting-edge tech solutions that drive clients’ growth. We empower our people with a culture of excellence and enable them with the opportunity to uphold their accountability to contribute on each level. We value humanity and collaboration, encourage professional and personal growth, and foster a supportive and flexible work environment where everyone’s contribution is welcomed.
We are looking for a Service Reliability Lead to join us as part of our team.
About the role
The Service Reliability Lead is the single technical owner accountable for the operational health, SLA compliance, and continuous improvement of the Utlx payment orchestration platform.
This role combines Site Reliability Engineering discipline with managed-service governance. The lead owns incident response end-to-end (from P1 triage to RCA delivery), builds and operates the monitoring and alerting stack, manages SLA measurement and penalty mechanics, and serves as the senior technical escalation point for the client.
The Service Reliability Lead reports to the Project Manager, who manages the support team as a whole (including engineers). The PM owns scheduling, administrative client communications, resource coordination, and project governance. The Service Reliability Lead operates as the technical authority within the team: a player-coach who directs engineering work, makes real-time incident decisions, sets technical standards, and participates in hands-on work alongside the engineers. Both the lead and the engineers report to the PM; the lead’s authority over the team is technical, not managerial.
About the project
You will work on a Payment Orchestration Platform, a greenfield project designed to optimize transaction processing, enhance operational efficiency, and deliver a seamless user experience. As part of this project, you will have the opportunity to influence its architecture and technical decisions.
The support team consists of the following roles:
Project Manager, Service Reliability Lead, Support / DevOps Engineers (3–4)
Tech Stack
Grafana, CloudWatch, PagerDuty, CloudWatch, EKS, RDS, networking
Work Environment
Work within the EU time zone (UTC+1/UTC+2), which is 2 hours behind Ukraine.
As a qualified expert, You will
Incident Management and On-Call
- Own the L2/L3/L4 escalation path: serve as the senior technical point of contact for all incidents,and coordinate with third-party vendors (AWS, payment gateways, infrastructure providers) when an external root cause is identified
- Ensure incident acknowledgement and resolution in line with SLA targets across all priority levels
- Make real-time decisions on hotfixes, rollbacks, and configuration changes under pressure
- Build and maintain the on-call rotation; ensure zero coverage gaps
- Manage workarounds through to permanent resolution and maintain the escalation matrix for the client
Observability and Monitoring
- Deliver an operational monitoring dashboard (CloudWatch / Grafana)
- Configure PagerDuty for automated alerting and on-call escalation aligned to SLA targets
- Maintain instrumentation across availability, latency, and error rate metrics per service tier
SLA and Penalty Governance
- Instrument and validate SLA clocks across response, workaround, and resolution targets
- Prepare monthly service credit calculations and service performance reports
- Provide metrics evidence during any client dispute review
- Deliver monthly reports covering incident volumes, SLA performance, RCA status, and risk log
RCA and Service Improvement
- Author Root Cause Analysis documents within 5 days of incident resolution
- Identify recurring patterns and monitor for Service Improvement Plan triggers
- Design and implement SIPs with corrective actions, owners, and delivery timelines
- Proactively reduce incident frequency and improve mean time to resolution
Infrastructure and AWS Operations
- Operate in line with the AWS Shared Responsibility Model
- Distinguish SPD-caused from third-party failures; maintain evidence for availability exclusion claims
- Coordinate planned and urgent maintenance windows with the client
We’re looking for you if you have
- 5–8 years in production operations / SRE
- Hands-on incident command experience
- AWS operational depth (CloudWatch, EKS, RDS, networking)
- Monitoring stack: Grafana, CloudWatch, PagerDuty
- PCI DSS awareness
- RCA authorship and structured problem-solving
- SLA management and service credit mechanics
- Experience with hypercare / go-live stabilisation periods
- Experience in fintech or payment systems
What’s in it for You
Reveal great tech solutions
Join the team of experts who create custom, cutting-edge tech solutions for world-renowned businesses, fueling client growth. Unleash your potential, tackle new challenges, and be part of a team that values your skills and contributions. Focus on long-term impact and building tailored, long-lasting partnerships with our clients.
Experience an agile and flexible working environment
Enjoy the freedom of fully remote work with a flexible working schedule. Empower yourself with a stable workload and a stable income, supported by provided laptops and licensed software. We focus on lasting cooperation and unite result-oriented individuals who stand on a high-performance approach to work.
Embrace the opportunity for personal and professional growth
Benefit from performance and merit reviews, elevate your skills with personal development plans, and individual learnings through the corporate library, public speaking support, and more.
Be among like-minded people
Work with a team of one mind who cares about what they do and how they do. Collaborate with top-notch experts who are always ready to help and support you through any challenges. Join company-wide tech and cultural events, and contribute to meaningful CSR initiatives that resonate with your values. Feel supported by your HR, and take advantage of our referral bonus program.
Interview steps
- Pre-Screening with the recruiter (30 min)
- Technical interview (up to 90 min)
- Manager interview (30 – 45 min)