Healthcare SaaS
SRE practice built from the ground up
SRE consulting for healthtech infrastructure: SLOs, incident response, and on-call at scale
The Challenge
No SLOs, no runbooks, and fully reactive healthtech infrastructure
A Series A healthtech was scaling from 5,000 to 50,000 patients and their on-call process was a Slack DM to the CTO. They had no SLOs, no runbooks, no alerting strategy, just Datadog dashboards nobody checked until something broke. Two major incidents in Q3 caused 6+ hours of downtime each and nearly lost them a hospital contract.
The Solution
SRE consulting: SLO framework, incident response runbooks, and error budget tracking
We ran a two-week SRE consulting engagement: defined SLOs for their 4 critical user journeys (appointment booking, results delivery, prescriptions, auth), built an error budget dashboard, wrote 12 incident response runbooks for their most common failure modes, set up PagerDuty with escalation policies, and trained two engineers to be the first in-house SRE on-call. We also redesigned their Datadog alerting to be SLO-based rather than threshold-based, cutting alert noise by 85%.
The Outcome
70% fewer incidents and 24/7 SRE coverage for healthtech infrastructure
In the 90 days after the SRE consulting engagement, they had zero incidents that breached their error budget. The two incidents that did occur resolved in under 5 minutes using the runbooks, compared to a 6-hour average before. The hospital contract was retained. The CTO stopped being woken up at 2am. SRE coverage is now 24/7 via a rotation of 3 engineers.
Ready for SRE consulting for your startup?
Book a free 15-minute SRE infrastructure audit. We'll surface your top 3 reliability risks and DevOps opportunities. No pitch, no sales rep.
