Founder Story
The incident that changed how I think about reliability forever
The origin of Coneixedor: one misconfigured admission controller, 11 services, and the philosophy shift that followed
The Challenge
A green dashboard, silent failures, and a system never designed to fail safely
The dashboard was green. Every metric looked healthy. But customer transactions were failing silently across 11 services. A single misconfigured OPA admission controller was quietly starving pods of memory — invisible to every alert, invisible to every dashboard. The engineers were not bad. The tooling was not outdated. The system had simply never been designed to fail safely. That distinction changed everything.
The Solution
Design out failure: OPA/Kyverno, SLOs, and runbooks before the incident
The incident forced a philosophy shift: stop blaming the engineer, start designing out the failure. Over the following 18 months, this was applied in practice — OPA and Kyverno policies to catch misconfigurations before they reached production, SLOs defined before incidents forced them, runbooks written before the incidents that would need them. Every failure mode documented before it could become an incident.
The Outcome
18 months, zero customer-impacting outages — and a consultancy built on what that taught us
18 months without a single customer-impacting outage on a 2,500-node Kubernetes cluster serving 500+ engineers. The insight from that one incident became the founding principle of Coneixedor: startups need this thinking more than enterprises do, because they are building systems under pressure with no safety net and no time to learn the hard way.
Ready for SRE consulting for your startup?
Book a free 15-minute SRE infrastructure audit. We'll surface your top 3 reliability risks and DevOps opportunities. No pitch, no sales rep.
