Loading…
Thursday August 13, 2026 11:25 - 11:50 KST
We started running MCP servers in production about a year ago. Within weeks we got our first 3 AM page, a tool was silently failing, agents were retrying into the void, and we had zero visibility into what was going wrong. No dashboard caught it. No runbook existed. That incident kicked off a twelve-month journey of figuring out how to actually operate these things. We ended up cutting agent-related incidents by about 70% and got tool invocation P99 under 500ms.

This talk is the playbook we wish we had on day one. We'll cover the stuff that bit us: health checks that go beyond HTTP 200 and actually validate MCP protocol readiness, how to handle deployments when connections are stateful, what to put on your observability dashboards (and what turned out to be noise), and circuit breaker patterns for when an upstream tool starts misbehaving. We'll also do a quick demo of a monitoring setup we've been running.
Speakers
avatar for Deep Poharkar

Deep Poharkar

Site Reliability Engineer, Obmondo
I’m Deep, a Site Reliability Engineer at Obmondo working on production reliability and incident response. I’ve contributed to open source through GSoC and CNCF’s LFX Mentorship, including work on LitmusChaos, and have spoken at Open Source Summit Japan 2024.
Thursday August 13, 2026 11:25 - 11:50 KST
Grand Ballroom 1 + 2

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link