Podcast Episode: Debugging Production Failures

Pip: Welcome to Azure Advice — where the authentication is broken, the memory is leaking, and the connection pool is set to self-destruct every thirty seconds. Christoph Corder has been busy.

Mara: He has. Today we're covering authentication failures at the protocol level, memory leaks driven by a cascade of zombie processes, a health check mystery that turned out to be one config value, and a practical framework for working with AI without being burned by it.

Pip: Let's start with the authentication problem that left no evidence it was even happening.

When Kerberos Goes Silent

Mara: The setup here is a specific kind of invisible failure: the application is running, the network is up, the credentials are correct — and authentication still isn't working right.

Pip: The post puts it plainly: "The customer reported a 383ms latency on authentication. There were no errors of any kind in the logs."

Mara: So the upshot is: nothing broke loudly. What actually happened was Kerberos failing silently because the Service Principal Name registered in Active Directory didn't match the external hostname the client was using to request a ticket.

Pip: The client asked for a ticket for the external hostname. Active Directory had the internal one. So Kerberos quietly gave up and the client fell back to NTLM — which succeeded, logged nothing, and added 383 milliseconds nobody could explain.

Mara: The fix was a single setspn command. The diagnosis required a network trace, because application logs only record that authentication succeeded. The protocol-level negotiation — what was requested, what the KDC returned, what the client tried next — happens below where most logging operates.

Pip: Which brings us to a different kind of invisible problem: memory that doesn't come back.

41 Zombie Domains and a 10.7GB Dump

Mara: The memory leak post opens with a .NET application on App Service that was recycling continuously and had produced a memory dump just under eleven gigabytes.

Pip: Eleven gigabytes. That's not a leak, that's a confession.

Mara: Working through it with WinDbg and SOS revealed 41 zombie AppDomains — the result of an FCN storm, where file system change notifications triggered restarts faster than the application could shut down cleanly, and each restart generated more file system events.

Pip: And three SDKs — StackExchange.Redis, Harness, and NewRelic — were all holding resources that should have released on shutdown but never got the signal because the shutdown sequence never completed.

Mara: The post's conclusion is precise: "The FCN storm was the ignition. The zombie domains were the evidence. The SDK leaks were the fuel that kept it burning." None of the three SDKs was individually at fault — they all behaved correctly under normal conditions. The storm created conditions none of them were designed for.

Pip: A health check failure that turned out to have a similarly mundane ignition source is next.

The 30-Second Setting That Stopped Deployments

Mara: A Java application on App Service couldn't complete routine node replacement. The health check kept oscillating — instance up, check passes, check fails, instance pulled, repeat.

Pip: The post names the culprit directly: "The customer had set it to 30 seconds" — referring to HikariCP's max-lifetime, the property controlling how long a database connection lives in the pool before being retired.

Mara: What this means in practice: the pool was churning its entire connection inventory roughly twice a minute. When the health check probe arrived during a rotation window, the app couldn't complete a database round-trip. The check read that as unhealthy. Raising max-lifetime to several minutes gave the platform a stable application to evaluate, and deployment completed cleanly.

Pip: One number, changed. That's the whole fix. Now — from diagnosing machines to interrogating the tools we use to diagnose them.

Assume It's Wrong, Then Make It Prove Otherwise

Mara: The AI post opens with a direct admission: early on, AI confidently gave wrong answers, fabricated references, and produced outputs that looked right until they were used. The response wasn't to stop using AI — it was to change the posture entirely.

Pip: The posture being: treat every output as a hypothesis, not a conclusion. The piece puts it this way: "I now operate from the assumption that AI is wrong." Which sounds like a rejection of the tool, but is actually the opposite.

Mara: Right — the argument is that deliberate friction produces better results than acceptance. Asking a model to justify its reasoning, show its work, and defend its conclusions forces it to surface uncertainty it left out of the first answer. That nuance, the post argues, is frequently where the real answer lives.

Pip: And in high-stakes diagnostic work — packet captures, memory dumps, root cause analyses a customer will act on — directionally right isn't good enough. The challenged answer has to be defensible.

Mara: The practical framing is that AI is excellent at acceleration, not at replacing professional judgment. Use it to reach the starting line faster, then apply expertise to close the last mile.

Pip: So the engineers getting the most out of it aren't the ones who trust it most. They're the ones who've built what the post calls a productive skepticism.

Mara: Assume it's wrong. Make it prove otherwise. That's the whole method.

Pip: Silent Kerberos failures, zombie AppDomains, a connection pool set to churn — and an argument that the right way to use AI is to argue with it.

Mara: There's a thread running through all of it: the failure that doesn't announce itself is the one worth understanding. Next time, more from Azure Advice.

When Kerberos Goes Silent

41 Zombie Domains and a 10.7GB Dump

The 30-Second Setting That Stopped Deployments

Assume It's Wrong, Then Make It Prove Otherwise

Share this:

Related

Leave a comment Cancel reply