A few weeks ago I wrote about a mindset shift that changed how I use AI: I assume it’s wrong until it proves otherwise. That post got a lot of traction, and the most common response was some version of “okay, but how do you actually do that?”
Fair question. The mindset is one thing. The mechanics are another.
Here’s exactly what challenging AI looks like in practice , not as abstract advice, but as a working method I use daily doing performance engineering and root cause analysis on Azure App Service.
The Problem With Accepting the First Answer
AI models are optimized to be helpful. That sounds like a feature, and it is. Until it isn’t. A model that’s optimized to be helpful will give you an answer even when the honest answer is “I’m not sure.” It will fill gaps in its reasoning with plausible-sounding logic. It will present a confident conclusion when what it actually has is an educated guess.
This isn’t the model lying to you. It’s the model doing exactly what it was trained to do. The problem is that confident and correct are not the same thing, and the model’s output looks identical in both cases.
Your job is to tell the difference.
The Four Challenges I Use
Challenge 1: “What’s the strongest argument against what you just told me?”
This is the most powerful single prompt I’ve found. It forces the model to steelman the opposition to its own conclusion. A model that can’t articulate a strong counter-argument probably hasn’t reasoned through the problem fully. A model that produces a genuinely strong counter-argument is telling you something important: the answer is more contested than the first response suggested.
In RCA work, I use this constantly. AI gives me a hypothesis about why a connection is failing or why a process is crashing. Before I write anything up, I ask it to argue the other side. Sometimes it holds. Sometimes the counter-argument is better than the original hypothesis. Both outcomes are useful.
Challenge 2: “What assumptions is this answer based on?”
AI answers are built on assumptions the model doesn’t always declare. When I ask this question, I’m not looking for the model to admit it’s wrong. I’m looking for the load-bearing assumptions that, if incorrect, would invalidate the conclusion.
In a recent case involving a Kerberos SPN mismatch, the initial AI analysis assumed the authentication failure was inbound. The actual failure was outbound: the App Service worker connecting to a backend resource. The assumption wasn’t stated. I had to surface it by asking. Once I did, the entire analysis needed reframing.
Challenge 3: “Is this consistent with [specific evidence]?”
Generic AI analysis works from general principles. Your problem is specific. The most effective challenge is to hand the model a concrete data point (a log entry, an error code, a timing measurement) and ask whether the conclusion is consistent with it.
When a customer reported 383ms latency on authentication with no errors in the logs anywhere, I gave that specific detail to the AI and asked it to reconcile the conclusion with the symptom. A clean log with measurable latency points to a silent fallback, not a failure. The model had to revise its framing, and the revised framing was correct.
Challenge 4: “What would you need to see to change your answer?”
This is the falsifiability test. A good hypothesis is one that can be proven wrong. It makes predictions about what evidence should or shouldn’t exist. If the AI can’t tell you what evidence would falsify its conclusion, the conclusion isn’t a hypothesis. It’s speculation dressed up as analysis.
I ask this before I commit to any diagnostic path. If the answer is “I would change my conclusion if X were true,” I go look for X. Either I find it and the hypothesis is dead, or I don’t find it and the hypothesis gets stronger. Either way I’ve done actual engineering instead of just following a suggestion.
The Adversarial Challenge
The four challenges above work on the AI’s reasoning. The adversarial challenge works on the AI itself.
Here’s how it works: take the conclusion the AI just gave you, then start a fresh session and argue against it. Give the new instance the same evidence and explicitly ask it to build the strongest possible case that the first conclusion is wrong. No priming, no framing, no “my AI just told me X, is that right?” Just the raw evidence and the instruction to attack the hypothesis.
You are not looking for agreement. You are looking for what breaks.
This matters because a model in a running conversation develops a kind of momentum. It’s not confirmation bias in the human sense, but it’s something functionally similar: the model’s responses are influenced by the context it’s been given, including its own prior outputs. It will tend to build on what it said before rather than challenge it. Starting fresh removes that inertia entirely.
In practice I do this on any finding that’s going into a formal RCA or a customer-facing document. The original session builds the hypothesis. The adversarial session tries to destroy it. If the hypothesis survives both, I’m confident enough to put my name on it. If the adversarial session surfaces something the original missed, I have a better finding and I haven’t shipped a wrong answer to a customer.
What you’re building is a two-model review process. The first AI is your analyst. The second AI is your skeptic. Neither one is authoritative on its own. The output that survives both is the one worth trusting.
One thing to watch: a well-prompted adversarial challenge will sometimes produce a genuinely better hypothesis than the original. When that happens, run the original session against the new hypothesis. Keep going until one position can’t counter the other. That’s your answer.
What This Looks Like End to End
Here’s a compressed version of a real workflow.
AI gives me an initial hypothesis: the application is crashing due to memory pressure from a dependency leak.
I challenge it: “What’s the strongest argument against this?” AI responds: high allocation rate could produce the same symptom without a leak, and there’s no evidence of retained object growth in the initial analysis.
I surface the assumption: “What are you assuming about the memory dump?” AI responds: it’s assuming the dump was taken at peak memory, not during a crash. If the dump was taken post-crash, the heap state doesn’t reflect the failure condition.
I apply specific evidence: “The dump was 10.7GB, taken at peak before the OOM kill. Is your conclusion consistent with that?” AI revises: at 10.7GB that’s consistent with retention, not just allocation rate. The leak hypothesis holds.
I test falsifiability: “What would change your answer?” AI responds: if the surviving objects were short-lived allocations rather than long-lived retained references, it would favor an allocation rate problem. Checking the gen2 heap distribution would distinguish the two.
I check the gen2 heap. Long-lived retained references confirmed. Leak hypothesis stands. We have a finding we can defend.
Four challenges. One defensible conclusion. The AI was directionally right in its first answer, but “directionally right” isn’t what you put in a root cause analysis.
The Shift in How You Think About AI
What I’ve described isn’t distrust. It’s the same discipline you’d apply to any analysis that matters. You don’t accept the first draft of a thesis. You don’t ship code without reviewing it. You don’t sign off on a root cause without testing whether the evidence actually supports it.
AI is a collaborator with extraordinary range and a specific failure mode: it can’t tell you when it doesn’t know. Your job isn’t to catch it being wrong. It’s to create the conditions where, if it is wrong, you’ll find out before it matters.
Assume it’s wrong. Make it prove otherwise. The four challenges above are how you do that.
One More Thing
The engineers I’ve watched get the most out of AI aren’t the ones who use it the most. They’re the ones who engage with it most deliberately. There’s a difference between running queries and running a process. The process is what produces reliable output.
If the first article was the mindset, this is the method. Use both.
Christopher Corder is a Senior Azure Technical Advisor at Microsoft specializing in Azure App Service performance engineering, diagnostics, and root cause analysis. He writes about AI, cloud engineering, and the practical realities of working at the intersection of both.