The Wireshark Cheat Sheet Every Azure App Service Engineer Needs (But Nobody Tells You About)

Why network traces feel like reading hieroglyphics—and how to actually understand them

After years of debugging critical Azure App Service escalations at 2 AM, I’ve learned something important: the gap between “my application is slow” and “here’s exactly why” is almost always hidden in a network trace.

But here’s the problem: Wireshark is intimidating. When you open a PCAP file and see 50,000 packets scrolling by, it feels like you need a PhD in networking just to find the problem. You don’t.

What you need is a translator—something that bridges the gap between application code (what we write) and network reality (what actually happens when your code runs in Azure).

So I built this cheat sheet. Not for network engineers, for App Service engineers who need to answer one question fast: “Is this a network problem or an application problem?”

1. The Basics: Isolating Traffic

Filter: ip.addr == x.x.x.x

The Analogy: Tuning a radio to a specific station. Without this, you’re hearing static from every station at once.

Why This Matters:
App Service traces are noisy. They capture internal Azure health probes, Kudu site traffic, and every other piece of infrastructure chatter. You need to isolate your backend dependency—your SQL Server, your Redis cache, your external API—to see the truth.

What to Look For:
Use the IP address of your backend resource. If you see thousands of packets for 127.0.0.1 or irrelevant internal IPs, you’re still in the noise.

Good Data: A clean list of packets relevant to your investigation.
Bad Data: Noise that distracts you from the root cause.

2. Connection Health: TCP Retransmissions

Filter: tcp.analysis.retransmission

The Analogy: You’re talking to someone on a cell phone with bad reception. You say “Hello?” They don’t hear you, so you have to shout “HELLO?” again.

Why This Matters:
This is the smoking gun for “Network Issues” vs. “Application Issues.” If there are no retransmissions, the network moved the data perfectly. The problem is elsewhere.

What to Look For:
Clumps of red or black highlighted lines in Wireshark. Look at the Time column—retransmissions force delays (usually 1s, then 3s, then 6s).

Good Data: < 1% of total traffic (occasional retransmissions are normal on the internet).
Bad Data: > 3-5% of total traffic, or consecutive retransmissions for the same packet (indicating a complete blackout).

3. The “Overwhelmed” Signal: ZeroWindow

Filter: tcp.analysis.zero_window

The Analogy: A waiter trying to bring more food to a table, but the table is completely full of dirty plates. The customer puts up a hand saying “Stop! I can’t take any more right now.”

Why This Matters:
This proves the receiver (your application) is the bottleneck, not the network. If your App Service sends a ZeroWindow, it means your code is processing data too slowly (usually high CPU or memory pressure).

What to Look For:
Packets coming from the device that’s being slow.

Good Data: 0 packets. Ideally, your application should never be so overwhelmed it stops network traffic.
Bad Data: Any occurrence usually indicates a performance problem on the receiving end.

4. The “Silent Failure”: SNAT Port Exhaustion

Filter: tcp.flags.syn == 1 && tcp.analysis.retransmission

The Analogy: Trying to make a phone call, but all the outgoing lines in the office are busy. You pick up the phone, hear silence, and hang up. You never actually connected to the other person.

Why This Matters:
In Azure App Service, this is the #1 symptom of SNAT Port Exhaustion. Your app is trying to open a new connection, but Azure has no ports left to give you.

What to Look For:
SYN packets leaving your App Service IP that never get a SYN-ACK response.

Good Data: 0 packets. Connection setup should be instant.
Bad Data: Any pattern of SYN retransmissions, especially during high load.

5. Application Latency: Time to First Byte

Filter: http.time > 2
(Requires enabling “Calculate conversation timestamps” in Wireshark preferences)

The Analogy: You order a pizza (request). The driver drives perfectly fast (network), but the kitchen takes an hour to bake it (application). The delay is in the kitchen, not the road.

Why This Matters:
This isolates server processing time. If the network ping is 20ms but http.time is 5 seconds, the problem is 100% inside your application code (e.g., a slow SQL query or an infinite loop).

What to Look For:
The gap between the request (GET /api/data) and the first response packet.

Good Data: Typically < 0.5 seconds (depends on your app requirements).
Bad Data: > 2 seconds (or whatever your specific SLA is).

6. The “Ghost” Disconnect: Idle Resets

Filter: tcp.flags.reset == 1 && tcp.time_delta > 230

The Analogy: You’re on a phone call but stop talking for 5 minutes. The phone company automatically hangs up to save lines. You try to speak again, but the line is dead.

Why This Matters:
A common bug in Azure apps. The application tries to reuse an old connection that the Azure Load Balancer has already silently killed (timeout is 240 seconds / 4 minutes).

What to Look For:
A RST (Reset) packet that happens exactly ~4 minutes after the last packet in that conversation.

Good Data: Keep-Alive packets sent every 60 seconds to keep the connection open.
Bad Data: A Reset packet appearing immediately after your app tries to send data on an old connection.

7. Service Health: HTTP Errors

Filter: http.response.code >= 500

The Analogy: The mailman delivered the letter perfectly, but the letter inside says “I am sick and cannot work today.”

Why This Matters:
This distinguishes between “I can’t reach the server” (network issue) and “The server crashed” (application issue).

What to Look For:
503 errors (overload), 500 errors (code crash), 502 errors (bad gateway/upstream issue).

Good Data: 0 packets.
Bad Data: Any cluster of 5xx errors indicates a service outage, regardless of network health.

8. The “Secret Handshake” Failure: TLS Alerts

Filter: tls.alert_message.desc or ssl.alert_message.desc

The Analogy: You try to enter a secure building with an ID card. The guard looks at it, shakes his head because it’s expired or the wrong color, and slams the door immediately.

Why This Matters:
Common when connecting to legacy APIs or strict third parties. If your App Service uses TLS 1.2 but the destination only speaks TLS 1.0, the connection dies here—before any HTTP data is exchanged.

What to Look For:
“Handshake Failure”, “Unknown CA”, or “Decrypt Error”.

Good Data: 0 packets.
Bad Data: Any alert messages indicate the two servers are speaking different encryption languages and cannot communicate.

9. The “Phone Book” Delay: DNS Latency

Filter: dns.time > 0.5

The Analogy: You want to call a pizza place, but you forgot the number. You have to spend 5 seconds looking it up in the phone book before you can even dial. The call itself is fast, but the process felt slow.

Why This Matters:
In Azure, if you use custom VNET DNS servers, they can be slow. Your application logs will say “Connection Failed,” but the trace shows the TCP connection never even attempted—because the DNS lookup timed out first.

What to Look For:
Long gaps between the “Standard query” and the “Standard query response”.

Good Data: < 0.1 seconds (100ms).
Bad Data: > 1.0 seconds. This is a massive “invisible” delay for users.

10. The “Confused Customer”: HTTP 4xx Errors

Filter: http.response.code >= 400 && http.response.code < 500

The Analogy: A customer walks into a burger shop and orders a tire change. The shop is working perfectly fine, but they have to spend time telling the customer “We don’t do that here.”

Why This Matters:
These errors generate CPU load on your App Service but aren’t “crashes.” Engineers often mistake a spike in 404s for a server outage.

What to Look For:
High volumes of 401 (auth issues) or 404 (bot scanning/bad links).

Good Data: Occasional 404s are normal web noise.
Bad Data: A flood of 401s indicates a broken connection string or expired secret key.

11. The “Big Package” Problem: Fragmentation

Filter: ip.flags.mf == 1

The Analogy: Trying to mail a surfboard. It doesn’t fit in the standard mailbox (MTU), so you have to saw it in half and ship it in two boxes. The receiver has to glue it back together.

Why This Matters:
Critical for VPN/ExpressRoute scenarios. If packets are fragmenting, it increases CPU load and the risk of packet loss. If the “Don’t Fragment” flag is set, these packets will just be dropped entirely (black hole).

What to Look For:
“More Fragments” flag set to 1.

Good Data: 0 packets. Modern networks try to avoid fragmentation (Path MTU Discovery).
Bad Data: Consistent fragmentation suggests an MTU mismatch between Azure and your on-premise firewall.

12. The “Pulse Check”: TCP Keep-Alives

Filter: tcp.analysis.keep_alive

The Analogy: You’re on a long, silent conference call. Every minute, you whisper “I’m still here” just so the line doesn’t disconnect.

Why This Matters:
This is the solution to the “Idle Reset” problem (Issue #6 above). If you see these happening every ~60 seconds, your configuration is solid.

What to Look For:
Packets with 0-1 byte length sent periodically on an idle stream.

Good Data: Consistent Keep-Alives (e.g., every 45-60 seconds).
Bad Data: Absence of Keep-Alives on long-running connections (like SQL or Service Bus listeners) usually leads to disconnects after 4 minutes.

Real-World Case Studies

How these filters saved the day in actual Azure support scenarios.

Case Study 1: The “Intermittent” Database Connection

The Scenario: A customer claims their App Service connecting to Azure SQL fails “randomly” about 5% of the time during peak hours. They blame network packet loss.

The Trap: The engineer initially looks for general packet loss (tcp.analysis.retransmission) but sees very little (<0.1%), so they tell the customer the network is fine. The customer is angry because the issue persists.

The Solution:

Command Used: tcp.flags.syn == 1 && tcp.analysis.retransmission
The “Aha!” Moment: The filter reveals hundreds of SYN packets (connection attempts) being retransmitted only during peak load. The trace shows the App Service trying to open a socket but getting no response.
Diagnosis: SNAT Port Exhaustion. The app was opening a new connection for every single SQL query instead of using a connection pool, running out of the 128 available outbound ports.
Resolution: The customer implemented Connection Pooling in their connection string, and the errors vanished.

Case Study 2: “Azure is Slow” (Blaming the Network)

The Scenario: A customer’s e-commerce site takes 8 seconds to load a product page. They insist the network routing between the App Service and their on-premise backend API is lagging.

The Trap: The engineer runs a Ping test and sees 20ms latency, but the customer argues that “Ping is different from data traffic.”

The Solution:

Command Used: http.time > 5 (Time to First Byte)
The “Aha!” Moment:
1. The engineer finds the specific HTTP GET request.
2. They see the TCP Handshake took 22ms (Network is fast).
3. They see the HTTP Request packet go out.
4. They see the HTTP Response packet come back 7.8 seconds later.
Diagnosis: Application Processing Delay. The network delivered the request instantly. The backend API spent 7.8 seconds calculating the response (likely a slow database query inside the API) before sending a single byte back.
Resolution: The customer optimized their SQL query on the backend. The load time dropped to 200ms.

Case Study 3: The 4-Minute Job Failure

The Scenario: A background worker process runs a report every night. It connects to a database, calculates for a while, and then tries to write the result. It fails exactly 4 minutes into the process every time with a “Connection Reset” error.

The Trap: The engineer thinks the database is crashing or the firewall is blocking the connection.

The Solution:

Command Used: tcp.flags.reset == 1 && tcp.time_delta > 230
The “Aha!” Moment: The filter isolates a single RST packet. The “Time Delta” (time since the last packet in this stream) is exactly 240.01 seconds.
Diagnosis: Azure Load Balancer Idle Timeout. The application opened a connection, did 4 minutes of math (CPU work) without sending any data over the wire, and then tried to use the connection. The Azure Load Balancer had already silently deleted the idle route at the 240-second mark.
Resolution: The customer adjusted their TCP Keep-Alive settings to send a dummy packet every 60 seconds. This kept the Load Balancer route active.

Case Study 4: The Legacy Upgrade Fail

The Scenario: A customer migrates their App Service from an old Windows 2012 server to a new Azure App Service (Windows 2019). Suddenly, it can no longer talk to an external 3rd party payment gateway.

The Trap: The engineer checks the Network Security Group (NSG) and Firewall logs, seeing “Allow,” so they assume the traffic is getting through.

The Solution:

Command Used: tls.alert_message.desc
The “Aha!” Moment: The trace shows the TCP handshake completes successfully (Traffic is allowed!). However, immediately after the “Client Hello,” the remote server sends a “Protocol Version” fatal alert.
Diagnosis: TLS Mismatch. The new Azure App Service forces TLS 1.2 for security. The old 3rd party gateway only supported TLS 1.0. They could connect (TCP), but they couldn’t agree on a secure language (TLS).
Resolution: The customer had to update their code to explicitly allow older TLS versions (temporary fix) while asking the vendor to upgrade their gateway.

Final Thoughts

Network traces don’t have to be scary. They’re just conversations between computers and once you know what to listen for, they tell you exactly what went wrong.

The next time you’re staring at a customer escalation at midnight and someone says “it’s probably a network issue,” you’ll have the tools to prove (or disprove) that in under 5 minutes.

What’s your go-to Wireshark filter? Drop it in the comments—I’d love to see what other engineers are using to debug Azure issues.

Christopher Corder is a Senior Technical Advisor for Azure App Services at Microsoft, specializing in complex performance troubleshooting and customer escalations.

The Wireshark Cheat Sheet Every Azure App Service Engineer Needs (But Nobody Tells You About)

1. The Basics: Isolating Traffic

2. Connection Health: TCP Retransmissions

3. The “Overwhelmed” Signal: ZeroWindow

4. The “Silent Failure”: SNAT Port Exhaustion

5. Application Latency: Time to First Byte

6. The “Ghost” Disconnect: Idle Resets

7. Service Health: HTTP Errors

8. The “Secret Handshake” Failure: TLS Alerts

9. The “Phone Book” Delay: DNS Latency

10. The “Confused Customer”: HTTP 4xx Errors

11. The “Big Package” Problem: Fragmentation

12. The “Pulse Check”: TCP Keep-Alives

Real-World Case Studies

Case Study 1: The “Intermittent” Database Connection

Case Study 2: “Azure is Slow” (Blaming the Network)

Case Study 3: The 4-Minute Job Failure

Case Study 4: The Legacy Upgrade Fail

Final Thoughts

Share this:

Related

Leave a comment Cancel reply