Avoiding Health Check Failures in Java Applications

Combination lock dial placed on broken glass shards
The 30-Second Misconfiguration That Held a Deployment Hostage

There is a special category of production bug that hides in plain sight. It doesn’t crash anything. It doesn’t throw an exception you can grep for. It just quietly prevents your infrastructure from doing what it’s supposed to do, and it waits patiently while you stare at metrics that almost make sense.

This is one of those stories.


The Symptom

The customer was running a Java application on Azure App Service. They needed to perform routine node replacement, a standard infrastructure operation. The problem: it wouldn’t complete. The health check was oscillating. Instance comes up, health check passes, health check fails, instance gets pulled, repeat. The deployment was stuck in a loop it couldn’t escape.

On the surface, this looks like a deployment issue. Maybe a startup probe misconfiguration. Maybe a cold start problem. Maybe the app just isn’t healthy. You start pulling on those threads and they all feel plausible.

None of them were the problem.


The Actual Problem

The application was using HikariCP, the most widely used Java database connection pool. HikariCP has a configuration property called max-lifetime — the maximum amount of time a connection is allowed to live in the pool before it gets retired and replaced.

The customer had set it to 30 seconds.

That number looks innocuous until you understand what it means operationally. HikariCP’s documentation recommends a max-lifetime value several minutes below whatever your database or network infrastructure enforces as a connection timeout, typically in the range of 10 to 30 minutes. The purpose is to proactively retire connections before they’re killed externally, so the pool never hands your application a dead connection.

At 30 seconds, you’ve essentially told HikariCP to churn its entire connection pool roughly twice a minute. Every 30 seconds, connections are expiring and being replaced. The pool is in a constant state of recycling.


Why the Health Check Failed

During node replacement on Azure App Service, the platform performs a health check against the application to verify it’s ready to receive traffic before completing the swap. This check hits a defined endpoint and expects a healthy response within a specific window.

Here’s the collision: the health check probe arrives during a moment when HikariCP is mid-cycle, retiring and reestablishing connections. The application, in that brief window, cannot successfully complete a database round-trip. The health check interprets this as unhealthy. The node gets pulled.

The next node comes up. Same config, same 30-second max-lifetime, same churn cycle. The health check rolls the dice again. Sometimes it passes. Sometimes it doesn’t. The oscillation isn’t random, it’s rhythmic. It’s synchronized to the connection pool rotation.

30 seconds is a short enough window that the health check was hitting the churn cycle with uncomfortable regularity.


The Fix

Increase max-lifetime to an appropriate value, in this case, several minutes. Once the connection pool stopped churning on a 30-second cycle, the health check had a stable application to evaluate, node replacement completed cleanly, and the deployment moved forward.

One configuration property. Four characters changed to bring the value in line with what the documentation recommends. Case closed.


What This Case Actually Teaches

The temptation in a case like this is to go looking for the dramatic explanation. Health check oscillation sounds serious. It feels like infrastructure. It pulls you toward platform logs, deployment configurations, network topology.

The real culprit was a Java connection pool setting that a developer likely copied from a blog post or a Stack Overflow answer without fully understanding what it controlled.

This is not a criticism. It’s the reality of how configuration spreads across engineering organizations. Properties get set, they work well enough in development or under low load, they get promoted to production, and they sit quietly until the exact wrong moment, which in this case was a routine maintenance operation.

A few things I take from cases like this:

Configuration debt is real debt. An application can run for months on a misconfigured pool setting without ever surfacing the problem. That doesn’t mean the setting is correct, it means the conditions that expose it haven’t arrived yet.

Health check failures are symptoms, not diagnoses. The oscillation was real. The health check was doing exactly what it was supposed to do. But treating the symptom, adjusting probe timing, extending grace periods, would have masked the problem without solving it. The oscillation was telling the truth about the application state; the application state was telling the truth about the pool configuration.

Read the documentation for the defaults you override. HikariCP’s recommended max-lifetime range exists for a reason. When you set a value that far outside the recommended bounds, the burden of understanding why is on you. That understanding doesn’t always happen at configuration time — sometimes it happens during a production incident at the worst possible moment.

The connection pool was just doing its job. It was told to retire connections every 30 seconds, and it did. The problem was never HikariCP.

It was the number someone typed into a config file.


Christopher Corder is a Senior Azure Technical Advisor at Microsoft specializing in Azure App Service performance engineering, diagnostics, and root cause analysis. He writes about the cases that look complicated until they aren’t — and the ones that look simple until they are.

Leave a comment