How Enabling a Health Check on our Azure App Service Caused >30s Response Times
TL;DR; If you use containers in Azure App Service and want to use the Health Check feature, be careful not redirect http to https inside your container.
The other day, we released what appeared to be a fairly low-risk update to an ASP.net Core web application to an Azure App Service. All seemed well, but after a couple of minutes, average response times increased from the usual handful of milliseconds to >30s. It seemed fairly clear that something must be wrong with this code, so I duly redeployed the old version of the application, and response times returned to their normal, healthy, level.
We then identified some features in the release which shouldn’t cause an issue, but we allowed ourselves to believe that they could have led to this issue. So, we created a new build with only a few features, which surely couldn’t have such an impact. We deployed this and response times went through the roof once more. What could have been causing this?
We compared config between the staging slot which we were swapping with production to see if there were any differences which could be causing the problem. Nothing. Except, Health Checks were enabled on the staging slot, and not on the production slot.
In 2020, Microsoft added a new Health Check feature to Azure App Services. In the Azure Portal, we are told
Health check increases your application’s availability by removing unhealthy instances from the load balancer. If your instance remains unhealthy, it will be restarted.
Great! If some unknown bug causes my application to become unresponsive, Azure can restart the poorly instance, while users continue to be directed to healthy instances, leaving my application available, and us to diagnose and fix the problem when we can get around to it.
The health check feature should simply poll a health check endpoint, which we specify, on the application each minute. It is to be considered “unhealthy” if it returns a non-2xx response. The application already has a health check endpoint which we monitor, so this should be really easy.
So, to confirm that this was the issue, without deploying any new code, we enabled the health check feature on the production slot. Sure enough, within a few minutes, average response times shot up to over 30 seconds. Leaving this for some time, we could see that this was cyclical - response times would be “fine” for a couple of minutes, before shooting up again, recovering to a normal level, before going bad once more.
This does feel like the app service is considered unhealthy and restarted - but the health check monitoring did not report anything unhealthy, nor was the app service instance restarted.
Diving into Docker
The application is running in a Docker container, so we looked at the container logs. Here we could see, with a cadence equal to the response time spikes we were seeing
Container for XXXXX site XXXXX is unhealthy, recycling site.
But whenever we manually navigated to the health check endpoint, all was well, with a 200 response code. What was causing Azure to determine that the application was unhealthy?
Looking into the Application Insights logs, we could see that our “manual” hits of the health check endpoint were responding 200, exactly as we would expect, however, there were also requests to the health check with 307 response code coming from localhost.
The obvious difference between these requests was the external ones were via HTTPS, wheras the “localhost” requests were via HTTP. This seemed odd, since the documentation said
If the site is HTTPS-Only enabled, the Health check request will be sent via HTTPS.
We did have the HTTPS-Only setting enabled. Since the request was coming from inside the App Service, Azure was not making the request via HTTPS, since it would only be made via HTTP once relayed to the application by the reverse proxy. This wouldn’t seem to matter, since all requests, at the point they are received by the container, are over HTTP, regardless of whether the original request to the App Service was HTTP or HTTPS.
This caused the application to return an HTTP 307, since we are using the
UseHttpsRedirection middleware. This works just fine for requests coming externally, since it is being used in the request pipeline after
UseForwardedHeaders. This means that the
UseHttpsRedirection middleware can see that the original request was over HTTPS, and knows that there is no need to redirect. Not so for the health check requests coming from Azure. Despite the fact that all requests would be received by the container over HTTP, it mattered that it wasn’t aware of an original request over HTTP.
The health check returning 307 caused Azure to determine that the application was unhealthy, and so restarted it. During this time, requests were not responded to until the new container was up and running, hence the increased average response times during this period.
In order to get the application to work well with the health check feature in Azure App Services, we will remove the
UseHttpsRedirection middleware, since HTTPS redirection is enforced at App Service level anyway.
It took a lot of time to find that this site-killing issue, which cropped up when deploying new code, was caused by a feature meant to stop the site from going down and staying down!
When you are not using containers with App Service, all of the above would have worked just fine; when it’s the App Service itself that does the healtcheck, the https setting is respected.
However, when you use containers, it looks like the Healthcheck settings is “passed down” to the container management layer. At that layer, there is only http traffic so the container management layer will always try to test the containers over http. This makes sense, but it is very easy to get caught out by and it took a lot of effort to get to the bottom of this.