Elevated HTTP 502 Errors in the EU-West Region
Incident Report for Websolr
Postmortem

At approximately 06:00 UTC May 12th, users in the EU-West region began to receive HTTP 502 responses to all requests. This lasted until roughly 14:20 UTC. 

In preparing for this postmortem, we identified two distinct problems: the disruption itself and the duration of time it took for a resolution. This postmortem will consider both.

Background

Websolr maintains a global fleet of servers, each of which is monitored by several different systems. These systems can identify problems and alert our operations team in the event of a problem.

Part of the fleet is dedicated solely to acting as a proxy layer, capable of routing requests to specific Solr cores, enforce throttling, manage authentication, and more. The proxy layer is a distributed service, partitioned by geographic region.

The decentralized nature of the proxy layer makes it resistant to node loss, as state is shared among all nodes in the layer. Thus, if a node fails for some reason, the load balancers will simply route traffic to the healthy nodes until the unhealthy node is fixed or replaced. This happens seamlessly so that users never even notice.

Proximate Cause of the Disruption

Nodes are periodically replaced automatically for a variety of reasons. Last week, one of the proxy layer’s nodes in EU-West-1 was replaced. The replacement came online and began to perform a bootstrapping process to install and configure all of the software needed to run an instance of the proxy service. However, the process failed due to a recent change in pip.

This failure prevented the bootstrap process from installing anything, including the monitoring services. As a result, no alert was sent to the team. And it went unnoticed by users, as there was still a healthy node online serving requests.

At 06:00 UTC May 12th, the healthy node was automatically replaced and failed to bootstrap as well. This meant that not only did the EU-West-1 region no longer have a functional routing layer, but that the monitoring services that otherwise would have indicated a problem were not present.

Proximate Cause of Delay in Resolution

Lacking alerts about the problem meant that it issue was not detected until our support team came online several hours later and noticed a massive number of support tickets. At that time, alarms were sounded and the problem was resolved within about 30 minutes.

Up to that point there were a number of signals indicating that something was very wrong and should have warranted review. “Signals” in this context refer to such large departures from the norm that a reasonable person would know immediately that there is a problem.

The most obvious signal was we began to receive an influx of support tickets. We received more tickets in the first few hours of the incident than we normally receive in 2 weeks, across all our products. We also received messages via social media regarding the disruption. Had anyone been monitoring these channels, it would have been obvious that there was a problem warranting investigation. Since our support team is US-based, the messages were arriving around 1:00AM our time, when everyone was asleep.

Another signal was that our system for monitoring request metrics suddenly stopped receiving anything from the EU-West region. A sudden loss of traffic across an entire region is a clear signal that something is amiss.

A third signal was that the load balancers experienced a massive increase in HTTP 5XX errors and almost 100% of requests were not successful. That itself should have raised alarms.

Root Cause of the Disruption & Delay

The aforementioned series of events are not the root cause of this particular incident. The root cause was the lack of sufficient alerting (both direct and indirect). Not only was a vital piece of infrastructure able to fail bootstrapping in a way that raised no alarm, but it remained undetected until the start of business hours because we lacked sufficient indirect monitoring.

If the monitoring service wasn’t dependent on bootstrapping, or the bootstrapping process could page our team in the event of a problem, this incident would not have occurred. We would have been made aware of the problem last week and would have been able to fix it then, preventing it from recurring on May 12th.

And if we’d had sufficient indirect monitoring (identifying signals and thresholds that imply a problem), then the failed bootstrapping process wouldn’t have mattered as much, because we still would have been paged to examine the region anyway.

Resolution

When the technical problem was identified, we were able to fix the bootstrapping problem with a single line of code. The nodes were re-bootstrapped and the routing layer was repopulated with data. This fix was deployed in several other regions, with US-East-1 being the final region on (scheduled for May 18th).

We have also upgraded the proxy service to be managed with our latest deploy tools, which are in heavy use with our hosted Elasticsearch product, Bonsai.

Finally, we’re working on implementing some signal detection in our existing systems that would alert us about the high probability of a problem somewhere in our fleet.

If you have any questions or concerns, please let us know at support@websolr.com.

Posted May 14, 2021 - 18:01 UTC

Resolved
This incident has been resolved.
Posted May 12, 2021 - 20:07 UTC
Update
We continue to monitor the fix for any further issues. We will publish a postmortem once internal discussions are complete.
Posted May 12, 2021 - 14:29 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted May 12, 2021 - 14:19 UTC
Identified
We have identified the root cause and are working on a fix.
Posted May 12, 2021 - 13:32 UTC
Investigating
We are currently investigating this issue.
Posted May 12, 2021 - 13:26 UTC
This incident affected: Region Health: AWS EU West (Ireland).