Elevated 503s for some users in US-East

Incident Report for Websolr

Postmortem

At approximately 12:30pm UTC, one of the nodes in our US-East region stopped responding to requests. We were made aware of the issue when users began sending in support tickets.

A review of metrics in CloudWatch indicated that the server had experienced a sudden, massive spike in CPU and had effectively locked up. Prior to the event, the server indicated low load and had plenty of capacity with respect to memory and disk space. We have opened a ticket with Amazon to understand the circumstances under which that type of event could have occurred.

Our immediate response was to try to repair the node. Fortunately, we were able to reboot it, which returned everything to normal. Had that not been effective, we would have followed our established and proven operations playbook for recovering lost nodes.

Once the all-clear had been made, our operations and products teams had a standup to discuss how this could have been avoided, or at least mitigated. The impact to customers during this event was quite limited, affecting 11 slices out of many thousands. Still, we take the impact seriously and have compiled a brief punch list of tools and processes for our development pipeline, and will be deploying those over the coming weeks.

Posted Mar 29, 2017 - 17:51 UTC

Resolved

The node has been repaired and is confirmed to be serving traffic normally.

Posted Mar 29, 2017 - 14:20 UTC

Identified

We have identified an unhealthy node in the region and are working to resolve.

Posted Mar 29, 2017 - 14:10 UTC

Investigating

We are investigating reports of persistent HTTP 503 errors in the US-East region.

Posted Mar 29, 2017 - 14:00 UTC