At approximately 12:30pm UTC, one of the nodes in our US-East region stopped responding to requests. We were made aware of the issue when users began sending in support tickets.
A review of metrics in CloudWatch indicated that the server had experienced a sudden, massive spike in CPU and had effectively locked up. Prior to the event, the server indicated low load and had plenty of capacity with respect to memory and disk space. We have opened a ticket with Amazon to understand the circumstances under which that type of event could have occurred.
Our immediate response was to try to repair the node. Fortunately, we were able to reboot it, which returned everything to normal. Had that not been effective, we would have followed our established and proven operations playbook for recovering lost nodes.
Once the all-clear had been made, our operations and products teams had a standup to discuss how this could have been avoided, or at least mitigated. The impact to customers during this event was quite limited, affecting 11 slices out of many thousands. Still, we take the impact seriously and have compiled a brief punch list of tools and processes for our development pipeline, and will be deploying those over the coming weeks.