Our first observations showed AWS CloudWatch reporting higher than normal 5xx errors on our main region ELB, along with a reduction in total requests, and a sharp increase in surge queue length. These correlated with our own internal metrics reports which showed a reduction in search request volumes and system usage stats like CPU, network bandwidth and loadavg.
The combination of these pointed strongly to our ELB being a bottleneck on incoming traffic. In particular, that ELB surge queue strongly indicating a high volume of requests being stuck at the ELB level.
Armed with that information, we made the decision to adjustment our ELB settings (specifically, to temporarily disable connection draining) which allowed the queue to drain, and overall traffic to return to normal, which we were able to confirm by seeing CloudWatch and our own stats return to expected baselines.
With connection draining enabled, this kind of queuing can be caused by ELB backends not accepting traffic at a sufficient rate. Subsequent analysis is indicating that we had fewer active backends registered to the ELB than originally planned. We’ve since expanded this capacity and will be referencing our logs and notes for this incident in an upcoming round of connection management for our routing proxy layer.