Recently the Websolr team has been making improvements to our routing and load balancing systems. This morning, we deployed an update to release more thorough back-end health checks. These improvements will help our systems detect and route around many kinds of failures instantaneously.
This update, while well-tested and carefully deployed, ultimately developed a regression and failed under production workloads. This failure caused our proxy servers to erroneously report failed health statuses for multiple backends simultaneously. The result was connection failures and 503 errors for a large portion of our production indices.
This problem was relatively slow to manifest. However, once detected, we immediately rolled back the changes to an earlier known-good deploy. Following the rollback we initiated a thorough root cause analysis and have since identified the problem. A fix is being built and will be rolled out later today.