The API experienced very intermittent outages (totalling around 40 minutes of downtime) beginning at 17:29 ET on the 18th and continuing until 13:21 on the 19th. During this window, many (but not all) API clients would receive an HTTP 504 Gateway Timeout response instead of the correct HTTP 200 OK response.

For the 504s, the root cause was a pair of bugs in our software, each of which would have been innocuous on its own, but together caused the downtime.

The first was a bug in how our API handled TCP connection timeouts to our database servers. TCP connection timeouts are routine, and if the timeout occurred at an inopportune moment, our API software would not correctly reconnect to the database server. If this happened, the API would return a 500 error to our load balancer, crash, and be restarted by automated process monitoring.

The second bug was in how our load balancer responded to 500 errors; instead of taking a single API backend server out of rotation (as it should have), it would take all of the API backend servers out of rotation. This manifested as an HTTP 504 error, since with no backends in rotation, the load balancer would time out waiting for one of them to come back up, but none would come back online swiftly enough to respond to the client. Since we have several load balancers, these 504 errors would affect a substantial fraction of our clients at once, though not all of them simultaneously.

The first bug caused backend 500 errors to occur much more frequently than they should have, and the the second bug caused them to be much more severe than they should have been (resulting in downtime rather a transient loss of maximum capacity).

We have fixed both of these bugs, and so this issue should never occur again.