The API was down for around five hours this morning from 12:20AM ET through 6:24AM ET.
The initial cause of the downtime was the simultaneous failure of around 15 of our virtual machines. Since we’re based on a cloud infrastructure, we anticipate that hosts are transient and try to design around failure; but we’ve never seen so many fail at once previously. It so happened that our primary database server was nearing it’s memory limits; the filling of network queues from the failures pushed it up over a threshold causing it to fail, too. Our monitoring system caught the error immediately, but due to a misconfiguration, failed to notify the on-call staff. (Other staff had, by chance, noticed the downtime, but had assumed our normal processes were working as intended and therefore failed to escalate.) As a result, the downtime continued on far longer than necessary.
As we see it, there were four problems that contributed to the downtime, any one of which, if mitigated, would have either prevented it entirely or greatly reduced its severity. We will be instituting changes to prevent each from occurring in the future:
- Our primary database never should have been running as close to its limits as it was. We have resized the database such that it has ample headroom, and we will be adding checks to ensure that we are aware and will take action prior to running out of memory in the future.
- Failures to our primary database never should have taken down the API. After the issues we had with our database servers previously, we have been testing a new client that would allow the API to continue working even if a transient database error occurs, but had been waiting on a particular development milestone to push it into production. After this morning’s error, we have pushed it into production regardless.
- Our monitoring software should have properly contacted the on-call staff. We will be performing an audit and test of the monitoring software to ensure that it is configured correctly, and we will be conducting periodic tests of the system to ensure that the appropriate people are contacted.
- We will be revising our staff guidelines such that there is a clear escalation path for issues found after hours by staff which is not on-call.
Thanks for bearing with us as we work on these issues.