Our monitoring system is seeing occasional, intermittent 502 Bad Gateway errors returned from the API. These errors are relatively infrequent, affecting 0.04% of API traffic. (However, the distribution of these errors depends on DNS; lucky clients would experience no problems, while unlucky clients would experience a much greater share of them.)
Update 22 Jan 2018: We believe that these issues are related to resource contention on our cloud provider’s hosts in the wake of hotfixes related to the Meltdown CPU vulnerability announced earlier this month. Last Friday (the 19th), we pulled the VMs that we had identified as most affected by this issue out of rotation, and over the weekend (the 20th and 21st) we did the same with additional problem VMs as we identified them.
Update 28 Jan 2018: Our cloud provider continues to investigate the issues with their system stability. We are continuing to respond to individual VM failures as they occur. Notably, we experienced a partial outage this morning between 3:30AM and 11:55AM ET as one of our load balancing servers became inaccessible in a way that our monitoring system didn’t catch. We have resolved the immediate issue and will be making our monitoring more aggressive for such failures so that they are resolved more quickly.
Update 15 Mar 2018: We are aware of degraded performance and timeouts on some API requests occurring today. We are continuing to test fixes with our cloud provider to mitigate this issue.
Update 20 Apr 2018: We are aware of occasional “502 Bad Gateway” errors being returned by our system. This is related to another suite of tests with our cloud provider. We have rolled back the test and will be continuing to work with them to find a permanent solution.
Thanks for bearing with us!