From 12:27 ET to 13:26 ET, approximately 25% of API requests lacked Hourly data. Our engineering staff inadvertently updated a critical setting (the location of our master weather database) in preparation for an upcoming major API update without similarly updating the version number of our server configurations; this caused some of our systems to start looking in the new (empty) location for data.
Thanks to careful monitoring, we caught the issue immediately, which prevented it from affecting a larger proportion of hosts. Unfortunately, since we make extensive use of local caching, the 25% of hosts that received the updated configuration deleted their local caches (believing them to be out of date), which caused them to lack data until the local caches were rebuilt.
In order to prevent this kind of error from occurring in the future, we will be doing the following:
- Reviewing our configuration management procedures such that multiple engineers must sign off on a change before it is deployed. (This is already the case for most kinds of software changes, but this kind slipped through our policies, allowing an insufficiently-vetted change to affect our production infrastructure.)
- Improving our load balancing configuration to actively check for data integrity issues. Our servers already have health checks that catch issues like this; by having our load balancers issue active health checks to verify such issues, problematic hosts can automatically be taken out of rotation, preventing misconfigurations from affecting end-users.
We apologize for the error and thank you for your continued support.