At 14:57 UTC on April 3rd some customers began receiving TLS certificate warnings and "403 Forbidden" error messages when accessing their Status Pages and our Management UI.
Within a few minutes, the Sorry™ team had been paged, and were on scene to begin diagnosis.
Our initial investigations led us to believe this was a TLS provisioning issue on our CDN, however, after a deeper dive with help from our edge provider Fastly it was later traced to one of our DNS hosts, who was intermittently serving an incorrect and very old record for our domain.
These incorrect DNS records led to traffic no longer being routed through our CDN and WAF, but instead straight to our origin servers, which for security purposes are not designed to receive direct traffic like this.
At 16:27 UTC the offending DNS provider was identified and removed from our domain, we then began to see traffic return to normal, and this trend continued over the following hour as the name server change propagated.
At 17:57 UTC we received final confirmation from customers to say they were no longer seeing any errors.
Whilst this only impacted a small number of customers, it was a substantial impact, with the application being intermittently unavailable to them for several hours.
For this, we're incredibly sorry.
During our post-impact assessment, we found the biggest contributing factor to the incident duration was the time it took us to identify the root cause as a DNS failure.
This was partly because DNS is generally very stable, and not often the "likely cause" of an incident, so didn't garner our immediate attention, however, there are certainly lessons we can learn.
With that in mind, we took several steps to minimize the risk of it happening in future, and help us respond quicker should the worst occur.
We have added OhDear as additional monitoring on all our critical endpoints. Their suite of tools includes DNS-specific checks, looking for changes in name servers and record types.
We also have DNSChecker added to our suite of diagnostics tools.
These new tools should make spotting and identifying DNS issues much faster and help narrow down which provider the failure stems from.
We have also replaced our old Primary/Secondary DNS configuration with a more resilient Primary/Primary setup using DNSimple and AWS Route53.
Running two separate and disconnected DNS providers means that DNS is much less likely to be a single point of failure.
We are also in a much better position to drop the offending provider from our traffic flow should we need to.
All the signs show this issue has been resolved, and we no longer see intermittent privacy errors. We are still working on the post-incident report, which may take some back-and-forth with the network team before we post again.
We must work out the root cause and prevent a repeat of today. Just because there were a relatively small number of sessions in the dark doesn't change how sorry we are to everyone who was affected.
We have seen an increase in successful requests, and monitoring shows improvements. We will mark this notice as recovering while continuing to monitor and coordinate with our edge provider on further identifying the root cause.
Once again, thank you for your continued patience. If you or any subscribers experience further issues, please do not hesitate to contact support.
We continue to investigate with our edge provider and have identified a potential DNS routing issue with specific DNS resolvers. This issue is resulting in a small number of requests intermittently bypassing the cache. However, a large percentage of requests are working and being routed via the edge correctly.
Once again, we thank you for your continued patience, and we will endeavour to update this notice when we have new information.
Very sorry to say that we are not quite out of the woods with this. We are sucessfully serving 100's of request a second however a small number of requests are routing incorrectly. We are all hands on deck together with our Edge provider. More updates to follow.
We have identified a possible fix for whats causing the intermittent errors and it's being deployed now. Thanks for waiting while we work on this one.
We continue to investigate, but monitoring has shown no further intermittent errors for the past 20 minutes. Once again, we apologise for any inconvenience caused and appreciate your patience as we work to resolve this issue promptly. Stay tuned for further updates.
We're currently experiencing intermittent SSL privacy errors on our status pages over the past few minutes. Our team is actively investigating the issue and has engaged our edge provider for assistance.
We apologise for any inconvenience caused and appreciate your patience as we work to resolve this issue promptly. Stay tuned for further updates.
We’ll find your subscription and send you a link to login to manage your preferences.
We’ve found your existing subscription and have emailed you a secure link to manage your preferences.
We’ll use your email to save your preferences so you can update them later.
Subscribe to other services using the bell icon on the subscribe button on the status page.
You’ll no long receive any status updates from Sorry™ Service Status, are you sure?
{{ error }}
We’ll no longer send you any status updates about Sorry™ Service Status.