On April 10th 14:42 UTC, we experienced a 51-minute platform-wide outage caused by a vendor suspending our account.
Upon service restoration at 15:33 UTC, we embarked on a joint effort with our vendor to understand the reasons behind the account suspension and devise preventive measures to avoid a recurrence of this issue.
I'm sorry for the disruption this caused. We are transparent with customers and hope this post-incident review provides precise details.
During the account provisioning process in 2023, our vendor issued a replacement contract due to an administrative error related to their terms and conditions. The replacement contract triggered a credit note to cover the initial contract; however, the credit amount was incorrect and did not cover the total amount. This resulted in our account showing as overdue, but with an understanding that the vendor would rectify the error.
Despite multiple assurances that the balance would be correct, our account ended with the suspension team, which led to the event's details in the timeline below.
During this review process, we helped the vendor improve workflows between their account management and credit control team.
Our vendor has corrected the credit note and provided written confirmation that our account "will not be suspended needlessly again".
We've always stressed the importance of not pointing fingers when it comes to outages. Today's outage was caused by an administrative error made by one of our suppliers. However, we chose them as suppliers, so the buck stops with us.
The outage lasted for around 51 minutes and prevented everyone from accessing our backend application. In addition, the status pages would no longer load. I am very sorry for the trouble it caused you.
Naturally, we have more to learn about why this happened and work to do to prevent it from happening again.
Here is a rundown of the complete timeline.
14:42 UTC: Monitoring alerted us that the application was failing, and the on-call engineer was immediately alerted.
14:45 UTC: We had confirmed a large-scale outage. On attempting to restart our services, we received a message from their API stating that this application has been suspended.
14:47 UTC: The team assembled and attempted to contact Heroku support
14:52 UTC: Team confirm the outage looks related to a billing issue
15:01 UTC: Contact made with our Enterprise Account Manager
15:08 UTC: Tweet Posted and Customers Tickets Updated
15:13 UTC: A second call with the Enterprise Account Manager, who confirmed that the account had been suspended in error and that he was working to resolve the issue.
15:20 UTC: Tweet and Customer Tickets provided a further update that the issue is an administrative error
15:28 UTC: Third call with Enterprise Account Manager confirming the request has been through the highest possible escalation, he also confirmed everyone knows the account is not overdue. We expect to be back online shortly.
15:31 UTC: Fourth call with the Enterprise Account Manager confirming the account will be active again shortly.
15:33 UTC: All services confirmed UP by Team
We’ll find your subscription and send you a link to login to manage your preferences.
We’ve found your existing subscription and have emailed you a secure link to manage your preferences.
We’ll use your email to save your preferences so you can update them later.
Subscribe to other services using the bell icon on the subscribe button on the status page.
You’ll no long receive any status updates from Sorry™ Service Status, are you sure?
{{ error }}
We’ll no longer send you any status updates about Sorry™ Service Status.