Total Outage of Sorry™ Status Pages, App & APIs

Uptime Impact: 51 minutes
Resolved
Resolved

Post Incident Review

On April 10th 14:42 UTC, we experienced a 51-minute platform-wide outage caused by a vendor suspending our account.

Upon service restoration at 15:33 UTC, we embarked on a joint effort with our vendor to understand the reasons behind the account suspension and devise preventive measures to avoid a recurrence of this issue.

I'm sorry for the disruption this caused. We are transparent with customers and hope this post-incident review provides precise details.

Understanding What Happened

During the account provisioning process in 2023, our vendor issued a replacement contract due to an administrative error related to their terms and conditions. The replacement contract triggered a credit note to cover the initial contract; however, the credit amount was incorrect and did not cover the total amount. This resulted in our account showing as overdue, but with an understanding that the vendor would rectify the error.

Despite multiple assurances that the balance would be correct, our account ended with the suspension team, which led to the event's details in the timeline below.

Preventative Measures

During this review process, we helped the vendor improve workflows between their account management and credit control team.

Our vendor has corrected the credit note and provided written confirmation that our account "will not be suspended needlessly again".

Avatar for Robin Geall
Robin Geall
Resolved

We've always stressed the importance of not pointing fingers when it comes to outages. Today's outage was caused by an administrative error made by one of our suppliers. However, we chose them as suppliers, so the buck stops with us.

The outage lasted for around 51 minutes and prevented everyone from accessing our backend application. In addition, the status pages would no longer load. I am very sorry for the trouble it caused you.

Naturally, we have more to learn about why this happened and work to do to prevent it from happening again.

Here is a rundown of the complete timeline.

14:42 UTC: Monitoring alerted us that the application was failing, and the on-call engineer was immediately alerted.

14:45 UTC: We had confirmed a large-scale outage. On attempting to restart our services, we received a message from their API stating that this application has been suspended.

14:47 UTC: The team assembled and attempted to contact Heroku support

14:52 UTC: Team confirm the outage looks related to a billing issue

15:01 UTC: Contact made with our Enterprise Account Manager

15:08 UTC: Tweet Posted and Customers Tickets Updated

15:13 UTC: A second call with the Enterprise Account Manager, who confirmed that the account had been suspended in error and that he was working to resolve the issue.

15:20 UTC: Tweet and Customer Tickets provided a further update that the issue is an administrative error

15:28 UTC: Third call with Enterprise Account Manager confirming the request has been through the highest possible escalation, he also confirmed everyone knows the account is not overdue. We expect to be back online shortly.

15:31 UTC: Fourth call with the Enterprise Account Manager confirming the account will be active again shortly.

15:33 UTC: All services confirmed UP by Team

Avatar for Robin Geall
Robin Geall
Began at:

Affected components
  • Status Pages
  • Management UI
  • REST API
  • Monitoring Automation
    • Inbound Mail
    • Pingdom Sync
  • Message Distribution
    • Email
      • Email by Sorry™
      • MailChimp
      • Mailgun
      • SendGrid
      • Postmark
    • Microsoft Teams
    • Slack
    • SMS
    • Twitter
    • Website Plugin
    • Intercom Messenger App