At 19:12 UTC on April 12th, our monitoring spotted several failures when loading our Management UI. The on-duty team member was alerted and on-scene within a few minutes.
It was noted that the Management UI was intermittently unavailable, but Status Pages and the API remained unimpacted.
After the initial assessment, the issue was identified as timeouts when connecting to our Postgres database, and the level-2 on-call team member was paged for assistance.
By 19:32 UTC the cavalry had arrived, however, the application was once again accessible, the initial issue having resolved itself.
While watching that the application remained stable the team began their initial investigations and reached out to our database provider for additional help in diagnosing what had happened.
The response from our provider suggested that our Postgres instance appeared to be experiencing an abnormal load, which was eating up all the CPU resources, causing connections to hang.
To properly understand what caused this, more detailed monitoring would be required on the database.
This, and responding to any insights it offered would be the path forward.
We have added two layers of additional monitoring to our Postgres database servers, the first helps us track and CPU and Query load being placed on the server, while the second helps us identify specific queries and database configuration settings that may be negatively impacting performance.
This new monitoring immediately led to us finding some particular queries related to sending notifications that were placing undue load on the server, and impacting performance across the entire application.
The queries identified by the monitoring have been rigorously improved. In some cases removed altogether or consolidated, and in other cases rewritten to be more efficient.
We also improved the indexing on some of the key database tables, to further optimize performance.
As a result of these changes, we are no longer seeing any spikes in load on the database servers, and we have also seen improved application response times across the board.
We've monitored the management UI and are confident this is now stable. We are conducting a root cause analysis as part of our post-incident process. Once again, we appreciate your understanding during this incident.
Between 19:10 UTC and 19:23 UTC, our Sorry™ management UI encountered a temporary outage, returning a 503 error due to a database connection issue. Our on-call engineer promptly addressed the issue by scaling down and back up our nodes, restoring full functionality.
We apologise for any inconvenience caused and appreciate your patience. Rest assured, we're closely monitoring the situation to ensure stability.
We’ll find your subscription and send you a link to login to manage your preferences.
We’ve found your existing subscription and have emailed you a secure link to manage your preferences.
We’ll use your email to save your preferences so you can update them later.
Subscribe to other services using the bell icon on the subscribe button on the status page.
You’ll no long receive any status updates from Sorry™ Service Status, are you sure?
{{ error }}
We’ll no longer send you any status updates about Sorry™ Service Status.