Management UI Temporarily Unavailable

Post Incident Review

At 19:12 UTC on April 12th, our monitoring spotted several failures when loading our Management UI. The on-duty team member was alerted and on-scene within a few minutes.

It was noted that the Management UI was intermittently unavailable, but Status Pages and the API remained unimpacted.

After the initial assessment, the issue was identified as timeouts when connecting to our Postgres database, and the level-2 on-call team member was paged for assistance.

By 19:32 UTC the cavalry had arrived, however, the application was once again accessible, the initial issue having resolved itself.

Understanding What Happened

While watching that the application remained stable the team began their initial investigations and reached out to our database provider for additional help in diagnosing what had happened.

The response from our provider suggested that our Postgres instance appeared to be experiencing an abnormal load, which was eating up all the CPU resources, causing connections to hang.

To properly understand what caused this, more detailed monitoring would be required on the database.

This, and responding to any insights it offered would be the path forward.

Improvements Made

Better Postgres Monitoring

We have added two layers of additional monitoring to our Postgres database servers, the first helps us track and CPU and Query load being placed on the server, while the second helps us identify specific queries and database configuration settings that may be negatively impacting performance.

This new monitoring immediately led to us finding some particular queries related to sending notifications that were placing undue load on the server, and impacting performance across the entire application.

Improving Expensive Queries

The queries identified by the monitoring have been rigorously improved. In some cases removed altogether or consolidated, and in other cases rewritten to be more efficient.

We also improved the indexing on some of the key database tables, to further optimize performance.

As a result of these changes, we are no longer seeing any spikes in load on the database servers, and we have also seen improved application response times across the board.

Post Incident Review

Understanding What Happened

Improvements Made

Better Postgres Monitoring

Improving Expensive Queries

Find Your Subscription

Subscribe to Status Updates