Between 4:05PM and 4:45PM UTC on January 28 2020 we had an API outage caused by performance degradation.
The event was triggered by a new release to our Chat API servers; quickly after the new release was live, load on our database infrastructure increased and caused HTTP response times to spike and time-out in some cases.
The event was detected by our latency and error monitoring. The team started working on the event by rolling back to the previous version at 4:20PM UTC. Unfortunately the rollback did not resolve the problem entirely.
After another rollback attempt we realised there were still pending queries from the previous release running on our PostgreSQL database. We manually terminated all the pending tasks at 4:40PM UTC; after that the error rate dropped to 0% again.
The outage affected 5% of HTTP requests at its peak (4:20PM to 4:27PM UTC).