High error rates and timeouts
Incident Report for getstream.io
Postmortem

Between 4:05PM and 4:45PM UTC on January 28 2020 we had an API outage caused by performance degradation.

The event was triggered by a new release to our Chat API servers; quickly after the new release was live, load on our database infrastructure increased and caused HTTP response times to spike and time-out in some cases.

The event was detected by our latency and error monitoring. The team started working on the event by rolling back to the previous version at 4:20PM UTC. Unfortunately the rollback did not resolve the problem entirely.

After another rollback attempt we realised there were still pending queries from the previous release running on our PostgreSQL database. We manually terminated all the pending tasks at 4:40PM UTC; after that the error rate dropped to 0% again.

The outage affected 5% of HTTP requests at its peak (4:20PM to 4:27PM UTC).

Posted Jan 28, 2020 - 17:19 UTC

Resolved
This incident has been resolved.
Posted Jan 28, 2020 - 16:58 UTC
Update
We are continuing to monitor for any further issues.
Posted Jan 28, 2020 - 16:57 UTC
Update
We are continuing to monitor for any further issues.
Posted Jan 28, 2020 - 16:45 UTC
Monitoring
A recent released caused load increase on part of the chat infrastructure and caused degraded performance and timeout errors. Remediation is in progress.
Posted Jan 28, 2020 - 16:12 UTC
This incident affected: Chat API.