Elevated error rates on Chat API
Incident Report for getstream.io
Postmortem

We have completed the post mortem for the December 9th incident. As the founder and CEO of Stream I’d like to apologize to all of our customers impacted by this issue. Stream powers activity feeds and chat for a billion end users, and we recognize that our customers operating in important sectors, such as healthcare, education, finance, and social apps, rely on our technology. As such, we have a responsibility to ensure that these systems are always available. 

Stability and performance is the cornerstone of what makes a hosted API like Stream work. Over the last 5 years it’s been extremely rare for us to have stability issues. Our team spends a significant amount of time and resources to ensure that we keep up our good stability track record. On December 9th, however, we made some significant mistakes, and we need to learn from that, as a team, and do better in the future.

The Outage

A rolling deployment between 11:28 GTM & 14:38 GMT was made to chat shards in US-east and Singapore regions. The code contained an issue with our Raft-based replication system, causing 66% of message events to not be delivered. Messages were still stored and retrievable via the API. The event replay endpoint also still returned messages. At 17:00 GMT the issue was identified and the code was rolled back, resolving the issue for all shards by 17:38 GMT. While the end user impact on the chat experience depends on the SDK, the offline storage integration, and the API region, for most apps, this meant a very significant disruption to the chat functionality.

What Went Wrong

As with any significant downtime event, it was a combination of problems that caused this outage. 

  1. The issue with the broken code should have been caught during our review process.
  2. The QA process should have identified this issue. Unfortunately tests were run on a single node setup and did not capture the bug.
  3. The issue with the reduced message events should have been visible during the rolling deploy.
  4. Monitoring and alerting should have picked up the issue before our customers reported it.

Resolution 1 - Monitoring

The biggest and most glaring issue here is the monitoring. While we do have extensive monitoring and alerting in place, we did not have one that captured message propagation. The team is introducing monitoring to track message delivery and adding alerting rules.

Resolution 2 - QA

The second issue is that our extensive QA test suite didn’t catch this issue, since it only occurred when running Stream in a multi cluster environment. We are updating our QA process to run in a cluster environment, so that it more closely resembles production systems. 

Resolution 3 - Heartbeat Monitoring

The previous two resolutions would have been enough to avoid this incident or reduce it to a very minor incident. With that being said, Chat API is a complex system and we think that more end-to-end testing will make issues easier to notice. For this reason we are also going to introduce canary-like testing so that we can detect failures at client-side level as well.

Non-Technical Factors

Stream has been growing extremely rapidly over the last year. Our team grew from 31 to 93 in the last 12 months. The chat API usage has been growing even faster than that. Keeping up to this level of growth requires constant changes to processes and operations like monitoring and deployment. This is something we have to reflect on as a team and do better.

Conclusion

Performance and stability is one of our key focus areas and something we spend a significant part of our engineering efforts on. Yesterday we let our customers down. For that, Tommaso and I would like to apologize. The entire team at Stream will strive to do better in the future.

Posted Dec 10, 2020 - 20:37 UTC

Resolved
This incident has been resolved.
Posted Dec 09, 2020 - 17:43 UTC
Identified
We identified an issue with Chat API that caused some messages to not being delivered via Websockets. The problem is already resolved for most applications, and the remediation should be completed for all apps shortly.
Posted Dec 09, 2020 - 17:20 UTC
Investigating
We are currently investigating an increase of errors on Chat API
Posted Dec 09, 2020 - 16:59 UTC
This incident affected: US-East (Chat - API).