We have completed the post mortem for the December 9th incident. As the founder and CEO of Stream I’d like to apologize to all of our customers impacted by this issue. Stream powers activity feeds and chat for a billion end users, and we recognize that our customers operating in important sectors, such as healthcare, education, finance, and social apps, rely on our technology. As such, we have a responsibility to ensure that these systems are always available.
Stability and performance is the cornerstone of what makes a hosted API like Stream work. Over the last 5 years it’s been extremely rare for us to have stability issues. Our team spends a significant amount of time and resources to ensure that we keep up our good stability track record. On December 9th, however, we made some significant mistakes, and we need to learn from that, as a team, and do better in the future.
The Outage
A rolling deployment between 11:28 GTM & 14:38 GMT was made to chat shards in US-east and Singapore regions. The code contained an issue with our Raft-based replication system, causing 66% of message events to not be delivered. Messages were still stored and retrievable via the API. The event replay endpoint also still returned messages. At 17:00 GMT the issue was identified and the code was rolled back, resolving the issue for all shards by 17:38 GMT. While the end user impact on the chat experience depends on the SDK, the offline storage integration, and the API region, for most apps, this meant a very significant disruption to the chat functionality.
What Went Wrong
As with any significant downtime event, it was a combination of problems that caused this outage.
Resolution 1 - Monitoring
The biggest and most glaring issue here is the monitoring. While we do have extensive monitoring and alerting in place, we did not have one that captured message propagation. The team is introducing monitoring to track message delivery and adding alerting rules.
Resolution 2 - QA
The second issue is that our extensive QA test suite didn’t catch this issue, since it only occurred when running Stream in a multi cluster environment. We are updating our QA process to run in a cluster environment, so that it more closely resembles production systems.
Resolution 3 - Heartbeat Monitoring
The previous two resolutions would have been enough to avoid this incident or reduce it to a very minor incident. With that being said, Chat API is a complex system and we think that more end-to-end testing will make issues easier to notice. For this reason we are also going to introduce canary-like testing so that we can detect failures at client-side level as well.
Non-Technical Factors
Stream has been growing extremely rapidly over the last year. Our team grew from 31 to 93 in the last 12 months. The chat API usage has been growing even faster than that. Keeping up to this level of growth requires constant changes to processes and operations like monitoring and deployment. This is something we have to reflect on as a team and do better.
Conclusion
Performance and stability is one of our key focus areas and something we spend a significant part of our engineering efforts on. Yesterday we let our customers down. For that, Tommaso and I would like to apologize. The entire team at Stream will strive to do better in the future.