getstream.io Status - Incident History

High error rate on Chat API service in ohio

2024-05-29T01:07:07Z

May 29, 01:07 UTC
Resolved - The incident has been resolved, and traffic is now served without any issue.

May 29, 01:06 UTC
Investigating - Our monitoring system has detected an incident affecting one of our shard in Ohio. Customers located on that shard have experienced a high error rate.

High error rate on one of our Feed shards in the us-east region

2024-05-09T14:38:11Z

May 9, 14:38 UTC
Resolved - The shard has been recovered and the incident has been resolved. Our team is currently conducting an internal investigation to determine the root cause of the issue.

May 9, 13:52 UTC
Identified - The issue has been identified and a fix is being implemented.

May 9, 13:51 UTC
Investigating - We are currently experiencing a high error rate on one of our Feed shards in the us-east region. Our team is actively working to resolve the situation.

High error rate on Chat API endpoints

2024-04-29T19:42:19Z

Apr 29, 19:42 UTC
Resolved - At 21:02 CET, a high error rate was recorded on the Chat API endpoints in one of our us-east shards. Our team resolved the issue at 21:18 CET, and the service is now operating normally.

High error rate for Chat Query Channels endpoint

2024-03-28T11:30:20Z

Mar 28, 11:30 UTC
Resolved - An issue with the QueryChannel endpoint led to some queries returning HTTP 403 response code. This incident has been resolved.

Realtime connections outage

2023-05-09T08:33:10Z

May 9, 08:33 UTC
Resolved - This incident has been resolved.

May 9, 08:08 UTC
Monitoring - A fix has been implemented and we are monitoring the results.

May 9, 07:35 UTC
Identified - The issue has been identified and a fix is being implemented.

May 9, 07:32 UTC
Investigating - We are currently investigating an issue with our Feed Realtime service.

Elevated error rate in our edge network

2023-03-27T15:12:07Z

Mar 27, 15:12 UTC
Resolved - This incident has been resolved.

Mar 27, 14:35 UTC
Identified - The issue has been identified and our team is working on a remediation

Elevated error rate for Feed apps in dublin region

2023-03-21T17:09:27Z

Mar 21, 17:09 UTC
Resolved - This incident has been resolved.

Mar 21, 16:56 UTC
Identified - The issue has been identified and a fix is being implemented.

AWS connectivity issues

2022-07-28T18:37:04Z

Jul 28, 18:37 UTC
Resolved - This incident has been resolved.

Jul 28, 18:36 UTC
Update - The issue has been resolved and the service in Ohio region is operating normally.

Jul 28, 18:27 UTC
Identified - We experience a partial outage due to AWS connectivity issues for selected apps in the Ohio region

Elevated API Errors on us-east

2021-12-08T12:43:43Z

Dec 8, 12:43 UTC
Resolved - The incident has been resolved. A post-mortem will follow.

Dec 8, 11:04 UTC
Update - The issue of this morning propagated to an additional component of our infrastructure intended to dispatch messages to the end users via websocket protocol. Our team tried to mitigate the issue and the problem seems to be resolved now. We are still monitoring the situation closely.

Dec 8, 09:02 UTC
Monitoring - A fix has been implemented and we are monitoring the results.

Dec 8, 08:55 UTC
Identified - The issue has been identified and a fix is being implemented. A temporary remediation has been put in place to mitigate the ongoing issue.

Dec 8, 07:20 UTC
Investigating - We're experiencing an elevated level of API errors and are currently looking into the issue. This issue affects one shard only in our us-east region.

Elevated API error rate in Dublin

2021-08-31T21:30:00Z

Aug 31, 21:30 UTC
Resolved - Traffic to our Dublin infrastructure experienced elevated error rate due to a AWS outage.
The incident started at 11:20PM, error rate decreased at 11:38PM and the incident was resolved by 11:59PM

We are still performing impact and root-cause analysis, a postmortem with more information will be posted here.

Increased error rate on Chat API

2021-04-14T05:30:00Z

Apr 14, 05:30 UTC
Resolved - We experienced higher than normal error rates during a database maintenance on Chat API. The error increased started at 5:24AM and resolved at 5:42AM UTC.

Chat API

2021-03-15T18:30:00Z

Mar 15, 18:30 UTC
Resolved - High error rate on Chat HTTP APIs

High error rate on Feed Realtime endpoint

2021-01-06T15:35:16Z

Jan 6, 15:35 UTC
Resolved - This incident has been resolved.

Jan 6, 10:02 UTC
Update - Realtime updates for feeds are back to normal, we are still monitoring the traffic.

The previous patch unfortunately did not resolve the problem and was causing realtime clients to retry the connection via the `Client not found, please reconnect` response.

Jan 6, 09:53 UTC
Monitoring - A fix has been implemented and we are monitoring the results.

Jan 6, 05:00 UTC
Identified - The issue has been identified and a fix is being implemented.

Feed Realtime - SQS high error rate

2021-01-04T23:25:00Z

Jan 4, 23:25 UTC
Resolved - Millions of requests to the handshake endpoint of our feed realtime system broke the API. This issue has been resolved and a full post mortem will follow.

Jan 4, 19:45 UTC
Update - We are continuing to monitor for any further issues.

Jan 4, 17:38 UTC
Monitoring - A fix has been implemented and we are monitoring the results.

Jan 4, 16:57 UTC
Investigating - We are currently investigating an issue with AWS SQS, we are receiving 100% error rate from SQS APIs.

Our feeds realtime endpoint is currently unable to push notifications to SQS.

Elevated error rates on Chat API

2020-12-09T17:43:15Z

Dec 9, 17:43 UTC
Resolved - This incident has been resolved.

Dec 9, 17:20 UTC
Identified - We identified an issue with Chat API that caused some messages to not being delivered via Websockets. The problem is already resolved for most applications, and the remediation should be completed for all apps shortly.

Dec 9, 16:59 UTC
Investigating - We are currently investigating an increase of errors on Chat API

Increased API latency

2020-07-28T12:12:17Z

Jul 28, 12:12 UTC
Resolved - AWS Networking issue is now resolved. We are now cleaning up our temporary remediations since they are not needed anymore. Traffic is back to normal for the last hour.

Jul 28, 10:37 UTC
Identified - Due to a networking issue on AWS us-east region, we are experiencing increased latency for some of the traffic on our US region. We are mitigating the problem while waiting for a final remediation on AWS infrastructure.

High error rates and timeouts

2020-01-28T16:58:54Z

Jan 28, 16:58 UTC
Resolved - This incident has been resolved.

Jan 28, 16:57 UTC
Update - We are continuing to monitor for any further issues.

Jan 28, 16:45 UTC
Update - We are continuing to monitor for any further issues.

Jan 28, 16:12 UTC
Monitoring - A recent released caused load increase on part of the chat infrastructure and caused degraded performance and timeout errors. Remediation is in progress.

Timeout Errors

2019-11-21T22:16:44Z

Nov 21, 22:16 UTC
Resolved - This incident has been resolved.

Nov 21, 22:04 UTC
Monitoring - Increased load on some API endpoints caused traffic to spike intermittently. Adding more capacity remediated the problem.

Nov 21, 21:27 UTC
Investigating - We are experiencing spikes of timeout errors; the team is investigating on the root cause and working on a remediation

Emails from Dashboard are not sent

2019-08-30T16:36:17Z

Aug 30, 16:36 UTC
Resolved - This incident has been resolved.

Aug 30, 14:46 UTC
Identified - Emails from Dashboard (invites, password resets and other notifications are currently not sent correctly).
We are talking to our SMTP provider (Mailgun) to resolve this issue as soon as possible.

Dashboard redirect issue

2019-05-10T21:34:11Z

May 10, 21:34 UTC
Resolved - This issue has been resolved.

May 10, 20:40 UTC
Investigating - The dashboard has a bug that's causing it to redirect some users to the homepage. Our team is investigating. APIs are fully operational, this only impacts the dashboard.

Elevated API Errors on US-EAST

2019-02-06T14:23:35Z

Feb 6, 14:23 UTC
Resolved - We were experiencing an elevated level of API errors on our us-east region. This incident lasted from 2:11PM UTC to 2:18PM.

Elevated API Errors on region EU-WEST

2019-01-02T13:22:46Z

Jan 2, 13:22 UTC
Resolved - This incident has been resolved.

Jan 2, 13:04 UTC
Monitoring - We were experiencing an elevated level of API errors because of a Redis upgrade that was unsuccessful. We have resolved the issue and are monitoring for further problems

EU-WEST API downtime

2018-11-22T13:08:04Z

Nov 22, 13:08 UTC
Resolved - Due to an operation mistake, API service between 12:56PM UTC and 12:58PM UTC API had very high error rate on the Europe West region. The problem is mitigated and resolved.

Detailed API error rate over time:

12:56PM 78%
12:57PM 93%
12:58PM 4%

Partial API outage

2018-08-23T15:59:01Z

Aug 23, 15:59 UTC
Resolved - Between 03:59PM and 04:32PM UTC API traffic resulted in HTTP errors or timeouts. Only a part of Stream applications hosted on US were affected by this problem.

Realtime Redis Failover

2018-08-08T01:09:10Z

Aug 8, 01:09 UTC
Resolved - Our distributed realtime cluster uses Redis (on Elasticache) for state management. A failover of the Elasticache cluster caused realtime to be unavailable for 7 minutes. This issue has been resolved and we're investigating why the failover took 7 minutes. This impacted customers using Stream's websocket, SQS or Webhook firehose systems.