API downtime

Incident Report for getstream.io

Postmortem

We've received additional information from AWS about this outage. To summarize, the RDS monitoring process and DB instance both failed causing a delay in automated failover.

=====

Thank you for contacting AWS Premium Support.

I understand that your RDS instance was not reachable from 1:23 to 1:41 UTC on 9th of February 2017 and you want to know the cause for it.

I have investigated your RDS instance and following is my analysis:

--> 2017-02-09 01:25:26 External Monitoring process is unable to communicate with monitoring service on your instance

--> Due to the communication issues talking to the monitoring process on the instance, the failover was getting delayed until the hard limit was reached from the external monitoring process. Before External Monitoring process forces failover you did a manual reboot with failover at around 2017-02-09 01:40:42 UTC.

--> That was the reason CloudWatch metrics was not available during that time period but it started uploading after it failed over to standby DB instance.

--> After making sure new primary DB instance is up to date with the old primary DB instance, RDS issued replace DB instance.

--> Replace DB instance workflow has deleted the faulty instance (old primary) and replace it with new instance. Then it will sync up with the primary DB instance.

--> This process (Replace DB instance) completed successfully at 2017-02-09 1:57:45 UTC. However, during this process DB instance was available for reads and writes.

Normally the failover will be triggered shortly within few minutes and this time it's indeed abnormal. It rarely happens and we do apologize for any inconvenience that this issue might have caused on your environment. The RDS team always works hard on improving the stability and reliability of the RDS service but sometimes failure do occur.

Our sincerest apologies for the operational pain that was caused you and please let me know if there is anything else I can assist with.

Posted Feb 09, 2017 - 18:25 UTC

Resolved

The problem was related to a hardware failure with one of our databases. The faulty server was replaced with a hot backup.

Posted Feb 09, 2017 - 01:49 UTC

Investigating

We are investigating an outage on the API

Posted Feb 09, 2017 - 01:41 UTC