High API error rate

Incident Report for getstream.io

Postmortem

The issue

From 19:16 to 19:24 UTC and from 19:30 to 19:46 UTC we had a high number HTTP 502 errors when connecting to the API.

The causes

A change was made to our servers SSH configuration that was thought not to have any effect. However, on newly provisioned servers it caused a failure to start the server process.

Normally this wouldn't have caused a big problem, because the load balancer should mark the host as unhealthy and thus no traffic should be sent there. Unfortunately, this was not the case because of a recent change in the health check logic. This change wrongly reported the server as healthy even though the server process was down.

The fixes

First we removed the bad servers manually from the load balancer. After that we fixed the problem with the SSH configuration and added the servers back to the load balancer. Finally we changed the health check to not report healthy when the server process is down.

Our apologies about the outage, our team is hard at work to further improve stability.

Posted Oct 10, 2017 - 21:11 UTC

Resolved

The issue has been resolved, more information about the outage will follow shortly.

Posted Oct 10, 2017 - 20:00 UTC

Investigating

We're currently investigating a high error rate on the APIs. A percentage of requests to the API are returning 502s, the cause is not yet identified.

Posted Oct 10, 2017 - 19:50 UTC