From 19:16 to 19:24 UTC and from 19:30 to 19:46 UTC we had a high number HTTP 502 errors when connecting to the API.
A change was made to our servers SSH configuration that was thought not to have any effect. However, on newly provisioned servers it caused a failure to start the server process.
Normally this wouldn't have caused a big problem, because the load balancer should mark the host as unhealthy and thus no traffic should be sent there. Unfortunately, this was not the case because of a recent change in the health check logic. This change wrongly reported the server as healthy even though the server process was down.
First we removed the bad servers manually from the load balancer. After that we fixed the problem with the SSH configuration and added the servers back to the load balancer. Finally we changed the health check to not report healthy when the server process is down.
Our apologies about the outage, our team is hard at work to further improve stability.