High latency spike and increased error rate
Incident Report for getstream.io
Postmortem

January 23 and 24 outage postmortem report

Stream suffered two incidents of degraded performance in the past 24 hours. We take uptime very seriously, and would like to be transparent with our operations and to our customers. The spikes occurred on Jan 23 at 3:50PM UTC and on Jan 24 at 11:45AM UTC. Both spikes were caused by a sudden increase of pressure to one of our PostgreSQL databases. Because Postgresql was slow at serving queries, HTTP requests started to pile up and eventually saturated the API workers' connection backlogs. API clients using a very low timeout will have encountered timeout exceptions. Other users of Stream would see 5xx responses on part of their API calls.

I am going to add a little bit of background so that it is easier to elaborate on what went wrong.

Some of our internal operations rely on moving data from one PostgreSQL database to another. Thanks to psql such operation is routinely performed by pipe-ing COPY TO STDOUT and COPY FROM STDOUT together. In order to not pressure the destination database with writes: we also use pv so that we are sure we never end up consuming all our IOPS capacity.

The command looks more or less like this:

psql src_db -c '\copy (...) to stdout' | pv -p --rate-limit 5242880 | psql dst_db -c '\copy (...) from stdout'

By terminating the same copy command running on the source database we were able to remove write pressure on the disk. After that the high latency problem affecting the API service was automatically resolved. After researching on other possible causes, we concluded that the pressure created by the copy command combined with increased traffic was behind this outage and picked a different time to do it again.

The same operation was then restarted during low traffic hours. To our surprise write pressure increased again after a couple of hours on the source database and caused another, albeit shorter, outage.

After more digging we realized that in both occasions the command was running in the background and could not write to stdout, forcing the source database to store the query results on disk and then causing very high I/O on disk and slow response times to regular traffic.

The remediation for this is very straight forward and it simply requires to never block stdout on the source database.

Posted about 1 year ago. Jan 24, 2018 - 21:00 UTC

Resolved
Between 11:43 UTC and 11:47 API traffic had a spike in HTTP 5xx errors and an increase of latency.
Posted about 1 year ago. Jan 24, 2018 - 12:09 UTC