Due to what it seems to be a bug with EC2 Security Groups, the connectivity between one API server and one Redis backend was impaired. This connectivity issue resulted in API requests waiting until a hard-timeout occurred. At its peak 1% of all API calls were affected and either returned a 502 error code or raised client-side timeout exceptions.
Once the problem was clear, the EC2 server with the configuration problem was removed from the load balancer, this immediately resolved the problem.
We are talking with AWS support to isolate and validate this problem; in the meantime we instrumented all our API servers to pro-actively check for this specific issue and decommission servers experiencing the same problem.