I've got a setup here that's a bit strange (inherited) and out of nowhere we're having random downtime. I've been trying to debug all day and have flat gotten nowhere with the issue.
Here's the setup: [ELB] -> [Nginx/Proxy] -> [Varnish] -> [Nginx/Proxy] -> [ELB] -> [App Servers]
The symptoms are seemingly random downtime (503 errors) on the frontend, and the 1st ELB removes the two Nginx nodes from service. I've checked Nginx on those two servers and don't see any odd errors, but for some reason the ELB just seems to drop them, and then as quickly as they're dropped they come back into service. This happens usually in the morning and I've watched for any oddities, scripts running (not that there would be on two proxy nodes) and anything else out of the ordinary.
The Varnish logs are clear as well, nothing notable with that aside from an occasional 404, 301's and 200's. I did get some strange traffic that looked like service scanning for phpmyadmin and such, but it's nothing I haven't seen before on a public facing machine. Disks are not full and the instances are operating normally. Neither nginx or varnish has restarted or seems to have any issues at all.
My question is if anyone can tell me where I would start to track this issue down? I'm currently using varnishlog, iostat, varnishncsa and the nginx logs, but given that those are not producing anything I don't have much of a clue.
[link][12 comments]