We have two web servers and a separate load balancer that is using ipvsadm rules in direct mode.
load1:~ # uname -a Linux load1 2.6.37.6-0.20-default #1 SMP 2011-12-19 23:39:38 +0100 i686 i686 i386 GNU/Linux load1:~ # ipvsadm -L IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP www.xxx.com:http wlc -> web1.xxx.com:http Route 1 0 0 -> web2.xxx.com:http Route 1 0 1 TCP www.xxx.com:https wlc -> web1.xxx.com:https Route 1 18 200 -> web2.xxx.com:https Route 1 19 200
We have a hack to check for the two web servers availability. A cron job runs every minute, verifies that the software on web1 and web2 is working as expected, and if not, takes them out of the rules. Eg. if web1 fails, the above becomes:
TCP www.xxx.com:http wlc -> web2.xxx.com:http Route 1 0 1 TCP www.xxx.com:https wlc -> web2.xxx.com:https Route 1 19 200
The load balancer, load1, also runs Apache, which is normally masked by IPVS. However, if both web1 and web2 fail (or they're down for maintenance etc.), the IPVS rules are cleared, and Apache on load1 starts responding (it just shows a "we're sorry" message).
So far this works. However, we've been seeing a problem: whenever the rules are cleared, say for a couple minutes, and then restored, SOME connections keep going to the Apache running on load1 rather than either web1 or web2. I've seen this happening even one hour or more after the incident - some people keep seeing the "we're sorry" page with the rules in place.
Note that it's not just a client caching issue, because I see entries in the Apache log on load1.
I know we're doing many things wrong, but, what are we doing wrong in this specific case?
[link][7 comments]