Quantcast
Channel: linuxadmin: Expanding Linux SysAdmin knowledge
Viewing all articles
Browse latest Browse all 17905

Pacemaker/Corosync, DRBD on ESX virtual machines issues - Split brains

$
0
0

Hello everyone, I've been struggling with the above for a while now and eventually decided to look for help on Reddit.

I have 2 VMs, /home partition is managed by DRBD, on top of that I have a Pacemaker and Corosync listening on a heartbeat vlan.

Once in a while - every few weeks, sometimes months I get an email about split brains and thats what I find in logs:

Aug 5 23:38:07 fs0 kernel: [68355583.821768] block drbd0: sock was shut down by peer Aug 5 23:38:07 fs0 kernel: [68355583.821778] block drbd0: peer( Secondary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) Aug 5 23:38:07 fs0 kernel: [68355583.842257] block drbd0: Creating new current UUID Aug 5 23:38:07 fs0 kernel: [68355583.866926] block drbd0: asender terminated Aug 5 23:38:07 fs0 kernel: [68355583.866933] block drbd0: Terminating drbd0_asender Aug 5 23:38:07 fs0 kernel: [68355584.507196] block drbd0: Connection closed Aug 5 23:38:07 fs0 kernel: [68355584.507226] block drbd0: conn( BrokenPipe -> Unconnected ) Aug 5 23:38:07 fs0 kernel: [68355584.507234] block drbd0: receiver terminated Aug 5 23:38:07 fs0 kernel: [68355584.507237] block drbd0: Restarting drbd0_receiver Aug 5 23:38:07 fs0 kernel: [68355584.507240] block drbd0: receiver (re)started Aug 5 23:38:07 fs0 kernel: [68355584.507245] block drbd0: conn( Unconnected -> WFConnection ) Aug 5 23:38:08 fs0 kernel: [68355584.612101] block drbd0: Handshake successful: Agreed network protocol version 91 Aug 5 23:38:08 fs0 kernel: [68355584.612111] block drbd0: conn( WFConnection -> WFReportParams ) Aug 5 23:38:08 fs0 kernel: [68355584.612140] block drbd0: Starting asender thread (from drbd0_receiver [4346]) Aug 5 23:38:08 fs0 kernel: [68355584.615955] block drbd0: data-integrity-alg: <not-used> Aug 5 23:38:08 fs0 kernel: [68355584.615973] block drbd0: drbd_sync_handshake: Aug 5 23:38:08 fs0 kernel: [68355584.615977] block drbd0: self 9C6012EC1D3A23DF:F838BBE1953EC251:38E221DC90C8B46B:2D25DFD8EF4F70D7 bits:1 flags:0 Aug 5 23:38:08 fs0 kernel: [68355584.615982] block drbd0: peer EA40E1F0BB878A3F:F838BBE1953EC250:38E221DC90C8B46A:2D25DFD8EF4F70D7 bits:0 flags:0 Aug 5 23:38:08 fs0 kernel: [68355584.615986] block drbd0: uuid_compare()=100 by rule 90 Aug 5 23:38:08 fs0 kernel: [68355584.621228] block drbd0: helper command: /sbin/drbdadm split-brain minor-0 Aug 5 23:38:11 fs0 kernel: [68355584.644234] block drbd0: conn( WFReportParams -> NetworkFailure ) Aug 5 23:38:11 fs0 kernel: [68355584.644369] block drbd0: asender terminated Aug 5 23:38:11 fs0 kernel: [68355584.644377] block drbd0: Terminating drbd0_asender Aug 5 23:38:12 fs0 notify-split-brain.sh[21452]: invoked for r0 Aug 5 23:38:12 fs0 kernel: [68355586.148682] block drbd0: helper command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0) Aug 5 23:38:12 fs0 kernel: [68355586.148703] block drbd0: conn( NetworkFailure -> Disconnecting ) Aug 5 23:38:12 fs0 kernel: [68355586.149526] block drbd0: Connection closed Aug 5 23:38:12 fs0 kernel: [68355586.149537] block drbd0: conn( Disconnecting -> StandAlone ) Aug 5 23:38:12 fs0 kernel: [68355586.149751] block drbd0: receiver terminated Aug 5 23:38:12 fs0 kernel: [68355586.149754] block drbd0: Terminating drbd0_receiver 

Looks like connection between nodes timed out (network was up all the time on both hosts) ? Failure occured moments after VM was migrated by a vMotion. I was told that during vMotion some packets may be lost. Is that the case?

Is there a Corosync setting I could tweak to mitigate those issues? TTL in Corosync config is set to 1 at the moment.

Thanks for your help!

submitted by SysadminOfThings
[link][24 comments]

Viewing all articles
Browse latest Browse all 17905

Trending Articles