Regarding the github cluster crash

I had a small discussion on twitter about possible reasons – the listed DRBD error in my eyes was just a symptom of the upper cluster layers not correctly detecting they would run into a split brain.
Some others agreed that STONITH is not enough a safeguard – especially if your quorum ressources are not perfectly chosen. A very well thought comment was also pointing out that there’s an inherent issue if you use the same channel for quorum decision (in github case: Network for cluster HA, for DATA, for DR:BD Sync and for the quorum decision). Very true that!

Enterprise cluster best practices normally demand you use two distinct(!!!!) heartbeat networks, fail back over the public and the fencing is done using scsi reservations on ALL data disks, with a distinct, uneven number of quorum disks. I hope the difference is quite visible? Of course we all heard the enterprise is not sexy and that 5 9s are a myth  so let’s just keep best practices out of the picture and look for options available if you have that shared network.

My UNIXy-mind says: then why doesn’t the goddamn thing panic if it’s not sure it has majority, and I constantly need to tell myself – oh, but it DID think it had.
So how can you end up there? If your ping resources are also into a split brain. How can that happen? If you use redundant component(s) in the first place.

And last Friday I found another nice story on a Cisco UCS blogs QA area.
First the quote, then then some thoughts:

We had a Chassis was Re-Ack’d and rather than the 30 or so seconds of expected outage we lost all VM’s and vCenter. What had happened was that there were two ESXi hosts in the same chassis which were clustered, they had a keepalive timeout of 30 secs, they were only referecing each others IP, and were set to power down all VM’s in case of being isolated. So you can see where this is going. So after an hours work of KVM’ing into ESXi hosts, disabling lockdown mode (as vCenter was a VM) powering up SQL servers, then AD then vCenter then all VM’s etc etc.. was a real pain.

Re-Acking as far as I understand it is just setting up a new management session and inventory from it. XenServer users will know that lol.

The post’s author went on to suggest to always also use the default gateway as quorum device.

Now, would that save you if your network layer runs into a software bug, split brain and the gateway is a virtual chassis or VRRP? Uh-Uh-OH!

I’ll give some examples what quorum ressources I used in the last (horrible heartbeat2 v1-mode) cluster I had to QA/fix.

The two cluster nodes were virtual machines on two separate hosts.

Each was presented a NIC for DR:BD and a NIC for the public network. The first nightmare is that most virt plaforms do not support passing on a link state. So even with all nice bonding tricks in place in the VM hosts it would still be possible that the network level comms would all fail. (We had tried to pull the bonding into the VMs but that did not work due to some bug in CentOS5/Xen)

So our clusters needed to safely detect a network failure and ideally panic their way out of it.

Each node had the following quorum ressources(*):

  • His own host’s lan IP
  • His partner’s host’s lan IP
  • The first uplink switch of the hosts
  • The second uplink switch of the hosts
  • The default gateway (non-VRRP)

Both hosts were connected to both switches. The switches and the gateway formed a (R)STP ring. The gateway in this case is marked red since it would normally be the decision maker. And later I had to ease up on that config since I was going much further than Heartbeat can go. On VCS this would be a piece of cake.

Now, things to think about:

This is all math! I’m very bad at it, so I have to just mentally count / map the results what happens if something breaks. Luckily I’m very fast at that 🙂

But if you’re deploying for a large site stuffed with 2-node clusters, think hard about this and consider DOING the math.

Consider that possibly for example the ring would never come up due to some ethernet loop on an edge switch – what would happen?

What happens if this network split-brains, will STP block the route to your gateway? (It should)

Keep asking those questions.

Experts will tell you that clustering is all about covering 1-Failures and cannot handle N-failure scenarios.

So first, test all single-component failures, and if that works, thats a nice thing[tm].

As a practical person my piece of advice and the takeaway for you (I hope) is: Fuck that, test what you’re not expected to test.

  • Pull all network cables, see what happens
  • Then put all back, and see what happens

Because sorry, unless you’re at some wizkid shop you’re responsible to know what happens if something goes wrong. Not just if everything goes as planned .

And, thats the great thing of testing more than you’re expected, let’s assume both nodes paniced when you pulled all the cables.

  • They would come back up and both sit at the DRBD connection handler.
  • Whats next? Ah, you put in the cables – ah, they set up the DRBD connection and come up!

This is really worth it.

And of course, general advice:

  • If you can make the case for it financially, use VCS instead of Pacemaker (RH Cluster is said to also work, i just don’t have any experience with it)
  • Never use Heartbeat, imho it lacks an internal state machine (thus locks up at shutdown if it can’t inform the partner that the network channel is down because it’s still got the msg to deliver)
  • If you have (real, dedicated) SAN admins, you can have nice things
  • If not, really use different networks for different stuff. Not VLANs, not virtual fabrics. Different networks.

I.e. Ultraspeed in the UK does that, have a look at the following quote at their server specs page:

Servers are configured with two physical network cards; one of these cards is dedicated to connecting the server to the private network, the other to connecting to the public network (i.e., the Internet). Each card has multiple gigabit ports on it and each port is connected to physically diverse networks, providing insulation against multiple path failure.

Servers are connected to a separate lights out management network; this gives Ultraspeed Systems Engineers remote reboot ability, along with full KVM (Keyboard/Video/Mouse).

Let that settle for a moment. If you’re all jumpy saying “we have all of that too” then either congrats to you or “keep searching” 🙂

(And, myself I’m thinking about this “physically diverse” for quite a while now.

All my servers have either 2 or 4 gigE nics plus KVM plus QDR infiniband. But I did not yet think beyond having those 4 NICs connected to 2 switches. Maybe that’s stopping just too early)


2 thoughts on “Regarding the github cluster crash

  1. Nice! IMHO DRBD is crap anyway if you just need HA replicated remote storage. The iSCSI/iSER/SRP initiator is a single system. So if you move the replication to the initiator, you don’t need a cluster manager. With that you can have PARALLEL network paths for storage (EQUAL LATENCY). With DRBD primary/secondary you have CHAINED network paths for writes – absolutely stupid and slow!

    This is why we’ve hacked MD RAID-1 for high-performance replication. It has a very very sophisticated write-intent bitmap, intelligent read-balancing, has a higher IO size limit (full 512 KiB instead of 128 KiB with DRBD), is much more stable and mature, has its config on disk,… We just needed to add some magic for VM live migration, scaling, etc.

    • With md raid you can also have N-way mirrors, i.e. attach one for migrations / backups.
      I see your point.

      I’ve done a lot of paper sketching for this kind of stuff. It’s a sad thing that the linux mpath prio callouts are all in C as far as I could find out, because that would have been my preference. doesn’t really matter. 🙂

      I’ll think some more about md instead of drbd there.

      Regarding cluster manager, it was their “file servers” – I guess they wanted application level HA plus storage replication; we can also call it a live example of how this merged approach has more issues.

      but it’s good you made a point of this. Yes, if one just wants replicated storage then one should really use the most straightforward approach to it, and md comes in there nicely.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s