What I had not expected was how hard it would be to decide on an actual solution.
Picking a Hypervisor
For a lab I would need:
- nested virt
- high performance
- low overhead to the same due to power etc.
- easy cloning of vms and labs
- flexible networking
- easy scripting
- wide storage options and easy migration
- thin provisioning of some kind
If you know all the products and their drawbacks it turned into a constant forth-and-back between the different hypervisors and ecosystems.
VMWare always sneaked back due to feature reliability and performance consistency and then got kicked back out for the lack of many features like API and storage migration w/o a full vCenter install.
I knew it would deliver a good (600-900MBish) performance under any circumstance, where i.e. Xen can be all over the place from 150 to 1900MB/s…
Another downside was that in VMWare my SolarFlare 5122 will definitely never expose the 256VNICs. And I’d like to have em.
Installing MegaCli in ESXi is also a bit annoying.
On the pro side there’s the Cisco Nexus1000V and many other similar *gasp* appliances.
And, the perfect emulation. No “half” disk drivers. no cheapass BIOS.
In the end, I like to have my stuff licensed and to use the full power of a VMWare setup I’d need to go with vCenter + Enterprise Lic. No fun.
While XenServer has great features for VM Cloning it’s just not my cup of tea. Too much very bad python code. Too many windows-user cludges. Broken networking all over.
Any expectation of storage flexibility would be in vain, needing backporting and recompiling software to the dom0 kernel using their SDK. Definitely not an easy solution if you wanna be able to flip between iSCSI, Infiniband, md and whatever else *looks* interesting. This should be a lab after all, and I don’t see any chance running something like the Storwise VSA in this. Nested ESXi for that, and that’s not on the roadmap for XenServer. If anything still is.
It would probably work best for SolarFlare. I’ll admit that.
This is what will run in many VMs, but I don’t wanna break layering, so my underlying hypervisor and solution should not be the same as in the VMs. I am not yet sure if it’s the right decision.
This would be the prime KVM choice since they already deliver a well-tuned configuration.
What worries me is that, while MooseFS’ FUSE client scales good enough on a single hypervisor node, it would end up with a lot of additional context switching / trashing if I use it on the main node and in the clients. There might be smarter ways around this, i.e. by having a fat global pool in the “layer1″ hypervisor and using that from the above layers, too. More probably it’d turn into a large disaster :)
Pointless, no hypervisor, one single kernel instance can’t successfully pretend being a bunch of OSDs and clients :)
This is what I already have and went with, especially to make use of tmem and run the Ceph labs as paravirt domUs. This way I know nothing will get in the way performance wise.
There’s one thing you should know though, comparing Xen vs. ESXi or a licensed VMWare though:
Xen’s powermanagement is brokenbrokenbroken:
- Newer deep-idle CPU states are all unsupported
- The utility to manage CPU power management is broken as well. Since 4.3 nothing works any more.
- Even if you free + shutdown a core from dom0 it’ll not be put to sleep
You can definitely tell from the power intake fan speed that Xen, even idle consumes more power than an idle Linux kernel would. Spinning up a PV domU has no impact, spinning up a HVM one is also a noticable increase in fan whoosh.
ESXi is far better integrated so I am expecting like 100 Euro (personal unfunded opinion) per year of additional energy wasted over VMWare.
My choice for Xen is mostly
- the bleeding edge features like tmem
- the really crazy stuff like vTPM and whatever of the cool features ain’t broken at any given time.
- leverage any storage trick I want and have available in a (thanks to Alpine Linux) very recent Linux kernel
- put in place ZFS, maybe in a dedicated driver domain
- also be able to use MooseFS and last, but most interesting
- all the things that never work on other hypervisors – CPU hotplug, dynamic ram changes…
- storage domUs!!!!!
I think in a lab running 20-30 loaded VMs it will be cruicial to optimize in the memory subsystem.
Same goes for having the least possible CPU overhead, under load this will help.
Last, concurrently being able to use different storage techs means I can chose different levels of availability and speed – albeit not _having to_ since there’s a large SSD monster underneath it.
I’m also quite sure the disks will switch from Raid10 to Raid5. They just won’t see any random IO any more.
The “Raid5 is not OK” Disclaimer
Oh, and yes. Just to mention it. I’m aware I’m running green drives behind a controller. I know about Raid5 rebuild times (actually, they’re much lower on HW raid. About 30% of software raid) and the thing is…
If I see disk dropouts (yet to be seen), I’ll replace the dumb thing ASAP. It makes me cringe to read about people considering this a raid controller issue. If the damn disk can’t read a block for so long that the controller drops it out… Then I’m glad I have that controller and it did the right thing.
Such (block errored) disks are nice as media in secondary NAS storage or as doorstops, but not for a raid. Maybe I just hit extremely lucky in having no media errors at all off them? Definitely not what you’d see in a dedicated server at a mass hoster.
I’ve also patched my Check_MK Smart plugin to track the smart stats from the raid PDisks, so anything SMART notices I’ll be immediately be aware of. Why the green disks in the first place? Well – power and noise benefits are huge. If I had some more space I’d consider a Raid6 of 8 of them, but not until I move to a bigger place.
Coming up next:
A colleague offered me some company when setting up a final storage layout.
We build a dedicated storage domU with PCI passthrough’ed MegaRaid controller and ZFS. The install had a little issue…
This is what the next posts will be about, one describing how to build a storage domU.
Also, what pitfalls to expect, and then a focus on losing data (sigh) and getting it back.
I’ll close with some lessons learned. :)