Some upgrades are special


No Xen kernel

Yeah never forget when upgrading from older Alpine Linux that Xen itself moved into the xen-*-hypervisor package. Didn’t catch that on update and so I had no more hypervisor on the system.

Xen 4.6.0 crashes on boot

My experience: No you don’t need to compile a debug Xen kernel + Toolstack. No you don’t need a serial console. No you don’t need to attach it.

You need google and search for whatever random regression you hit.

In this case, if you set dom0_mem, it will crash instead of putting the memory in the unused pool:

4.6.1 fixes that but isn’t in AlpineLinux stable so far.

So what I did was enabling autoballoon in /etc/xen/xl.conf. That’s one of the worst things you can do, ever. It slows down VM startup, has NO benefit at all, and as far as I know also increases memory fragmentation to the max. Which is lovely, especially considering Xen doesn’t detect this NUMA machine as one thanks to IBM’s chipset “magic”.

CPU affinity settings got botched

I had used a combination of the vcpu pinning / scheduling settings to make sure power management works all while dedicating 2 cores to dom0 for good performance. Normally with dom0 VCPU pinning you got a problem:

dom0 is being a good citizen, only using the first two cores. But all other VMs may also use those, breaking some of the benefits…

So what you’d do was have settings like this


memory = 8192
maxmem = 16384
name   = ""
vcpus  = 4
cpus = [ "^0,^1" ]

That tells Xen this VM can have 4 virtual CPUs, but they’ll never be scheduled on the first two cores (the dom0 ones).

Yeah except in Xen 4.6 no VM can boot like that.
The good news is a non-array’ed syntax works:

cpus = "^0,^1,2-7"



IBM’s certificates for this old clunker are expired. Solution to access?

Use Windows.

Oh, and if it’s in the BIOS from configuring the RSA module it’ll NEVER display anything, even if you reset the ASM. You need to reset the server once more. Otherwise you get a white screen. The recommended fix is to reinstall the firmware, which also gets you that reboot.


Alpine Linux

My network config also stopped working. That’s the one part I’d like to change in Alpine – not using that challenged “interfaces” file from Debian, but instead something that is more friendly to working with VLANs, Bridges and Tunnels.

If you’re used to it day-to-day it might “look” just fine but that’s because you don’t go out and compare to something else.

Bringing up a bridge interface was broken, because one sysctl knob was no longer supported. So it tried to turn off ebtables there, that didn’t work and so, hey, what should we do? Why not just stop bringing up any interfaces that are configured and completely mess up the IP stack?

I mean, what else would make sense to do than the WORST possible outcome?

If this were a cluster with a lost quorum I’d even agree. But you can bet the Linux kids will run that into a split brain with pride.


I’ll paste the actual config once I found how to take WordPress out of the idiot mode and edit html. Since, of course, pasting to a <PRE> section is utterly fucked up.


I removed my bonding config from this to be able to debug more easily. But the key points was to remove the echo calls and making the pre-up / post-up parts more reliable.

auto lo
iface lo inet loopback

auto br0
iface br0 inet static
    pre-up brctl addbr br0
    pre-up ip link set dev eth0 up
    post-up brctl addif br0 eth0
    address your_local_ip
    netmask subnet_mask
    broadcast subnet_bcast
    gateway your_gw
    hostname waxh0012
    post-down brctl delif br0 eth0
    post-down brctl delbr br0
# 1 gbit nach aussen
auto eth0             
iface eth0 inet manual      
    up ip link set $IFACE up    
    down ip link set $IFACE down

Xen VMs not booting

This is some annoying thing with Alpine only. Some VMs just require you to press enter, them being stuck at the grub menu. It’s something with grub parsing the extlinux.conf that doesn’t work, most likely the “timeout” or “default” lines.
And of course the idiotic (hd0) default vs (hd0,0) from the grub-compat wrapper.
I think you can’t possibly shout too loud at any of the involved people since this all goes to the “why care?” class of issues.
(“Oh just use PV-Grub” … except that has another bunch of issues…)

Normally I don’t want to bother anymore reporting all the broken stuff I run into. It’s gotten just too much, basically I would just spend 2/3 of my day on that. But since this concerns a super-recent release of Alpine & Xen (and even some debian) I figured I’ll save people some of the PITA I spend my last hours on.
When able I dump them to my confluence at this url:
Adminspace – Fixlets

I also try really hard to not rant there 🙂

Nathanael Copa reached out to me and let me know that the newer bridge packages on Alpine include a lot more helper scripts. That way the icky settings from the config would not have been needed any more.
Another thing one can do is to do

post-up my-command || echo "didnt work"

you should totally not … need to do that, but it helps.

No-copy extracting Xen VM tarballs to LVM

SUSE Studio delivers Xen VM images which is really nice. They contain a sparse image and a (mostly incomplete) VM config file. Since I’m updating them pretty often I needed a hack that saves on any unneeed copies and needs no scratch space, either.

Goal: save copy times and improve life quality instead of copying and waiting…

First, lets have a look at the contents and then let’s check out how to directly extract them…

(Oh. Great. Shitbuntu won’t let me paste here)


Well, great.

I’n my case the disk image is called:


It’s located in a folder named:



So, what we can do is this:

First, set up some variables so we can shrink the command later on…


Then, tie it together to store our VM data.

wget -O- $url | tar -O -xzf - ${folder}/${vmimage} | dd of=$lv bs=1024k

Storing to a file at the same time:

wget -O- $url | tee /dev/shm/myfile.tar.gz | tar -O -xzf - ${folder}/${vmimage} |\
dd of=$lv bs=1024k


Wget will fetch the file, write it to STDOUT, tar will read STDIN, only extract the image file, and write the extracted data to STDOUT, which is then buffered and written by the dd.


If you’ll reuse the image for multiple VMs like me you can also write it to /dev/shm and, if RAM allows, also gunzip it. the gzip extraction is actually limiting the performance, and even tar itself seems to be a little slow. I only get around 150MB/s on this.

I do remember it needs to flatten out the sparse image while storing to LVM, but I’m not sure if / how that influences the performance.


(Of course none of this would be necessary if the OSS community hadn’t tried to ignore / block / destroy standards like OVF as much as they could. Instead OVF is complex, useless and unsupported. Here we are.)

Xen Powermanagement

Hi all,

this is a very hot week and the sun is coming down on my flat hard. Yet, I’m not outside having fun: Work has invaded this sunday.

I ran into a problem: I need to run some more loaded VMs but it’s going to be hotter than usual. I don’t wanna turn into a piece of barbeque. The only thing I could do is to turn my Xen host’s powersaving features to the max.

Of course I had to write a new article on power management in the more current Xen versions from that… 🙂

Find it here: Xen Power management – for current Xen.

When I saved it I found, I also have an older one (which i wasn’t aware of anymore) that covers the Xen 3.4 era.

Xen full powersaving mode – for Xen 3.x




Did you know those settings only take a mouse click in VMWare?

Part three: Storage migration, disaster recovery and friends

All posts:

What I had not expected was how hard it would be to decide on an actual solution.

 Picking a Hypervisor

For a lab I would need:

  • nested virt
  • high performance
  • low overhead to the same due to power etc.
  • easy cloning of vms and labs
  • flexible networking
  • easy scripting
  • wide storage options and easy migration
  • thin provisioning of some kind


If you know all the products and their drawbacks it turned into a constant forth-and-back between the different hypervisors and ecosystems.



VMWare always sneaked back due to feature reliability and performance consistency and then got kicked back out for the lack of many features like API and storage migration w/o a full vCenter install.

I knew it would deliver a good (600-900MBish) performance under any circumstance, where i.e. Xen can be all over the place from 150 to 1900MB/s…

Another downside was that in VMWare  my SolarFlare 5122 will definitely never  expose the 256VNICs. And I’d like to have em.

Installing MegaCli in ESXi is also a bit annoying.

On the pro side there’s the Cisco Nexus1000V and many other similar *gasp* appliances.

And, the perfect emulation. No “half” disk drivers. no cheapass BIOS.

In the end, I like to have my stuff licensed and to use the full power of a VMWare setup I’d need to go with vCenter + Enterprise Lic. No fun.



Just LOL.

While XenServer has great features for VM Cloning it’s just not my cup of tea. Too much very bad python code. Too many windows-user cludges. Broken networking all over.

Any expectation of storage flexibility would be in vain, needing backporting and recompiling software to the dom0 kernel using their SDK. Definitely not an easy solution if you wanna be able to flip between iSCSI, Infiniband, md and whatever else *looks* interesting. This should be a lab after all, and I don’t see any chance running something like the Storwise VSA in this. Nested ESXi for that, and that’s not on the roadmap for XenServer. If anything still is.

It would probably work best for SolarFlare. I’ll admit that.



This is what will run in many VMs, but I don’t wanna break layering, so my underlying hypervisor and solution should not be the same as in the VMs. I am not yet sure if it’s the right decision.

This would be the prime KVM choice since they already deliver a well-tuned configuration.

What worries me is that, while MooseFS’ FUSE client scales good enough on a single hypervisor node, it would end up with a lot of additional context switching / trashing if I use it on the main node and in the clients. There might be smarter ways around this, i.e. by having a fat global pool in the “layer1” hypervisor and using that from the above layers, too. More probably it’d turn into a large disaster 🙂



Pointless, no hypervisor, one single kernel instance can’t successfully pretend being a bunch of OSDs and clients 🙂


Plain Xen:

This is what I already have and went with, especially to make use of tmem and run the Ceph labs as paravirt domUs. This way I know nothing will get in the way performance wise.

There’s one thing you should know though, comparing Xen vs. ESXi or a licensed VMWare though:

Xen’s powermanagement is brokenbrokenbroken:

  • Newer deep-idle CPU states are all unsupported
  • The utility to manage CPU power management is broken as well. Since 4.3 nothing works any more.
  • Even if you free + shutdown a core from dom0 it’ll not be put to sleep

You can definitely tell from the power intake fan speed that Xen, even idle consumes more power than an idle Linux kernel would. Spinning up a PV domU has no impact, spinning up a HVM one is also a noticable increase in fan whoosh.

ESXi is far better integrated so I am expecting like 100 Euro (personal unfunded opinion) per year of additional energy wasted over VMWare.

My choice for Xen is mostly

  • the bleeding edge features like tmem
  • the really crazy stuff like vTPM and whatever of the cool features ain’t broken at any given time.
  • leverage any storage trick I want and have available in a (thanks to Alpine Linux) very recent Linux kernel
  • put in place ZFS, maybe in a dedicated driver domain
  • also be able to use MooseFS and last, but most interesting
  • all the things that never work on other hypervisors – CPU hotplug, dynamic ram changes…
  • storage domUs!!!!!


I think in a lab running 20-30 loaded VMs it will be cruicial to optimize in the memory subsystem.

Same goes for having the least possible CPU overhead, under load this will help.

Last, concurrently being able to use different storage techs means I can chose different levels of availability and speed – albeit not _having to_ since there’s a large SSD monster underneath it.

I’m also quite sure the disks will switch from Raid10 to Raid5. They just won’t see any random IO any more.

The “Raid5 is not OK” Disclaimer

Oh, and yes. Just to mention it. I’m aware I’m running green drives behind a controller. I know about Raid5 rebuild times (actually, they’re much lower on HW raid. About 30% of software raid) and the thing is…

If I see disk dropouts (yet to be seen), I’ll replace the dumb thing ASAP. It makes me cringe to read about people considering this a raid controller issue. If the damn disk can’t read a block for so long that the controller drops it out… Then I’m glad I have that controller and it did the right thing.

Such (block errored) disks are nice as media in secondary NAS storage or as doorstops, but not for a raid. Maybe I just hit extremely lucky in having no media errors at all off them? Definitely not what you’d see in a dedicated server at a mass hoster.

I’ve also patched my Check_MK Smart plugin to track the smart stats from the raid PDisks, so anything SMART notices I’ll be immediately be aware of. Why the green disks in the first place? Well – power and noise benefits are huge. If I had some more space I’d consider a Raid6 of 8 of them, but not until I move to a bigger place.


Coming up next:

A colleague offered me some company when setting up a final storage layout.

We build a dedicated storage domU with PCI passthrough’ed MegaRaid controller and ZFS. The install had a little issue…

This is what the next posts will be about, one describing how to build a storage domU.

Also, what pitfalls to expect, and then a focus on losing data (sigh) and getting it back.

I’ll close with some lessons learned. 🙂

Trip to OpenNebulaconf

This year also saw the first ever OpenNebula conference. I was there for a really short time only, since I’d been coming from the open source backup conference at cologne.

Let me say it was a harder, longer trip than I could handle, two conferences in two days is already bad, but if you also need to prepare stuff it gets rough.

So, how was it?

Getting there: the (almost endless) ride

So, an almost sleepless ride to cologne and then another pretty long one to berlin, a short nap, and every free minute spend on the lab (the server failed the final test reboot like 2 hours before my 3am train departed…). A disaster, but at least the people started to be less rude (running into you, etc) the closer I got to berlin.

At some point there was a nice young consultant woman sitting next to me who *also* fought sleep while she frantically worked on some papers. Couldn’t help smiling.

By the time I arrived I had like 37 hours of work/talks/travel versus 3 hours of sleep. You bet I *love* the beds at my fav berlin hotel (park inn alexanderplatz) when I arrive after a ride like that.

I’m in the wrong place and there’s a nazi for breakfast.

The next day started out bad – the hotel was *called*, but not located at, Alexanderplatz. Not fun considering I had to put down a lot of money to get a room at Alexanderplatz, had planned to save some time by being close by to the venue, and that I had a kinda weird cab driver to take me to the other place. Being completely exhausted even in the morning I really didn’t care to hear about the lower amounts of foreign population in East-berlin due to the non-exchangeable nature of the GDR mark.

Last to go

Having arrived I found the conference reception desk, and apparently I was totally the last person to arrive:  the guy at the desk immediately knew who I am. I browsed around a little, immediately caught sight of the super cool inovex opennebula lab (acrylic casing, 8 i5 nodes), then had some coffee and settled for the sofa.


I tried to get my “personal IT” working so I could drop a message to carlo daffara who only had little time left till his flight and at some point I realized the impatient guy around the corner was him, waiting. 🙂

With that sorted we spent almost two hours chatting and I was surprised at some of the stuff they’re doing at cloudweavers. It doesn’t easily happen that you meet anyone up for a discussion of IO queue/latency/bw issues. Like, noone. Less than that if you’re talking about CEOs. Now, there he is and he’s even got real solutions in the works that noone has ever worked on as methodically. And stuff like this is all just a little sidequest for cloudweavers. I’m amazed.

Lunch break? Slides!

So far I had seen no talks but at least got to watch the amazing lightning talks – once they’re online, watch all of them.

I tried to make my slides more useful, fixed bugs in my new opennebula nagios checks and, well, generally panicked.

Then it was time for the talk, and I tried to do well. 🙂

Slides suck!

Next time I’ll stick with 5 slides and just tell what I think – I don’t need that bullshit powerpoint to get people interested so why bother.

I think I managed to have some minds *click* on the idea of monitoring the large scope of an infrastructure instead of just details. One of the key points was to monitor free capacity instead of usage. In a cloud env I think this is a must have.

I didn’t get the time to add a single tiny BI rule for my setup, so I skipped most of the business intelligence part.

One sad/funny point was that I went on forever about fully dynamic configuration, but missed the main point:

This will be a downloadable selfconfiguring monitoring appliance you can get via the marketplace.

I just didn’t remember to say it.

The reception was good anyway and I hope I helped some of the people – not to mention that it was really hard to talk in front of so many of them! I’m still suprised if someone comes to me and says he liked the talk. Some day I’ll stop worrying.

0.25 days of conference left

I watched a few more talks and it was hard to decide which one to look at – for example I went to hear about rOCCI and it was very worth it but missed the talk from BBC. I’m so looking forward to the recordings.

After that talk, there was another break and then the conference ended with a very short speech from the OpenNebula guys. Many people including me just kept sitting, still eager for more talks. Seems there’s room for a 3-day conference if the topic is that interesting 🙂

What else…

I think it was great that there was multiple companies behind hosting the conference, it seemed to open up discussions a little. I was surprised that the NetWays team really held back marketing wise, which is far different from what I heard from (non-MK 🙂 visitors of other (mostly monitoring-) conferences they have a role in. They did an incredibly good job at organizing stuff. It’s hard to describe – I’m used to the utter chaos of CCC and such conferences and what Netways put up is the exact opposite. Everything worked. Everyone I talked to was happy with how smooth the conference went. Really, great work.

After the conference I had some sleep and then went for drinks with the opennebula guys. Sitting outside after a few burgers I had the second “unfun” event of the day when some old unhappy man started to insult, attack and shove around random people of the group. My first thought was just “yeah right, that’s what we get for being in Mitte instead of Kreuzberg”.

Since I was the only german I tried to tell him to stop acting like a 12-year old idiot, but to little success. After some time he finally left. I think this guy was actually just full of self-hate and wanted someone to hit him. Very weird.

How did I make it back?

After this unasked interruption we moved on a few corners and went to CCCP bar, which was still mostly a tourist place, but a lot more the Berlin I’m used to. Good drinks, a lot of opennebula and other chat and a nice bartender(ess) made it very hard to leave.

At 3 or 4 I still somehow started walking back to my hotel. I have no clue how I actually got there.

The next day I got a lot more sleep and instead of getting drunk again I was already adding some more bugfixes to the KVM checks 🙂

Although I missed the OpenNebula team – they’re extremely interesting and nice people.

Final words

I missed some of the best talks, plus the hacking sessions, plus the gettogether. Next year I shall not make that mistake!

Soon I’ll also do a writeup about the technical  bits of the monitoring thingy.

Quickie: VirtualBox Xen Conversion

converting a copy of a Virtualbox image for transfering to Xen host.
This is just a brain dump, but if you run into this scenario it’ll be enough for you to work with.

I didn’t see a way to have VBoxManage write to stdout.

Clonehd and the UUID

floh@egal:/scratch/convert$ VBoxManage clonehd opennebula-3.8-sandbox.vdi\
 opennebula-3.8-sandbox.img -format RAW
VBoxManage: error: Cannot register the hard disk 
{fd6db816-whatever} because a hard disk '/scratch/opennebula-3.8-sandbox.vdi'
with UUID {fd6db816-whatever} already exists

floh@egal:/scratch/convert$ VBoxManage internalcommands \
sethduuid opennebula-3.8-sandbox.vdi 
UUID changed to: d3faa73c-blah

floh@egal:/scratch/convert$ VBoxManage clonehd opennebula-3.8-sandbox.vdi \
opennebula-3.8-sandbox.img -format RAW
Clone hard disk created in format 'RAW'. UUID: b58d3fc1-blah

Conversion is only IO-speed dependant, I did it on a single laptop HDD instead of SSD or BBU’ed raid. If you’re working with more than 10GB, use something fast.

Create LV on Remote host

[root@localhost ~]# lvcreate -L 10240 -n lvonefe vgxen_raid10
  Logical volume "lvonefe" created


Transfer speed is always limited by SSH, unless you have Alpine Linux with HPN-SSH 🙂 You can also switch ssh ciphers for bulk transfer. My “home box” doesn’t use Alpine Linux yet; I switch to blowfish.

floh@egal:/scratch/convert$ dd if=opennebula-3.8-sandbox.img bs=1024k | 
ssh -c blowfish root@ "dd of=/dev/vgxen_raid10/lvonefe bs=1024k"
root@'s password:

Still only got a sad 36MB/s.
Retrying the same with compression enabled (-C) yields no improvement.

SSH? NFS ftw.

Citrix XenServer and XAPI

Let me give them a chance to express their coding abilities in their very own words, which I found while working on a filter that supresses “chatty” log output from XenServer.

[2012…|debug|xs1|14262 inet-RPC|dispatch:logout D:…|dispatcher] Unknown rpc “logout”

Yes, this piece of shit doesnt even understand it’s own RPCs.

Can you find the DDOS?

This kept me a little busy on Friday night, a long running DDOS hammering at my server, specifically at the VPS subnet, not caring if the IPs were even allocated.

I Reported it to my ISP quite immediately, but didn’t get an answer so far.

At some point I figured this (I guess some few hundred kpps) was just beyond what I could fix on my own, and that this, after all, had not been my weekend plan.

I throttled all traffic to somewhere around 2KB/s and went off to buy Batman Arkham City instead.

This is a weekly RRD that averages the numbers down, but makes better for a comparism. The small spikes are daily backups, a few GB give or take. On the long green one you’ll see how traffic went down after throttling, and you can see it took a full day till the attack finally wore out.

When I looked there was about 5MB/s of incoming SYN with all kinds of funny options, and around 5MB/s of useless ICMP replies from my box. Gotta love comparing this to FreeBSD boxes which simply auto-throttle such an attack right…

Lessons learned:

  • Syncookies are not optional, you WANT them enabled.
  • Your kernel will reply to anything it feels reponsible for, thats why I had to concern with the many-MB’s of ICMP replies for the unallocated IP under attack.
  • Nullrouting unused IPs was the most helpful thing I did.
  • Throttling was the second most helpful, just next time it needs to be a lot more specific.
  • IPTables & tc syntax is a complete nightmare when compared to any router OS. I wonder what they took before designing their options. Every single thing it can do is twisted until it’s definitely non-straightforward.
  • Methodically working on shapers and drop rules was the wrong thing to do! Either have them prepared and ready to enable, or skip it and look at more powerful means right away. If someone is throwing nukes at you, then don’t spend the last minute setting up your air defences. 🙂
  • enabling the kernel ARP filter might be the right thing to suppress unwanted.responses – or it might break VM networking.
  • The check_mk/multsite idea of running quite a few distributed monitoring systems is great. Even if I lost livestatus connectivity to the system it still DID do the monitoring, so once I had reasonable bandwidth again all the recorded data was there to look at.
  • IMO this is much more cruicial with IDS logs. It’s very rare, but there are cases where a big nasty DDOS is just used to hide the real attack.
  • It feels a smart move to plan for real routers on the network. Of course,  that has certain disadvantages on the “OPEX” side of things. I got the routers, but rack units are not free :/
  • If you see a sudden traffic spike and spend hours trying to find a software bug or a hacked system, you might be looking at a DDOS probe. Look at this, recorded roughly two weeks earlier:
 I noticed this because I had quite well-tuned traffic monitoring already, using the ISP’s standard tools. Even then my guts had been telling me this was someone probing the the targets performance etc. prior to a real attack.

And, finally: I guess I now lost more sleep to playing Batman, even forgot I wanted to go to a party on Saturday. Those damn sidequests 🙂

xenheap settings

I had been looking for a Bugfix in the RHEL5.7 xen patches’ source and suddenly found the defaults of the Xenheap and that there is a command line option for it. Back on Xen2 this was still set on compile time and I had been trying to find out what the heap size is these days. I wanted to make sure this not reported as “free” in xm info, so that my Xen plugin will not report that memory as available.

Googling for xenheap_megabytes lead to a Citrix KB entry that even included a formula for the heap size:

• The “xenheap_megabytes=24” change is a workaround to a known issue with the version 5.5 release and its updates that would otherwise cause an artificial ceiling to be reached even though the host should be capable of starting more VMs. You would see error messages such as the following appearing in /var/log/xensource.log when trying to start a VM:
Xc.Error(“creating domain failed: hypercall 36 fail: 12: Cannot allocate memory (ret -1)”) even though the host has enough memory to start the VM.

• The default xenheap size is 16 megabytes. Changing it to 24 megabytes was done in this particular environment to start that number of VMs. In general, setting xenheap_megabytes to (12 + max-vms/10) is probably a good rule to use.

Popcorn thread

A bug report about the Xen timer issues that were mostly Ubuntu-related.

It’s been opened in 2007, has multiple loops of “are you sure this still broken?” “yes it still is” “for me too” and of course not a single developer working on it, aside from informational messages like

“The Ubuntu kernel team is no longer assigned bugs as per decision of the Ubuntu kernel team” and the glorious “this bug was filed against a now unsupported release and will now be closed as wontfix”. As a cherry on top this is right after someone verifying the issue still exists in 2011.

A special price for readers who notice the posting of the guy who actually traced down the problem and goes unheard.