Xen Powermanagement


Hi all,

this is a very hot week and the sun is coming down on my flat hard. Yet, I’m not outside having fun: Work has invaded this sunday.

I ran into a problem: I need to run some more loaded VMs but it’s going to be hotter than usual. I don’t wanna turn into a piece of barbeque. The only thing I could do is to turn my Xen host’s powersaving features to the max.

Of course I had to write a new article on power management in the more current Xen versions from that… :)

Find it here: Xen Power management - for current Xen.

When I saved it I found, I also have an older one (which i wasn’t aware of anymore) that covers the Xen 3.4 era.

Xen full powersaving mode - for Xen 3.x

 

 

 

Trivia:
Did you know those settings only take a mouse click in VMWare?

Check_MK support for Allnet 3481v2


A friend of mine has had this thermometer and asked me to look into monitoring and setup.

I don’t think I ever put as much work into monitoring such a tiny device. Last evening and almost night I stabbed at it some more and finally completed the setup and documentation. I literally went to bed at 5am because of this tiny sensor.

To save others from this (and to make sure I have a reliable documentation for it…), I’ve made a wiki article out of the pretty tricky setup. Along the way I even found it still runs an old openssl.

You can check it out here:

http://confluence.wartungsfenster.de/display/Adminspace/Monitoring+Allnet+3418v2

The bitbucket version isn’t yet committed, I hope I will do this in a moment… :p
One interesting hurdle was I couldn’t do a check_mk package (using mkp list / mkp pack) since I also needed to include things from local/lib and similar folders. When I visit the MK guys again I’ll nag about this.

 

 

They have really pretty meters in their UI by the way.

Would hope something like it makes it to the nagvis exchange some day.

edit note: I initially wrote it has an “affected OpenSSL”. It seems they had built it back in 2012 without heartbeat, which is a nice and caring thing to do.
It’s still goddamn outdated.

Friday special: screenrc for automatic IRSSI start


Just wanted to share a little snippet.

This is my SSH+Screen config for my IRC box:

  • If I connect from any private system, I’ll get my irc window.
  • If it rebooted or something, the screen command will automatically re-create an IRSSI session for me.
  • If I detach the screen, i’m automatically logged out.
~$ cat .ssh/authorized_keys
command="screen -d -RR -S irc -U" ssh-[ key removed] me@mypc

The authorized keys settings enforce running only _this_ command, and the screen options set a title for, force-detach, force-reattach and force-create a screen session by the name “irc”.

~$ cat .screenrc 
startup_message off
screen -t irssi 1 irssi

The screenrc does the next step by auto-running irssi in win1 with title accordingly set.
(And it turns off the moronic GPL notice)
Irssi in itself is configured to autoconnect to the right networks and channels, of course. (to be honest: Irssi config is something I don’t like to touch more than every 2-3 years.)

On the clients I also have an alias in /etc/hosts for it, so if I type “ssh irc”, I’ll be right back on irc. Every time and immediately.

 

This is the tiny little piece of perfect world I was able to create, so I thought I’d share it.

FreeBSD periodic mails vs. monitoring


I love FreeBSD! Taking over a non-small infrastructure of around 75 FreeBSD servers was something I wouldn’t have wanted to pass on.

The problem bit is that I do consulting only, not pure ops. But there wasn’t much of an ops team left…

Where they used to put around 10 man-days per week into the feeding and care of FreeBSD plus some actual development, I’m now trying to do something in 1 day. And I still want it to be a well-run, albeit slower, ship.

One of the biggest hurdles was the sheer volume of email.

Adding up Zabbix alerts (70% of which concerning _nothing)), the FreeBSD periodic mails, cron outputs, and similar reporting I would see weeks with 1500+ mails or in the higher 1000s if there was any actual issues. Each week. Just imagine what it looked like when I didn’t visit my customer for 3 weeks…

Many of those mails have no point at all once You’re running more than -base:

The most typical example would be bad SSH logins. All those servers run software to block attackers and even feed that info back to a central authority and log there. So, why in hell would I want to know about malicious SSH connects?

Would you like a mail that tells you no hardware device has failed, today?

  • And another one every day until 2032?
  • From all servers?

This makes no sense.

Same goes for the mails that tell me about neccessary system updates.

What I’ve done so far can be put in those 3 areas:

1. Periodics:

Turn off as much of the periodic mails as possible (i.e. anything that is possible to see by other means). I tried to be careful with it, but it didn’t work like this. My periodic.conf looks like this now:

freebsd periodic.conf
I found turning off certain things like the “security mail” also disables portaudit DB updates. But I just changed my portaudit call to include the download. Somehow I had assumed that *update* would be separate from *report*.

2. Fix issues:

Apply fixes for any bugs that are really that, bugs. At least if I figure out how to fix them. More often than not I’ll hit a wall in between the NIH config management and bad perl code.

3. Monitor harder, but also smarter:

Put in better monitoring, write custom plugins for anything I need (OpenSSH Keys, Sendmail queues, OS Updates) and set thresholds to either a baseline value for “normal” systems or to values derived from peak loads for “busy” systems.

Some of the checks are to be found at my bitbucket, and honestly, I’m still constantly working on them.

https://bitbucket.org/darkfader/nagios/src/cc233b93c106166a5494d7488c38880df0a5946b/check_mk/freebsd_updates/?at=default

The checked in version might change quite often, I.e. I now think it won’t hurt to have a stronger separation of reporting for OS and Ports issues. And, maybe a check that tells me if I still need a reboot for a system.

The most current area now is automating the updates.

I’m taming the VMWare platform and using some Pysphere code to create VM snapshots on the fly. So there’s an ansible playbook that pulls updates. It’ll then check if there is a mismatch between the version reported from uname -a and the “tag” file from freebsd-update. In that case, it’ll trigger a VM snapshot and install / reboot.

Another piece of monitoring does a grep -R -e  “^<<<<<” -e “>>>>>” /etc and as such alerts me of unmerged files.

I try to do with tiny little pieces and have everything a dual-use (agriculture and weapons, you know) technology that gives me status reporting and status improvement.

I started a howto about the specifics I did in monitoring, see
FreeBSD Monitoring at my adminspace wiki.

Ansible FreeBSD update fun…


Using Ansible to make my time at the laundry place more interesting…

 

me@admin ~/playbooks]$ ansible-playbook -i hosts freebsd-updates.yml

PLAY [patchnow:&managed:&redacted-domain:!cluster-pri] *************

GATHERING FACTS ****************************************************
ok: [portal.dmz.redacted-domain.de]
ok: [carbon.dmz.redacted-domain.de]
ok: [irma-dev.redacted-domain-management.de]
ok: [lead.redacted-domain-intern.de]
ok: [polonium.redacted-domain-management.de]
ok: [silver.redacted-domain-management.de]
ok: [irma2.redacted-domain-management.de]
ok: [inoxml-89.redacted-domain-management.de]

TASK: [Apply updates] **********************************************
changed: [inoxml-89.redacted-domain-management.de]
changed: [carbon.dmz.redacted-domain.de]
changed: [portal.dmz.redacted-domain.de]
changed: [irma-dev.redacted-domain-management.de]
changed: [lead.redacted-domain-intern.de]
changed: [polonium.redacted-domain-management.de]
changed: [silver.redacted-domain-management.de]
changed: [irma2.redacted-domain-management.de]
 finished on lead.redacted-domain-intern.de
 finished on portal.dmz.redacted-domain.de
 finished on silver.redacted-domain-management.de
 finished on inoxml-89.redacted-domain-management.de
 finished on carbon.dmz.redacted-domain.de
 finished on polonium.redacted-domain-management.de
 finished on irma-dev.redacted-domain-management.de
 finished on irma2.redacted-domain-management.de

TASK: [Reboot] ****************************************************
changed: [carbon.dmz.redacted-domain.de]
changed: [portal.dmz.redacted-domain.de]
changed: [inoxml-89.redacted-domain-management.de]
changed: [irma-dev.redacted-domain-management.de]
changed: [lead.redacted-domain-intern.de]
changed: [polonium.redacted-domain-management.de]
changed: [silver.redacted-domain-management.de]
changed: [irma2.redacted-domain-management.de]

TASK: [wait for ssh to come back up] *******************************
ok: [portal.dmz.redacted-domain.de]
ok: [irma-dev.redacted-domain-management.de]

I now use a “patchnow” group to have some decision maker because *surprise* I don’t want to snapshot and patch all systems at once.

Quite annoying that the most fundamential admin decisions are always really tricky to put in automation systems (written by devs). Also, I’ll need to kick my own ass since the playbook didn’t trigger the snapshots anyway!

For the long term solution I think I’ll first define a proper policy based on factors like this:

  • How mature the installed OS version & patches are (less risk of patching)
  • How exposed the system is
  • The number of users affected by the downtime
  • The time needed for recovery

What factors do you look at?

Not enough desks to sufficiently bang your head on.


I’m not convinced. I have some LSI cards in several of my boxes and they
very consistently drop disks that are running SMART tests in the background.
I have yet to find a firmware or driver version that corrects this.

 

This is the kind of people I wish I never had to deal with.

(If not obvious: If your disks drop out while running SMART tests, look at the disks. There are millions of disks that handle this without issue. If yours drop out, they’re having a problem. Even if you think it doesn’t matter or even if it’s only showing with the controller. It doesn’t matter. Stuff it up.

I’m utterly sick of dealing with “admins” like this.)

Part four: Storage migration, disaster recovery and friends


This article also was published a little too early….. :)

 

A colleague (actually, my team lead) and I set out to build a new, FreeBSD based storage domU.

 

The steps we did:

Updated Raid Firmware

re-flashing my M5015 Raid controller to more current, non-IBM firmware. We primarly hoped this would enable the SSDs write cache. Didn’t work. It was a little easier than expected since I had already done parts of the procedure.

Your most important command for this is “Adpallinfo”

 

Created Raid Luns

We then created a large bunch of Raid10 luns over 4 of the SSDs.

  • 32GB for the storage domU OS
  • 512MB for testing a controller-ram buffered SLOG
  • 16GB ZIL
  • 16GB L2ARC
  • 600odd GB “rest”

Configure PCI passthrough in Xen

There was a few hickups, the kernel command line just wouldn’t activate, nor did using modprobe.d and /etc/modules do the job on their own.

This is what we actually changed…

First, we obtained the right PCI ID using lspci (apk add pciutils)

daveh0003:~# lspci | grep -i lsi

01:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator] (rev 03)

in /etc/modules:

xen_pciback

in /etc/modprobe.d/blacklist added:

blacklist megaraid_sas

in /etc/modprobe.d/xen-pciback.conf

options xen-pciback hide=(0000:01:00.0)

in /etc/update-extlinux.conf

default_kernel_opts=”modprobe.blacklist=megaraid_sas quiet” – we had also tried

#default_kernel_opts=”xen-pciback.hide=’(01:00.0)’ quiet”

(btw, not escaping the paraentesis can cause busybox/openrc init to crash!!)

and, last, but not least I gave up annoyedly and put some stuff in /etc/rc.local

echo 0000:01:00.0 > /sys/bus/pci/devices/0000:01:00.0/driver/unbind

echo 0000:01:00.0 > /sys/bus/pci/drivers/pciback/new_slot

echo 0000:01:00.0 > /sys/bus/pci/drivers/pciback/bind

(and even this isn’t working without me manually calling it. It will take many more hours to get this to a state where it just works. If you ever wonder where the price of VMWare is justified… every-fucking-where)

FreeBSD storage domU

The storage domU is a pretty default install of FreeBSD10 to a 32GB LUN on the raid.

During install DHCP did not work ($colleague had also run into this issue) and so we just used a static IP… While the VM is called “freesd3″ I also added a CNAME called “stor” for easier access.

The zpools are:

  • zroot (the VM itself)
  • zdata (SSD-only)
  • zdata2 (Disk fronted by SSD SLOG and L2ARC)

I turned on ZFS compression on most of those using the dataset names, i.e.:

set compression=lz4 zroot/var

VMs can later access this using iSCSI or as a Xen block device (we’ll get to that later!)

Now, for the actual problem. During installation, the device IDs had shifted. On FreeBSD this is highly uncommon to see and you *really* consider that a linux-only issue. Well, not true.

Install

We selected “mfid0″, which should have been the 32GB OS Lun…

This is what MegaCli shows:

<<<megaraid_ldisks>>>
Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Size                : 32.0 GB
Sector Size         : 512
State               : Optimal
Strip Size          : 128 KB
Number Of Drives per span:2
Virtual Drive: 1 (Target Id: 1)
Size                : 3.182 TB
Sector Size         : 512
State               : Optimal
Strip Size          : 64 KB
Number Of Drives per span:2
Virtual Drive: 2 (Target Id: 2)
Size                : 512.0 MB
Sector Size         : 512
State               : Optimal
Strip Size          : 128 KB
Number Of Drives per span:2
Virtual Drive: 3 (Target Id: 3)
Size                : 16.0 GB
Sector Size         : 512
State               : Optimal
Strip Size          : 128 KB
Number Of Drives per span:2
Virtual Drive: 4 (Target Id: 4)
Size                : 64.0 GB
Sector Size         : 512
State               : Optimal
Strip Size          : 128 KB
Number Of Drives per span:2
Virtual Drive: 5 (Target Id: 5)
Size                : 630.695 GB
Sector Size         : 512
State               : Optimal
Strip Size          : 128 KB
Number Of Drives per span:2

Note that the logical drive ID and Target:Lun match up just fine!

 

 

The OS side:

Please compare to what FreeBSD’s mfi driver assigns…

mfid0: 32768MB (67108864 sectors) RAID volume (no label) is optimal
mfid1: 512MB (1048576 sectors) RAID volume (no label) is optimal
mfid2: 16384MB (33554432 sectors) RAID volume (no label) is optimal
mfid3: 65536MB (134217728 sectors) RAID volume (no label) is optimal
mfid4: 645832MB (1322663936 sectors) RAID volume (no label) is optimal
mfid5: 3337472MB (6835142656 sectors) RAID volume 'raid10data' is optimal

At install time it was cute enough to *drums* assign the 3.X T lun as mfid0. So we installed FreeBSD 10 on the LUN that stores my VMs.

That, of course, killed the LVM headers and a few gigabytes of data.

 

My next post will skip over reinstalling to the right lun (identified from live cd system) and instead describe how I went about getting the data back.

 

Heartbleed oder warum das Internet nicht mit “Windows” gemacht wird…


Gerade mit einem Chatbekannten ueber die Kundeninformation seiner Firma zu Heartbleed gesprochen…
20:17 <darkfader> hah und mit dem ssl
20:17 <darkfader> das lustige ist
20:18 <darkfader> weisst, das ist jetzt 7 tage her
20:18 <darkfader> vor 6 tagen waren wir unixer einigermassen fertig mit patchen
20:18 <darkfader> und audit vorher
20:18 <darkfader> und so
20:18 <darkfader> und die windows welt macht gerade zusammenfassungen der
betroffenen produkte

 

Erinnert mich sehr an eine Diskussion letztens ueber den Patchbedarf einer Unix-Umgebung

“Wenn das so stabil ist, warum muss man das dann so oft updaten?”

Von meiner Erklaerung hier mal ein kleines

Schaubild:

Part three: Storage migration, disaster recovery and friends


All posts:

What I had not expected was how hard it would be to decide on an actual solution.

 Picking a Hypervisor

For a lab I would need:

  • nested virt
  • high performance
  • low overhead to the same due to power etc.
  • easy cloning of vms and labs
  • flexible networking
  • easy scripting
  • wide storage options and easy migration
  • thin provisioning of some kind

 

If you know all the products and their drawbacks it turned into a constant forth-and-back between the different hypervisors and ecosystems.

 

VMWare:

VMWare always sneaked back due to feature reliability and performance consistency and then got kicked back out for the lack of many features like API and storage migration w/o a full vCenter install.

I knew it would deliver a good (600-900MBish) performance under any circumstance, where i.e. Xen can be all over the place from 150 to 1900MB/s…

Another downside was that in VMWare  my SolarFlare 5122 will definitely never  expose the 256VNICs. And I’d like to have em.

Installing MegaCli in ESXi is also a bit annoying.

On the pro side there’s the Cisco Nexus1000V and many other similar *gasp* appliances.

And, the perfect emulation. No “half” disk drivers. no cheapass BIOS.

In the end, I like to have my stuff licensed and to use the full power of a VMWare setup I’d need to go with vCenter + Enterprise Lic. No fun.

 

XenServer:

Just LOL.

While XenServer has great features for VM Cloning it’s just not my cup of tea. Too much very bad python code. Too many windows-user cludges. Broken networking all over.

Any expectation of storage flexibility would be in vain, needing backporting and recompiling software to the dom0 kernel using their SDK. Definitely not an easy solution if you wanna be able to flip between iSCSI, Infiniband, md and whatever else *looks* interesting. This should be a lab after all, and I don’t see any chance running something like the Storwise VSA in this. Nested ESXi for that, and that’s not on the roadmap for XenServer. If anything still is.

It would probably work best for SolarFlare. I’ll admit that.

 

CloudWeavers:

This is what will run in many VMs, but I don’t wanna break layering, so my underlying hypervisor and solution should not be the same as in the VMs. I am not yet sure if it’s the right decision.

This would be the prime KVM choice since they already deliver a well-tuned configuration.

What worries me is that, while MooseFS’ FUSE client scales good enough on a single hypervisor node, it would end up with a lot of additional context switching / trashing if I use it on the main node and in the clients. There might be smarter ways around this, i.e. by having a fat global pool in the “layer1″ hypervisor and using that from the above layers, too. More probably it’d turn into a large disaster :)

 

LXC:

Pointless, no hypervisor, one single kernel instance can’t successfully pretend being a bunch of OSDs and clients :)

 

Plain Xen:

This is what I already have and went with, especially to make use of tmem and run the Ceph labs as paravirt domUs. This way I know nothing will get in the way performance wise.

There’s one thing you should know though, comparing Xen vs. ESXi or a licensed VMWare though:

Xen’s powermanagement is brokenbrokenbroken:

  • Newer deep-idle CPU states are all unsupported
  • The utility to manage CPU power management is broken as well. Since 4.3 nothing works any more.
  • Even if you free + shutdown a core from dom0 it’ll not be put to sleep

You can definitely tell from the power intake fan speed that Xen, even idle consumes more power than an idle Linux kernel would. Spinning up a PV domU has no impact, spinning up a HVM one is also a noticable increase in fan whoosh.

ESXi is far better integrated so I am expecting like 100 Euro (personal unfunded opinion) per year of additional energy wasted over VMWare.

My choice for Xen is mostly

  • the bleeding edge features like tmem
  • the really crazy stuff like vTPM and whatever of the cool features ain’t broken at any given time.
  • leverage any storage trick I want and have available in a (thanks to Alpine Linux) very recent Linux kernel
  • put in place ZFS, maybe in a dedicated driver domain
  • also be able to use MooseFS and last, but most interesting
  • all the things that never work on other hypervisors – CPU hotplug, dynamic ram changes…
  • storage domUs!!!!!

 

I think in a lab running 20-30 loaded VMs it will be cruicial to optimize in the memory subsystem.

Same goes for having the least possible CPU overhead, under load this will help.

Last, concurrently being able to use different storage techs means I can chose different levels of availability and speed – albeit not _having to_ since there’s a large SSD monster underneath it.

I’m also quite sure the disks will switch from Raid10 to Raid5. They just won’t see any random IO any more.

The “Raid5 is not OK” Disclaimer

Oh, and yes. Just to mention it. I’m aware I’m running green drives behind a controller. I know about Raid5 rebuild times (actually, they’re much lower on HW raid. About 30% of software raid) and the thing is…

If I see disk dropouts (yet to be seen), I’ll replace the dumb thing ASAP. It makes me cringe to read about people considering this a raid controller issue. If the damn disk can’t read a block for so long that the controller drops it out… Then I’m glad I have that controller and it did the right thing.

Such (block errored) disks are nice as media in secondary NAS storage or as doorstops, but not for a raid. Maybe I just hit extremely lucky in having no media errors at all off them? Definitely not what you’d see in a dedicated server at a mass hoster.

I’ve also patched my Check_MK Smart plugin to track the smart stats from the raid PDisks, so anything SMART notices I’ll be immediately be aware of. Why the green disks in the first place? Well – power and noise benefits are huge. If I had some more space I’d consider a Raid6 of 8 of them, but not until I move to a bigger place.

 

Coming up next:

A colleague offered me some company when setting up a final storage layout.

We build a dedicated storage domU with PCI passthrough’ed MegaRaid controller and ZFS. The install had a little issue…

This is what the next posts will be about, one describing how to build a storage domU.

Also, what pitfalls to expect, and then a focus on losing data (sigh) and getting it back.

I’ll close with some lessons learned. :)

Part two: Storage migration, disaster recovery and friends


All posts:

 Go and find me a new host. Keep some money for foods.

So, in march and april I set out to build a *home* server that could handle a Ceph lab, and would behave mostly like real hardware. That equates to disks being slow, SSDs being fast, and RAM being, well, actual RAM. Writing to two disks should ideally also not immediately turn into an IO blender because they reside on one (uncached) spindle.

I think ocver all I spent 30 hours on Ebay and in shops to find good hardware for a cheap price.

 

This is what I gathered:

  • Xeon 2680V2 CPU (some ES model) with 8 instead of 10 cores but same 25MB of cache. It’s also overclockable, should I ever not resist that
  • Supermicro  X9SRL-F mainboard. There are better models with SAS and i350 NICs but I wanted to be a little more price-conservative there
  • 8x8GB DDR3 Ram which I recycled from other servers
  • 5x Hitachi SSD400M SSDs – serious business, enterprise SAS SSDs.
  • The old LSI 9260 controller
  • The old WD green disks

The other SSD option had been Samsung SM843T but their seller didn’t want to give out a receipt. I’m really happy I opted for “legit” and ended up with a better deal just a week later:

The Hitachis are like the big brother of the Intel DC S3700 SSD we all love. I had been looking for those on the cheap for like half a year and then hit lucky. At 400GB capacity each it meant I could make good use of VM cloning etc. and generally never look back to moving VMs from one pool to another for space.

 

I had (and still have) a lot of trouble with the power supply. Those intel CPUs take very low power on idle, even at the first stage of the boot. So the PSU, while on the intel HCL, would actually turn off after half a second when you had very few components installed. A hell of a bug to understand since you normally remove components to trace issues.

Why did I do that? oh, because the supermicro ipmi gave errors on some memory module. Which was OK but not fully supported. Supermicro is just too cheap to have good IPMI code.

Meh.

Some benchmarking, using 4(!) SSDs was done and incredibly.

Using my LSI tuning script I was able to hit sustained 1.8GB/s writes and sustained 2.2GB/s reads.

After some more thinking I decided to check out Raid5 which (thanks to the controller using parity to calculate every 4th? block) still gave a 1.8GB/s read 1.2GB/s write.

Completely crazy performance.

To get the full Raid5 speed I had to turn on Adaptive Read Ahead. Otherwise it was around 500MB/s, aka a single SSDs read speed.

One problem that stuck around was that the controller would / will not enable the SSDs write cache, no matter what you tell it!

This is a huge issue considering each of those SSDs has 512MB(ytes) of well-protected cache.

The SSD is on LSIs HCL for this very controller so this is a bit of a bugger. I’ll get back to this in a later post since by now I *have* found something fishy in the controllers’ output that might be the cause.

Nonetheless: Especially in a raid5 scenario this will have a lot of impact on write latency and IOPS.

Oh, generally: this SSD model and latency? not a large concern :)