No-copy extracting Xen VM tarballs to LVM


SUSE Studio delivers Xen VM images which is really nice. They contain a sparse image and a (mostly incomplete) VM config file. Since I’m updating them pretty often I needed a hack that saves on any unneeed copies and needs no scratch space, either.

Goal: save copy times and improve life quality instead of copying and waiting…

First, lets have a look at the contents and then let’s check out how to directly extract them…

(Oh. Great. Shitbuntu won’t let me paste here)

 

Well, great.

I’n my case the disk image is called:

SLES_11_SP3_JeOS_Rudder_client.x86_64-0.0.6.raw

It’s located in a folder named:

SLES_11_SP3_JeOS_Rudder_client-0.0.6/

 

So, what we can do is this:

First, set up some variables so we can shrink the command later on…

version=0.0.6
appliance=SLES_11_SP3_JeOS_Rudder_client
url=https://susestudio.com/...6_64-${version}.xen.tar.gz
appliance=SLES_11_SP3_JeOS_Rudder_client
folder=${appliance}-${version}
vmimage=${appliance}.x86_64-${version}.raw
lv=/dev/vgssdraid5/lvrudderc1

Then, tie it together to store our VM data.

wget -O- $url | tar -O -xzf - ${folder}/${vmimage} | dd of=$lv bs=1024k

Storing to a file at the same time:

wget -O- $url | tee /dev/shm/myfile.tar.gz | tar -O -xzf - ${folder}/${vmimage} |\
dd of=$lv bs=1024k

 

Wget will fetch the file, write it to STDOUT, tar will read STDIN, only extract the image file, and write the extracted data to STDOUT, which is then buffered and written by the dd.

 

If you’ll reuse the image for multiple VMs like me you can also write it to /dev/shm and, if RAM allows, also gunzip it. the gzip extraction is actually limiting the performance, and even tar itself seems to be a little slow. I only get around 150MB/s on this.

I do remember it needs to flatten out the sparse image while storing to LVM, but I’m not sure if / how that influences the performance.

 

(Of course none of this would be necessary if the OSS community hadn’t tried to ignore / block / destroy standards like OVF as much as they could. Instead OVF is complex, useless and unsupported. Here we are.)

Check_MK support for Allnet 3481v2


A friend of mine has had this thermometer and asked me to look into monitoring and setup.

I don’t think I ever put as much work into monitoring such a tiny device. Last evening and almost night I stabbed at it some more and finally completed the setup and documentation. I literally went to bed at 5am because of this tiny sensor.

To save others from this (and to make sure I have a reliable documentation for it…), I’ve made a wiki article out of the pretty tricky setup. Along the way I even found it still runs an old openssl.

You can check it out here:

http://confluence.wartungsfenster.de/display/Adminspace/Monitoring+Allnet+3418v2

The bitbucket version isn’t yet committed, I hope I will do this in a moment… :p
One interesting hurdle was I couldn’t do a check_mk package (using mkp list / mkp pack) since I also needed to include things from local/lib and similar folders. When I visit the MK guys again I’ll nag about this.

 

 

They have really pretty meters in their UI by the way.

Would hope something like it makes it to the nagvis exchange some day.

edit note: I initially wrote it has an “affected OpenSSL”. It seems they had built it back in 2012 without heartbeat, which is a nice and caring thing to do.
It’s still goddamn outdated.

LVM Mirroring #2


Hmm, people still look at my ages-old post about LVM all the time.

So, just a note from end-2013:

The mirror consistency stuff is not your worst nightmare anymore.

Barriers work these days, and I think it’s more important to concentrate on EXT4 Settings like “block_validity”. The chance of losing data due to a lvm mirror issue is much lower than the chance of unnoticed data loss in ext4 🙂

My LVM pain points, as of today, would be:

lvm.conf is a huge patchwork of added features, there should be a LVM maintainer that oversees structure as features are added.

Instead it’s like a castle with a lot of woodden gangways (mirrorlog devices) and stairs (thin provisioning) on the outside  and no windows (read up on the “fsck” utility for thin pools, TRY what happens if it runs full and recover from it)

Some features require pre-ahead planning and the way it’s now does not support that.

Reporting is still as bad as it used to be.

I’d be happy for someone to show me how he splits out a snapshot + pv to a backup host, brings it back AND has a fast resync.

(Note, the PV uuid wouldn’t change in this. So, if it doesn’t work, it hints at design flaws)

Those pieces I worry about. And really, the way the project is adding features without specs, documentation and (imho) oversight makes it looks like some caricature of a volume manager.

How I feel about that:

Just look, I have added the feature the others were talking about.

And look at THIS: I now have an arm on my ass so I can scratch between my shoulders, too!

Example: LVM2 did away with a per-LV header as classic LVM had, so you don’t have a ressource area to debug with, and don’t support BBR or per-LV mirror write consistency via the LV header. But instead they added an -optional- feature that wipes the start of an LV. So, if you lose config and rebuild a LV manually on the exact same sectors, but newer LVM config, it’ll wipe out the first meg of the LV.

A volume manager that after the design introduces kind of a LV format change, and make it WIPE DISK BLOCKS. I don’t care how smart you think you are: Whoever came up with this should get the old Mitnick jail sentence: Forbidden to use a computer.

The bad layering of PV/LV/VG I also still care about.

Storage management in the sense I’m used to is something I still don’t touch with LVM.

On the other hand I’m itching daily to actually make prod use of those exact modern features 🙂

But basically I just use it to carve out volumes, but instead of pvmove I might use more modern, powerful tools like blocksync/lvmsync and work with new VGs.

Also, just to be clear:

I’m not saying “don’t use LVM” – I have it on almost any system and hate those without it. I’m just saying it’s not delivering the quality for consistently successful datacenter usage. If you set up your laptop with a 1TB root volume and no LVM and then have some disk-related issue. Well, I’ll probably laugh and get you some Schnaps.

That being said, I wanted to write about more modern software that is actually fun to look at, next post 🙂

OMD port to Debian/kFreeBSD


FreeBSD turned 20 yesterday – I have a little present!

You’re now able to run – with a little detour – a really modern monitoring environment (OMD) on FreeBSD.

Downloading:

Lets start with the download link:

Wartungsfenster-OMD: omd-1.00_0.wheezy_kfreebsd-amd64.deb

How I got there

I found there’s a howto for running Debian/kFreeBSD on FreeBSD as a JAIL! See the howto from here: kFreeBSD Jails (new window)

So the idea I came up with was:

Debian is a supported distro, so let’s port OMD to run in that jail!

I built a jail mostly via the howto – try it too!
Things that I changed:
The command to launch jails has a “–” bit in it – removed that or it didn’t launch the jail.

My actual script (since I found nothing about running Linux Jails from rc.conf)

#!/bin/sh

setup()
{
ifconfig xn0 alias 192.168.10.66/32
mount -t linprocfs linprocfs /srv/jail/debjail/proc
mount -t linsysfs linsysfs /srv/jail/debjail/sys
mount -t tmpfs tmpfs /srv/jail/debjail/run
mount -t devfs devfs /srv/jail/debjail/dev
}

launch()
{
jail -J /var/run/jail/debjail.jid -c jid=66     \
   name=debjail path=/srv/jail/debjail          \
   host.hostname=debjail ip4.addr=192.168.10.66 \
   command=/bin/sh  -c "/etc/init.d/rc S && /etc/init.d/rc 2"
   # needed for building deb packages
   jail -m jid=66 allow.sysvipc=1
   # needed for ping
   jail -m jid=66 allow.raw_sockets=1
}


if [ $? = 0 ]; then
   setup
   launch
fi

starting the Jail!

Like this the jail should likely launch. You can enter it using jexec 66 /bin/bash. Once inside, add openssh so you can get in from the outside. You’ll also need to configure authentication, namely root access.

To be able to use su inside the jail, make sure you didn’t put it on a filesystem that’s mounted as nosuid (i do that by default, …)

just a normal server, lets build!

(if you just wanna run OMD, you can of course skip this part 🙂
Once SSH and su work, ssh into the jail I checked out the omd sources and started building, using first configure, then make && make pack, and in the last stage set about make deb.
I had to remove most of the perl-modules due to incompletely specified dependencies. Due to that I also had to remove Thruk, Gearman and disable RRDtool’s perl support. Sorry, but after a handfull of perl-related issues I got really sick of it. Fix your shit. Further weirdness was that the Check_MK tarball from git was corrupt. I replaced it with the upstream one.
(The make deb also needed sysvipc support in the jail enabled)

The resulting package is what You found above at the download link.

OMD install

I could install it using gdebi like on a standard distro.

apt-get install gdebi && gdebi omd-1.00_0.wheezy_kfreebsd-amd64.deb

When creating a site (omd config create sitename) I found I had to turn off the tmpdir setting

omd config sitename set TMPFS off

and then I could launch the site.
My setup won’t need TMPFS performance, so I didn’t do anything here. It could be fixable, I’m not sure.

After restarting the main apache,

service apache2 restart

The Check_MK gui greeted me with errors because it tried to find a uuid in /proc but that’s not yet possible on FreeBSD (PR openend: kern/183615).

I made a tiny change to the Check_MK Multisite sources for this. The UUID is used for identifying sessions and so this is _quite_ important to work.

to fix it, change a little piece of the weblib.py for your site (or actually, in /omd/versions/default)
/omd/sites/freemon/version/share/check_mk/web/htdocs/weblib.py

at line 117 and below:

    116 # Generates a selection id or uses the given one
    117 def selection_id():
    118     if not html.has_var('selection'):
    119         #sel_id = file('/proc/sys/kernel/random/uuid').read().strip()
    120         import commands
    121         sel_id = commands.getoutput("uuid")

(so, instead of reading from /proc, run “uuid”)

for this to work you also need to install uuid using

apt-get install uuid

After this I needed to also enable raw socket support for the jail so ping hosts would work (already in the debjail script)

If you use IPv6, don’t forget to switch to the Icinga monitoring core using

omd config sitename set CORE icinga

Start it using

omd start sitename

and hit the web interface at http://jailname/sitename – user: omdadmin pass: omd

root@debjail:/omd# omd status
Doing 'status' on site freemon:
mkeventd:       running
apache:         running
rrdcached:      running
npcd:           running
icinga:         running
crontab:        running
-----------------------
Overall state:  running

Not a full FreeBSD port, but a bullshit linux system locked into a jail – still – this is getting the job done.

Try it, and, that’s it!

Happy birthday Beastie!

Zyxel NSA-325 NAS / Media Server


I just went shopping and besides some proper going-anywhere shoes from new rock I finally bought a home NAS box. The model is a Zyxel NSA-325 which is only a little less powerful as i.e. a Synology DS213+ or DS413 but only costs 1/3 resp. 1/4 of the money those cost.
Being me, it’s not entirely unreasonable I’ll want to run more than one of whichever NAS storage I get.

The downside of the Zyxel NAS is it’s avverage webinterface and limited OS.
At work I chatted with $customers about this and found I actually just want something that lets me edit /etc/exports and maybe run targetcli, definitely a GUI isn’t high-ranking with me.

It seems these run Archlinux but there’s some hickups. After my underground ride home, I’m quite sure I won’t “update” to Arch as a priority.

Why no ArchLinux?

Now, it’s like that:

The NAS will be used to store backups, I’m quite sure I can hack a Bacula Storage Daemon (SD) package for it’s native format.
The bacula Director will be running on my “home cluster” that spans the Nexus7 tablet and the raspberry Pi.
By running the SD on the NAS I’ll make sure that only metadata traffic is directed to those tiny powerless cluster nodes.
By taking the backups off my home server I’ll stop wasting raid10 space on it for storing *backups*.

While I was at it, I also grabbed two 2TB Caviar green disks.

So this is the plan:


Put a 1.5TB random disk into the NAS. (Not the greens, you’ll see why)
Flip-rotate the two new 2TB green for the two 1.5TB ones in my server, so it’ll finally have 2TB disks only.
Online resize, tadaaaa, increase to 4TB usable space. Taking out 600GB of backup storage means I’m looking at 1.6TB free Raid10 space. 🙂

While at it, add a 2.5″ drive cage to the server and add 2 Samsung 530’s.
Replace the current Perc5i Raid controller in the server with one that has SAS6G/SATA3 ports (ibm5015 off ebay)
This will also allow for better placing and ventilation in the server since the perc5 didn’t really fit in my case.
Since I don’t have the “Performance Accelerator Key” (fucking $400 hardware dongle) for the controller, I can’t use LSI CacheCade and need to settle for something OS-based.

That means I shall upgrade the Xen host to AlpineLinux from OracleVM2 it currently runs.

Then I can use Flashcache (or possibly bcache but I don’t like it) to enable read-caching via the SSDs.
And since it’s just read-caching there’s nothing bad about running them as a raid0 for *cough* added performance. After that, I doubt I’ll ever remember this box is backed by cheap and silent, but foremost *cool* WD green drives.

Then, a not so fun step, some pisses-me-off-already chroot building so I can use the goddamned MegaCLI to monitor the raid controller on AlpineLinux.

Finally, put the two now-free 1.5TB WD green in the NAS box.
I plan to also put the cobbler distro mirror on it and those ISOS that are easily obtained.

End result:

  1. Replace disks and controller, add SSDs => A lot more performance and space in the home server.
  2. Able to fully use TMEM on the home server => more free RAM, thus longer lifecycle for the server.
  3. dedicated storage for backups and cobbler => *everything* infrastructure can run on the few-Watts only RaspiNexus cluster.
  4. A shitload of storage migrations, HW replacements etc.  => I totally don’t want to bother with replacing the OS on the NAS.

Helpful change for OMD


The Open Monitoring Distribution (OMD) allows you to have multiple “sites” each consisting of configureable elements a Nagios (or Icinga, Shinken, Check_MK Microcore) instance, apache webserver and other tools.

Each site can be started/stopped individually, allowing you to take them offline for maintenance or have them in a cluster for failover.

The main apache on a system uses reverse proxies to let you access the “sites” and has always been able to tell you if a site wasn’t started at the moment.

This is done via a 503 ErrorDocument handler in the file “apache-own.conf”. It’s a nice feature but has a huge drawback if you run a kiosk mode browser for showing the monitoring dashboard on a TV or tablet (like me).

Once that page is displayed you’re out. You’ll never see that the site is back up.

I know 3 cases where this commonly becomes an issue:

  • Bootup of Nagios server with local terminal
  • Cluster failovers
  • Apache dies

The second one is the most annoying:

  • You have a GUI displaying valid info.
  • one of the servers has a problem and it triggers a failover
  • autorefresh kicks in and you get dropped to the 503 page
  • Cluster failover finishes
  • but nothing gets you back in.

Now, the fix is so easy you won’t believe it:

In apache-own.conf of your site, change the following:

from:

<Location /sitename>
ErrorDocument 503 “<h1>OMD: Site Not Started</h1>You need to start this site in order to access the web interface.”

to:

<Location /sitename>
ErrorDocument 503 “<META HTTP-EQUIV=\”refresh\” CONTENT=\”30\”><h1>OMD: Site Not Started</h1>You need to start this site in order to access the web interface.”

Restart the system apache (/etc/init.d/apache2 restart for most of us) and it’ll work.

file under:

I tried to develop a dev mindset, but found I like it when stuff really works.

About Disk partitioning


So, you always wondered why one would have a dozen logical volumes or filesystems on a server? And how it brings any benefit?

Let’s look at this example output from a live system with a database growing out of control:

Filesystem Size Used Avail Capacity Mounted on
/dev/da0s1a 495M 357M 98M 78% /
devfs 1.0k 1.0k 0B 100% /dev
/dev/da1s1d 193G 175G 2.3G 99% /home
/dev/da0s1f 495M 28k 456M 0% /tmp
/dev/da0s1d 7.8G 2.1G 5.1G 29% /usr
/dev/da0s1e 3.9G 1.0G 2.6G 28% /var
/dev/da2s1d 290G 64G 202G 24% /home/server-data/postgresql-backup

I’ll now simply list all problems that arise from this being mounted as a one, singular /home. Note, it would just be worse with a large rootdisk.

  • /home contains my admin homedirectory. So I cannot disable applications, comment /home in fstab, reboot and do maintenance. Instead all maintenance on this filesystem will need to start in singleuser mode.
  • /home contains not just the one PGSQL database with the obesity issues, it also hold a few mysql databases for web users. Effect: if it really runs full, it’ll also crash the other databases and all those websites.
  • /home being one thing for all applications I cannot just stop the database, umount, run fsck, change the root reserve from it’s default 8% – so there’s a whopping 20GB I cannot _get to_
  • /home being one thing means I also can’t do a UFS snapshot of just the database, with ease. Instead it’ll consist of all the data on this box, meaning it will have a higher change volume, leaving less time to magically copy this.
  • /home being the only big, fat filesystem also means I can’t just do fishy stuff and move some stuff out (oh and yes, there’s the backup filesystem. Accept I can’t use it)
  • PostgreSQL being in /home I cannot even discern the actual IO coming from it. Well, maybe Dtrace could, but all standard tools that use filesystem level metrics don’t stand a chance.
  • PostgreSQL being in /home instead of it’s own filesystem *also* means I can’t use a dd from the block device + fsck for the initial sync – instead I’ll run file-level using rsync…
  • It also means I can’t just stop the DB, snapshot, start it and pull the snapshot off to a different PC with a bunch of blazing fast low-quality SSDs for quick analysis.

 

I’m sure I missed a few points.

Any of them is going to cause hours and hours of workarounds.

 

Ok, this is a FreeBSD box, and one not using GEOM or ZFS – I don’t get many chances as it stands. So, even worse for me this is one stupid bloated filesystem.

 

Word of advice:

IF you ever think about “why should I run LVM on my box”, don’t think about the advantages right now, or the puny overhead for increasing filesystems as you need space. Think about what real storage administration (so, one VG and one LV in it doesn’t count) do for you if you NEED it.

Simply snapshotting a volume, adding PVs on-the-fly, attaching mirrors for migrations… this should be in your toolbox, and this should be on your systems.

 

Filesystem reliability and doing what my guts tell me…


On Linux you have a wide range of available filesystems – making a choice is never easy.

I just wanted to summarize what I’ve been telling my class attendees over the last years, what I’ve seen in live setups, and what I’m actually DOING.

  • EXT4 – generally, I DO hate ext FS. For me it’s hyped by people who will simply blame your hardware once you lost your data. My rule of thumb is that i’m using ext on recent linux kernels where the block_validity options is available. Beyond this, I’ll also set the following options:
  1. on error: panic – if we have a read/write error that is persistent or causes a journal abort, just ZAP the box.
  2. discard
  3. data=journal or ordered, depending on importance of the server. It has shown up to 30% impact for me, but it’s a choice you can make.
  4. checktime / check interval – both to 0. I rather have trust checksumming and would not resist a full fsck once a year
  5. possibly also make the journal bigger. Ideally you’d be able to use an external journal – i recommend against it b/c you can never trust devs and it would not be fun to see your fsck not supporting the ext. journal
  6.  journal_checksum is a lot more important but also a work in progress especially if your kernel still starts with 2.6. w/o this option ext doesn’t really notice shit about aborted writes, corrupted journal. But in some versions it’s also plain default. It’s a mess.
  • XFS – I noticed this is what I actually use if it’s my own system, meaning I do have the highest trust in xfs. This is kinda funny since it’s a 1996 filesystem with focus on performace. So far we’ve stayed friends. If the system is a 2.6 one I’ll definitely go for XFS. XFS has also turned out most stable for the Ceph devs in their benchmarks, so it’s not just my guts, it’s also quite proven where others have indeed failed. For production use on RHEL, there’s a choice to get the XFS feature channel and thus run XFS with RedHats support
  • JFS – JFS is what AIX users know as JFS2, and is the most modern of all prod-grade FS in my comparism here. It’s been new and shiny in the early 2000’s, i think around 2004. It has been proven to be superior in small-file performance. so if 100k’s of files in a directory is something that comes into your use case, JFS is something to look at. The problem is that JFS is badly integrated in most distros. If you find out it’s the best-performing for you and you *need* it in production, my advice is to get your OS support via IBM and let them deal with it.
  • VxFS – this is what you commonly seeing used in serious envs looking at integrity and performance. It’s the most scalable and powerful of the lot and has the most features (heh, btrfs, go cry), but it DOES COST MONEY. If you might have use for extra features like split-mirror backups on a different host etc. then it is a good choice and the price acceptable for what you’re getting.

Takeaway –

old distros like RHEL/CentOS/OEL(*) or Debian: Consider XFS

new distros and you wanna have somewhat standard setup: Consider EXT4 but _with_ the bells and whistles.

ZFS / Btrfs not included on purpose. If you think you can put your data on those already, then that’s fine for you, but not for everyone. (of course I run them for testing… silly)

VxFS – cool for your prod servers. If you are dealing with real data (let’s say, like a telco’s billing or other places where they move a netflix’ yearly revenue each day) you will most problably end up migrating to VxFS in the long run. So you might just start with it…

If it’s my system, my data – I just grab XFS. Main reason to pick something different was usually if there’s other people that might need to handle and error and who don’t know anything but fsck.

Running Ceph? I just grab XFS, anything else is too shady – one of many many similar experiences.

23:42 < someguy> otherguy: yes, I as well with BTRFS. I moved to XFS on dev.
Prod has always been XFS.

If it’s a prod system with data worth $$$$$$? I’d not think about anything but VxFS.

Next on this channel


Instead of a new years resolution* I’ve looked into what things to work on next. Call it milestones, achievements, whatever.

  • I’ve already cleaned up my platform, based on Alpine Linux now.
  • IPv6 switchover is going well but not a prime concern. Much stuff just works and other stuff is heavily broken, so it’s best to not rush into a wall.
  • Bacula: I’ve invested a lot of time to into backup management routine again. This paid off and made clear it was stupid to decide per-system which VMs to backup and which not. If you want to have reliable backups, just backup everything and be done with it. Sidequests still available are splitting catalogs and wondering why there is no real operations manual. (rename a client? move clients backup catalogs? All this stuff is still at a level that I’d call grumpy retarded cleverness: “You can easily do that with a script” – yeah, but how comes that the prime opensource backup tool doesn’t bring along routine features that are elsewhere handled with ONE keypress (F2 to the rescue)
  • cfengine: This will be my big thing over the next 3 weeks, at home and on holiday. Same goal, coming to grips real well. During the last years I’ve tried puppet, liked but not used Chef and glimpsed at SALT. Then I skipped all of them and decided that Ansible was good for the easy stuff and for the not easy stuff I want the BIG (F) GUN, aka cfengine.
  • Ganeti & Job scheduling: In cleaning up the hosting platform I’ve seen I’ve missed a whole topic automation-wise. Ahead scheduling of VM migrations, Snapshots etc. A friend is pushing me towards ganeti and it sure fills a lot of gaps I currently see, but it doesn’t scale up to the larger goal (OpenNebula virtual datacenters). I’ll see if there is a reasonable way for picking pieces out of Ganeti. Still, the automation topic stays unresolved. There is still no powerful OSS scheduler – the existing ones are all aimed at HPC clusters which is very easy turf compared to enterprise scheduling. So it seems I’ll need to come up with something really basic that does the job.
  • Confluence: That’s an easier topic, I’m working to complete my confluence foo so that I’ll be able to make my own templates and use it real fast.

What else…. oh well that was quite a bit already 😉

Otherwise I’ve been late (in deciding) for one project that would have been a lovable start and turned down two others because they were abroad. Being at the start of this new (selfemployed) career I’m feeling a reasonable amount of panic. Over the weekends it usually turns into a less reasonable one 😉

But yet also cheerful at being able to give a focus on building up my skills, and I also think it was the right decision. I went into the whole Unix line of work by accident but loved it since “if it doesn’t work you know it’s your fault” – a maybe stale, yet basically bug-less environment where you concentrate on issues that come up in interactions of large systems instead of bug after bug after bug. (See above at platform makeover – switching to alpine Linux has been so much fun for the same reason).

My website makeover is in progress and I’m happy to know I’ll visually arrive in this decade-2.0 with it soon.

Bild

Of course I’ve also run into some bugs in DTC while trying to auto-setup a WordPress setup. DTC is the reason for the last Debian system I’m keeping. Guess who is REALLY at risk of being deprecated now 😉

*not causing doomsday worked for 2012 though

Nexus 7 followup


The Nexus tablet is getting more and more interesting for me as I come up with additional uses.

Today I installed DHCP and Bind on it, and am thinking about Cobbler, too.

The idea is having the Nexus as infrastructure core and with the tools to bootstrap the rest of my systems.

Power would come via an active USB hub, if it fails, the nexus can still power its gadgets till it runs out. That way it’s i.e. possible to determine if power or all networking failed and still alert.

And after a disaster crash you’d have a install server that is ready to bring the core systems back up.