No-copy extracting Xen VM tarballs to LVM


SUSE Studio delivers Xen VM images which is really nice. They contain a sparse image and a (mostly incomplete) VM config file. Since I’m updating them pretty often I needed a hack that saves on any unneeed copies and needs no scratch space, either.

Goal: save copy times and improve life quality instead of copying and waiting…

First, lets have a look at the contents and then let’s check out how to directly extract them…

(Oh. Great. Shitbuntu won’t let me paste here)

 

Well, great.

I’n my case the disk image is called:

SLES_11_SP3_JeOS_Rudder_client.x86_64-0.0.6.raw

It’s located in a folder named:

SLES_11_SP3_JeOS_Rudder_client-0.0.6/

 

So, what we can do is this:

First, set up some variables so we can shrink the command later on…

version=0.0.6
appliance=SLES_11_SP3_JeOS_Rudder_client
url=https://susestudio.com/...6_64-${version}.xen.tar.gz
appliance=SLES_11_SP3_JeOS_Rudder_client
folder=${appliance}-${version}
vmimage=${appliance}.x86_64-${version}.raw
lv=/dev/vgssdraid5/lvrudderc1

Then, tie it together to store our VM data.

wget -O- $url | tar -O -xzf - ${folder}/${vmimage} | dd of=$lv bs=1024k

Storing to a file at the same time:

wget -O- $url | tee /dev/shm/myfile.tar.gz | tar -O -xzf - ${folder}/${vmimage} |\
dd of=$lv bs=1024k

 

Wget will fetch the file, write it to STDOUT, tar will read STDIN, only extract the image file, and write the extracted data to STDOUT, which is then buffered and written by the dd.

 

If you’ll reuse the image for multiple VMs like me you can also write it to /dev/shm and, if RAM allows, also gunzip it. the gzip extraction is actually limiting the performance, and even tar itself seems to be a little slow. I only get around 150MB/s on this.

I do remember it needs to flatten out the sparse image while storing to LVM, but I’m not sure if / how that influences the performance.

 

(Of course none of this would be necessary if the OSS community hadn’t tried to ignore / block / destroy standards like OVF as much as they could. Instead OVF is complex, useless and unsupported. Here we are.)

Advertisements

LVM Mirroring #2


Hmm, people still look at my ages-old post about LVM all the time.

So, just a note from end-2013:

The mirror consistency stuff is not your worst nightmare anymore.

Barriers work these days, and I think it’s more important to concentrate on EXT4 Settings like “block_validity”. The chance of losing data due to a lvm mirror issue is much lower than the chance of unnoticed data loss in ext4 🙂

My LVM pain points, as of today, would be:

lvm.conf is a huge patchwork of added features, there should be a LVM maintainer that oversees structure as features are added.

Instead it’s like a castle with a lot of woodden gangways (mirrorlog devices) and stairs (thin provisioning) on the outside  and no windows (read up on the “fsck” utility for thin pools, TRY what happens if it runs full and recover from it)

Some features require pre-ahead planning and the way it’s now does not support that.

Reporting is still as bad as it used to be.

I’d be happy for someone to show me how he splits out a snapshot + pv to a backup host, brings it back AND has a fast resync.

(Note, the PV uuid wouldn’t change in this. So, if it doesn’t work, it hints at design flaws)

Those pieces I worry about. And really, the way the project is adding features without specs, documentation and (imho) oversight makes it looks like some caricature of a volume manager.

How I feel about that:

Just look, I have added the feature the others were talking about.

And look at THIS: I now have an arm on my ass so I can scratch between my shoulders, too!

Example: LVM2 did away with a per-LV header as classic LVM had, so you don’t have a ressource area to debug with, and don’t support BBR or per-LV mirror write consistency via the LV header. But instead they added an -optional- feature that wipes the start of an LV. So, if you lose config and rebuild a LV manually on the exact same sectors, but newer LVM config, it’ll wipe out the first meg of the LV.

A volume manager that after the design introduces kind of a LV format change, and make it WIPE DISK BLOCKS. I don’t care how smart you think you are: Whoever came up with this should get the old Mitnick jail sentence: Forbidden to use a computer.

The bad layering of PV/LV/VG I also still care about.

Storage management in the sense I’m used to is something I still don’t touch with LVM.

On the other hand I’m itching daily to actually make prod use of those exact modern features 🙂

But basically I just use it to carve out volumes, but instead of pvmove I might use more modern, powerful tools like blocksync/lvmsync and work with new VGs.

Also, just to be clear:

I’m not saying “don’t use LVM” – I have it on almost any system and hate those without it. I’m just saying it’s not delivering the quality for consistently successful datacenter usage. If you set up your laptop with a 1TB root volume and no LVM and then have some disk-related issue. Well, I’ll probably laugh and get you some Schnaps.

That being said, I wanted to write about more modern software that is actually fun to look at, next post 🙂

About Disk partitioning


So, you always wondered why one would have a dozen logical volumes or filesystems on a server? And how it brings any benefit?

Let’s look at this example output from a live system with a database growing out of control:

Filesystem Size Used Avail Capacity Mounted on
/dev/da0s1a 495M 357M 98M 78% /
devfs 1.0k 1.0k 0B 100% /dev
/dev/da1s1d 193G 175G 2.3G 99% /home
/dev/da0s1f 495M 28k 456M 0% /tmp
/dev/da0s1d 7.8G 2.1G 5.1G 29% /usr
/dev/da0s1e 3.9G 1.0G 2.6G 28% /var
/dev/da2s1d 290G 64G 202G 24% /home/server-data/postgresql-backup

I’ll now simply list all problems that arise from this being mounted as a one, singular /home. Note, it would just be worse with a large rootdisk.

  • /home contains my admin homedirectory. So I cannot disable applications, comment /home in fstab, reboot and do maintenance. Instead all maintenance on this filesystem will need to start in singleuser mode.
  • /home contains not just the one PGSQL database with the obesity issues, it also hold a few mysql databases for web users. Effect: if it really runs full, it’ll also crash the other databases and all those websites.
  • /home being one thing for all applications I cannot just stop the database, umount, run fsck, change the root reserve from it’s default 8% – so there’s a whopping 20GB I cannot _get to_
  • /home being one thing means I also can’t do a UFS snapshot of just the database, with ease. Instead it’ll consist of all the data on this box, meaning it will have a higher change volume, leaving less time to magically copy this.
  • /home being the only big, fat filesystem also means I can’t just do fishy stuff and move some stuff out (oh and yes, there’s the backup filesystem. Accept I can’t use it)
  • PostgreSQL being in /home I cannot even discern the actual IO coming from it. Well, maybe Dtrace could, but all standard tools that use filesystem level metrics don’t stand a chance.
  • PostgreSQL being in /home instead of it’s own filesystem *also* means I can’t use a dd from the block device + fsck for the initial sync – instead I’ll run file-level using rsync…
  • It also means I can’t just stop the DB, snapshot, start it and pull the snapshot off to a different PC with a bunch of blazing fast low-quality SSDs for quick analysis.

 

I’m sure I missed a few points.

Any of them is going to cause hours and hours of workarounds.

 

Ok, this is a FreeBSD box, and one not using GEOM or ZFS – I don’t get many chances as it stands. So, even worse for me this is one stupid bloated filesystem.

 

Word of advice:

IF you ever think about “why should I run LVM on my box”, don’t think about the advantages right now, or the puny overhead for increasing filesystems as you need space. Think about what real storage administration (so, one VG and one LV in it doesn’t count) do for you if you NEED it.

Simply snapshotting a volume, adding PVs on-the-fly, attaching mirrors for migrations… this should be in your toolbox, and this should be on your systems.

 

Next Post


running pvscan…:

    Walking through all physical volumes
        /dev/ramdisk: Skipping (regex)
        /dev/loop0: Skipping (sysfs)
        /dev/sda: Skipping (regex)
        Opened /dev/md0 RO
      /dev/md0: size is 5860543488 sectors
        Closed /dev/md0
      /dev/md0: size is 5860543488 sectors
        Opened /dev/md0 RW O_DIRECT
        /dev/md0: block size is 4096 bytes
        Closed /dev/md0
        Using /dev/md0
        Opened /dev/md0 RW O_DIRECT
        /dev/md0: block size is 4096 bytes
      /dev/md0: No label detected
        Closed /dev/md0

The Raid is OK
The VG is on that disk
Already deleted my LVM cache and disabled writing it

Let me say I am SO sick of this…
Perc5 battery is fully loaded, going to move off the LVM/MD setup ASAP.

Just need to find the data first.

Update1:
Found out that the LVM cache file was both the cause and the last straw. It had been available when the system called lvm.static during boot (which read it). Via the cache file the system found the volume groups, and then once the system was up, it didn’t find them any more.
But since I had disabled write_cache_state (albeit after writing one file), it did never delete that old cache.

Getting closer to the read issue now, but I guess it’s time for a short excursion, making a work copy of the whole server :>

The LVM metadata is either incomplete, or “has a little offset” on the disk md0. I can see it using dd, but not yet sure if it’s OK.

The other thing I found out is that the Perc5i will totally overheat in my current server case.

And the earth is a disc


I just managed to trigger a scenario that lets me take you on a ride through the layers of LVM, letting us see which ones are having issues. Usually I find it quite hard to pin-point examples for where this thing misbehaves :>

What happened:

I have a second hard disk in my thinkpads multibay slot.
On that disk is a volume group “vg01”, which holds 3 LVs which I normally do not mount or use.
I removed that harddisk earlier today after booting because of the noise.
I didn’t think much about the VG since nothing was mounted anyway, so I didn’t do a vgchange or vgexport, leaving it activated.
Some 6 hours later I re-inserted the disk to mount something on it.

I waited till it spun up and then, well, somewhat naively assumed, I would just have to activate the VG.
This (probably) isn’t working out since the LVM disk now as a different device handle & name in the kernel.
Of course it still has the same PVID and a simple pvscan would (probably) fix this. This is *not* the point now.

The point is that

this should never work:

root@klappstuhl:~# vgchange -a y vg01
  /dev/dm-7: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-8: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-9: read failed after 0 of 4096 at 0: Input/output error
  3 logical volume(s) in volume group "vg01" now active

The correct thing to happen here would be:

– vgchange to be unable to change the VG’s activation state since the code MUST abort a activation to same or higher level if no physical copy of the VGDA is visible.
– implicit lvchange failing. The LVs should be in UNAVAILABLE state. Their PV is missing, their VG has no Quorum (and cannot, with 1 PV), their state never was “AVAILABLE R/W or “ACTIVE” in this cycle… and not a single of their LE is in fact accessible. When no LE is backed by an available PE then it MUST NOT be activate-able. It had probably been activated b/c I had the disk plugged in on boot, but:
– a vgchange -a y MUST re-run the vg activation procedure and then fail, report the error up it’s layers and put things into the right (lower) Availabilty state, while not dropping down to a new (same/higher/lower) Activation state – since the operation in itself FAILED.

Instead what we see is TRY OF ACTIVATION TO SAME/HIGHER -> FAIL AND ABORT -> RE-RUN ACTIVATION TO SAME LEVEL

Lets verify what LVM is just thinking:

root@klappstuhl:~# lvdisplay /dev/vg01/lvisos 
  /dev/dm-7: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-8: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-9: read failed after 0 of 4096 at 0: Input/output error
  --- Logical volume ---
  LV Name                /dev/vg01/lvisos
  VG Name                vg01
  LV UUID                MucSwf-2reM-x9LV-cxHz-89Gk-oeTI-E5OdBT
  LV Write Access        read/write
  LV Status              available
  # open                 0

“No problem” here at all. Earth is a disc and the center of the universe.

Lets give it a reality check.

root@klappstuhl:~# fsck.jfs /dev/vg01/lvisos
fsck.jfs version 1.1.12, 24-Aug-2007
processing started: 12/5/2011 17.50.30
Using default parameter: -p
The current device is:  /dev/vg01/lvisos
ujfs_rw_diskblocks: read 0 of 4096 bytes at offset 32768
ujfs_rw_diskblocks: read 0 of 4096 bytes at offset 61440
Superblock is corrupt and cannot be repaired 
since both primary and secondary copies are corrupt.  

 CANNOT CONTINUE.

So the block layer(s) had to deny the read requests since the block devices was unavailable just now. Surely LVM now flagged the LV as unavailable and stale now?

root@klappstuhl:~# lvdisplay /dev/vg01/lvisos 
  /dev/dm-7: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-8: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-9: read failed after 0 of 4096 at 0: Input/output error
  --- Logical volume ---
  LV Name                /dev/vg01/lvisos
  VG Name                vg01
  LV UUID                MucSwf-2reM-x9LV-cxHz-89Gk-oeTI-E5OdBT
  LV Write Access        read/write
  LV Status              available
  # open                 0
  LV Size                32.00 GiB
  Current LE             8192
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           251:9

Note:
Nothing. Still thinks everything is 100% perfect.

This can be caused by two things:

1. The idiotic activation state chain we saw is overwriting the LV activation state when it re-runs & falls back to the “old OK” state.
2. Setting states might not be working at lower layers AND the propagation of states is generally not understood. That means, they can change states per component in a layer (VG,PV,LV) but there is no internal relation between them (PVVG>LV). Yes the config has “segments” and all that. That doesn’t mean this information is correctly USED.

If I only had one guess about the issue, I would guess it’s caused by something in tie with using devmapper.

What this means for you:

It is not possible to safely monitor LVM *status* on any layer at all.
It is possible to monitor the LVM monitors (uevent based), but I feel pretty confident they will never, ever report an error.

What this means for me:

For quite long time I had assumed that some parts of the Linux LVM2 layering are broken. I had guessed that there are some deep, rotten issues causing that you cannot get some of the info (i.e. lvdisplay -v with cLVM mirror cannot show a LE + PE1, PE2 list and pvdisplay -v cannot list which “segments” of which LVS are on a given PV)

I have to get used to the idea that I was wrong, because things are MUCH worse: layering seems to not exist except in block addressing LE:PE parts, and that the internal state handling might be misdesigned.

This thing is so full of design flaws… you would think it wasn’t just a reimplementation off something that did not have these problems?

Addendum:

root@klappstuhl:~# pvscan
  /dev/dm-7: read failed after 0 of 4096 at 75161862144: Input/output error
  /dev/dm-7: read failed after 0 of 4096 at 75161919488: Input/output error
  /dev/dm-7: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-7: read failed after 0 of 4096 at 4096: Input/output error
  /dev/dm-7: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-8: read failed after 0 of 4096 at 128848953344: Input/output error
  /dev/dm-8: read failed after 0 of 4096 at 128849010688: Input/output error
  /dev/dm-8: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-8: read failed after 0 of 4096 at 4096: Input/output error
  /dev/dm-8: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-9: read failed after 0 of 4096 at 34359672832: Input/output error
  /dev/dm-9: read failed after 0 of 4096 at 34359730176: Input/output error
  /dev/dm-9: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-9: read failed after 0 of 4096 at 4096: Input/output error
  /dev/dm-9: read failed after 0 of 4096 at 0: Input/output error
  PV /dev/sdc    VG vg01      lvm2 [298.09 GiB / 76.09 GiB free]
  PV /dev/sda2   VG vgklapp   lvm2 [110.86 GiB / 13.09 GiB free]
  Total: 2 [408.95 GiB] / in use: 2 [408.95 GiB] / in no VG: 0 [0   ]
root@klappstuhl:~# vgchange -a y
  /dev/dm-7: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-8: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-9: read failed after 0 of 4096 at 0: Input/output error
  3 logical volume(s) in volume group "vg01" now active
  7 logical volume(s) in volume group "vgklapp" now active
root@klappstuhl:~# dd if=/dev/sdc of=/dev/null bs=1024k count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.956136 s, 110 MB/s

root@klappstuhl:~# pvdisplay /dev/sdc
  /dev/dm-7: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-8: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-9: read failed after 0 of 4096 at 0: Input/output error
  --- Physical volume ---
  PV Name               /dev/sdc
  VG Name               vg01
  PV Size               298.09 GiB / not usable 1.34 MiB
  Allocatable           yes 
  PE Size               4.00 MiB
  Total PE              76311
  Free PE               19479
  Allocated PE          56832
  PV UUID               kNDYjk-ftrb-NWfx-p1dB-qKdt-jV9t-HEX7xb

root@klappstuhl:~# dd if=/dev/vg01/lvisos
dd: reading `/dev/vg01/lvisos': Input/output error
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.000518152 s, 0.0 kB/s

In fact, the PV has no state.

Oh, and small note:
to fix:

vgchange -a n /dev/vg01 ;  vgexport /dev/vg01 ;  vgimport /dev/vg01 && vgchange -a -y /dev/vg01

I would assume you dont need to re-import. I just wanted to make up with my LVM for ranting so much.
If it works without re-import then the need to de-activate is only caused by the issue with the mis-handling of vg activation described above.
So in case you just had to stop all your applications, force-umount the filesystems and all that although the disk came back long ago…:
Sorry to hear. The LVM2 bugzilla at RedHat should be your next stop now. Or the order form for Symantec / Veritas VxVM 🙂

Updating the kitchen (server) #2


I’m still working on my kitchen box that needed substantially more ram and disk space. The disk space is so I can run a Bacula storage daemon in there for offsite backups of other servers. Also I want to do some more tests with cobbler which also take up quite some space. Faster networking (like, above 4MB/s scp speed) was something that was needed and ideally I’d be able to run HVM domUs in the future.

Ok to be honest I don’t really have many reasons for Xen HVM.

  • FreeBSD VMs since this is doing realllllly well – you get PV drivers without the bad maintenance of the freebsd-xen PV kernel
  • Running a VM with D3D would allow to run my ROSE Online game shop without the PC being on. Not sure if that would really work 🙂
  • Being able to test really crazy hardware is easier at home too, like SolarFlare NICs that can create 2048 Xen netback devices for PV domUs. (Stuff like this is why you’ll still me laughing a lot when people think that KVM is anywhere compareable to Xen)
After the most hardware stuff was already done, I used this weekend to make the hardware switch.

Hardware:

Ram:

The memory modules from the old box (2GB DDR2 533 Kingston) won’t work in the intel board, it just doesn’t come up. I’ll have to get some replacement for the 4GB I’d given away to my GF 🙂

8GB total would be really needed. Otherwise I’ll have to ditch the whole thing and reinstall it inside ESXi to be able to overcommit. (noes!)

And disks:

The Transcend Flash module arrived and it’s working fine now.
A horrible lot of kickstart hacking later I can now also safely pick the OS install disk.

  •  HW vendor (certain servers are flagged as having USB and will never install anyplace other than “their designated media”
  • Disk model wins next, if it’s one of the flash devices
After this worked I settled to install the server and put it back in place.
But I also have a better version made that (70% in pseudocode still 😦 ) can do more things after the above criteria:
  • a disk smaller 16GB can win
  • find out if we’re talking hw or sw raid
  • only mess with the first 2GB of a disk in call cases
  • You can define a limit in size
  • or you can exclude harddisk models, either will never be touched
The one thing that makes my brain get all fuzzy is the idea to be able to exclude all scsi devices that are behind “remote” ports as iSCSI and FC. But tracing this through the /dev mess is a nightmare (kind of a devmapper reverse lookup)  and I wish I could forget about this idea.
Anyway, here you can see the transcend flash module, in upper and it also has a real “readonly” switch that I’m not using.
Also note the Biostart cables becoming lose here.

…and here, too.

 

Here you can see them replaced with stock intel cables:

Also visible: The FLOPPY port. While it has no IDE port it to hook up a spare dvd drive when you’d want to install from DVD, it actually comes with a floppy connector. I don’t really get it.

 

Between looking for some more docs for my PXE installing I actually found intels PXE manuals, should have looked at those many years earlier.

i.e. if you want to set up iSCSI boot there’s a program that can set up the iSCSI target to use. Or even FCoE according to the doc. Incase anyone is really using that instead of FC. tehehe.
http://www.intel.com/support/network/adapter/pro100/bootagent/sb/cs-009350.htm

Here you can see the whole thing back in it’s place, well integrated with the kitchen utilities but now with 4 disk slots instead of one.

On the right there’s 3 temp controlled fans that “assist” the room climate in my small place.

Software:

Raid

What was left is all softwareish stuff, like for example migrating to Raid10 from the old Raid1.

A sad experience was finding that there’s a plenty of howtos that are actually building things more like a concat of two Raid1’s instead of a fully striped/mirrored Raid10.

On the other hand without reading so much about this is wouldn’t have figured it out on my own.

In the end it’s just been this command:

[root@localhost ~]# mdadm -v --create /dev/md2 --level=raid10 --raid-devices=4 --layout=f2\
 /dev/sdd1 missing /dev/sde1 missing
mdadm: chunk size defaults to 64K
mdadm: /dev/sdd1 appears to contain an ext2fs file system
    size=1839408K  mtime=Tue Aug 30 00:46:17 2011
mdadm: /dev/sde1 appears to contain an ext2fs file system
    size=1839408K  mtime=Tue Aug 30 00:46:17 2011
mdadm: size set to 1465135936K
Continue creating array? y
mdadm: array /dev/md2 started.

I had unfortunately split my original raid beforehand, and with no bacula backup at hand I decided to re-establish the Raid1 on the two old disks first and do the move to raid10 tomorrow.

I’ll probably not be using pvmove since I already had bad experiences and since get more than 1 weekly hit in the logs here for “pvmove data loss” I DOUBT it’s mess and pretty much assume that pvmove on Linux really sucks as much as I always say it does.

That means a few hours of DD-based volume migration instead, quite a waste of time but I figure I still have scripts around for it from the last time.

After that I can move the disks from the old to the new array. (A visit to mdadm.conf will be in place to be sure it’s not bringing up something old from the past on next reboot)

LVM Filters lvm filters!!!

During testing of one of the less useful howtos, I wanted a small reality check at one point – did I already lose data. Oh and I almost lost data then. I used vgck and as a result it switched paths to the NEW raid device because it still had the old lvm header intact on one member disk. So, for a moment imagine now running pvmove from /old/raid into /new/raid where the data is being read from /new/raid … help! 🙂 Or running pvcreate on /new/raid… might not even be possible.

[root@localhost ~]# vgck
Found duplicate PV I3PllpQdYviurqaON12wuiV2FEEvScEm: using /dev/md4 not /dev/md1

We learn two things here:

  1. Check your LVM filters or it will f**k up.
  2. always wipe all metadata areas (MD, LVM and ext) if you re-cycle storage

After this whole chaos I had my old Raid1 and the new Raid10:


Personalities : [raid1] [raid10]
md2 : active raid10 sde1[2] sdd1[0]
2930271744 blocks 64K chunks 2 far-copies [4/2] [U_U_]

md1 : active raid1 sdb2[2] sdc2[0]
1462196032 blocks [2/1] [U_]
[>………………..] recovery = 2.2% (32834176/1462196032) finish=983.6min speed=24215K/sec

md0 : active raid1 sdb1[1]
2939776 blocks [2/1] [_U]

In the monitoring it was quite easily visible that I will have change something about the disks though.  They are getting quite warm. The first thing I did was turning the server which seems to have worked for the 2 new Seagate disks. I’ll have to replace them with WD Green disks quite soon, as these are a lot more stable temperature – wise.

I also set speed limits for the mdadm resync to further cool things down.

This is done using the proc filesystem and the effect is quite visible.

[root@localhost ~]# cat /proc/sys/dev/raid/speed_limit_max
50000

One stupid detail is that the mdresync speed didn’t go back up when I switched to 100MB/s.

In performance the difference between those disks is HUGE, the Seagates read up to 135MB/s, the WD green seems to top out around 80MB/s. But this being a home box this, in a Raid10 should do quite OK, even when running virtual machines over the network via CIFS or iSCSI.

Otherwise one could consider using –layout=f4 to cut back on disk space for better r/w performance. But since performance has gone up by many times with this upgrade I’ll just be happy!

 

Whats left after the Raid is done?

  • Cable up so I can use LACP (the bond and bridges are already configured)
  • Add noise shielding
  • and a dust filter

a bunch of new OracleVM RPMs


I’ve done some specfile practicing today and was able to port a lot of the applications i needed over to OracleVM. In a way it was mostly specfile hunting and some needed small fixes to work on old RHEL5ish distros.

Special thanks to Brett Trotter at blackopsoft.com who is the only one that has a src rpm for current versions of  m4.

What I really cared about was the following packages:

  • pigz
  • ossec-hids-client
  • openvswitch
  • mercurial (haha no i wont use git in my free time 🙂
  • fail2ban
  • dtc-xen-firewall
  • dtc-xen
You’ll also find some other things that are just dependencies.
Since I needed them for my xen boxes, I’ve added it to the project black magic project at bitbucket
Look in /usr/src/redhat/RPMS there!
There’s still some more work to do – I want a yum channel that can deliver all the addons that my xen hosts need. A bacula client, vncsnapshot, raid controller utilities (so far LSI and Adaptec) and IPMI management tools for Intel & IBM.
If you need the specfiles, you can ask me, or instead let me know how to build fresh SRPMS from my fixed specfiles.
Ah right, and lets not forget the really hard ones that I skipped:
flashcache-wt and
OpenNHRPD

Linux LVM mirroring comes at a price


You can find some nice article about clvm mirroring here http://www.joshbryan.com/blog/2008/01/02/lvm2-mirrors-vs-md-raid-1

A reader had already tried to warn people but I think it went unheard

LVM is not safe in a power failure, it does not respect write barriers and pass those down to the lower drives.

hence, it is often faster than MD by default, but to be safe you would have to turn off your drive’s write caches, which ends up making it slower than if you used write barriers.

First of all, he’s right. More on that below. Also I find it kinda funny how he goes into turning off write caches. I was under the impression that NOONE is crazy enough to have write caches enabled in their servers, unless they’re battery backed and the local disk is only used for swap anyway. I mean, that was the one guy who at least know about the barrier issue and he thinks it’s safe to run with his cache turned on.

All the pretty little linux penguins look soooo much faster – as long as we just disable all those safeguards that people built into unix over the last 20 years 🙂

Anyway, back to LVM mirrors!

We just learned: All devicemapper based IO layers in Linux can/will lose barriers.

Furthermore LVM2 has it’s own set of issues, and it’s important to chose wisely – I think these are the most notable items that can give you lots of trouble in a mirror scenario:

  • no sophisticated mirror write consistency (and worse, people who are using –corelog)
  • only trivial mirror policies
  • no good per LE-PE sync status handling
  • (no PV keys either? – PV keys are used to hash LE-PE mappings independent of PVID)
  • limited number of mirrors (this can turn into a problem if you wanna move data with added redundancy during the migration)
  • no safe physical volume status handling
  • too many userspace components that will work fine as long as everything is ok but can die on you if something is broken
  • no reliable behaviour on quorum loss (VG should not activate, optionally the server should panic upon quorum loss, but at LEAST vgchange -a y should be able to re-establish the disks once their back). I sometimes wonder if the LVM2 even knows a quorum?!!
  • On standard distros nothing hooks into the lvm2 udev event handlers, so there are no reliable monitors for your status. Besides, the lvm2 monitors suck seem to be still in a proof-of-concept state…

since barriers are simply dropped in the devicemapper (not in LVM btw) you should chose wisely whether to use lvm2 mirrors for critical data mirroring.

Summary:

  • LVM mirror may look faster, but it comes at a price
  • Things tend to be slower if they do something the proper way.

Of course, if you’re using LVM on top of MD you *also* lose barriers.

Usually we can all live pretty well with either of those settings, but we should be aware there are problems and that we opted managability / performace over integrity.

Personally I’ll see the management advantages of LVM as high enough to accept the risk of FS corruption. I think the chance of losing data is much higher when I manually mess around with fdisk or parted and MD on every occasion I add a disk etc.

If it were very critical data you can either replicate in the storage array (without LVM and multipath??????) or scratch up the money for a Veritas FS/Volume Manger license (unless you’re a Xen user like me… 😦 )

either way…:

SET UP THE MONITORING.

 

A little update here:

According to the LVM article on wikipedia.com the kernels from 2.6.31 do handle barriers correctly even with LVM. On the downside that article only covers Linux LVM and imho has a lot of factual errors, so I’m not sure I’ll just go and be a believer now.

Linux LVM2 design


Look at this … it says it all.

[root@davexh0001 ~]# vgchange -a n vgbacula
  /dev/vgbacula/lvmysql: read failed after 0 of 4096 at 10737352704: Input/output error
  /dev/vgbacula/lvmysql: read failed after 0 of 4096 at 10737410048: Input/output error
  /dev/vgbacula/lvmysql: read failed after 0 of 4096 at 0: Input/output error
  /dev/vgbacula/lvmysql: read failed after 0 of 4096 at 4096: Input/output error
  /dev/vgbacula/lvmysql: read failed after 0 of 4096 at 0: Input/output error
  /dev/vgbacula/lvbacstor00: read failed after 0 of 4096 at 471909466112: Input/output error
  /dev/vgbacula/lvbacstor00: read failed after 0 of 4096 at 471909523456: Input/output error
  /dev/vgbacula/lvbacstor00: read failed after 0 of 4096 at 0: Input/output error
  /dev/vgbacula/lvbacstor00: read failed after 0 of 4096 at 4096: Input/output error
  /dev/vgbacula/lvbacstor00: read failed after 0 of 4096 at 0: Input/output error
  /dev/vgbacula/lvmysql: read failed after 0 of 4096 at 0: Input/output error
  /dev/vgbacula/lvbacstor00: read failed after 0 of 4096 at 0: Input/output error
  Volume group "vgbacula" not found

Is that error handled properly on:


on HP-UX

yes.
on AIX?

yes.

in VxVM?

yes

in Linux LVM2?

err…

losing data with pvmove, then automagically moving LVM Volumes between VGs


First some story about how I ended up copying volumes between VGs:

I was replacing my Xen Host’s two disk drives (one 500GB, one 1.5TB) with a pair of WD caviar green 1.5TB drives to make it some more silent that it already is.

Of course the new disks were supposed to be in a raid1 setup.

Now this time I’ve had my 2nd disaster with using pvmove. This time I had some proper data loss.

What happened?

I had set up one of the 1.5TB disks as a degraded raid1 using an USB case because the system has only two SATA ports.

Then I added the resulting /dev/md1 device to the “vgxen2” volume group and started moving LVs using pvmove. Roughly 50GB into the data it turned out that the disk was faulty. The layering between md, lvm and kernel sucks big time, so atr some point I got the “rejecting IO to dead device” and some more of this kind.

procedure to fix:

pvmove –abort (fails somewhat)

vgreduce –removemissing (will not really work becasue it can’t deal with the volumes created by pvmove – this was reported by someone back in 2006… lol)

identify lvm metadata copy *before* pvmove by grepping in /etc/lvm/archive/

edit out the section for the volume that was being copied (I had a backup!)

vgcfgrestore -f /tmp/lvmdata_fixed

vgchange -a y vgxen2 (still getting the errors, vgchange is not properly implemented. vgchange -a y is supposed to reactivate the vg even if it’s already active. same for vgscan, guess why the original came with -v and -p options and why you had to move the lvmtab away? And of course there’s no lvlnboot command to sync userland and kernel. gosh, I so  FUCKING HATE linux lvm2. So many bugs and design flaws. If I could afford the power consumption of my hp-ux boxes that’d become a fileserver)

well. how to fix it? ah. a reboot.

Oh. err. and this is why I’m copying over my data instead of using pvmove.

So lets go to the actual script

#!/bin/bash
# script for offline copying of lvm volume group contents
# free to use, but no warranties / liabilities accepted.

OLDVG=/dev/vgxen2
NEWVG=/dev/vgxen

for lv in $OLDVG/*
 do
 LVNAME=$(basename $lv)
# first check if we already have an existing copy because I manually copied the large ones.

 if [ ! -r $NEWVG/$LVNAME ]
 then
# figure out the LV size in MB because lvdisplay has rounding errors and changes the unit.
 NUMLE=$( lvdisplay $lv | grep "Curr" | awk '{print $3}')
 LVSZ=$(( $NUMLE * 4 ))

# now we got the lv size and can create it and use DD to copy it.
 lvcreate -L $LVSZ -n $(basename $lv) $NEWVG
 echo "copying $LVSZ MB for $LVNAME from $OLDVG to $NEWVG"
# i had to specify obs because the usb bridge or something made 300 input ios into 5000 output io's
# also, if you want a performance couter or ETA display, you just need to split the dd operation
# into many and take their times.
 dd if=$OLDVG/$LVNAME of=$NEWVG/$LVNAME bs=512k ibs=512k obs=512k &&
 echo "lv copy ok" &&
 echo "lvremove -f $OLDVG/$LVNAME" >> /tmp/lvremovescript
 fi

done

echo "copies are completed, if no major errors occured you could remove the old LVs now using /tmp/lvremovescript"

This is running smoothly now and once done I will be able to work on the next checklist item:

The old 500GB will be attached via USB and contain my Backup VM – which will be dual bootable between Xen and real “iron”.
So if disaster strikes, I will be able to plug my backup server into any available system that can boot off USB and the backups will be available.
And until then I will enjoy the benefits of using a Xen VM, it can even stay in suspend mode for all the “normal” hours of the day and only be brought
up for the actual backups runs.

These options have existed for years now, it is time they see some more use.