And the earth is a disc


I just managed to trigger a scenario that lets me take you on a ride through the layers of LVM, letting us see which ones are having issues. Usually I find it quite hard to pin-point examples for where this thing misbehaves :>

What happened:

I have a second hard disk in my thinkpads multibay slot.
On that disk is a volume group “vg01”, which holds 3 LVs which I normally do not mount or use.
I removed that harddisk earlier today after booting because of the noise.
I didn’t think much about the VG since nothing was mounted anyway, so I didn’t do a vgchange or vgexport, leaving it activated.
Some 6 hours later I re-inserted the disk to mount something on it.

I waited till it spun up and then, well, somewhat naively assumed, I would just have to activate the VG.
This (probably) isn’t working out since the LVM disk now as a different device handle & name in the kernel.
Of course it still has the same PVID and a simple pvscan would (probably) fix this. This is *not* the point now.

The point is that

this should never work:

root@klappstuhl:~# vgchange -a y vg01
  /dev/dm-7: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-8: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-9: read failed after 0 of 4096 at 0: Input/output error
  3 logical volume(s) in volume group "vg01" now active

The correct thing to happen here would be:

– vgchange to be unable to change the VG’s activation state since the code MUST abort a activation to same or higher level if no physical copy of the VGDA is visible.
– implicit lvchange failing. The LVs should be in UNAVAILABLE state. Their PV is missing, their VG has no Quorum (and cannot, with 1 PV), their state never was “AVAILABLE R/W or “ACTIVE” in this cycle… and not a single of their LE is in fact accessible. When no LE is backed by an available PE then it MUST NOT be activate-able. It had probably been activated b/c I had the disk plugged in on boot, but:
– a vgchange -a y MUST re-run the vg activation procedure and then fail, report the error up it’s layers and put things into the right (lower) Availabilty state, while not dropping down to a new (same/higher/lower) Activation state – since the operation in itself FAILED.

Instead what we see is TRY OF ACTIVATION TO SAME/HIGHER -> FAIL AND ABORT -> RE-RUN ACTIVATION TO SAME LEVEL

Lets verify what LVM is just thinking:

root@klappstuhl:~# lvdisplay /dev/vg01/lvisos 
  /dev/dm-7: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-8: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-9: read failed after 0 of 4096 at 0: Input/output error
  --- Logical volume ---
  LV Name                /dev/vg01/lvisos
  VG Name                vg01
  LV UUID                MucSwf-2reM-x9LV-cxHz-89Gk-oeTI-E5OdBT
  LV Write Access        read/write
  LV Status              available
  # open                 0

“No problem” here at all. Earth is a disc and the center of the universe.

Lets give it a reality check.

root@klappstuhl:~# fsck.jfs /dev/vg01/lvisos
fsck.jfs version 1.1.12, 24-Aug-2007
processing started: 12/5/2011 17.50.30
Using default parameter: -p
The current device is:  /dev/vg01/lvisos
ujfs_rw_diskblocks: read 0 of 4096 bytes at offset 32768
ujfs_rw_diskblocks: read 0 of 4096 bytes at offset 61440
Superblock is corrupt and cannot be repaired 
since both primary and secondary copies are corrupt.  

 CANNOT CONTINUE.

So the block layer(s) had to deny the read requests since the block devices was unavailable just now. Surely LVM now flagged the LV as unavailable and stale now?

root@klappstuhl:~# lvdisplay /dev/vg01/lvisos 
  /dev/dm-7: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-8: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-9: read failed after 0 of 4096 at 0: Input/output error
  --- Logical volume ---
  LV Name                /dev/vg01/lvisos
  VG Name                vg01
  LV UUID                MucSwf-2reM-x9LV-cxHz-89Gk-oeTI-E5OdBT
  LV Write Access        read/write
  LV Status              available
  # open                 0
  LV Size                32.00 GiB
  Current LE             8192
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           251:9

Note:
Nothing. Still thinks everything is 100% perfect.

This can be caused by two things:

1. The idiotic activation state chain we saw is overwriting the LV activation state when it re-runs & falls back to the “old OK” state.
2. Setting states might not be working at lower layers AND the propagation of states is generally not understood. That means, they can change states per component in a layer (VG,PV,LV) but there is no internal relation between them (PVVG>LV). Yes the config has “segments” and all that. That doesn’t mean this information is correctly USED.

If I only had one guess about the issue, I would guess it’s caused by something in tie with using devmapper.

What this means for you:

It is not possible to safely monitor LVM *status* on any layer at all.
It is possible to monitor the LVM monitors (uevent based), but I feel pretty confident they will never, ever report an error.

What this means for me:

For quite long time I had assumed that some parts of the Linux LVM2 layering are broken. I had guessed that there are some deep, rotten issues causing that you cannot get some of the info (i.e. lvdisplay -v with cLVM mirror cannot show a LE + PE1, PE2 list and pvdisplay -v cannot list which “segments” of which LVS are on a given PV)

I have to get used to the idea that I was wrong, because things are MUCH worse: layering seems to not exist except in block addressing LE:PE parts, and that the internal state handling might be misdesigned.

This thing is so full of design flaws… you would think it wasn’t just a reimplementation off something that did not have these problems?

Addendum:

root@klappstuhl:~# pvscan
  /dev/dm-7: read failed after 0 of 4096 at 75161862144: Input/output error
  /dev/dm-7: read failed after 0 of 4096 at 75161919488: Input/output error
  /dev/dm-7: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-7: read failed after 0 of 4096 at 4096: Input/output error
  /dev/dm-7: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-8: read failed after 0 of 4096 at 128848953344: Input/output error
  /dev/dm-8: read failed after 0 of 4096 at 128849010688: Input/output error
  /dev/dm-8: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-8: read failed after 0 of 4096 at 4096: Input/output error
  /dev/dm-8: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-9: read failed after 0 of 4096 at 34359672832: Input/output error
  /dev/dm-9: read failed after 0 of 4096 at 34359730176: Input/output error
  /dev/dm-9: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-9: read failed after 0 of 4096 at 4096: Input/output error
  /dev/dm-9: read failed after 0 of 4096 at 0: Input/output error
  PV /dev/sdc    VG vg01      lvm2 [298.09 GiB / 76.09 GiB free]
  PV /dev/sda2   VG vgklapp   lvm2 [110.86 GiB / 13.09 GiB free]
  Total: 2 [408.95 GiB] / in use: 2 [408.95 GiB] / in no VG: 0 [0   ]
root@klappstuhl:~# vgchange -a y
  /dev/dm-7: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-8: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-9: read failed after 0 of 4096 at 0: Input/output error
  3 logical volume(s) in volume group "vg01" now active
  7 logical volume(s) in volume group "vgklapp" now active
root@klappstuhl:~# dd if=/dev/sdc of=/dev/null bs=1024k count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.956136 s, 110 MB/s

root@klappstuhl:~# pvdisplay /dev/sdc
  /dev/dm-7: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-8: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-9: read failed after 0 of 4096 at 0: Input/output error
  --- Physical volume ---
  PV Name               /dev/sdc
  VG Name               vg01
  PV Size               298.09 GiB / not usable 1.34 MiB
  Allocatable           yes 
  PE Size               4.00 MiB
  Total PE              76311
  Free PE               19479
  Allocated PE          56832
  PV UUID               kNDYjk-ftrb-NWfx-p1dB-qKdt-jV9t-HEX7xb

root@klappstuhl:~# dd if=/dev/vg01/lvisos
dd: reading `/dev/vg01/lvisos': Input/output error
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.000518152 s, 0.0 kB/s

In fact, the PV has no state.

Oh, and small note:
to fix:

vgchange -a n /dev/vg01 ;  vgexport /dev/vg01 ;  vgimport /dev/vg01 && vgchange -a -y /dev/vg01

I would assume you dont need to re-import. I just wanted to make up with my LVM for ranting so much.
If it works without re-import then the need to de-activate is only caused by the issue with the mis-handling of vg activation described above.
So in case you just had to stop all your applications, force-umount the filesystems and all that although the disk came back long ago…:
Sorry to hear. The LVM2 bugzilla at RedHat should be your next stop now. Or the order form for Symantec / Veritas VxVM 🙂

Advertisements

2 thoughts on “And the earth is a disc

  1. TBH i did not read the whole post (yet).

    1. you can easily spin down the harddisk in the bay with hdparm -Y /dev/sdb or use hdparm -S40 /dev/sdb to let it spin down automatically after some idle time (“sdb” and “40” are just examples of course and the values i use). any access to the device will automatically spin it up (hence it is a bit tricky to use -Y in startup scripts at the right time).

    2. i think i have a similar problem with a snapshot that is created on boot and deleted on the next boot (before creating another one). an initramfs local-top script (cryptroot) runs vgchange before the snapshots are touched and spits out i/o errors. after the system has booted everything looks normal as far as the lvm commands are concerned with one exception*. i wonder if snapshots across boots are supported at all but could not find an obvious answer yet.

    this is with 3 lvs for root, data and root-snapshot in one big pv inside a luks volume “in” a normal 83-type partition. the idea behind this is to create backups of the root partition in a state before it is mounted. previously i removed the snapshot after doing the backup, but his does not “reserve” the LEs needed and i forgot this when rearranging the space and hence the backup process broke. so i decided to let the snapshot alone until the next boot. i am not entirely sure that is the real cause of those warnings because i have rearranged the LEs at the same time.

    the i/o errors:
    Reading all physical volumes. This may take a while…
    /dev/dm-4: read failed after 0 of 4096 at 12884836352: Input/output error
    /dev/dm-4: read failed after 0 of 4096 at 12884893696: Input/output error
    /dev/dm-4: read failed after 0 of 4096 at 0: Input/output error
    /dev/dm-4: read failed after 0 of 4096 at 4096: Input/output error
    Found volume group “ssd” using metadata type lvm2
    /dev/dm-4: read failed after 0 of 4096 at 0: Input/output error
    3 logical volume(s) in volume group “ssd” now active

    * /dev/dm-1: read failed after 0 of 4096 at 0: Input/output error

    • Hi,

      2) snapshots across boot are supported imho.
      3) sounds tricky which the snapshot-before-mount. interesting, too, but i have not become good friends with the init-top and init-bottom scripts. and yeah, removing the snapshot can’t workout 🙂
      i’d give it a try to look at the devices behind dm-1 and dm-4 during boot. maybe it would be helpful to bring up udev in the initrd to have reliable device names.
      *shudder* did i just suggest udev?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s