Linux LVM mirroring comes at a price


You can find some nice article about clvm mirroring here http://www.joshbryan.com/blog/2008/01/02/lvm2-mirrors-vs-md-raid-1

A reader had already tried to warn people but I think it went unheard

LVM is not safe in a power failure, it does not respect write barriers and pass those down to the lower drives.

hence, it is often faster than MD by default, but to be safe you would have to turn off your drive’s write caches, which ends up making it slower than if you used write barriers.

First of all, he’s right. More on that below. Also I find it kinda funny how he goes into turning off write caches. I was under the impression that NOONE is crazy enough to have write caches enabled in their servers, unless they’re battery backed and the local disk is only used for swap anyway. I mean, that was the one guy who at least know about the barrier issue and he thinks it’s safe to run with his cache turned on.

All the pretty little linux penguins look soooo much faster – as long as we just disable all those safeguards that people built into unix over the last 20 years 🙂

Anyway, back to LVM mirrors!

We just learned: All devicemapper based IO layers in Linux can/will lose barriers.

Furthermore LVM2 has it’s own set of issues, and it’s important to chose wisely – I think these are the most notable items that can give you lots of trouble in a mirror scenario:

  • no sophisticated mirror write consistency (and worse, people who are using –corelog)
  • only trivial mirror policies
  • no good per LE-PE sync status handling
  • (no PV keys either? – PV keys are used to hash LE-PE mappings independent of PVID)
  • limited number of mirrors (this can turn into a problem if you wanna move data with added redundancy during the migration)
  • no safe physical volume status handling
  • too many userspace components that will work fine as long as everything is ok but can die on you if something is broken
  • no reliable behaviour on quorum loss (VG should not activate, optionally the server should panic upon quorum loss, but at LEAST vgchange -a y should be able to re-establish the disks once their back). I sometimes wonder if the LVM2 even knows a quorum?!!
  • On standard distros nothing hooks into the lvm2 udev event handlers, so there are no reliable monitors for your status. Besides, the lvm2 monitors suck seem to be still in a proof-of-concept state…

since barriers are simply dropped in the devicemapper (not in LVM btw) you should chose wisely whether to use lvm2 mirrors for critical data mirroring.

Summary:

  • LVM mirror may look faster, but it comes at a price
  • Things tend to be slower if they do something the proper way.

Of course, if you’re using LVM on top of MD you *also* lose barriers.

Usually we can all live pretty well with either of those settings, but we should be aware there are problems and that we opted managability / performace over integrity.

Personally I’ll see the management advantages of LVM as high enough to accept the risk of FS corruption. I think the chance of losing data is much higher when I manually mess around with fdisk or parted and MD on every occasion I add a disk etc.

If it were very critical data you can either replicate in the storage array (without LVM and multipath??????) or scratch up the money for a Veritas FS/Volume Manger license (unless you’re a Xen user like me… 😦 )

either way…:

SET UP THE MONITORING.

 

A little update here:

According to the LVM article on wikipedia.com the kernels from 2.6.31 do handle barriers correctly even with LVM. On the downside that article only covers Linux LVM and imho has a lot of factual errors, so I’m not sure I’ll just go and be a believer now.

Advertisements

14 thoughts on “Linux LVM mirroring comes at a price

  1. Hi,
    thanks for the article – helped me understanding the barrier issue with LVM. According to http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=ab4c1424882be9cd70b89abf2b484add355712fa barrier support is implemented into the 2.6 kernel for a while, but only for single DM devices. So the correct form of your statement would be:

    Of course, if you’re using LVM on top of mirrored MD you *also* lose barriers.

    But nevertheless within the given context your statement is absolutely correct 🙂

    Tom

    • Very glad it helped.
      I’m also convinced that this will come to a good solution over time. It’s a matter of time only, but at some point there will be a clearer defined IO path through the kernel :> What I still havent found out if there is any benefit from turning off barriers: Will the filesystem / buffer cache be “more careful” if it’s already warned that this functionality won’t be working?

  2. Sure, we all agree that (under normal circumstances with generally available hardware) disabling write cache is the only _proved and absolutely safe_ way to go. Yet, to rely on barriers (or alike functionality) could be a fine middle ground between reliability and performance.

    As the previous comment mentioned, barriers are supported on single device DM devices[1] (as of 2.6.30 – 2.6.29 was buggy in that regard[2]). And as this blog post mentions, that does put LVM mirrors at risk. md can also handle barriers (on all RAID levels as of 2.6.33, previously only RAID1[3]). However, I did not come the same conclusion as you did. Lets look at the facts: 1) LVM requires DM, 2) an md device is a single device DM device, 3) md handles barriers – hence LVM on an md device should handle barriers (as of 2.6.33).

    Something that is not noted here though is that these issues seem to be a thing of the past as barriers is being replaced by a simpler interface [4][5].

    1: https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=ab4c1424882be9cd70b89abf2b484add355712fa
    2: https://lwn.net/Articles/326635/
    3: https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a2826aa92e2e14db372eda01d333267258944033
    4: http://www.linux-mag.com/id/7773/#comment-8337
    5: https://lkml.org/lkml/2010/9/3/199

    P.S: I hope I did not come across as harsh, this is a confusing topic in general. In fact I haven’t been using LVM that much and I could very well be wrong. 🙂

    • This is not harsh but interesting since I hoped for things to change to the better.
      It’s a slowly going forward topic, and once all layers work fine I’ll be just happy.
      My newest boxes are running on AlpineLinux and I’m looking forward to testing / seeing barriers working there. (This is a high stack.. a VM using LVM on a LV in a VG on flashcache ssd-cached device on a drbd device on a simple stupid disk)

      Problematic about “this just affected kernels before 2.6.33” is that most “critical” installs will be somewhere between RHEL5 and RHEL6.
      If it’s fixed in my Ubuntu 11.10 desktop has almost practical relevance – but costing me downtime on the server OS (once I know it’s backported 100% and working) has have.
      You have far too many SW vendors who play the “this is not yet supported on RHEL6” game, and you also need a year-ish until you can even remotely consider using a new RHEL release in production. I mean… remember RHEL5.0 – 5.2? That was a mess.
      So most production systems are stuck on something that isn’t fixed at all.

      If this was well-known and documented, it would be quite OK and a calculated risk, but i’d say it’s far from that.

      And I am very unhappy with this under the context of other mature OS being flagged as useless and slow, when time after time their slowless is caused not by bad code but by doing things “proper”. Since this is just a blog I tend to express that 🙂

      Oh and I use LVM a lot. but never for mirrors. I have had a look under that hood and no, that is beyond repair – which is a pity since LVM mirroring (the way it is on unix) is the most flexible and comfortable thing around.

      edit:

      the flush cache thing looks interesting but i wanna see it in real life.
      especially if it is protected against the aggressive buffer cache when people still run virtual machines on top of file-backed images.

      And comments like this one are what I meant above:
      “Activating write barriers or disabling the disk cache are both things that most readers would be reluctant do do because of the reduced performance that will most certainly result.”

      For heavens sake…

      • But don’t you agree, LVM on an md device sounds like it should be working, right? Unfortunately there is no strong confirmation from a credible source (such as an involved developer).

        I haven’t come across Alpine Linux before but judging by their site it looks promising. Regarding flashcache, have you considered LVM Tiered Storage[1][2] instead? As I’ve understood, flashcache requires quite a bit of effort to get working whereas LVMTS is more or less bolt-on.

        It’s hard to tell if Red Hat has backported the needed patches to RHEL6 but it wouldn’t surprise me. At least we know it’ll be there by RHEL7. And as you said, LVM is a mess under the hood and I think it’s not reasonable to expect it to ever be cleaned up. If anything we’ll get a viable replacement sooner or later.

        1: https://github.com/tomato42/lvmts
        2: https://bbs.archlinux.org/viewtopic.php?id=113529

  3. Tanel,
    your comments are a wealth of information!
    I had not even heard of LVMTS. Recently I had read about the thin target which is alos a nice addon. And yes, flashcache is a little tricky but has the positive karma of being in production for a few years already.

    I will definitely test LVMTS Thanks for the pointer.

    • If you’re going to use LVMTS, make sure to have a decent backup plan. As I’ve understood a LVMTS setup combining an HDD with an SSD provides the same level of reliability as a spanning RAID – hence it is wise to use mirrored HDDs and mirrored SSDs that then are used for LVMTS – a bit like RAID10 instead of RAID0.

      Also, you might want to have a look at Bcache[1]. It’s more alike flashcache (but supposedly faster) than LVMTS and the developer likes to compare it to L2ARC on ARC – but FS agnostic.

      I would personally stay away from flashcache as I have gotten the feeling that it is too intrusive (and has yet to be run on corner-case hardware) to be trusted – but hey, someone has to take the first step. Be sure to let us know what you find. 🙂

      1: http://bcache.evilpiepirate.org/

      • Update: “L2ARC on ARC” should have been “L2ARC on ZFS”.

      • Bcache: Liked what I read, suffers from that it’s basically designed by a guy at home on his home PC. Although he’s made many good choices, this can still hurt.

        LVMTS:
        It’s like they’re just opening that small hole there and unknowingly open the entrance to Ry’leh.
        From all that I can see they are not aware of the issues ahead.

        There’s a nice very old research paper about online data migration, google for “FAST2002-aqueduct.pdf”.
        Notes:
        * very old paper
        * concerned about limitation of bandwidth consumed by storage re-settling
        * the server used back then had a lot more IO bandwidth than a current PC
        * they had to make a huge mess
        * they didn’t just use pvmove
        * they picked a LVM that could attach up to 3 mirrors and can easily resync mirror relations on a per LE level (nowadays 4, a sad joke compared to VxVM)
        * HP-UX’ pvmove does not have a history of data loss

        Things that I think are missing in LVMTS are (imho)
        * Being able to configure policies (this can go in, this can never go in)
        * Scalability (can’t add 1, then 2, then 5 mirrors and read r/r)
        * Protection about a race of “moving” stuff in and out of the cache. From what I read (but not tested) LVMTS would get into big trouble if I dare use a 4MB PE size and write a script that constantly reads a 4MB chunk at random locations for like 3 times. Correction after re-reading – it have to be N times per 5 minute or killhup interval. But that doesn’t matter really. Imagine the havoc a novice database coder’s select * will create if it runs a few times in the interval.

        I had been thinking about doing a similar thing as poor mans tiering in Veritas (free vxfs for home use or so 😉
        Difference there is a would have had a defined subdisk size and only worked by attaching addn’l mirrors on faster disk, not by moving data. The “do not like point” was that it is a userspace driven approach (not desirable like having a router where control and routing plane are in the same process environment)
        My recent scrap paper designs go towards doing it *only* via the mpath prio and backend syncing because this is a far healthier scheme.

        Linux pvmove per se is a nightmare choice because it has (had?) very entertaining features like not being able to resume after a crash, stopping to copy data and the mere fact that instead of ADDING MIRRORS you’ll always a HOT set of data tells me this is not for me.

        The recent thread about SSD loss handling in flashcache is giving me a chill. They’re patching it now, but they didn’t think to include it in the whole design phase. And the alternatives are something nicely designed by a guy at home and something designed by people who (apparently) have not been touched by admin responsibilities. In my world, you lost data once, you were given hell for months. you lost it twice, you got sued and you got only hope it was still in the 6-digit range. A design that doesn’t add redundancy along the way can go to hell as far as it concerns me.

        It’s not reassuring if the best choice is the software from a scale-out based social network, but, definitely, flashcache is the only usable thing out of the three.

        Final question:
        If we run experimental stuff anyway…
        Ceph + Flashcache. Ceph has not too much in for ensuring locality at the moment, it might be fun to put a few ssds in a flashcache ontop of /dev/rdb on a system. (This is pretty exactly what Amplidata does, even though they got different cache paths for read and write IO, so you can use SSD for random read and 15K disks for bumping up write b/w)

        What do you think about the combination? Where would you put the caches?
        (/me has the advantage of low net bandwidth over IB, so it might be smarter to cache in each OSD node)

        p.s.: the typo with ARC/ZFS was autocorrected while reading 🙂

  4. Uuum, you are aware that all serious modern drives have a capacitor that allows them to empty their cache and put the heads into parking?

    Who would be crazy enough to buy hard drives that don’t (if they even still exist), and disable their write cache?
    But hey, what crazy person has servers, but no battery backup for them? (Except maybe the little home server, but that doesn’t count.)

    • Hi,

      firstofall, the article is old, and while there’s still some time to go by until everyone runs reasonably current kernels, barriers DO work quite well by now.

      The idea of barriers is being able to say “ALL stuff before this mark must be commited before we go on with ANY of the stuff behind it”
      That way not just power failures can be covered, i.e.
      reordering in buffer cache (like with /dev/loop sigh, but that’s really a home box server-only issue 🙂 or partial failures (lets say one of your datacenters fails and you’d LOVE to have a pretty much consistent state on all luns in the other)

      Greetings

    • Funny enough, never trust disk vendors – Samsung is very successfully selling the 830 / 840 series of SSDs that come without a cache-protecting capacitor 😉

  5. Pingback: Hogyan tegyük a Linux LVM mirrorlog mindkét példányát külön lábra

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s