How to break LizardFS…

To start with:
MooseFS and LizardFS are the most forgiving, fault-tolerant filesystems I know.
I’m working with Unix/Storage system for many years, I like running things stable.

What does running a stable system mean to you?

To me it means I’ve taken something to it’s breaking point and I learned how it will exactly behave at that point. Suffice to say, I’ll further not allow it to get to that point.
That, put very bluntly means stable operation.
If we were dealing with a real science and real engineering there would be a sheet of paper indicating tolerances. But the IT world isn’t like that. So we need to find out ourselves.

I’d done a lot of very mean tests to first MooseFS and then later LizardFS.
My install is currently spread over 3 Zyxel NAS and 1 VM. Most of the data is on the Zyxel NAS (running Arch), one of which also has a local SSD using EnhanceIO to drive down latencies and CPU load. The VM is on Debian.
The mfsmaster is running on a single cubietruck board that can just so handle the compute load.

The setup is sweating, has handled a few migrations between hardware and software setups.
And, this is the point it has been operating rock-solid for over a year.

How I finally got to the breaking point.
A few weeks back I migrated my Xen host to OpenVswitch. I’m using LACP over two gigE ports, they both serve a bunch of Vlans to the host. The reason for switching was to get sFlow exports and also the cool feature of directly running .1q VLANs into virtual machines.

After the last OS upgrade (system had been crashing *booh*) I had some openvswitch bug for about a week or two.
Any network connection would initially not work, i.e. every ping command would drop the first packet, and then work.

In terms of my shared filesystem, this affected only the Debian VM on the host, which only held 1TB of data.
I’ve got most of my data at goal: 3, meaning two of the copies were not on that VM.

Now see for yourself:

root@cubie2 /mfsmount # mfsgetgoal cluster/www/vhosts/zoe/.htaccess
cluster/www/vhosts/zoe/.htaccess: 3

root@cubie2 /mfsmount # mfscheckfile cluster/www/vhosts/zoe/.htaccess
chunks with 0 copies: 1

I don’t understand how this happened.

  • The bug affected one of four mfs storage nodes
  • the file(s) had a goal of 3
  • the file wasn’t touched from the OS ever during that period.

Finally, don’t do a mfsfilerepair on a file with 0 copies left. I was very blonde – but it also doesn’t matter 🙂


Amazon und wir Deutschen?

am ehrlichsten sind wir deutschen eigentlich samstag morgens um halb 12.
da stehen wir naemlich in der postfiliale schlange, um unser amazon paket abzuholen.
wir sind verschlafen und fuehlen eine tiefsitzende angst.
irgendwo koennte uns so ein aufrechter preusse auflauern.
und uns zur rede stellen, warum wir bis eben eigentlich noch geschlafen haben.

german angst.

Heartbleed oder warum das Internet nicht mit “Windows” gemacht wird…

Gerade mit einem Chatbekannten ueber die Kundeninformation seiner Firma zu Heartbleed gesprochen…
20:17 <darkfader> hah und mit dem ssl
20:17 <darkfader> das lustige ist
20:18 <darkfader> weisst, das ist jetzt 7 tage her
20:18 <darkfader> vor 6 tagen waren wir unixer einigermassen fertig mit patchen
20:18 <darkfader> und audit vorher
20:18 <darkfader> und so
20:18 <darkfader> und die windows welt macht gerade zusammenfassungen der
betroffenen produkte


Erinnert mich sehr an eine Diskussion letztens ueber den Patchbedarf einer Unix-Umgebung

“Wenn das so stabil ist, warum muss man das dann so oft updaten?”

Von meiner Erklaerung hier mal ein kleines


Cut Ubuntu login CPU usage by 90%

I was just testing check_by_ssh at scale, running around 5000 ssh-based checks against the desktop system here.

One thing that puzzled me was that after adding passive mode load actually went UP instead of down.

I saw a load of up to 11 after this change just to run check_dummy.

You could see it wasn’t accounted to any of the long running processes except polkitd, so the conclusion was that this would be related to some desktop bullshit written by the Ubuntu devs.

After some research, most of this comes from dbus and policykit running useless desktop sessions for the ssh based login. Since todays distros have a lot of things tied into dbus I couldn’t do much about this.

The real conclusion was found by a stackexchange post.

I deleted all the crap in /etc/update.motd.d/

Now the system load is down to under 1 most of the time.

I don’t wanna think about how many KWh those useless scripts waste on a planetwide scale.

People, PLEASE don’t use Ubuntu if you don’t have to.

Cloudweavers first look

I’m sitting in Udine, Italy after a day with the Cloudweavers team and trying to collect my first impressions.

In case you missed out about it, Cloudweavers is a ready-to-use private cloud “system”.

It can scale from one hypervisor host to many, meaning you can gracefully virtualize your whole infrastructure.

OpenNebula is used as the management stack because of it’s small footprint and easy customization.

Under this there’s a distributed filesystem. It’s not Ceph but one of the more mature ones (moosefs), meaning it would less successfully scale to 1000 hosts, but instead the POSIX FS semantics already work well and you can actually expect it to have the data accessible at all times 🙂

At a 1000 host scale I also have felt a certain pain with just having one large RADOS cluster, I’d segregate to more islands as to avoid large-scale losses. Carlo from Cloudweavers told me about ideas that are more along this line. 


The cool part with cloudweavers is that you can bring up a setup on one node (in minutes) and really scale out from there, simply adding hosts as you need them. They then become hypervisor nodes and storage nodes. Systems with an older CPU or broken BIOS that can’t run KVM? Well, they’ll just be a storage node. Automagically 🙂


The part that makes me happy is that they have opted for resilience in all parts they touched.

Data is mirrored many times, writes are only committed if they’re successfully on another host and the whole architecture is made to be self-fencing and to recover gracefully.


I had originally inquired with them back in – I think – February looking at it from a cloud hosting angle since buying pre-integrated good software would run cheaper than home-building the full ceph/alpinelinux/xen/opennebula stack.

Now after having the first look, I think that was a good decision, even that very stack has matured a lot in the mean time.

I wanted to design a platform, and my goal was to have one that is truly resilient:

  • It should be able to tolerate failures.
  • It should be able to keep running for months even if I got hit by a bus.

Cloudweavers seems to have built it.

LVM Mirroring #2

Hmm, people still look at my ages-old post about LVM all the time.

So, just a note from end-2013:

The mirror consistency stuff is not your worst nightmare anymore.

Barriers work these days, and I think it’s more important to concentrate on EXT4 Settings like “block_validity”. The chance of losing data due to a lvm mirror issue is much lower than the chance of unnoticed data loss in ext4 🙂

My LVM pain points, as of today, would be:

lvm.conf is a huge patchwork of added features, there should be a LVM maintainer that oversees structure as features are added.

Instead it’s like a castle with a lot of woodden gangways (mirrorlog devices) and stairs (thin provisioning) on the outside  and no windows (read up on the “fsck” utility for thin pools, TRY what happens if it runs full and recover from it)

Some features require pre-ahead planning and the way it’s now does not support that.

Reporting is still as bad as it used to be.

I’d be happy for someone to show me how he splits out a snapshot + pv to a backup host, brings it back AND has a fast resync.

(Note, the PV uuid wouldn’t change in this. So, if it doesn’t work, it hints at design flaws)

Those pieces I worry about. And really, the way the project is adding features without specs, documentation and (imho) oversight makes it looks like some caricature of a volume manager.

How I feel about that:

Just look, I have added the feature the others were talking about.

And look at THIS: I now have an arm on my ass so I can scratch between my shoulders, too!

Example: LVM2 did away with a per-LV header as classic LVM had, so you don’t have a ressource area to debug with, and don’t support BBR or per-LV mirror write consistency via the LV header. But instead they added an -optional- feature that wipes the start of an LV. So, if you lose config and rebuild a LV manually on the exact same sectors, but newer LVM config, it’ll wipe out the first meg of the LV.

A volume manager that after the design introduces kind of a LV format change, and make it WIPE DISK BLOCKS. I don’t care how smart you think you are: Whoever came up with this should get the old Mitnick jail sentence: Forbidden to use a computer.

The bad layering of PV/LV/VG I also still care about.

Storage management in the sense I’m used to is something I still don’t touch with LVM.

On the other hand I’m itching daily to actually make prod use of those exact modern features 🙂

But basically I just use it to carve out volumes, but instead of pvmove I might use more modern, powerful tools like blocksync/lvmsync and work with new VGs.

Also, just to be clear:

I’m not saying “don’t use LVM” – I have it on almost any system and hate those without it. I’m just saying it’s not delivering the quality for consistently successful datacenter usage. If you set up your laptop with a 1TB root volume and no LVM and then have some disk-related issue. Well, I’ll probably laugh and get you some Schnaps.

That being said, I wanted to write about more modern software that is actually fun to look at, next post 🙂

Open Source Backup Conference – Impressions

After 4 years I finally managed to visit this conference.

The former Bacula conference ended up with a rename to “Open source
backup conference” recently and, under the new name, took place in Cologne
on Wednesday.
The idea of the rename was to extend the scope to other open source backup
solutions than just Bacula and it’s younger cousin Bareos.
This worked out pretty well, my personal highlight was a talk on an entirely
different software named “relax and recover” (rear). It’s a Linux-only system
like Ignite for HP-UX or mksysb on AIX, covering emergency restores of critical
This kind of software is tuned to store the recover images on different media
and to be highly reliable at the time of restore, even if confronted with a
blank or different system. So it’ll do things beyond what normal backup
software handles, i.e. setting up your Raid devices.

The story about Bareos is an interesting and sad one – apparently Bacula, in
it’s open source form is halfway dead. There aren’t any recent commits in the
git tree and there was even an incident where a new feature was pulled out of
the OSS version so it only became available to the users of the commercial fork.

The people / companies behind Bareos are more OSS-minded than this and made sure
the feature set is much closer of the enterprise bacula version.

As a user, thats of course great, stuff like lighter compression algos or the
“relaxed” SSL mode are quite interesting from a practically minded point of
view. They also used a lot of time to remove stale code and are now looking at
ways to move forward.

A nice and well designed build chain ensures the code you download will actually
work by doing a really large bunch of regression testing in various stages.

There was also a nice talk by someone from (or related to) NetAPP who managed
to show off a really well-designed largescale backup system without delving
into marketing *at all*. Still, I got shiny eyes once he mentioned there’s an
equallogic array that offers native infiniband storage access.

What impressed me is that the Backup community there is far friendlier than i.e.the monitoring community. There was no visible competition among the different
Well. Save for the one guy from collax community server who set the low point
by doing a 3-4 minute talk about his product during the presentation of another one. He did a good job of making sure I’ll recommend people to avoid his stuff.

My own talk on Monitoring Bacula was quite fun for me, and generated many
really interesting questions. I slipped at one time by giving a very personal opionion on old stale Perl code 🙂

The feedback in itself was split, some people were unhappy since, well, I
presented the current state of affairs: “Bugs, bugs, bugs” and others who were
motivated to improve things. I stuck with the second group, or actually, they
more or less trapped me in a corner later on, discussing ways how to add crucial checks, etc.
It seems we’re on a way forward to improve the monitoring a lot.
Doing a lot of basics like collecting all the old checks in one place is the
thing I started on.


After that I ended up with a Bacula book which is awesome since it’ll help me debug some issues and even has a nice chapter on pool management.

Ah, yeah, that was something I missed – there was little exchange about best
practices, like “how do you name your pools, how this, how that”.

That was just perfect at the OpenNebula Conference the next day. By then I’d also gotten a few valued hours of sleep.

On being Freelance and castles in the sky

This week it dawned on me how much potential I wasted before becoming self-employed. Over the years there was countless times where I just slammed my head into the desk because bosses or workmates wouldn’t understand some far-fetched thing I came up with.

Basically, any time I’m working on something, my head will consider and spin around ideas for improving it. And improving the improved version. And the one after that. And then suddenly I see the castle in the air… The one that will do away with most of the issues involved with how things are done[tm] and replaces it with something cleaner.

Some people have let me go ahead with and have usually ended up quite happy with the final result, but most of the time they’ll be too busy with the current issues at hand to even grasp that I’m pointing them at a way that _removes_ the issues.

That was a waste of my time.

Another thing I had taken long to understand is actually really easy – other people have other goals, other visions. So if you work for someone, well, it’s obvious now:

You don’t get to chose, you don’t get to design, you don’t get to do the things that prove smarter in the long run. Someone else is chosing for you, and as long as you keep that role, it won’t change.

Design a SAN with iSCSI fronend. After the design $boss cuts out the 2 $300 LAN switches, handing you a leftover 8port D-Link? 

Fucking stupid, tell me in the first place and I’ll account for bad equipment, but don’t drop it in AFTER designing. It never works like this!


That’s what I’m talking about. Thats how people force decisions on you, and turn your work into a piece of shit, wasting countless money on the future impacts and crashes of their own platform where they could have had something that would have worked.


I really think in very long terms when it comes to system design and was always puzzled about decisions being made that would hurt the respective decider for years to come.

(Check_MK wise: not using PAXOS for the replication, no proper WATO api, …)


Often I had proposed a more reliable way in the start, to no avail.


Well to cut it short:

I can’t waste time with people that don’t see/share my visions. I rather need to go out and talk to people that understand them. They’re there.

And this is where the  one prerequisite comes in:

Evolving to your full potential and making your visions come true requires not being someone else’s employee.

It will also definitely require to skip a few nights sleep. So what.

NSA 325 write performance explained


Just found out some more about the benchmark-winning write performance. I should have noticed this earlier but this was what check_mk picked up at inventory time:

(‘nsa’, ‘mounts’, ‘/i-data/ef9c2dc9’, [‘barrier=0’, ‘data=writeback’, ‘noatime’, ‘rw’, ‘usrquota’]),


Turning off barriers and setting writeback is _questionable_. On the other hand, I’m just storing backups on it in a raid0. Just, makes me wonder how other owners would rate those settings if they knew…

Linux storage goodies

The #ceph guys had asked why I picked Fedora for the storage classes.

This is my list of stuff that needs to work / be easily installable for attendees to make a distro useful.

  • parted
  • ceph
  • flashcache
  • iscsi initiator
  • lio
  • multipath
  • lvm thin provisioning (thinp)
  • cgroups
  • zfs
  • ideally even VxVM / VxFS