Nagios MooseFS Checks #2

For the curious,  here you see the bingo point where MooseFS is done detecting undergoal chunks and you can watch the rebalancing happening.


WARN – 9287 chunks of goal 3 lack replicas
WARN – 9291 chunks of goal 3 lack replicas
WARN – 9291 chunks of goal 3 lack replicas
WARN – 9294 chunks of goal 3 lack replicas
WARN – 9295 chunks of goal 3 lack replicas
WARN – 9291 chunks of goal 3 lack replicas
WARN – 9287 chunks of goal 3 lack replicas
WARN – 9283 chunks of goal 3 lack replicas
WARN – 9279 chunks of goal 3 lack replicas
WARN – 9273 chunks of goal 3 lack replicas
WARN – 9262 chunks of goal 3 lack replicas
WARN – 9254 chunks of goal 3 lack replicas


As you can see the number of undergoal chunks is dropping by the minute.
In Check_MK you could even use the internal counter functions to give an ETA for completion of the resync.
And that’s where real storage administration starts… 🙂

MooseFS Nagios Checks

MooseFS is a really robust filesystem, yet this shouldn’t be an excuse for bad docs and no monitoring.

So let’s see:


I just marked a disk on a chunkserver for removal by prefixing the path in /etc/mfs/mfshdd.cfg with an asterisk (*).  Next, I started running the check in a loop, and after seeing the initial “OK” state, I proceeded with /etc/init.d/mfs-chunkserver restart. Now the cluster’s mfsmaster finds out about the pending removal:

This is what the output looks like after a moment:

dhcp100:moosefs floh$ while true ; do ./ ; sleep 5 ; done

OK – No errors
WARN – 11587 chunks of goal 3 lack replicas
WARN – 10 chunks of goal 3 lack replicas
WARN – 40 chunks of goal 3 lack replicas
WARN – 70 chunks of goal 3 lack replicas
WARN – 90 chunks of goal 3 lack replicas

As you can see, the number of undergoal chunks is growing – this is because we’re still in the first scan loop of the mfsmaster. The loop time is usually 300 or more seconds, and the number of chunks checked during one loop is usually also throttled at i.e. 10000 (that equals 640GB).

In my tiny setup this means after 300s I should see the final number – but also during this time there will be some re-balancing to free the marked-for-removal chunkserver. I already wish I’d be outputting perfdata with the check for some fun graphs.


Lesson for you?

The interval with my check should be equal to the loop time configured in mfsmaster.cfg.


Some nice person from TU Graz also pointed me at a forked repo of the mfs python bindings, and there is already some more nagios checks:


Make sure to check out

I’ll also testride this, but probably turn it into real documentation in my wiki at Adminspace – Check_MK and Nagios

Mostly, I’m pondering how to really set up a nice storage farm based on MooseFS at home, so I’m totally distracted from just tuning this check 🙂

Filesystem reliability and doing what my guts tell me…

On Linux you have a wide range of available filesystems – making a choice is never easy.

I just wanted to summarize what I’ve been telling my class attendees over the last years, what I’ve seen in live setups, and what I’m actually DOING.

  • EXT4 – generally, I DO hate ext FS. For me it’s hyped by people who will simply blame your hardware once you lost your data. My rule of thumb is that i’m using ext on recent linux kernels where the block_validity options is available. Beyond this, I’ll also set the following options:
  1. on error: panic – if we have a read/write error that is persistent or causes a journal abort, just ZAP the box.
  2. discard
  3. data=journal or ordered, depending on importance of the server. It has shown up to 30% impact for me, but it’s a choice you can make.
  4. checktime / check interval – both to 0. I rather have trust checksumming and would not resist a full fsck once a year
  5. possibly also make the journal bigger. Ideally you’d be able to use an external journal – i recommend against it b/c you can never trust devs and it would not be fun to see your fsck not supporting the ext. journal
  6.  journal_checksum is a lot more important but also a work in progress especially if your kernel still starts with 2.6. w/o this option ext doesn’t really notice shit about aborted writes, corrupted journal. But in some versions it’s also plain default. It’s a mess.
  • XFS – I noticed this is what I actually use if it’s my own system, meaning I do have the highest trust in xfs. This is kinda funny since it’s a 1996 filesystem with focus on performace. So far we’ve stayed friends. If the system is a 2.6 one I’ll definitely go for XFS. XFS has also turned out most stable for the Ceph devs in their benchmarks, so it’s not just my guts, it’s also quite proven where others have indeed failed. For production use on RHEL, there’s a choice to get the XFS feature channel and thus run XFS with RedHats support
  • JFS – JFS is what AIX users know as JFS2, and is the most modern of all prod-grade FS in my comparism here. It’s been new and shiny in the early 2000’s, i think around 2004. It has been proven to be superior in small-file performance. so if 100k’s of files in a directory is something that comes into your use case, JFS is something to look at. The problem is that JFS is badly integrated in most distros. If you find out it’s the best-performing for you and you *need* it in production, my advice is to get your OS support via IBM and let them deal with it.
  • VxFS – this is what you commonly seeing used in serious envs looking at integrity and performance. It’s the most scalable and powerful of the lot and has the most features (heh, btrfs, go cry), but it DOES COST MONEY. If you might have use for extra features like split-mirror backups on a different host etc. then it is a good choice and the price acceptable for what you’re getting.

Takeaway –

old distros like RHEL/CentOS/OEL(*) or Debian: Consider XFS

new distros and you wanna have somewhat standard setup: Consider EXT4 but _with_ the bells and whistles.

ZFS / Btrfs not included on purpose. If you think you can put your data on those already, then that’s fine for you, but not for everyone. (of course I run them for testing… silly)

VxFS – cool for your prod servers. If you are dealing with real data (let’s say, like a telco’s billing or other places where they move a netflix’ yearly revenue each day) you will most problably end up migrating to VxFS in the long run. So you might just start with it…

If it’s my system, my data – I just grab XFS. Main reason to pick something different was usually if there’s other people that might need to handle and error and who don’t know anything but fsck.

Running Ceph? I just grab XFS, anything else is too shady – one of many many similar experiences.

23:42 < someguy> otherguy: yes, I as well with BTRFS. I moved to XFS on dev.
Prod has always been XFS.

If it’s a prod system with data worth $$$$$$? I’d not think about anything but VxFS.

SSD Failure statistics

I just got forwarded an article from c’t along the lines of “SSDs aren’t as problematic as they used to be”.

Which is true, but encouraged me to make a count of the ones I’ve actually used and how many of them really failed. In general I have made peace with SSDs by accepting that they are a part that wears out and just needs to be replaced sometimes. That way you put more thought into periodic maintenance, firmware updating etc. and less time into sobbing about lost data.

The models I’ve used / installed somewhere.

  • The start make two stone-aged 60GB Samsungs, they used to be horrible and even with a TRIM-supporting OS they were often stuttering and hanging for minutes. A few years later I needed two SSDs for a friend, gave these two a full wipe and new firmware and now they’re running fine ever since. This shows how much Samsung has learned in terms of firmware.
  • 7 OCZ Vertex II 60GB – all running OK
  • 1 OCZ Vertex II 120GB – immediate and total data loss. irrecoverable, at it. I know two more people with the same experience. Would guess some metadata corruption, since there’s documented ways of resetting the SSDs. Sad story about this is mostly that it’s some typical “what could go wrong there” issue. Some better design and it would just block to RO and have some metadata consistency helpers.
  • 1 Intel 510 120GB – not quick by all means. but solid. given that it uses a cheapo controller, it was not my best buy, but still … I like solid!
  • 2 Intel 320 120GB – everything great, quick one, too.
  • 3 OCZ Vertex III – 120GB – all doing fine
  • 2 Samsung 830GB 256GB – doing fine and I trust them.
  • 8 Samsung 830GB 120GB – 7 are doing fine and I trust them, one is having major hickups and has trashed it’s LSI raid brothers and sisters while at it. Still testing with a lot of interest why this happens.
  • At work we have some more ‘cheap’ SSDs, one out of 10 seems to have issues.
  • $colleage also had a cheap SSD that failed, but it did so slowly and he had time to get his data off it and is now using one of the 320’s.

That leaves us with the following numbers:

Out of 36 total SSDs:

  • 1x died a slow death
  • 4x were constantly going nuts
  • 3x chose instant and complete dataloss[tm]

-> that gives a roundabout 20% chance of issues. More than I ever felt it would be.

Fine print / details for those who care:

The SSDs are normally “underprovisioned”, d.h. i only partition something like 80% of their space. Sometimes I allocate just 40%, i.e. for ZIL on ZFS. On the downside, the SSDs that run in raid configs are of course not seeing TRIM support. There I sometimes run a SATA secure erase to freshen them, but not as often as I planned. On the other hand, they also don’t get heavy usage at all.

I had investigated and planned to make a larger sata reserve area (and I think I did it on ONE of them, in a raid that happens to do it for all 🙂 but got blank stares from some people running many times more SSDs and so I put it off.

As for why hardware raid – because the CPU overhead is so much lower with a good HBA over software raid. Lower CPU overhead means higher throughput if you’re intending to also run applications on the server. Normally SW Raid on Linux does better scaling out to multiple cores, but also needs substantially more power to move the bits – even on Raid0.

I.e. my desktop topped out at 1.2GB/s (due to controller limits I think) with a CPU usage of 3 Cores @ 100%, whereas the same box with an older LSI Raidcontroller + some of the onboard ports hit 2.4GB/s
at 2 Cores @100%

(But it got sluggish, probably PCIe was totally exhausted)

Clould downtimes as they should be

OrionVM of Australia, the cloud IaaS hosting company with the fastest cloud storage ever built seems to have had their first outage today.

  • It was not affecting their cloud
  • Everything stayed available
  • no data was lost
  • It was a DDOS on a router on the frontend side of the network, not in the backend
  • In TOTAL it just took 4 minutes.
  • It was quickly detected and reported by them

This is how it should be.

This is what people should demand.



Harddisk prices skyrocketing everywhere

Everywhere? No, not exactly everywhere!

I had been all over Munich’s computer shops and the prices were really insane, 1TB over 100 Euros, and 2TB right there around EUR 200.

Well, then I remembered the Karstadt  “mall” and well… external 2TB USB drives from 70 Euros and up. The one I bought was a 1.5TB Hitachi drive, sadly 5400rpm, but still cheap compared to the other places..

Karstadt obviously didn’t get the news, grab while they still got plenty.

persistent block device naming

This article at the arch linux wiki first impressed me since it gave a good overview of the options using LABEL, UUID and disk/by-path & by-uuid. They even covered doing these things in the initrd which is a nice idea.

The caveat is that the article does not at all go into the topic of persistent block device naming, that means writing udev rules that make a certain device show up as /dev/block/emc_clariion_1234_lun0 or the other big issue with linux device names, which is generating names based on the remote target ports (iscsi, fc, ib, …).

But this is kind of a think like Apple would have done:
Instead of describing half-assed features that you may or not may make work, they listed the options that DO actually work. So what the authors did there was probably the smartest thing to do.

On a side note, kudoes to Data Direct Networks.

They in fact *did* find the holy grail of Linux storage device naming, and use udev to name their block devices by resolving the WWN into their internal LUN IDs. This is something that Veritas has done for ages (look up enclosure based naming) via vendor supplied libraries for the multipath layer VxDMP.

As far as I know DDN has been the first company to ever do this using native Linux tools, thus perfectly integrating into the OS.

Quoting from a mail about the multipath names with DDN storage:

Ah, sorry about that, yes, DDN did install a rule:

KERNEL==”sd?”, SUBSYSTEM==”block”, ATTRS{vendor}==”DDN*”,

SYSFS{model}==”S2A*”, RUN+=”/usr/local/sbin/tune_s2a %k”

This is what the resulting mpath device looked like:

SDDN_S2A_9900_1308xxxxxxxx dm-13 DDN,S2A 9900
\_ round-robin 0 [prio=0][active]
\_ 3:0:1:11 sdaj 66:48 [failed][undef]

Well OK, the individual paths cannot be recognized like this but they’re a million miles ahead of the pack anyway :>
Veritas of course can list the array ports, too, but DDN could easily add this.

About Linux IO drain #2

I’ll start writing down how I think Linux memory handling can be set up to not fuck up and go wrong. This is too much for one person to test in their evenings, but I think starting the checklist for this thing is something one guy can very well do 🙂

The article will best go to my adminspace wiki..

Our ingredients:

Page out / buffer dirty rate

cgroups memory limits:

Establish a strict separation between ressources intended for the OS and ressources for applications.

Check impact on buffer cache size

Form one cgroup for the OS

Form one more cgroup for the Apps

Document how the redhat tool for saving cgroup states over reboots is used

Make this one re-usable so if you add another App to the server you copy the cgroup settings and effictively break the ressources in half by default.

IO scheduler:

Number of queues

The kernel scheduler only has 64 queues. Someone *really* needs to test how behaviour changes from

  1. 8 concurrent IOs to 4 disks to
  2. 256 concurrent IOs to 100 disks
I feel many things are tripping there. At least from what I’ve seen.


Imho people will be randomly recommended to use cfq or deadline if they report disk issues, but they’re rarely recommended to actually *use* cfq(2) to priorize up/down IO writers.
The hard point with cfq is that it can’t really handle multiple writers in the same class killing each other (lacking a bucket mechanism).
Even if both VMs are run in the “idle” qos class, VM A running a DD will always impose horrible lag to VM B that is just web serving. CFQ can save your server from abusive processes, but not protect processes of the same kind / qos level.
So, when you really turn to deadline you’ll find it can’t do any priorities, meaning you suddenly lose the “save the server” feature. at least every process on the system will feel the same pain in theory.
Then, according to the docs, for highend arrays etc. you’ll want to use NOOP, which is fine from a scheduler point of view, but of course means now you don’t have any classification any more, so a fast reader / writer or any disk with insanely high queue & serve time will now kill your server.
The usual “deadline” vs. “noop” vs. “cfq” benchmarks are concerned with desktop PCs using a single SSD.
I think it is becoming to benchmark them against BBUed Raid HBAs with 512MB at first – but not only.
The real challenge for answering if NOOP is beneficial is:
real storage with concurrent writers (that means, a dozen or more linux servers, smarty) pumping into 512GB cache – something that to my knowledge has never been looked into, although it’s far more important when it comes to scheduler tuning.  (no, sorry, scale out toys do not matter in this prospect, instead any old DMX-3 will do just fine)
If deadline handles with the usual <1% difference to NOOP in this scenario there might be a winner.
’nuff for the first few thoughts

Food for reading:

About Linux IO drain

Found this at the linux-ha archives and loved it.
Perfect description of what Admins see and scheduler guys rarely understand.

Oh, and I should just comment: Big-memory Linux systems seem to get
into trouble with _large_ amounts of buffered writes. Even an
mke2fs on a big partition can show this. It seems that vast
amounts of memory are eaten by the write buffers so the system
starves for memory, and at the same time there is no disk cache
because it has been eaten by the write buffers, so every command
or library read has to go to disk (causing lots more I/O), but of
course the disk queue is full of writes, so reads are slow …
you end up with everything that does disk I/O showing up in state
‘D’ in a process listing.

Ceph on FreeBSD

comes to

Someone (on #ceph) is preparing patches for Ceph to successfully build on FreeBSD!

He sounds dedicated and knowledgeable enough to make it work.
While chatting he said he’d try to plug it into the GEOM block layer framework and also add some ZFS spice.

On linux you have /dev/rbd for blocklayer access (“RADOS block device”) and use LVM (and/or md) to architect more layers of storage on top of this, but on FreeBSD you can really just tuck it into GEOM and build anything you want. GEOM is “*one* tool for all” written by *one* guy, so it’s a lot less messy than mixing all kinds of different stuff like LVM & multipath & ecryptfs & EFI & MDraid & ionice to get to the same goal. (Of course there also has been a usual share of bugs made worse by the fact that there’s less users and less devs to find / fix the bugs. But the design is heavily leaning on Veritas far better making the non-broken things a charm to work with)

Now, to come back to Ceph on FreeBSD…  you’ll have a production-stable & tested ZFS version, a advanced block layer via GEOM and the most modern object storage/filesystem mix Ceph in one box. If you mentally add in HAMMER then *BSD is finally getting to one of the top positions when it comes to the most modern filesystems.

I guess in 1-2 years time there will be a lot of performance tuning work for the FreeBSD Core team 😉