MooseFS is a really robust filesystem, yet this shouldn’t be an excuse for bad docs and no monitoring.
So let’s see:
I just marked a disk on a chunkserver for removal by prefixing the path in /etc/mfs/mfshdd.cfg with an asterisk (*). Next, I started running the check in a loop, and after seeing the initial “OK” state, I proceeded with /etc/init.d/mfs-chunkserver restart. Now the cluster’s mfsmaster finds out about the pending removal:
This is what the output looks like after a moment:
dhcp100:moosefs floh$ while true ; do ./nagios-moosefs-replicas.py ; sleep 5 ; done
OK – No errors
WARN – 11587 chunks of goal 3 lack replicas
WARN – 10 chunks of goal 3 lack replicas
WARN – 40 chunks of goal 3 lack replicas
WARN – 70 chunks of goal 3 lack replicas
WARN – 90 chunks of goal 3 lack replicas
As you can see, the number of undergoal chunks is growing – this is because we’re still in the first scan loop of the mfsmaster. The loop time is usually 300 or more seconds, and the number of chunks checked during one loop is usually also throttled at i.e. 10000 (that equals 640GB).
In my tiny setup this means after 300s I should see the final number – but also during this time there will be some re-balancing to free the marked-for-removal chunkserver. I already wish I’d be outputting perfdata with the check for some fun graphs.
Lesson for you?
The interval with my check should be equal to the loop time configured in mfsmaster.cfg.
Some nice person from TU Graz also pointed me at a forked repo of the mfs python bindings, and there is already some more nagios checks:
Make sure to check out
I’ll also testride this, but probably turn it into real documentation in my wiki at Adminspace – Check_MK and Nagios
Mostly, I’m pondering how to really set up a nice storage farm based on MooseFS at home, so I’m totally distracted from just tuning this check 🙂