Check_MK support for Allnet 3481v2


A friend of mine has had this thermometer and asked me to look into monitoring and setup.

I don’t think I ever put as much work into monitoring such a tiny device. Last evening and almost night I stabbed at it some more and finally completed the setup and documentation. I literally went to bed at 5am because of this tiny sensor.

To save others from this (and to make sure I have a reliable documentation for it…), I’ve made a wiki article out of the pretty tricky setup. Along the way I even found it still runs an old openssl.

You can check it out here:

http://confluence.wartungsfenster.de/display/Adminspace/Monitoring+Allnet+3418v2

The bitbucket version isn’t yet committed, I hope I will do this in a moment… :p
One interesting hurdle was I couldn’t do a check_mk package (using mkp list / mkp pack) since I also needed to include things from local/lib and similar folders. When I visit the MK guys again I’ll nag about this.

 

 

They have really pretty meters in their UI by the way.

Would hope something like it makes it to the nagvis exchange some day.

edit note: I initially wrote it has an “affected OpenSSL”. It seems they had built it back in 2012 without heartbeat, which is a nice and caring thing to do.
It’s still goddamn outdated.

Advertisements

LOPSA Mentorship & Monitoring


I’m THIS excited!

Mozilla

Recently someone asked on the lopsa-mentorship lists for some help with improving the monitoring for the community project he works for.

The one whose logo above _everyone_ knows 🙂

I offered to help since, well, monitoring!

Now I’m waiting to get in touch and then answer / guide him with any monitoring issues he finds.

Waiting. Excitedly. I already prepped a page of questions. Can’t wait. So excited.

I hope we can settle on Check_MK instead of anything outdated, but we’ll see. Not gonna push something on him, there are more interesting questions than what software to use.

i.e. identifying the actual services provided, seeing their dependencies (i.e. if a build of this piece fails today, there’s no new version for next week), and since I’ll not be the person doing the work, it’ll be a much bigger challenge:

Finding the essence of why and how to monitor what.

 

About LOPSA Mentorship:

The league of professional sysadmins (LOPSA) has a mentorship program, where beginning sysadmins or such starting into a new topic can ask for help. This dates back to when things were called system administrators guild (SAGE).

I remember I wrote there looking for help back in 2001/2002 when I got my first “serious” sysadmin job. I checked the options “hundreds of servers” “production” “lack of prior experience” and something like “HELP!”. Noone replied.

I joined the mentorship program to help people not have this happen again.

 

About LOPSA:

LOPSA itself is the largest standing organization of system administrators.

It offers exchange of ideas and practices. This is extremely helpful for professional sysadmins, since we normally don’t have anyone outside of our current gig to compare our ideas with. And normally we tackle more complex tasks than most “DevOps” scenarios cover, so looking out on the internet will also just send you crying. LOPSA fills the gap, getting you in touch with more experienced and fresh sysadmins in an informal way.
Beyond that it also guides by setting some rules i.e. with a Code of Ethics.

The latter I’ve  translated to german – so I’m quite bound by the code of ethics. 🙂

 

 

Nagios MooseFS Checks #2


For the curious,  here you see the bingo point where MooseFS is done detecting undergoal chunks and you can watch the rebalancing happening.

 

WARN – 9287 chunks of goal 3 lack replicas
WARN – 9291 chunks of goal 3 lack replicas
WARN – 9291 chunks of goal 3 lack replicas
WARN – 9294 chunks of goal 3 lack replicas
WARN – 9295 chunks of goal 3 lack replicas
WARN – 9291 chunks of goal 3 lack replicas
WARN – 9287 chunks of goal 3 lack replicas
WARN – 9283 chunks of goal 3 lack replicas
WARN – 9279 chunks of goal 3 lack replicas
WARN – 9273 chunks of goal 3 lack replicas
WARN – 9262 chunks of goal 3 lack replicas
WARN – 9254 chunks of goal 3 lack replicas

 

As you can see the number of undergoal chunks is dropping by the minute.
In Check_MK you could even use the internal counter functions to give an ETA for completion of the resync.
And that’s where real storage administration starts… 🙂

MooseFS Nagios Checks


MooseFS is a really robust filesystem, yet this shouldn’t be an excuse for bad docs and no monitoring.

So let’s see:

 

I just marked a disk on a chunkserver for removal by prefixing the path in /etc/mfs/mfshdd.cfg with an asterisk (*).  Next, I started running the check in a loop, and after seeing the initial “OK” state, I proceeded with /etc/init.d/mfs-chunkserver restart. Now the cluster’s mfsmaster finds out about the pending removal:

This is what the output looks like after a moment:

dhcp100:moosefs floh$ while true ; do ./nagios-moosefs-replicas.py ; sleep 5 ; done

OK – No errors
WARN – 11587 chunks of goal 3 lack replicas
WARN – 10 chunks of goal 3 lack replicas
WARN – 40 chunks of goal 3 lack replicas
WARN – 70 chunks of goal 3 lack replicas
WARN – 90 chunks of goal 3 lack replicas

As you can see, the number of undergoal chunks is growing – this is because we’re still in the first scan loop of the mfsmaster. The loop time is usually 300 or more seconds, and the number of chunks checked during one loop is usually also throttled at i.e. 10000 (that equals 640GB).

In my tiny setup this means after 300s I should see the final number – but also during this time there will be some re-balancing to free the marked-for-removal chunkserver. I already wish I’d be outputting perfdata with the check for some fun graphs.

 

Lesson for you?

The interval with my check should be equal to the loop time configured in mfsmaster.cfg.

 

Some nice person from TU Graz also pointed me at a forked repo of the mfs python bindings, and there is already some more nagios checks:

 

Make sure to check out

https://github.com/richarson/python-moosefs/blob/master/check-mfs.py

I’ll also testride this, but probably turn it into real documentation in my wiki at Adminspace – Check_MK and Nagios

Mostly, I’m pondering how to really set up a nice storage farm based on MooseFS at home, so I’m totally distracted from just tuning this check 🙂

OMD port to Debian/kFreeBSD


FreeBSD turned 20 yesterday – I have a little present!

You’re now able to run – with a little detour – a really modern monitoring environment (OMD) on FreeBSD.

Downloading:

Lets start with the download link:

Wartungsfenster-OMD: omd-1.00_0.wheezy_kfreebsd-amd64.deb

How I got there

I found there’s a howto for running Debian/kFreeBSD on FreeBSD as a JAIL! See the howto from here: kFreeBSD Jails (new window)

So the idea I came up with was:

Debian is a supported distro, so let’s port OMD to run in that jail!

I built a jail mostly via the howto – try it too!
Things that I changed:
The command to launch jails has a “–” bit in it – removed that or it didn’t launch the jail.

My actual script (since I found nothing about running Linux Jails from rc.conf)

#!/bin/sh

setup()
{
ifconfig xn0 alias 192.168.10.66/32
mount -t linprocfs linprocfs /srv/jail/debjail/proc
mount -t linsysfs linsysfs /srv/jail/debjail/sys
mount -t tmpfs tmpfs /srv/jail/debjail/run
mount -t devfs devfs /srv/jail/debjail/dev
}

launch()
{
jail -J /var/run/jail/debjail.jid -c jid=66     \
   name=debjail path=/srv/jail/debjail          \
   host.hostname=debjail ip4.addr=192.168.10.66 \
   command=/bin/sh  -c "/etc/init.d/rc S && /etc/init.d/rc 2"
   # needed for building deb packages
   jail -m jid=66 allow.sysvipc=1
   # needed for ping
   jail -m jid=66 allow.raw_sockets=1
}


if [ $? = 0 ]; then
   setup
   launch
fi

starting the Jail!

Like this the jail should likely launch. You can enter it using jexec 66 /bin/bash. Once inside, add openssh so you can get in from the outside. You’ll also need to configure authentication, namely root access.

To be able to use su inside the jail, make sure you didn’t put it on a filesystem that’s mounted as nosuid (i do that by default, …)

just a normal server, lets build!

(if you just wanna run OMD, you can of course skip this part 🙂
Once SSH and su work, ssh into the jail I checked out the omd sources and started building, using first configure, then make && make pack, and in the last stage set about make deb.
I had to remove most of the perl-modules due to incompletely specified dependencies. Due to that I also had to remove Thruk, Gearman and disable RRDtool’s perl support. Sorry, but after a handfull of perl-related issues I got really sick of it. Fix your shit. Further weirdness was that the Check_MK tarball from git was corrupt. I replaced it with the upstream one.
(The make deb also needed sysvipc support in the jail enabled)

The resulting package is what You found above at the download link.

OMD install

I could install it using gdebi like on a standard distro.

apt-get install gdebi && gdebi omd-1.00_0.wheezy_kfreebsd-amd64.deb

When creating a site (omd config create sitename) I found I had to turn off the tmpdir setting

omd config sitename set TMPFS off

and then I could launch the site.
My setup won’t need TMPFS performance, so I didn’t do anything here. It could be fixable, I’m not sure.

After restarting the main apache,

service apache2 restart

The Check_MK gui greeted me with errors because it tried to find a uuid in /proc but that’s not yet possible on FreeBSD (PR openend: kern/183615).

I made a tiny change to the Check_MK Multisite sources for this. The UUID is used for identifying sessions and so this is _quite_ important to work.

to fix it, change a little piece of the weblib.py for your site (or actually, in /omd/versions/default)
/omd/sites/freemon/version/share/check_mk/web/htdocs/weblib.py

at line 117 and below:

    116 # Generates a selection id or uses the given one
    117 def selection_id():
    118     if not html.has_var('selection'):
    119         #sel_id = file('/proc/sys/kernel/random/uuid').read().strip()
    120         import commands
    121         sel_id = commands.getoutput("uuid")

(so, instead of reading from /proc, run “uuid”)

for this to work you also need to install uuid using

apt-get install uuid

After this I needed to also enable raw socket support for the jail so ping hosts would work (already in the debjail script)

If you use IPv6, don’t forget to switch to the Icinga monitoring core using

omd config sitename set CORE icinga

Start it using

omd start sitename

and hit the web interface at http://jailname/sitename – user: omdadmin pass: omd

root@debjail:/omd# omd status
Doing 'status' on site freemon:
mkeventd:       running
apache:         running
rrdcached:      running
npcd:           running
icinga:         running
crontab:        running
-----------------------
Overall state:  running

Not a full FreeBSD port, but a bullshit linux system locked into a jail – still – this is getting the job done.

Try it, and, that’s it!

Happy birthday Beastie!

Trip to OpenNebulaconf


This year also saw the first ever OpenNebula conference. I was there for a really short time only, since I’d been coming from the open source backup conference at cologne.

Let me say it was a harder, longer trip than I could handle, two conferences in two days is already bad, but if you also need to prepare stuff it gets rough.

So, how was it?

Getting there: the (almost endless) ride

So, an almost sleepless ride to cologne and then another pretty long one to berlin, a short nap, and every free minute spend on the lab (the server failed the final test reboot like 2 hours before my 3am train departed…). A disaster, but at least the people started to be less rude (running into you, etc) the closer I got to berlin.

At some point there was a nice young consultant woman sitting next to me who *also* fought sleep while she frantically worked on some papers. Couldn’t help smiling.

By the time I arrived I had like 37 hours of work/talks/travel versus 3 hours of sleep. You bet I *love* the beds at my fav berlin hotel (park inn alexanderplatz) when I arrive after a ride like that.

I’m in the wrong place and there’s a nazi for breakfast.

The next day started out bad – the hotel was *called*, but not located at, Alexanderplatz. Not fun considering I had to put down a lot of money to get a room at Alexanderplatz, had planned to save some time by being close by to the venue, and that I had a kinda weird cab driver to take me to the other place. Being completely exhausted even in the morning I really didn’t care to hear about the lower amounts of foreign population in East-berlin due to the non-exchangeable nature of the GDR mark.

Last to go

Having arrived I found the conference reception desk, and apparently I was totally the last person to arrive:  the guy at the desk immediately knew who I am. I browsed around a little, immediately caught sight of the super cool inovex opennebula lab (acrylic casing, 8 i5 nodes), then had some coffee and settled for the sofa.

Oh, THERE!

I tried to get my “personal IT” working so I could drop a message to carlo daffara who only had little time left till his flight and at some point I realized the impatient guy around the corner was him, waiting. 🙂

With that sorted we spent almost two hours chatting and I was surprised at some of the stuff they’re doing at cloudweavers. It doesn’t easily happen that you meet anyone up for a discussion of IO queue/latency/bw issues. Like, noone. Less than that if you’re talking about CEOs. Now, there he is and he’s even got real solutions in the works that noone has ever worked on as methodically. And stuff like this is all just a little sidequest for cloudweavers. I’m amazed.

Lunch break? Slides!

So far I had seen no talks but at least got to watch the amazing lightning talks – once they’re online, watch all of them.

I tried to make my slides more useful, fixed bugs in my new opennebula nagios checks and, well, generally panicked.

Then it was time for the talk, and I tried to do well. 🙂

Slides suck!

Next time I’ll stick with 5 slides and just tell what I think – I don’t need that bullshit powerpoint to get people interested so why bother.

I think I managed to have some minds *click* on the idea of monitoring the large scope of an infrastructure instead of just details. One of the key points was to monitor free capacity instead of usage. In a cloud env I think this is a must have.

I didn’t get the time to add a single tiny BI rule for my setup, so I skipped most of the business intelligence part.

One sad/funny point was that I went on forever about fully dynamic configuration, but missed the main point:

This will be a downloadable selfconfiguring monitoring appliance you can get via the marketplace.

I just didn’t remember to say it.

The reception was good anyway and I hope I helped some of the people – not to mention that it was really hard to talk in front of so many of them! I’m still suprised if someone comes to me and says he liked the talk. Some day I’ll stop worrying.

0.25 days of conference left

I watched a few more talks and it was hard to decide which one to look at – for example I went to hear about rOCCI and it was very worth it but missed the talk from BBC. I’m so looking forward to the recordings.

After that talk, there was another break and then the conference ended with a very short speech from the OpenNebula guys. Many people including me just kept sitting, still eager for more talks. Seems there’s room for a 3-day conference if the topic is that interesting 🙂

What else…

I think it was great that there was multiple companies behind hosting the conference, it seemed to open up discussions a little. I was surprised that the NetWays team really held back marketing wise, which is far different from what I heard from (non-MK 🙂 visitors of other (mostly monitoring-) conferences they have a role in. They did an incredibly good job at organizing stuff. It’s hard to describe – I’m used to the utter chaos of CCC and such conferences and what Netways put up is the exact opposite. Everything worked. Everyone I talked to was happy with how smooth the conference went. Really, great work.

After the conference I had some sleep and then went for drinks with the opennebula guys. Sitting outside after a few burgers I had the second “unfun” event of the day when some old unhappy man started to insult, attack and shove around random people of the group. My first thought was just “yeah right, that’s what we get for being in Mitte instead of Kreuzberg”.

Since I was the only german I tried to tell him to stop acting like a 12-year old idiot, but to little success. After some time he finally left. I think this guy was actually just full of self-hate and wanted someone to hit him. Very weird.

How did I make it back?

After this unasked interruption we moved on a few corners and went to CCCP bar, which was still mostly a tourist place, but a lot more the Berlin I’m used to. Good drinks, a lot of opennebula and other chat and a nice bartender(ess) made it very hard to leave.

At 3 or 4 I still somehow started walking back to my hotel. I have no clue how I actually got there.

The next day I got a lot more sleep and instead of getting drunk again I was already adding some more bugfixes to the KVM checks 🙂

Although I missed the OpenNebula team – they’re extremely interesting and nice people.

Final words

I missed some of the best talks, plus the hacking sessions, plus the gettogether. Next year I shall not make that mistake!

Soon I’ll also do a writeup about the technical  bits of the monitoring thingy.

Check_MK and Bacula


Check_MK and Bacula

I’ve prepared a small Xmas present for the Check_MK / Bacula users like me.

At this link you can find an overview of current monitoring options if you want to have Nagios tell you if “something is wrong with the backups”.

For my own use I’ve now decided to use a minimalistic yet powerful option using Bacula’s Runscript and flag files that used to monitor job status from Check_MK.

I like this since it will also work if the Bacula server breaks down alltogether.

Let me know any additions you think worthwhile.

Nexus 7 followup


The Nexus tablet is getting more and more interesting for me as I come up with additional uses.

Today I installed DHCP and Bind on it, and am thinking about Cobbler, too.

The idea is having the Nexus as infrastructure core and with the tools to bootstrap the rest of my systems.

Power would come via an active USB hub, if it fails, the nexus can still power its gadgets till it runs out. That way it’s i.e. possible to determine if power or all networking failed and still alert.

And after a disaster crash you’d have a install server that is ready to bring the core systems back up.

OMD Nexus 7


OMD Nexus 7

Nexus 7 Tablet running OMD / Nagios / Check_MK

 

Now everyone can have a $250 Nagios appliance with builtin UPS and WiFi and a Quadcore to get things moving!

 

Thanks to the great ompistro package it was only half a weekend’s work to get the basic system up and running.

Currently I’m pulling the sources off it so I can upload the source tree and .deb package.

I’d love to extend this some more, things one could poke at:

  • wallmounting
  • a strong active USB hub for charging
  • new SIM for USB cell
  • firefox autostart & autologin
  • disable screen blanking
  • better dashboard
  • disable screen input

Anyone wanting to join?

Simulating lively Check_MK tcp agent outputs


Trying to write a new check for Check_MK  often has two problems:

  • No live output which would show live traffic flow.
  • No “errors” because the systems normally aren’t broken.

For SNMP-based hosts:

There is a perfect solution named the “Agent Simulator”, just enable agent_simulator in main.mk and use {} in the stored output.

You make Check_MK run on stored output using usewalk_hosts += [ “hosttag” ] or by going into simulation_mode = True.

The datafiles are either in var/check_mk/snmpwalks or in tmp/check_mk_cache/

Then the following is possible:

  • Auto-switching states – i.e. 1 1 1 1 1 2 5 would make a network interface that goes down every 6 minutes (how awesome is that)
  • Wave-forms, or plain and simply
  • continuous growth.

For TCP-based hosts:

There is nothing similar. Nobody had the time to build it.

I had to build a check for a host which I don’t have and more severely, I also couldn’t produce real traffic. But I needed to verify it gives me useful data. The solution was to make a little script that replaces my agent output (which is read in using datasource_programs) on the fly.

 

floh@klappstuhl:~$ cat fudge-counters 
write=110747074454
read=392058695183
i=0
while true ; do 
 sleep 10 
 i=$(( $i + 1 ))
 newwrite=$(( $write + $i * 100000 ))
 newread=$(( $read + $i * 100000 ))
 cat ~/git/abcd/agent_output/aix-perflib | 
 sed "s/$write/$newwrite/" | 
 sed "s/$read/$newread/" > /tmp/customer/aix-perflib
done

Get the original counter values, replace the file continuosly, but do only change them while it passes by.

That way the original stays intact.