I love FreeBSD! Taking over a non-small infrastructure of around 75 FreeBSD servers was something I wouldn’t have wanted to pass on.
The problem bit is that I do consulting only, not pure ops. But there wasn’t much of an ops team left…
Where they used to put around 10 man-days per week into the feeding and care of FreeBSD plus some actual development, I’m now trying to do something in 1 day. And I still want it to be a well-run, albeit slower, ship.
One of the biggest hurdles was the sheer volume of email.
Adding up Zabbix alerts (70% of which concerning _nothing)), the FreeBSD periodic mails, cron outputs, and similar reporting I would see weeks with 1500+ mails or in the higher 1000s if there was any actual issues. Each week. Just imagine what it looked like when I didn’t visit my customer for 3 weeks…
Many of those mails have no point at all once You’re running more than -base:
The most typical example would be bad SSH logins. All those servers run software to block attackers and even feed that info back to a central authority and log there. So, why in hell would I want to know about malicious SSH connects?
Would you like a mail that tells you no hardware device has failed, today?
- And another one every day until 2032?
- From all servers?
This makes no sense.
Same goes for the mails that tell me about neccessary system updates.
What I’ve done so far can be put in those 3 areas:
Turn off as much of the periodic mails as possible (i.e. anything that is possible to see by other means). I tried to be careful with it, but it didn’t work like this. My periodic.conf looks like this now:
I found turning off certain things like the “security mail” also disables portaudit DB updates. But I just changed my portaudit call to include the download. Somehow I had assumed that *update* would be separate from *report*.
2. Fix issues:
Apply fixes for any bugs that are really that, bugs. At least if I figure out how to fix them. More often than not I’ll hit a wall in between the NIH config management and bad perl code.
3. Monitor harder, but also smarter:
Put in better monitoring, write custom plugins for anything I need (OpenSSH Keys, Sendmail queues, OS Updates) and set thresholds to either a baseline value for “normal” systems or to values derived from peak loads for “busy” systems.
Some of the checks are to be found at my bitbucket, and honestly, I’m still constantly working on them.
The checked in version might change quite often, I.e. I now think it won’t hurt to have a stronger separation of reporting for OS and Ports issues. And, maybe a check that tells me if I still need a reboot for a system.
The most current area now is automating the updates.
I’m taming the VMWare platform and using some Pysphere code to create VM snapshots on the fly. So there’s an ansible playbook that pulls updates. It’ll then check if there is a mismatch between the version reported from uname -a and the “tag” file from freebsd-update. In that case, it’ll trigger a VM snapshot and install / reboot.
Another piece of monitoring does a grep -R -e “^<<<<<” -e “>>>>>” /etc and as such alerts me of unmerged files.
I try to do with tiny little pieces and have everything a dual-use (agriculture and weapons, you know) technology that gives me status reporting and status improvement.
I started a howto about the specifics I did in monitoring, see
FreeBSD Monitoring at my adminspace wiki.