FrOSCon Nachlese: Rudder (1)


Bei dem Vortrag hatte ich ja erwaehnt, dass es sehr praktisch ist, dass Normation eigene C-Entwickler hat und Ihre eigene Version des Rudder Agent pflegt.

Case in point, ein kritischer Bug in CFEngine, der einfach schon gefixed war. Und 30 Minuten nach meiner Frage war der Patch auch freigegeben…

15:29 < darkfader> amousset: i'm inclined to think we could also
use a backport of https://github.com/cfengine/core/pull/2643
15:30 < darkfader> unless someone tells me "oh no i tested with 
a few thousand clients for a few months and it doesn't affect us" 😉
15:33 < amousset> darkfader: it has already been backported 
(see http://www.rudder-project.org/redmine/issues/8875)
15:34 < Helmsman> Bug #8875: Backport patch to fix connection cache 
( Pending release issue assigned to  Jonathan CLARKE. 
URL: https://www.rudder-project.org/redmine//issues/8875 )
15:37 < darkfader> amousset: heh, pending release 
15:38 < Matya> hey you just released today 🙂
15:40 < amousset> yesterday actually 🙂
16:07 < jooooooon> darkfader: it's released now 😉

 

 

… es gibt naemlich einen Freigabeprozess!

Advertisements

Time for 2016


Hi everyone.

 

I just thought some post is in place after having gone dark for quite a long time.

I’d been home sick for almost a month. First I’d snapped my back very badly and then I caught a strong flu. This completely ruined the month I had meant to spend on posting things here. Lying on your back is BORING and painkillers make you too numb do to anything useful.

Between that I’ve also done a few fun projects and been to OpenNebulaConf (Oct), the chaos communication camp (Dec) and the config management camp (Feb) and each time I came home with some nice ideas to throw around.

To be honest though, the highlight of the last months was watching Deadpool.

If you can handle some completely immature humor and the good old ultraviolent, go watch it.

 

For this year, there will be EuroBSDCon and OpenNebulaConf yet again.

One great thing about OpenNebula is the extremely friendly community. Comparing this to *any* other conf I’ve been to they are all pretty darn hostile and bro-ish. OpenNebula is such a nice community in  comparison and I really hope the others will start trying to match up with that at some point.

Your network is probably owned – What to do next?


I’ll try to summarize my thoughts after the pretty shocking 31C3 talk.

The talk was this one: Reconstructing .Narratives.

This trip to 31C3 was meant to be a normal educational excursion but it is now just depressing. The holes the NSA & friends rip into the networks we are looking after are so deep it’s hard to describe.

Our democratic governments using the data gathered for KILL LISTS of people, even assigning a “kill value” as in how many people are legit to kill if it helps the matter. This is something I can’t yet fit into my head. The political and technical aspects are covered on Spiegel.de.

Note that the info there will be extended in 3 weeks since there will be another drop of info regarding malware aspects.

Personally, I’m not feeling well just over what I heard there and I’m grateful they didn’t come around to the malware list.

Now I’ll go ahead on the tech side and talk about what you should consider, we NEED to clean up our networks.

This is not a check list. It is a list to start from.

Your admin workstation:

  • Buy a new one. Install Qubes as per https://qubes-os.org/
  • If your box runs it nicely, submit it to their HCL.
  • Talked to Joanna before this shaking talk, and I’ll write about my “interview” at a later time.
  • Use the TOR VM or another box with Tails for your FW downloads
  • I wish coreboot was actually usable, if you can help on that end, please do it.

Point of Administration MATTERS

  • IPSEC VPN with preshared keys: Not safe
  • IPSEC VPN: Should be safe?
  • PPTP VPN: (Obviously) Not safe
  • SSH: VERY VERY questionable
  • ISDN Callback: Sorry, that was only safe before IP was the standard. And maybe not then

So basically, if your servers aren’t in the cloud but in your basement, THAT IS A GOOD THING.

Really sorry but it has to be said.

Re-keying:

  • wipe your ssh host keys, regenerate them
  • Don’t use less than 4k keys.
  • include the routers and other networking equipment.
  • Drop ALL your admin keys
  • Regenerate them monthly
  • Be prepared to re-key once we find out what SSH ECDSA-style option is actually safe

SSH adjustments are now described very well at the following github url:
stribika – Secure Secure Shell

Passwords:

change passwords!

this is sounding funny and old, but since any connection you have ever made might get decrypted at a later time, you should consider all compromised.
I think it should also be a good thing[tm] to have separate passwords on the first line of jump hosts than on the rest of systems.

yes, keys seem safer. But i’ve been talking about passwords, which included issues like keystroke timing attacks on password based logins to systems further down the line.
of course applies to public keys; i.e. don’t overly enjoy agent forwarding. I’d rather not allow my “jump host login” key on the inner ring of systems.

Password management:

It seems the tool from Bruce Schneier is rather safe, I’d go away from the “common” choices like KeepassX.

Info / Download: https://www.schneier.com/passsafe.html

Firmware:

Make BIOS reflashing a POLICY.

Random number generators:

Expect you will need to switch them, personally I THINK you should immediately drop the comforts of haveged.

GnuPG

It was recommended more than one time.

Start using it more and more, putting more stuff in it than you’d have done till today.

Switches and routers:

Your network is NOT your friend.

  • IP ACLs are really a good thing to consider and piss off intruders
  • A good tool to set ACLs globally on your hardware is Googles capirca. Find it at a href=”https://code.google.com/p/capirca/”>https://code.google.com/p/capirca/. Shorewall etc. is more on the “nice for a host” level. We have come a long way with host based firewalls, but…
  • Think harder about how to secure your whole network. And how to go about replacing parts of it.

We can’t be sure which of our LAN active components are safe, your WAN probably IS NOT.

Clients

We really need to have PSF more commonspread.

Talk it over with your clients, how much ongoing damage is acceptable for helping the helpless XP users.

Guest WIFI

Do NOT run a flat home network.

Additions welcome, comment if you know something to *advance* things.

Bacula version clash between 5 and 7


This is the second time I run into the error “Authorization key rejected by Storage daemon.”

It makes backups and restores impossible. Most traces / explanations on the internet will point at FD hostname or SD hostname or key mismatch issues.

That is of course always possible, but if you had it working until a week ago when you updated – please don’t let them discourage you. This error will also occur for any version 7 client connecting to a version 5 server. I’ve had it on my Macbook after running “port upgrade outdated” and just now on my FreeBSD desktop during a migration restore.

The jobs will abort after the client is asked to send/receive files.

Debug output of the storage daemon shows that this is in fact a client error!

the red herring, a bacula error message saying

Authorization key rejected by Storage daemon

is completely wrong.

They just abstracted / objectified their logging a little too much. The SD received the error “client didn’t want me” and has to pass it own. Not helpful. Sorry guys 🙂

As a warning / example, here have a look at the log:

JobName: RestoreFiles
Bootstrap: /var/lib/bacula/mydir-dir.restore.1.bsr
Where:
Replace: always
FileSet: Full Set
Backup Client: Egal
Restore Client: Egal
Storage: PrimaryFileStorage-int
When: 2014-09-14 12:40:15
Catalog: MyCatalog
Priority: 10
Plugin Options: *None*
OK to run? (yes/mod/no): yes
Job queued. JobId=17300
*mess
14-Sep 12:40 waxu0604-dir JobId 17300: Start Restore Job RestoreFiles.
14-Sep 12:40 waxu0604-dir JobId 17300: Using Device "PrimaryFileDevice"
14-Sep 12:39 Egal JobId 17300: Fatal error: Authorization key rejected by Storage daemon.
Please see http://www.bacula.org/en/rel-manual/Bacula_Freque_As[...]
*status client=Egal
Connecting to Client Egal at 192.168.xxx:9102

Egal Version: 5.2.12 (12 September 2012)  amd64-portbld-freebsd10.0
Daemon started 14-Sep-14 12:43. Jobs: run=0 running=0.
 Heap: heap=0 smbytes=21,539 max_bytes=21,686 bufs=50 max_bufs=51
 Sizeof: boffset_t=8 size_t=8 debug=0 trace=0 
Running Jobs:
Director connected at: 14-Sep-14 12:43
No Jobs running.
====

As you saw the restore aborts while a status client is doing just fine.
The same client is now running its restore without ANY issue after doing no more than downgrading the client to version 5.

*status client=Egal
Connecting to Client Egal at 192.168.xxx.xxx:9102

Egal Version: 5.2.12 (12 September 2012)  amd64-portbld-freebsd10.0
Daemon started 14-Sep-14 12:43. Jobs: run=0 running=0.
 Heap: heap=0 smbytes=167,811 max_bytes=167,958 bufs=96 max_bufs=97
 Sizeof: boffset_t=8 size_t=8 debug=0 trace=0 
Running Jobs:
JobId 17301 Job RestoreFiles.2014-09-14_12.49.00_41 is running.
      Restore Job started: 14-Sep-14 12:48
    Files=2,199 Bytes=1,567,843,695 Bytes/sec=10,812,715 Errors=0
    Files Examined=2,199
    Processing file: /home/floh/Downloads/SLES_11_SP3_JeOS_Rudder_[...]

All fine, soon my data will be back in place.

(Don’t be shocked by the low restore speed, my “server” is running the SDs off a large MooseFS share built out of $100 NAS storages.
I used to have the SDs directly on NAS and got better speeds with that but I like distributed storage better than speed)

Blackhat 2014 talks you should really really look at


This is my watchlist compiled from the 2014 agenda, many of those talks are important if you want to be prepared of future and current issues.

Very great to see there’s also a few talks that fall more into the “defense” category.

 

# Talks concerning incredibly big and relevant issues. I filed those under “the world is gonna end”.

The first two are worthy of that and hopefully wake up people in the respective design bodies:

  • CELLULAR EXPLOITATION ON A GLOBAL SCALE: THE RISE AND FALL OF THE CONTROL PROTOCOL
  • ABUSING MICROSOFT KERBEROS: SORRY YOU GUYS DON’T GET IT

Also annoying to horrible threats

  • EXTREME PRIVILEGE ESCALATION ON WINDOWS 8/UEFI SYSTEMS
  • A PRACTICAL ATTACK AGAINST VDI SOLUTIONS
  • BADUSB – ON ACCESSORIES THAT TURN EVIL
  • A SURVEY OF REMOTE AUTOMOTIVE ATTACK SURFACES

 Things that will actually help improve security practices and should be watched as food for thought

  • OPENSTACK CLOUD AT YAHOO SCALE: HOW TO AVOID DISASTER
  • CREATING A SPIDER GOAT: USING TRANSACTIONAL MEMORY SUPPORT FOR SECURITYo
  • BUILDING SAFE SYSTEMS AT SCALE – LESSONS FROM SIX MONTHS AT YAHOO
  • BABAR-IANS AT THE GATE: DATA PROTECTION AT MASSIVE SCALE
  • FROM ATTACKS TO ACTION – BUILDING A USABLE THREAT MODEL TO DRIVE DEFENSIVE CHOICES
  • THE STATE OF INCIDENT RESPONSE

What could end our world five years from now:

  • EVASION OF HIGH-END IPS DEVICES IN THE AGE OF IPV6

note, memorize, listen to recommendations

  • HOW TO LEAK A 100-MILLION-NODE SOCIAL GRAPH IN JUST ONE WEEK? – A REFLECTION ON OAUTH AND API DESIGN IN ONLINE SOCIAL NETWORKS
  • ICSCORSAIR: HOW I WILL PWN YOUR ERP THROUGH 4-20 MA CURRENT LOOP
  • MINIATURIZATION

scada / modbus / satellites

  • THE NEW PAGE OF INJECTIONS BOOK: MEMCACHED INJECTIONS
  • SATCOM TERMINALS: HACKING BY AIR, SEA, AND LAND
  • SMART NEST THERMOSTAT: A SMART SPY IN YOUR HOME
  • SVG: EXPLOITING BROWSERS WITHOUT IMAGE PARSING BUGS
  • THE BEAST WINS AGAIN: WHY TLS KEEPS FAILING TO PROTECT HTTP

Don’t recall what those two were about

  • GRR: FIND ALL THE BADNESS, COLLECT ALL THE THINGS
  • LEVIATHAN: COMMAND AND CONTROL COMMUNICATIONS ON PLANET EARTH

Friday special: screenrc for automatic IRSSI start


Just wanted to share a little snippet.

This is my SSH+Screen config for my IRC box:

  • If I connect from any private system, I’ll get my irc window.
  • If it rebooted or something, the screen command will automatically re-create an IRSSI session for me.
  • If I detach the screen, i’m automatically logged out.
~$ cat .ssh/authorized_keys
command="screen -d -RR -S irc -U" ssh-[ key removed] me@mypc

The authorized keys settings enforce running only _this_ command, and the screen options set a title for, force-detach, force-reattach and force-create a screen session by the name “irc”.

~$ cat .screenrc 
startup_message off
screen -t irssi 1 irssi

The screenrc does the next step by auto-running irssi in win1 with title accordingly set.
(And it turns off the moronic GPL notice)
Irssi in itself is configured to autoconnect to the right networks and channels, of course. (to be honest: Irssi config is something I don’t like to touch more than every 2-3 years.)

On the clients I also have an alias in /etc/hosts for it, so if I type “ssh irc”, I’ll be right back on irc. Every time and immediately.

 

This is the tiny little piece of perfect world I was able to create, so I thought I’d share it.

FreeBSD periodic mails vs. monitoring


I love FreeBSD! Taking over a non-small infrastructure of around 75 FreeBSD servers was something I wouldn’t have wanted to pass on.

The problem bit is that I do consulting only, not pure ops. But there wasn’t much of an ops team left…

Where they used to put around 10 man-days per week into the feeding and care of FreeBSD plus some actual development, I’m now trying to do something in 1 day. And I still want it to be a well-run, albeit slower, ship.

One of the biggest hurdles was the sheer volume of email.

Adding up Zabbix alerts (70% of which concerning _nothing)), the FreeBSD periodic mails, cron outputs, and similar reporting I would see weeks with 1500+ mails or in the higher 1000s if there was any actual issues. Each week. Just imagine what it looked like when I didn’t visit my customer for 3 weeks…

Many of those mails have no point at all once You’re running more than -base:

The most typical example would be bad SSH logins. All those servers run software to block attackers and even feed that info back to a central authority and log there. So, why in hell would I want to know about malicious SSH connects?

Would you like a mail that tells you no hardware device has failed, today?

  • And another one every day until 2032?
  • From all servers?

This makes no sense.

Same goes for the mails that tell me about neccessary system updates.

What I’ve done so far can be put in those 3 areas:

1. Periodics:

Turn off as much of the periodic mails as possible (i.e. anything that is possible to see by other means). I tried to be careful with it, but it didn’t work like this. My periodic.conf looks like this now:

freebsd periodic.conf
I found turning off certain things like the “security mail” also disables portaudit DB updates. But I just changed my portaudit call to include the download. Somehow I had assumed that *update* would be separate from *report*.

2. Fix issues:

Apply fixes for any bugs that are really that, bugs. At least if I figure out how to fix them. More often than not I’ll hit a wall in between the NIH config management and bad perl code.

3. Monitor harder, but also smarter:

Put in better monitoring, write custom plugins for anything I need (OpenSSH Keys, Sendmail queues, OS Updates) and set thresholds to either a baseline value for “normal” systems or to values derived from peak loads for “busy” systems.

Some of the checks are to be found at my bitbucket, and honestly, I’m still constantly working on them.

https://bitbucket.org/darkfader/nagios/src/cc233b93c106166a5494d7488c38880df0a5946b/check_mk/freebsd_updates/?at=default

The checked in version might change quite often, I.e. I now think it won’t hurt to have a stronger separation of reporting for OS and Ports issues. And, maybe a check that tells me if I still need a reboot for a system.

The most current area now is automating the updates.

I’m taming the VMWare platform and using some Pysphere code to create VM snapshots on the fly. So there’s an ansible playbook that pulls updates. It’ll then check if there is a mismatch between the version reported from uname -a and the “tag” file from freebsd-update. In that case, it’ll trigger a VM snapshot and install / reboot.

Another piece of monitoring does a grep -R -e  “^<<<<<” -e “>>>>>” /etc and as such alerts me of unmerged files.

I try to do with tiny little pieces and have everything a dual-use (agriculture and weapons, you know) technology that gives me status reporting and status improvement.

I started a howto about the specifics I did in monitoring, see
FreeBSD Monitoring at my adminspace wiki.

Ansible FreeBSD update fun…


Using Ansible to make my time at the laundry place more interesting…

 

me@admin ~/playbooks]$ ansible-playbook -i hosts freebsd-updates.yml

PLAY [patchnow:&managed:&redacted-domain:!cluster-pri] *************

GATHERING FACTS ****************************************************
ok: [portal.dmz.redacted-domain.de]
ok: [carbon.dmz.redacted-domain.de]
ok: [irma-dev.redacted-domain-management.de]
ok: [lead.redacted-domain-intern.de]
ok: [polonium.redacted-domain-management.de]
ok: [silver.redacted-domain-management.de]
ok: [irma2.redacted-domain-management.de]
ok: [inoxml-89.redacted-domain-management.de]

TASK: [Apply updates] **********************************************
changed: [inoxml-89.redacted-domain-management.de]
changed: [carbon.dmz.redacted-domain.de]
changed: [portal.dmz.redacted-domain.de]
changed: [irma-dev.redacted-domain-management.de]
changed: [lead.redacted-domain-intern.de]
changed: [polonium.redacted-domain-management.de]
changed: [silver.redacted-domain-management.de]
changed: [irma2.redacted-domain-management.de]
 finished on lead.redacted-domain-intern.de
 finished on portal.dmz.redacted-domain.de
 finished on silver.redacted-domain-management.de
 finished on inoxml-89.redacted-domain-management.de
 finished on carbon.dmz.redacted-domain.de
 finished on polonium.redacted-domain-management.de
 finished on irma-dev.redacted-domain-management.de
 finished on irma2.redacted-domain-management.de

TASK: [Reboot] ****************************************************
changed: [carbon.dmz.redacted-domain.de]
changed: [portal.dmz.redacted-domain.de]
changed: [inoxml-89.redacted-domain-management.de]
changed: [irma-dev.redacted-domain-management.de]
changed: [lead.redacted-domain-intern.de]
changed: [polonium.redacted-domain-management.de]
changed: [silver.redacted-domain-management.de]
changed: [irma2.redacted-domain-management.de]

TASK: [wait for ssh to come back up] *******************************
ok: [portal.dmz.redacted-domain.de]
ok: [irma-dev.redacted-domain-management.de]

I now use a “patchnow” group to have some decision maker because *surprise* I don’t want to snapshot and patch all systems at once.

Quite annoying that the most fundamential admin decisions are always really tricky to put in automation systems (written by devs). Also, I’ll need to kick my own ass since the playbook didn’t trigger the snapshots anyway!

For the long term solution I think I’ll first define a proper policy based on factors like this:

  • How mature the installed OS version & patches are (less risk of patching)
  • How exposed the system is
  • The number of users affected by the downtime
  • The time needed for recovery

What factors do you look at?

Cfengine training


Whats coming

 

You should see a more interesting post here today.

Tomorrow till friday I’ll be going to a cfengine3 class.

I’ve been so excited about this I’ve been counting down the days and such…

Unfortunately thanks to the OpenSSL nightmare of today I don’t even have time to think about tomorrow.

 

new tools

anyway. by next week I’ll have working knowledge of both Ansible and cfengine3.

This is what I consider a great toolset, or as I described to a friend as “having an excellent hammer for when I need a hammer”, and also having something to build whole cities with for when I need that.

Talking of cities:

One of my favorite books is “The city and the stars” by Arthur C. Clarke, which takes place in a city enduring aeons.

This is kinda what I’d love my servers to do, too. I think a good overall system should be able to keep running and running and running. It should be weathering disk failures, updates and power failures.

I think this does not just work by giving it a “immutability”, but by teaching it how to serve it’s actual purpose…

 

cfengine

Cfengine, to me, feels closest to that goal.

(Notably, in that story the only only normal-thinking guy in that city is a rare occurance and really wants to get out)

Sysadmin to manager translation guide


I just wrote this for fun, no liabilities taken! 🙂

Well, this is interesting:

  • “Something is definitely broken, but I don’t expect persistent data loss. Someone made a highly stupid bug”
  • “Puppet should do this (good thing), but it does that (bad thing)”
  • “I think this is gonna break once anyone touches it”
  • “I think this is gonna break in the next 24 hours”

Well, this is weird:

  • “This should never have happened. Something fucked up big time.”
  • “There might be logical data corruption.”
  • “I might soon tell you we’re doomed.”

This is not good:

  • “You lost service and/or data”

(notice “well” indicates undetermined data loss)

I’ll need to have a look at that:

“This is all broken and was set up by someone who didn’t bother to think. We’ll need to take it apart just to find out what was broken by setup and what recently broke so you called me. It’s better if you won’t hear what I have to say about this setup.”

How do you do backups?

  • “What you’re asking might cause data loss”.
  • “I don’t yet trust you to do things right”
  • “How many layers of safety do we have?”

By asking about “how” you get a chance to make up excuses, or, if you can give details, we’ll have a good chance to getting out of this safely. I like “safely”.

What date is your last backup?

“You have lost data, I’m planning a strategy for recovery and if, incidentially, you’d not tell me your backups are broken, then we can proceed quite successfully. i can probably fix this without needing the backups and saving a lot of time, but if you don’t *have* a backup, I can’t try this method. Because suddenly I need to worry about finding all your uncovered data before trying to fix anything.”

I’ll need to look this up

  • “you threw something completely new at me”
  • “you designed things so creatively that it’ll need 1-2 hours of research and ideally a rebuild in lab to make sure there IS a workable path out of this. No sane person has a setup like this.”
  • “Last time someone did such a crazy thing I managed to fix it, but you need to go away right now because if you see how other people ended up in that situation, you’ll be depressed”

Could someone fetch the green book please?

Your VxVM volumes are broken because you never properly configured anything. We needthe best possible documentation before we even start typing.”