Bacula Packet size too big

So I ended up getting the dreaded “Packet size too big” error from Bacula.

I wasn’t sure when it started, either with the High Sierra update, or with some brew update.

The error looks like this:


20-Oct 12:53 my-bacula-dir JobId 0: Fatal error: bsock.c:579 Packet size=1073741835 too big from “Client: Utopia:192.168.xx.xx:9102. Terminating connection.


It can be reproduced simply by doing “status client”, and will also happen if you try to do a backup.

If you look into the error you’ll find an entry in the bacula FAQ that handles windows specifics, and how to proceed if it’s not the known causes they explain.

Get a trace file using -d100, and report on the list.

So, the first thing I found it won’t make a trace file, at least not on OSX.

You can alternatively use -d100 plus a tcpdump -c 1000 port 9102 to get reasonable debug info.

While looking at the mailing list I also found that the general community support for any mention of this error is horrible.

You’re being told you got a broken NIC, or that your network is mangling the data, etc.

All of which were very plausible scenarios back in, say 2003, when bacula was released.

Nowadays with LRO/GSO and 10g nics it is not super unimaginable to receive a 1MB sized packet. For a high volume transfer application like backup, it is in fact the thing that SHOULD HAPPEN.

But in this case people seem to do anything they can do disrupt discussion and blame the issue on the user. In one case they did that with high effort, even when a guy proved he could reproduce it using his loopback interface, no network or corruption involved at all.

I’m pretty sick of those guys and so I also did everything I could – to avoid writing to the list.

Turns out the last OSX brew update went to version 9 of the bacula-fd while my AlpineLinux director still is on 7.x.

Downgrading using brew switch bacula-fd solved this for good.

Now the fun question is: is it either TSO has somehow influencing bacula 9, but not 7, and my disabling tso via sysctl had no effect?¬† or is it they did at last allow more efficient transfers in newer versions and that broke compatibility BECAUSE for 10 years they’ve been blaming their users, their users’ networks and anything else they could find?

Just so they’d not need to update the socket code?

There are other topics that have been decade-long stuck, and I wonder if they should just be put up as GSoC projects to benefit the community, but also anyone who can tackle them!

  • multisession fd’s (very old feature in commercial datacenter level backup, often it can even stream multiple segments of the same file to different destinations. Made sense for large arrays, and makes sense again with SSDs)
  • bugs in notification code that cause the interval to shorten after a while
  • fileset code that unconditionally triggers a full if you modify the fileset (even i.e. if you exclude something that isn’t on the system)
  • base jobs not being interlinked and no smart global table
  • design limitations in nextpool directives (can’t easily stage and archive and virtual full at the same time for the same pool)
  • bad transmission error handling (“bad”? NONE!). At least now you could resume, but why can’t it just do a few retries, why does the whole backup need to abort in the first place, if you sent say 5 billion packets and one of them was lost?
  • Director config online reload failing if SSL enabled and @includes of wildcards exist.
  • Simplification of multiple-jobs at the same time to the same file storage, but all jobs to their own files. ATM it is icky to put it nicely. At times you wonder if it wouldn’t be simpler to use a free virtual tape library than deal with how bacula integrates file storage
  • Adding utilities like “delete all backups for this client”, “delete all failed backups, reliably and completely to the point where FS space is freed”


It would be nice if it doesn’t need another 15 years till those few but critical bits are ironed out.

And if not that, it would be good for the project to just stand by its limitations, it’s not healthy or worthy if some community members play “blame the user” without being stopped. The general code quality of bacula is so damn high there’s no reason why one could not admit to limitations. And it would probably be a good step for solving them.


Some more Windows stuff?



FYI I did some more windows things ūüôā

Below a few lessons learned and some links that were helpful.


Seems Windows has broken handling of ICMP redirects since Win7 was introduced.

They’re bad, but they’re also turned on in Windows by default (can be configured via some special corner in GPO) and they are not respected. According to docs it should result in a 10-minute routing table entry, but it never does.

So, even temporary hacks: No, remove them, rebuild it right away. Better than debugging a broken kernel!


so we found we needed to push some extra static routes to our test clients via DHCP.

How to do that, especially if your DHCPd is from the last decade?

This is how:

Domain controller backups

Normally, Windows always a backup in a configurable location. By default, the backup should also go to the NTDS folder. I recommend you check it out, because we reproducibly found the backup file is not there.

The most perfect howto / KB article for that whole kind of stuff seems to be here:

A secondary help could be this one:


Windows Repair

The repair mode is missing a few commands

A, and if you wanna chkdsk remember to first use diskutil to assign a new drive letter and import your C:\ thing so you can test the right thing.



Still didn’t find any way to get the goddamn QEMU guest agent running well on windows.


SSH Key auth

I looked into being able to do key based auth and GSSAPI auth for SSH.

It seems doable, on the one end you store the key in a field named¬†AltSecurityIdentities and prefix it with SSHKey: so it’ll match on the right data when queried.

That query is done using a helper that comes with sssd and is put in sshd_config (i think).

That means they’re not doing the plain SSH way, but i think many of the “support LDAP ¬†certs” things in SSH have stayed in a “here’s a patch” state, so rather something well-integrated via sssd.

The GSS part seems a bit questionable with multiple parties building patched versions of PuTTY. I hope by now the official one is good enough. It seems mostly about sending the right stuff from PuTTY, not a server side ickyness.

I found one guy who re-wired all that to go via LDAP because he didn’t know there’s a Kerberos master in his Windows AD. But good to know that’s also possible ūüôā

A definite todo with this would be to properly put your host keys in DNS so it’s really a safe and seamless experience. DNS registration from Linux to AD *is* possible, and with kerberos set up it should also not include security nightmares. So it’s just about registering one more item (A, PTR and SSHFP)

I would like to get that set up nicely enough that it can be enabled anywhere. My biggest worry is in a cloud context you’re instantiating the new boxes and so you definitely would have a credential management issue.

Unless I do it the hard way and create the computer account from the ONE controller, and then put the credential into the VM context/env so it’ll be able to pick it up and work with this inital token to take over its own computer account.

At that point it would be “proper” and make me happy, but I’ve learned that THAT kind of thing is what you can only build if someone needs it and pays for you.

(Hobby items should not go into the 4-week effort range. Yeah, you can build “something” in 2 days, but “proper” will take a lot longer).

I’m totally interested into some shortcut that would do a minimal thing instead of the whole.



Libvirt is hillariously stupid – we restored a VM backup image, found it unbootable. It went on like that for some time.

In the end it turns out it was a qcow2, not a raw image. I’m kinda pissed off about this since there’s a bazillion of tools in the KVM ecosys that know how to deal with multiple image times – especially qemu itself. But it’s too fucking stupid to autodetect the type. A type that can be detected as simple as doing “file myimage.img>.



We also did a 10gbit upgrade (yes, of course SolarFlare NICs) and found that our disk IO is still limited – limited by the disks behind the SSD cache. So those disks need to go.

What’s vastly improved is live migration times (3-6 seconds for a 4GB VM) and interactive performance in RDP. Watching videos over RDP with multiple clients has become a no-brainer.

I have no idea why I’m not getting the same perf at home – 10g client, 2x10g server, but RDP is much slower. It might be something idiotic like the 4K screen downscaling. All I know is I have no idea ūüôā

OTOH my server has a fraction of the CPU power, too.



Finally, I again managed to split-brain our cluster and GOD DAMN ME next time I’ll learn to just pull the plug instead of any, any other measure.

(How: Misconfigured VLAN tagging – the hosts run untagged and I had a tagAll in place. Should have put the whole port to defaults before starting)

As kids we would wonder, but we didn’t know 2016

As kids we would wonder why noone treated us seriously, like grown-ups, like someone capable of reasoning.

But it’s not about that.

It’s about letting someone¬†keep a glimpse of trust, looking back to a few years where they didn’t yet see wars starting, not being able to do anything to stop an escalation.

Families getting wiped out as a necessity (to someone, after all)¬†on a mountain roadside because they happened to watch a spies’ assassination.

Prisons that have been given up to the inmates and just patrolled from the outside.

Watching your favourite artists die.

Hearing that a friend committed suicide.

Actually, people getting so deeply hopeless that they willingly crash their own airplane, wiping out whole school classes.

Undercover investigators who were the actual enabler¬†of the Madrid subway bombing. Knowing how they’ll also forever be lost in their guilt, not making anything better.

Seeing how Obama’s goodbye gets drowned in humanity wondering if Trump had women pissing on <whatever> for money or not.

Then asking yourself why that even would matter considering T’s¬†definitely a BAD PERSON so who cares what kind of sex he’s into, why can one’s private details take attention¬†from the actual fact that he’s absolutely not GOOD?

Watching a favorite place to be torn down for steel and glass offices.

Understanding what a burnt down museum means.

Life’s inevitable bits, to be confronted with them works only if you had a long peaceful period in your life.

And that, that’s what you really just shouldn’t see or rather understand too early.

After all, there’s an age where we all tried to get toothpaste back into the tube, just because we’d not believe it just doesn’t work that way.


Sorry for this seemingly moody post, it’s really been cooking since that 2012 murder case. Today



On the pro side, there’s movies, Wong Kar-Wai and so many more. There’s art, and the good news is that we can always add more art, and work more towards the world not being a shithole for the generations after us.

But, seriously, you won’t be able to do much good if you look at the burning mess right from the start.

happy about VM compression on ZFS

Over the course of the weekend I’ve switched one of my VM hosts to the newer recommended layout for NodeWeaver – this uses ZFS, with compression enabled.

First, let me admit some things you might find amusing:

  • I found I had forgotten to add back one SSD after my disk+l2arc experiment
  • I found I had one of the two nodes plugged into its 1ge ports, instead of using the 10ge ones.

The switchover looked like this

  1. pick a storage volume to convert, write down the actual block device and the mountpoint
  2. tell LizardFS i’m going to disable it (prefix it with a * and kill -1 the chunkserver)
  3. Wait a bit
  4. tell LizardFS to forget about it (prefix with a # and kill -1 the chunkserver)
  5. umount
  6. ssh into the setup menu
  7. select ‘local storage’ and pick the now unused disk, assign it to be a ZFS volume
  8. quit the menu after successful setup of the disk
  9. kill -1 the chunkserver to enable it
  10. it’ll be visible in the dashboard again, and you’ll also see it’s a ZFS mount.

Compression was automatically enabled (lz4).

I’ve so far only looked at the re-replication speed and disk usage.

Running on 1ge I only got around 117MB/s (one of the nodes is on LACP and the switch can’t do dst+tcp port hashing so you end up in one channel.

Running on 10ge I saw replication network traffic to go up to 370MB/s.

Disk IO was lower since the compression already kicked, and the savings have been vast.

[root@node02 ~]# zfs get all | grep -w compressratio
srv0  compressratio         1.38x                  -
srv1  compressratio         1.51x                  -
srv2  compressratio         1.76x                  -
srv3  compressratio         1.53x                  -
srv4  compressratio         1.57x                  -
srv5  compressratio         1.48x                  -

I’m pretty sure ZFS also re-sparsified all sparse files, the net usage on some of the storages went down from 670GB to around 150GB.


I’ll probably share screenshots once the other node is also fully converted and rebalancing has happened.

Another thing I told a few people was that I ripped out the 6TB HDDs once I found that $customer’s OpenStack cloud performed slightly better than my home setup.

Consider that solved. ūüėČ

vfile:~$ dd if=/dev/zero of=blah bs=1024k count=1024 conv=fdatasync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 2.47383 s, 434 MB/s

(this is a network-replicated write…

make sure to enable the noop disk scheduler before you do that.

Alternatively, if there’s multiple applications on the server (i.e. a container hosting VM), use the¬†deadline disk scheduler¬†with the nomerges option set. That matters. Seriously ūüôā


Happy hacking and goodbye

FrOSCon Nachlese: Rudder (1)

Bei dem Vortrag hatte ich ja erwaehnt, dass es sehr praktisch ist, dass Normation eigene C-Entwickler hat und Ihre eigene Version des Rudder Agent pflegt.

Case in point, ein kritischer Bug in CFEngine, der einfach schon gefixed war. Und 30 Minuten nach meiner Frage war der Patch auch freigegeben…

15:29 < darkfader> amousset: i'm inclined to think we could also
use a backport of
15:30 < darkfader> unless someone tells me "oh no i tested with 
a few thousand clients for a few months¬†and it doesn't affect us" ūüėČ
15:33 < amousset> darkfader: it has already been backported 
15:34 < Helmsman> Bug #8875: Backport patch to fix connection cache 
( Pending release issue assigned to  Jonathan CLARKE. 
URL: )
15:37 < darkfader> amousset: heh, pending release 
15:38 < Matya> hey you just released today ūüôā
15:40 < amousset> yesterday actually ūüôā
16:07 < jooooooon> darkfader: it's released now ūüėČ



… es gibt naemlich einen Freigabeprozess!

FroSCon Nachlese: hosting



Die zwei deutschen “wir meinen das ernst mit dem anders sein” Webhoster waren beide auf der FroSCon.

Damit meine ich die hostharing EG und UBERSPACE .

hostsharing hatte sich 2001 nach dem Strato-GAU gegruendet, ihr aktueller (oder schon immer?) Slogan ist: Community Driven Webhosting. Sie sind eine Genossenschaft, d.h. man ist idealerweise nicht einfach Kunde, sondern redet auch mit. Es gibt aber natuerlich einen harten Kern, der die Getriebe am laufen haelt.

UBERSPACE kenne ich seit gefuehlt 2-3 Jahren. Sie machen “Hosting on Asteroids” ūüėČ

Hier kann man den Preis selbst mitbestimmen, und bekommt eine flexible Loesung ohne den ganzen Bullshit, den die Massenhoster aussen rum bauen.

Es sind also zwei Generationen. Die Idee scheint mir dennoch aehnlich:

  • Ein Preismodell, das sich am bestmoeglichen Service orientiert.
  • Offenheit. Konzepte sind verstaendlich, ein User kann erkennen, wo seine Daten landen werden und kann klar erkennen, dass sie gut aufbewahrt sind
  • Ernsthafter Serverbetrieb (d.h. mehrere Standorte, Backups wo anders, bei Problemen reagieren, proaktives Management)
  • Nur kompetente Leute dranzusetzen! D.h. viele Jahre echte Unix-Erfahrung, nicht nur bisschen PHP und Panel-Bedienung. Alle beteiligten sind grob auf dem gleichen Niveau.

Im Gegensatz dazu der Massenmarkt, wo man oft genug lesen kann, dass wieder Kunden fuer die Sicherheitsluecken der Anbieter gestraft werden, keine Backups da waren, kein DR-Konzept vorlag, kaputte Hardware weiterbetrieben wird (“Wozu hat der Speicher ECC, wenn ich ihn dann tauschen lassen soll?”) usw.

Mit UBERSPACE konnte nicht nicht sprachen, weil ich es garnicht mitbekommen hab. Mit einem sass ich im Auto, habe aber Schlaf nachgeholt. ūüôā

Der hostsharing habe ich vor allem eines geraten: Schreibt Preise drauf.

Typisch fuer Idealisten ūüôā

Aber nicht zu verachten, ihr Start liegt jetzt 15 Jahre zurueck und sie haben seitdem halt einfach Qualitaet geliefert.

UBERSPACE wuensch ich das gleiche, und beiden, dass sie ihre Konzepte noch so gut skalieren koennen, dass sie die anderen Anbieter am Markt mit Qualitaet unter Druck setzen werden.


Weblinks: – ja, der erste Link geht zur Technik. Letztlich sind nur Technik, Team und Prozesse wichtig. Zumindest, wenn man keinen Website-baukasten braucht. – die Infoseite – die Admin-API, von der sie mir garnicht erzaehlt hatten ūüôā

FroSCon Nachlese: coreboot



Hab mich sehr gefreut, dass jemand von Coreboot da war. Der Stand wurde vom BSI organisiert, ansonsten helfen sie dem Projekt auch noch mit Test-Infrastruktur.

Find ich gut, das ist nahe an dem, was ich fuer die “wichtige” Arbeit des BSI halte.

Der Coreboot-Entwickler hat mir ihre neue Website (sieht gut aus, gibt’s aber noch nicht online) gezeigt, und ich hab ansonsten auch meinen viele Jahre alten Stand auffrischen koennen.

Es gibt alleine bei Intel 25 Leute, die mitarbeiten an Coreboot!

Native Hardware wird immer mehr (von 0 auf 2 oder so, aber hey!)

Es gibt ein recht tolles Chromebook von HP, das ich mir holen sollte

Die Payloads werden vielfaeltiger – ich hatte gesagt, dass ich Seabios einfach hasse, und gerne einen Phoenix-Clone haette. Er hat dann nachgefragt, und mich drauf gestossen, dass ich eigentlich ein viel besseres PXE will. (Siehe VMWare, Virtualbox, nicht siehe KVM oder PyPXEboot!)

Und er hatte einen Rat: Petitboot – ein SSH-barer PXE-loader, der so einfach alles und jedes kann, was wir uns als Admins oder QA-Engineers wuenschen.

Ist auf jeden Fall auf der Testliste.


Was es sonst noch beim BSI Stand gab, war GPG, OpenPGP und OpenVAS.

VAS haette mich auch interessiert, aber ich glaube, ein vollstaendiges Gespraech ist besser als zwei halbe. ūüôā

This is patching


  • There’s an OpenSSL issue
  • Fetch and rebuild the stable ports tree (2016Q3)
  • Find it won’t build php70 and mod_php70 anymore
  • Try to compare to your -current ports tree
  • Find there’s a php security issue in the the¬†version there, but not the one you had
  • Wait till it’s fixed so you can build
  • Type portsnap, then just to be safe fist do a full world update to make sure your portsnap isn’t having security issues any more.
  • Updated portsnap has a metadata corruption
  • Remove your portsnap files, try again then just think “whatever” and fetch the ports from the ftp mirror and re-extract manually
  • Notice you just fetched an unsigned file via FTP and will use it to build i.e. your OpenSSL.
  • Rant about that.
  • Find you can’t build because it can’t fetch
  • Debug the reason it can’t fetch
  • Find it’s a bug in the ports tree from the fix of the¬†above security issue
  • Make mental note noone seems to react withing 1-2 days if the stable tree is broken
  • While searching for existing bugs, find a PR about pkg audit that tries to redefine¬†the functionality in order to not fix an output regression
  • Open a bug report for the PHP bug, adjust your local file
  • Fetch your new package list
  • Do a pkg audit, find it reports not too much.
  • Do a pkg audit -F, find it gives an SSL error
  • Find the certificate expired 2 months ago.
  • Wonder how noone even reacted to that
  • Find that SSLlabs somehow can’t even properly process the site anymore.
  • Find out that SSLlabs is actually dead just now.
  • Notice in the last lines it had managed to print that the actual hostnames points at a rdns-less v6 address and v4 cnames to a random test system.
  • Most likely the webserver ain’t heavily protected in that case, huh?
  • Give up and use http like the goddamn default
  • Random pkg update -> so update pkg first
  • In the end, just no-downtime slam all the new packages over the server because you’re sick of it all.



The next person who posts about “admins need to apply the fixes” I’ll just kick in the face. With my beer.

Diving into sysadmin-legalese

I’ve had the “fun” to at times write outage notices like the current Google statement. The wording is interesting. ¬†IT systems fail, and once it’s time to write about it, management will make sure they are well-hidden. The notice will be written by some poor soul who just wants to go home to his family instead of being a bearer of bad news. If he writes something that is not correct, management will be back just in time for the crucification.

Things go wrong, we lose data, we lose money. It’s just something that happens.

But squeezing both the uncertainties and¬†hard truths of IT into public statements is an art of it’s own.

Guess what? I’ve been in popcorn mood –¬†Here are multiple¬†potential translations for this status notice:

“The issue with Compute Engine network connectivity should have been resolved for nearly all instances. For the remaining few remaining instances we are working directly with the affected customers. No further updates will be posted, but we will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will also provide a more detailed analysis of this incident once we have completed our internal investigation.”

should have been resolved:

we cannot take responsibility. it might come back even just now! we’re not able to safely confirm anything ūüė¶

but we¬†spoke to the guy’s boss and they promised he’ll not do that thing again.

for nearly all instances:

the error has been¬†contained, but we haven’t been able to fix it?¬†we’ve been unable to undo the damage where it happened. we were lucky in most cases, but not in all.

the remaining few remaining instances

the remaining remaining… they aren’t going away! Still remaining!

these instances that are left we’ll not be able to fix like the majority where our fix worked.

Please, just let me go to bed and someone else do those? I can’t even recall how many 1000s we just fixed and it won’t stop.

working directly with the affected customers:

only if you’re affected and we certainly know we need to come out, we will involve you. we can’t ascertain if there’s damage for you. we need to run more checks and will tell you when / how long. you get to pick between restores and giving us time to debug/fix. (with a network related issue, that is unlikely)

no further updates will be posted

since it is safely contained, but we don’t understand it, we can’t make any statement that doesn’t have a high chance of being wrong

we only post here for issues currently affecting a larger than X number of users. Now that we fixed it for most, the threshold is not reached

we aren’t allowed to speak about this – we don’t need to be this open once we know the potential liabilites are under a treshold.

will conduct an internal investigation

we are unwilling to discuss any details until we completely know what happened

appropriate improvements  

we are really afraid now to change the wrong thing.

it will be the right button next time, it has to!

management has signed off the new switch at last!

to prevent or minimize future recurrence

we’re not even sure we can avoid this. right now, we would expect this to happen again. and again. and again. recurring, you know? if at least we’d get a periodic downtime to reboot shit before this happens, but we’re the cloud and noone gives us downtimes!

and please keep in mind minimized recurrence means a state like now, so only a few affected instances, which seems under the threshold where we notify on the tracker here.

we really hope it won’t re-occur so we don’t have to write another of those.


Don’t take too serious but these are some of the alarms going off in my head if I see such phrasing.

Have a great day and better backups!

First look at the UP-board

I’ve finally got two UP Boards. After they arrived I also had ordererd another “Mean Well” dedicated 12V rail mount PSU, and some 2.1mm power cables.

The boards are nice little things with a lot of CPU power. Quad Atom with some cache, eMMC and enough RAM!

Photos, dmesg, hwinfo etc can be found here:

The basics:

My models have 2GB of ram which is shared with the onboard graphics.

The have a big front and back cooling plate, for hardcore usage there’s also an active fan in their shop.

Connectors: USB 2.0, 3.0, 3.0 OTG. The latter is a¬†Macbook Air ‚Äěstyle‚Äú Typ-C flat connector. There’s also power (via a 2.1mm plug), HDMI and some other stuff I didn’t understand.

There’s one connector that has a industrial style plug. This port exposes 2x USB and a serial with BIOS forwarding. You should give in and just buy it, there’s no way you’ll easily find the plug on your own.

You’ll need this cable unless you only plan on a desktop use.¬†It doesn’t come with a FTDI serial, so also make sure to get one of those.

The MMC reads at up to 141MB/s (pretty nice) and writes (fdatasync) up to 61MB/s (also pretty OK). TRIM does work.

The LAN interface is just a Realtek, connected via PCIe (2.5GT/s, x1).

BIOS stuff

On boot you’re greeted by a normal EFI shell, reminded me of my late HP-UX days, except here there is¬†no SAN boot scan.

Pressing F7 gives you a boot menu which always also allows going to BIOS Setup, which is a normal phoenix-style menu. Very small and simple – that’s nice.

Serial forwarding is supported, I didn’t try netbooting yet.

OS (ubilinux)

I installed their “default” distro which is done by flashing the ISO to a stick (or putting it on a CD) and you have to take care to use a USB2.0 connector if it’s a USB3 stick or it won’t be detected(!)

The grub menu was really slow, while the BIOS had been quick.

Limiting the video ram to 64MB + UHD screen brought me a system that stopped working once X was up.¬†I didn’t investigate that, instead I booted to single user mode and told systemd to make that a default (systemctl

Ubilinux is a Debian Jessie (sigh) but with some parts scrapped from Ubuntu (sigh).

It works and has all the stuff to i.e. access the GPIO connectors.

lm_sensors detected the coretemp CPU sensors, nothing else.

AES-NI was autoloaded.

The only thing I couldn’t make work yet was the hardware watchdog, which is an issue split between SystemD, packaging and probably something else.

This one gets a 9/10 which is rare ūüôā