So I ended up getting the dreaded “Packet size too big” error from Bacula.
I wasn’t sure when it started, either with the High Sierra update, or with some brew update.
The error looks like this:
20-Oct 12:53 my-bacula-dir JobId 0: Fatal error: bsock.c:579 Packet size=1073741835 too big from “Client: Utopia:192.168.xx.xx:9102. Terminating connection.
It can be reproduced simply by doing “status client”, and will also happen if you try to do a backup.
If you look into the error you’ll find an entry in the bacula FAQ that handles windows specifics, and how to proceed if it’s not the known causes they explain.
Get a trace file using -d100, and report on the list.
So, the first thing I found it won’t make a trace file, at least not on OSX.
You can alternatively use -d100 plus a tcpdump -c 1000 port 9102 to get reasonable debug info.
While looking at the mailing list I also found that the general community support for any mention of this error is horrible.
You’re being told you got a broken NIC, or that your network is mangling the data, etc.
All of which were very plausible scenarios back in, say 2003, when bacula was released.
Nowadays with LRO/GSO and 10g nics it is not super unimaginable to receive a 1MB sized packet. For a high volume transfer application like backup, it is in fact the thing that SHOULD HAPPEN.
But in this case people seem to do anything they can do disrupt discussion and blame the issue on the user. In one case they did that with high effort, even when a guy proved he could reproduce it using his loopback interface, no network or corruption involved at all.
I’m pretty sick of those guys and so I also did everything I could – to avoid writing to the list.
Turns out the last OSX brew update went to version 9 of the bacula-fd while my AlpineLinux director still is on 7.x.
Downgrading using brew switch bacula-fd 7.xxx solved this for good.
Now the fun question is: is it either TSO has somehow influencing bacula 9, but not 7, and my disabling tso via sysctl had no effect? or is it they did at last allow more efficient transfers in newer versions and that broke compatibility BECAUSE for 10 years they’ve been blaming their users, their users’ networks and anything else they could find?
Just so they’d not need to update the socket code?
There are other topics that have been decade-long stuck, and I wonder if they should just be put up as GSoC projects to benefit the community, but also anyone who can tackle them!
- multisession fd’s (very old feature in commercial datacenter level backup, often it can even stream multiple segments of the same file to different destinations. Made sense for large arrays, and makes sense again with SSDs)
- bugs in notification code that cause the interval to shorten after a while
- fileset code that unconditionally triggers a full if you modify the fileset (even i.e. if you exclude something that isn’t on the system)
- base jobs not being interlinked and no smart global table
- design limitations in nextpool directives (can’t easily stage and archive and virtual full at the same time for the same pool)
- bad transmission error handling (“bad”? NONE!). At least now you could resume, but why can’t it just do a few retries, why does the whole backup need to abort in the first place, if you sent say 5 billion packets and one of them was lost?
- Director config online reload failing if SSL enabled and @includes of wildcards exist.
- Simplification of multiple-jobs at the same time to the same file storage, but all jobs to their own files. ATM it is icky to put it nicely. At times you wonder if it wouldn’t be simpler to use a free virtual tape library than deal with how bacula integrates file storage
- Adding utilities like “delete all backups for this client”, “delete all failed backups, reliably and completely to the point where FS space is freed”
It would be nice if it doesn’t need another 15 years till those few but critical bits are ironed out.
And if not that, it would be good for the project to just stand by its limitations, it’s not healthy or worthy if some community members play “blame the user” without being stopped. The general code quality of bacula is so damn high there’s no reason why one could not admit to limitations. And it would probably be a good step for solving them.