Bacula version clash between 5 and 7


This is the second time I run into the error “Authorization key rejected by Storage daemon.”

It makes backups and restores impossible. Most traces / explanations on the internet will point at FD hostname or SD hostname or key mismatch issues.

That is of course always possible, but if you had it working until a week ago when you updated – please don’t let them discourage you. This error will also occur for any version 7 client connecting to a version 5 server. I’ve had it on my Macbook after running “port upgrade outdated” and just now on my FreeBSD desktop during a migration restore.

The jobs will abort after the client is asked to send/receive files.

Debug output of the storage daemon shows that this is in fact a client error!

the red herring, a bacula error message saying

Authorization key rejected by Storage daemon

is completely wrong.

They just abstracted / objectified their logging a little too much. The SD received the error “client didn’t want me” and has to pass it own. Not helpful. Sorry guys 🙂

As a warning / example, here have a look at the log:

JobName: RestoreFiles
Bootstrap: /var/lib/bacula/mydir-dir.restore.1.bsr
Where:
Replace: always
FileSet: Full Set
Backup Client: Egal
Restore Client: Egal
Storage: PrimaryFileStorage-int
When: 2014-09-14 12:40:15
Catalog: MyCatalog
Priority: 10
Plugin Options: *None*
OK to run? (yes/mod/no): yes
Job queued. JobId=17300
*mess
14-Sep 12:40 waxu0604-dir JobId 17300: Start Restore Job RestoreFiles.
14-Sep 12:40 waxu0604-dir JobId 17300: Using Device "PrimaryFileDevice"
14-Sep 12:39 Egal JobId 17300: Fatal error: Authorization key rejected by Storage daemon.
Please see http://www.bacula.org/en/rel-manual/Bacula_Freque_As[...]
*status client=Egal
Connecting to Client Egal at 192.168.xxx:9102

Egal Version: 5.2.12 (12 September 2012)  amd64-portbld-freebsd10.0
Daemon started 14-Sep-14 12:43. Jobs: run=0 running=0.
 Heap: heap=0 smbytes=21,539 max_bytes=21,686 bufs=50 max_bufs=51
 Sizeof: boffset_t=8 size_t=8 debug=0 trace=0 
Running Jobs:
Director connected at: 14-Sep-14 12:43
No Jobs running.
====

As you saw the restore aborts while a status client is doing just fine.
The same client is now running its restore without ANY issue after doing no more than downgrading the client to version 5.

*status client=Egal
Connecting to Client Egal at 192.168.xxx.xxx:9102

Egal Version: 5.2.12 (12 September 2012)  amd64-portbld-freebsd10.0
Daemon started 14-Sep-14 12:43. Jobs: run=0 running=0.
 Heap: heap=0 smbytes=167,811 max_bytes=167,958 bufs=96 max_bufs=97
 Sizeof: boffset_t=8 size_t=8 debug=0 trace=0 
Running Jobs:
JobId 17301 Job RestoreFiles.2014-09-14_12.49.00_41 is running.
      Restore Job started: 14-Sep-14 12:48
    Files=2,199 Bytes=1,567,843,695 Bytes/sec=10,812,715 Errors=0
    Files Examined=2,199
    Processing file: /home/floh/Downloads/SLES_11_SP3_JeOS_Rudder_[...]

All fine, soon my data will be back in place.

(Don’t be shocked by the low restore speed, my “server” is running the SDs off a large MooseFS share built out of $100 NAS storages.
I used to have the SDs directly on NAS and got better speeds with that but I like distributed storage better than speed)

Advertisements

Open Source Backup Conference – Impressions


After 4 years I finally managed to visit this conference.

The former Bacula conference ended up with a rename to “Open source
backup conference” recently and, under the new name, took place in Cologne
on Wednesday.
The idea of the rename was to extend the scope to other open source backup
solutions than just Bacula and it’s younger cousin Bareos.
This worked out pretty well, my personal highlight was a talk on an entirely
different software named “relax and recover” (rear). It’s a Linux-only system
like Ignite for HP-UX or mksysb on AIX, covering emergency restores of critical
systems.
This kind of software is tuned to store the recover images on different media
and to be highly reliable at the time of restore, even if confronted with a
blank or different system. So it’ll do things beyond what normal backup
software handles, i.e. setting up your Raid devices.

The story about Bareos is an interesting and sad one – apparently Bacula, in
it’s open source form is halfway dead. There aren’t any recent commits in the
git tree and there was even an incident where a new feature was pulled out of
the OSS version so it only became available to the users of the commercial fork.

The people / companies behind Bareos are more OSS-minded than this and made sure
the feature set is much closer of the enterprise bacula version.

As a user, thats of course great, stuff like lighter compression algos or the
“relaxed” SSL mode are quite interesting from a practically minded point of
view. They also used a lot of time to remove stale code and are now looking at
ways to move forward.

A nice and well designed build chain ensures the code you download will actually
work by doing a really large bunch of regression testing in various stages.

There was also a nice talk by someone from (or related to) NetAPP who managed
to show off a really well-designed largescale backup system without delving
into marketing *at all*. Still, I got shiny eyes once he mentioned there’s an
equallogic array that offers native infiniband storage access.

What impressed me is that the Backup community there is far friendlier than i.e.the monitoring community. There was no visible competition among the different
companies.
Well. Save for the one guy from collax community server who set the low point
by doing a 3-4 minute talk about his product during the presentation of another one. He did a good job of making sure I’ll recommend people to avoid his stuff.

My own talk on Monitoring Bacula was quite fun for me, and generated many
really interesting questions. I slipped at one time by giving a very personal opionion on old stale Perl code 🙂

The feedback in itself was split, some people were unhappy since, well, I
presented the current state of affairs: “Bugs, bugs, bugs” and others who were
motivated to improve things. I stuck with the second group, or actually, they
more or less trapped me in a corner later on, discussing ways how to add crucial checks, etc.
It seems we’re on a way forward to improve the monitoring a lot.
Doing a lot of basics like collecting all the old checks in one place is the
thing I started on.

 

After that I ended up with a Bacula book which is awesome since it’ll help me debug some issues and even has a nice chapter on pool management.

Ah, yeah, that was something I missed – there was little exchange about best
practices, like “how do you name your pools, how this, how that”.

That was just perfect at the OpenNebula Conference the next day. By then I’d also gotten a few valued hours of sleep.

A lesson in disaster restores


Hi guys, it’s been a long time.

3 days ago I started into a long-winded journey of creating a clustered setup for my home services.

For my birthday some months back I had gotten a Raspberry Pi from a good friend, and I had already been running a Nexus7 tablet as my home server since last year.

Now, since I sometimes take the Nexus7 along to show people how useful a monitoring system can become with Check_MK BI rules (especially when it is also portable!) I ran into problems:

Taking the Nexus7 outside of my flat meant I also lost:

  • DNS
  • DHCP
  • SSH Jumphost
  • Monitoring

So the really fun idea was to bring the Raspberry into this.

Gradually I’m turning the two into a Corosync/Pacemaker cluster!

The config is done via Ansible which really takes this devops toy/toolbox to it’s limits.

Few people have even configured HA clusters with it, and since Ansible playbooks are meant to be repeatable, it’s also an interesting feat to make sure your playbook doesn’t shut down a live cluster, etc. That’s where the challenges really start and I sometimes wonder if such devs are even aware what real sysadmin work is.

Automation, CI and proper setup of services are the basics, then comes the complex stuff 🙂

Anyway, all in all it’s a fun project to work on during my evenings.

2 days ago I started looking into how I can make Debian7 (RasPi) and Ubuntu12.10 (Nexus) more compatible since the cluster software has a problem using SysV scripts on the one box and Upstart on the other.

Among that I noticed a retarded hack in /etc/init/nexus7… where they replace the /etc/apt/sources.list on every boot and turn off updates alongside.

Sure enough, there was 400+MB of updates I missed. So, I triple-checked there wasn’t a kernel update along with those, since I figured a kernel-update on a unsupported tablet might not be smart.

There was no kernel update. So I went ahead and ran the update.

Sadly, after the reboot my tablet had lost networking, the wifi module didn’t come up any more due to a firmware error. And yeah, I had no new kernel, but somehow I had a new initrd and firmware modules… WTF.

Now, how to get out of this mess?

USB Ethernet doesn’t work since those modules are missing in the Nexus7 kernel.

Run a bacula restore on another system, restore:

  • /boot/initrd.img
  • /lib/modules
  • /lib/firmware

Put it in a tarball on a USB pendrive, and with that you can then recover the files to your tablet.

Reboot, eh voila, WIFI is back.

Now I also did a full restore of my main OMD Nagios site since I had deleted that during the whole mess (omd create sitename –bare, then ssh from nexus7 to backup server, add /opt/omd/sites/sitename/* and set it to restore to original location.)

Oh Ubuntu, why do you have to suck so badly?

And thanks for giving me a challenge.

Next, disabling the isc-dhcp upstart job and port over the Debian init script from the Raspi.

(Via Ansible. Of course)

Oh – and the lesson:

It’s important to have a way to transfer OS level backups via different methods as a DR fallback.

Networking can be _gone_, especially during real disaster scenarios you’ll have a hard time if you assume your PXE server will work, routing will be there, etc.

(It’s a good way to look like an idiot if you have to sit and wait till network is really working after an outage. HP-UX had that down quite nice, you’d just prepare offline bootable tapes / isos on the main install server’s failover box, and with that you could’ve brought up your most critical boxes already)

Next on this channel


Instead of a new years resolution* I’ve looked into what things to work on next. Call it milestones, achievements, whatever.

  • I’ve already cleaned up my platform, based on Alpine Linux now.
  • IPv6 switchover is going well but not a prime concern. Much stuff just works and other stuff is heavily broken, so it’s best to not rush into a wall.
  • Bacula: I’ve invested a lot of time to into backup management routine again. This paid off and made clear it was stupid to decide per-system which VMs to backup and which not. If you want to have reliable backups, just backup everything and be done with it. Sidequests still available are splitting catalogs and wondering why there is no real operations manual. (rename a client? move clients backup catalogs? All this stuff is still at a level that I’d call grumpy retarded cleverness: “You can easily do that with a script” – yeah, but how comes that the prime opensource backup tool doesn’t bring along routine features that are elsewhere handled with ONE keypress (F2 to the rescue)
  • cfengine: This will be my big thing over the next 3 weeks, at home and on holiday. Same goal, coming to grips real well. During the last years I’ve tried puppet, liked but not used Chef and glimpsed at SALT. Then I skipped all of them and decided that Ansible was good for the easy stuff and for the not easy stuff I want the BIG (F) GUN, aka cfengine.
  • Ganeti & Job scheduling: In cleaning up the hosting platform I’ve seen I’ve missed a whole topic automation-wise. Ahead scheduling of VM migrations, Snapshots etc. A friend is pushing me towards ganeti and it sure fills a lot of gaps I currently see, but it doesn’t scale up to the larger goal (OpenNebula virtual datacenters). I’ll see if there is a reasonable way for picking pieces out of Ganeti. Still, the automation topic stays unresolved. There is still no powerful OSS scheduler – the existing ones are all aimed at HPC clusters which is very easy turf compared to enterprise scheduling. So it seems I’ll need to come up with something really basic that does the job.
  • Confluence: That’s an easier topic, I’m working to complete my confluence foo so that I’ll be able to make my own templates and use it real fast.

What else…. oh well that was quite a bit already 😉

Otherwise I’ve been late (in deciding) for one project that would have been a lovable start and turned down two others because they were abroad. Being at the start of this new (selfemployed) career I’m feeling a reasonable amount of panic. Over the weekends it usually turns into a less reasonable one 😉

But yet also cheerful at being able to give a focus on building up my skills, and I also think it was the right decision. I went into the whole Unix line of work by accident but loved it since “if it doesn’t work you know it’s your fault” – a maybe stale, yet basically bug-less environment where you concentrate on issues that come up in interactions of large systems instead of bug after bug after bug. (See above at platform makeover – switching to alpine Linux has been so much fun for the same reason).

My website makeover is in progress and I’m happy to know I’ll visually arrive in this decade-2.0 with it soon.

Bild

Of course I’ve also run into some bugs in DTC while trying to auto-setup a WordPress setup. DTC is the reason for the last Debian system I’m keeping. Guess who is REALLY at risk of being deprecated now 😉

*not causing doomsday worked for 2012 though

Check_MK and Bacula


Check_MK and Bacula

I’ve prepared a small Xmas present for the Check_MK / Bacula users like me.

At this link you can find an overview of current monitoring options if you want to have Nagios tell you if “something is wrong with the backups”.

For my own use I’ve now decided to use a minimalistic yet powerful option using Bacula’s Runscript and flag files that used to monitor job status from Check_MK.

I like this since it will also work if the Bacula server breaks down alltogether.

Let me know any additions you think worthwhile.

Bacula Konferenz


Ich koennte meinen Kopf garnicht so oft gegen den Tisch hauen, wie ich eigentlich muesste.
Als ich den Termin gesehen hab, fiel mir nicht auf, dass der Call for Papers seit 4 Tagen um war (2 Werktage genauer genommen?)
Wenn ich jetzt nicht so lange gewartet haette, um endlich rauszufinden, ob ich alleine fahre oder nicht, danach ob ich freibekomme, danach vielleicht noch, ob die Sonne an dem Tag scheint usw. dann haette das sicher noch alles problemlos geklappt.

Zur Zeit kann ich mich echt drauf verlassen, dass alles, was ich anfasse ausnahmsweise nicht kaputt geht, sondern nur zu klebriger, nutzloser, graugruengelb karierter Masse wird.

Bacula Konferenz


An 23.09.2009 findet in Koeln eine (die erste!) Konferenz zu Bacula statt.

Hier der Call for Papers.
Wenn irgend moeglich, fahre ich da hin und falls mir noch ein wenig Zeit in die Hand faellt, frische ich das Bacula Cluster Howto auf und stelle es dort vor. Fuer mich alleine war es eh immer ein wenig zu viel. 🙂
Erstmal muss ich aber erfahren, ob ich da alleine hinfahre oder nicht.