A lesson in disaster restores


Hi guys, it’s been a long time.

3 days ago I started into a long-winded journey of creating a clustered setup for my home services.

For my birthday some months back I had gotten a Raspberry Pi from a good friend, and I had already been running a Nexus7 tablet as my home server since last year.

Now, since I sometimes take the Nexus7 along to show people how useful a monitoring system can become with Check_MK BI rules (especially when it is also portable!) I ran into problems:

Taking the Nexus7 outside of my flat meant I also lost:

  • DNS
  • DHCP
  • SSH Jumphost
  • Monitoring

So the really fun idea was to bring the Raspberry into this.

Gradually I’m turning the two into a Corosync/Pacemaker cluster!

The config is done via Ansible which really takes this devops toy/toolbox to it’s limits.

Few people have even configured HA clusters with it, and since Ansible playbooks are meant to be repeatable, it’s also an interesting feat to make sure your playbook doesn’t shut down a live cluster, etc. That’s where the challenges really start and I sometimes wonder if such devs are even aware what real sysadmin work is.

Automation, CI and proper setup of services are the basics, then comes the complex stuff šŸ™‚

Anyway, all in all it’s a fun project to work on during my evenings.

2 days ago I started looking into how I can make Debian7 (RasPi) and Ubuntu12.10 (Nexus) more compatible since the cluster software has a problem using SysV scripts on the one box and Upstart on the other.

Among that I noticed a retarded hack in /etc/init/nexus7… where they replace the /etc/apt/sources.list on every boot and turn off updates alongside.

Sure enough, there was 400+MB of updates I missed. So, I triple-checked there wasn’t a kernel update along with those, since I figured a kernel-update on a unsupported tablet might not be smart.

There was no kernel update. So I went ahead and ran the update.

Sadly, after the reboot my tablet had lost networking, the wifi module didn’t come up any more due to a firmware error. And yeah, I had no new kernel, but somehow I had a new initrd and firmware modules… WTF.

Now, how to get out of this mess?

USB Ethernet doesn’t work since those modules are missing in the Nexus7 kernel.

Run a bacula restore on another system, restore:

  • /boot/initrd.img
  • /lib/modules
  • /lib/firmware

Put it in a tarball on a USB pendrive, and with that you can then recover the files to your tablet.

Reboot, eh voila, WIFI is back.

Now I also did a full restore of my main OMD Nagios site since I had deleted that during the whole mess (omd create sitename –bare, then ssh from nexus7 to backup server, add /opt/omd/sites/sitename/* and set it to restore to original location.)

Oh Ubuntu, why do you have to suck so badly?

And thanks for giving me a challenge.

Next, disabling the isc-dhcp upstart job and port over the Debian init script from the Raspi.

(Via Ansible. Of course)

Oh – and the lesson:

It’s important to have a way to transfer OS level backups via different methods as a DR fallback.

Networking can be _gone_, especially during real disaster scenarios you’ll have a hard time if you assume your PXE server will work, routing will be there, etc.

(It’s a good way to look like an idiot if you have to sit and wait till network is really working after an outage. HP-UX had that down quite nice, you’d just prepare offline bootable tapes / isos on the main install server’s failover box, and with that you could’ve brought up your most critical boxes already)

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s