FYI I did some more windows things 🙂
Below a few lessons learned and some links that were helpful.
Non-Routing
Seems Windows has broken handling of ICMP redirects since Win7 was introduced.
They’re bad, but they’re also turned on in Windows by default (can be configured via some special corner in GPO) and they are not respected. According to docs it should result in a 10-minute routing table entry, but it never does.
So, even temporary hacks: No, remove them, rebuild it right away. Better than debugging a broken kernel!
routing
so we found we needed to push some extra static routes to our test clients via DHCP.
How to do that, especially if your DHCPd is from the last decade?
This is how:
http://thomasjaehnel.com/blog/2010/01/pushing-routes-via-dhcp.html
Domain controller backups
Normally, Windows always a backup in a configurable location. By default, the backup should also go to the NTDS folder. I recommend you check it out, because we reproducibly found the backup file is not there.
The most perfect howto / KB article for that whole kind of stuff seems to be here:
Active Directory Database Maintenance
A secondary help could be this one:
http://eniackb.blogspot.de/2009/06/active-directory-database.html
Windows Repair
The repair mode is missing a few commands
A, and if you wanna chkdsk remember to first use diskutil to assign a new drive letter and import your C:\ thing so you can test the right thing.
QEMU-QA
Still didn’t find any way to get the goddamn QEMU guest agent running well on windows.
SSH Key auth
I looked into being able to do key based auth and GSSAPI auth for SSH.
It seems doable, on the one end you store the key in a field named AltSecurityIdentities and prefix it with SSHKey: so it’ll match on the right data when queried.
That query is done using a helper that comes with sssd and is put in sshd_config (i think).
That means they’re not doing the plain SSH way, but i think many of the “support LDAP certs” things in SSH have stayed in a “here’s a patch” state, so rather something well-integrated via sssd.
The GSS part seems a bit questionable with multiple parties building patched versions of PuTTY. I hope by now the official one is good enough. It seems mostly about sending the right stuff from PuTTY, not a server side ickyness.
I found one guy who re-wired all that to go via LDAP because he didn’t know there’s a Kerberos master in his Windows AD. But good to know that’s also possible 🙂
A definite todo with this would be to properly put your host keys in DNS so it’s really a safe and seamless experience. DNS registration from Linux to AD *is* possible, and with kerberos set up it should also not include security nightmares. So it’s just about registering one more item (A, PTR and SSHFP)
I would like to get that set up nicely enough that it can be enabled anywhere. My biggest worry is in a cloud context you’re instantiating the new boxes and so you definitely would have a credential management issue.
Unless I do it the hard way and create the computer account from the ONE controller, and then put the credential into the VM context/env so it’ll be able to pick it up and work with this inital token to take over its own computer account.
At that point it would be “proper” and make me happy, but I’ve learned that THAT kind of thing is what you can only build if someone needs it and pays for you.
(Hobby items should not go into the 4-week effort range. Yeah, you can build “something” in 2 days, but “proper” will take a lot longer).
I’m totally interested into some shortcut that would do a minimal thing instead of the whole.
QEMU:
Libvirt is hillariously stupid – we restored a VM backup image, found it unbootable. It went on like that for some time.
In the end it turns out it was a qcow2, not a raw image. I’m kinda pissed off about this since there’s a bazillion of tools in the KVM ecosys that know how to deal with multiple image times – especially qemu itself. But it’s too fucking stupid to autodetect the type. A type that can be detected as simple as doing “file myimage.img>.
10Gbit
We also did a 10gbit upgrade (yes, of course SolarFlare NICs) and found that our disk IO is still limited – limited by the disks behind the SSD cache. So those disks need to go.
What’s vastly improved is live migration times (3-6 seconds for a 4GB VM) and interactive performance in RDP. Watching videos over RDP with multiple clients has become a no-brainer.
I have no idea why I’m not getting the same perf at home – 10g client, 2x10g server, but RDP is much slower. It might be something idiotic like the 4K screen downscaling. All I know is I have no idea 🙂
OTOH my server has a fraction of the CPU power, too.
Nobrains
Finally, I again managed to split-brain our cluster and GOD DAMN ME next time I’ll learn to just pull the plug instead of any, any other measure.
(How: Misconfigured VLAN tagging – the hosts run untagged and I had a tagAll in place. Should have put the whole port to defaults before starting)