just to leave the most important things from a day last week where things were special.
We set out to extend / upgrade a forgotten 2003 domain to 2012, with that it’d also finally get a second DC.
So we expected those steps:
- Put domain in 2003 mode
- Add 2012 DC, if not happy, do forest prep or so
- Switch all FSMO
- Shut down 2003 DC, test if NAS shares are accessible
- Restart it
- ready a 2012 VM as secondary DC
- dcpromo the new VM
- demote the old DC
- clean up DNS/LDAP should we find any leftovers
It didn’t go anywhere as smooth.
First we hit a lot of DNS issues because not all clients used DHCP and so they’d end up looking for a DNS that wasn’t there.
Once that was fixed we found stuff still wasn’t OK at all (No logon servers available). On the new DC the management console (the shiny new one) didn’t complain about anything, but nothing worked.
First we found the time service wasn’t doing OK: w32tm needed a lot of manual config (and resetting thereof) to finally sync it’s time, it didn’t take the domain time and it didn’t pull it in externally. That caused all kerberos tickets with the new DC to be worthless.
Later we also noticed that netdom query fsmo suddenly hit an RPC error. This was followed by a lot of time trying to debug the RPC error. In fact, on one of the following links I found a reminder to work with dcdiag which finally gave information that was actually useful to someone in IT. We had DNS issues (not critical) and a broken NTFRS. Basically I ran all the checks from the first link:
Then I verified that, yes, indeed our netlogon shares were missing on the new DC and, in fact, it had never fully replicated. No surprise it wasn’t working. The next thread (german) turned up after I found the original issue and had the right KB links to fix it.
So, what happened was that some dork at MS had used a MS JET DB driver for the rolling log of the file replication service. During a power outage, the JET DB wrote an invalid journal entry. It had been broken since then.
What I had to do was to re-initialize the replication, and then everything fixed just fine.
The KB entry for that is long, dangerous and something I hope you won’t have to do.
I hated that this KB has actually a few errors, i.e. they don’t even tell you when to start back up the service on the original DC. Since we only had *one* it was also sometimes unclear if I’d just be doomed or be fine.
In the end, it’s not that bad, GPOs are just files, which you can restore if needed. So even if you end up with a completely empty replication set, you can put your GPOs back in. And, from the infra side, all your GPOs are less important than this service running…
There’s a lot of unclear warnings about the permissions of those files, so copy in cmd, might be OK. Otherwise you can also reset the perms later via one of the AD MMC things, so actually that’s not too horrible. I had no issue at all and *heh* they also hadn’t ever used a GPO.
Also, note that on the last goddamn page of the article they tell you how to make a temporary workaround.
Monitoring lessons for Windows
- Don’t skip the basic FSMO check Nagios had forever
- Have the Check_MK replication check running (it’ll not see this)
- Monitor the SYSVOL and NETLOGON shares via the matching check
- Monitor NTFRS EventID 13568, optionally 16508
- Set up LDAP checks against the global catalog
- Checks built on dcdiag (afraid one has to ignore the event log since it shows historic entries). Command that finally got me rolling was dcdiag /v /c /d /e /s:
- Functional diagrams of AD for NagVis
- Pre-defined rulesets for AD monitoring in BI
I feel those could just be part of the base config, there’s no specifics to monitoring. BI can do the AD monitoring strictly via autodetection, but NagVis is better for visual diagnosis.
With a proper monitoring in place I’d not have had to search the issue at all…
For my monitoring customers I’ll try to build this and include it in their configs. Others should basically demand the same thing.
- There’s always a way to fix things
- AD is a complex system, you can’t just run one DC. No matter if SBS exist(s|ed) or not, it’s just not sane to do. Do not run AD with just one DC. Be sane, be safer.
- Oh, and look at your Eventlogs. Heh.