FrOSCon Nachlese: Rudder (1)


Bei dem Vortrag hatte ich ja erwaehnt, dass es sehr praktisch ist, dass Normation eigene C-Entwickler hat und Ihre eigene Version des Rudder Agent pflegt.

Case in point, ein kritischer Bug in CFEngine, der einfach schon gefixed war. Und 30 Minuten nach meiner Frage war der Patch auch freigegeben…

15:29 < darkfader> amousset: i'm inclined to think we could also
use a backport of https://github.com/cfengine/core/pull/2643
15:30 < darkfader> unless someone tells me "oh no i tested with 
a few thousand clients for a few months and it doesn't affect us"😉
15:33 < amousset> darkfader: it has already been backported 
(see http://www.rudder-project.org/redmine/issues/8875)
15:34 < Helmsman> Bug #8875: Backport patch to fix connection cache 
( Pending release issue assigned to  Jonathan CLARKE. 
URL: https://www.rudder-project.org/redmine//issues/8875 )
15:37 < darkfader> amousset: heh, pending release 
15:38 < Matya> hey you just released today🙂
15:40 < amousset> yesterday actually🙂
16:07 < jooooooon> darkfader: it's released now😉

 

 

… es gibt naemlich einen Freigabeprozess!

FroSCon Nachlese: hosting


 

 

Die zwei deutschen “wir meinen das ernst mit dem anders sein” Webhoster waren beide auf der FroSCon.

Damit meine ich die hostharing EG und UBERSPACE .

hostsharing hatte sich 2001 nach dem Strato-GAU gegruendet, ihr aktueller (oder schon immer?) Slogan ist: Community Driven Webhosting. Sie sind eine Genossenschaft, d.h. man ist idealerweise nicht einfach Kunde, sondern redet auch mit. Es gibt aber natuerlich einen harten Kern, der die Getriebe am laufen haelt.

UBERSPACE kenne ich seit gefuehlt 2-3 Jahren. Sie machen “Hosting on Asteroids”😉

Hier kann man den Preis selbst mitbestimmen, und bekommt eine flexible Loesung ohne den ganzen Bullshit, den die Massenhoster aussen rum bauen.

Es sind also zwei Generationen. Die Idee scheint mir dennoch aehnlich:

  • Ein Preismodell, das sich am bestmoeglichen Service orientiert.
  • Offenheit. Konzepte sind verstaendlich, ein User kann erkennen, wo seine Daten landen werden und kann klar erkennen, dass sie gut aufbewahrt sind
  • Ernsthafter Serverbetrieb (d.h. mehrere Standorte, Backups wo anders, bei Problemen reagieren, proaktives Management)
  • Nur kompetente Leute dranzusetzen! D.h. viele Jahre echte Unix-Erfahrung, nicht nur bisschen PHP und Panel-Bedienung. Alle beteiligten sind grob auf dem gleichen Niveau.

Im Gegensatz dazu der Massenmarkt, wo man oft genug lesen kann, dass wieder Kunden fuer die Sicherheitsluecken der Anbieter gestraft werden, keine Backups da waren, kein DR-Konzept vorlag, kaputte Hardware weiterbetrieben wird (“Wozu hat der Speicher ECC, wenn ich ihn dann tauschen lassen soll?”) usw.

Mit UBERSPACE konnte nicht nicht sprachen, weil ich es garnicht mitbekommen hab. Mit einem sass ich im Auto, habe aber Schlaf nachgeholt.🙂

Der hostsharing habe ich vor allem eines geraten: Schreibt Preise drauf.

Typisch fuer Idealisten🙂

Aber nicht zu verachten, ihr Start liegt jetzt 15 Jahre zurueck und sie haben seitdem halt einfach Qualitaet geliefert.

UBERSPACE wuensch ich das gleiche, und beiden, dass sie ihre Konzepte noch so gut skalieren koennen, dass sie die anderen Anbieter am Markt mit Qualitaet unter Druck setzen werden.

 

Weblinks:

https://uberspace.de/tech – ja, der erste Link geht zur Technik. Letztlich sind nur Technik, Team und Prozesse wichtig. Zumindest, wenn man keinen Website-baukasten braucht.
https://www.hostsharing.net/ – die Infoseite

https://www.hostsharing.net/events/hostsharing-veroeffentlicht-python-implementierung-des-api-zur-serveradministration – die Admin-API, von der sie mir garnicht erzaehlt hatten🙂

FroSCon Nachlese: coreboot


 

Coreboot

Hab mich sehr gefreut, dass jemand von Coreboot da war. Der Stand wurde vom BSI organisiert, ansonsten helfen sie dem Projekt auch noch mit Test-Infrastruktur.

Find ich gut, das ist nahe an dem, was ich fuer die “wichtige” Arbeit des BSI halte.

Der Coreboot-Entwickler hat mir ihre neue Website (sieht gut aus, gibt’s aber noch nicht online) gezeigt, und ich hab ansonsten auch meinen viele Jahre alten Stand auffrischen koennen.

Es gibt alleine bei Intel 25 Leute, die mitarbeiten an Coreboot!

Native Hardware wird immer mehr (von 0 auf 2 oder so, aber hey!)

Es gibt ein recht tolles Chromebook von HP, das ich mir holen sollte

Die Payloads werden vielfaeltiger – ich hatte gesagt, dass ich Seabios einfach hasse, und gerne einen Phoenix-Clone haette. Er hat dann nachgefragt, und mich drauf gestossen, dass ich eigentlich ein viel besseres PXE will. (Siehe VMWare, Virtualbox, nicht siehe KVM oder PyPXEboot!)

Und er hatte einen Rat: Petitboot – ein SSH-barer PXE-loader, der so einfach alles und jedes kann, was wir uns als Admins oder QA-Engineers wuenschen.

https://secure.raptorengineering.com/content/kb/1.html

Ist auf jeden Fall auf der Testliste.

 

Was es sonst noch beim BSI Stand gab, war GPG, OpenPGP und OpenVAS.

VAS haette mich auch interessiert, aber ich glaube, ein vollstaendiges Gespraech ist besser als zwei halbe.🙂

This is patching


 

  • There’s an OpenSSL issue
  • Fetch and rebuild the stable ports tree (2016Q3)
  • Find it won’t build php70 and mod_php70 anymore
  • Try to compare to your -current ports tree
  • Find there’s a php security issue in the the version there, but not the one you had
  • Wait till it’s fixed so you can build
  • Type portsnap, then just to be safe fist do a full world update to make sure your portsnap isn’t having security issues any more.
  • Updated portsnap has a metadata corruption
  • Remove your portsnap files, try again then just think “whatever” and fetch the ports from the ftp mirror and re-extract manually
  • Notice you just fetched an unsigned file via FTP and will use it to build i.e. your OpenSSL.
  • Rant about that.
  • Find you can’t build because it can’t fetch
  • Debug the reason it can’t fetch
  • Find it’s a bug in the ports tree from the fix of the above security issue
  • Make mental note noone seems to react withing 1-2 days if the stable tree is broken
  • While searching for existing bugs, find a PR about pkg audit that tries to redefine the functionality in order to not fix an output regression
  • Open a bug report for the PHP bug, adjust your local file
  • Fetch your new package list
  • Do a pkg audit, find it reports not too much.
  • Do a pkg audit -F, find it gives an SSL error
  • Find the http://www.vuxml.org certificate expired 2 months ago.
  • Wonder how noone even reacted to that
  • Find that SSLlabs somehow can’t even properly process the site anymore.
  • Find out that SSLlabs is actually dead just now.
  • Notice in the last lines it had managed to print that the actual hostnames points at a rdns-less v6 address and v4 cnames to a random FreeBSD.org test system.
  • Most likely the vuxml.org webserver ain’t heavily protected in that case, huh?
  • Give up and use http like the goddamn default
  • Random pkg update -> so update pkg first
  • In the end, just no-downtime slam all the new packages over the server because you’re sick of it all.

 

 

The next person who posts about “admins need to apply the fixes” I’ll just kick in the face. With my beer.

Diving into sysadmin-legalese


I’ve had the “fun” to at times write outage notices like the current Google statement. The wording is interesting.  IT systems fail, and once it’s time to write about it, management will make sure they are well-hidden. The notice will be written by some poor soul who just wants to go home to his family instead of being a bearer of bad news. If he writes something that is not correct, management will be back just in time for the crucification.

Things go wrong, we lose data, we lose money. It’s just something that happens.

But squeezing both the uncertainties and hard truths of IT into public statements is an art of it’s own.

Guess what? I’ve been in popcorn mood – Here are multiple potential translations for this status notice:

“The issue with Compute Engine network connectivity should have been resolved for nearly all instances. For the remaining few remaining instances we are working directly with the affected customers. No further updates will be posted, but we will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will also provide a more detailed analysis of this incident once we have completed our internal investigation.”


should have been resolved:

we cannot take responsibility. it might come back even just now! we’re not able to safely confirm anything😦

but we spoke to the guy’s boss and they promised he’ll not do that thing again.

for nearly all instances:

the error has been contained, but we haven’t been able to fix it? we’ve been unable to undo the damage where it happened. we were lucky in most cases, but not in all.

the remaining few remaining instances

the remaining remaining… they aren’t going away! Still remaining!

these instances that are left we’ll not be able to fix like the majority where our fix worked.

Please, just let me go to bed and someone else do those? I can’t even recall how many 1000s we just fixed and it won’t stop.

working directly with the affected customers:

only if you’re affected and we certainly know we need to come out, we will involve you. we can’t ascertain if there’s damage for you. we need to run more checks and will tell you when / how long. you get to pick between restores and giving us time to debug/fix. (with a network related issue, that is unlikely)

no further updates will be posted

since it is safely contained, but we don’t understand it, we can’t make any statement that doesn’t have a high chance of being wrong

we only post here for issues currently affecting a larger than X number of users. Now that we fixed it for most, the threshold is not reached

we aren’t allowed to speak about this – we don’t need to be this open once we know the potential liabilites are under a treshold.

will conduct an internal investigation

we are unwilling to discuss any details until we completely know what happened

appropriate improvements  

we are really afraid now to change the wrong thing.

it will be the right button next time, it has to!

management has signed off the new switch at last!

to prevent or minimize future recurrence

we’re not even sure we can avoid this. right now, we would expect this to happen again. and again. and again. recurring, you know? if at least we’d get a periodic downtime to reboot shit before this happens, but we’re the cloud and noone gives us downtimes!

and please keep in mind minimized recurrence means a state like now, so only a few affected instances, which seems under the threshold where we notify on the tracker here.

we really hope it won’t re-occur so we don’t have to write another of those.


 

Don’t take too serious but these are some of the alarms going off in my head if I see such phrasing.

Have a great day and better backups!

First look at the UP-board


I’ve finally got two UP Boards. After they arrived I also had ordererd another “Mean Well” dedicated 12V rail mount PSU, and some 2.1mm power cables.

The boards are nice little things with a lot of CPU power. Quad Atom with some cache, eMMC and enough RAM!

Photos, dmesg, hwinfo etc can be found here:

http://up.home.xenhosts.de/

The basics:

My models have 2GB of ram which is shared with the onboard graphics.

The have a big front and back cooling plate, for hardcore usage there’s also an active fan in their shop.

Connectors: USB 2.0, 3.0, 3.0 OTG. The latter is a Macbook Air „style“ Typ-C flat connector. There’s also power (via a 2.1mm plug), HDMI and some other stuff I didn’t understand.

There’s one connector that has a industrial style plug. This port exposes 2x USB and a serial with BIOS forwarding. You should give in and just buy it, there’s no way you’ll easily find the plug on your own.

You’ll need this cable unless you only plan on a desktop use. It doesn’t come with a FTDI serial, so also make sure to get one of those.

The MMC reads at up to 141MB/s (pretty nice) and writes (fdatasync) up to 61MB/s (also pretty OK). TRIM does work.

The LAN interface is just a Realtek, connected via PCIe (2.5GT/s, x1).

BIOS stuff

On boot you’re greeted by a normal EFI shell, reminded me of my late HP-UX days, except here there is no SAN boot scan.

Pressing F7 gives you a boot menu which always also allows going to BIOS Setup, which is a normal phoenix-style menu. Very small and simple – that’s nice.

Serial forwarding is supported, I didn’t try netbooting yet.

OS (ubilinux)

I installed their “default” distro which is done by flashing the ISO to a stick (or putting it on a CD) and you have to take care to use a USB2.0 connector if it’s a USB3 stick or it won’t be detected(!)

The grub menu was really slow, while the BIOS had been quick.

Limiting the video ram to 64MB + UHD screen brought me a system that stopped working once X was up. I didn’t investigate that, instead I booted to single user mode and told systemd to make that a default (systemctl set-default=multi-user.target).

Ubilinux is a Debian Jessie (sigh) but with some parts scrapped from Ubuntu (sigh).

It works and has all the stuff to i.e. access the GPIO connectors.

lm_sensors detected the coretemp CPU sensors, nothing else.

AES-NI was autoloaded.

The only thing I couldn’t make work yet was the hardware watchdog, which is an issue split between SystemD, packaging and probably something else.

This one gets a 9/10 which is rare🙂

Active Directory burns


Hi,

 

just to leave the most important things from a day last week where things were special.

We set out to extend / upgrade a forgotten 2003 domain to 2012, with that it’d also finally get a second DC.

So we expected those steps:

  • Put domain in 2003 mode🙂
  • Add 2012 DC, if not happy, do forest prep or so
  • Switch all FSMO
  • Shut down 2003 DC, test if NAS shares are accessible
  • Restart it
  • ready a 2012 VM as secondary DC
  • dcpromo the new VM
  • demote the old DC
  • clean up DNS/LDAP should we find any leftovers

It didn’t go anywhere as smooth.

First we hit a lot of DNS issues because not all clients used DHCP and so they’d end up looking  for a DNS that wasn’t there.

Once that was fixed we found stuff still wasn’t OK at all (No logon servers available). On the new DC the management console (the shiny new one) didn’t complain about anything, but nothing worked.

First we found the time service wasn’t doing OK: w32tm needed a lot of manual config (and resetting thereof) to finally sync it’s time, it didn’t take the domain time and it didn’t pull it in externally. That caused all kerberos tickets with the new DC to be worthless.

 

Later we also noticed that netdom query fsmo suddenly hit an RPC error. This was followed by a lot of time trying to debug the RPC error. In fact, on one of the following links I found a reminder to work with dcdiag which finally gave information that was actually useful to someone in IT. We had DNS issues (not critical) and a broken NTFRS. Basically I ran all the checks from the first link:

When good Domain Controllers go bad!

Then I verified that, yes, indeed our netlogon shares were missing on the new DC and, in fact, it had never fully replicated. No surprise it wasn’t working. The next thread (german) turned up after I found the original issue and had the right KB links to fix it.

http://www.mcseboard.de/topic/115522-fehler-13508-kein-sysvol-und-probleme-mit-dateireplikationsdienst/

So, what happened was that some dork at MS had used a MS JET DB driver for the rolling log of the file replication service. During a power outage, the JET DB wrote an invalid journal entry. It had been broken since then.

What I had to do was to re-initialize the replication, and then everything fixed just fine.

The KB entry for that is long, dangerous and something I hope you won’t have to do.

http://support.microsoft.com/kb/315457/en-us

I hated that this KB has actually a few errors, i.e. they don’t even tell you when to start back up the service on the original DC. Since we only had *one* it was also sometimes unclear if I’d just be doomed or be fine.

In the end, it’s not that bad, GPOs are just files, which you can restore if needed. So even if you end up with a completely empty replication set, you can put your GPOs back in. And, from the infra side, all your GPOs are less important than this service running…

There’s a lot of unclear warnings about the permissions of those files, so copy in cmd, might be OK. Otherwise you can also reset the perms later via one of the AD MMC things, so actually that’s not too horrible. I had no issue at all and *heh* they also hadn’t ever used a GPO.

Also, note that on the last goddamn page of the article they tell you how to make a temporary workaround.

 

Monitoring lessons for Windows

  • Don’t skip the basic FSMO check Nagios had forever
  • Have the Check_MK replication check running (it’ll not see this)
  • Monitor the SYSVOL and NETLOGON shares via the matching check
  • Monitor NTFRS EventID 13568, optionally 16508
  • Set up LDAP checks against the global catalog

Monitoring gaps:

  • Checks built on dcdiag (afraid one has to ignore the event log since it shows historic entries). Command that finally got me rolling was dcdiag /v /c /d /e /s:
  • Functional diagrams of AD for NagVis
  • Pre-defined rulesets for AD monitoring in BI

I feel those could just be part of the base config, there’s no specifics to monitoring. BI can do the AD monitoring strictly via autodetection, but NagVis is better for visual diagnosis.

With a proper monitoring in place I’d not have had to search the issue at all…

For my monitoring customers I’ll try to build this and include it in their configs. Others should basically demand the same thing.

 

Furthermore:

  • There’s always a way to fix things
  • AD is a complex system, you can’t just run one DC. No matter if SBS exist(s|ed) or not, it’s just not sane to do. Do not run AD with just one DC. Be sane, be safer.
  • Oh, and look at your Eventlogs. Heh.

Ceph Training Overview


Someone on #ceph asked about training in Europe / major cities there.

So what I did is I googled the s*** out of “Ceph Training”…

I mean, I’ve done a little browse about who’s currently offering any, part as research “do I wanna do that again?” and also because I think alternatives are good.

 

Here’s the offerings I found, along with some comments.

All of them have basics pretty nice now, meaning you get to look at CRUSH, make sure  you understand PGs reasonably and most will let you do maintenance tasks like OSD add / remove…

I didn’t only look at classes in German, but it seems the interest in Ceph in Germany is just pretty high. Rightfully so🙂

Ceph Trainings

Btv0bouCcAA_VD3

CEPH – Cluster- und Dateisysteme

(obviously a Germany speaking class)

https://www.medienreich.de/training/ceph-cluster-und-dateisysteme

They have lots of references for this course, and clearly involve CephFS. They offer a flexible duration so you can chose how deep some topics will be covered. They handle common mistakes, which is very nice for the trainees.

Some addon they do is re-exporting, i.e. HA NFS or VMs in RADOS etc. Surely helpful, but with that one make sure you either cut that part very short (like 2hours) to only get the overview, or stretch to be 2 days of its own. Clustering isn’t a part of your toolkit, it’s a new toolkit you learn. If you cut short, you end up worse. And make no mistake, Ceph(FS) will be your new SPOF, so make sure you get in very close touch with it.

One thing I’d also recommend is not to do the class with just a 3-node setup if you take a longer one. 3 nodes is really nice for your first days with Ceph but the reliability and performance are completely unrelated to what you see in a larger setup.

hastexo – “Get in touch”

https://www.hastexo.com/services/training/

I found some past feedback from one of their classes, it seems to be very good.

Also keep in mind they’re among the really long-time Ceph users, they hang out in #ceph at least as long as I do, and that is now, idk? 6 years?

Different from pure trainers or people that run around bragging about Ceph but aren’t even on the community, hastexo has also spent years delivering Ceph setups large- and small.

The only grudge I have with them is when they recommended consumer Samsung SSDs in a Ceph Intro for iX magazine. That wasn’t funny, I met people who thought that was a buying advice for anything serious. Ignoring the that any power outage would potentially fizzle all the journal SSDs ain’t something you do. But the author just probably tried to be nice and save people some money in their labs.

Anyway, hastexo do to their large amount of installations is the very best choice if your company is likely to have a few special requirements; let’s say you’re a company that might test with 8 boxes but later run 500+ and you want real-world experience and advice for scalability, even in your special context.

Afaik they’re in Germany, but they’re real IT people, as far as I can tell any of them would speak most fluent english🙂

Seminar Ceph

http://www.seminar-experts.de/seminare/ceph/

This is just a company re-selling trainings someone else is doing.

The trainer seems to have a good concept though, adding in benchmarking and spending a lot of time on the pros/cons of fuse vs kernel for different tasks.

This is the course you should take if you’ll be “the ceph guy” in your company and need to fight and win on your own.

Nothing fuzzy, no OpenStack or Samba “addons”. Instead you learn about Ceph to the max. I love that.

Price isn’t low even for 4 days, but I see the value in this, and in-house training generally ain’t cheap.

There’s also an “streaming” option which comes around cheaper but a Ceph class without a lab is mostly useless. It also doesn’t say anything about the trainer, so no idea if he’d do it in another language than German.

Red Hat Ceph Storage Architecture and Administration

http://www.flane.de/en/course/redhat-ceph125

Seriously, no. This is all about OpenStack. You can take this course if you have some extra time to learn Ceph in-depth or if you’re the OpenStack admin and do some Ceph on the side, and aren’t the only guy around.

Can also be partially interesting if you have other ideas for using the Rados Gateway.

 

Merrymack Ceph Training

http://www.ceph-training.com/index.html

A web-based / video-based training. Price-wise this beats them all if you just have 1-2 attendees and no prior knowledge.

Probably a very good refresh if Ceph knowledge is dated or if you want to learn at your own pace. That way you can spend a lot more lab time, rather nice.

If you have a few people on the team the price goes up and you should really negotiate a price.

Personally I’d prefer something with a trainer who looks at your test and tells you “try like this and it’ll work” but $499 are hard to argue with if you got some spare time to do the lab chores.

I think this is the actual launch info of the course:

https://www.linkedin.com/pulse/i-just-launched-on-demand-ceph-training-course-donald-talton

 

No longer available

 Ceph 2 – Day workshop at Heise / iX magazine.

It was a bit expensive for 2 days with up to 15 people.

http://www.heise.de/ix/meldung/iX-Workshop-zum-Dateisystem-Ceph-2466563.html

Nice as a get-to-know thing, I would not recommend it as an only training before going into a prod deployment

 

MK Linux Storage & LVM

That’s the original first Ceph training, the one I used to do🙂

Ceph was done on the final day of the class, because back then you’d not find enough people to just come around for a Ceph training😉

But it’s not offered by them any longer. As far as I know the interest was always a little bit too low since this hardcore storage stuff seems to have a different audience than the generic Linux/Bash/Puppet classes do.

 

Summary

Which one would I recommend?

Seminar Ceph” from that reseller would be for storage admins who need to know their ceph cluster as well as a seasoned SAN admin knows their VMAX etc. Also the best choice for people at IT shops who need to support Ceph in their customer base. You’ll be better off really understanding all parts of the storage layer, you might get your life sued away if you lose some data.

Go to hastexo if you generally know about Ceph, you already read the Ceph paper and some more current docs, your team is strong enough to basically set it up on your own (at a scale, so not “we can install that on 5 servers with ansible but “we’ve brought up new applications in size of 100s of servers capacity often enough, thank you”). You’d be able to strengthen some areas with them and benefit from their implementation experience.

Take the online Ceph Training if you want something quick and cheap and are super eager to tinker around and learn all the details. You’ll end up at the same level as with the pro training but need more time to get there.

Myself?

I still got no idea if I should do another training. I looked at all their outlines and it looked OK. Some more crush rebuilds to flex fingers and add-/remove/admin-socketify all the things🙂 So, that’s fine with a week of prep and slides.

Training is a lot more fun than anything else I do, too.

But, to be honest the other stuff isn’t done and also pretty cool, with 1000s of servers and so on.

At my next website (www.florianheigl.me) iteration I’ll be adding classes and schedule.

Some upgrades are special


 

No Xen kernel

Yeah never forget when upgrading from older Alpine Linux that Xen itself moved into the xen-*-hypervisor package. Didn’t catch that on update and so I had no more hypervisor on the system.

Xen 4.6.0 crashes on boot

My experience: No you don’t need to compile a debug Xen kernel + Toolstack. No you don’t need a serial console. No you don’t need to attach it.

You need google and search for whatever random regression you hit.

In this case, if you set dom0_mem, it will crash instead of putting the memory in the unused pool: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=810070

4.6.1 fixes that but isn’t in AlpineLinux stable so far.

So what I did was enabling autoballoon in /etc/xen/xl.conf. That’s one of the worst things you can do, ever. It slows down VM startup, has NO benefit at all, and as far as I know also increases memory fragmentation to the max. Which is lovely, especially considering Xen doesn’t detect this NUMA machine as one thanks to IBM’s chipset “magic”.

CPU affinity settings got botched

I had used a combination of the vcpu pinning / scheduling settings to make sure power management works all while dedicating 2 cores to dom0 for good performance. Normally with dom0 VCPU pinning you got a problem:

dom0 is being a good citizen, only using the first two cores. But all other VMs may also use those, breaking some of the benefits…

So what you’d do was have settings like this

 

memory = 8192
maxmem = 16384
name   = "xen49.xenvms.de"
vcpus  = 4
cpus = [ "^0,^1" ]

That tells Xen this VM can have 4 virtual CPUs, but they’ll never be scheduled on the first two cores (the dom0 ones).

Yeah except in Xen 4.6 no VM can boot like that.
The good news is a non-array’ed syntax works:

cpus = "^0,^1,2-7"

 

IBM RSA2

IBM’s certificates for this old clunker are expired. Solution to access?

Use Windows.

Oh, and if it’s in the BIOS from configuring the RSA module it’ll NEVER display anything, even if you reset the ASM. You need to reset the server once more. Otherwise you get a white screen. The recommended fix is to reinstall the firmware, which also gets you that reboot.

 

Alpine Linux

My network config also stopped working. That’s the one part I’d like to change in Alpine – not using that challenged “interfaces” file from Debian, but instead something that is more friendly to working with VLANs, Bridges and Tunnels.

If you’re used to it day-to-day it might “look” just fine but that’s because you don’t go out and compare to something else.

Bringing up a bridge interface was broken, because one sysctl knob was no longer supported. So it tried to turn off ebtables there, that didn’t work and so, hey, what should we do? Why not just stop bringing up any interfaces that are configured and completely mess up the IP stack?

I mean, what else would make sense to do than the WORST possible outcome?

If this were a cluster with a lost quorum I’d even agree. But you can bet the Linux kids will run that into a split brain with pride.

 

I’ll paste the actual config once I found how to take WordPress out of the idiot mode and edit html. Since, of course, pasting to a <PRE> section is utterly fucked up.

 

I removed my bonding config from this to be able to debug more easily. But the key points was to remove the echo calls and making the pre-up / post-up parts more reliable.

auto lo
iface lo inet loopback

auto br0
iface br0 inet static
    pre-up brctl addbr br0
    pre-up ip link set dev eth0 up
    post-up brctl addif br0 eth0
    address your_local_ip
    netmask subnet_mask
    broadcast subnet_bcast
    gateway your_gw
    hostname waxh0012
    post-down brctl delif br0 eth0
    post-down brctl delbr br0
           
                    
# 1 gbit nach aussen
auto eth0             
iface eth0 inet manual      
    up ip link set $IFACE up    
    down ip link set $IFACE down

Xen VMs not booting

This is some annoying thing with Alpine only. Some VMs just require you to press enter, them being stuck at the grub menu. It’s something with grub parsing the extlinux.conf that doesn’t work, most likely the “timeout” or “default” lines.
And of course the idiotic (hd0) default vs (hd0,0) from the grub-compat wrapper.
I think you can’t possibly shout too loud at any of the involved people since this all goes to the “why care?” class of issues.
(“Oh just use PV-Grub” … except that has another bunch of issues…)

Normally I don’t want to bother anymore reporting all the broken stuff I run into. It’s gotten just too much, basically I would just spend 2/3 of my day on that. But since this concerns a super-recent release of Alpine & Xen (and even some debian) I figured I’ll save people some of the PITA I spend my last hours on.
When able I dump them to my confluence at this url:
Adminspace – Fixlets

I also try really hard to not rant there🙂

Update:
Nathanael Copa reached out to me and let me know that the newer bridge packages on Alpine include a lot more helper scripts. That way the icky settings from the config would not have been needed any more.
Another thing one can do is to do

post-up my-command || echo "didnt work"

you should totally not … need to do that, but it helps.

Time for 2016


Hi everyone.

 

I just thought some post is in place after having gone dark for quite a long time.

I’d been home sick for almost a month. First I’d snapped my back very badly and then I caught a strong flu. This completely ruined the month I had meant to spend on posting things here. Lying on your back is BORING and painkillers make you too numb do to anything useful.

Between that I’ve also done a few fun projects and been to OpenNebulaConf (Oct), the chaos communication camp (Dec) and the config management camp (Feb) and each time I came home with some nice ideas to throw around.

To be honest though, the highlight of the last months was watching Deadpool.

If you can handle some completely immature humor and the good old ultraviolent, go watch it.

 

For this year, there will be EuroBSDCon and OpenNebulaConf yet again.

One great thing about OpenNebula is the extremely friendly community. Comparing this to *any* other conf I’ve been to they are all pretty darn hostile and bro-ish. OpenNebula is such a nice community in  comparison and I really hope the others will start trying to match up with that at some point.