As kids we would wonder, but we didn’t know 2016


As kids we would wonder why noone treated us seriously, like grown-ups, like someone capable of reasoning.

But it’s not about that.

It’s about letting someone keep a glimpse of trust, looking back to a few years where they didn’t yet see wars starting, not being able to do anything to stop an escalation.

Families getting wiped out as a necessity (to someone, after all) on a mountain roadside because they happened to watch a spies’ assassination.

Prisons that have been given up to the inmates and just patrolled from the outside.

Watching your favourite artists die.

Hearing that a friend committed suicide.

Actually, people getting so deeply hopeless that they willingly crash their own airplane, wiping out whole school classes.

Undercover investigators who were the actual enabler of the Madrid subway bombing. Knowing how they’ll also forever be lost in their guilt, not making anything better.

Seeing how Obama’s goodbye gets drowned in humanity wondering if Trump had women pissing on <whatever> for money or not.

Then asking yourself why that even would matter considering T’s definitely a BAD PERSON so who cares what kind of sex he’s into, why can one’s private details take attention from the actual fact that he’s absolutely not GOOD?

Watching a favorite place to be torn down for steel and glass offices.

Understanding what a burnt down museum means.

Life’s inevitable bits, to be confronted with them works only if you had a long peaceful period in your life.

And that, that’s what you really just shouldn’t see or rather understand too early.

After all, there’s an age where we all tried to get toothpaste back into the tube, just because we’d not believe it just doesn’t work that way.

 

Sorry for this seemingly moody post, it’s really been cooking since that 2012 murder case. Today

 

 

On the pro side, there’s movies, Wong Kar-Wai and so many more. There’s art, and the good news is that we can always add more art, and work more towards the world not being a shithole for the generations after us.

But, seriously, you won’t be able to do much good if you look at the burning mess right from the start.

happy about VM compression on ZFS


Over the course of the weekend I’ve switched one of my VM hosts to the newer recommended layout for NodeWeaver – this uses ZFS, with compression enabled.

First, let me admit some things you might find amusing:

  • I found I had forgotten to add back one SSD after my disk+l2arc experiment
  • I found I had one of the two nodes plugged into its 1ge ports, instead of using the 10ge ones.

The switchover looked like this

  1. pick a storage volume to convert, write down the actual block device and the mountpoint
  2. tell LizardFS i’m going to disable it (prefix it with a * and kill -1 the chunkserver)
  3. Wait a bit
  4. tell LizardFS to forget about it (prefix with a # and kill -1 the chunkserver)
  5. umount
  6. ssh into the setup menu
  7. select ‘local storage’ and pick the now unused disk, assign it to be a ZFS volume
  8. quit the menu after successful setup of the disk
  9. kill -1 the chunkserver to enable it
  10. it’ll be visible in the dashboard again, and you’ll also see it’s a ZFS mount.

Compression was automatically enabled (lz4).

I’ve so far only looked at the re-replication speed and disk usage.

Running on 1ge I only got around 117MB/s (one of the nodes is on LACP and the switch can’t do dst+tcp port hashing so you end up in one channel.

Running on 10ge I saw replication network traffic to go up to 370MB/s.

Disk IO was lower since the compression already kicked, and the savings have been vast.

[root@node02 ~]# zfs get all | grep -w compressratio
srv0  compressratio         1.38x                  -
srv1  compressratio         1.51x                  -
srv2  compressratio         1.76x                  -
srv3  compressratio         1.53x                  -
srv4  compressratio         1.57x                  -
srv5  compressratio         1.48x                  -

I’m pretty sure ZFS also re-sparsified all sparse files, the net usage on some of the storages went down from 670GB to around 150GB.

 

I’ll probably share screenshots once the other node is also fully converted and rebalancing has happened.

Another thing I told a few people was that I ripped out the 6TB HDDs once I found that $customer’s OpenStack cloud performed slightly better than my home setup.

Consider that solved. 😉

vfile:~$ dd if=/dev/zero of=blah bs=1024k count=1024 conv=fdatasync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 2.47383 s, 434 MB/s

(this is a network-replicated write…

make sure to enable the noop disk scheduler before you do that.

Alternatively, if there’s multiple applications on the server (i.e. a container hosting VM), use the deadline disk scheduler with the nomerges option set. That matters. Seriously 🙂

 

Happy hacking and goodbye

FrOSCon Nachlese: Rudder (1)


Bei dem Vortrag hatte ich ja erwaehnt, dass es sehr praktisch ist, dass Normation eigene C-Entwickler hat und Ihre eigene Version des Rudder Agent pflegt.

Case in point, ein kritischer Bug in CFEngine, der einfach schon gefixed war. Und 30 Minuten nach meiner Frage war der Patch auch freigegeben…

15:29 < darkfader> amousset: i'm inclined to think we could also
use a backport of https://github.com/cfengine/core/pull/2643
15:30 < darkfader> unless someone tells me "oh no i tested with 
a few thousand clients for a few months and it doesn't affect us" 😉
15:33 < amousset> darkfader: it has already been backported 
(see http://www.rudder-project.org/redmine/issues/8875)
15:34 < Helmsman> Bug #8875: Backport patch to fix connection cache 
( Pending release issue assigned to  Jonathan CLARKE. 
URL: https://www.rudder-project.org/redmine//issues/8875 )
15:37 < darkfader> amousset: heh, pending release 
15:38 < Matya> hey you just released today 🙂
15:40 < amousset> yesterday actually 🙂
16:07 < jooooooon> darkfader: it's released now 😉

 

 

… es gibt naemlich einen Freigabeprozess!

FroSCon Nachlese: hosting


 

 

Die zwei deutschen “wir meinen das ernst mit dem anders sein” Webhoster waren beide auf der FroSCon.

Damit meine ich die hostharing EG und UBERSPACE .

hostsharing hatte sich 2001 nach dem Strato-GAU gegruendet, ihr aktueller (oder schon immer?) Slogan ist: Community Driven Webhosting. Sie sind eine Genossenschaft, d.h. man ist idealerweise nicht einfach Kunde, sondern redet auch mit. Es gibt aber natuerlich einen harten Kern, der die Getriebe am laufen haelt.

UBERSPACE kenne ich seit gefuehlt 2-3 Jahren. Sie machen “Hosting on Asteroids” 😉

Hier kann man den Preis selbst mitbestimmen, und bekommt eine flexible Loesung ohne den ganzen Bullshit, den die Massenhoster aussen rum bauen.

Es sind also zwei Generationen. Die Idee scheint mir dennoch aehnlich:

  • Ein Preismodell, das sich am bestmoeglichen Service orientiert.
  • Offenheit. Konzepte sind verstaendlich, ein User kann erkennen, wo seine Daten landen werden und kann klar erkennen, dass sie gut aufbewahrt sind
  • Ernsthafter Serverbetrieb (d.h. mehrere Standorte, Backups wo anders, bei Problemen reagieren, proaktives Management)
  • Nur kompetente Leute dranzusetzen! D.h. viele Jahre echte Unix-Erfahrung, nicht nur bisschen PHP und Panel-Bedienung. Alle beteiligten sind grob auf dem gleichen Niveau.

Im Gegensatz dazu der Massenmarkt, wo man oft genug lesen kann, dass wieder Kunden fuer die Sicherheitsluecken der Anbieter gestraft werden, keine Backups da waren, kein DR-Konzept vorlag, kaputte Hardware weiterbetrieben wird (“Wozu hat der Speicher ECC, wenn ich ihn dann tauschen lassen soll?”) usw.

Mit UBERSPACE konnte nicht nicht sprachen, weil ich es garnicht mitbekommen hab. Mit einem sass ich im Auto, habe aber Schlaf nachgeholt. 🙂

Der hostsharing habe ich vor allem eines geraten: Schreibt Preise drauf.

Typisch fuer Idealisten 🙂

Aber nicht zu verachten, ihr Start liegt jetzt 15 Jahre zurueck und sie haben seitdem halt einfach Qualitaet geliefert.

UBERSPACE wuensch ich das gleiche, und beiden, dass sie ihre Konzepte noch so gut skalieren koennen, dass sie die anderen Anbieter am Markt mit Qualitaet unter Druck setzen werden.

 

Weblinks:

https://uberspace.de/tech – ja, der erste Link geht zur Technik. Letztlich sind nur Technik, Team und Prozesse wichtig. Zumindest, wenn man keinen Website-baukasten braucht.
https://www.hostsharing.net/ – die Infoseite

https://www.hostsharing.net/events/hostsharing-veroeffentlicht-python-implementierung-des-api-zur-serveradministration – die Admin-API, von der sie mir garnicht erzaehlt hatten 🙂

FroSCon Nachlese: coreboot


 

Coreboot

Hab mich sehr gefreut, dass jemand von Coreboot da war. Der Stand wurde vom BSI organisiert, ansonsten helfen sie dem Projekt auch noch mit Test-Infrastruktur.

Find ich gut, das ist nahe an dem, was ich fuer die “wichtige” Arbeit des BSI halte.

Der Coreboot-Entwickler hat mir ihre neue Website (sieht gut aus, gibt’s aber noch nicht online) gezeigt, und ich hab ansonsten auch meinen viele Jahre alten Stand auffrischen koennen.

Es gibt alleine bei Intel 25 Leute, die mitarbeiten an Coreboot!

Native Hardware wird immer mehr (von 0 auf 2 oder so, aber hey!)

Es gibt ein recht tolles Chromebook von HP, das ich mir holen sollte

Die Payloads werden vielfaeltiger – ich hatte gesagt, dass ich Seabios einfach hasse, und gerne einen Phoenix-Clone haette. Er hat dann nachgefragt, und mich drauf gestossen, dass ich eigentlich ein viel besseres PXE will. (Siehe VMWare, Virtualbox, nicht siehe KVM oder PyPXEboot!)

Und er hatte einen Rat: Petitboot – ein SSH-barer PXE-loader, der so einfach alles und jedes kann, was wir uns als Admins oder QA-Engineers wuenschen.

https://secure.raptorengineering.com/content/kb/1.html

Ist auf jeden Fall auf der Testliste.

 

Was es sonst noch beim BSI Stand gab, war GPG, OpenPGP und OpenVAS.

VAS haette mich auch interessiert, aber ich glaube, ein vollstaendiges Gespraech ist besser als zwei halbe. 🙂

This is patching


 

  • There’s an OpenSSL issue
  • Fetch and rebuild the stable ports tree (2016Q3)
  • Find it won’t build php70 and mod_php70 anymore
  • Try to compare to your -current ports tree
  • Find there’s a php security issue in the the version there, but not the one you had
  • Wait till it’s fixed so you can build
  • Type portsnap, then just to be safe fist do a full world update to make sure your portsnap isn’t having security issues any more.
  • Updated portsnap has a metadata corruption
  • Remove your portsnap files, try again then just think “whatever” and fetch the ports from the ftp mirror and re-extract manually
  • Notice you just fetched an unsigned file via FTP and will use it to build i.e. your OpenSSL.
  • Rant about that.
  • Find you can’t build because it can’t fetch
  • Debug the reason it can’t fetch
  • Find it’s a bug in the ports tree from the fix of the above security issue
  • Make mental note noone seems to react withing 1-2 days if the stable tree is broken
  • While searching for existing bugs, find a PR about pkg audit that tries to redefine the functionality in order to not fix an output regression
  • Open a bug report for the PHP bug, adjust your local file
  • Fetch your new package list
  • Do a pkg audit, find it reports not too much.
  • Do a pkg audit -F, find it gives an SSL error
  • Find the http://www.vuxml.org certificate expired 2 months ago.
  • Wonder how noone even reacted to that
  • Find that SSLlabs somehow can’t even properly process the site anymore.
  • Find out that SSLlabs is actually dead just now.
  • Notice in the last lines it had managed to print that the actual hostnames points at a rdns-less v6 address and v4 cnames to a random FreeBSD.org test system.
  • Most likely the vuxml.org webserver ain’t heavily protected in that case, huh?
  • Give up and use http like the goddamn default
  • Random pkg update -> so update pkg first
  • In the end, just no-downtime slam all the new packages over the server because you’re sick of it all.

 

 

The next person who posts about “admins need to apply the fixes” I’ll just kick in the face. With my beer.

Diving into sysadmin-legalese


I’ve had the “fun” to at times write outage notices like the current Google statement. The wording is interesting.  IT systems fail, and once it’s time to write about it, management will make sure they are well-hidden. The notice will be written by some poor soul who just wants to go home to his family instead of being a bearer of bad news. If he writes something that is not correct, management will be back just in time for the crucification.

Things go wrong, we lose data, we lose money. It’s just something that happens.

But squeezing both the uncertainties and hard truths of IT into public statements is an art of it’s own.

Guess what? I’ve been in popcorn mood – Here are multiple potential translations for this status notice:

“The issue with Compute Engine network connectivity should have been resolved for nearly all instances. For the remaining few remaining instances we are working directly with the affected customers. No further updates will be posted, but we will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will also provide a more detailed analysis of this incident once we have completed our internal investigation.”


should have been resolved:

we cannot take responsibility. it might come back even just now! we’re not able to safely confirm anything 😦

but we spoke to the guy’s boss and they promised he’ll not do that thing again.

for nearly all instances:

the error has been contained, but we haven’t been able to fix it? we’ve been unable to undo the damage where it happened. we were lucky in most cases, but not in all.

the remaining few remaining instances

the remaining remaining… they aren’t going away! Still remaining!

these instances that are left we’ll not be able to fix like the majority where our fix worked.

Please, just let me go to bed and someone else do those? I can’t even recall how many 1000s we just fixed and it won’t stop.

working directly with the affected customers:

only if you’re affected and we certainly know we need to come out, we will involve you. we can’t ascertain if there’s damage for you. we need to run more checks and will tell you when / how long. you get to pick between restores and giving us time to debug/fix. (with a network related issue, that is unlikely)

no further updates will be posted

since it is safely contained, but we don’t understand it, we can’t make any statement that doesn’t have a high chance of being wrong

we only post here for issues currently affecting a larger than X number of users. Now that we fixed it for most, the threshold is not reached

we aren’t allowed to speak about this – we don’t need to be this open once we know the potential liabilites are under a treshold.

will conduct an internal investigation

we are unwilling to discuss any details until we completely know what happened

appropriate improvements  

we are really afraid now to change the wrong thing.

it will be the right button next time, it has to!

management has signed off the new switch at last!

to prevent or minimize future recurrence

we’re not even sure we can avoid this. right now, we would expect this to happen again. and again. and again. recurring, you know? if at least we’d get a periodic downtime to reboot shit before this happens, but we’re the cloud and noone gives us downtimes!

and please keep in mind minimized recurrence means a state like now, so only a few affected instances, which seems under the threshold where we notify on the tracker here.

we really hope it won’t re-occur so we don’t have to write another of those.


 

Don’t take too serious but these are some of the alarms going off in my head if I see such phrasing.

Have a great day and better backups!

First look at the UP-board


I’ve finally got two UP Boards. After they arrived I also had ordererd another “Mean Well” dedicated 12V rail mount PSU, and some 2.1mm power cables.

The boards are nice little things with a lot of CPU power. Quad Atom with some cache, eMMC and enough RAM!

Photos, dmesg, hwinfo etc can be found here:

http://up.home.xenhosts.de/

The basics:

My models have 2GB of ram which is shared with the onboard graphics.

The have a big front and back cooling plate, for hardcore usage there’s also an active fan in their shop.

Connectors: USB 2.0, 3.0, 3.0 OTG. The latter is a Macbook Air „style“ Typ-C flat connector. There’s also power (via a 2.1mm plug), HDMI and some other stuff I didn’t understand.

There’s one connector that has a industrial style plug. This port exposes 2x USB and a serial with BIOS forwarding. You should give in and just buy it, there’s no way you’ll easily find the plug on your own.

You’ll need this cable unless you only plan on a desktop use. It doesn’t come with a FTDI serial, so also make sure to get one of those.

The MMC reads at up to 141MB/s (pretty nice) and writes (fdatasync) up to 61MB/s (also pretty OK). TRIM does work.

The LAN interface is just a Realtek, connected via PCIe (2.5GT/s, x1).

BIOS stuff

On boot you’re greeted by a normal EFI shell, reminded me of my late HP-UX days, except here there is no SAN boot scan.

Pressing F7 gives you a boot menu which always also allows going to BIOS Setup, which is a normal phoenix-style menu. Very small and simple – that’s nice.

Serial forwarding is supported, I didn’t try netbooting yet.

OS (ubilinux)

I installed their “default” distro which is done by flashing the ISO to a stick (or putting it on a CD) and you have to take care to use a USB2.0 connector if it’s a USB3 stick or it won’t be detected(!)

The grub menu was really slow, while the BIOS had been quick.

Limiting the video ram to 64MB + UHD screen brought me a system that stopped working once X was up. I didn’t investigate that, instead I booted to single user mode and told systemd to make that a default (systemctl set-default=multi-user.target).

Ubilinux is a Debian Jessie (sigh) but with some parts scrapped from Ubuntu (sigh).

It works and has all the stuff to i.e. access the GPIO connectors.

lm_sensors detected the coretemp CPU sensors, nothing else.

AES-NI was autoloaded.

The only thing I couldn’t make work yet was the hardware watchdog, which is an issue split between SystemD, packaging and probably something else.

This one gets a 9/10 which is rare 🙂

Active Directory burns


Hi,

 

just to leave the most important things from a day last week where things were special.

We set out to extend / upgrade a forgotten 2003 domain to 2012, with that it’d also finally get a second DC.

So we expected those steps:

  • Put domain in 2003 mode 🙂
  • Add 2012 DC, if not happy, do forest prep or so
  • Switch all FSMO
  • Shut down 2003 DC, test if NAS shares are accessible
  • Restart it
  • ready a 2012 VM as secondary DC
  • dcpromo the new VM
  • demote the old DC
  • clean up DNS/LDAP should we find any leftovers

It didn’t go anywhere as smooth.

First we hit a lot of DNS issues because not all clients used DHCP and so they’d end up looking  for a DNS that wasn’t there.

Once that was fixed we found stuff still wasn’t OK at all (No logon servers available). On the new DC the management console (the shiny new one) didn’t complain about anything, but nothing worked.

First we found the time service wasn’t doing OK: w32tm needed a lot of manual config (and resetting thereof) to finally sync it’s time, it didn’t take the domain time and it didn’t pull it in externally. That caused all kerberos tickets with the new DC to be worthless.

 

Later we also noticed that netdom query fsmo suddenly hit an RPC error. This was followed by a lot of time trying to debug the RPC error. In fact, on one of the following links I found a reminder to work with dcdiag which finally gave information that was actually useful to someone in IT. We had DNS issues (not critical) and a broken NTFRS. Basically I ran all the checks from the first link:

When good Domain Controllers go bad!

Then I verified that, yes, indeed our netlogon shares were missing on the new DC and, in fact, it had never fully replicated. No surprise it wasn’t working. The next thread (german) turned up after I found the original issue and had the right KB links to fix it.

http://www.mcseboard.de/topic/115522-fehler-13508-kein-sysvol-und-probleme-mit-dateireplikationsdienst/

So, what happened was that some dork at MS had used a MS JET DB driver for the rolling log of the file replication service. During a power outage, the JET DB wrote an invalid journal entry. It had been broken since then.

What I had to do was to re-initialize the replication, and then everything fixed just fine.

The KB entry for that is long, dangerous and something I hope you won’t have to do.

http://support.microsoft.com/kb/315457/en-us

I hated that this KB has actually a few errors, i.e. they don’t even tell you when to start back up the service on the original DC. Since we only had *one* it was also sometimes unclear if I’d just be doomed or be fine.

In the end, it’s not that bad, GPOs are just files, which you can restore if needed. So even if you end up with a completely empty replication set, you can put your GPOs back in. And, from the infra side, all your GPOs are less important than this service running…

There’s a lot of unclear warnings about the permissions of those files, so copy in cmd, might be OK. Otherwise you can also reset the perms later via one of the AD MMC things, so actually that’s not too horrible. I had no issue at all and *heh* they also hadn’t ever used a GPO.

Also, note that on the last goddamn page of the article they tell you how to make a temporary workaround.

 

Monitoring lessons for Windows

  • Don’t skip the basic FSMO check Nagios had forever
  • Have the Check_MK replication check running (it’ll not see this)
  • Monitor the SYSVOL and NETLOGON shares via the matching check
  • Monitor NTFRS EventID 13568, optionally 16508
  • Set up LDAP checks against the global catalog

Monitoring gaps:

  • Checks built on dcdiag (afraid one has to ignore the event log since it shows historic entries). Command that finally got me rolling was dcdiag /v /c /d /e /s:
  • Functional diagrams of AD for NagVis
  • Pre-defined rulesets for AD monitoring in BI

I feel those could just be part of the base config, there’s no specifics to monitoring. BI can do the AD monitoring strictly via autodetection, but NagVis is better for visual diagnosis.

With a proper monitoring in place I’d not have had to search the issue at all…

For my monitoring customers I’ll try to build this and include it in their configs. Others should basically demand the same thing.

 

Furthermore:

  • There’s always a way to fix things
  • AD is a complex system, you can’t just run one DC. No matter if SBS exist(s|ed) or not, it’s just not sane to do. Do not run AD with just one DC. Be sane, be safer.
  • Oh, and look at your Eventlogs. Heh.

Ceph Training Overview


Someone on #ceph asked about training in Europe / major cities there.

So what I did is I googled the s*** out of “Ceph Training”…

I mean, I’ve done a little browse about who’s currently offering any, part as research “do I wanna do that again?” and also because I think alternatives are good.

 

Here’s the offerings I found, along with some comments.

All of them have basics pretty nice now, meaning you get to look at CRUSH, make sure  you understand PGs reasonably and most will let you do maintenance tasks like OSD add / remove…

I didn’t only look at classes in German, but it seems the interest in Ceph in Germany is just pretty high. Rightfully so 🙂

Ceph Trainings

Btv0bouCcAA_VD3

CEPH – Cluster- und Dateisysteme

(obviously a Germany speaking class)

https://www.medienreich.de/training/ceph-cluster-und-dateisysteme

They have lots of references for this course, and clearly involve CephFS. They offer a flexible duration so you can chose how deep some topics will be covered. They handle common mistakes, which is very nice for the trainees.

Some addon they do is re-exporting, i.e. HA NFS or VMs in RADOS etc. Surely helpful, but with that one make sure you either cut that part very short (like 2hours) to only get the overview, or stretch to be 2 days of its own. Clustering isn’t a part of your toolkit, it’s a new toolkit you learn. If you cut short, you end up worse. And make no mistake, Ceph(FS) will be your new SPOF, so make sure you get in very close touch with it.

One thing I’d also recommend is not to do the class with just a 3-node setup if you take a longer one. 3 nodes is really nice for your first days with Ceph but the reliability and performance are completely unrelated to what you see in a larger setup.

hastexo – “Get in touch”

https://www.hastexo.com/services/training/

I found some past feedback from one of their classes, it seems to be very good.

Also keep in mind they’re among the really long-time Ceph users, they hang out in #ceph at least as long as I do, and that is now, idk? 6 years?

Different from pure trainers or people that run around bragging about Ceph but aren’t even on the community, hastexo has also spent years delivering Ceph setups large- and small.

The only grudge I have with them is when they recommended consumer Samsung SSDs in a Ceph Intro for iX magazine. That wasn’t funny, I met people who thought that was a buying advice for anything serious. Ignoring the that any power outage would potentially fizzle all the journal SSDs ain’t something you do. But the author just probably tried to be nice and save people some money in their labs.

Anyway, hastexo do to their large amount of installations is the very best choice if your company is likely to have a few special requirements; let’s say you’re a company that might test with 8 boxes but later run 500+ and you want real-world experience and advice for scalability, even in your special context.

Afaik they’re in Germany, but they’re real IT people, as far as I can tell any of them would speak most fluent english 🙂

Seminar Ceph

http://www.seminar-experts.de/seminare/ceph/

This is just a company re-selling trainings someone else is doing.

The trainer seems to have a good concept though, adding in benchmarking and spending a lot of time on the pros/cons of fuse vs kernel for different tasks.

This is the course you should take if you’ll be “the ceph guy” in your company and need to fight and win on your own.

Nothing fuzzy, no OpenStack or Samba “addons”. Instead you learn about Ceph to the max. I love that.

Price isn’t low even for 4 days, but I see the value in this, and in-house training generally ain’t cheap.

There’s also an “streaming” option which comes around cheaper but a Ceph class without a lab is mostly useless. It also doesn’t say anything about the trainer, so no idea if he’d do it in another language than German.

Red Hat Ceph Storage Architecture and Administration

http://www.flane.de/en/course/redhat-ceph125

Seriously, no. This is all about OpenStack. You can take this course if you have some extra time to learn Ceph in-depth or if you’re the OpenStack admin and do some Ceph on the side, and aren’t the only guy around.

Can also be partially interesting if you have other ideas for using the Rados Gateway.

 

Merrymack Ceph Training

http://www.ceph-training.com/index.html

A web-based / video-based training. Price-wise this beats them all if you just have 1-2 attendees and no prior knowledge.

Probably a very good refresh if Ceph knowledge is dated or if you want to learn at your own pace. That way you can spend a lot more lab time, rather nice.

If you have a few people on the team the price goes up and you should really negotiate a price.

Personally I’d prefer something with a trainer who looks at your test and tells you “try like this and it’ll work” but $499 are hard to argue with if you got some spare time to do the lab chores.

I think this is the actual launch info of the course:

https://www.linkedin.com/pulse/i-just-launched-on-demand-ceph-training-course-donald-talton

 

No longer available

 Ceph 2 – Day workshop at Heise / iX magazine.

It was a bit expensive for 2 days with up to 15 people.

http://www.heise.de/ix/meldung/iX-Workshop-zum-Dateisystem-Ceph-2466563.html

Nice as a get-to-know thing, I would not recommend it as an only training before going into a prod deployment

 

MK Linux Storage & LVM

That’s the original first Ceph training, the one I used to do 🙂

Ceph was done on the final day of the class, because back then you’d not find enough people to just come around for a Ceph training 😉

But it’s not offered by them any longer. As far as I know the interest was always a little bit too low since this hardcore storage stuff seems to have a different audience than the generic Linux/Bash/Puppet classes do.

 

Summary

Which one would I recommend?

Seminar Ceph” from that reseller would be for storage admins who need to know their ceph cluster as well as a seasoned SAN admin knows their VMAX etc. Also the best choice for people at IT shops who need to support Ceph in their customer base. You’ll be better off really understanding all parts of the storage layer, you might get your life sued away if you lose some data.

Go to hastexo if you generally know about Ceph, you already read the Ceph paper and some more current docs, your team is strong enough to basically set it up on your own (at a scale, so not “we can install that on 5 servers with ansible but “we’ve brought up new applications in size of 100s of servers capacity often enough, thank you”). You’d be able to strengthen some areas with them and benefit from their implementation experience.

Take the online Ceph Training if you want something quick and cheap and are super eager to tinker around and learn all the details. You’ll end up at the same level as with the pro training but need more time to get there.

Myself?

I still got no idea if I should do another training. I looked at all their outlines and it looked OK. Some more crush rebuilds to flex fingers and add-/remove/admin-socketify all the things 🙂 So, that’s fine with a week of prep and slides.

Training is a lot more fun than anything else I do, too.

But, to be honest the other stuff isn’t done and also pretty cool, with 1000s of servers and so on.

At my next website (www.florianheigl.me) iteration I’ll be adding classes and schedule.