Ceph Training Overview


Someone on #ceph asked about training in Europe / major cities there.

So what I did is I googled the s*** out of “Ceph Training”…

I mean, I’ve done a little browse about who’s currently offering any, part as research “do I wanna do that again?” and also because I think alternatives are good.

 

Here’s the offerings I found, along with some comments.

All of them have basics pretty nice now, meaning you get to look at CRUSH, make sure  you understand PGs reasonably and most will let you do maintenance tasks like OSD add / remove…

I didn’t only look at classes in German, but it seems the interest in Ceph in Germany is just pretty high. Rightfully so 🙂

Ceph Trainings

Btv0bouCcAA_VD3

CEPH – Cluster- und Dateisysteme

(obviously a Germany speaking class)

https://www.medienreich.de/training/ceph-cluster-und-dateisysteme

They have lots of references for this course, and clearly involve CephFS. They offer a flexible duration so you can chose how deep some topics will be covered. They handle common mistakes, which is very nice for the trainees.

Some addon they do is re-exporting, i.e. HA NFS or VMs in RADOS etc. Surely helpful, but with that one make sure you either cut that part very short (like 2hours) to only get the overview, or stretch to be 2 days of its own. Clustering isn’t a part of your toolkit, it’s a new toolkit you learn. If you cut short, you end up worse. And make no mistake, Ceph(FS) will be your new SPOF, so make sure you get in very close touch with it.

One thing I’d also recommend is not to do the class with just a 3-node setup if you take a longer one. 3 nodes is really nice for your first days with Ceph but the reliability and performance are completely unrelated to what you see in a larger setup.

hastexo – “Get in touch”

https://www.hastexo.com/services/training/

I found some past feedback from one of their classes, it seems to be very good.

Also keep in mind they’re among the really long-time Ceph users, they hang out in #ceph at least as long as I do, and that is now, idk? 6 years?

Different from pure trainers or people that run around bragging about Ceph but aren’t even on the community, hastexo has also spent years delivering Ceph setups large- and small.

The only grudge I have with them is when they recommended consumer Samsung SSDs in a Ceph Intro for iX magazine. That wasn’t funny, I met people who thought that was a buying advice for anything serious. Ignoring the that any power outage would potentially fizzle all the journal SSDs ain’t something you do. But the author just probably tried to be nice and save people some money in their labs.

Anyway, hastexo do to their large amount of installations is the very best choice if your company is likely to have a few special requirements; let’s say you’re a company that might test with 8 boxes but later run 500+ and you want real-world experience and advice for scalability, even in your special context.

Afaik they’re in Germany, but they’re real IT people, as far as I can tell any of them would speak most fluent english 🙂

Seminar Ceph

http://www.seminar-experts.de/seminare/ceph/

This is just a company re-selling trainings someone else is doing.

The trainer seems to have a good concept though, adding in benchmarking and spending a lot of time on the pros/cons of fuse vs kernel for different tasks.

This is the course you should take if you’ll be “the ceph guy” in your company and need to fight and win on your own.

Nothing fuzzy, no OpenStack or Samba “addons”. Instead you learn about Ceph to the max. I love that.

Price isn’t low even for 4 days, but I see the value in this, and in-house training generally ain’t cheap.

There’s also an “streaming” option which comes around cheaper but a Ceph class without a lab is mostly useless. It also doesn’t say anything about the trainer, so no idea if he’d do it in another language than German.

Red Hat Ceph Storage Architecture and Administration

http://www.flane.de/en/course/redhat-ceph125

Seriously, no. This is all about OpenStack. You can take this course if you have some extra time to learn Ceph in-depth or if you’re the OpenStack admin and do some Ceph on the side, and aren’t the only guy around.

Can also be partially interesting if you have other ideas for using the Rados Gateway.

 

Merrymack Ceph Training

http://www.ceph-training.com/index.html

A web-based / video-based training. Price-wise this beats them all if you just have 1-2 attendees and no prior knowledge.

Probably a very good refresh if Ceph knowledge is dated or if you want to learn at your own pace. That way you can spend a lot more lab time, rather nice.

If you have a few people on the team the price goes up and you should really negotiate a price.

Personally I’d prefer something with a trainer who looks at your test and tells you “try like this and it’ll work” but $499 are hard to argue with if you got some spare time to do the lab chores.

I think this is the actual launch info of the course:

https://www.linkedin.com/pulse/i-just-launched-on-demand-ceph-training-course-donald-talton

 

No longer available

 Ceph 2 – Day workshop at Heise / iX magazine.

It was a bit expensive for 2 days with up to 15 people.

http://www.heise.de/ix/meldung/iX-Workshop-zum-Dateisystem-Ceph-2466563.html

Nice as a get-to-know thing, I would not recommend it as an only training before going into a prod deployment

 

MK Linux Storage & LVM

That’s the original first Ceph training, the one I used to do 🙂

Ceph was done on the final day of the class, because back then you’d not find enough people to just come around for a Ceph training 😉

But it’s not offered by them any longer. As far as I know the interest was always a little bit too low since this hardcore storage stuff seems to have a different audience than the generic Linux/Bash/Puppet classes do.

 

Summary

Which one would I recommend?

Seminar Ceph” from that reseller would be for storage admins who need to know their ceph cluster as well as a seasoned SAN admin knows their VMAX etc. Also the best choice for people at IT shops who need to support Ceph in their customer base. You’ll be better off really understanding all parts of the storage layer, you might get your life sued away if you lose some data.

Go to hastexo if you generally know about Ceph, you already read the Ceph paper and some more current docs, your team is strong enough to basically set it up on your own (at a scale, so not “we can install that on 5 servers with ansible but “we’ve brought up new applications in size of 100s of servers capacity often enough, thank you”). You’d be able to strengthen some areas with them and benefit from their implementation experience.

Take the online Ceph Training if you want something quick and cheap and are super eager to tinker around and learn all the details. You’ll end up at the same level as with the pro training but need more time to get there.

Myself?

I still got no idea if I should do another training. I looked at all their outlines and it looked OK. Some more crush rebuilds to flex fingers and add-/remove/admin-socketify all the things 🙂 So, that’s fine with a week of prep and slides.

Training is a lot more fun than anything else I do, too.

But, to be honest the other stuff isn’t done and also pretty cool, with 1000s of servers and so on.

At my next website (www.florianheigl.me) iteration I’ll be adding classes and schedule.

Advertisements

Part four: Storage migration, disaster recovery and friends


This article also was published a little too early….. 🙂

 

A colleague (actually, my team lead) and I set out to build a new, FreeBSD based storage domU.

 

The steps we did:

Updated Raid Firmware

re-flashing my M5015 Raid controller to more current, non-IBM firmware. We primarly hoped this would enable the SSDs write cache. Didn’t work. It was a little easier than expected since I had already done parts of the procedure.

Your most important command for this is “Adpallinfo”

 

Created Raid Luns

We then created a large bunch of Raid10 luns over 4 of the SSDs.

  • 32GB for the storage domU OS
  • 512MB for testing a controller-ram buffered SLOG
  • 16GB ZIL
  • 16GB L2ARC
  • 600odd GB “rest”

Configure PCI passthrough in Xen

There was a few hickups, the kernel command line just wouldn’t activate, nor did using modprobe.d and /etc/modules do the job on their own.

This is what we actually changed…

First, we obtained the right PCI ID using lspci (apk add pciutils)

daveh0003:~# lspci | grep -i lsi

01:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator] (rev 03)

in /etc/modules:

xen_pciback

in /etc/modprobe.d/blacklist added:

blacklist megaraid_sas

in /etc/modprobe.d/xen-pciback.conf

options xen-pciback hide=(0000:01:00.0)

in /etc/update-extlinux.conf

default_kernel_opts=”modprobe.blacklist=megaraid_sas quiet” – we had also tried

#default_kernel_opts=”xen-pciback.hide='(01:00.0)’ quiet”

(btw, not escaping the paraentesis can cause busybox/openrc init to crash!!)

and, last, but not least I gave up annoyedly and put some stuff in /etc/rc.local

echo 0000:01:00.0 > /sys/bus/pci/devices/0000:01:00.0/driver/unbind

echo 0000:01:00.0 > /sys/bus/pci/drivers/pciback/new_slot

echo 0000:01:00.0 > /sys/bus/pci/drivers/pciback/bind

(and even this isn’t working without me manually calling it. It will take many more hours to get this to a state where it just works. If you ever wonder where the price of VMWare is justified… every-fucking-where)

FreeBSD storage domU

The storage domU is a pretty default install of FreeBSD10 to a 32GB LUN on the raid.

During install DHCP did not work ($colleague had also run into this issue) and so we just used a static IP… While the VM is called “freesd3” I also added a CNAME called “stor” for easier access.

The zpools are:

  • zroot (the VM itself)
  • zdata (SSD-only)
  • zdata2 (Disk fronted by SSD SLOG and L2ARC)

I turned on ZFS compression on most of those using the dataset names, i.e.:

set compression=lz4 zroot/var

VMs can later access this using iSCSI or as a Xen block device (we’ll get to that later!)

Now, for the actual problem. During installation, the device IDs had shifted. On FreeBSD this is highly uncommon to see and you *really* consider that a linux-only issue. Well, not true.

Install

We selected “mfid0”, which should have been the 32GB OS Lun…

This is what MegaCli shows:

<<<megaraid_ldisks>>>
Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Size                : 32.0 GB
Sector Size         : 512
State               : Optimal
Strip Size          : 128 KB
Number Of Drives per span:2
Virtual Drive: 1 (Target Id: 1)
Size                : 3.182 TB
Sector Size         : 512
State               : Optimal
Strip Size          : 64 KB
Number Of Drives per span:2
Virtual Drive: 2 (Target Id: 2)
Size                : 512.0 MB
Sector Size         : 512
State               : Optimal
Strip Size          : 128 KB
Number Of Drives per span:2
Virtual Drive: 3 (Target Id: 3)
Size                : 16.0 GB
Sector Size         : 512
State               : Optimal
Strip Size          : 128 KB
Number Of Drives per span:2
Virtual Drive: 4 (Target Id: 4)
Size                : 64.0 GB
Sector Size         : 512
State               : Optimal
Strip Size          : 128 KB
Number Of Drives per span:2
Virtual Drive: 5 (Target Id: 5)
Size                : 630.695 GB
Sector Size         : 512
State               : Optimal
Strip Size          : 128 KB
Number Of Drives per span:2

Note that the logical drive ID and Target:Lun match up just fine!

 

 

The OS side:

Please compare to what FreeBSD’s mfi driver assigns…

mfid0: 32768MB (67108864 sectors) RAID volume (no label) is optimal
mfid1: 512MB (1048576 sectors) RAID volume (no label) is optimal
mfid2: 16384MB (33554432 sectors) RAID volume (no label) is optimal
mfid3: 65536MB (134217728 sectors) RAID volume (no label) is optimal
mfid4: 645832MB (1322663936 sectors) RAID volume (no label) is optimal
mfid5: 3337472MB (6835142656 sectors) RAID volume 'raid10data' is optimal

At install time it was cute enough to *drums* assign the 3.X T lun as mfid0. So we installed FreeBSD 10 on the LUN that stores my VMs.

That, of course, killed the LVM headers and a few gigabytes of data.

 

My next post will skip over reinstalling to the right lun (identified from live cd system) and instead describe how I went about getting the data back.

 

Part three: Storage migration, disaster recovery and friends


All posts:

What I had not expected was how hard it would be to decide on an actual solution.

 Picking a Hypervisor

For a lab I would need:

  • nested virt
  • high performance
  • low overhead to the same due to power etc.
  • easy cloning of vms and labs
  • flexible networking
  • easy scripting
  • wide storage options and easy migration
  • thin provisioning of some kind

 

If you know all the products and their drawbacks it turned into a constant forth-and-back between the different hypervisors and ecosystems.

 

VMWare:

VMWare always sneaked back due to feature reliability and performance consistency and then got kicked back out for the lack of many features like API and storage migration w/o a full vCenter install.

I knew it would deliver a good (600-900MBish) performance under any circumstance, where i.e. Xen can be all over the place from 150 to 1900MB/s…

Another downside was that in VMWare  my SolarFlare 5122 will definitely never  expose the 256VNICs. And I’d like to have em.

Installing MegaCli in ESXi is also a bit annoying.

On the pro side there’s the Cisco Nexus1000V and many other similar *gasp* appliances.

And, the perfect emulation. No “half” disk drivers. no cheapass BIOS.

In the end, I like to have my stuff licensed and to use the full power of a VMWare setup I’d need to go with vCenter + Enterprise Lic. No fun.

 

XenServer:

Just LOL.

While XenServer has great features for VM Cloning it’s just not my cup of tea. Too much very bad python code. Too many windows-user cludges. Broken networking all over.

Any expectation of storage flexibility would be in vain, needing backporting and recompiling software to the dom0 kernel using their SDK. Definitely not an easy solution if you wanna be able to flip between iSCSI, Infiniband, md and whatever else *looks* interesting. This should be a lab after all, and I don’t see any chance running something like the Storwise VSA in this. Nested ESXi for that, and that’s not on the roadmap for XenServer. If anything still is.

It would probably work best for SolarFlare. I’ll admit that.

 

CloudWeavers:

This is what will run in many VMs, but I don’t wanna break layering, so my underlying hypervisor and solution should not be the same as in the VMs. I am not yet sure if it’s the right decision.

This would be the prime KVM choice since they already deliver a well-tuned configuration.

What worries me is that, while MooseFS’ FUSE client scales good enough on a single hypervisor node, it would end up with a lot of additional context switching / trashing if I use it on the main node and in the clients. There might be smarter ways around this, i.e. by having a fat global pool in the “layer1” hypervisor and using that from the above layers, too. More probably it’d turn into a large disaster 🙂

 

LXC:

Pointless, no hypervisor, one single kernel instance can’t successfully pretend being a bunch of OSDs and clients 🙂

 

Plain Xen:

This is what I already have and went with, especially to make use of tmem and run the Ceph labs as paravirt domUs. This way I know nothing will get in the way performance wise.

There’s one thing you should know though, comparing Xen vs. ESXi or a licensed VMWare though:

Xen’s powermanagement is brokenbrokenbroken:

  • Newer deep-idle CPU states are all unsupported
  • The utility to manage CPU power management is broken as well. Since 4.3 nothing works any more.
  • Even if you free + shutdown a core from dom0 it’ll not be put to sleep

You can definitely tell from the power intake fan speed that Xen, even idle consumes more power than an idle Linux kernel would. Spinning up a PV domU has no impact, spinning up a HVM one is also a noticable increase in fan whoosh.

ESXi is far better integrated so I am expecting like 100 Euro (personal unfunded opinion) per year of additional energy wasted over VMWare.

My choice for Xen is mostly

  • the bleeding edge features like tmem
  • the really crazy stuff like vTPM and whatever of the cool features ain’t broken at any given time.
  • leverage any storage trick I want and have available in a (thanks to Alpine Linux) very recent Linux kernel
  • put in place ZFS, maybe in a dedicated driver domain
  • also be able to use MooseFS and last, but most interesting
  • all the things that never work on other hypervisors – CPU hotplug, dynamic ram changes…
  • storage domUs!!!!!

 

I think in a lab running 20-30 loaded VMs it will be cruicial to optimize in the memory subsystem.

Same goes for having the least possible CPU overhead, under load this will help.

Last, concurrently being able to use different storage techs means I can chose different levels of availability and speed – albeit not _having to_ since there’s a large SSD monster underneath it.

I’m also quite sure the disks will switch from Raid10 to Raid5. They just won’t see any random IO any more.

The “Raid5 is not OK” Disclaimer

Oh, and yes. Just to mention it. I’m aware I’m running green drives behind a controller. I know about Raid5 rebuild times (actually, they’re much lower on HW raid. About 30% of software raid) and the thing is…

If I see disk dropouts (yet to be seen), I’ll replace the dumb thing ASAP. It makes me cringe to read about people considering this a raid controller issue. If the damn disk can’t read a block for so long that the controller drops it out… Then I’m glad I have that controller and it did the right thing.

Such (block errored) disks are nice as media in secondary NAS storage or as doorstops, but not for a raid. Maybe I just hit extremely lucky in having no media errors at all off them? Definitely not what you’d see in a dedicated server at a mass hoster.

I’ve also patched my Check_MK Smart plugin to track the smart stats from the raid PDisks, so anything SMART notices I’ll be immediately be aware of. Why the green disks in the first place? Well – power and noise benefits are huge. If I had some more space I’d consider a Raid6 of 8 of them, but not until I move to a bigger place.

 

Coming up next:

A colleague offered me some company when setting up a final storage layout.

We build a dedicated storage domU with PCI passthrough’ed MegaRaid controller and ZFS. The install had a little issue…

This is what the next posts will be about, one describing how to build a storage domU.

Also, what pitfalls to expect, and then a focus on losing data (sigh) and getting it back.

I’ll close with some lessons learned. 🙂

Part two: Storage migration, disaster recovery and friends


All posts:

 Go and find me a new host. Keep some money for foods.

So, in march and april I set out to build a *home* server that could handle a Ceph lab, and would behave mostly like real hardware. That equates to disks being slow, SSDs being fast, and RAM being, well, actual RAM. Writing to two disks should ideally also not immediately turn into an IO blender because they reside on one (uncached) spindle.

I think ocver all I spent 30 hours on Ebay and in shops to find good hardware for a cheap price.

 

This is what I gathered:

  • Xeon 2680V2 CPU (some ES model) with 8 instead of 10 cores but same 25MB of cache. It’s also overclockable, should I ever not resist that
  • Supermicro  X9SRL-F mainboard. There are better models with SAS and i350 NICs but I wanted to be a little more price-conservative there
  • 8x8GB DDR3 Ram which I recycled from other servers
  • 5x Hitachi SSD400M SSDs – serious business, enterprise SAS SSDs.
  • The old LSI 9260 controller
  • The old WD green disks

The other SSD option had been Samsung SM843T but their seller didn’t want to give out a receipt. I’m really happy I opted for “legit” and ended up with a better deal just a week later:

The Hitachis are like the big brother of the Intel DC S3700 SSD we all love. I had been looking for those on the cheap for like half a year and then hit lucky. At 400GB capacity each it meant I could make good use of VM cloning etc. and generally never look back to moving VMs from one pool to another for space.

 

I had (and still have) a lot of trouble with the power supply. Those intel CPUs take very low power on idle, even at the first stage of the boot. So the PSU, while on the intel HCL, would actually turn off after half a second when you had very few components installed. A hell of a bug to understand since you normally remove components to trace issues.

Why did I do that? oh, because the supermicro ipmi gave errors on some memory module. Which was OK but not fully supported. Supermicro is just too cheap to have good IPMI code.

Meh.

Some benchmarking, using 4(!) SSDs was done and incredibly.

Using my LSI tuning script I was able to hit sustained 1.8GB/s writes and sustained 2.2GB/s reads.

After some more thinking I decided to check out Raid5 which (thanks to the controller using parity to calculate every 4th? block) still gave a 1.8GB/s read 1.2GB/s write.

Completely crazy performance.

To get the full Raid5 speed I had to turn on Adaptive Read Ahead. Otherwise it was around 500MB/s, aka a single SSDs read speed.

One problem that stuck around was that the controller would / will not enable the SSDs write cache, no matter what you tell it!

This is a huge issue considering each of those SSDs has 512MB(ytes) of well-protected cache.

The SSD is on LSIs HCL for this very controller so this is a bit of a bugger. I’ll get back to this in a later post since by now I *have* found something fishy in the controllers’ output that might be the cause.

Nonetheless: Especially in a raid5 scenario this will have a lot of impact on write latency and IOPS.

Oh, generally: this SSD model and latency? not a large concern 🙂

 

Ceph on FreeBSD


comes to


Someone (on #ceph) is preparing patches for Ceph to successfully build on FreeBSD!

He sounds dedicated and knowledgeable enough to make it work.
While chatting he said he’d try to plug it into the GEOM block layer framework and also add some ZFS spice.

On linux you have /dev/rbd for blocklayer access (“RADOS block device”) and use LVM (and/or md) to architect more layers of storage on top of this, but on FreeBSD you can really just tuck it into GEOM and build anything you want. GEOM is “*one* tool for all” written by *one* guy, so it’s a lot less messy than mixing all kinds of different stuff like LVM & multipath & ecryptfs & EFI & MDraid & ionice to get to the same goal. (Of course there also has been a usual share of bugs made worse by the fact that there’s less users and less devs to find / fix the bugs. But the design is heavily leaning on Veritas far better making the non-broken things a charm to work with)

Now, to come back to Ceph on FreeBSD…  you’ll have a production-stable & tested ZFS version, a advanced block layer via GEOM and the most modern object storage/filesystem mix Ceph in one box. If you mentally add in HAMMER then *BSD is finally getting to one of the top positions when it comes to the most modern filesystems.

I guess in 1-2 years time there will be a lot of performance tuning work for the FreeBSD Core team 😉

my first ceph filesystem!


with a lot of help from someone on #ceph I managed to removed all the errors that I made while copying the easy configuration from the ceph wiki.
I had not created some mointpoints and forgot the entries for the MDS nodes!

My test setup consists of six debian lenny vm’s, one per disk spindle in my xen dom0s, each got one 100GB LV to read from.

waxu0026# ceph -s
10.07.17 00:06:43.441934 b73d3b70 monclient(hunting): found mon0
10.07.17 00:06:43.457934 pg v393: 1584 pgs: 1584 active+clean; 4084 MB data, 8176 MB used, 592 GB / 600 GB avail
10.07.17 00:06:43.478316 mds e5: 1/1/1 up, 1 up:standby, 1 up:active
10.07.17 00:06:43.478687 osd e15: 6 osds: 6 up, 6 in
10.07.17 00:06:43.479115 log 10.07.16 23:13:45.107332 mon0 192.168.19.26:6789/0 11 : [INF] mds0 192.168.19.241:6800/3785 up:active
10.07.17 00:06:43.479735 mon e1: 2 mons at 192.168.19.26:6789/0 192.168.19.241:6789/0
10.07.17 00:06:43.499271 b73d46d0 b73d46d0 strange, pid file /var/run/ceph.pid has 7970, not expected 8354

waxu0307:/ceph# df -h .
Filesystem Size Used Avail Use% Mounted on
192.168.19.26:/ 600G 7.7G 593G 2% /ceph

Write performance was horrible as I had not created journaling volumes.
I suspect even later these will be a performance hotspot no matter what.

Read performance was low for a single read (53MB/s). I very much hope that this will scale well above a single disks’ performance.

Here are my test files:

waxu0026# ls -l /ceph/
total 4194304
-rw-r–r– 1 root root 1073741824 Jul 16 23:21 lala
-rw-r–r– 1 root root 1073741824 Jul 16 23:39 lala2
-rw-r–r– 1 root root 1073741824 Jul 16 23:54 lala3
-rw-r–r– 1 root root 1073741824 Jul 17 00:06 lala4

The good thing is that even with this first-try-ever setup I see it scales up very well when multiple nodes are involved, I get roughly 100 – 110MB/s there. Not absolutely perfect considering the lacp trunks should enable cross-node traffic to go about a single GigE ports performance.