Heartbleed oder warum das Internet nicht mit “Windows” gemacht wird…


Gerade mit einem Chatbekannten ueber die Kundeninformation seiner Firma zu Heartbleed gesprochen…
20:17 <darkfader> hah und mit dem ssl
20:17 <darkfader> das lustige ist
20:18 <darkfader> weisst, das ist jetzt 7 tage her
20:18 <darkfader> vor 6 tagen waren wir unixer einigermassen fertig mit patchen
20:18 <darkfader> und audit vorher
20:18 <darkfader> und so
20:18 <darkfader> und die windows welt macht gerade zusammenfassungen der
betroffenen produkte

 

Erinnert mich sehr an eine Diskussion letztens ueber den Patchbedarf einer Unix-Umgebung

“Wenn das so stabil ist, warum muss man das dann so oft updaten?”

Von meiner Erklaerung hier mal ein kleines

Schaubild:

Part three: Storage migration, disaster recovery and friends


All posts:

What I had not expected was how hard it would be to decide on an actual solution.

 Picking a Hypervisor

For a lab I would need:

  • nested virt
  • high performance
  • low overhead to the same due to power etc.
  • easy cloning of vms and labs
  • flexible networking
  • easy scripting
  • wide storage options and easy migration
  • thin provisioning of some kind

 

If you know all the products and their drawbacks it turned into a constant forth-and-back between the different hypervisors and ecosystems.

 

VMWare:

VMWare always sneaked back due to feature reliability and performance consistency and then got kicked back out for the lack of many features like API and storage migration w/o a full vCenter install.

I knew it would deliver a good (600-900MBish) performance under any circumstance, where i.e. Xen can be all over the place from 150 to 1900MB/s…

Another downside was that in VMWare  my SolarFlare 5122 will definitely never  expose the 256VNICs. And I’d like to have em.

Installing MegaCli in ESXi is also a bit annoying.

On the pro side there’s the Cisco Nexus1000V and many other similar *gasp* appliances.

And, the perfect emulation. No “half” disk drivers. no cheapass BIOS.

In the end, I like to have my stuff licensed and to use the full power of a VMWare setup I’d need to go with vCenter + Enterprise Lic. No fun.

 

XenServer:

Just LOL.

While XenServer has great features for VM Cloning it’s just not my cup of tea. Too much very bad python code. Too many windows-user cludges. Broken networking all over.

Any expectation of storage flexibility would be in vain, needing backporting and recompiling software to the dom0 kernel using their SDK. Definitely not an easy solution if you wanna be able to flip between iSCSI, Infiniband, md and whatever else *looks* interesting. This should be a lab after all, and I don’t see any chance running something like the Storwise VSA in this. Nested ESXi for that, and that’s not on the roadmap for XenServer. If anything still is.

It would probably work best for SolarFlare. I’ll admit that.

 

CloudWeavers:

This is what will run in many VMs, but I don’t wanna break layering, so my underlying hypervisor and solution should not be the same as in the VMs. I am not yet sure if it’s the right decision.

This would be the prime KVM choice since they already deliver a well-tuned configuration.

What worries me is that, while MooseFS’ FUSE client scales good enough on a single hypervisor node, it would end up with a lot of additional context switching / trashing if I use it on the main node and in the clients. There might be smarter ways around this, i.e. by having a fat global pool in the “layer1″ hypervisor and using that from the above layers, too. More probably it’d turn into a large disaster :)

 

LXC:

Pointless, no hypervisor, one single kernel instance can’t successfully pretend being a bunch of OSDs and clients :)

 

Plain Xen:

This is what I already have and went with, especially to make use of tmem and run the Ceph labs as paravirt domUs. This way I know nothing will get in the way performance wise.

There’s one thing you should know though, comparing Xen vs. ESXi or a licensed VMWare though:

Xen’s powermanagement is brokenbrokenbroken:

  • Newer deep-idle CPU states are all unsupported
  • The utility to manage CPU power management is broken as well. Since 4.3 nothing works any more.
  • Even if you free + shutdown a core from dom0 it’ll not be put to sleep

You can definitely tell from the power intake fan speed that Xen, even idle consumes more power than an idle Linux kernel would. Spinning up a PV domU has no impact, spinning up a HVM one is also a noticable increase in fan whoosh.

ESXi is far better integrated so I am expecting like 100 Euro (personal unfunded opinion) per year of additional energy wasted over VMWare.

My choice for Xen is mostly

  • the bleeding edge features like tmem
  • the really crazy stuff like vTPM and whatever of the cool features ain’t broken at any given time.
  • leverage any storage trick I want and have available in a (thanks to Alpine Linux) very recent Linux kernel
  • put in place ZFS, maybe in a dedicated driver domain
  • also be able to use MooseFS and last, but most interesting
  • all the things that never work on other hypervisors – CPU hotplug, dynamic ram changes…
  • storage domUs!!!!!

 

I think in a lab running 20-30 loaded VMs it will be cruicial to optimize in the memory subsystem.

Same goes for having the least possible CPU overhead, under load this will help.

Last, concurrently being able to use different storage techs means I can chose different levels of availability and speed – albeit not _having to_ since there’s a large SSD monster underneath it.

I’m also quite sure the disks will switch from Raid10 to Raid5. They just won’t see any random IO any more.

The “Raid5 is not OK” Disclaimer

Oh, and yes. Just to mention it. I’m aware I’m running green drives behind a controller. I know about Raid5 rebuild times (actually, they’re much lower on HW raid. About 30% of software raid) and the thing is…

If I see disk dropouts (yet to be seen), I’ll replace the dumb thing ASAP. It makes me cringe to read about people considering this a raid controller issue. If the damn disk can’t read a block for so long that the controller drops it out… Then I’m glad I have that controller and it did the right thing.

Such (block errored) disks are nice as media in secondary NAS storage or as doorstops, but not for a raid. Maybe I just hit extremely lucky in having no media errors at all off them? Definitely not what you’d see in a dedicated server at a mass hoster.

I’ve also patched my Check_MK Smart plugin to track the smart stats from the raid PDisks, so anything SMART notices I’ll be immediately be aware of. Why the green disks in the first place? Well – power and noise benefits are huge. If I had some more space I’d consider a Raid6 of 8 of them, but not until I move to a bigger place.

 

Coming up next:

A colleague offered me some company when setting up a final storage layout.

We build a dedicated storage domU with PCI passthrough’ed MegaRaid controller and ZFS. The install had a little issue…

This is what the next posts will be about, one describing how to build a storage domU.

Also, what pitfalls to expect, and then a focus on losing data (sigh) and getting it back.

I’ll close with some lessons learned. :)

Part two: Storage migration, disaster recovery and friends


All posts:

 Go and find me a new host. Keep some money for foods.

So, in march and april I set out to build a *home* server that could handle a Ceph lab, and would behave mostly like real hardware. That equates to disks being slow, SSDs being fast, and RAM being, well, actual RAM. Writing to two disks should ideally also not immediately turn into an IO blender because they reside on one (uncached) spindle.

I think ocver all I spent 30 hours on Ebay and in shops to find good hardware for a cheap price.

 

This is what I gathered:

  • Xeon 2680V2 CPU (some ES model) with 8 instead of 10 cores but same 25MB of cache. It’s also overclockable, should I ever not resist that
  • Supermicro  X9SRL-F mainboard. There are better models with SAS and i350 NICs but I wanted to be a little more price-conservative there
  • 8x8GB DDR3 Ram which I recycled from other servers
  • 5x Hitachi SSD400M SSDs – serious business, enterprise SAS SSDs.
  • The old LSI 9260 controller
  • The old WD green disks

The other SSD option had been Samsung SM843T but their seller didn’t want to give out a receipt. I’m really happy I opted for “legit” and ended up with a better deal just a week later:

The Hitachis are like the big brother of the Intel DC S3700 SSD we all love. I had been looking for those on the cheap for like half a year and then hit lucky. At 400GB capacity each it meant I could make good use of VM cloning etc. and generally never look back to moving VMs from one pool to another for space.

 

I had (and still have) a lot of trouble with the power supply. Those intel CPUs take very low power on idle, even at the first stage of the boot. So the PSU, while on the intel HCL, would actually turn off after half a second when you had very few components installed. A hell of a bug to understand since you normally remove components to trace issues.

Why did I do that? oh, because the supermicro ipmi gave errors on some memory module. Which was OK but not fully supported. Supermicro is just too cheap to have good IPMI code.

Meh.

Some benchmarking, using 4(!) SSDs was done and incredibly.

Using my LSI tuning script I was able to hit sustained 1.8GB/s writes and sustained 2.2GB/s reads.

After some more thinking I decided to check out Raid5 which (thanks to the controller using parity to calculate every 4th? block) still gave a 1.8GB/s read 1.2GB/s write.

Completely crazy performance.

To get the full Raid5 speed I had to turn on Adaptive Read Ahead. Otherwise it was around 500MB/s, aka a single SSDs read speed.

One problem that stuck around was that the controller would / will not enable the SSDs write cache, no matter what you tell it!

This is a huge issue considering each of those SSDs has 512MB(ytes) of well-protected cache.

The SSD is on LSIs HCL for this very controller so this is a bit of a bugger. I’ll get back to this in a later post since by now I *have* found something fishy in the controllers’ output that might be the cause.

Nonetheless: Especially in a raid5 scenario this will have a lot of impact on write latency and IOPS.

Oh, generally: this SSD model and latency? not a large concern :)

 

Part one: Storage migration, disaster recovery and friends


This is the first post of a series describing recent changes I did, some data loss, recovering from it and evaluating damage.
All posts:

 

Starting point.

I am building a new Xen Host for my home lab. It was supposed to handle one or two full Ceph labs at high load.The old machine just couldn’t do that.

 

What I had was a Core2 Q6600 quadcore CPU on an Intel S3210 board (IPMI, yay). It had 8GB of Ram, a IBM M5015 Raid Controller and Dual Nics. For storage I had a Raid10 over 4x2TB WD Green drives fronted by a Raid0 Flashcache Device build from two Samsung 830′s. Due to the old chipset the SSDs were limited somewhere around 730MB/s read/write speed.

The main problems were lack of CPU instructions (nested paging etc) for advanced or bleeding edge Xen features.

  • Memory overcommit using XenPaging only works if you have a more recent CPU than mine. (Of course this defeats the point since a more recent Xeon can handle enough RAM in the first place. But still)
  • The second thing was that PVH mode for FreeBSD needed a more recent CPU and last,
  • Nested Virt with Xen is getting somewhere which would be interesting for running ESXi or many Cloudweavers instances w/o performance impact

So, I couldn’t have many nice things!

Also I knew the consumer SSDs had too much latency for a highspeed cache.

For Ceph there was the added requirement of handling the Ceph Journals (SSD) for multiple OSDs and not exposing bottlenecks and IO variances from using the same SSD a dozen times.

 

I’m unhappy to replace the server while it was so far never really over 2-3% of average CPU – but since I want to do A LOT more with Ceph and Cloudweavers it was time to take a step forward. I spend some time calculating how far the step could be and  found that I would have to settle somewhere around ~1600 Euro for everything.

Cfengine training


Whats coming

 

You should see a more interesting post here today.

Tomorrow till friday I’ll be going to a cfengine3 class.

I’ve been so excited about this I’ve been counting down the days and such…

Unfortunately thanks to the OpenSSL nightmare of today I don’t even have time to think about tomorrow.

 

new tools

anyway. by next week I’ll have working knowledge of both Ansible and cfengine3.

This is what I consider a great toolset, or as I described to a friend as “having an excellent hammer for when I need a hammer”, and also having something to build whole cities with for when I need that.

Talking of cities:

One of my favorite books is “The city and the stars” by Arthur C. Clarke, which takes place in a city enduring aeons.

This is kinda what I’d love my servers to do, too. I think a good overall system should be able to keep running and running and running. It should be weathering disk failures, updates and power failures.

I think this does not just work by giving it a “immutability”, but by teaching it how to serve it’s actual purpose…

 

cfengine

Cfengine, to me, feels closest to that goal.

(Notably, in that story the only only normal-thinking guy in that city is a rare occurance and really wants to get out)

Sysadmin to manager translation guide


I just wrote this for fun, no liabilities taken! :)

Well, this is interesting:

  • “Something is definitely broken, but I don’t expect persistent data loss. Someone made a highly stupid bug”
  • “I think this is gonna break once anyone touches it”
  • “I think this is gonna break in the next 24 hours”

Well, this is weird:

  • “This should never have happened. Something fucked up big time.”
  • “There might be logical data corruption.”
  • “I might soon tell you we’re doomed.”

This is not good:

  • “You lost service and/or data”

(notice “well” indicates undetermined data loss)

I’ll need to have a look at that:

“This is all broken and was set up by someone who didn’t bother to think. We’ll need to take it apart just to find out what was broken by setup and what recently broke so you called me. It’s better if you won’t hear what I have to say about this setup.”

How do you do backups?

  • “What you’re asking might cause data loss”.
  • “I don’t yet trust you to do things right”
  • “How many layers of safety do we have?”

By asking about “how” you get a chance to make up excuses, or, if you can give details, we’ll have a good chance to getting out of this safely. I like “safely”.

What date is your last backup?

“You have lost data, I’m planning a strategy for recovery and if, incidentially, you’d not tell me your backups are broken, then we can proceed quite successfully”

I’ll need to look this up

  • “you threw something completely new at me”
  • “you designed things so creatively that it’ll need 1-2 hours of research and ideally a rebuild in lab to make sure there IS a workable path out of this. No sane person has a setup like this.”
  • “Last time someone did such a crazy thing I managed to fix it, but you need to go away right now because if you see how other people ended up in that situation, you’ll be depressed”

Could someone fetch the green book please?

Your VxVM volumes are broken because you never properly configured anything. We needthe best possible documentation before we even start typing.”

Cut Ubuntu login CPU usage by 90%


I was just testing check_by_ssh at scale, running around 5000 ssh-based checks against the desktop system here.

One thing that puzzled me was that after adding passive mode load actually went UP instead of down.

I saw a load of up to 11 after this change just to run check_dummy.

You could see it wasn’t accounted to any of the long running processes except polkitd, so the conclusion was that this would be related to some desktop bullshit written by the Ubuntu devs.

After some research, most of this comes from dbus and policykit running useless desktop sessions for the ssh based login. Since todays distros have a lot of things tied into dbus I couldn’t do much about this.

The real conclusion was found by a stackexchange post.

I deleted all the crap in /etc/update.motd.d/

Now the system load is down to under 1 most of the time.

I don’t wanna think about how many KWh those useless scripts waste on a planetwide scale.

People, PLEASE don’t use Ubuntu if you don’t have to.

LOPSA Mentorship & Monitoring


I’m THIS excited!

Mozilla

Recently someone asked on the lopsa-mentorship lists for some help with improving the monitoring for the community project he works for.

The one whose logo above _everyone_ knows :)

I offered to help since, well, monitoring!

Now I’m waiting to get in touch and then answer / guide him with any monitoring issues he finds.

Waiting. Excitedly. I already prepped a page of questions. Can’t wait. So excited.

I hope we can settle on Check_MK instead of anything outdated, but we’ll see. Not gonna push something on him, there are more interesting questions than what software to use.

i.e. identifying the actual services provided, seeing their dependencies (i.e. if a build of this piece fails today, there’s no new version for next week), and since I’ll not be the person doing the work, it’ll be a much bigger challenge:

Finding the essence of why and how to monitor what.

 

About LOPSA Mentorship:

The league of professional sysadmins (LOPSA) has a mentorship program, where beginning sysadmins or such starting into a new topic can ask for help. This dates back to when things were called system administrators guild (SAGE).

I remember I wrote there looking for help back in 2001/2002 when I got my first “serious” sysadmin job. I checked the options “hundreds of servers” “production” “lack of prior experience” and something like “HELP!”. Noone replied.

I joined the mentorship program to help people not have this happen again.

 

About LOPSA:

LOPSA itself is the largest standing organization of system administrators.

It offers exchange of ideas and practices. This is extremely helpful for professional sysadmins, since we normally don’t have anyone outside of our current gig to compare our ideas with. And normally we tackle more complex tasks than most “DevOps” scenarios cover, so looking out on the internet will also just send you crying. LOPSA fills the gap, getting you in touch with more experienced and fresh sysadmins in an informal way.
Beyond that it also guides by setting some rules i.e. with a Code of Ethics.

The latter I’ve  translated to german – so I’m quite bound by the code of ethics. :)

 

 

Nagios MooseFS Checks #2


For the curious,  here you see the bingo point where MooseFS is done detecting undergoal chunks and you can watch the rebalancing happening.

 

WARN – 9287 chunks of goal 3 lack replicas
WARN – 9291 chunks of goal 3 lack replicas
WARN – 9291 chunks of goal 3 lack replicas
WARN – 9294 chunks of goal 3 lack replicas
WARN – 9295 chunks of goal 3 lack replicas
WARN – 9291 chunks of goal 3 lack replicas
WARN – 9287 chunks of goal 3 lack replicas
WARN – 9283 chunks of goal 3 lack replicas
WARN – 9279 chunks of goal 3 lack replicas
WARN – 9273 chunks of goal 3 lack replicas
WARN – 9262 chunks of goal 3 lack replicas
WARN – 9254 chunks of goal 3 lack replicas

 

As you can see the number of undergoal chunks is dropping by the minute.
In Check_MK you could even use the internal counter functions to give an ETA for completion of the resync.
And that’s where real storage administration starts… :)

MooseFS Nagios Checks


MooseFS is a really robust filesystem, yet this shouldn’t be an excuse for bad docs and no monitoring.

So let’s see:

 

I just marked a disk on a chunkserver for removal by prefixing the path in /etc/mfs/mfshdd.cfg with an asterisk (*).  Next, I started running the check in a loop, and after seeing the initial “OK” state, I proceeded with /etc/init.d/mfs-chunkserver restart. Now the cluster’s mfsmaster finds out about the pending removal:

This is what the output looks like after a moment:

dhcp100:moosefs floh$ while true ; do ./nagios-moosefs-replicas.py ; sleep 5 ; done

OK – No errors
WARN – 11587 chunks of goal 3 lack replicas
WARN – 10 chunks of goal 3 lack replicas
WARN – 40 chunks of goal 3 lack replicas
WARN – 70 chunks of goal 3 lack replicas
WARN – 90 chunks of goal 3 lack replicas

As you can see, the number of undergoal chunks is growing – this is because we’re still in the first scan loop of the mfsmaster. The loop time is usually 300 or more seconds, and the number of chunks checked during one loop is usually also throttled at i.e. 10000 (that equals 640GB).

In my tiny setup this means after 300s I should see the final number – but also during this time there will be some re-balancing to free the marked-for-removal chunkserver. I already wish I’d be outputting perfdata with the check for some fun graphs.

 

Lesson for you?

The interval with my check should be equal to the loop time configured in mfsmaster.cfg.

 

Some nice person from TU Graz also pointed me at a forked repo of the mfs python bindings, and there is already some more nagios checks:

 

Make sure to check out

https://github.com/richarson/python-moosefs/blob/master/check-mfs.py

I’ll also testride this, but probably turn it into real documentation in my wiki at Adminspace – Check_MK and Nagios

Mostly, I’m pondering how to really set up a nice storage farm based on MooseFS at home, so I’m totally distracted from just tuning this check :)