About Linux IO drain #2

I’ll start writing down how I think Linux memory handling can be set up to not fuck up and go wrong. This is too much for one person to test in their evenings, but I think starting the checklist for this thing is something one guy can very well do 🙂

The article will best go to my adminspace wiki..

Our ingredients:

Page out / buffer dirty rate

cgroups memory limits:

Establish a strict separation between ressources intended for the OS and ressources for applications.

Check impact on buffer cache size

Form one cgroup for the OS

Form one more cgroup for the Apps

Document how the redhat tool for saving cgroup states over reboots is used

Make this one re-usable so if you add another App to the server you copy the cgroup settings and effictively break the ressources in half by default.

IO scheduler:

Number of queues

The kernel scheduler only has 64 queues. Someone *really* needs to test how behaviour changes from

  1. 8 concurrent IOs to 4 disks to
  2. 256 concurrent IOs to 100 disks
I feel many things are tripping there. At least from what I’ve seen.


Imho people will be randomly recommended to use cfq or deadline if they report disk issues, but they’re rarely recommended to actually *use* cfq(2) to priorize up/down IO writers.
The hard point with cfq is that it can’t really handle multiple writers in the same class killing each other (lacking a bucket mechanism).
Even if both VMs are run in the “idle” qos class, VM A running a DD will always impose horrible lag to VM B that is just web serving. CFQ can save your server from abusive processes, but not protect processes of the same kind / qos level.
So, when you really turn to deadline you’ll find it can’t do any priorities, meaning you suddenly lose the “save the server” feature. at least every process on the system will feel the same pain in theory.
Then, according to the docs, for highend arrays etc. you’ll want to use NOOP, which is fine from a scheduler point of view, but of course means now you don’t have any classification any more, so a fast reader / writer or any disk with insanely high queue & serve time will now kill your server.
The usual “deadline” vs. “noop” vs. “cfq” benchmarks are concerned with desktop PCs using a single SSD.
I think it is becoming to benchmark them against BBUed Raid HBAs with 512MB at first – but not only.
The real challenge for answering if NOOP is beneficial is:
real storage with concurrent writers (that means, a dozen or more linux servers, smarty) pumping into 512GB cache – something that to my knowledge has never been looked into, although it’s far more important when it comes to scheduler tuning.  (no, sorry, scale out toys do not matter in this prospect, instead any old DMX-3 will do just fine)
If deadline handles with the usual <1% difference to NOOP in this scenario there might be a winner.
’nuff for the first few thoughts

Food for reading:


5 thoughts on “About Linux IO drain #2

  1. I found deadline works a bit better for server with 8 cores and around 10 disks (few RAID1, one RAID10) on dom0. I still get some md hungs in D state, but not as frequent as with cfq.

    • I’m surprised you’re seeing that many issues with just 10 disks, and if there’s a difference in frequency of D states, wouldnt that mean that there are some worse deadlocks in the sched that I didnt even think about?

  2. I really dunno what could couse that – I have not digged (and lack skills to dig) such deep into kernel stuff. Maybe it’s overloaded? Running about 30 VPS, load avg 15min ~1.3. iostat -x show less than 10% utilization for 5 disks, 15-20% for 3 disks, 26% and 36% for remaining two disks. Uptime ~23 days (failed few weeks ago with md_sync D state).

    One interesting thing is, that in RAID1 (SATA, 2x 640GB, 7200) one disk has 6% util, other 36% o_O, RAID10 (SATA, 4x 320GB, 7200) %util 26%, 14%, 7%, 8%.

    SAS 15K and SAS 10K RAID1 both disk %util are equal.

    • suggestion: stop using iostat, go for sar -d

      second suggestion: look up how to do continuous sar recording via cron and let that run for at least the “Mean time between crashes” so you can see if there was a load spike or not.

      (check_mk doesn’t yet track io latencies, but we already have a good idea it will be complicated to get it right! 🙂

    • Just figured what you really need to try is raising dom0 weight.
      (but do it on a test system w/ same xen+kernel version first. There is a bug… lol. sched-credit works fine, sched-credit2 does panic.)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s