I’ll start writing down how I think Linux memory handling can be set up to not fuck up and go wrong. This is too much for one person to test in their evenings, but I think starting the checklist for this thing is something one guy can very well do 🙂
The article will best go to my adminspace wiki..
Our ingredients:
Page out / buffer dirty rate
cgroups memory limits:
Establish a strict separation between ressources intended for the OS and ressources for applications.
Check impact on buffer cache size
Form one cgroup for the OS
Form one more cgroup for the Apps
Document how the redhat tool for saving cgroup states over reboots is used
Make this one re-usable so if you add another App to the server you copy the cgroup settings and effictively break the ressources in half by default.
IO scheduler:
Number of queues
The kernel scheduler only has 64 queues. Someone *really* needs to test how behaviour changes from
- 8 concurrent IOs to 4 disks to
- 256 concurrent IOs to 100 disks
I feel many things are tripping there. At least from what I’ve seen.
Types:
Imho people will be randomly recommended to use cfq or deadline if they report disk issues, but they’re rarely recommended to actually *use* cfq(2) to priorize up/down IO writers.
The hard point with cfq is that it can’t really handle multiple writers in the same class killing each other (lacking a bucket mechanism).
Even if both VMs are run in the “idle” qos class, VM A running a DD will always impose horrible lag to VM B that is just web serving. CFQ can save your server from abusive processes, but not protect processes of the same kind / qos level.
So, when you really turn to deadline you’ll find it can’t do any priorities, meaning you suddenly lose the “save the server” feature. at least every process on the system will feel the same pain in theory.
Then, according to the docs, for highend arrays etc. you’ll want to use NOOP, which is fine from a scheduler point of view, but of course means now you don’t have any classification any more, so a fast reader / writer or any disk with insanely high queue & serve time will now kill your server.
The usual “deadline” vs. “noop” vs. “cfq” benchmarks are concerned with desktop PCs using a single SSD.
I think it is becoming to benchmark them against BBUed Raid HBAs with 512MB at first – but not only.
The real challenge for answering if NOOP is beneficial is:
real storage with concurrent writers (that means, a dozen or more linux servers, smarty) pumping into 512GB cache – something that to my knowledge has never been looked into, although it’s far more important when it comes to scheduler tuning. (no, sorry, scale out toys do not matter in this prospect, instead any old DMX-3 will do just fine)
If deadline handles with the usual <1% difference to NOOP in this scenario there might be a winner.
’nuff for the first few thoughts
Food for reading: