Squeezing the Raidcontroller

This is a good day for benchmarking galore!

I’m trying to collect performance data for my controllers so that I can fine-tune the monitoring based on real measures, not educated guessing. I want to know the actual IOPS and MB per second limits and set the levels accordingly.

Todays victim is a

“Intel(R) RAID Controller SROMBSAS18E”

as found in the SR1550 servers on the active SAS backplane.

It is also very well known as the Dell PERC5…

With Intel servers you need addons for 512MB Ram and BBU. These came included with my server.

Right now we’re only doing readonly tests here. For one, the BBU is utterly discharged.

Test setup:

3x73GB 15K SAS drive in Raid0 config (IO Policy WB, Cached)

4x60GB OCZ Vertex2 in Raid10 config (IO Policy Direct)

OCZ Vertex2 SSDs

Linux Settings: cfq scheduler, Readahead is set to 1MB for both Luns.

Test scenario: Pull as fast off the disks as we can.

Write down the numbers from SAR afterwards.

[root@localhost ~]# dd if=/dev/sdc of=/dev/null bs=1024k count=10240 & ย 
dd if=/dev/sdb of=/dev/null bs=1024k count=10240
[1] 4198
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 37.0208 seconds, 290 MB/s
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 37.0224 seconds, 290 MB/s

Average:          sdb   1677.58 553577.58      0.00    329.99      1.32      0.79      0.46     77.04
Average:          sdc   1867.27 578368.77      0.00    309.74      1.44      0.77      0.45     83.96



Other fun things to test now…

  • Switch to SATA SSD Raid0 instead of Raid10
  • Look at IO Overhead in Xen domU*
  • See how much faster the SR1625 will perform ๐Ÿ™‚
  • Update the outdated firmware ๐Ÿ™‚
  • Switch to deadline scheduler
* ย already tried that one, still trying to really understand the results. Most important: enable readahead in dom0, helps heaps, if i remember correctly it bumped me from 300 to 400MB/s

Practical results, too:

If the controller peaks out at 580MB/s, I can now plan the number of 10k/15k/ssd…

6 thoughts on “Squeezing the Raidcontroller

  1. The problem with your test is that it tests large sequential IO workload. This is a type of workload you will hardly ever find, and it has very little to do with real workload. Databases use small random IO (4K or 8K), and filesystem uses something with a mix of, I would guess (educated guess, but no more), 60% sequential and 40% random, so your fully sequential IO has nothing to do with real life. If you really want to stress the controller and disks, and not only the disks’ throughput, which you now test, I would recommend you check Oracle’s tool ‘Orion’, which can allow easy and meaningful tests. Try it instead. You can graph it later, and see how your system behaves under real workloads.
    And if I guess that you don’t use four SSDs for regular filesystem acess…


    • Well, suppose this is *exactly* what I wanted to test?

      I will set up monitoring for IO bandwidth per LUN, per Controller, for IOPS and for the service time. And this will be based on the real limits.

      The idea is to be able to make that data useful – instead of just making a “data gathering pr0n” like munin or collectd, I’ll reverse the data, to be show available capacity. That way I can react if a LUN is hitting 80% of the IOPS at which it maxes out.

      I need to gather complete data, to keep track of IO overheads like caused by mis-alignment (when someone still uses DOS partitions).

      The actual workload is not heavily sequential, but also different from a normal database. On the server there will be somewhere around 20-30 virtual machines that will all run anything the user chooses to. I.e. in Oracle I know that this FS will get 8K writes and that one 512byte ones. Can’t do that here, this means cutting back on assumptions and trying to react faster if something peaks out.

      The SSDs will not be for filesystem use, yup. They’ll be for testing the alignment stuff (many 1000 IOPS difference), seeing how far I get the controller. Later on, only one per server remains. It’ll be used as “write around flash cache” to buffer hot zone read IOs.

      (It’s something I had not yet tested but it’s heavily used at Facebook so I presume it somewhat works. On the other hand one has to remember they don’t handle more critical data than “IMs” and “likes” so I’ll surely not use it for anything else than read caching ๐Ÿ™‚

      I’ll add the Oracle IO tester to the list.
      Right now I think of something like:

      • 1 vm has pkgsrc bulk compiles (very inefficient and disk trashing)
      • another one runs blogbench
      • another one only does sequential reads/write
      • one that is only cpu heavy on 4-5 cores to put pressure on the xen credit scheduler

      The idea is to definitely produce very unfriendly IO mixes, and then establish a baseline that makes everyone get “acceptable” performance. Otherwise due to the cfq scheduler’s crappyness the sequential read/write will starve anything else.

      Phew. This got lenghty but your comment was reasonable and i wanted to explain why I’m going to lengths with a normally not helpful benchmark.

  2. Acceptable, and also interesting ๐Ÿ™‚
    You can alter CFQ’s settings. It is a complex and very wide-use capable scheduler.
    BTW – for powerful RAID setup, I have found, on some cases, that using the ‘noop’ scheduler produced better IO and faster results. Assuming you have (and you do!) a RAID controller.

  3. yes, both deadline and noop are definitely faster with “real” hardware.
    (much more so when looking at real arrays with >100GB cache ๐Ÿ˜‰

    deadline works better in general, but I’m still a bit afraid to go with it as I think it will not be enough to keep massive abusers at bay. although I could switch the scheduler and apply policies on demand. and run with deadline as long as nothing goes haywire.

    Hey… a new idea. thanks!

  4. I have had *worse* results with deadline, so I cannot recommend it. I believe it depends on the real IO workload. In my case (usually – Oracle for high-throughput systems) – noop worked remarkably better.


  5. What I’ve learned so far is that the linux scheduler authors generally ignore the fact what servers look like – what looks good with 1 or 2 disks will look entirely different once you have over 64 disks (there’s some limit, i don’t remember if it’s the number of kernel io queues or buckets. something like that) and it’ll be all looking even more worse once you’re running over 2 or 4 disk paths.

    the schedulers would need to be attached to the mpath devices only, instead they sit on the member disks which turns into competition IMHO.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s