OVM 3.0.2 SDK


Yay,

Oracle has released the SDK templates for the OVM 3 releases.

Seems it was not hindered by closed-door politics but just a slow release after all.

This means a lot of good things, I can bump up project black magic to 3.0 and also continue the OpenVSwitch and FlashCache and DMVPN tests with a much more current platform. ❤

I’m very happy, hopefully 3.0.2 will be able to boot on my SuperMicro box. (3.0.1 had hung right at the “loading Xen.gz” part of the bootloader)

Xen hackathon lab(ptop)


Tomorrow is the Xen hackathon in Munich!!!!

My goal for this day is to find a stable way for monitoring Xen including the performance data. Ideally (and with some help) by the end of the day I’ll have added thorough documentation about the performance counters to the Xen wiki.

That would also aid in creating a MIB extension for Xen – it’d surely make me happy to see a snmp agent based on the check_mk agent plugin 😉

Talking of that – of course i’ll be mostly trying to get perfect monitoring via check_mk. Right now the agent only tracks basic up/down and Ram usage. I want to add per VM counters and ensure things don’t go all jumpy if dynamic memory allocation is used.

I’ll carry along a few Xen hosts in my laptop to go through the various combinations of agent output.

This was a good time to use check_mk’s WATO to set up a nice menu – if I select “in my laptop” as a VMs location then it’ll automatically use the ssh datasource to connect to the systems via lan2.

ovm3

    lan1:     type NAT, dhcp
    lan2:     type hostonly, 192.168.56.130

centos54

    lan1:    type NAT, dhcp
    lan2:    type hostonly, 192.168.56.154

alpine

     lan1:   type NAT, dhcp
     lan2:   type hostonly, 192.168.56.122

Additionally, I hooked up two of my real Xen hosts, running Debian Squeeze and OVM 2.2.

Short of Xen2 (just kidding) and NetBSD this allows for any combination of the xen utilities that could be found out in the wild….

 

And I’ll also take along some infiniband gear as a giveaway in case any of the Devs wants to tackle RDMA live migration.

New infiniband benchmarks – ib_rdma_bw overflow :)


Last night I was running some more benchmarks to verify the infiniband links are stable and checking if there is any negative impact if you add a 4xSDR (10Gbit) node to the other 4xDDR (20Gbit) nodes.

I was mostly looking at the rdma bandwidth with connected mode, as this is what should apply to glusterFS. I normally turned off the firewall, but I think technically it doesn’t matter with RDMA.

I noticed when changing the iterations in ib_rdma_bw to 200000 the bandwidth average displayed would drop from 2.8GB/s down to 40-50MB/s.
What had happened?
I decided to run multiple tests over all the connections (A to B, B to C, A to C, …) and found the error kept coming up once I ran a longer test than the default.

After that it was either a bug in ib_rdma_bw or my switch. I found it unlikely to be the switch, at those data rates an error should show almost immediately, like within 20GB, not after a few 100)

Turns out there was an overflow in ib_rdma_bw. Problem solved 🙂

That’s a older picture of the Cisco SFS3504 switch / gateway:
Cisco SFS 3504 infiniband gateway
Right now there’s 3 infiniband cables going in and no gigabit cable coming out.
I disabled the gateway function until I know how to use infiniband partitions correcty. They’ll be mapped on ethernet VLANs so that non-IB hosts can access them via IPoIB, too. But when I just enabled this without using VLANs etc. it meant that the IB hosts would see their own and forgeign IPs twice – via IB and Ethernet. A lot of chaos resulted 🙂

Note: GlusterFS builds perfectly on Oracle VM SDK


I already wrote some bit about GlusterFS and after the chat with the very experienced guru from OrionVM.au I got a little more curious about realworld-usage of Gluster again.

For running a few Xen hosts with failover, good performance and mirrored storage, I need the following:

  • Redundant storage for every piece of data
  • No storage unavailability if a single node goes down
  • Online expansion of storage
  • Locality of storage – instead of having separate “storage” and “compute” nodes, have some parts of the storage local to the Xen dom0’s.
  • Storage layer should run on dom0’s but also be able to run on other systems, i.e. dedicated storage boxes.

 

GlusterFS has one very unique feature of optionally running with the Infiniband Verbs layer, most easily put this means data can get read from it’s disk on the one node RIGHT INTO the ram of the node that wants the data.

The data transfer rates on Infiniband when using ip over ib or sdp (socket direct protocol) and all of those options always look a bit sucky. Surely a QDR (36gibt raw) link will stomp 10ge, but “only” by 1.5 times or so.

When using IB Verbs we’re looking at almost full line rate, so a 3.4-3.5 times increase over 10ge.

I’m a poor guy using only SDR and DDR infiniband, so I’ll be getting 15gbit’s at the best. But that’s still real world IO performance with superlow latency that 10ge doesn’t even scratch. and hey, the port cost for me is like $70.

Who would complain like that? All that’s left is making it run in Oracle VM!

After some searching I found the SDK VM I had downloaded while training for the Oracle VM certification in June.

It’s called the “OVM Build template”, i.e. OVM_BUILD_TEMPLATE_2.2.1. After downloading from the Oracle E-delivery site, you can basically boot it and build your rpms with very little effort.

Steps if you wanna follow me:

  • Download & unpack the image.
  • copy it to an existing Xen host (doesn’t have to be oracle vm, but it might be easier.)
  • I experienced a dom0 crash after the loop0 device lost around 2000 IOs. This had never happened to me before, but I have a very small host at home with low dom0 memory. To avoid that I
  • copied the image into a lvm volume, avoiding use of /dev/loop alltogether
  • use tightvnc to connect to the VM console  via the dom0’s IP.
  • set up temporary networking if needed
  • wget the glusterfs centos or redhat SRPMfrom their site at http://download.gluster.com/pub/gluster/glusterfs/3.1/LATEST/RHEL/
  • add the oracle vm and oel 5 public yum repos to the yum config using the following command – or read up at https://deranfangvomende.wordpress.com/2010/09/07/using-the-public-oraclevm-repo/

cd /etc/yum.repos.d && wget http://public-yum.oracle.com/public-yum-ovm2.repo http://public-yum.oracle.com/public-yum-el5.repo

  • install infiniband verbs headers & stuff using

yum -y –enablerepo=el5_u5_base –enablerepo=ovm22_2.2.1_base install libibverbs-devel

  • rpm -i the srpm from where you downloaded it to.
  • cd to /usr/src/redhat
  • run and go for a coffee.

rpmbuild -bb SPECS/glusterfs.spec

and voila, I got these:

[root@localhost redhat]# ls /usr/src/redhat/RPMS/i386glusterfs-core-3.1.0-1.i386.rpm       glusterfs-fuse-3.1.0-1.i386.rpmglusterfs-debuginfo-3.1.0-1.i386.rpm  glusterfs-rdma-3.1.0-1.i386.rpm

Let me know if it works for you, and also let me know your benchmark results, especially if you also use the stripe+replication translators in a stack.

Lastly, word of warning:

GlusterFS isn’t ceph. Do NOT expect magic things to happen ever use different size disks or one of your storage pieces will run full before the others, with very little warning.

Also from what I read and tested, do never expect gluster failover to just work[tm]. Infiniband bonding is not the easiest thing to tackle with, and the GlusterFS timeout handling is very, very, tricky.

Instead of fiddling around on your own, if you want ultimate performance and stability make use of the professional services/consulting options at gluster.com.

(Weil: Downtime kann jeder. Erstmal nachzudenken ist der schwierige Teil.)

Pieces of software: Xen Infiniband support


– Xen-IB project: IB support even in domU, many features
– work mostly completed in 2005/2006 outside of normal tree
– presented at XenSummit 2006
– never commited back because some last cleaning was needed?
– users settled on just asking for RDMA support in failover – would probably reduce the 60-600ms failover gap down to <1ms with good design
– 10-40 times higher than GigE bandwidth and no IP overhead would mean another performance boost
– RDMA support has been the most mentioned item on the Xen4 wishlist
– went COMPLETELY ignored

The poster in this list post summed it up quite well – from a higher point of view such a beautiful option got wasted…:

“imagine a small motherboard with just a good IB interface and 100MB ethernet
(for managing, not SAN/LAN), no disks, in a blade chassis, running Xen….
wouldn’t that be just great?”

I think what we have here (in the mssing IB support, not in the above quote) is a failure of seeing the big picture.

  • 40gbits QDR Infiniband – fully redundant, unlike ethernet
  • Let me do the math: thats 5GB/s IO rate…
  • No need for local storage
  • More space for Ram -> better cost efficiency
  • move on to distributed storage -> more widespread use of grid/cloud setups

Well, and Xen started off as a grid computing project, right?

This is kinda what Cisco has come up now with with their special mucho-RAM-servers VMWare+Datacenter Ethernet(DCE) bundles.

Just that we were there first. Better. Faster. Leaner. Cheaper.
And we made nothing of it.

We’ve got a history of not picking up on 3rd party additions and letting even down the most important ones, until they’re so outdated they won’t merge any more, but even ignoring (forgetting about?) the user wishlist is saddening.

The community manager Stephen Spector even forwarded my email asking about IB support and the user wishlist to a developer in charge a few weeks ago.

Guess what – got no reply.