I already wrote some bit about GlusterFS and after the chat with the very experienced guru from OrionVM.au I got a little more curious about realworld-usage of Gluster again.
For running a few Xen hosts with failover, good performance and mirrored storage, I need the following:
- Redundant storage for every piece of data
- No storage unavailability if a single node goes down
- Online expansion of storage
- Locality of storage – instead of having separate “storage” and “compute” nodes, have some parts of the storage local to the Xen dom0’s.
- Storage layer should run on dom0’s but also be able to run on other systems, i.e. dedicated storage boxes.
GlusterFS has one very unique feature of optionally running with the Infiniband Verbs layer, most easily put this means data can get read from it’s disk on the one node RIGHT INTO the ram of the node that wants the data.
The data transfer rates on Infiniband when using ip over ib or sdp (socket direct protocol) and all of those options always look a bit sucky. Surely a QDR (36gibt raw) link will stomp 10ge, but “only” by 1.5 times or so.
When using IB Verbs we’re looking at almost full line rate, so a 3.4-3.5 times increase over 10ge.
I’m a poor guy using only SDR and DDR infiniband, so I’ll be getting 15gbit’s at the best. But that’s still real world IO performance with superlow latency that 10ge doesn’t even scratch. and hey, the port cost for me is like $70.
Who would complain like that? All that’s left is making it run in Oracle VM!
After some searching I found the SDK VM I had downloaded while training for the Oracle VM certification in June.
It’s called the “OVM Build template”, i.e. OVM_BUILD_TEMPLATE_2.2.1. After downloading from the Oracle E-delivery site, you can basically boot it and build your rpms with very little effort.
Steps if you wanna follow me:
- Download & unpack the image.
- copy it to an existing Xen host (doesn’t have to be oracle vm, but it might be easier.)
- I experienced a dom0 crash after the loop0 device lost around 2000 IOs. This had never happened to me before, but I have a very small host at home with low dom0 memory. To avoid that I
- copied the image into a lvm volume, avoiding use of /dev/loop alltogether
- use tightvnc to connect to the VM console via the dom0’s IP.
- set up temporary networking if needed
- wget the glusterfs centos or redhat SRPMfrom their site at http://download.gluster.com/pub/gluster/glusterfs/3.1/LATEST/RHEL/
- add the oracle vm and oel 5 public yum repos to the yum config using the following command – or read up at https://deranfangvomende.wordpress.com/2010/09/07/using-the-public-oraclevm-repo/
cd /etc/yum.repos.d && wget http://public-yum.oracle.com/public-yum-ovm2.repo http://public-yum.oracle.com/public-yum-el5.repo
- install infiniband verbs headers & stuff using
yum -y –enablerepo=el5_u5_base –enablerepo=ovm22_2.2.1_base install libibverbs-devel
- rpm -i the srpm from where you downloaded it to.
- cd to /usr/src/redhat
- run and go for a coffee.
rpmbuild -bb SPECS/glusterfs.spec
and voila, I got these:
[root@localhost redhat]# ls /usr/src/redhat/RPMS/i386glusterfs-core-3.1.0-1.i386.rpm glusterfs-fuse-3.1.0-1.i386.rpmglusterfs-debuginfo-3.1.0-1.i386.rpm glusterfs-rdma-3.1.0-1.i386.rpm
Let me know if it works for you, and also let me know your benchmark results, especially if you also use the stripe+replication translators in a stack.
Lastly, word of warning:
GlusterFS isn’t ceph. Do NOT expect magic things to happen ever use different size disks or one of your storage pieces will run full before the others, with very little warning.
Also from what I read and tested, do never expect gluster failover to just work[tm]. Infiniband bonding is not the easiest thing to tackle with, and the GlusterFS timeout handling is very, very, tricky.
Instead of fiddling around on your own, if you want ultimate performance and stability make use of the professional services/consulting options at gluster.com.
(Weil: Downtime kann jeder. Erstmal nachzudenken ist der schwierige Teil.)