happy about VM compression on ZFS


Over the course of the weekend I’ve switched one of my VM hosts to the newer recommended layout for NodeWeaver – this uses ZFS, with compression enabled.

First, let me admit some things you might find amusing:

  • I found I had forgotten to add back one SSD after my disk+l2arc experiment
  • I found I had one of the two nodes plugged into its 1ge ports, instead of using the 10ge ones.

The switchover looked like this

  1. pick a storage volume to convert, write down the actual block device and the mountpoint
  2. tell LizardFS i’m going to disable it (prefix it with a * and kill -1 the chunkserver)
  3. Wait a bit
  4. tell LizardFS to forget about it (prefix with a # and kill -1 the chunkserver)
  5. umount
  6. ssh into the setup menu
  7. select ‘local storage’ and pick the now unused disk, assign it to be a ZFS volume
  8. quit the menu after successful setup of the disk
  9. kill -1 the chunkserver to enable it
  10. it’ll be visible in the dashboard again, and you’ll also see it’s a ZFS mount.

Compression was automatically enabled (lz4).

I’ve so far only looked at the re-replication speed and disk usage.

Running on 1ge I only got around 117MB/s (one of the nodes is on LACP and the switch can’t do dst+tcp port hashing so you end up in one channel.

Running on 10ge I saw replication network traffic to go up to 370MB/s.

Disk IO was lower since the compression already kicked, and the savings have been vast.

[root@node02 ~]# zfs get all | grep -w compressratio
srv0  compressratio         1.38x                  -
srv1  compressratio         1.51x                  -
srv2  compressratio         1.76x                  -
srv3  compressratio         1.53x                  -
srv4  compressratio         1.57x                  -
srv5  compressratio         1.48x                  -

I’m pretty sure ZFS also re-sparsified all sparse files, the net usage on some of the storages went down from 670GB to around 150GB.

 

I’ll probably share screenshots once the other node is also fully converted and rebalancing has happened.

Another thing I told a few people was that I ripped out the 6TB HDDs once I found that $customer’s OpenStack cloud performed slightly better than my home setup.

Consider that solved. 😉

vfile:~$ dd if=/dev/zero of=blah bs=1024k count=1024 conv=fdatasync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 2.47383 s, 434 MB/s

(this is a network-replicated write…

make sure to enable the noop disk scheduler before you do that.

Alternatively, if there’s multiple applications on the server (i.e. a container hosting VM), use the deadline disk scheduler with the nomerges option set. That matters. Seriously 🙂

 

Happy hacking and goodbye

Advertisements

FroSCon Nachlese: hosting


 

 

Die zwei deutschen “wir meinen das ernst mit dem anders sein” Webhoster waren beide auf der FroSCon.

Damit meine ich die hostharing EG und UBERSPACE .

hostsharing hatte sich 2001 nach dem Strato-GAU gegruendet, ihr aktueller (oder schon immer?) Slogan ist: Community Driven Webhosting. Sie sind eine Genossenschaft, d.h. man ist idealerweise nicht einfach Kunde, sondern redet auch mit. Es gibt aber natuerlich einen harten Kern, der die Getriebe am laufen haelt.

UBERSPACE kenne ich seit gefuehlt 2-3 Jahren. Sie machen “Hosting on Asteroids” 😉

Hier kann man den Preis selbst mitbestimmen, und bekommt eine flexible Loesung ohne den ganzen Bullshit, den die Massenhoster aussen rum bauen.

Es sind also zwei Generationen. Die Idee scheint mir dennoch aehnlich:

  • Ein Preismodell, das sich am bestmoeglichen Service orientiert.
  • Offenheit. Konzepte sind verstaendlich, ein User kann erkennen, wo seine Daten landen werden und kann klar erkennen, dass sie gut aufbewahrt sind
  • Ernsthafter Serverbetrieb (d.h. mehrere Standorte, Backups wo anders, bei Problemen reagieren, proaktives Management)
  • Nur kompetente Leute dranzusetzen! D.h. viele Jahre echte Unix-Erfahrung, nicht nur bisschen PHP und Panel-Bedienung. Alle beteiligten sind grob auf dem gleichen Niveau.

Im Gegensatz dazu der Massenmarkt, wo man oft genug lesen kann, dass wieder Kunden fuer die Sicherheitsluecken der Anbieter gestraft werden, keine Backups da waren, kein DR-Konzept vorlag, kaputte Hardware weiterbetrieben wird (“Wozu hat der Speicher ECC, wenn ich ihn dann tauschen lassen soll?”) usw.

Mit UBERSPACE konnte nicht nicht sprachen, weil ich es garnicht mitbekommen hab. Mit einem sass ich im Auto, habe aber Schlaf nachgeholt. 🙂

Der hostsharing habe ich vor allem eines geraten: Schreibt Preise drauf.

Typisch fuer Idealisten 🙂

Aber nicht zu verachten, ihr Start liegt jetzt 15 Jahre zurueck und sie haben seitdem halt einfach Qualitaet geliefert.

UBERSPACE wuensch ich das gleiche, und beiden, dass sie ihre Konzepte noch so gut skalieren koennen, dass sie die anderen Anbieter am Markt mit Qualitaet unter Druck setzen werden.

 

Weblinks:

https://uberspace.de/tech – ja, der erste Link geht zur Technik. Letztlich sind nur Technik, Team und Prozesse wichtig. Zumindest, wenn man keinen Website-baukasten braucht.
https://www.hostsharing.net/ – die Infoseite

https://www.hostsharing.net/events/hostsharing-veroeffentlicht-python-implementierung-des-api-zur-serveradministration – die Admin-API, von der sie mir garnicht erzaehlt hatten 🙂

FroSCon Nachlese: coreboot


 

Coreboot

Hab mich sehr gefreut, dass jemand von Coreboot da war. Der Stand wurde vom BSI organisiert, ansonsten helfen sie dem Projekt auch noch mit Test-Infrastruktur.

Find ich gut, das ist nahe an dem, was ich fuer die “wichtige” Arbeit des BSI halte.

Der Coreboot-Entwickler hat mir ihre neue Website (sieht gut aus, gibt’s aber noch nicht online) gezeigt, und ich hab ansonsten auch meinen viele Jahre alten Stand auffrischen koennen.

Es gibt alleine bei Intel 25 Leute, die mitarbeiten an Coreboot!

Native Hardware wird immer mehr (von 0 auf 2 oder so, aber hey!)

Es gibt ein recht tolles Chromebook von HP, das ich mir holen sollte

Die Payloads werden vielfaeltiger – ich hatte gesagt, dass ich Seabios einfach hasse, und gerne einen Phoenix-Clone haette. Er hat dann nachgefragt, und mich drauf gestossen, dass ich eigentlich ein viel besseres PXE will. (Siehe VMWare, Virtualbox, nicht siehe KVM oder PyPXEboot!)

Und er hatte einen Rat: Petitboot – ein SSH-barer PXE-loader, der so einfach alles und jedes kann, was wir uns als Admins oder QA-Engineers wuenschen.

https://secure.raptorengineering.com/content/kb/1.html

Ist auf jeden Fall auf der Testliste.

 

Was es sonst noch beim BSI Stand gab, war GPG, OpenPGP und OpenVAS.

VAS haette mich auch interessiert, aber ich glaube, ein vollstaendiges Gespraech ist besser als zwei halbe. 🙂

This is patching


 

  • There’s an OpenSSL issue
  • Fetch and rebuild the stable ports tree (2016Q3)
  • Find it won’t build php70 and mod_php70 anymore
  • Try to compare to your -current ports tree
  • Find there’s a php security issue in the the version there, but not the one you had
  • Wait till it’s fixed so you can build
  • Type portsnap, then just to be safe fist do a full world update to make sure your portsnap isn’t having security issues any more.
  • Updated portsnap has a metadata corruption
  • Remove your portsnap files, try again then just think “whatever” and fetch the ports from the ftp mirror and re-extract manually
  • Notice you just fetched an unsigned file via FTP and will use it to build i.e. your OpenSSL.
  • Rant about that.
  • Find you can’t build because it can’t fetch
  • Debug the reason it can’t fetch
  • Find it’s a bug in the ports tree from the fix of the above security issue
  • Make mental note noone seems to react withing 1-2 days if the stable tree is broken
  • While searching for existing bugs, find a PR about pkg audit that tries to redefine the functionality in order to not fix an output regression
  • Open a bug report for the PHP bug, adjust your local file
  • Fetch your new package list
  • Do a pkg audit, find it reports not too much.
  • Do a pkg audit -F, find it gives an SSL error
  • Find the http://www.vuxml.org certificate expired 2 months ago.
  • Wonder how noone even reacted to that
  • Find that SSLlabs somehow can’t even properly process the site anymore.
  • Find out that SSLlabs is actually dead just now.
  • Notice in the last lines it had managed to print that the actual hostnames points at a rdns-less v6 address and v4 cnames to a random FreeBSD.org test system.
  • Most likely the vuxml.org webserver ain’t heavily protected in that case, huh?
  • Give up and use http like the goddamn default
  • Random pkg update -> so update pkg first
  • In the end, just no-downtime slam all the new packages over the server because you’re sick of it all.

 

 

The next person who posts about “admins need to apply the fixes” I’ll just kick in the face. With my beer.

Diving into sysadmin-legalese


I’ve had the “fun” to at times write outage notices like the current Google statement. The wording is interesting.  IT systems fail, and once it’s time to write about it, management will make sure they are well-hidden. The notice will be written by some poor soul who just wants to go home to his family instead of being a bearer of bad news. If he writes something that is not correct, management will be back just in time for the crucification.

Things go wrong, we lose data, we lose money. It’s just something that happens.

But squeezing both the uncertainties and hard truths of IT into public statements is an art of it’s own.

Guess what? I’ve been in popcorn mood – Here are multiple potential translations for this status notice:

“The issue with Compute Engine network connectivity should have been resolved for nearly all instances. For the remaining few remaining instances we are working directly with the affected customers. No further updates will be posted, but we will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will also provide a more detailed analysis of this incident once we have completed our internal investigation.”


should have been resolved:

we cannot take responsibility. it might come back even just now! we’re not able to safely confirm anything 😦

but we spoke to the guy’s boss and they promised he’ll not do that thing again.

for nearly all instances:

the error has been contained, but we haven’t been able to fix it? we’ve been unable to undo the damage where it happened. we were lucky in most cases, but not in all.

the remaining few remaining instances

the remaining remaining… they aren’t going away! Still remaining!

these instances that are left we’ll not be able to fix like the majority where our fix worked.

Please, just let me go to bed and someone else do those? I can’t even recall how many 1000s we just fixed and it won’t stop.

working directly with the affected customers:

only if you’re affected and we certainly know we need to come out, we will involve you. we can’t ascertain if there’s damage for you. we need to run more checks and will tell you when / how long. you get to pick between restores and giving us time to debug/fix. (with a network related issue, that is unlikely)

no further updates will be posted

since it is safely contained, but we don’t understand it, we can’t make any statement that doesn’t have a high chance of being wrong

we only post here for issues currently affecting a larger than X number of users. Now that we fixed it for most, the threshold is not reached

we aren’t allowed to speak about this – we don’t need to be this open once we know the potential liabilites are under a treshold.

will conduct an internal investigation

we are unwilling to discuss any details until we completely know what happened

appropriate improvements  

we are really afraid now to change the wrong thing.

it will be the right button next time, it has to!

management has signed off the new switch at last!

to prevent or minimize future recurrence

we’re not even sure we can avoid this. right now, we would expect this to happen again. and again. and again. recurring, you know? if at least we’d get a periodic downtime to reboot shit before this happens, but we’re the cloud and noone gives us downtimes!

and please keep in mind minimized recurrence means a state like now, so only a few affected instances, which seems under the threshold where we notify on the tracker here.

we really hope it won’t re-occur so we don’t have to write another of those.


 

Don’t take too serious but these are some of the alarms going off in my head if I see such phrasing.

Have a great day and better backups!

Active Directory burns


Hi,

 

just to leave the most important things from a day last week where things were special.

We set out to extend / upgrade a forgotten 2003 domain to 2012, with that it’d also finally get a second DC.

So we expected those steps:

  • Put domain in 2003 mode 🙂
  • Add 2012 DC, if not happy, do forest prep or so
  • Switch all FSMO
  • Shut down 2003 DC, test if NAS shares are accessible
  • Restart it
  • ready a 2012 VM as secondary DC
  • dcpromo the new VM
  • demote the old DC
  • clean up DNS/LDAP should we find any leftovers

It didn’t go anywhere as smooth.

First we hit a lot of DNS issues because not all clients used DHCP and so they’d end up looking  for a DNS that wasn’t there.

Once that was fixed we found stuff still wasn’t OK at all (No logon servers available). On the new DC the management console (the shiny new one) didn’t complain about anything, but nothing worked.

First we found the time service wasn’t doing OK: w32tm needed a lot of manual config (and resetting thereof) to finally sync it’s time, it didn’t take the domain time and it didn’t pull it in externally. That caused all kerberos tickets with the new DC to be worthless.

 

Later we also noticed that netdom query fsmo suddenly hit an RPC error. This was followed by a lot of time trying to debug the RPC error. In fact, on one of the following links I found a reminder to work with dcdiag which finally gave information that was actually useful to someone in IT. We had DNS issues (not critical) and a broken NTFRS. Basically I ran all the checks from the first link:

When good Domain Controllers go bad!

Then I verified that, yes, indeed our netlogon shares were missing on the new DC and, in fact, it had never fully replicated. No surprise it wasn’t working. The next thread (german) turned up after I found the original issue and had the right KB links to fix it.

http://www.mcseboard.de/topic/115522-fehler-13508-kein-sysvol-und-probleme-mit-dateireplikationsdienst/

So, what happened was that some dork at MS had used a MS JET DB driver for the rolling log of the file replication service. During a power outage, the JET DB wrote an invalid journal entry. It had been broken since then.

What I had to do was to re-initialize the replication, and then everything fixed just fine.

The KB entry for that is long, dangerous and something I hope you won’t have to do.

http://support.microsoft.com/kb/315457/en-us

I hated that this KB has actually a few errors, i.e. they don’t even tell you when to start back up the service on the original DC. Since we only had *one* it was also sometimes unclear if I’d just be doomed or be fine.

In the end, it’s not that bad, GPOs are just files, which you can restore if needed. So even if you end up with a completely empty replication set, you can put your GPOs back in. And, from the infra side, all your GPOs are less important than this service running…

There’s a lot of unclear warnings about the permissions of those files, so copy in cmd, might be OK. Otherwise you can also reset the perms later via one of the AD MMC things, so actually that’s not too horrible. I had no issue at all and *heh* they also hadn’t ever used a GPO.

Also, note that on the last goddamn page of the article they tell you how to make a temporary workaround.

 

Monitoring lessons for Windows

  • Don’t skip the basic FSMO check Nagios had forever
  • Have the Check_MK replication check running (it’ll not see this)
  • Monitor the SYSVOL and NETLOGON shares via the matching check
  • Monitor NTFRS EventID 13568, optionally 16508
  • Set up LDAP checks against the global catalog

Monitoring gaps:

  • Checks built on dcdiag (afraid one has to ignore the event log since it shows historic entries). Command that finally got me rolling was dcdiag /v /c /d /e /s:
  • Functional diagrams of AD for NagVis
  • Pre-defined rulesets for AD monitoring in BI

I feel those could just be part of the base config, there’s no specifics to monitoring. BI can do the AD monitoring strictly via autodetection, but NagVis is better for visual diagnosis.

With a proper monitoring in place I’d not have had to search the issue at all…

For my monitoring customers I’ll try to build this and include it in their configs. Others should basically demand the same thing.

 

Furthermore:

  • There’s always a way to fix things
  • AD is a complex system, you can’t just run one DC. No matter if SBS exist(s|ed) or not, it’s just not sane to do. Do not run AD with just one DC. Be sane, be safer.
  • Oh, and look at your Eventlogs. Heh.

Ceph Training Overview


Someone on #ceph asked about training in Europe / major cities there.

So what I did is I googled the s*** out of “Ceph Training”…

I mean, I’ve done a little browse about who’s currently offering any, part as research “do I wanna do that again?” and also because I think alternatives are good.

 

Here’s the offerings I found, along with some comments.

All of them have basics pretty nice now, meaning you get to look at CRUSH, make sure  you understand PGs reasonably and most will let you do maintenance tasks like OSD add / remove…

I didn’t only look at classes in German, but it seems the interest in Ceph in Germany is just pretty high. Rightfully so 🙂

Ceph Trainings

Btv0bouCcAA_VD3

CEPH – Cluster- und Dateisysteme

(obviously a Germany speaking class)

https://www.medienreich.de/training/ceph-cluster-und-dateisysteme

They have lots of references for this course, and clearly involve CephFS. They offer a flexible duration so you can chose how deep some topics will be covered. They handle common mistakes, which is very nice for the trainees.

Some addon they do is re-exporting, i.e. HA NFS or VMs in RADOS etc. Surely helpful, but with that one make sure you either cut that part very short (like 2hours) to only get the overview, or stretch to be 2 days of its own. Clustering isn’t a part of your toolkit, it’s a new toolkit you learn. If you cut short, you end up worse. And make no mistake, Ceph(FS) will be your new SPOF, so make sure you get in very close touch with it.

One thing I’d also recommend is not to do the class with just a 3-node setup if you take a longer one. 3 nodes is really nice for your first days with Ceph but the reliability and performance are completely unrelated to what you see in a larger setup.

hastexo – “Get in touch”

https://www.hastexo.com/services/training/

I found some past feedback from one of their classes, it seems to be very good.

Also keep in mind they’re among the really long-time Ceph users, they hang out in #ceph at least as long as I do, and that is now, idk? 6 years?

Different from pure trainers or people that run around bragging about Ceph but aren’t even on the community, hastexo has also spent years delivering Ceph setups large- and small.

The only grudge I have with them is when they recommended consumer Samsung SSDs in a Ceph Intro for iX magazine. That wasn’t funny, I met people who thought that was a buying advice for anything serious. Ignoring the that any power outage would potentially fizzle all the journal SSDs ain’t something you do. But the author just probably tried to be nice and save people some money in their labs.

Anyway, hastexo do to their large amount of installations is the very best choice if your company is likely to have a few special requirements; let’s say you’re a company that might test with 8 boxes but later run 500+ and you want real-world experience and advice for scalability, even in your special context.

Afaik they’re in Germany, but they’re real IT people, as far as I can tell any of them would speak most fluent english 🙂

Seminar Ceph

http://www.seminar-experts.de/seminare/ceph/

This is just a company re-selling trainings someone else is doing.

The trainer seems to have a good concept though, adding in benchmarking and spending a lot of time on the pros/cons of fuse vs kernel for different tasks.

This is the course you should take if you’ll be “the ceph guy” in your company and need to fight and win on your own.

Nothing fuzzy, no OpenStack or Samba “addons”. Instead you learn about Ceph to the max. I love that.

Price isn’t low even for 4 days, but I see the value in this, and in-house training generally ain’t cheap.

There’s also an “streaming” option which comes around cheaper but a Ceph class without a lab is mostly useless. It also doesn’t say anything about the trainer, so no idea if he’d do it in another language than German.

Red Hat Ceph Storage Architecture and Administration

http://www.flane.de/en/course/redhat-ceph125

Seriously, no. This is all about OpenStack. You can take this course if you have some extra time to learn Ceph in-depth or if you’re the OpenStack admin and do some Ceph on the side, and aren’t the only guy around.

Can also be partially interesting if you have other ideas for using the Rados Gateway.

 

Merrymack Ceph Training

http://www.ceph-training.com/index.html

A web-based / video-based training. Price-wise this beats them all if you just have 1-2 attendees and no prior knowledge.

Probably a very good refresh if Ceph knowledge is dated or if you want to learn at your own pace. That way you can spend a lot more lab time, rather nice.

If you have a few people on the team the price goes up and you should really negotiate a price.

Personally I’d prefer something with a trainer who looks at your test and tells you “try like this and it’ll work” but $499 are hard to argue with if you got some spare time to do the lab chores.

I think this is the actual launch info of the course:

https://www.linkedin.com/pulse/i-just-launched-on-demand-ceph-training-course-donald-talton

 

No longer available

 Ceph 2 – Day workshop at Heise / iX magazine.

It was a bit expensive for 2 days with up to 15 people.

http://www.heise.de/ix/meldung/iX-Workshop-zum-Dateisystem-Ceph-2466563.html

Nice as a get-to-know thing, I would not recommend it as an only training before going into a prod deployment

 

MK Linux Storage & LVM

That’s the original first Ceph training, the one I used to do 🙂

Ceph was done on the final day of the class, because back then you’d not find enough people to just come around for a Ceph training 😉

But it’s not offered by them any longer. As far as I know the interest was always a little bit too low since this hardcore storage stuff seems to have a different audience than the generic Linux/Bash/Puppet classes do.

 

Summary

Which one would I recommend?

Seminar Ceph” from that reseller would be for storage admins who need to know their ceph cluster as well as a seasoned SAN admin knows their VMAX etc. Also the best choice for people at IT shops who need to support Ceph in their customer base. You’ll be better off really understanding all parts of the storage layer, you might get your life sued away if you lose some data.

Go to hastexo if you generally know about Ceph, you already read the Ceph paper and some more current docs, your team is strong enough to basically set it up on your own (at a scale, so not “we can install that on 5 servers with ansible but “we’ve brought up new applications in size of 100s of servers capacity often enough, thank you”). You’d be able to strengthen some areas with them and benefit from their implementation experience.

Take the online Ceph Training if you want something quick and cheap and are super eager to tinker around and learn all the details. You’ll end up at the same level as with the pro training but need more time to get there.

Myself?

I still got no idea if I should do another training. I looked at all their outlines and it looked OK. Some more crush rebuilds to flex fingers and add-/remove/admin-socketify all the things 🙂 So, that’s fine with a week of prep and slides.

Training is a lot more fun than anything else I do, too.

But, to be honest the other stuff isn’t done and also pretty cool, with 1000s of servers and so on.

At my next website (www.florianheigl.me) iteration I’ll be adding classes and schedule.

Custom NFS options for XenServer


Just had to spend half a night to make move a XenServer lab behind a firewall

Now, the servers are behind a NAT firewall, but my ISO repositories and the test NFS SR are not.

The two steps to get a solution were:

  • enable insecure mounting of the shares, because the NATting scrambles the ports
  • use TCP instead of UDP for the mount

The right hints came from this mailing list thread:

http://www.gossamer-threads.com/lists/xen/users/289442

So what I did was patch the NFS storage manager on both nodes.

In this thread: Kelvin Vanderlip and someone at Softlayer, the internet is rather small at times.

The nicest thing: 100MB/s read throughput… I’m more than surprised!

tinker

This traffic comes into the Debian PV domU on XenServer via XenServers’ OpenVswitch. The XenServers are, like the pfSense only VMs running on ESXi. So next, the pfSense firewall, doing NAT, two of the VMWare vSwitches, 3 real gigabit switches and again OpenVswitch till it finally hits the File server.

Except for the ISO repository which comes re-exported from a LizardFS share.

Lost BNX2 Broadcom BCM5708 drivers after Ubuntu upgrade


Hey everyone,

this feels so important I’d rather leave a post here to save you the same troubles.

Networking nightmare:

On Ubuntu 14 LTS you’ll need not just the non-free firmware package, but also this last one called:

“linux-image-extra-3.13.0-63-generic 3.13.0-63.103                        amd64        Linux kernel extra modules for version 3.13.0 on 64 bit x86 SMP”

Otherwise you’ll not have much more than the stock e1000e around for networking, meaning your servers may miss some nics. This was extremely hard to figure due to the fact that *first* my /lib/modules/3.13.0-63 included the bnx2 and bnx2x modules. Amusingly I found it would still boot 3.13.0-61. After the install of the -63 kernel, the modules were gone. It seems there’s some stupid trimming hook.

Installing the linux-image-extra package made the module stick and I have all 4 nics back.

Lost monitoring site:

A really nice feat is how OMD integrates with most distros by just having a zzz-omd.conf that includes the per-site config files. Now, funny enough, this has been in /etc/apache2/conf.d for years. Ubuntu 14.04 doesn’t read that anymore, it only handles /etc/apache2/conf-enabled. Which is more aligned with the Debian way of things (not that I enjoy it, but at least it’s consistent), but HELL why do you need to suddenly change it after you already fucked up?

I was looking for proxy module issues / misconfiguration for ages until I decided to just add random crap to config files and see if it would break Apache. No, it didn’t. After that some greps verified conf.d isn’t read any longer. It’s beyond me why they don’t at least move the contents over.

SAN nightmare:

One more thing that caused the server to not even boot:

Ubuntu has no concept of loading local disk drivers before SAN drivers. It scanned a lot of SAN luns, hit udev rule bugs by running a bad inquiry on each of them and then finally hit the root device mount timeout.

Root was a local SAS attached SATA disk.

Drivers in this case were mptsas and mptfc. You get the idea, yes? Alphabetically “FC” comes before “SAS”. And no, I don’t think Ubuntu is commonly used with SAN boot plus local disk……..

I’m pretty sure once the devs notice the issue they’l go with a highly professional solution like in /etc/grub.d, i.e. 10-mptsas and 40-mptfc. So clever 🙂

Anyway, to sort that out:

Blacklist the mptfc module in /etc/modules.d/blacklist.conf

fire up update-initramfs -u

Load it again from /etc/rc.local. Of course this also means you can’t really use your SAN luns anymore.

no, there’s no hook in their ramdisk framework to change the order. I don’t know what else to do about it.

If it weren’t for the udev issue, adding a script for the modprobe in init-bottom might work (prior to lvm and multipath).

But also, I have not unlimited time in my life to properly fix shit like this. I have searched for a few hours and have not found anything that comes close to a clean solution for the udev or init order problem. And it’s not my box, so even a update-grub hook etc. just wouldn’t cut it.

If my friend whose server this is needs a more long-term fix, it would be to use a HBA from a vendor with an initial higher than “M”, so in that case, switch to QLogic.

In summary, I think this story covers all there is to say about working with Ubuntu 14 on a server.

How to break LizardFS…


To start with:
MooseFS and LizardFS are the most forgiving, fault-tolerant filesystems I know.
I’m working with Unix/Storage system for many years, I like running things stable.

What does running a stable system mean to you?

To me it means I’ve taken something to it’s breaking point and I learned how it will exactly behave at that point. Suffice to say, I’ll further not allow it to get to that point.
That, put very bluntly means stable operation.
If we were dealing with a real science and real engineering there would be a sheet of paper indicating tolerances. But the IT world isn’t like that. So we need to find out ourselves.

I’d done a lot of very mean tests to first MooseFS and then later LizardFS.
My install is currently spread over 3 Zyxel NAS and 1 VM. Most of the data is on the Zyxel NAS (running Arch), one of which also has a local SSD using EnhanceIO to drive down latencies and CPU load. The VM is on Debian.
The mfsmaster is running on a single cubietruck board that can just so handle the compute load.

The setup is sweating, has handled a few migrations between hardware and software setups.
And, this is the point it has been operating rock-solid for over a year.

How I finally got to the breaking point.
A few weeks back I migrated my Xen host to OpenVswitch. I’m using LACP over two gigE ports, they both serve a bunch of Vlans to the host. The reason for switching was to get sFlow exports and also the cool feature of directly running .1q VLANs into virtual machines.

After the last OS upgrade (system had been crashing *booh*) I had some openvswitch bug for about a week or two.
Any network connection would initially not work, i.e. every ping command would drop the first packet, and then work.

In terms of my shared filesystem, this affected only the Debian VM on the host, which only held 1TB of data.
I’ve got most of my data at goal: 3, meaning two of the copies were not on that VM.

Now see for yourself:


root@cubie2 /mfsmount # mfsgetgoal cluster/www/vhosts/zoe/.htaccess
cluster/www/vhosts/zoe/.htaccess: 3

root@cubie2 /mfsmount # mfscheckfile cluster/www/vhosts/zoe/.htaccess
cluster/www/vhosts/zoe/.htaccess:
chunks with 0 copies: 1

I don’t understand how this happened.

  • The bug affected one of four mfs storage nodes
  • the file(s) had a goal of 3
  • the file wasn’t touched from the OS ever during that period.

Finally, don’t do a mfsfilerepair on a file with 0 copies left. I was very blonde – but it also doesn’t matter 🙂