I just got forwarded an article from c’t along the lines of “SSDs aren’t as problematic as they used to be”.
Which is true, but encouraged me to make a count of the ones I’ve actually used and how many of them really failed. In general I have made peace with SSDs by accepting that they are a part that wears out and just needs to be replaced sometimes. That way you put more thought into periodic maintenance, firmware updating etc. and less time into sobbing about lost data.
The models I’ve used / installed somewhere.
- The start make two stone-aged 60GB Samsungs, they used to be horrible and even with a TRIM-supporting OS they were often stuttering and hanging for minutes. A few years later I needed two SSDs for a friend, gave these two a full wipe and new firmware and now they’re running fine ever since. This shows how much Samsung has learned in terms of firmware.
- 7 OCZ Vertex II 60GB – all running OK
- 1 OCZ Vertex II 120GB – immediate and total data loss. irrecoverable, at it. I know two more people with the same experience. Would guess some metadata corruption, since there’s documented ways of resetting the SSDs. Sad story about this is mostly that it’s some typical “what could go wrong there” issue. Some better design and it would just block to RO and have some metadata consistency helpers.
- 1 Intel 510 120GB – not quick by all means. but solid. given that it uses a cheapo controller, it was not my best buy, but still … I like solid!
- 2 Intel 320 120GB – everything great, quick one, too.
- 3 OCZ Vertex III – 120GB – all doing fine
- 2 Samsung 830GB 256GB – doing fine and I trust them.
- 8 Samsung 830GB 120GB – 7 are doing fine and I trust them, one is having major hickups and has trashed it’s LSI raid brothers and sisters while at it. Still testing with a lot of interest why this happens.
- At work we have some more ‘cheap’ SSDs, one out of 10 seems to have issues.
- $colleage also had a cheap SSD that failed, but it did so slowly and he had time to get his data off it and is now using one of the 320’s.
That leaves us with the following numbers:
Out of 36 total SSDs:
- 1x died a slow death
- 4x were constantly going nuts
- 3x chose instant and complete dataloss[tm]
-> that gives a roundabout 20% chance of issues. More than I ever felt it would be.
Fine print / details for those who care:
The SSDs are normally “underprovisioned”, d.h. i only partition something like 80% of their space. Sometimes I allocate just 40%, i.e. for ZIL on ZFS. On the downside, the SSDs that run in raid configs are of course not seeing TRIM support. There I sometimes run a SATA secure erase to freshen them, but not as often as I planned. On the other hand, they also don’t get heavy usage at all.
I had investigated and planned to make a larger sata reserve area (and I think I did it on ONE of them, in a raid that happens to do it for all 🙂 but got blank stares from some people running many times more SSDs and so I put it off.
As for why hardware raid – because the CPU overhead is so much lower with a good HBA over software raid. Lower CPU overhead means higher throughput if you’re intending to also run applications on the server. Normally SW Raid on Linux does better scaling out to multiple cores, but also needs substantially more power to move the bits – even on Raid0.
I.e. my desktop topped out at 1.2GB/s (due to controller limits I think) with a CPU usage of 3 Cores @ 100%, whereas the same box with an older LSI Raidcontroller + some of the onboard ports hit 2.4GB/s
at 2 Cores @100%
(But it got sluggish, probably PCIe was totally exhausted)