SSD Failure statistics


I just got forwarded an article from c’t along the lines of “SSDs aren’t as problematic as they used to be”.

Which is true, but encouraged me to make a count of the ones I’ve actually used and how many of them really failed. In general I have made peace with SSDs by accepting that they are a part that wears out and just needs to be replaced sometimes. That way you put more thought into periodic maintenance, firmware updating etc. and less time into sobbing about lost data.

The models I’ve used / installed somewhere.

  • The start make two stone-aged 60GB Samsungs, they used to be horrible and even with a TRIM-supporting OS they were often stuttering and hanging for minutes. A few years later I needed two SSDs for a friend, gave these two a full wipe and new firmware and now they’re running fine ever since. This shows how much Samsung has learned in terms of firmware.
  • 7 OCZ Vertex II 60GB – all running OK
  • 1 OCZ Vertex II 120GB – immediate and total data loss. irrecoverable, at it. I know two more people with the same experience. Would guess some metadata corruption, since there’s documented ways of resetting the SSDs. Sad story about this is mostly that it’s some typical “what could go wrong there” issue. Some better design and it would just block to RO and have some metadata consistency helpers.
  • 1 Intel 510 120GB – not quick by all means. but solid. given that it uses a cheapo controller, it was not my best buy, but still … I like solid!
  • 2 Intel 320 120GB – everything great, quick one, too.
  • 3 OCZ Vertex III – 120GB – all doing fine
  • 2 Samsung 830GB 256GB – doing fine and I trust them.
  • 8 Samsung 830GB 120GB – 7 are doing fine and I trust them, one is having major hickups and has trashed it’s LSI raid brothers and sisters while at it. Still testing with a lot of interest why this happens.
  • At work we have some more ‘cheap’ SSDs, one out of 10 seems to have issues.
  • $colleage also had a cheap SSD that failed, but it did so slowly and he had time to get his data off it and is now using one of the 320’s.

That leaves us with the following numbers:

Out of 36 total SSDs:

  • 1x died a slow death
  • 4x were constantly going nuts
  • 3x chose instant and complete dataloss[tm]

-> that gives a roundabout 20% chance of issues. More than I ever felt it would be.

Fine print / details for those who care:

The SSDs are normally “underprovisioned”, d.h. i only partition something like 80% of their space. Sometimes I allocate just 40%, i.e. for ZIL on ZFS. On the downside, the SSDs that run in raid configs are of course not seeing TRIM support. There I sometimes run a SATA secure erase to freshen them, but not as often as I planned. On the other hand, they also don’t get heavy usage at all.

I had investigated and planned to make a larger sata reserve area (and I think I did it on ONE of them, in a raid that happens to do it for all 🙂 but got blank stares from some people running many times more SSDs and so I put it off.

As for why hardware raid – because the CPU overhead is so much lower with a good HBA over software raid. Lower CPU overhead means higher throughput if you’re intending to also run applications on the server. Normally SW Raid on Linux does better scaling out to multiple cores, but also needs substantially more power to move the bits – even on Raid0.

I.e. my desktop topped out at 1.2GB/s (due to controller limits I think) with a CPU usage of 3 Cores @ 100%, whereas the same box with an older LSI Raidcontroller + some of the onboard ports hit 2.4GB/s
at 2 Cores @100%

(But it got sluggish, probably PCIe was totally exhausted)

4 thoughts on “SSD Failure statistics

  1. Ich habe auch ein paar Vertex2 Verluste zu vermelden. Die erste nach drei Tagen, die zweite nach rund eineinhalb Monaten, die dritte Platte (wieder ein Austausch) nach 3 Minuten (!!!). Nun liegt hier eine Agility 3 (wieder ein Austausch) und ich traue mich nicht sie einzubauen… -.-

    • Spaete Antwort weil krank und verpeilt.
      In jedem Falle, es ist recht einfach und ich hab auch einige Zeit gebraucht, um das zu akzeptieren:
      SSDs sind Verschleissteile.

      (Und ich hasse es, weil ich mit dem Zurueckschicken grundsaetzlich unfaehig bin)

  2. A little 2014 update:

    Still no 60GB Vertex2 failures.
    Two more failures did happen: 2 out of 7 Samsung 830.
    Added and not failed: 1 Samsung PM843T
    Added and not failed: 5 HGST SSD400M

    Hard disks:
    4x 1.5TB Seagate @2010 (all fine)
    4x 3TB WD Red @2012 (1 might be failing)
    4x WD Green 2TB @2012 (1 might be failing)
    4x Random disks @2009-2013 (all fine)
    12x 73GB SAS @2008 (2 failed, old shit, rarely in use)
    6x 146GB SAS @2009 (all fine)
    6x 300GB SAS @2011 (all fine)

    So far SSD failures still in the lead.

  3. 2015 update:
    1 1TB WD blue disk failed with data corruption on SATA, OS crashes and whatnot.
    9 HGST SSD400M running fine
    1 HGST SSD400S (B revision!) running fine
    2 Samsung SSD 850 running fine

    many of the 830 SSDs are now hitting steady state quickly and prolonged, the one I’d used in the Macbook has grown HORRIBLY slow.

Leave a comment