Permabits and Petabytes

January 20, 2009

Data Protection’s Black Swan: Seagate Drive Failures

Filed under: Jered Floyd — jeredfloyd @ 10:45 pm

At this point there’s been lots of press coverage on the very high failure rates on Seagate’s Barracuda 7200.11 desktop drives. Last Friday, Seagate came clean and admitted this is due to a firmware bug and that the bug affects several other drive families as well. The good news is that the problem doesn’t affect the integrity of the data stored on the drive, but the bad news is that if the bug has already hit you’ll have to send your drive to Seagate for repair. Additionally, the updated firmware is not yet available (update: or, at least, updated firmware that doesn’t make the problem worse). This problem brings up an interesting question, though… where does data protection come into play in situations like these?

As I’ve written about before, all data protection schemes depend on statistical failure models. With hard drives, the model allows for two primary types of failure: total spindle failure (i.e. MTBF, Mean Time Between Failures) and unreadable blocks (i.e. BER, Bit Error Rate).

All drives shipping today have published MTBFs in the range of 1 million hours and up, or about 114 years. This doesn’t mean that the drives are actually expected to run for 114 years — rather, it means that if you have a large population of drives operating over their expected life (say, 5 years), this will be the failure rate. One hundred drives operating for 5 years is 500 drive-years, and a MTBF of 114 years means you should expect around 4 of those 100 drives, or 4%, to fail during those five years.

The bit error rate of drives today range between 1 in 10^14 and 1 in 10^16. This is the rate at which a single bit will be unreadable due to the statistical encodings used in modern drives to avoid the performance penalty of read-after-write — verifying every bit written. In reality, these failures will cause a full block, or 512 bytes, to be unreadable. These error rates correspond to between 12 TB and 1,200 TB read… much more common than MTBF failures, but also much less catastrophic. On a drive with a BER of 1 in 10^14 you can expect a read error after only 12 passes through reading the entire drive; with a 1 in 10^16 BER that goes up to 1200 passes across the drive. As long as you’re reading the drives, these errors happen regularly. Luckily, as long as other failures don’t occur they’re easily corrected. (The problem with RAID is that these errors are nigh certain to occur while recovering from a drive failure, the one time you can’t easily recover from an unreadable block!)

So, what does the current Seagate drive problem mean? Well, according to Seagate:

This condition is caused by a firmware bug that allows the drive’s “event log” pointer to be set to an invalid location. This condition is detected by the drive during power up, and the drive goes in to failsafe mode to prevent inadvertent corruption to or loss of user data. As a result, once the failure has occurred user data becomes inaccessible.

During power up, if the Event Log counter is at entry 320, or a multiple of (320 + x*256), and if a particular data fill pattern (dependent on the type of tester used during the drive manufacturing test process) had been present in the reserved-area system tracks when the drive’s reserved-area file system was created during manufacturing (note this is not the Operating System’s file system, but is instead an area reserved outside the drive’s logical block address space that is used for drive operating data structures and storage), firmware will incorrectly allow the Event Log pointer to increment past the end of the Event Log data structure. This error is detected and results in an “Assert Failure”, which causes the drive to hang as a failsafe measure. When the drive enters failsafe further updates to the counter become impossible and the condition will persist through all subsequent power cycles.

The problem can only occur if a power cycle initialization occurs when the Event Log is at 320 or some multiple of 256 thereafter. Once a drive is in this state, an end user will not be able to resolve/recover existing failed drives. Recovery of failed drive requires Seagate technical intervention. However, the problem can be prevented by updating drive firmware to a newer version and/or by keeping the drive powered on until a newer firmware version is available.

Note that in order for a drive to be susceptible to this issue, it must have both the firmware revision that contains the issue, have been tested through the specific manufacturing process, and be power cycled.

So, basically, the firmware has an off-by-one or sentinel value error that allows the drive’s Event Log (the thing that keeps track of occurrences like the unreadable blocks we talked about above) to advance too far, but the system is also smart enough to check for this instead of corrupting data.

After the first 320 event log entries the log presumably wraps around, at which point there are 256 possible values for the event log pointer depending on how many loggable events have occurred. That means that, in the long run, if you power-cycle the drives you will have around a 1 in 256 chance that they will become inaccessible.

The good news is that it’s unlikely that event log entries are correlated, so these accidental “death counters” are unlikely to be synchronized. But you’re still rolling a 256-sided die (ok, a pair of 16s) each time you reboot, and that’s pretty bad. How does this affect RAID’s failure model?

If you have a RAID 5 with 7+1 drives, each time you reboot your RAID you have a 96.9% of everything being all right, a 3.0% chance of having a single drive not start up, and about a 0.1% chance of having multiple drives not start up. That’s a 1 in 1000 chance of failing to start your array on reboot, and a 1 in 333 chance of triggering a rebuild, which as we know with RAID 5 has a high potential to lose some data.

With a RAID 16 of 14+2 drives,you have a 93.9% chance of everything being all right, a 5.9% chance of losing a single drive, a 0.2% chance of losing two drives, and less than 0.1% chance of worse. That’s more than a 6% chance of triggering a rebuild on reboot. Ouch!

I’m sad to say that even advanced erasure coding techniques can only offer hope a few orders of magnitude better than RAID technologies, because these drives failures are all correlated on boot time. There’s no opportunity to repair damage between failures; the system goes from being off to having some number of failed drives. For data protection technologies, this is what Nassim Nicholas Taleb would call a Black Swan, a high-impact event outside of the system threat model.

Data protection mechanisms are designed to make unreliable systems more reliable. They protect against failures by being able to rebuild before another failure occurs, and so there’s a base assumption that failures are generally uncorrelated. As we’ve seen before with RAID, this isn’t necessarily always the case — vibrational coupling can cause correlated failures within a single chassis, which is one way in which RAIN can offer better reliability. Spin-up or boot-time failures are another type of correlated failure, one much more difficult to deal with.

So, what can be done to protect against correlated failures of this type? A few thoughts:

  • Tried and true. Don’t be the very first to use any given model of drive, or any given firmware revision. Wait and see how things go after a few months, and make sure you know what you’re getting. At Permabit we lock down the revisions of all system components, so we don’t get surprises with new BIOS or firmware revisions. Many vendors don’t.
  • Mixed technologies. Use a storage platform that allows you to mix generations of technology, to distribute the risk. In Enterprise Archive, different nodes can always be of different generations, with different generation drives. Add new storage as you need it, and get the benefit of a heterogeneous storage environment at the same time.
  • Replication and Disaster Recovery. The biggest cause of correlated failure is site-related, be it environmental (such as a cooling fault, or bad power), or a disaster (i.e. site destruction). Replication technologies help protect in these cases.

Finally, a few additional risks to consider:

  • Is spin-up a loggable event? If at each boot an entry is made in the drive’s Event Log, suddenly there’s a lot more correlation between drives. If this is the case, the bug really is a “death counter”. After 320 boots the drive will die, barring an opportune event being logged before the next reboot.
  • Will MAID kill drives? Companies like COPAN use MAID technologies to spin down drives to save energy (although this isn’t really where most of the power is going today). Failures correlated with drive power-on could be disastrous in these systems.
  • Will software become the weakest link? As storage hardware technologies become more robust, software complexity becomes the challenge. For example, many of the recent advances in SSD storage are software-related, using software algorithms to hide some of the weaknesses of flash storage. This software must be rigorously tested and written in a strongly structured way to ensure it meets the level of quality necessary. In the past hardware companies have shown poor understanding of high-level software development practices — can they overcome their past?
Advertisements

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: