Permabits and Petabytes

December 3, 2008

No Silver Bullet: Archive Challenges

Filed under: Jered Floyd — jeredfloyd @ 5:40 pm

After my post about dirty little secrets a few weeks ago, Joe Martins from Data Mobility Group wrote to point out the real “dirty little secret” about archive systems: even if your archival storage is reliable, it doesn’t mean you can do anything useful with your data once to retrieve it in the distant future.

There’s more to a digital archive than just being able to store and retrieve your bits from media. If your storage system has been designed properly then it will give you your data, but it won’t necessarily give you the information that data represents. For several years I co-chaired the SNIA Long-Term Archive and Compliance Storage Initiative, and this was a problem that we frequently considered. The challenges found when considering how to solve this problem led in part to the development of XAM, the new eXtensible Access Method standard for object-based information storage.

When it comes time to retrieve and process data that was written a long time prior, there are two major challenges — what I like to call physical readability and logical readability. Physical readability means that the archive system is able to retrieve and present the exact bitstream that was originally written, intact, complete, with no errors. Logical readability, on the other hand, means that I am able to extract the same semantic meaning from those bits as when they were originally processed. The first problem is one that can be solved purely by technology; the second one, sadly not.

First, let’s look at physical readability. A guarantee of physical readability is expected from any modern storage system, though few effectively deliver on this over any appreciable period of time. Over 3 to 5 years nearly any system will do. Data is written to media, and later read back off that same media. Technologies like RAID are used to protect data on disk against loss due to spindle failure, and multiple copies are made of data on tape. Even recently written tapes, however, have a terrible track record on reliability. I’m not going to quote any specific sources here since I’ve seen numbers from 20 to 45 to 71 percent for the rate of failure for tape access (and 67.3% of statistics are made up on the spot), but I think it’s entirely fair to say that there are significant reliability concerns with tape. (For the purposes of archive data, even a 0.1% failure rate, 1 in 1000, is far too high.) Of course, I’ve already discussed at lengths the long-term challenges of RAID.

Even worse, going beyond 5 years exceeds the functional life of media or recording technology, and maintaining physical readability becomes increasingly difficult. I’d be wiling to bet that a number of my readers have boxes of QIC-80 tapes in the garage or basement with old data on them. Even if the tapes have a 50 year lifespan, do you have any ideas on where to get a working QIC-80 tape drive? NASA just recently went through an amazing project to recover old Lunar Orbiter image data, involving finding, refurbishing and interfacing with 40-year-old Ampex tape drives, an enormous project covering more than a decade to complete. Media life isn’t the problem with long-term data storage, and “archival-grade” media isn’t going to solve your physical readability problems, because the reader hardware will never last as long as the media.

The National Archives and Records Administration (NARA) recommends copying archive data to new, modern media every three to five years; this solves both the danger of media degration through media refresh, and the danger of media obsolescence through technology refresh. This is typically implemented as a recurring business responsibility or professional services contract. As the amount of data grows, however, media refresh operations must be ocurring on an almost continuous basis. Additionally, coping with the exponentially increasing density of archive media means either reformatting the data and updating indexes on each refresh, or complicating the data format further with conventions like “tape stacking”, storing multiple tape images sequentially on the same new tape. To eliminate the risk of error in this refresh process, and to eliminate the enormous cost, media and technology refresh should be handled automatically by an archive storage system, but few systems do this today.

To address this challenge, Permabit Enterprise Archive includes automatic future-proof media migration. An Enterprise Archive deployment has some initial number of storage nodes, system components that include both processors and storage and that work collaboratively to maintain the integrity and persistence of all data stored in the system. Our RAIN-EC technology ensures that no data is lost even in the event of multiple drive or node failures. Initially, all storage nodes are of the same model (with the same number of drives, drive capacity, processing power, and so forth), and the system is sized to fulfill your initial capacity requirements. These storage nodes also have up to a three year warranty on all hardware components, the normal recommended life for disk-based storage.

As your storage needs grow, new storage nodes can be added at any time, non-disruptively to the system, expanding the capacity available to all file systems in your Enterprise Archive. Particularly important is that these new storage nodes can be of whatever model is currently available and thus provides the best price for performance and capacity. New storage nodes may have larger hard drives, more drives, faster processors, or even new storage technologies entirely! The system automatically adds these to the existing storage infrastructure, with no need to match the existing components. New storage nodes automatically receive an amount of stored data proportional to their capacity as compared with the rest of the system.

More importantly, as storage nodes reach the end of their usable lives they can be removed from the system completely non-disruptively. A three-year-old or five-year-old storage node will be much less power efficient and storage-dense than a modern one, and newer nodes will also provide an increase in overall system performance. Because the Enterprise Archive is architected around Ethernet and TCP/IP technologies as a backplane, new nodes that are backwards-compatible with existing deployments will be available for decades to come. Older nodes can be removed at any time and the system automatically migrates data to the new nodes with no operator intervention, no format translation, and no costly migration professional services.

Maintaining physical readability of data in an archive requires continual migration of data to new technology every few years to ensure that the original bits can be retrieved far into the future. With archives including hundreds of terabytes to petabytes of information these migration processes must be built into the very core of the archive system so as to avoid enormous process costs and a high risk of an error failing to accurately transfer data during one of these migrations. I have seen many migration projects where the migration services cost far more than the new storage system being put in place! Permabit addresses both of these problems with an architecture designed to handle future storage technologies automatically.

This is only half the problem, however. A proper storage architecture will allow you to retrieve the bits that you wrote 100 years ago, but will you be able to make sense of them? Consider a word processing document written 20 years ago on an Apple II computer, common at the time. Even if you could get the bits of your 5 1/4″ floppy, would you be able to extract and format your manuscript? What about data written by a propietary internal application that ran on your enterprise mainframe 20 years ago; could you extract meaning from those files? This is the problem of logical readability, and the other dirty little secret Joe aluded to in his comment. In my next post I’ll talk more about the challenges here.

3 Comments »

  1. […] Filed under: Jered Floyd — jeredfloyd @ 5:32 pm In my last post in this series I introduced the concepts of physical versus logical readability and explained how getting back your bits in 100 years is a hard problem in itself but is not alone […]

    Pingback by No Silver Bullet: Logical Readability « Permabits and Petabytes — December 5, 2008 @ 5:54 pm

  2. Good,informative article,but I will respectfully disagree on some points.I hated tapes,but used them extensively prior to cd-r.I have admin’d many businesses(almost exclusively small or micro)and have never lost a mission critical dataset.I can actively show you files with late 1980s time,date stamps.Virtualization can allow me to update and maintain this data.One last thought,old hdd’s are a suitable,short to mid term storage medium for non-mission critical datasets.

    Comment by Htos1 — December 10, 2008 @ 11:05 am

  3. […] Filed under: Jered Floyd — jeredfloyd @ 3:47 pm In the first post of this series, I introduced the concepts of physical versus logical readability and explained how getting back your bits in 100 years is a hard problem, but one with solid product […]

    Pingback by No Silver Bullet: Format Best Practices « Permabits and Petabytes — December 11, 2008 @ 3:56 pm


RSS feed for comments on this post. TrackBack URI

Leave a comment

Blog at WordPress.com.