Permabits and Petabytes

December 11, 2008

No Silver Bullet: Format Best Practices

Filed under: Jered Floyd — jeredfloyd @ 3:47 pm

In the first post of this series, I introduced the concepts of physical versus logical readability and explained how getting back your bits in 100 years is a hard problem, but one with solid product and technology solutions. Last post, I explained why there’s no simple solution to being able to turn those bits back into information, but there are ways through careful planning to avoid the pitfalls.

So how can you solve the logical readability problem? Primarily by following best practices for data format preservation. Some best practices:

1. Use the simplest data format necessary for the information at hand.

The goal with archival data is long term readability. Instead of trying to squeeze a few more bits out if your encoding and obfuscating it in the process, use the simplest format appropriate. Use CSV to represent tabular data instead of an Excel spreadsheet, for example.

2. Use broadly standardized, extensively documented formats.

Some data formats have been developed by industry standards organizations specifically for data interchange, and are backed by extensive documentation developed by a cross-industry team of experts. Other data formats have been designed by individual corporations to encourage you to buy their latest software, with no public documentation. Which format do you think will be more readable in 100 years?

Note that Microsoft Office Open XML doesn’t count here. It may have an ECMA/ISO standard number, but it shortcut the normal standardization process and resulted in massive controversy. The documentation was written by a single organization and lacks significant details on how the format is to be interpreted — this standard was done as a checkbox to meet purchasing requirements, nothing more.

When in doubt, take a look at the specification for the file format. How long is it? Can you understand it? Does it look complete? Do you think you would be able to write a program to read it (or have a team do so) if you had to? ISO used to be a good place to look before the OOXML debacle; now you’re forced to do a little more research on your own.

With image data, you have options like JPEG (lossy) and PNG (lossless). With audio you have MP3, Ogg Vorbis and FLAC. For tabular data, you can use a simple format like CSV, and make sure to include extensive documentation as to how the data is to be interpreted. For most documents PDF/A is reasonable to consider; it’s a subset of PDF developed in conjunction with the Association for Information and Image Management (AIIM) and standardized within ISO. For productivity documents, consider OpenDocument, a simpler, more complete, and more broadly supported format than OOXML. Even so, consider if a simpler standard might be more appropriate before using OpenDocument or PDF/A.

When developing a new structured data format, consider a well-standardized framework like XML. XML allows for extensive internal documentation of data, and has related standards like XML Schema that allow document structure to be described and enforced.

3. Resolve all external references, include format documentation.

If the format you have chosen does not already include external references like fonts and linked data make sure you incorporate these, following the same format guidelines here, before the data is archived. Your archive representation must be able to stand on its own.

Additionally, in any repository be sure to include extensive documentation so that a reader or renderer for the data formats therein could be written solely from that documentation. Documentation in English and pseudocode would be a good start. If your data is in XML format, always include XML Schema that describes the XML format, and perhaps XSLT for common translations and semantic restrictions.

4. Store your data through well-standardized interfaces.

Never use proprietary interfaces to your archive store! Don’t use a proprietary API as you have no guarantee to its availability next year, let alone 100 years from now. Block level access (e.g. iSCSI, FCoE) is also a bad idea. Block storage requires a file system to be layered on top of it, and that file system is vulnerable to the same format compatibility challenges as the documents being stored.

Use file-level interfaces like NFS and CIFS, or object-level interfaces like XAM as they become more widely deployed in the industry. For preserving metadata XAM provides rich infrastructure for later metadata retrieval and processing, whereas if this data were simply placed alongside the document it might be harder to retrieve in the future.

5. Consider virtual machines only temporarily for data already on the cusp of irretrievability.

A mechanism for data readability preservation that has become increasingly popular is the use of virtual machines; archiving VMs with legacy OSes and software along with the data files so that they can still be opened and read when the original hardware is no longer available. For data that is already in danger of becoming unreadable, this is a smart first step.

I don’t think it’s a great idea to use VM archiving for long term logical readability, however. Even if you assume that the VM will remain possible to run for a very long period of time, there are many more dependencies — this violates the first recommendation above. Will that archived version of your OS from 1994 still run properly in 100 years? Will the application? Or will a “year 2100″ problem, or a change in leap year rules, or some other external change mean that the software no longer functions properly in the future world?

VMs provide a sandbox in which you can (usually) run your legacy application again. This allows you to process and view your data in that application, but what if you want to integrate it into your modern process flow? Unless the application can export to a new data format, in which case you should do that for all the data and dispose of the VM, you’re trapped looking at your data through the window of the VM. Your data remains locked away where you can see, but not touch. This is a dangerous place to find yourself many years later!

For these reasons I recommend VM archives as only a stop-gap measure. Your data should be written in a well-documented or self-documenting manner from the start, and older data should be translated to a future-proof format as quickly as possible.


As you can see, there’s no silver bullet for maintaining logical readability for data in an archive; good planning is necessary up front, and vigilance is required to spot upcoming format problems before they become overly expensive to solve.

Properly protecting your archived data over a long period of time requires smart storage, smart software, and smart planning. Permabit Enterprise Archive provides an architecture that completely solves your physical readability challenges, leaving only those of logical readability. Standard NFS, CIFS and WebDAV interfaces ensure that your data will always be accessible into the future. To help with logical readability we’ve partnered with companies like Atempo and Symantec, but for much of your data you must sit down, review your data, and plan explicitly for the future.

About these ads

2 Comments »

  1. I agree that OOXML is trash but can’t agree with your endorsement of OpenDocument Format (“ODF”).

    ODF is in the same class of egregious under-specification as OOXML. For example, ODF v. 1.1 contains a maximum 227 conformance requirements whilst containing up to 3,984 options and recommendations. See more detailed stats, methodology, and caveats in the appendix to the document downloadable from http://www.universal-interop-council.org/node/37

    All of those “may,” “optional” and “should” terms mask hard-coded programming decisions in implementing apps. They largely represent application dependencies on the overwhelmingly market-leading implementation, OpenOffice.org (“OOo”) and its various clones.

    ODF also allows application-specific elements and attributes and OOo itself writes some 150 extensions to ODF, largely in the form of document settings.

    Such under-specification is the major reason that non-lossy round-trip ODF interoperability has never been demonstrated except among OOo and its clones.

    What is missing from both ODF and OOXML is compliance with ISO/IEC JTC 1 Directives Annex I. International standards are to “specify clearly and unambiguously the conformity requirements that are essential to achieve the interoperability. Complexity and the number of options should be kept to a minimum and the implementability of the standards should be demonstrable.” http://www.jtc1sc34.org/repository/0856rev.pdf (,) pg. 145.

    The rot at JTC 1 did not begin with OOXML. The adoption of ODF as an international standard stood first as a giant precedent for adoption of grossly under-specified international standards in the IT sector.

    “ODF interoperability” is a complete and utter myth spun by IBM and its camp followers. How one may ethically characterize a standard as “open” when it is so grossly under-specified remains an unsolved mystery.

    Certainly, neither ODF nor OOXML fulfill the minimum requirements of the Agreement on Technical Barriers to Trade. An international standard must specify [i] all characteristics [ii] of an identifiable product or group of products [iii] only in mandatory “must” or “must not” terms. WTDS 135 EC – Asbestos, (World Trade Organization Appellate Body; March 12, 2001; HTML version), ΒΆΒΆ 66-70, http://www.wto.org/english/tratop_e/dispu_e/cases_e/ds135_e.htm; reaffirmed and further explained, WTDS 231 EC – Sardines, (World Trade Organization Appellate Body; 26 September 2002), pp. 41-51, http://www.wto.org/english/tratop_e/dispu_e/cases_e/ds231_e.htm (.)

    The above factors lend added weight to your sage advice to “consider if a simpler standard might be more appropriate before using OpenDocument[.]“

    Comment by Paul E. ("Marbux") Merrell, J.D. — December 12, 2008 @ 7:54 am

  2. Paul,

    I agree that ODF isn’t a perfect choice and warn that simpler formats are better than OpenDocument; perhaps I should emphasize that more.

    You’re right that ODF and OOXML both have the extensions and underspecification problem, and just because it’s 10 times better with ODF doesn’t mean it’s not still a problem for long-term archival data storage. My understanding is that ODF wasn’t “Fast Track” so shame on ISO for not working out these issues in committee.

    I’m a bit confused, though; in searching on your nom de plume I find a lot of vociferous advocacy for ODF. Have you since changed your mind on the standard, or are you just trying to moderate the enthusiasm? :-)

    Regards,
    –Jered

    Comment by jeredfloyd — December 15, 2008 @ 1:04 pm


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Silver is the New Black Theme. Create a free website or blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: