In the first post of this series, I introduced the concepts of physical versus logical readability and explained how getting back your bits in 100 years is a hard problem, but one with solid product and technology solutions. Last post, I explained why there’s no simple solution to being able to turn those bits back into information, but there are ways through careful planning to avoid the pitfalls.
So how can you solve the logical readability problem? Primarily by following best practices for data format preservation. Some best practices:
1. Use the simplest data format necessary for the information at hand.
The goal with archival data is long term readability. Instead of trying to squeeze a few more bits out if your encoding and obfuscating it in the process, use the simplest format appropriate. Use CSV to represent tabular data instead of an Excel spreadsheet, for example.
2. Use broadly standardized, extensively documented formats.
Some data formats have been developed by industry standards organizations specifically for data interchange, and are backed by extensive documentation developed by a cross-industry team of experts. Other data formats have been designed by individual corporations to encourage you to buy their latest software, with no public documentation. Which format do you think will be more readable in 100 years?
Note that Microsoft Office Open XML doesn’t count here. It may have an ECMA/ISO standard number, but it shortcut the normal standardization process and resulted in massive controversy. The documentation was written by a single organization and lacks significant details on how the format is to be interpreted — this standard was done as a checkbox to meet purchasing requirements, nothing more.
When in doubt, take a look at the specification for the file format. How long is it? Can you understand it? Does it look complete? Do you think you would be able to write a program to read it (or have a team do so) if you had to? ISO used to be a good place to look before the OOXML debacle; now you’re forced to do a little more research on your own.
With image data, you have options like JPEG (lossy) and PNG (lossless). With audio you have MP3, Ogg Vorbis and FLAC. For tabular data, you can use a simple format like CSV, and make sure to include extensive documentation as to how the data is to be interpreted. For most documents PDF/A is reasonable to consider; it’s a subset of PDF developed in conjunction with the Association for Information and Image Management (AIIM) and standardized within ISO. For productivity documents, consider OpenDocument, a simpler, more complete, and more broadly supported format than OOXML. Even so, consider if a simpler standard might be more appropriate before using OpenDocument or PDF/A.
When developing a new structured data format, consider a well-standardized framework like XML. XML allows for extensive internal documentation of data, and has related standards like XML Schema that allow document structure to be described and enforced.
3. Resolve all external references, include format documentation.
If the format you have chosen does not already include external references like fonts and linked data make sure you incorporate these, following the same format guidelines here, before the data is archived. Your archive representation must be able to stand on its own.
Additionally, in any repository be sure to include extensive documentation so that a reader or renderer for the data formats therein could be written solely from that documentation. Documentation in English and pseudocode would be a good start. If your data is in XML format, always include XML Schema that describes the XML format, and perhaps XSLT for common translations and semantic restrictions.
4. Store your data through well-standardized interfaces.
Never use proprietary interfaces to your archive store! Don’t use a proprietary API as you have no guarantee to its availability next year, let alone 100 years from now. Block level access (e.g. iSCSI, FCoE) is also a bad idea. Block storage requires a file system to be layered on top of it, and that file system is vulnerable to the same format compatibility challenges as the documents being stored.
Use file-level interfaces like NFS and CIFS, or object-level interfaces like XAM as they become more widely deployed in the industry. For preserving metadata XAM provides rich infrastructure for later metadata retrieval and processing, whereas if this data were simply placed alongside the document it might be harder to retrieve in the future.
5. Consider virtual machines only temporarily for data already on the cusp of irretrievability.
A mechanism for data readability preservation that has become increasingly popular is the use of virtual machines; archiving VMs with legacy OSes and software along with the data files so that they can still be opened and read when the original hardware is no longer available. For data that is already in danger of becoming unreadable, this is a smart first step.
I don’t think it’s a great idea to use VM archiving for long term logical readability, however. Even if you assume that the VM will remain possible to run for a very long period of time, there are many more dependencies — this violates the first recommendation above. Will that archived version of your OS from 1994 still run properly in 100 years? Will the application? Or will a “year 2100” problem, or a change in leap year rules, or some other external change mean that the software no longer functions properly in the future world?
VMs provide a sandbox in which you can (usually) run your legacy application again. This allows you to process and view your data in that application, but what if you want to integrate it into your modern process flow? Unless the application can export to a new data format, in which case you should do that for all the data and dispose of the VM, you’re trapped looking at your data through the window of the VM. Your data remains locked away where you can see, but not touch. This is a dangerous place to find yourself many years later!
For these reasons I recommend VM archives as only a stop-gap measure. Your data should be written in a well-documented or self-documenting manner from the start, and older data should be translated to a future-proof format as quickly as possible.
As you can see, there’s no silver bullet for maintaining logical readability for data in an archive; good planning is necessary up front, and vigilance is required to spot upcoming format problems before they become overly expensive to solve.
Properly protecting your archived data over a long period of time requires smart storage, smart software, and smart planning. Permabit Enterprise Archive provides an architecture that completely solves your physical readability challenges, leaving only those of logical readability. Standard NFS, CIFS and WebDAV interfaces ensure that your data will always be accessible into the future. To help with logical readability we’ve partnered with companies like Atempo and Symantec, but for much of your data you must sit down, review your data, and plan explicitly for the future.