In my last post in this series I introduced the concepts of physical versus logical readability and explained how getting back your bits in 100 years is a hard problem in itself but is not alone sufficient for a complete archive. Accurately being able to store and retrieve bits — maintaining physical readability — over a long period of time is critical to an archive, as is being able to do so cost effectively, but is not enough. Logical readability, the ability to interpret what those bits mean, must be maintained as well, and this is a much harder problem that cannot be solved by technological means alone.
Modern electronic storage consists of binary data, ones and zeros. The physical encodings are complex and analog in nature and change frequently with advances in technology, but the data represented is always binary. This has not always been the case, as in the analog tapes from the Lunar Orbiter that I wrote about last time, but for fundamental mathematical reasons data is almost certain to be binary representable going forward. Storing and retrieving a bitstream is the physical readability challenge.
Beyond those bits, though, we have a Tower of Babel of conventions on what those bits mean. Today a byte is almost univerally 8 bits long, but this has only been true since the 1980s. Computing architectures prior to widespread adoption of the PC had byte sizes from 6 to 9 bits. A revolutionary and popular architecture at MIT, Multics, made use of 9-bit bytes and 36-bit words, producing data formats incompatible with modern 8-bit byte architectures. Even reading data off of old tapes is a significant challenge! Beyond this, different architectures have different endianness, the order in which the different bytes of a longer word are written. All of this is before you even reach the application level.
Simple text documents seem like they should be easy to maintain as readable, but we’ve had two options there, the more standard ASCII and IBM’s legacy mainframe EBCDIC. Reading an EBCDIC document on a modern computer just results in mumbo-jumbo without conversion. Beyond that, formats get even more complex. Word processor, spreadsheet and presentation software document formats change every version so the vendor can add more features, and more importantly force everyone else to upgrade to the latest version. In general, office productivity software has maintained good backwards compatibility, but some file formats eventually fall by the wayside and can no longer be read. Consider another place critical data frequently lies: databases. Database formats are outrageously complicated and change frequently as the vendors try to eek out a little bit more performance.
The problem gets even worse when you consider the software that generates much of the internal business data out there — home-grown software that writes in ad-hoc, undocumented formats that are supported by no other products. The source code may have been lost, and the original developers long since gone. When time comes to move to a new computing platform, this data can become orphaned.
Beyond this, most documents are not wholly self-contained. Word processing documents make external reference to fonts that are not included in the document itself. Images may also be included by external reference. Spreadsheets can be linked to databases. If one of these external components disappear, the document becomes incomplete.
I was speaking at a conference several years ago on “The 100-Year Archive” and the challenges above, and gave my recommendations on how best to avoid these problems, which I will repeat in this series. At then end there were a number of audience questions. The one that sticks in my head was this — one woman stood up and was very upset that the SNIA had not undertaken a technical project to solve the logical readability problem. “Why hasn’t SNIA developed an archival format for all data?” she questioned. Her employer had many terabytes of research data generated ten years prior. The equipment and software that generated this data was no longer available, and there was no documentation as to the data format. They were painstakingly reverse engineering the data format and writing software to translate it to a new format. Why wasn’t there some software that would just do this for them?
There is no magic software, and no magic data format, that will never become obsolete. As with maintaining physical readability, maintaining logical readability is something you have to prepare for and plan right from the start. Unlike with physical readability, the future pain for planning poorly is not as obvious until it is too late. You can’t write your Rosetta Stone after the language has already gone extinct!
In the next post in this series, I will provide some best practices on how you can avoid being stuck deciphering legacy formats like a code without a key.