Balisage logo

Proceedings

Beyond Eighteen Wheels

Considerations in Archiving Documents Represented Using the Extensible Markup Language

Liam R. E. Quin

XML Activity Lead

The World Wide Web Consortium

International Symposium on XML for the Long Haul: Issues in the Long-term Preservation of XML
August 2, 2010

Copyright © Liam Quin, 2010

How to cite this paper

Quin, Liam R. E. “Beyond Eighteen Wheels: Considerations in Archiving Documents Represented Using the Extensible Markup Language.” Presented at International Symposium on XML for the Long Haul: Issues in the Long-term Preservation of XML, Montréal, Canada, August 2, 2010. In Proceedings of the International Symposium on XML for the Long Haul: Issues in the Long-term Preservation of XML. Balisage Series on Markup Technologies, vol. 6 (2010). DOI: 10.4242/BalisageVol6.Quin01.

Abstract

When documents are stored for any significant length of time, or when they are used, whether continuously or occasionally, over an extended period, the original people and culture and context associated with their creation become unavailable. If the documents are to remain useful, it is necessary to retain sufficient knowledge about how they can be used that the future people involved can still gain value from them.

This document is a position paper for discussion.

Table of Contents

Introduction
Definitions
An XML Document
A Long Time
Storing
Documents as Communication
Document Context
Navigation and Finding Aids
The Politics of Selection
How to Archive?
The Physical Substrate
The Logical Layers
Coded Character Sets
Fonts and Glyphs
Extensible Markup Language (XML)
Ancillary Formats
Multiple Copies, Multiple Locations.
Summary
Designing XML-based Formats for Longevity
Avoid Implicit Content
Avoid Obscure Features
Avoid Cryptic Names
Document the significance of markup items
Validate the Data
Mean What You Mean To Mean
Check Links
Provide for Translations
Provide for Contextualization
Don't be Inventive
Summary
XML or Not XML?
Textual Format
Explicit End Markers
Embedded Usage
Device Independent
Open Specification
Conclusions

Introduction

The requirements for archiving a document for a hundred years are very different from the requirements for archiving the same document for one year, or for five years. As we try to prepare for even longer term storage, the number of unknowns increases greatly.

This paper suggests ways to manage some of those unknowns, and to prepare for as many of them as possible: in other words, ways to store XML-encoded documents in such a way as to have some reasonable expectation that they can be decoded at some unspecified time in the future and used.

Definitions

An XML Document

Before we can talk about storing documents for a long time, we must decide what we mean by the term document. This is not sophistry: an XML document generally consists of multiple parts, and, as will be discussed further, we must be careful in determining the boundary of the document for the purpose of preservation. Let the term XML Document, then, in this paper, denote a sequence of characters represented digitally on a computer such that the sequence of characters satisfies the productions and constraints of the XML specification. We shall be neither more not less precise, but rather shall qualify or expand upon this base term as needed. Our definition excludes non-digital representations: if one were to print out an XML document onto paper, and then make a video of the paper, the video would not, by our definition, constitute an XML document.

A Long Time

The overview for the Balisage pre-conference symposium suggests that a long time, in terms of document storage, could be for as long as a thousand years. The author of this document, however, suggests that A Long Time, however long it may actually be, in any case starts now, in the present, and gradually lengthens. What is significant is that the people who created the document are not the people who decode it, and that the context of that decoding is not necessarily the same as the social, technological or political context in which the document was encoded.

Storing

By our definition an XML Document exists only within computer storage. A document will be said to have been stored for a given period of time if, at the end of that time, the same sequence of characters can be retrieved. This definition permits changes in the encoding of the document, for example from UTF-8 to UTF-16, as long as the sequence of encoded characters remains the same. Our definition is also silent on whether the document may be inaccessible at times during the stored period.

One example of a document becoming inaccessible might be a so-called dark archive, created to hold copyrighted information until such time as the copyright expires. That is a political or social inaccessibility: the document might be accessible technically, but perhaps not legally. Documents can easily become technically inaccessible. A promise of XML is that the format is open, and, unlike, say, a proprietary word processing format, will still be readable in the future. The same promise was made of SGML documents, but few people make SGML software today. Fortunately, software exists to convert SGML documents to XML, but doing so correctly requires human expertise in both formats in order to select correct options. In order to retain technical accessibility, then, documents may need to be migrated between formats. If this is not done, however, keeping the format definition along with the documents may facilitate such work after a Very Long Time.

Documents as Communication

The Extensible Markup Language sees a great many uses in a wide range of applications. We may classify XML documents in many ways, but for the purpose of archiving, let us consider a document as a form of communication, with at least one speaker and zero or more listeners. Any combination of listener and speaker may be an automated process or may be a human (or perhaps some other sentient being). We shall use the term machine to denote an agent that is not sentient, and person to denote a sentient agent, regardless of whether the person or the machine is human, robotic, dolphin or even alien.

We can sort documents, then, by the creators and by the intended audience, in terms of one's likely interest in archiving the documents:

Table I

CreatorsAudienceInterest
MachineMachineVery Low
MachinePersonLow
PersonMachineLow
PersonPersonHigh

For the purpose of this ranking, multiple agents are treated as equivalent to a single agent, but a heterogeneous group of agents containing at least one person is considered to be a person.

The interest in archiving machine-machine communication is low because the (human) programmers of our computers decided that the messages were not of interest to humans, and because instead of archiving actual messages, it is customary to archive logs of many such messages. For example, it is not usually of interest to record the mouse pointer location on a computer screen when a particular icon was clicked, except in human-computer interaction usability research; archiving the details of that event is unlikely to be of use to anyone, even where, as with XCB XCB, it was in XML. Even a usability researcher would be unlikely to find it useful without further information, such as what icon was displayed there, what task the user was attempting, whether she was tired, and so forth. The machine-to-machine message in this example is not complete without its context, and does not constitute a sustained rhetoric or literature. Most machine-to-machine communication happens entirely without human intervention, or with only indirect intervention. Note that a log file, recording a summary of such communications, is an instance of machine to person communication, and is in a different category.

Machine to Person communication designates machine-generated documents, and these constitute a broad range of literature, ranging from error messages and log files to random poetry. In some cases it is more interesting to archive the algorithms used to generate the random poetry, perhaps with some examples. In this paper we will consider randomly generated literature to be a subset of person to person communication, with the creator being the person who wrote the program and therefore controlled the domain of discourse. Machine-generated documents such as error messages that are part of a larger context can be archived only as part of a larger context, as we shall see. It would make little sense to archive a document whose entire content was "File not found," although files containing such messages can be a great source of confusion to users.

Person to Machine communication might include computer programs and scripts, XSLT transformations, manually-generated SOAP messages, and much more. Computer programs can be archived, but are mostly outside the scope of this document. We shall consider XSLT documents in a later section.

Person to person communication, mediated through XML documents, includes all manner of electronic mail, instant message conversation, poetry, visual arts, music, virtual sculpture, prosody and erotica, research and reflection. This is the material that first comes to most people's mind when they consider archiving beyond a short-term computer backup. It is the stuff of libraries and of museums, the artifacts of our culture.

For the sake of completeness, let us be clear that our primary concern is archiving for subsequent human (Person) retrieval and study.

Document Context

Documents do not exist in isolation. They are created, stored and retrieved using digital computers. When a Person reads a document, or part of a document, what is understood is, as with any piece of literature or cultural artifact, bounded on all sides by that culture. Once the context of creation is lost, understanding of the artifact is necessarily incomplete. Documents are part of a culture that includes other documents as well as social conventions and shared knowledge. We shall use the term External Context to denote the environment, political, social, technological and cultural, surrounding a document.

How an ancient object was used is often a mystery. Similarly, there are documents in existence which appear to be works of literature of some kind, but which cannot now be deciphered at all. In some cases, such as the Phaistos Disc, there is almost no surviving context at all. In others, such as the one-time key encryption used by Dee and Kelly in Bohemia Liu2005, there is some knowledge about the purpose of the documents, but not necessarily sufficient information to decode them. In other cases, knowledge is incomplete.

Misunderstandings, corruption in copying, ignorance, politics and flawed textual theories have all come into play with documents such as Biblical translations. Famously, “Peace on Earth and good will to all men” is now considered to be more accurately rendered, “Peace on Earth to all men of good will.” Was the genitive case not noticed by the translators of the King James Bible, or did it not exist in their primary texts, or did they choose a silent emendation? We might like to imagine that when we archive computer texts, we will not have problems with corruption. But in practice it is not corruption but a lost context that is the problem. Consider the way that the English language has changed an as short a time as three hundred years:“indifference” was a word that in the 1700s meant without difference, impartial; today a statement that God metes out justice with indifference would not be considered respectful.

In order to reconstruct the significance of a text, then, recipients of our putative archived XML Document, a Very Long Time from now, will need to understand not only the computer formats we have used but also the natural-language parts of our text. For XML documents, where it is common practice to embed natural-language terms in markup as element names or identifiers, natural language appears not only in document content, but also in the actual document format, in the markup.

Language context is only one part of the wider document context. A funeral oration might be perceived quite differently from a shopping list; a parody differently from a news article. The expected use and implicit shared understanding between document creator and audience in these examples can be lost by the Very Long Time; this tacit knowledge must therefore be documented and made explicit if the archived document is to be interpreted as it was intended.

As C. Michael Sperberg-McQueen pointed out in reviewing a draft of this paper, the effort for an author in adding extra background information may seem onerous, and may require a very different sort of skill than writing the document. In a corporate environment, sanctions are generally available to require additional information to be of at least minimal accuracy and completeness; in a university or library environment, the mandate for making tacit knowledge manifest as explicit knowledge (See Applen2009, chh. 1 and 3) may fall to the archivist.

Socio-political contexts, organizational contexts, corporate cultures and fashion can also all affect the interpretation of documents. One cannot archive an entire culture in order to explain a single document, but neither can one understand an entire culture from a single document.

Every document necessarily stands in some relationship to other documents. That relationship can be implicit or explicit. An example of an implicit relationship might be that a dictionary gives definitions or explanations for words found in other documents. An explicit link might point from a person's name mentioned in one document to a biography of that person in another document.

Information about the external contexts in which a document was created, and was intended to be understood, then, are generally necessary in order to understand that document correctly. These external contexts include language and culture of the creators of documents as well as of the intended recipients, the other documents created or preexisting within those contexts, and also the intended purposes and audiences of the documents.

Navigation and Finding Aids

Large modern research libraries and rare book libraries often store the books away from the public: to see a book one must request it explicitly. In such a world there is no browsing, and serendipitous discoveries have been outlawed. One cannot discover tucked inside an otherwise uninteresting volume a transcription of a poem whose only extant Anglo-Saxon manuscript copy had been destroyed in a fire. Forgotten poets remain forgotten, sometimes for the good of mankind and sometimes regrettably.

A putative future patron of a digital library may be at a great disadvantage compared to today's visitor to a rare book collection: that of expectation. One might reasonably expect a rare book library to have a copy of Plato's Republic, of Moxon on printing Moxon1683, or of something printed by Caxton or Aldus Manutius. But the digital collection might contain a thousand terabyes constituting an ultra-high resolution scan of the skin of an earthworm, or the accumulated income tax returns of retired Cornish clergymen, or the entirety of Moroccan twentieth-century literature. What files to request?

This question of how to make a selection is not new to anyone working in the fields of archiving, but it is certainly new to many computer engineers, the people most likely to be constructing digital archives. An overview of the contents of an archive is of critical importance and, in the end, may be a significant deciding factor in which archives survive. Controversial works might need to be hidden; at different times in history many works have been defaced or destroyed for ideological reasons. In a world of automated search it might seem that such strategies cannot succeed; after a Very Long Time what seems today Controversial may in any case seem banal or common-place. We should remember, however, that full-text search is generally accomplished using software that, over time, will probably cease to function unless it is actively maintained.

An archive, in any case, needs an overview, a Finding Aid, that gives the reader an idea of the sorts of thing that one might find in the collection, and perhaps delves down by category into subsections of the collection.

The Politics of Selection

George Landow writes (pp. 267ff) about the politics of hypertext; of particular relevance here is the idea that providing easier access to some documents implies harder access to others: the choice of which documents to archive is (or can be) a political decision every bit as much as decisions about which books to keep on the shelves in a public library.

It is not technically feasible to archive all documents. Even if we restrict ourselves to the domain of person-to-person communication, we still find that the sheer volume of electronic mail, especially when spam is included, simply makes it harder to find information later. In addition, privacy concerns mean that it is not always desirable to archive everything. Some public libraries now routinely delete book borrowing information, so that they cannot be required to identify which books a particular individual may have read. If not all documents are to be archived, some documents must be rejected. In a corporate research environment it might be that reports are archived, but not research notes, for example. Yet, in the future, those notes might be considered a highly valuable resource for understanding and validating (or otherwise) the findings of the reports.

When a Document is archived then, one should consider archiving secondary documents in the same collection; however, this increases the burden on Finding Aids and on Archive Structure.

How to Archive?

After selecting which documents are to be preserved, and (explicitly or implicitly) which are to be destroyed, or at best left to their own chances, after the decision to create an archive has been made, one must determine the methodologies to be employed. After the why and the what comes the how.

The Physical Substrate

It is not reasonable to expect modern computer storage devices to remain functional for A Very Long Time. Typical values for A Very Long Time for most computer equipment today are measured in thousands of hours, not thousands of years. Magnetic tapes degrade over time, as do optical storage media such as compact discs. Active devices such as rotating hard drives have dependencies on voltage and current levels, on specific versions of software drivers for specific operating systems, and, since they contain firmware, may also fail after a specific date.

It is possible to run a digital archive in such a way that data is periodically migrated to newer media. Such a strategy assumes a continued supply of funding and replacement media. It is also possible for an organization to rely on external archiving services, but this does not solve the question of ensuring that A Very Long Time is sufficiently great.

Suitable physical media for long-term storage of digital data remains an unsolved problem at this time.

The author of this document once unpacked a computer system; inside the box was also a manual in many ring-bound volumes, and it was necessary to open shrink-wrapped stacks of hole-punched paper and insert them into the proper binders. One of these manuals was a chapter explaining how to off-load the box with a fork-lift truck, and how to open the box. Of course, in order to discover these instructions, it was necessary to open the box. When archiving for a Very Long Time, it is important to label the archive in multiple languages, with a pen, on the outside of the box. Who, in a thousand years from now, would guess that an object clearly marked 90 Minutes Audio Cassette actually contained a computer program? If the instructions for unpacking the archive are inside the archive, how will they be used?

The Logical Layers

Computer users are accustomed to metaphors presented by graphical user interfaces. For example, a Folder is used as a metaphor for a group of documents. But the actual implementation of a File System on most operating systems today involves a list of hard disk block numbers or storage extents. It might be that a future data archaeologist will have to inspect those individual disk blocks and piece them together. This process is made considerably easier if files larger than a single block are in plain text wherever possible, rather than (for example) being compressed. In addition, the fewer layers that must be penetrated, the easier the task, so store individual files in folders (directories) rather than in binary formats such as zip or tar archives. The process of reconstructing data from a damaged CD-ROM or hard drive is tedious, but today at least it is a known skill; many skills fall into disuse, and today few people can repair a hole in a saucepan, sharpen a wooden ploughshare, or correctly aim a ballista. If an important archive is to be stored for a long time, the layout of the storage system file systems must be documented on a separate physical medium.

A text file is in actual fact stored digitally in a way that could be thought of as a sequence of integers, with an implied mapping from integers to characters and from character sequences to visible representations known as glyphs. The mapping from integers to characters is known as an encoding; the mapping from characters to glyphs is implemented by fonts.

Coded Character Sets

A Coded Character Set, or Encoding, is a mapping from integers (or, more properly, codes of some sort) into characters. Some encodings are context sensitive, so that the same integer may map to different characters in different contexts; ISO 2022-JP is an example of such an encoding mechanism. Others are context-free, so that the same integer always maps to the same logical character. Over time, character encodings tend to be modified, for example by introducing the Euro sign, or by fixing minor bugs. There is no general concept of version numbers for encodings, however, so that there is no way to determining which historical version of a given encoding was in use when a document was created. Some encodings (most notably IBM EBCDIC) also have many variations, with no overall consistent, standard naming scheme.

For the purpose of archiving data for a long time, it is clearly essential to label all character encodings used, and to include, along with the archived data, copies of the specifications for the encodings. Note that in order to read these specifications, people may need to decipher at least some of the encodings used!

Fonts and Glyphs

An encoding transforms the stored computer file, which we consider to be a sequence of integers, into a sequence of logical characters. However, in order for a person to be able to make sense of the information, the characters must be presented in some readable form. In the West this is most often done using alphabetic symbols from the Latin script. The software that controls the mapping from characters to glyphs is a Text Layout Engine. This software generally reads tables to indicate that particular sequences of characters are to be displayed as particular sequences of character shapes; the definitions of those replacement sequences ans the corresponding character shapes (glyphs) are defined in Fonts. In order to read a computer file then, the integers must be mapped to characters, the characters to glyphs, and the glyphs rendered on a screen, paper, or other device.

A font is really a piece of software that implements a typeface design. Current font technology, especially OpenType, includes procedural machine code in a language called TrueType; it would be unreasonable to expect that software written today will still be runnable twenty years from now. Therefore, as part of archiving a document for A Very Long Time, we must also archive depictions of the glyphs, perhaps as bitmap images, along with documentation of the bitmap image format that was used.

Extensible Markup Language (XML)

Our subject is the archival storage of documents encoded in XML. This encoding should not be confused with a coded character set: the XML encoding is defined as a formal grammar whose input is a sequence of characters, not a sequence of integers. The characters are defined to be in the Unicode character set, although the particular version of the Unicode character set is not clearly defined. We have already noted that information on the coded character set should be stored along with the document; we now note that the version of XML used, and the corresponding XML specification itself, must also be stored. This is not the definition of the actual markup used, but rather the specification for XML itself.

Ancillary Formats

Photographs, illustrations, sound clips, 3D models, digital scent definitions, video and any other non-textual information must be archived in a file format that is documented. Where possible, declarative formats are to be preferred over procedural, and open, documented formats preferred over closed, undocumented formats.

Declarative and Procedural Formats

A format may be said to be Procedural if instances of that format give a complete specification of an algorithm, and Declarative if instead the format describes a desired result without giving a full algorithm.

An example of a Procedural format for graphics might be a computer program in the FORTRAN IV language using the Graphics Kernel System (GKS) to draw a series of five rectangles. The program might be several hundred lines long, and would deal with initializing a device context, then with querying which plotter pen colours were available, then issuing an instruction to select (say) the red pen, then telling the robotic plotter to lift the pen, move to a particular place on the paper, lower the pen, and move the pen horizontally by a certain distance, and so on. Running such a program even five or ten years after it was written may be difficult, as it will probably contain code that is specific to a particular operating environment, and possibly to a particular device. Deducing that a particular Calcomp plotter held the red pen in position five might or might not be trivial.

An example of a Declarative format might be an XML Rectangle Language, with five elements called Rectangle, each with a colour="Coates3801" attribute. In this example, although the recipient of the document might not know what Coates3801 means, it is not necessary to perform a computation or to run a program in order to comprehend the intent to draw five rectangles. The outcome has been described, and not the mechanism.

Open And Closed Format Specifications

We shall denote by Open Format Specification a specification with the following characteristics, listed with the most important first, from the perspective of Very Long Time Archiving of Documents:

  1. Conforming objects can be created and manipulated freely, without needing permission or payment of royalties;

  2. Conforming implementations can be created freely, without needing permission or payment of royalties;

  3. The specification itself is available and can be copied freely, without needing permission or payment of royalties;

  4. In addition to describing an Open Format, the Specification itself is available in a format defined by an Open Format Specification.

We shall describe each of these characteristics in turn. A specification which does not meet any of them, or that meets only the first, we shall denote as a Closed Format. It is necessary to consider that, after a Very Long Time, the organization that issued the Specification may or may not still exist. However, if copyright still pertains, it might be that it is no longer possible to use the Specification until copyright expires. Future changes in copyright law may mean that copyright no longer ever expires.

Objects can be created freely

If this is not the case, then explicit permission must be obtained from the controlling organization to create the archive, and also to give permission for the archive to be accessed and used. This permission must of course be stored along with the object.

Implementations can be created Freely

After A Very Long Time, it might be that no implementation exists that can still be run. In order to make use of the archive, a new implementation will need to be written. For example, digital hypertext literature written using Hypercard can often no longer be run or experienced; one way to preserve Hypercard-based literature might be to create an open source program to run them, but this is a difficult proposition Liu2005

The specification can be copied freely

A Very Long Time from now, a commercially-old specification may well be unavailable. For example, ISO SQL 92 [ref] has been withdrawn after less than 20 years, and is no longer for sale. Therefore, copies of the specification should be archived along with the documents, and that may require permission. When the archive is used, the copyright status of the specification must be very clear.

The Specification itself is written using an Open Format

An example of a closed format electronic document might be a Magic Wand file (Magic Wand was a word processor in the 1970s and 80s). The format is binary and proprietary, and is no longer available. Magic Wand is also no longer available, and all existing license keys will no doubt have expired. So a specification for a graphics file format (say) that was archived in Magic Wand format would not now be very useful.

Where the format is not entirely open, text-based formats are generally to be preferred over binary formats, as being easier to pick apart byte by byte, line by line, in the future.

In the case that the Specification is not available in an open format, multiple formats should be used, perhaps including a bitmap image for each page of a document, to maximise the chance that at least ne of the formats can be read in the future: a sort of digital Rosetta Stone.

Multiple Copies, Multiple Locations.

Some digital documents, like antiquarian books, are scarcer than others. With antiquarian books, commercial value is related both to scarcity and to interest. With digital documents, commercial value is determined by ease of availability and interest. Documents that are widely copied are easier to access. The license or copyright by which digital documents are released determines how easy it is for other people to share copies of them. However, as digital documents are more widely disseminated, there is a greater chance of corrupted or changed copies emerging. This can be counteracted by providing a digital fingerprint, known as a signature or checksum hash, along with the file. This does not prevent alterations, but makes it possible for people to test to see if the file has been changed.

Making archived documents widely available in unchanged form is sometimes referred to as “mirroring.” A collection of documents that is widely disseminated in this way can survive even if only one of the mirror sites survives. However, for this to happen, the mirrors must be funded.

A private organization might decide to have a distributed archive, with entire copies of the archive at several disparate geographical locations. However, if the organization ceased operations, all of the archives would probably be closed. Mirroring by multiple organizations is more robust, but there has to be a suitably sustainable funding source. Sometimes this can be provided by government grants; in other cases, if the documents are suitably licensed, and can be made public for commercial use, advertising on Web sites hosting the archive can suffice. In some communities there are already mirroring and archiving initiatives such as Lots of Copies Keep Stuff Safe [LOCKSS2008].

Summary

An archive that is expected to outlast the archivist must be self-contained: it must contain not only a single document of interest, but also information about everything needed to decode and use that document, at both physical and logical levels.

In order to facilitate future decoding of an archive, use separate uncompressed files wherever possible. A hierarchical folder or directory structure can help to keep ancillary documents separate from the main document.

There is no definite answer to physical storage formats at this time.

Where there is a choice of file formats, Open Formats should be used wherever possible.

Storing information in multiple parallel formats is wise for archiving, as is using multiple locations.

Designing XML-based Formats for Longevity

In this section we shall assume that the reader has some familiarity with the Extensible Markup Language and associated terminology.

By the term XML-based Format we intend to denote not only an XML Vocabulary or set of vocabularies, as might be specified in some Schema language such as XSD or DTD or RelaxNG, but also to include any usage guides, documentation, examples and associated social culture.

We must not assume that XML Processing models will remain the same for A Very Long Time. For example, xml:id, xml:base, xinclude and other low-level specifications have arisen within the past decade, and there is no reason to suppose that new specifications will not similarly come into being. We can create some guidelines that follow from this observation.

Avoid Implicit Content

Some XML Schema languages, including XSD and DTDs, have the ability to provide “default” attribute values, or, in some cases, even default element content. The document is augmented by this content after validation against a schema by a software process. Since we cannot assume that software processes will still be runnable A Very Long Time from now, we should make sure that the archived XML Document can be used without schema validation, and, in XML terminology, is a stand-alone document.

Avoid Obscure Features

Any feature of XML, or of any other format, that is not widely implemented, or whose behaviour varies between implementations, or whose semantics are not clearly documented, should be avoided. For XML that might include, for example:

  • Notation, a feature whose semantics are not well-defined;

  • Use of parameter entities in the internal document type definition subset at the start of a document, as support for this feature is not required by the XML specification, and not all XML processors implement it.

  • The use of inline general entities to introduce markup, as this is not always supported.

  • Use of character encodings other than UTF-8 or UTF-16 or US-ASCII, the only encodings currently guaranteed to be supported.

  • Reliance on “well-known” sets of entity definitions such as those provided by ISO 8879:SGML or by HTML; even if these entity sets are included with the document, an XML processor that does not support DTD processing will be unable to handle the document.

Avoid Cryptic Names

A necessary precondition of usability is comprehensibility. If someone else if to make use of our markup they must understand it. Consider the example in the following listing, in which short element names and a flat strucure have been used for a novel (the actual content has been reduced to numbers to avoid accidentally introducing anything of interest into this paper):

		<h>1</h><p>3</p><p>10</p><p>21</p><h>64</h><p>129</p>
	    

In the listing, the relationship of the elements is not explicit. An improved version is given in listing 2:

<c><h>1</h><p>3</p><p>10</p><p>21</p></c><c><h>64</h><p>129</p></c>

Someone inspecting this document might not have enough information to deduce the intent of the markup, and we could make the further improvement of listing 3:

<chapter>
    <heading>1</heading>
    <paragraph>3</paragraph>
    <paragraph>10</paragraph>
    <paragraph>21</paragraph>
</chapter>
<chapter>
    <heading>64</heading>
    <paragraph>129</paragraph>
</chapter>

Of course, such lengthy element names may be inconvenient when editing a document. Plausible compromises include converting to an archival format, or making sure that each document includes sufficient information to allow someone to reconstruct this meaning. Of these, the first approach seems more likely to stand the test of time.

Document the significance of markup items

Write clear descriptions of the purpose of each XML element and attribute; be careful not to rely on the element name in the description. For example, do not describe the bulletlist element as containing a bulleted list; describe it as containing a sequence of independent items that form a related group or sequence, and give an example. The meaning and usage of terms such as Paragraph, Section, List, Title, Bullet and Flower, Folio and Explication has not remained constant over the past few centuries, and even today is not constant across all cultures.

A project to archive a body of documents over a Very Long Time might well ensure that the documentation is translated into multiple languages, to increase the chance that a future user will be able to understand what is written. Of course, one might do the same with the actual documents as well as the documentation about the rhetorical and technical contexts of the document. Changing the actual XML Document is outside the scope of this paper, but the possibility of creating an archival format alongside the original document has already been suggested.

Validate the Data

If there is uncertainty about the ability of recipients and users of an archived document to be able to use the XML format, there must be considerably more uncertainty about their ability to cope with errors in the use of the format. Make sure that all XML data is will-formed: not only the XML Document itself but all supporting information. Validate against XML Schemas, DTDs or other test documents wherever possible. Where data fails to validate, the archivist should attempt to ask the supplier to provide corrected content; where that is not possible, the original data must obviously be archived, but the archivist should consider including a corrected version of any documents that do not validate, along with information about the validity problems. Including detailed notes about the format emplyed is of little use if the documents to not correspond to the documentation.

Mean What You Mean To Mean

There is no clear definition of Meaning that satisfies all users of a typical XML document. Some people are interested in denotational semantics from a semiotic viewpoint (what is signified?), some in behavioural semantics (what does it do), some merely in human understanding (what is it?). A consequence of this is that there is no single universal way to denote the meaning of XML markup. W3C XML Schema documents can contain annotation elements for this purpose.

Check Links

If your document format has links, whether explicit or implicit, you should check them before committing the document or documents to the archive. As a minimum, make sure that all link targets exist. Better, make sure that the links go where they should. For example, check the titles of sections that are the targets of links, perhaps to the content of an element in the link that exists only for that purpose. For off-site links, for example to Web sites, consider archiving a surrogate, or including a description of the remote content. Archiving the actual remote Web page may require explicit permission, but for some projects this is worth while. Ensure that you are checking linked documents in your collection that you are about to archive, and that your link checker is not reaching out to a production server elsewhere in your organization! All of this requires provision in the design of an XML document format.

Although the World Wide Web Consortium has produced a specification for marking up hypertext links in XML, xlink, this specification is not likely to be sufficient even for explicit links, as it does not provide for a pattern to match against a given target element in order to validate a link.

The xlink specification does not attempt at all to support implicit links, where, for example, a part-number element within a step of a repair procedure is treated as a link to a database of parts, and for printing is augmented by a description of the part, and for online publication also becomes a hyperlink to an online catalogue. Like implicit content from default elements and attributes, implicit links are difficult to archive. The ISO SGML HyTime specification did have support, using a mechanism termed Architectural Forms, for at least some level of implicit linking, but HyTime has not been adopted by the XML community.

The best approach to linking in documents that are expected to be archived, then, is to include in each document some information about link targets. This could be a short textual description or an entire resource. For the purpose of link checking, static content in the document can be used, such as an attribute whose value specifies an XPath expression, and another attribute giving a regular expression (a text pattern) that the result of evaluating that XPath expression must match. For example, one might say that the nearest enclosing section title of the link target must contain the word Plastic. The utility of this is at archive creation time, and during the regular document maintenance life-cycle.

Provide for Translations

After A Very Long Time, language will have changed. Documents that have been translated into multiple languages clearly have a better chance of bing understood. A single Rosetta Stone is more likely to survive in one place than three separate stones: consider archiving a single file containing all of the translations, or at least a fragment from each language, perhaps first a German paragraph and then a French one and then a Swedish one (in alphabetical order by their own names, but that breaks down for languages written with non-Latin scripts). Remember that any piece of human-readable text may need to contain markup in some language or other, whether to delimit right-to-left and left-to-right components, or for Ruby-style annotations, or for emphasis. As a result, all natural-language fragments should be in XML elements (not attributes), where they can be distinguished by xml:lang values. Designing XML documents for translation has been described elsewhere [e.g. ITS] in detail.

It can be helpful to have block-level translations, so that (for example) a paragraph or list item in one language can be immediately followed by corresponding text in another language. This does not always work, since the rhetorical structures in different cultures may suggest that material be most naturally presented in different sequences.

Provide for Contextualization

Include a place for authors to make explicit not only the purpose of the document, but how it is to be used and how it might fit into the ecosystem of the wider context in which it is created. For example, a dream journal might have an introduction that says the author wrote down memories of dreams each day for a year, but the wider context might include that this was part of a therapeutic exercise in working out resentment towards alien visitors, and that, after a year, the writer's perception about the visitors was changed. This sort of explanation is not generally considered necessary for a work of fiction, but the boundaries between perception and construction become less clearly discernible over time. Harry Potter is a fictional boy, but of course the train from King's Cross station in London, and the oddly numbered platform there, really exist. They have been created after the success of the books, but that small detail may no longer be obvious a Very Long Time from now.

In a business context, the reasons behind particular decisions and documents may not be public; it may be desirable to store two versions of a document, one of which is to be made available only fifty years (say) after its original creation. Some governments have similar policies, although the details vary widely between countries.

As an example of contexts, consider the following amounts of money: all except one represent the same value.

Table II

Amount1/36/80.33803270.170.19

Adding some context would make this clearer:

Table III

Amount£1/3£-/6/8£0.3380d£327$0.17$0.19

To belabour the point, even here there is some difficulty. The second item, £-/6/8, is a now-obsolete noation for six shillings and eight pence in the pre-decimal British currency; with twelve pennies in the pound that made 80d (denarii, pennies). Italy, in pre-Euro days, used the Lira, and used the same symbol for currency, although devaluation meant there were of the order of a thousand Italian Lira to the Britsh Pound (at some point in history); similarly, the two dollar figures are from two different countries using the dollar as a currency symbol. Thus, notations and symbols that we take for granted may lose their meaning over time, or not be clear to future readers. There is no clear way to determine how much context to retain, and yet we must persist in claiming that the more context is recorded, the greater the chance of successful communication.

Don't be Inventive

Use existing specifications where possible; if that is impractical, consider copying techniques from existing specifications, which can then be included in the archive. The more people who use a specific technique, the more likely it is to survive.

Summary

Use validated, well-documented XML that wherever possible relies on widespread practices.

XML or Not XML?

The scope of this paper is intended to be considerations for long-term storage of XML Documents. However, what if the best things to store are not XML Documents at all? Such considerations could easily fill volumes; in this paper, we have room only to delineate a small number of features of XML Documents that make them suitable.

Textual Format

As we have already discussed, textual formats tend to be more robust in the face of possible data corruption than compressed or binary formats. If a single storage device block becomes unreadable, the rest of the document after the lacuna will still be readable, albeit incomplete.

Explicit End Markers

Not all text formats have explicit markup to surround objects, or, if they do, may use an ambiguous symbol to mark an ending, such as a close brace or a closing parenthesis. The possibility of data corruption means that the additional redundancy of repeating an element name can help to identify errors and limit the scope of corruption.

Embedded Usage

XML is used in devices ranging from automobile engines to television sets. Its use is very widespread, and it is found in devices expected to last for thirty years or more. These devices cannot easily be changed if XML becomes obsolete and is replaced by (say) Aldus PageMaker files. This means that the chance of XML technology surviving, or being historically retrievable, is statistically significant: 3% of 1000 years is 30 years.

Device Independent

XML Documents do not generally rely on specific hardware. For example, one does not embed Epson dot-matrix printer control sequences in XML documents in order to generate underlining. XML-based graphics formats are generally declarative, and do not generally rely on initializing a plotter or on the size of a sheet of paper.

Open Specification

The XML Specification meets all four of the criteria given in this paper for a specification to be considered fully Open. This is not to say that no other format does: many do.

Conclusions

This paper has outlined some considerations for long-term storage of XML documents.

A small amount of extra care and consideration may help to provide a framework for document creation with a Very Long Time in mind.

An archived document should have an overview document accompanying it, that documents which specifications were used and why, gives a high-level summary of the document itself, lists all copyrights and trademarks that may apply (and patents, non-disclosure agreements, licenses or other agreements or restrictions on republishing the document in the future), and lists all associated files and their purpose.

Since there may be hundreds or even tens of thousands of ancillary files for a single document, especially if specifications are included, use a hierarchical file structure to give prominence to the actual document.

Use open formats wherever possible, and prefer textual formats to binary formats.

Do not use complex compound file formats such as zip files, which are difficult or impossible to repair if they become corrupted. Similarly, store files uncompressed wherever possible.

For long-term archiving, use multiple organizations and multiple physical locations for the data.

Do not store data in a vault covered by a large pyramid with a lidless eye carved into it, as you will attract aliens.

References

[Applen2009] Applen, J. D. and McDaniel, Rudy, “The Rhetorical Nature of XML,” Rouledge, 2009.

(to be supplied; a number of references were consulted)

[Liu2005] “Born-Again Bits: A Framework for Migrating Electronic Literature,” Alan Liu, David Durand, Nick Montfort, Merrilee Proffitt, Liam R. E. Quin, Jean-Hugues Réty, and Noah Wardrip-Fruin 2005, online at www.eliterature.org/pad/bab.html and accessed July 2010

[Moxon1683] Moxon, Joseph, “Mechanick exercises on the whole art of printing,” 1683/4.

[Wooley2002] Wooley, Benjamin, “The Queen's Conjurer: The Science and Magic of Dr. John Dee, Adviser to Queen Elizabeth I,” Holt, 2002 (not itself a scholarly book but a good and clear introduction to the topic).

[XCB] The X protocol C-language Binding (XCB), available at xcb.freedesktop.org, accessed July 2010.

Author's keywords for this paper: Archiving; Long Term Storage

Liam R. E. Quin

XML Activity Lead

The World Wide Web Consortium

Liam Quin is the XML Activity Lead at the World Wide Web Consortium, where he has worked since 2001; he also does consulting in his spare time. Prior to working for W3C, Quin was a full-time consultant. He has worked with structured markup since the early 1980s, with SGML since 1987, and was an Invited Expert for the original XML work at W3C.