International Symposium on XML for the Long Haul Issues in the Long-term Preservation of XML
Preliminary Program

Monday, August 2, 2010

9:00 am — 9:45 am

A brief history of markup of social science data: from punched cards to “the life cycle” approach

Laine Ruus

Traditional quantitative social science data analysis requires three ingredients: the raw data, metadata (what we used to call a codebook), and software. Software changes all the time, within some limits. Raw data without metadata is useless: it might as well be generated by a random number generator. And metadata without data is like the index to a periodical the last remaining copy of which was sent for recycling last month. Over time, metadata have been expected to support many different functions, and microsolutions have never quite satisfied many, much less all, of those functions. Until recently, that is: a roughly 25-year process of historical evolution has led to DDI, the Data Documentation Initiative, which unites several levels of metadata in one emerging standard.

9:45 am - 10:30 am

Sustainability of linguistic resources revisited

Georg Rehm, Oliver Schonefeld, Thorsten Trippel, & Andreas Witt

Data providers, users, and funders alike want and need sustainability of language resources (e.g. language corpora, grammars, etc.); sustainability requires making the resources available according to defined processes, platforms, or archives in a reproducible and reliable way. A three-year project on sustainability of linguistic resources conducted at Tübingen, Hamburg, and Potsdam illuminates some of the difficulties: the prevalence of stand-off markup (requiring a layer of specialized tools atop the XML stack), machine-generated XML of low clarity, ad hoc non-standard tag sets, discoverability, and selection criteria for long-term archiving . XML and other standards are necessary but not sufficient ingredients in the mix.

11:00 am - 11:45 am

Report from the field: PubMed Central, an XML-based archive of life science journal articles

Jeff Beck National Library of Medicine (NLM)

PubMed Central (PMC) is an XML-based archive of life sciences journal literature that provides public access to the full text of more than two million articles. Structures above the article are built as collections of articles. PMC's article acceptance policies, ingestion processes, support tools and guiding philosophies are described, showing how its ongoing success as one of the fastest growing and most frequently consulted scientific resources is being achieved.

11:45 am - 12:30 pm

Portico: A case study in the use of XML for the long-term preservation of digital artifacts

Sheila Morrissey, John Meyer, Sushil Bhattarai, Sachin Kurdikar, Jie Ling, Matthew Stoeffler, & Umadevi Thanneeru, ITHAKA

Portico is a not-for-profit digital preservation service providing a permanent archive of electronic journals, books, and other scholarly content, currently encompassing more than 15 million items comprising over 176 million files. Portico preserves publisher-provided digital artifacts, Portico-created XML metadata, PDF pages, figure and table graphics, supplemental files, and the header or the full text as XML in Portico's own XML DTDs, based on the National Library of Medicine's Archiving DTDs. The current challenges facing Portico include document validity, the slipperiness of semantics, handling generated text, preserving external links, and tag set versioning. We believe future challenges to include scaling, managing the preservation of supplementary material, and the long-term sustainability both of cultural memory institutions that preserve digital artifacts and of the XML community of development and use itself. The paper suggests some practices that can help assure the semantic stability of digital assets.

2:00 pm - 2:30 pm

The Sustainability of the Scholarly Edition in a Digital World

Cathy Moran Hajo, New York University

Scholarly editions must be used for generations; by nature they require a stable long-term publication format. Some editors have eagerly embraced digital editing and XML, but many more editors remain unconvinced that digital publications can last as long as printed books. Community standards and DTDs for editions have not been widely adopted and editors lack consensus about what a digital edition should be. XML’s stability and sustainability is critical to efforts to go beyond “the book,” and to develop new ways of presenting texts and scholarly commentary. To build 21st century editions, we need tools to make XML encoding easier, to encourage collaboration, to exploit social media, and to separate transcriptions of texts from the editorial scholarship applied to them.

2:30 pm - 3:00 pm

A formal approach to XML semantics: implications for archive standards

Andrew Dombrowski, & Quinn Dombrowski, University of Chicago

Earlier work on markup semantics has assumed syntactically and semantically plausible schemas as a starting point. We can use markup semantics conversely to evaluate the plausibility of markup vocabularies. For purposes of long-term preservation, we cannot choose vocabularies by identifying some set of use cases to support: we cannot foresee in enough detail the goals to which the future will wish to put the data preserved today. The application of Montague semantics to markup languages may make it possible to distinguish vocabularies that can last from those which will not last. Ad hoc semantics, closely tied to transient features of the world and to our current world view may easily lose meaning with time. What we need are vocabularies with semantics sufficiently independent of contingent facts to survive over the long haul.

3:00 pm - 3:30 pm

Metadata for long term preservation of product data

Joshua Lubell, National Institute of Standards and Technology

Product data can usefully be defined as structured information about objects which are produced by industrial and business processes. In terms of information types, data formats, usage, and lifespan, product data is both complex and diverse, encompassing proprietary 3D image modeling information, dimensions, tolerances, and other model annotations, supplementary material such as test analysis, videos, and datasets, and human-readable documentation. Although the metadata issues in this problem space present some unique features, there are valuable lessons to be learned from the library metadata and packaging standards and how they relate to product metadata. Extending the library standards to represent subsets of information from emerging product lifecycle management standards could help tame the complexity of long-term product data archival.

4:00 pm - 4:30 pm

Beyond eighteen wheels: considerations in archiving documents represented using the Extensible Markup Language (XML)

Liam R. E. Quin, World Wide Web Consortium (W3C)

How can archived documents remain useful even though, over time, the contexts in which they are used may change? The best hope is to retain knowledge of their original media and processing contexts and of their larger social, political, economic, linguistic and semantic contexts. With that principle in mind, strategies for extending the useful life of archived documents are suggested.

International Symposium on XML for the Long Haul Issues in the Long-term Preservation of XML Preliminary Program