How to cite this paper

Viglianti, Raffaele. “Encoding document and text in the Shelley-Godwin Archive.” Presented at Symposium on Cultural Heritage Markup, Washington, DC, August 10, 2015. In Proceedings of the Symposium on Cultural Heritage Markup. Balisage Series on Markup Technologies, vol. 16 (2015).

Symposium on Cultural Heritage Markup
August 10, 2015

Balisage Paper: Encoding document and text in the Shelley-Godwin Archive

Raffaele Viglianti

Research Programmer at the Maryland Institute for Technology in the Humanities

Copyright © 2015 by the author. Used with permission.


The Shelley-Godwin Archive uses TEI to encode manuscript text from two perspectives: one focused on the document and one focused on the text. This short presentation addresses issues of adopting stand-off markup as a technique for the project's encoding goals.

The Shelley-Godwin Archive (S-GA) is a project involving the Maryland Institute for Technology in the Humanities (MITH) and the Bodleian, British, Huntington, Houghton, and New York Public Libraries that began in 2011 and has now completed two phases of work. In October 2013, the project released a beta version of its online reading environment containing high-resolution images and accompanying TEI-encoded transcriptions of the three surviving manuscript notebooks containing Mary Shelley’s drafts of Frankenstein, or, The Modern Prometheus. In Summer 2015, the project released a faster and more stable online reading environment together with images and transcriptions of three manuscript notebooks in Percy Shelley’s hand containing, amongst other works, the fair copy of the dramatic poem Prometheus Unbound.

With the project’s latest phase completed, we are now planning for future research, and development. This short presentation will describe the next phase of work on our markup scheme, which will embrace stand-off markup.

The design of the TEI markup scheme for S-GA coincided with the addition of the new "document-focused" elements to the TEI in the release of P5 version 2.0.1. This encoding approach switches focus from text as communicative act or linguistic content to text as sign on some physical support. This approach enables, for example, rigorous description of often complicated sets of additions, deletions, and emendations. The S-GA scheme primarily follows this approach; however the archive is also meant to include clear "reading texts" for those readers who are primarily interested in the final state of each manuscript, which requires representing two different hierarchies, one documentary and one textual.

Our solution prioritizes the documentary hierarchy over the textual one and relies on an automatic process to convert the “document-focused” encoding into a “text-focused” one. While some transformations can be handled heuristically, others are explicitly encoded with the general purpose <milestone> and <anchor> element pairs, for example to indicate the start and end of a paragraph, or a verse of poetry.

This has proved effective for the Frankenstein manuscripts. When working on Prometheus Unbound, however, we found that the approach does not scale well. Shelley’s manuscripts complicate matters in two ways: there are additional dramatic and poetic textual structures, and there are fragments of other works interspersed with the fair copy of Prometheus Unbound. Authoring a valid encoding and processing it for publication have become a vexed process.

Our next phase of development will focus on separating the hierarchies more neatly by creating parallel documents for the documentary and textual hierarchies. In order to avoid redundancy, we do not intend to encode the text twice, as it is done, for example in the Digitale Faustedition project, one of the very first projects that successfully adopted the new document-focused vocabulary of the TEI. Rather, the S-GA primary encoding will remain the documentary one, while a parallel document will encode textual structures and use stand-off pointers to pull character data from the documentary encoding. Instead of over-loaded milestone elements, regular TEI elements will encode paragraphs, poetic structures, etc., thus simplifying, we argue, the authoring and processing of the encoding. By targeting character data in an XML document is now seemingly achievable, also given the recent revision of the TEI Pointer scheme proposed and implemented by Hugh Cayless (presented at the TEI Members Meeting in 2014).

This short presentation hopes to stir conversation about two main topics. 1) authorship of stand-off markup, or how to simplify the creation of pointers to character data in XML documents. We envisage a web-based tool to visually select target elements and strings to build precise pointer expressions (e.g. in XPointer). We created an early prototype of this called the coreBuilder,[1] which is currently used in another project to create a stand-off critical apparatus. 2) updating pointers to changeable sources. Given our goal of making S-GA participatory, we expect the primary documentary TEI encoding to be changeable; what would be necessary to update the pointers in the secondary textual encoding? At this point, we speculate that automatically monitoring a versioning system may provide the information necessary to update the pointers. This is also a necessary step towards our goal of enabling user participation and curation on the archive.