Multiply these possibilities by the large number of ancillary texts, and you have a sense of what an assault this technology portends on a normative, atomistic conception of the act of reading. One doesn’t, it is true, exactly curl up with a good book here. Rather, one is faced with dozens of possibilities at once, literally replicating the ways intertextual allusions play against and within any literary work of dimension and intellectual ambition.
Curran clearly had an idea that his edition was breaking new ground: We are hoping to create here for the first time an electronic variorum, and perhaps he did not at that moment realize that the work he and Jack Lynch were putting to the project to encode variants with HTML was being replicated by the Text Encoding Initiative, which in the same year (1994), released a draft of its P3 Guidelines, a draft which contained its first modeling of apparatus criticus markup with the elements app and rdg and their corresponding attributes. The first half of the 1990s saw intense drafting for the TEI in defining a standard tagset in SGML, and the irony of Curran’s venture with HTML seems to have placed his effort in a parallel universe, perhaps due to its emphasis on immediate sharing and distribution. Duplicating the efforts of the TEI in SGML, and bound by the semantic limitations of another specific instantiation of SGML in HTML 1.0 to represent descriptively the variations between two texts, the PAEE Frankenstein editors pioneered their own system of pseudomarkup that appears in a third frame running beneath the 1818 and 1831 windows, as visible on a representative page. The pseudomarkup applies a system of square brackets and angle brackets to render variants inline together in the document.What is immediately evident is the precision and care taken in the first edition, even in its apparent lack of awareness of the TEI as it was developing an alternate SGML form in the mid 1990s. Also evident, ironically, is the net loss of information in the translation of the PAEE into TEI for Romantic Circles in 2009, where the difficulty of negotiating between HTML 1.0 and the XML standard of TEI appears to have been strained in favor of expressing the presentation view of the texts while removing the more adventurous aspects of pseudomarkup. The editorial annotations were transferred from the HTML to the XML syntax, no effort was made in the conversion of 2009 to apply the apparatus criticus of TEI to curate the handiwork in pseudomarkup of the original PAEE editors.Looking back on origins of our electronic Frankenstein monster, we see something of a history of strained relations between chunky hypertext books and creamy TEI which in those days applied SGML in favor of the semantics of document hierarchies to show interrelationships.Here I am referring to Steven DeRose and David Durand’s contrast of atomized or chunky hypercards vs. creamy or supposedly more flexible text representation in hierarchies of the TEI from a timely article of 1995, in which they write: A number of popular hypertext systems use a data model deemed inadequate for all but a few scholarly reference needs: this is the card-based or ‘chunky’ hypertext model, in which documents must be fragmented into data atoms of uniform size and minimal internal structure. Since few documents have historically been structured in this manner, the TEl hypertext guidelines use the more flexible text-based or ‘creamy’ approach to hypertext. See Steven J. DeRose and David G. Durand, The TEI Hypertext Guidelines, Computers and the Humanities 29, 1995, pages 181-190. The PAEE attempted an apparatus criticus in preparing tiny hypercard chunks in HTML 1.0 frames as an interface that prioritized a study of variation, and even coded that variation in markup of their own that used angle brackets, square brackets, boldface, and italics to provide through the web browser interface a synoptic view of a variorum edition. Fortunately the edition is still served from its original University of Pennsyvlania URL, apparently unchanged over the years, but in a time of generational transference marked by rapid aging of electronic texts prepared in non-standard ways, the old edition’s fragility and ambition are worth contemplating now. In 2017-2018, we preparers of a new edition (just like our predecessors) are standing on the proverbial shoulders of giants, and if we are single-minded about preparing the documents in a format we can readily process and publish, we stand potentially to lose the scholarship and the impressive mass of paratext surrounding that first edition. We underestimate these early editions to our peril (or to our potential cultural impoverishment). Indeed, the PAEE contains hundreds of paratext documents, including an impressive corpus of other literary texts in its Works Included in this Edition bordering on and relevant to Frankenstein, as well an impressive array of Contexts pages covering religious, mythical, geographic, scientific topics. Whether we can help preserve the vision of the first editors in interlinking contextual materials with their critical variorum edition, or whether we should try to do so, are open questions at this stage in our work on the Bicentennial Frankenstein project. Our own team, itself pressed for time and finding a need to invent a new expression of Frankenstein’s contexts, is not likely to curate the paratextual assemblage of documents in the PAEE, as we have prioritized for the bicentennial moment to reconstruct the electronic variorum and add more to it than available in the original.Stages in Up-Translating the Pennsylvania Electronic EditionIn December 2016, I harvested the nearly 500 individual HTML files of the PAEE representing the 1818 and 1831 editions and began a painstaking process of up-translation, with the following major stages in view:
converting the hundreds of hypercard documents of the PAEE from HTML 1.0 into two separate plain text documents representing the 1818 and 1831 editions;up-converting these to a simple form of XML to prepare them for automated collation with collateX, software designed to compare multiple documents and locate and mark their points of deviance and output them tagged according to TEI’s critical apparatus markup;preparing from carefully proofed OCR a digitized 1823 text, formatted in XML for collation with the 1818 and 1831 editions;processing the three XML documents with collateX to produce an XML text, to be up-converted into TEI;maintaining the three separate documents to process with an additional set of witnesses prepared from the Shelley-Godwin Archive edition of the 1816-1818 manuscript notebooks of Frankenstein.
At the time of this writing in July 2017, we are beginning work on the last two stages of this. That is, we have successfully prepared a fresh automated collation of the 1818, 1823, and 1831 printed editions of Frankenstein, and we are planning how to:
re-process the collation after we prepare a similarly formatted unified file of the thousands of manuscript notebook files from the Shelley-Godwin Archivebuild a complete TEI edition from the collation
Beyond these goals of moving from chunks to large unified files for our base texts, we of course also face the challenge of how to share the edition with readers, not only scholars of the novel and its time period but also enthusiasts of the Frankenstein narrative, who may, if we are successful, learn something new about its textual genetics: how this text transformed over time. That last goal is out of scope of this paper, as our current project is basically to create a tractable semantically meaningful TEI markup building anew with a start in the PAEE files.My concentration for this paper is on a stage of what is completed already, and my sustained interaction with the PAEE hypertext edition. We first decided to process a predictable plain text with pseudomarkup to preserve information from the markup, and we decided that each stage of our up-conversion would produce a distinct and re-usable edition in its own right, whether in plain text, or a preliminary stage of XML for collation, or ultimately the collated edition in TEI P5. In this context, the PAEE files take on multiple afterlives in plain text and XML forms.At the time of preparation, we were uncertain whether plain text of simple XML formatting was best suited for collation processing, and in the course of that processing we discovered indeed that XML was preferable since the collation software could be programmed to read and process and ignore particular elements. The pseudomarkup in our plain text, however, served (and continues potentially to serve) as a viable intermediary format, trivally easy to up-translate into XML, and co-existing alongside its hierarchially organized kin.We began, then, by producing plain text from the old PAEE hypercards. The following documents my decisions and actions taken on the 1990s files to prepare them as plain text editions, prior to collation. Decisions for preserving and eliminating markup in plain text versionsUsing regex find-and-replace strategies, we prepared altered versions of the PA EE HTML files to reproduce simpler forms consistent with current XHTML 5 standards.In the PA EE some elements (like <p> and <br>) were not given close tags, while others were, making the code difficult to process with XSLT. Close tags were applied and the files were simplified to carry only the title page, prefacing material, and text of the novel.The elements holding navigational information in the PA EE were excluded. This is because the PA EE texts were prepared as 238 and 250 separate HTML files (for the 1818 and 1831 editions) in order to manually align them in small chunks as a means to compare them visually in HTML frames. Since our edition is uniting these hundreds of chunks into a single document, we will prepare new navigational elements at a later stage after we have prepared our new TEI edition and are ready to produce a new reading view.Renaming files and directories: The PA EE files were stored in three separate directories for each edition, associated with volumes 1, 2, and 3 of the 1818 edition, and the 1831 files were given names to assist with pairing them with associated chunks of the 1818 edition inside HTML frames. Since we need to process the files all together to output a single text of the 1818 and of the 1831 novel, we flattened the hierarchy: We removed the volume directories and held each edition’s set of 238 and 250 files respectively in its own directory. The files were renamed carefully to number their sequence in assembling the text, and to simplify their association with the text’s structural layers: the opening material, the Walton letters that frame the text at its beginning and end, and the internal chapters.Eliminating hyperlinked editorial annotations: We decided we must simply represent the nineteenth-century editions and that we cannot at this time properly curate the PAEE edition itself, so we did not port the links to editorial annotations coded in the PA EE. For easing the collation and up-conversion process later, we are preserving information from the presentation markup of the PA EE texts: its rendering of italics, square brackets, and centered text.In the PA EE there is no distinction between italics for titles and italics for emphasized words. Because the asterisk is used to signal footnotes in the text, we use the underscore (_) instead to mark off italicized text of any kind.Square brackets ([ ]) are placed around text marked as small caps. (We have commented out the one instance in the 1831 PA EE HTML in which square brackets were used to hold a normalized variant of a word, to suppress that from the output.)Centered text is marked between curly braces: { } Note: some center tagging, such as in header tags, was lost in the conversion process and should be restored as we proof the texts.Each unit of PA EE HTML texts marked with a structural element to indicate line break (<br>) or paragraph (<p>) is produced as a unit line in the plain text. Thus, an entire paragraph appears as a single line. Every unit line is followed by two newline characters.Documentation is generated at the head of the text files inside commented text marked with hashes (`# `), to indicate the derivation of the documents from the PA EE and to document the rendering decisions above.Stages for processing the altered PA EE HTML to produce plain text editionsThe following processing stages would certainly have been better accomplished entirely with XSLT, because after the first step, they rely on a complex series of find and replace operations that might have been better accomplished and documented if all handled with XSLT string-processing functions. The basic idea, though, is to prepare a consistently formatted set of documents that can reliably be compared with collation software.The conversion process relies on an XSLT transformation to prepare plain text from the updated HTML, running it over directories of the hundreds of files. Prepared to process from a collection organized unambiguously by filename and output a single file. Filenames were prefaced by a number to process in sequential order The XSLT is featured in the following section and linked here in our GitHub repository.In oXygen, we processed the transformations in two batches, by uncommenting the appropriate variable pointing to either the 1818 or 1831 directory, commenting out the other, and running it over any dummy XML file since oXygen requires an XML file be associated with the transformation). Open the output in Text Wrangler (recently superseded by BBEdit) and in oXygen, and work on the following:In Text Wrangler (or BBEdit), remove line breaks (option in the Text menu). This ensures that any text preceded by just one newline character is pulled into the preceding line, which unites the content of each paragraph inside a single line.In oXygen, with regex find and replace, eliminate instances of more than two newline characters `\n`, but ensure that two newlines appear between each line.Add \n\n after VOLUME, LETTER, PREFACE, and CHAPTER headings and the Introduction heading in the 1831 edition. Search for (PREFACE|VOLUME|LETTER|CHAPTER)\s+[IVXLC]+\.* . Also check and restore newlines in letter headings.In Text Wrangler (or BBEdit), educate the quotes: This produces curly apostrophes and quotes from the straight quotes of the PA EE.Regularize white spaces using Find & Replace in oXygen, using the \h regex to indicate white space inside a line. Replace any instances of \h\h with a regular space. Convert double hyphens (--) to em dashes (—).XSLT for Translation of Up-Translated HTML to Pseudo-Marked TextOur translation of the PAEE into XML involves a significant intermediary step that might be considered a destination in its own right: a not-so-plain text file featuring pseudo-markup that preserves information from the markup. Our XSLT produces a header of meta-information about the edition we are preparing, with an explanation of symbols. This edition of the 1818 and 1831 text files might well be a valuable output in its own right, prior to their collation. Preparing this text format and its pseudomarkup establishes the basis for our new preparation of a plain text edition for the OCR'd 1823 edition. All three editions must be prepared in the same plain text format to ensure a precise and accurate machine collation into a single file.
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="3.0">
<!--2016-12-28 ebb: Prepared to process from a collection organized unambiguously by filename and output a single file. Filenames were prefaced by a number to process in sequential order.-->
<xsl:strip-space elements="*"/>
<xsl:output method="text" encoding="UTF-8"/>
<!--ebb: Uncomment one of the following lines to process the appropriate edition, either 1818 or 1831.-->
<!--<xsl:variable name="paEdition" select="collection('../frankenTexts_HTML/PA_Electronic_Ed/1818_ed')"/>-->
<xsl:variable name="paEdition" select="collection('../frankenTexts_HTML/PA_Electronic_Ed/1831_ed')"/>
<xsl:template match="/">
<xsl:text>********************************************************************************
# FRANKENSTEIN; OR, THE MODERN PROMETHEUS
## The Pittsburgh Bicentennial Edition
### INTRODUCTORY NOTE ON THE TEXT:
This is a plain text edition of the </xsl:text><xsl:value-of select="($paEdition//head[1]/tokenize(title, ', ')[2])[1]"/> edition of _Frankenstein; or, the Modern Prometheus_ by Mary Shelley <xsl:text>prepared for the Frankenstein Bicentennial project, which commemorates the 200th anniversary of the first published edition of this novel in 1818.
</xsl:text>
<xsl:text>Frankenstein; or, the Modern Prometheus: Pittsburgh Bicentennial Edition is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. <!--ebb: Check with project team. Do we want this to be a free culture license, meaning we permit commercial uses of this work? If so, change this to read:
Frankenstein; or, the Modern Prometheus: Pittsburgh Bicentennial Edition is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
-->
</xsl:text>
<xsl:text>Date this text was produced: </xsl:text><xsl:value-of select="current-dateTime()"/><xsl:text>.
</xsl:text>
<xsl:text>This edition is part of the Pittsburgh research team's contribution to the Bicentennial Frankenstein Project, and is prepared by Elisa Beshero-Bondar of the University of Pittsburgh at Greensburg with assistance from Rikk Mulligan of Carnegie Mellon University. We are grateful for consultation from Wendell Piez, David J. Birnbaum, and Raffaele Viglianti, as well as Neil Fraistat and Dave Rettenmaier. This edition's stages of development are stored and documented in the Pittsburgh_Frankenstein GitHub repository: https://github.com/ebeshero/Pittsburgh_Frankenstein/ .
We have produced this plain text edition for two purposes:
1) To prepare for automated collation of the 1818, 1823, and 1831 editions of _Frankenstein_ using CollateX, in order to generate a TEI XML document that stores the variations of these texts.
2) To provide a reliable digital base text of each edition tractable for future projects.
</xsl:text>
<xsl:text>This plain text edition is one of two, representing the 1818 and 1831 editions of the novel. This pair of editions is based on the Pennsylvania Electronic Edition of _Frankenstein; or, the Modern Prometheus_ by Mary Shelley, edited by Stuart Curran and assisted by Jack Lynch, located at http://knarf.english.upenn.edu/ and hereafter referred to as PA EE. Elisa Beshero-Bondar and Rikk Mulligan *are correcting* these texts against photo facsimiles of the 1818 and 1831 texts.
* We will alter the previous sentence in this header when this phase of proof-checking is completed.
</xsl:text>
<xsl:text>Our plain text edition preserves the rendering of italics, square brackets, and centered text from the PA EE HTML texts.
* In the PA EE there is no distinction between italics for titles and italics for emphasized words. Because the asterisk is used to signal footnotes in the text, we use the underscore (`_`) instead to mark off italicized text of any kind.
* Square brackets (`[ ]`) are placed around text marked as small caps. (We have commented out the one instance in the 1831 PA EE HTML in which square brackets were used to hold a normalized variant of a word, to suppress that from the output.)
* Centered text is marked between percent symbols: `% %`.
* Each unit of PA EE HTML texts marked with a structural element to indicate line break (`<br>`) or paragraph (`<p>`) is produced as a unit line in the plain text. Thus, an entire paragraph appears as a single line. Every unit line is followed by two newline characters.
</xsl:text>
<xsl:text>Note for later processing: In the PA EE of this text, there are </xsl:text><xsl:value-of select="count(distinct-values($paEdition//body//a/@href))"/> encoded links, each pointing to an editorial annotation.
<xsl:text>********************************************************************************</xsl:text>
<xsl:apply-templates select="$paEdition//body"/>
</xsl:template>
<xsl:template match="br">
<xsl:text>
</xsl:text>
</xsl:template>
<xsl:template match="p">
<xsl:apply-templates/><xsl:text>
</xsl:text>
</xsl:template>
<!-- <xsl:template match="text()">
<xsl:apply-templates select="normalize-space(.)"/>
2016-12-28 ebb: normalize-space() causes problems: too much tightening up of the output so words are run together, also when applied at <p> template, child nodes aren't processed.
Using regex and Text-Wrangler on the output file to remove its excess lines.
</xsl:template>-->
<xsl:template match="i">
<xsl:text>_</xsl:text><xsl:apply-templates/><xsl:text>_</xsl:text>
</xsl:template>
<xsl:template match="small">
<xsl:text>[</xsl:text><xsl:apply-templates/><xsl:text>]</xsl:text>
</xsl:template>
<xsl:template match="center">
<xsl:text>%</xsl:text><xsl:apply-templates/><xsl:text>%</xsl:text>
</xsl:template>
</xsl:stylesheet>
Collation Discoveries: Flattening and Chunking AgainAt the time of this writing, two members of our team completed proof-checking the plain text files generated by the above stages of processing against photofacsimiles of the nineteenth-century editions. We have undone the normalized spellings of the PAEE to restore the original spellings of the nineteenth-century texts, and have corrected transcription errors, improving what we can. We also painstakingly labored on correcting a plain text file generated by OCR from ABBYY Finereader of the 1823 text. We discovered it was best to convert these to XML (easily done with a series of Find and Replace operations from pseudomarkup to angle brackets). In May 2017, we processed our first collation of the documents, and discovered in the process that we could output plain text tables to align the outputs side by side, as well as XML output, both of which are useful. We also discovered two things that have caused us to return to revise substantially the documents we had prepared:
Collating multiple versions of a novel is highly demanding of processing power, and lags considerably and may not properly locate alignments and deviations if the the entire novel is processed at once. I had to prepare my XML for chunking again, with larger roughly chapter-sized chunks rather than the one or two paragraphs of hypercard chunking from the PAEE. That meant dispensing with structuring an elaborately hierarchical document prior to machine collation: the highest level of hierachical organization below the document node is a paragraph, and the structural components of the novel (chapter and volume and letter divs) are signalled with milestone style elements. Milestone elements are used to signal the boundaries of the pieces to be processed by collateX.Once we had processed the collations, we easily saw in collateX's plain text output tables many more errors to check and correct against our photofacsimiles.
The irony of this is, just as we thought we might be finished with stitching up the body of the Frankenstein Creature, we discovered a necessity to break it into pieces again in order to faciliate collation. Collation outputs are necessarily multiple now, and must be processed again to stitch them together. Collation is now a recurring process, helping us to note corrections still to be made in our base texts. The plain text and simple XML files we have prepared now serve as a sort of ur-text still leading to our goal of preparing a new synoptic TEI document combining the three texts. From the process, we begin to see some new flexibility and value in flattened, chunkable XML documents.The need to work on the collation in small units is further exacerberated by analysis of the work ahead on working the manuscript notebooks into the collation with the print editions. At first we thought that ur-text file would be an XML document containing angle bracket markup in the form of critical apparatus tags, thus:
<p>I am by birth a
<app>
<rdg wit="#c56"><ptr target="http://url.shelley-godwin"/></rdg>
<rdg wit="#p1818">Genevese</rdg>
<rdg wit="#p1823">Scotsman</rdg>
<rdg wit="#p1831">Martian</rdg>
</app>.
</p>
This encoding (of a nonsensical and nonexistent sample passage) demonstrates our first plan of 2016 for interweaving three text files together with pointers into the Shelley-Godwin Notebooks, which combine text and image and cannot be rendered as text documents. Our team member at MITH and my colleague on the TEI Technical Council, Raffaele Viglianti, first planned to encode linkages to associated mansucript notebook facsimile pages in the Shelley-Godwin Archive. The pointers would lead in a published edition to links from our edition into related passages of the draft notebooks. However, we discovered a major problem with this plan on examining the output of our collation of the 1818, 1823, and 1831 editions:
First, the collation output is much too complicated to make processing by hand particularly easy.Perhaps more significantly, the notebooks are themselves a variant edition in their own right, and it would perhaps be more efficient to process their texts with automatic collation in parallel with the print editions we have been preparing.
This raises a fresh set of challenges for our project. The notebook XML chunked quite finely, with a separate file for each page, using diplomatic TEI markup that prioritizes the description of each page. Line-breaks are particularly vexing because where words in the notebooks break at the end of the lines, there are no consistent reliable symbols to indicate how they are joined on the next line.
Our plan is to pull out the line elements, use the existing markup to locate paragraph, and preserve information about insertions and deletions in the TEI of the Shelley-Godwin Notebooks. The new version of XML we produce would then be more or less compatible with the editions we prepared of the 1818, 1823, and 1831 texts for the purposes of collation, working in some additional information from the diplomatic edition and finding a way to signal that information in the critical apparatus output. We would need to stitch the thousands of notebook pieces into the larger chunks at alignment points we can identify across all of the documents. This will undoubtedly prove the most challenging stage of our work so far.Reflections toward a theory of up-translationThe Bicentennial Frankenstein project’s encounter with an impressive early hypertext edition raises more general questions worthy of reflection towards theorizing the up-transformation process:
How do we understand the relationships among generations of digital editions?Our experience urges caution with hasty reappropriation or automated methods in up-translating dated documents. Careful document analysis and an assessment of the vision of the original edition will challenge the would be up-translator to find some way to respect the vision and scope of a dated electronic edition.What aspects of the old hypertext editions (or editions in formats not consistent with our own) transcend or exceed the structures we currently consider sustainable? What perspective might a thorough review of the first still extant hypertext editions contribute to our scholarly editing practice now?The fragmentation of the early Frankenstein into nearly 500 pieces represents a particulated vision of collation that made difficulties for our up-translation process, and we did not anticipate that we would need to chunk the documents yet again, and yet once more (for the manuscript notebooks) in our own software-assisted collation.
The survival of a particulated, chunky Frankenstein edition is the most remarkably persistent feature of our work thus far. We have been discovering that large documents with entrenched hierarchies are difficult process in portions. When we prepare chunks or segments of text for collation, we need to make sure their start and end points are aligned in some way, so we look for those moments of alignment and set milestone units as signal posts. But to produce tractable files, cutting through volume divs (for example) raises problems, particularly when those structural units of hierarchy are not consistent in the editions being compared.What have we learned? We discover that deep-nested hierachies create problems when we need to compare different editions of the same text whose hierarchies are not aligned. We discover that readily fragmentable output may be preferable for indicating points of intersection. We consider that the process of up-conversion and up-translation might best produce multiple formats of output, where plain text and XML co-exist. We find on the path of our goal in producing a synoptic TEI edition, a desirability in sharing both a hierarchic document that stores information about comparison as well as separate edition files of each document. The imposition of structural and crit-apparatus hierarchy represents one interpretation of our documents, but that will not be the only one worth preserving over generations. Apparently, we need simple, granular documents, too.