How to cite this paper

Beshero-Bondar, Elisa E. “Adventures in Correcting XML Collation Problems with Python and XSLT: Untangling the Frankenstein Variorum.” Presented at Balisage: The Markup Conference 2022, Washington, DC, August 1 - 5, 2022. In Proceedings of Balisage: The Markup Conference 2022. Balisage Series on Markup Technologies, vol. 27 (2022). https://doi.org/10.4242/BalisageVol27.Beshero-Bondar01.

Balisage: The Markup Conference 2022
August 1 - 5, 2022

Balisage Paper: Adventures in Correcting XML Collation Problems with Python and XSLT

Untangling the Frankenstein Variorum

Elisa E. Beshero-Bondar

Professor of Digital Humanities

Program Chair of Digital Media, Arts, and Technology

Penn State Erie, The Behrend College

Elisa Beshero-Bondar explores and teaches document data modeling with the XML family of languages. She serves on the TEI Technical Council and is the founder and organizer of the Digital Mitford project and its usually annual coding school. She experiments with visualizing data from complex document structures like epic poems and with computer-assisted collation of differently encoded editions of Frankenstein. Her ongoing adventures with markup technologies are documented on her development site at newtfire.org.

Copyright © 2022 Elisa Beshero-Bondar

Abstract

The process of instructing a computer to compare texts, known as computer-aided collation, might resemble trying to fix a power loom when the threads it is supposed to weave together become tangled. The power of the automated weaving continues, with the threads improperly aligned and the pattern broken in a way that can make it difficult to isolate the cause of the problem. Automating a tedious process magnifies the complexity of error-correction, sometimes calling for new tooling to help us perfect the weaving or collating process.

The authors are attempting to refine a collation algorithm to improve its alignment of variant passages in the Frankenstein Variorum project. We have begun with a Python script that tokenizes and normalizes the texts of the editions and delivers them to collateX for processing the collation and delivering TEI-conformant output for our project. In post-processing stages after running the collation, we apply a series of XSLT transformations to the collation output. This post-collation XSLT pipeline publishes the digital variorum edition, which prepares each output witness in TEI XML to store information about its own variance from the other editions. We have discussed that pipeline elsewhere, but our interest in this paper is in efforts to repair and correct and improve the collation process.

We have applied Schematron and XSLT in post-processing to correct patterns of erroneous alignments, but eventually realized that the problems we were trying to solve required repairing the collation algorithm. We are now experimenting with revising the collation algorithm in two ways: 1) by fine-tuning the text preparation algorithms we apply in our Python file that delivers text to the collateX software, and 2) by attempting to introduce those same text preparation algorithms entirely with XSLT using the Text Alignment Network's XSLT application of tan:diff() and tan:collate(), introduced by Joel Kalvesmaki at the 2021 Balisage conference. In this paper we discuss the challenges of figuring out where and how to intervene in the collation process, and what we are learning about how far we can take XSLT and Schematron in helping to automate the preparation, collation, and correction process.

Table of Contents

Cycles of time in the Gothenburg model for computer-aided collation
The Frankenstein Variorum project and its production pipeline
Snags in the Weaving
A Schematron flashlight
XSLT for alignment correction
The smashed-tokens problem
XSLT for the win?

Cycles of time in the Gothenburg model for computer-aided collation

A project applying computer-aided collation, by which we mean the guidance of a computational system to compare texts, challenges our debugging capacities in ways that scale and consume far more time than humans may expect at the outset. Scholarly editors working on collation projects often do so in the context of preparing critical editions that present a text-bearing cultural object in all its variations. These editors tend to prioritize a high degree of accuracy in the marking of variants, and may come to a computer-aided collation project prepared for saintly patience and endurance to spot and correct each and every error. Yet endurance has its mortal limits, and computationally-generated errors require the developers' patient scrutiny to be directed not to the outputs but to the preparation of the inputs. Humanities scholars (the author included) may be tempted to hand-correct errors as they spot them to perfect an output for a swift project launch, but this is brittle and can even compound errors, even if applied by a computer script that finds and replaces every instance of a distinctive pattern in the output. We should limit correcting the output, even computationally, in favor of making changes at the guidance and preparation stages and re-running the collation software. Taking cycles of time to study the errors generated by a collation process should lead us to a state of enlightenment about precisely what we are comparing and precisely how it is to be compared. Do we consider computer collation a form of machine learning, in which we expend energy preparing the training set? But it is really the scholars who learn over cycles of collation re-processing, with enhanced awareness of their systems for comparison. Perhaps we should think of computer-aided collation as machine-assisted human learning.

This paper marks a return to work on a digital collation project that began in the years just before the pandemic, and paused a moment in 2020-2021 when several of the original project members moved to new jobs and I, as the one leading the preparation of the collation data, concentrated on teaching and running a program in a new university during a time of global pandemic crisis disrupting university work. Returning to work on the Frankenstein Variorum has necessitated reorienting myself to complicated processes and reviewing the problems we identified as we last documented and left them. This has been a labor of reviewing and sampling our past code, updating its documentation, and trying to find new and better ways to continue our work. We set out to prepare a comparison of five distinct versions of the novel Frankenstein, without taking any single edition as the default standard. Rather, we seek to provide readers a reading view for each of the five versions that demonstrates which passages were altered and how heavily they were altered. The Frankenstein Variorum should help readers to see a heavily revised text and understand the abandoned, discontinuous branches of that revision.

Much of this project has investigated methods for moving between hierarchical and flat text structures, moving back and forth between markup and strings as we thread our texts into the machine power-loom of collation software, examine the woven output, and contemplate how to untangle the snags and snarls we find in the woven textile of collation output. The Gothenburg model guides our work, an elaboration of five distinct stages of work necessary to guide machine-assisted collation as established in 2009 by the developers of collateX and Juxta in Gothenburg, Sweden:

  1. Tokenization (deciding on the smallest units of comparison)

  2. Normalization/Regularization (deciding what distinct features in the source documents need to be suppressed from comparison. An easy example of a normlization step is indicating that ampersands are stand in for the word and. A more difficult judgment call is the handling of notational metamarks or even editorial markup that assist reading in one document, but should be skipped over by the collation software as not meaningful for comparison with other documents.)

  3. Alignment (locating via software parallel equivalent passages and their moments of unison and divergence)

  4. Analysis/Feedback (reviewing, correcting, adjusting the alignment algorithm as well as normalization/regularization method

  5. Visualization (finding an effective way to show and share the collation results)[1]

These stages are usually numbered, but that numbering belies the cyclicality of the Gothenburg process. We scholars who collate have to try and try again to retool, rethink, re-run, and even this is not a steady process because we have to decide whether certain kinds of problems are best resolved by changing the pre-processing to improve the normalization prior to collation, or to apply corrections to collation output errors in post-processing. Collation projects of significant length and complexity require computational assistance to visualize not only the collation output but crucially to locate patterns in the collation errors. The Analysis/Feedback stage is critical and without giving it procedural care, we risk inaccuracies or the human error introduced during hand-correcting outputs.

In returning to work on the Frankenstein Variorum, I have concentrated on the Analysis/Feedback stage to try to find patterns in the snags of our collation weave, and to try different normalization and alignment methods to improve the work. The work requires close observation of meticulous details (plenty of myopic eye fatigue), but also a return to the long view of our research goals. As David J. Birnbaum and Elena Spadini have discussed, normalization is interpretive, reflecting on transcription decisions and deciding on what properties should and should not be considered worthy of comparison.[2] In our project, the great challenge for the normalization process has been contending with the diplomatic markup of the Shelley-Godwin Archive's manuscript notebook edition of Frankenstein. The meticulous page-by-page markup structured on page surfaces and zones cannot be compared with the semantic trees representing the other print-based editions in the project, though deletions do matter, as do the insertion points of marginal additions and corrections. Alignment of our variant passages is improved by revisiting our normalization algorithms, and also by testing our software tools as soon as we recognize problems.

In this paper, I venture that the software we choose to assist us with aligning collations matters not only for its effectiveness in applying an algorithm like Needleman Wunsch or Dekker or TAN Diff, but also for its capacity to guide the scholar in the documentation of a highly complex, time-consuming, cyclical process. In the following sections I first unfold the elaborate pipeline process of our collation and visualization, to then introduce problems we have found and efforts to resolve them by exploring a new method of alignment. XSLT has been vital to the pre-processing and post-processing of our edition and collation data, but now we find it may be a benefit for handling the entire process of tree flattening, string collation, and edition construction. Long ago in the early years of XSLT, John Bradley extolled its virtues not only for publishing documents but also for text scholarly analysis and research.[3] Though it is common in the Digital Humanities community now to think of Python or R Studio as the only necessary tools one needs for text analysis, the strengths of XSLT seem realized in its precision and elegance in transforming structured and unstructured text and its particular ease of documentation for the humanities researcher. In this paper, I contemplate a move from applying normalizing algorithms to stringified XML in Python to restating and revising those algorithms in XSLT.

The Frankenstein Variorum project and its production pipeline

The Frankenstein Variorum project (hereafter referred to as FV) is an ongoing collation project that began during the recent 1818-2018 bicentennial celebrating the first publication of Mary Shelley's novel. We are constructing a digital variorum edition that highlights alterations to the novel Frankenstein over five key moments from its first drafting in 1816 to its author’s final revisions by 1831. If we think of each source edition for the collation as a thread, the collation algorithm can be said to weave five threads together into a woven pattern that helps us to identify the moments when the editions align together and where they diverge. The project applies automated collation, so far working with the software collateX to create a weave of five editions of the novel, output in TEI-conformant XML. We have shared papers at Balisage in previous years about our processing work, preparing differently-encoded source editions for machine-assisted collation[4], and working with the output of collation to construct a spine file in TEI critical apparatus form, which coordinates the preparation of the edition files that hold elements locating moments of divergence from the other editions.[5]

Our source threads for weaving the collation pattern include two well-known digital editions: The Pennsylvania Electronic Edition (PAEE), an early hypertext edition produced at the University of Pennsylvania in the mid 1990s by Stuart Curran and Jack Lynch that represents the 1818 and 1831 published editions of the novel, and the Shelley-Godwin Archive's edition of the manuscript notebooks (S-GA) published in 2013 by the University of Maryland. We prepared two other editions in XML representing William Godwin’s corrections of his daughter's edition from 1823, and the Thomas copy, which represents Mary Shelley’s corrections and proposed revisions written in a copy of the 1818 edition left in Italy with her friend Mrs. Thomas.

Here is a summary of our project’s production pipeline, discussed in more detail in previous Balisage papers:

  1. Preparing differently-encoded XML files for collation. This involves several stages of document analysis and pre-processing:

    • Identifying markup that indicates structures we care about: letter, chapter, paragraphs, and verse structures, for example. Where these change from version to version, we want to be able to track them by including this markup with the text of each edition in the collation.

    • Re-sequencing margin annotations coded in the SGA edition of the manuscript notebook, as these margin annotations were encoded at the ends of each XML file and needed to be moved into reading order for collation. (For this resequencing, we wrote XSLT to follow the @xml:ids and pointers on each SGA edition file).

    • Determining and marking chunk boundaries (dividing the texts into units that mostly start and end the same way to facilitate comparison): We divided Frankenstein into 33 chunks roughly corresponding with chapter structures. (Later in refining the collation we subdivided some of these chunks (breaking some complicated chunks into 3 -5 sub-chunks) for a more granular comparison of shorter passages.)

    • Flattening the edition files' structural markup into Trojan milestone elements (converting structural elements that wrap chapters, letters, and paragraphs like <div xml:id="chap-ID"> into self-closed matched pairs of elements to mark the starts and ends of structures: div sID="chap-ID"/> and div eID="chap-ID"/>). The use of these empty milestones facililates the inclusion of markup in the collation when, for example, a version inserts a new paragraph of chapter boundary within an otherwise identical passage in the other editions. Ultimately this flattening in pre-processing facilitates post-processing of the edition files from text strings to XML holding collation data.[6]

  2. Normalizing the text and processing the collation. Up to this point, we have been applying a Python script to read the flattened XML as text, and to introduce a series of about 25 normalizations via regular expression patterns that instruct collateX to (among other things):

    • Convert & into and

    • Ignore some angle-bracketed material that is not relevant to the collation: For example, convert <surface.+?/> and <zone.+?/> into "" (nothing) when comparing strings of the editions

    • Simplify to ignore the attribute values on Trojan milestone elements. For example, remove the @sID and @eID attributes by converting (<p)\s+.+?(/>) into $1$2 (or simply <p/>) so that all paragraph markers are read as the same.

  3. Work with the output of the collation to prepare, in a series of stages, a very important file that we call our spine, which contains the collation data that organizes the variorum edition project. The format of the output is a prototype of a TEI critical apparatus, and in preliminary stages it is not quite valid against the TEI. Here is an example of a part of the spine in an early post-processing stage, featuring a passage in which Victor Frankenstein beholds his completed Creature in each of the five editions. Notice that the <rdg> elements contain mixed content: each is holding a passage from the source edition that is crucially not normalized. This is our project’s particular approach to the parallel segmentation method of preparing a critical apparatus in TEI P5. Eventually the spine of the collation needs to contain data particular to each version in order to reconstruct that version so that it stores information about its variations from the other versions. In the not-quite-TEI stage of our post-collation process represented here, we can see that a use of the Trojan milestone markers to indicate the starting points of paragraphs, and their values indicate their location in the original source edition's XML hierarchy.[7]

                                        
        <cx:apparatus xmlns:cx="http://interedition.eu/collatex/ns/1.0">
    	  <app>
    		    <rdgGrp n="['chapter']">
    			      <rdg wit="f1818"><milestone n="4" type="start" unit="chapter"/>
    			         <head sID="novel1_letter4_chapter4_div4_div4_head1"/>CHAPTER </rdg>
    			      <rdg wit="f1823"><milestone n="4" type="start" unit="chapter"/>
    			         <head sID="novel1_letter4_chapter4_div4_div4_head1"/>CHAPTER </rdg>
    			      <rdg wit="fThomas"><milestone n="4" type="start" unit="chapter"/>
    			         <head sID="novel1_letter4_chapter4_div4_div4_head1"/>CHAPTER </rdg>
    			      <rdg wit="f1831"><milestone n="5" type="start" unit="chapter"/>
    			         <head sID="novel1_letter4_chapter5_div4_div5_head1"/>CHAPTER </rdg>
    			      <rdg wit="fMS"><lb n="c56-0045__main__1"/>
    			         <milestone spanTo="#c56-0045.04" unit="tei:head"/>Chapter </rdg>
    		    </rdgGrp>
    	  </app>
    	  <app>
    		    <rdgGrp n="['7<shi rend="sup"></shi><p/>']">
    			      <rdg wit="fMS">7<shi rend="sup"></shi>
    			         <milestone unit="tei:p"/><lb n="c56-0045__main__2"/> </rdg>
    		    </rdgGrp>
    		    <rdgGrp n="['iv.<p/>']">
    			      <rdg wit="f1818">IV.<head eID="novel1_letter4_chapter4_div4_div4_head1"/>
    			         <p sID="novel1_letter4_chapter4_div4_div4_p1"/></rdg>
    			      <rdg wit="f1823">IV.<head eID="novel1_letter4_chapter4_div4_div4_head1"/>
    			         <p sID="novel1_letter4_chapter4_div4_div4_p1"/></rdg>
    			      <rdg wit="fThomas">IV.<head eID="novel1_letter4_chapter4_div4_div4_head1"/>
    			         <p sID="novel1_letter4_chapter4_div4_div4_p1"/></rdg>
    		    </rdgGrp>
    		    <rdgGrp n="['v.<p/>']">
    			      <rdg wit="f1831">V.<head eID="novel1_letter4_chapter5_div4_div5_head1"/>
    			         <p sID="novel1_letter4_chapter5_div4_div5_p1"/> </rdg>
    		    </rdgGrp>
    	  </app>
    	  <app>
    		    <rdgGrp n="['it']">
    			      <rdg wit="f1818">I<hi sID="novel1_letter4_chapter4_div4_div4_p1_hi1"/>T
    			         <hi eID="novel1_letter4_chapter4_div4_div4_p1_hi1"/> </rdg>
    			      <rdg wit="f1823">I<hi sID="novel1_letter4_chapter4_div4_div4_p1_hi1"/>T
    			         <hi eID="novel1_letter4_chapter4_div4_div4_p1_hi1"/> </rdg>
    			      <rdg wit="fThomas">I<hi sID="novel1_letter4_chapter4_div4_div4_p1_hi1"/>T
    			         <hi eID="novel1_letter4_chapter4_div4_div4_p1_hi1"/> </rdg>
    			      <rdg wit="f1831">I<hi sID="novel1_letter4_chapter5_div4_div5_p1_hi1"/>T
    			         <hi eID="novel1_letter4_chapter5_div4_div5_p1_hi1"/> </rdg>
    			      <rdg wit="fMS">It </rdg>
    		    </rdgGrp>
    	  </app>
    	  <app>
    		    <rdgGrp n="['was', 'on', 'a', 'dreary', 'night', 'of']">
    			      <rdg wit="f1818">was on a dreary night of </rdg>
    			      <rdg wit="f1823">was on a dreary night of </rdg>
    			      <rdg wit="fThomas">was on a dreary night of </rdg>
    			      <rdg wit="f1831">was on a dreary night of </rdg>
    			      <rdg wit="fMS">was on a dreary night of </rdg>
    		    </rdgGrp>
    	  </app>
    	  <app>
    		    <rdgGrp n="['november', '']">
    			      <rdg wit="fMS">November <lb n="c56-0045__main__3"/> </rdg>
    		    </rdgGrp>
    		    <rdgGrp n="['november,']">
    			      <rdg wit="f1818">November, </rdg>
    			      <rdg wit="f1823">November, </rdg>
    			      <rdg wit="fThomas">November, </rdg>
    			      <rdg wit="f1831">November, </rdg>
    		    </rdgGrp>
    	  </app>
    	  <app>
    		    <rdgGrp n="['that', 'i', 'beheld']">
    			      <rdg wit="f1818">that I beheld </rdg>
    			      <rdg wit="f1823">that I beheld </rdg>
    			      <rdg wit="fThomas">that I beheld </rdg>
    			      <rdg wit="f1831">that I beheld </rdg>
    			      <rdg wit="fMS">that I beheld </rdg>
    		    </rdgGrp>
    	  </app>
    	  <app>
    		    <rdgGrp n="['<del>the', 'frame', 'on', 'whic<del>',
    		    'my', 'man', 'completeed,.', '<del>and<del>']">
    			      <rdg wit="fMS"><del rend="strikethrough" sID="c56-0045__main__d2e9718"/>
    			         the frame on whic<del eID="c56-0045__main__d2e9718"/> my man 
    			         comple<mdel>at</mdel>teed,. 
    			         <del rend="strikethrough" sID="c56-0045__main__d2e9739"/>
    			         And<del eID="c56-0045__main__d2e9739"/><lb n="c56-0045__main__4"/> </rdg>
    		    </rdgGrp>
    		    <rdgGrp n="['the', 'accomplishment', 'of']">
    			      <rdg wit="f1818">the accomplishment of my toils. </rdg>
    			      <rdg wit="f1823">the accomplishment of my toils. </rdg>
    			      <rdg wit="fThomas">the accomplishment of my toils. </rdg>
    			      <rdg wit="f1831">the accomplishment of my toils. </rdg>
    		    </rdgGrp>
    	  </app>

  4. Apply this collation data in post-processing to prepare the edition files:

    • Raise the flattened editions in their original form (convert Trojan milestones into the original tree hierarchy).

    • With XSLT, pull information from the spine file, and insert <seg> elements in each of the distinct edition files to mark every point of variance captured in the collation. Each <seg> element is mapped to a specific location in the collation spine file. This mapping is done by recording in the value of the seg @xml:id a modified XPath expression for the particular <app> element in the collation spine. (The modification to the XPath is necessary so that the result is valid as an xml:id, which may not contain a solidus.) Here is an example of <seg> elements applied in the 1818 edition file. Here the @xml:id attributes are keyed to the specific <app> elements in the portion of the spine representing collation unit 10. Where the collation spine marks passages around chapter headings and paragraph boundaries, we apply a @part attribute to designate an initial and final portion within the structure of the edition XML file:

                                                   
               <div type="collation">
                  <milestone n="4" type="start" unit="chapter"/>
                  <head xml:id="novel1_letter4_chapter4_div4_div4_head1">
                      CHAPTER <seg part="I" xml:id="C10_app2-f1818__I">IV.</seg>
                  </head>
                  <p xml:id="novel1_letter4_chapter4_div4_div4_p1">
                     <seg part="F" xml:id="C10_app2-f1818__F">
                     </seg>I<hi xml:id="novel1_letter4_chapter4_div4_div4_p1_hi1">T</hi>
                        was on a dreary night of <seg xml:id="C10_app5-f1818">November, </seg>
                        that I beheld <seg xml:id="C10_app7-f1818">the accomplishment of my toils. </seg>
                        With an anxiety that almost amounted to <seg xml:id="C10_app9-f1818">agony, </seg>
                        I collected <seg xml:id="C10_app11-f1818">the </seg>instruments of life around 
                        <seg xml:id="C10_app14-f1818">me, that </seg>I 
                        <seg xml:id="C10_app16-f1818">might infuse </seg>a spark of being 
                        into the lifeless thing that lay at my feet. . . .
                  </p> 
                  . . . 
               </div>

    • Convert the text contents of the spine critical apparatus into stand-off pointers to the <seg> elements in the new edition files. Here is a passage from the converted standoff spine featuring only the normalized tokens for each variant passage, and containing pointers to the locations of <seg> elements in each edition XML file corresponding with a reading witness:

                                                   
                  <app xml:id="C10_app7" n="48">
                     <rdgGrp xml:id="C10_app7_rg1"
                             n="['<del>the', 'frame', 'on', 'whic<del>', 'my', 'man', 'completeed,.', '<del>and<del>']">
                        <rdg wit="#fMS">
                           <ptr target="https://raw.githubusercontent.com/PghFrankenstein/fv-data/master/variorum-chunks/fMS_C10.xml#string-range(//tei:surface[@xml:id='ox-ms_abinger_c56-0045']/tei:zone[@type='main']//tei:line[3],15,59)"/>
                        </rdg>
                     </rdgGrp>
                     <rdgGrp xml:id="C10_app7_rg2" n="['the', 'accomplishment', 'of']">
                        <rdg wit="#f1818">
                           <ptr target="https://raw.githubusercontent.com/PghFrankenstein/fv-data/master/variorum-chunks/f1818_C10.xml#C10_app7-f1818"/>
                        </rdg>
                        <rdg wit="#f1823">
                           <ptr target="https://raw.githubusercontent.com/PghFrankenstein/fv-data/master/variorum-chunks/f1823_C10.xml#C10_app7-f1823"/>
                        </rdg>
                        <rdg wit="#fThomas">
                           <ptr target="https://raw.githubusercontent.com/PghFrankenstein/fv-data/master/variorum-chunks/fThomas_C10.xml#C10_app7-fThomas"/>
                        </rdg>
                        <rdg wit="#f1831">
                           <ptr target="https://raw.githubusercontent.com/PghFrankenstein/fv-data/master/variorum-chunks/f1831_C10.xml#C10_app7-f1831"/>
                        </rdg>
                     </rdgGrp>
                  </app>

    • Calculate special links and pointers to the original XML files of the Shelley-Godwin Archive's manuscript edition of Frankenstein, adjusting for the resequencing prior to collation.

  5. Publish the XML via a web framework (optimally CETEIcean, but for now using Node.js and React). Share the edition text files and XML data from our GitHub repo.[8]

    FV-webView: Web view of a variant passage in the Frankenstein Variorum reading interface

    A snapshot of the Frankenstein Variorum web interface showing the 1816 manuscript notebook edition at the moment when Victor Frankenstein looks at the form of his complete Creature, highlighting the variant passage featured in the preceding code blocks. The segmented passages highlight the variants with deepening shades based on Levenshtein distance measurements, divided into 3 ranges (very light for Levenshtein distances of a few characters, mid-range, and more intense for distances above 20).

Snags in the Weaving

A Schematron flashlight

For an unhappy time two years ago, pressed by a deadline to launch a proof-of-concept partial version of the Variorum interface, I hand-corrected the outputs of collation assisted by a Schematron file that would help me to identify common problems. The Schematron was like a flashlight guiding me through a forest of code in efforts to revise the contents of <rdg> elements and refine the organization of <app> elements and their constituent <rdgGrp> elements. The sheer volume of corrections was exhausting, and a short novel like Frankenstein could seem infinitely long and tortuously slow to read while attempting to make spine adjustments. I knew this method was not sufficient for our needs. This Schematron file was merely my first attempt to try to address the problem systematically:

                        
<?xml version="1.0" encoding="UTF-8"?>
<sch:schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" queryBinding="xslt2">
    <sch:pattern>
        <sch:rule context="app">
            <sch:report test="not(rdgGrp)" role="error">Empty app element--missing rdgGrps! That's an error introduced from editing the collation.</sch:report>
     <sch:report test="contains(descendant::rdg[@wit='fThomas'], '<del')" role="info">
         Here is a place where the Thomas text contains a deleted passage. Is it completely encompassed in the app?</sch:report>
            <sch:assert test="count(descendant::rdg/@wit) = count(distinct-values(descendant::rdg/@wit))" role="error">
                A repeated rdg witness is present! There's an error here introduced by editing the collation.
            </sch:assert>
        </sch:rule>
    </sch:pattern>
    <sch:pattern>
        <sch:rule context="app[count(rdgGrp) eq 1][count(descendant::rdg) eq 1]">
            <sch:report test="count(preceding-sibling::app[1]/rdgGrp) eq 1 or count(following-sibling::app[1]/rdgGrp) eq 1 or last()" role="warning">
                Here is a "singleton" app that may be best merged in with the preceding or following "unison" app as part of a new rdgGrp. 
            </sch:report>
        </sch:rule>
    </sch:pattern>
    <sch:pattern>
        <sch:let name="delString" value="'<del'"/>
    <sch:rule context="rdg[@wit='fThomas']" role="error">
        <sch:let name="textTokens" value="tokenize(text(), ' ')"/>
        <sch:let name="delMatch" value="for $t in $textTokens return $t[contains(., $delString)]"/>
        <sch:assert test="count($delMatch) mod 2 eq 0">
            Unfinished deletion in the Thomas witness. We count <sch:value-of select="count($delMatch)"/> deletion matches. 
            Make sure the Thomas witness deletion is completely encompassed in the app.</sch:assert>
    </sch:rule>
    </sch:pattern>
    <sch:pattern>
        <sch:rule context="rdgGrp[ancestor::rdgGrp]">
            <sch:report test=".">A reading group must NOT be nested inside another reading group!</sch:report>
        </sch:rule>
    </sch:pattern>
</sch:schema>
When reviewing and editing collation spine files, I would apply this schema to guide my work and stop me from making terrible mistakes in hand-correcting common collation output problems, such as, for example, pasting a <rdgGrp> element inside another <rdgGrp> (permissible and useful in some TEI projects, but an outright mistake in our very simple spine). Eye fatigue is a serious problem in just about every stage of the Gothenburg model, and this Schematron did intervene to prevent clumsy mistakes. More importantly for the long-range work of the project, this simple Schematron file helped me to begin documenting and describing repeating patterns of error in the collation, most significantly here the singleton <app>, containing only one witness inside, and the incommplete witness that I wanted to include a complete act of deletion.

XSLT for alignment correction

Revisiting the collation output in fall 2021 with two student research assistants, we began identifying patterns that XSLT could correct. The output from collateX output would frequently create a problem we called loner dels: that is, failing to align text with flattened <del> markup wtih corresponding witnesses, despite the presence of related content with other witnesses. collateX would sometimes place these in an <app> with a single <rdgGrp> by themselves. We succeeded over a few Friday afternoon research team meetings in November to correct this while my students learned about tunneling parameters in XSLT. Our Stylesheet is short but represents a step forward in documentation as well as outright repair of collation alignment:

                     
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:math="http://www.w3.org/2005/xpath-functions/math"
    xmlns:cx="http://interedition.eu/collatex/ns/1.0"
    
    exclude-result-prefixes="xs math"
    version="3.0">
    <!--2021-09-24 ebb with wdjacca and amoebabyte: We are writing XSLT to try to move
    solitary apps reliably into their neighboring app elements representing all witnesses. 
    
    -->
  <xsl:mode on-no-match="shallow-copy"/>
  
<!-- ********************************************************************************************
        LONER DELS: These templates deal with collateX output of app elements 
        containing a solitary MS witness containing a deletion, which we interpret as usually a false start, 
        before a passage.
     *********************************************************************************************
    -->  
    <xsl:template match="app[count(descendant::rdg) = 1][contains(descendant::rdg, '<del')]">
  
        <xsl:if test="following-sibling::app[1][count(descendant::rdgGrp) = 1 and count(descendant::rdg) gt 1]">
               <xsl:apply-templates select="following-sibling::app[1]" mode="restructure">
                  <xsl:with-param as="node()" name="loner" select="descendant::rdg" tunnel="yes" />
                   <xsl:with-param as="attribute()" name="norm" select="rdgGrp/@n" tunnel="yes"/>
               </xsl:apply-templates>
               
           </xsl:if>
    </xsl:template>
    
    
    <xsl:template match="app[preceding-sibling::app[1][count(descendant::rdg) = 1][contains(descendant::rdg, '<del')]]"/>


    <xsl:template match="app" mode="restructure">
        <xsl:param name="loner" tunnel="yes"/>
        <xsl:param name="norm" tunnel="yes"/>
        <app>
        <xsl:apply-templates select="rdgGrp" mode="restructure">
                <xsl:with-param  as="node()" name="loner" tunnel="yes" select="$loner"/>
            </xsl:apply-templates>
            <xsl:variable name="TokenSquished">
                <xsl:value-of select="$norm ! string()||descendant::rdgGrp[descendant::rdg[@wit=$loner/@wit]]/@n"/>
            </xsl:variable>
            <xsl:variable name="newToken">
                <xsl:value-of select="replace($TokenSquished, '\]\[', ', ')"/>
            </xsl:variable>
           <rdgGrp n="{$newToken}">
              <rdg wit="{$loner/@wit}"><xsl:value-of select="$loner/text()"/>
              <xsl:value-of select="descendant::rdg[@wit = $loner/@wit]"/>
              </rdg>
               
           </rdgGrp> 
        </app> 
    </xsl:template>
    
    <xsl:template match="rdgGrp" mode="restructure">
        <xsl:param name="loner" tunnel="yes"/>

           <xsl:if test="rdg[@wit ne $loner/@wit]">
            <xsl:copy-of select="current()" />
        </xsl:if>
    </xsl:template>
</xsl:stylesheet>

Next my little team and I began to tackle the noxious problem of empty tokens, that is, spurious <app> elements that would form around a single witness, the Shelley-Godwin Archive's manuscript witness. Most frequently, these isolated <app> elements would with an empty token, "", formed to represent the normalized form a <lb/> element. The empty token itself is quite correct for our collation normalization process: The Shelley-Godwin Archive editors applied line-by-line encoding of the manuscript notebook, and we need to preserve the information about each line in order to help ensure that each passage is connected back to its source edition file. So in our project, we must not ignore this markup, but we do normalize it for the purposes of the collation since, in our document data model, it is not meaningful information about how the manuscript notebooks compare with the printed editions. In pre-processing the source files for collation, we converted the Shelley-Godwin Archive’s <line> elements into empty milestone markers as <lb/> marking the beginnings of each line, and these <lb> elements with their informative attributes are output as contents of <rdg wit="fMS"> in the spine file. Our normalizing algorithm converts them into empty tokens to indicate that they are meaningless for the purposes of collation, but they are nevertheless meaningful to the reconstruction of the 1816 manuscript edition information following collation.

In its most frequently occurring form, the empty-token problem would appear in separate isolated <app> elements in between two unison apps (that is, interrupting a passage in which all witnesses should align in unison. Here is an example of the pattern. Those reading vigilantly will note that the expected octothorpe or sharp # characters indicating that this is a pointer to a defined @xml:id do not appear in the @wit attribute values in this example. These are simply lacking at their point of output from collateX, before it is reprocessed in the post-production pipeline as a TEI document.

                        
	<app>
		<rdgGrp n="['as', 'i', 'did', 'not', 'appear', 'to', 'know']">
			<rdg wit="f1818">as I did not appear to know </rdg>
			<rdg wit="f1823">as I did not appear to know </rdg>
			<rdg wit="fThomas">as I did not appear to know </rdg>
			<rdg wit="f1831">as I did not appear to know </rdg>
			<rdg wit="fMS">as I did not appear to know </rdg>
		</rdgGrp>
	</app>
	<app>
		<rdgGrp n="['']">
			<rdg wit="fMS"><lb n="c57-0119__main__14"/> </rdg>
		</rdgGrp>
	</app>
	<app>
		<rdgGrp n="['the']">
			<rdg wit="f1818">the </rdg>
			<rdg wit="f1823">the </rdg>
			<rdg wit="fThomas">the </rdg>
			<rdg wit="f1831">the </rdg>
			<rdg wit="fMS">the </rdg>
		</rdgGrp>
	</app>
The empty token problem is evident here in the normalized form of an <lb/> element with all its attributes, being read properly as an array of one item, a null string, but somehow despite being nullified, asserting its presence to interrupt a unison passage. At this point, ideally there will be only one <app> element that should appear like this:
                        
	<app>
		<rdgGrp n="['as', 'i', 'did', 'not', 'appear', 'to', 'know', 'the']">
			<rdg wit="f1818">as I did not appear to know the </rdg>
			<rdg wit="f1823">as I did not appear to know the </rdg>
			<rdg wit="fThomas">as I did not appear to know the </rdg>
			<rdg wit="f1831">as I did not appear to know the </rdg>
			<rdg wit="fMS">as I did not appear to know 
			<rdg wit="fMS"><lb n="c57-0119__main__14"/> the </rdg>
		</rdgGrp>
	</app>

FV-whiteboard: Frankenstein collation notes on my office whiteboard

My student Jacqueline Chan, Penn State Behrend May 2022 Digital Media, Arts, and Technology graduate, carefully drafted these notes on the empty token collation problem and our efforts to solve it on my office whiteboard during a meeting with my student research assistants as we inspected collation output code together.

We did not complete the XSLT to resolve this particular problem because it is not outright disruptive to our collation pipeline. That is, the output collation shows two passages rather than one in which all witnesses are properly aligned. All collation problems are solved if we instruct collateX to completely ignore the <lb/> elements entirely, but we cannot do that. Our output collation files need to preserve locational information stored in the @n attributes, indicating the position of a passage in a specific zone of the source TEI encoding of the manuscript notebook surface. The @n attributes store information needed to reconstruct an XPath to point to a specific location in the S-GA edition. We decided ultimately to live with this collation error for the purposes of the edition production pipeline.

The smashed-tokens problem

We discovered a much more pernicious problem, however, with our method of tokenizing and normalizing passages that contain angle-bracketted markup. This problem generates what we call smashed-tokens, that is two tokens losing their space separator and becoming one token. Here is a very short passage of XML from the Shelley-Godwin Archive's encoding:

                        
                 . . . the spot<add eID="c57-0117__main__d3e21951"/> & endeavoured . . .
In this passage, we will be normalizing to ignore the <add> element. Notice the spaces separating the element tag from the ampersand. Now, here is how collateX interprets the passage in its collation alignment:
                        
	<app>
		<rdgGrp n="['spot,', 'and', 'endeavoured,']">
			<rdg wit="f1818">spot, and endeavoured, </rdg>
			<rdg wit="f1823">spot, and endeavoured, </rdg>
			<rdg wit="fThomas">spot, and endeavoured, </rdg>
			<rdg wit="f1831">spot, and endeavoured, </rdg>
		</rdgGrp>
		<rdgGrp n="['spotand', 'endeavoured']">
			<rdg wit="fMS">spot& endeavoured </rdg>
		</rdgGrp>
	</app>
Somewhere in the normalizing process, the two completely separate tokens spot and and were spliced together into one token. We are not certain whether this problem is more with Python's tokenization or with collateX's alignment of normalized empty strings representing markup irrelevant to the collation. It's as if the normalized markup refuses quite to disappear, haunting the collation machinery with a ghostly persistent trace. We are also not certain at the date of this writing whether choosing a different alignment algorithm in collateX might resolve the problem, but I think not, since error seems to be in interpreting spaces following normalization. That is, I believe the problems could be in the Python script that handles the normalization.

At the date of this writing, we have not completed these XSLT post-processing tasks, in part because we all became busy wih senior projects (for them, completing their projects and for me, assisting with them). I have also paused a moment before continuing down this path because the more patterns we discovered to correct, the more I wanted to try another XSLT approach. If our normalization and alignment were so disjointed to require so much XSLT post-processing, what if XSLT might simply be better at the task of normalizing markup and tokenizing and aligning around it? That is, would it be less brittle to use XSLT for all of our processing, including the collation itself?

XSLT for the win?

At the August 2021 Balisage conference, Joel Kalvesmaki introduced tan:diff and tan:collate. As I listened to the paper, I was struck by how clear, simple, and precise Kalvesmaki made the process of string comparison seem, using XPath and XSLT:

tan:diff() . . . extracts a series of progressively smaller samples from the shorter of the two texts and looks for a match in the longer one on the basis of fn:contains(). When a match is found, the results are separated by fn:substring-before() and fn:substring-after()[9]

The application of XPath functions described here made the alignment process seem more familiar and accessible, amenable to the kinds of modifications I regularly make with my own XSLT. I wondered whether starting with attempts to match longer samples and moving to match shorter ones might generate fewer aligned clusters and reduce the spurious particulation I had been seeing in my Python-assisted process with collateX. So I have lately begun experimenting with tan:diff() and tan:collate().

At the time of this writing, I have begun testing a collation unit from the Frankenstein Variorum, the same featured above that produced the smashed tokens problem featured in the previous section. Currently I am applying the software following Kalvesmaki’s detailed comments within his sample Diff XSLT files guiding the user to generate TAN Diff's native XML and HTML output. The XML output is like, but yet unlike the TEI critical apparatus structure we have been using in the Frankenstein Variorum project. The markup is simply organized in <c> when all witnesses share a normalized string, and <u> when they are variant from each other. There is nothing like the <app> element to bundle together related groups. But this is the realm of XSLT and the developers encourage their users to adapt the code to their needs. Alas, however, at the time of this writing, it appears there may be no straightforward way for me, without assistance from the developer, to output the original text in the format we need for the spine of the Frankenstein Variorum project. This remains an area for future development.

Nevertheless, I have been applying and updating my 25 normalizing replacement patterns via XSLT and find this much more legible than my handling of the same in my Python script (which sometimes contained duplicate passages). XSLT, by nature of being written in XML is XPathable, and with a large quantity of replacement sequences, I am able to check and sequence the process carefully in a way that is easier for me to document legibly.

                     
                <!-- Should punctuation be ignored? -->
    <xsl:param name="tan:ignore-punctuation-differences" as="xs:boolean" select="false()"/>
    
    <xsl:param name="additional-batch-replacements" as="element()*">
        <!--ebb: normalizations to batch process for collation. NOTE: We want to do these to preserve some markup \\
            in the output for post-processing to reconstruct the edition files. 
            Remember, these will be processed in order, so watch out for conflicts. -->
        <replace pattern="(<.+?>\s*)&gt;" replacement="$1" message="normalizing away extra right angle brackets"/>
        <replace pattern="&amp;" replacement="and" message="ampersand batch replacement"/>
        <replace pattern="</?xml>" replacement="" message="xml tag replacement"/>
        <replace pattern="(<p)\s+.+?(/>)" replacement="$1$2" message="p-tag batch replacement"/>
        <replace pattern="(<)(metamark).*?(>).+?\1/\2\3" replacement="" message="metamark batch replacement"/>
        <!--ebb: metamark contains a text node, and we don't want its contents processed in the collation, so 
        this captures the entire element. -->
        <replace pattern="(</?)m(del).*?(>)" replacement="$1$2$3" message="mdel-SGA batch replacement"/>  
        <!--ebb: mdel contains a text node, so this catches both start and end tag.
        We want mdel to be processed as <del>...</del>-->
        <replace pattern="</?damage.*?>" replacement="" message="damage-SGA batch replacement"/> 
        <!--ebb: damage contains a text node, so this catches both start and end tag. -->
        <replace pattern="</?unclear.*?>" replacement="" message="unclear-SGA batch replacement"/> 
        <!--ebb: unclear contains a text node, so this catches both start and end tag. -->
        <replace pattern="</?retrace.*?>" replacement="" message="retrace-SGA batch replacement"/> 
        <!--ebb: retrace contains a text node, so this catches both start and end tag. -->
        <replace pattern="</?shi.*?>" replacement="" message="shi-SGA batch replacement"/> 
        <!--ebb: shi (superscript/subscript) contains a text node, so this catches both start and end tag. -->
        <replace pattern="(<del)\s+.+?(/>)" replacement="$1$2" message="del-tag batch replacement"/>
        <replace pattern="<hi.+?/>" replacement="" message="hi batch replacement"/>
        <replace pattern="<pb.+?/>" replacement="" message="pb batch replacement"/>
        <replace pattern="<add.+?>" replacement="" message="add batch replacement"/>
        <replace pattern="<w.+?/>" replacement="" message="w-SGA batch replacement"/>
        <replace pattern="(<del)Span.+?spanTo="#(.+?)".*?(/>)(.+?)<anchor.+?xml:id="\2".*?>" 
            replacement="$1$3$4$1$3" message="delSpan-to-anchor-SGA batch replacement"/>
        <replace pattern="<anchor.+?/>" replacement="" message="anchor-SGA batch replacement"/>  
        <replace pattern="<milestone.+?unit="tei:p".+?/>" replacement="<p/> <p/>" 
            message="milestone-paragraph-SGA batch replacement"/>  
        <replace pattern="<milestone.+?/>" replacement="" message="milestone non-p batch replacement"/>  
        <replace pattern="<lb.+?/>" replacement="" message="lb batch replacement"/>  
        <replace pattern="<surface.+?/>" replacement="" message="surface-SGA batch replacement"/> 
        <replace pattern="<zone.+?/>" replacement="" message="zone-SGA batch replacement"/> 
        <replace pattern="<mod.+?/>" replacement="" message="mod-SGA batch replacement"/> 
        <replace pattern="<restore.+?/>" replacement="" message="restore-SGA batch replacement"/> 
        <replace pattern="<graphic.+?/>" replacement="" message="graphic-SGA batch replacement"/> 
        <replace pattern="<head.+?/>" replacement="" message="head batch replacement"/>
The documentation-centered nature of the TAN stylesheets makes the XSLT function to document decisions and makes it easy to amend the process and resequence it. Contrast this with the brittle complexity of my Python normalization function and its dependencies:
                     
regexWhitespace = re.compile(r'\s+')
regexNonWhitespace = re.compile(r'\S+')
regexEmptyTag = re.compile(r'/>$')
regexBlankLine = re.compile(r'\n{2,}')
regexLeadingBlankLine = re.compile(r'^\n')
regexPageBreak = re.compile(r'<pb.+?/>')
RE_MARKUP = re.compile(r'<.+?>')
RE_PARA = re.compile(r'<p\s[^<]+?/>')
RE_INCLUDE = re.compile(r'<include[^<]*/>')
RE_MILESTONE = re.compile(r'<milestone[^<]*/>')
RE_HEAD = re.compile(r'<head[^<]*/>')
RE_AB = re.compile(r'<ab[^<]*/>')
# 2018-10-1 ebb: ampersands are apparently not treated in python regex as entities any more than angle brackets.
# RE_AMP_NSB = re.compile(r'\S&\s')
# RE_AMP_NSE = re.compile(r'\s&\S')
# RE_AMP_SQUISH = re.compile(r'\S&\S')
# RE_AMP = re.compile(r'\s&\s')
RE_AMP = re.compile(r'&')
# RE_MULTICAPS = re.compile(r'(?<=\W|\s|\>)[A-Z][A-Z]+[A-Z]*\s')
# RE_INNERCAPS = re.compile(r'(?<=hi\d"/>)[A-Z]+[A-Z]+[A-Z]+[A-Z]*')
# TITLE_MultiCaps = match(RE_MULTICAPS).lower()
RE_DELSTART = re.compile(r'<del[^<]*>')
RE_ADDSTART = re.compile(r'<add[^<]*>')
RE_MDEL = re.compile(r'<mdel[^<]*>.+?</mdel>')
RE_SHI = re.compile(r'<shi[^<]*>.+?</shi>')
RE_METAMARK = re.compile(r'<metamark[^<]*>.+?</metamark>')
RE_HI = re.compile(r'<hi\s[^<]*/>')
RE_PB = re.compile(r'<pb[^<]*/>')
RE_LB = re.compile(r'<lb.*?/>')
# 2021-09-06: ebb and djb: On <lb> collation troubles: LOOK FOR DOT MATCHES ALL FLAG
# b/c this is likely spanning multiple lines, and getting split by the tokenizing algorithm.
# 2021-09-10: ebb with mb and jc: trying .*? and DOTALL flag
RE_LG = re.compile(r'<lg[^<]*/>')
RE_L = re.compile(r'<l\s[^<]*/>')
RE_CIT = re.compile(r'<cit\s[^<]*/>')
RE_QUOTE = re.compile(r'<quote\s[^<]*/>')
RE_OPENQT = re.compile(r'“')
RE_CLOSEQT = re.compile(r'”')
RE_GAP = re.compile(r'<gap\s[^<]*/>')
# <milestone unit="tei:p"/>
RE_sgaP = re.compile(r'<milestone\sunit="tei:p"[^<]*/>')

. . .

def normalize(inputText):
# 2018-09-23 ebb THIS WORKS, SOMETIMES, BUT NOT EVERWHERE: RE_MULTICAPS.sub(format(re.findall(RE_MULTICAPS, inputText, flags=0)).title(), \
# RE_INNERCAPS.sub(format(re.findall(RE_INNERCAPS, inputText, flags=0)).lower(), \
    return RE_MILESTONE.sub('', \
        RE_INCLUDE.sub('', \
        RE_AB.sub('', \
        RE_HEAD.sub('', \
        RE_AMP.sub('and', \
        RE_MDEL.sub('', \
        RE_SHI.sub('', \
        RE_HI.sub('', \
        RE_LB.sub('', \
        RE_PB.sub('', \
        RE_PARA.sub('<p/>', \
        RE_sgaP.sub('<p/>', \
        RE_LG.sub('<lg/>', \
        RE_L.sub('<l/>', \
        RE_CIT.sub('', \
        RE_QUOTE.sub('', \
        RE_OPENQT.sub('"', \
        RE_CLOSEQT.sub('"', \
        RE_GAP.sub('', \
        RE_DELSTART.sub('<del>', \
        RE_ADDSTART.sub('<add>', \
        RE_METAMARK.sub('', inputText)))))))))))))))))))))).lower()
I came to realize that in the more writerly environment of the XSLT stylesheets for TAN Diff, It is easier to review the code, which would be helpful to me setitng it down and returning to it after some months or years.

More to the point of this experiment with a new alignment method, does it succeed in reducing the tokenization and noramlization alignments discussed in the previous section? So far, yes, often with identical correct alignments and usually with longer matches. For example, on the passage that illustrated the smashed tokens problem, here is tan collate's output:

                     
      <u>
         <txt>spot,</txt>
         <wit ref="2-1818_fullFlat_C27" pos="1423"/>
         <wit ref="4-Thomas_fullFlat_C27" pos="1423"/>
         <wit ref="3-1823_fullFlat_C27" pos="1421"/>
         <wit ref="5-1831_fullFlat_C27" pos="1422"/>
      </u>
      <u>
         <txt>spot</txt>
         <wit ref="1-msColl_C27" pos="317"/>
      </u>
      <c>
         <txt> and </txt>
         <wit ref="2-1818_fullFlat_C27" pos="1428"/>
         <wit ref="4-Thomas_fullFlat_C27" pos="1428"/>
         <wit ref="3-1823_fullFlat_C27" pos="1426"/>
         <wit ref="5-1831_fullFlat_C27" pos="1427"/>
         <wit ref="1-msColl_C27" pos="321"/>
      </c>
      <u>
         <txt>endeavoured,</txt>
         <wit ref="2-1818_fullFlat_C27" pos="1433"/>
         <wit ref="4-Thomas_fullFlat_C27" pos="1433"/>
         <wit ref="3-1823_fullFlat_C27" pos="1431"/>
         <wit ref="5-1831_fullFlat_C27" pos="1432"/>
      </u>
      <u>
         <txt>endeavoured </txt>
         <wit ref="1-msColl_C27" pos="326"/>
      </u>
I have found no problems with empty tokens created from markup, and overall the number of aligned groups is much smaller. Here is a summary of total aligned reading groups and unison aligned reading groups for the same collation unit as processed by Python-mediated CollateX vs. Tan Diff XSLT:

Table I

Alignments generated by CollateX vs. Tan Diff

Alignment CollateX Tan Diff/Collate
Unison Alignments (all witnesses the same) 1198 515
All Alignments (unison and divergent) 2315 1932
In general, the alignments were longer in tan:diff() (as expected), and the tokenization and normalization outcomes appear so far to be reasonably reliable, though this is difficult to tell without being able to see the original passages in the TAN collation XML output. When this functionality is achieved, it would make sound sense to take a moment and adapt tan:collate() to the pipeline of the Frankenstein Variorum project, since it appears even out of the box to produce fewer alignment problems. At the time of this writing, however, it is difficult to navigate the TAN library of included XSLT files and to fully understand how to intervene in the collation output process.

Just because it is written in XSLT and is well-documented for public usability, does not mean an XSLT library is any more or less adaptable to intensive customization than a library made of other forms of code. As I write this concluding paragraph in mid-July 2022, my student assistant Yuying Jin and I have made a significant breakthrough in modifying our pre-processing Python script for collateX, and we have succeeded in handling the problem of spliced tokens, though we have now multipled the number of empty null-string tokens. Since it now appears easier to continue refining our Python script than to intervene in the labyrinthine XSLT library of functions, we will proceed in this direction for now, and look forward to future developments to improve the customization potential of the TAN library. The XSLT process promises to make the recursive pre-programming cycles in the Gothenburg model more amenable to clear documentation, and the effort to investigate two pre-processing methods via Python, and via XSLT, has proven fruitful to resolve many problems in our original process. The TAN library holds great promise for handling collations in XSLT, and while customizing its output proves for this moment no easy matter for an outsider to TAN, it seems a clear next step for this impressive XSLT library.



[1] Interedition Development Group, The Gothenburg Model, 2010-2019. https://collatex.net/doc/. For more on the summit and workshop of collation software developers in 2009 that formulated the Gothenburg model, see Ronald Haentjens Dekker, Dirk van Hulle, Gregor Middell, Vincent Neyt, and Joris van Zundert. Computer-supported collation of modern manuscripts: CollateX and the Beckett Digital Manuscript Project. Digital Scholarship in the Humanities 30:3 (December 2014) pp. 3-4. doi:https://doi.org/10.1093/llc/fqu007.

[2] Birnbaum, David J. and Elena Spadini. Reassessing the locus of normalization in machine-assisted collation. DHQ: Digital Humanities Quarterly 14:3 (2020). http://www.digitalhumanities.org/dhq/vol/14/3/000489/000489.html.

[3] John Bradley. Text Tools in A Companion to Digital Humanities, ed. Susan Schreibman, Ray Siemens, and John Unsworth (Wiley, 2004), pages 505-522.

[4] Beshero-Bondar, Elisa Eileen. Rebuilding a Digital Frankenstein by 2018: Reflections toward a Theory of Losses and Gains in Up-Translation. Presented at Up-Translation and Up-Transformation: Tasks, Challenges, and Solutions, Washington, DC, July 31, 2017. In Proceedings of Up-Translation and Up-Transformation: Tasks, Challenges, and Solutions. Balisage Series on Markup Technologies, vol. 20 (2017). doi:https://doi.org/10.4242/BalisageVol20.Beshero-Bondar01.

[5] See Beshero-Bondar, Elisa E., and Raffaele Viglianti. Stand-off Bridges in the Frankenstein Variorum Project: Interchange and Interoperability within TEI Markup Ecosystems. Presented at Balisage: The Markup Conference 2018, Washington, DC, July 31 - August 3, 2018. In Proceedings of Balisage: The Markup Conference 2018. Balisage Series on Markup Technologies, vol. 21 (2018). doi:https://doi.org/10.4242/BalisageVol21.Beshero-Bondar01. I am also grateful to Michael Sperberg-McQueen and David Birnbaum for their guidance of my efforts to raise flattened markup into hierarchical markup leading to a Balisage paper evaluating multiple methods for attempting this. See Birnbaum, David J., Elisa E. Beshero-Bondar and C. M. Sperberg-McQueen. Flattening and unflattening XML markup: a Zen garden of XSLT and other tools. Presented at Balisage: The Markup Conference 2018, Washington, DC, July 31 - August 3, 2018. In Proceedings of Balisage: The Markup Conference 2018. Balisage Series on Markup Technologies, vol. 21 (2018). doi:https://doi.org/10.4242/BalisageVol21.Birnbaum01.

[6] On Trojan markup, see Steve DeRose, Markup Overlap: A Review and a Horse, Extreme Markup Languages 2004. https://www.academia.edu/42096176/Markup_Overlap_A_Review_and_a_Horse; and Sperberg-McQueen, C. M. Representing concurrent document structures using Trojan Horse markup. Presented at Balisage: The Markup Conference 2018, Washington, DC, July 31 - August 3, 2018. In Proceedings of Balisage: The Markup Conference 2018. Balisage Series on Markup Technologies, vol. 21 (2018). doi:https://doi.org/10.4242/BalisageVol21.Sperberg-McQueen01.

[7] On the parallel segmentation method for handling a critical apparatus in TEI P5, see the TEI Guidelines: TEI Consortium, eds. 4.3.2 Floating Texts. TEI P5: Guidelines for Electronic Text Encoding and Interchange. 4.4.0. April 19, 2022. TEI Consortium. http://www.tei-c.org/release/doc/tei-p5-doc/en/html/DS.html#DSFLT (2022-07-19).

[8] Our production pipeline unfolds in a series of GitHub repos at https://github.com/FrankensteinVariorum. The repo storing and documenting the post-collation XSLT files, not the focus of discussion for this paper but likely of interest to our readers is https://github.com/FrankensteinVariorum/fv-postCollation.

[9] Kalvesmaki, Joel. String Comparison in XSLT with tan:diff(). Presented at Balisage: The Markup Conference 2021, Washington, DC, August 2 - 6, 2021. In Proceedings of Balisage: The Markup Conference 2021. Balisage Series on Markup Technologies, vol. 26 (2021). doi:https://doi.org/10.4242/BalisageVol26.Kalvesmaki01.

Author's keywords for this paper:
collation; tokenization; normalization; alignment; Gothenburg model; Python; XSLT; stand-off markup; stand-off pointers