How to cite this paper

Bleeker, Elli, Bram Buitendijk and Ronald Haentjens Dekker. “Marking up microrevisions with major implications: Non-linear text in TAG.” Presented at Balisage: The Markup Conference 2020, Washington, DC, July 27 - 31, 2020. In Proceedings of Balisage: The Markup Conference 2020. Balisage Series on Markup Technologies, vol. 25 (2020). https://doi.org/10.4242/BalisageVol25.Bleeker01.

Balisage: The Markup Conference 2020
July 27 - 31, 2020

Balisage Paper: Marking up microrevisions with major implications: Non-linear text in TAG

Elli Bleeker

Researcher, Research and Development

Research and Development Group, Netherlands Academy for Arts and Sciences

`<elli.bleeker@di.huc.knaw.nl>`

Elli Bleeker is a postdoctoral researcher in the Research and Development Team at the Humanities Cluster, part of the Royal Netherlands Academy of Arts and Sciences. She specializes in digital scholarly editing and computational philology, with a focus on modern manuscripts and genetic criticism. Elli completed her PhD at the Centre for Manuscript Genetics (2017) on the role of the scholarly editor in the digital environment. As a Research Fellow in the Marie Sklodowska-Curie funded network DiXiT (2013–2017), she received advanced training in manuscript studies, text modeling, and XML technologies for text modeling.

Bram Buitendijk

Software Developer, Research and Development

Research and Development Group, Netherlands Academy for Arts and Sciences

`<bram.buitendijk@di.huc.knaw.nl>`

Bram Buitendijk is a software developer in the Research and Development team at the Humanities Cluster, part of the Royal Netherlands Academy of Arts and Sciences. He has worked on transcription and annotation software, collation software, and repository software.

Ronald Haentjens Dekker

Head of Research and Development and Software Architect

Research and Development Group, Netherlands Academy for Arts and Sciences

`<ronald.dekker@di.huc.knaw.nl>`

Ronald Haentjens Dekker is a software architect and lead engineer of the Research and Development Team at the Humanities Cluster, part of the Royal Netherlands Academy of Arts and Sciences. As a software architect, he is responsible for translating research questions into technology or algorithms and explaining to researchers and management how specific technologies will influence their research. He has worked on transcription and annotation software, collation software, and repository software, and he is the lead developer of the CollateX collation tool. He also conducts workshops to teach researchers how to use scripting languages in combination with digital editions to enhance their research.

Abstract

The article discusses how micro-level textual variation can be expressed in an idiomatic manner using markup, and how the markup information is subsequently used by a digital collation tool for a more refined analysis of the textual variation. We take examples from the manuscript materials of Virginia Woolf's To the Lighthouse (1927), which bear the traces of the author's struggles in the form of deletions, additions, and rewrites. These in-text revisions typically constitute non-linear, discontinuous, or multi-hierarchical information structures. While digital technology has been instrumental in supporting manuscript research, the current data models for text provide only limited support for co-existing hierarchies or non-linear text features. The hypergraph data model of TAG is specifically designed to support and facilitate the study of complex manuscript text by way of its syntax TAGML and the collation tool HyperCollate. The article demonstrates how the study of textual variation can be augmented by designated markup to express the in-text, micro-level revisions, and by computer-assisted collation that takes into account that information.

Introduction

Non-linear text

Challenges for text-encoding
Challenges for text analysis

Encoding non-linear text

Related work

Embedded markup approaches
Stand-off approaches

TAGML

TAGML Pipeline
Examples

Analysing non-linear text: collation

Related work

Creating additional witnesses for each in-text revision
Passing along markup
Comparing data-centric XML

HyperCollate

HyperCollate's approach
Examples

Discussion

Conclusion

Introduction^[1]

When we say that text encoding is a means of making explicit an interpretation of that text, we mean that the encoder is compelled to explicitly formulate their underlying assumptions about the text. We often forget to point out that text encoding also implies a (subconscious) choice for a certain data model. Needless to say, not all data models are equally suitable to express and query all kinds of textual information. It is crucial, then, for encoders to be(come) aware of the consequences of their choices, not only on the level of the tagset but also on the level of the data model. As a result, practising digital textual scholarship – the modeling, encoding, and analysing – can be as informative and enlightening as the end product.

Since data models for text are usually developed according to a specific conceptual idea of text, it is interesting to see what textual features are natively supported by a data model. What are, for example, the consequences of expressing text as a consequetive sequence of characters, with annotations as ranges on the text (LMNL data model)? Or: how will representing textual information as RDF statements (cf. EARMARK, see Peroni and Vitali 2009) instead of an ordered rooted tree (cf. XML) change the way we think about text? In each case, the affordances of the chosen data model will inevitably affect our encoding practice and the outcomes of an analysis.

Ideally, then, the choice for a suitable data model is primarily informed by one's research questions and the particulars of the textual material, and not by a prevailing standard. As Michael Sperberg-McQueen noted in his concluding remarks of the Balisage 2009 conference: It is not standards in themselves that are harmful, but mindless adherence to standards that is harmful (Sperberg-McQueen 2009). Making an informed choice for a specific data model is accordingly related to the research needs of a scholar, who needs to be clear about what textual feature(s) they want to examine, what result(s) they expect from the text modeling, and how they intend to get there.

In this contribution, we investigate how the TAG data model addresses a persistent challenge for modeling and analysing literary and historical documents. The contribution builds upon two previous Balisage papers which introduced respectively the TAG model (Haentjens Dekker and Birnbaum 2017) and the TAGML syntax (Haentjens Dekker et al. 2018). Presently, we will expand on the potential of TAG and TAGML to model and process complex textual phenomena. We take our examples from fragments of the authorial manuscripts of Virginia Woolf's To the Lighthouse (1927). The text on these documents presents quite some modeling challenges: words are deleted mid-way a sentence, phrases are inserted in between the lines or in the margin, paragraphs are transposed, changes to the text structure are indicated with arrows or metamarks, etc. In short: the documents contain the sort of textual phenomena that tests the limitations of a data model. For this contribution, we concentrate on one particular phenomena: in-text revisions and similar non-linear text structures.^[2]

After briefly illustrating what we mean with in-text revisions and how they constitute non-linear information structures (section 2 Non-linear text), we go on to demonstrate how TAGML allows encoders to markup non-linear text in a straightforward and idiomatic manner. Using the concept of the computational pipeline, we show in section 3 (Encoding non-linear text) how a TAGML transcription of non-linear text is tokenized, parsed, and stored as a single TAG hypergraph for text. The fourth section, Analysing non-linear text: collation, discusses the topic of automated collation and outlines how the non-linear information that is stored in the individual TAG hypergraphs can be used to come to a more refined collation output via graph-to-graph comparison.

The paper intends to demonstrate how TAG allows scholars to be extremely precise in expressing their interpretation of textual variance which, in turn, positively affects the subsequent processing and analysis of the encoded texts. Because work on the TAG project is under ongoing development, this contribution will not be your average tool presentation. Rather, we intend to show in some detail how textual information is stored, interpreted, and processed by our data model. We consider this essential to understanding the potential of our model for supporting textual analysis. By providing detailed insights into the design choices and technical implementation of TAGML and HyperCollate, we emphasize how the choices made on the level of the storage and processing of textual information can affect the subsequent analysis.

Non-linear text

Challenges for text-encoding

Briefly put, non-linear text means that the text does not form a linear stream of characters. As we explained in Bleeker et al. 2018 and Haentjens Dekker et al. 2018, textual content is normally fully ordered information: the text characters form a stream of characters and their order is inherent to their meaning. Fully ordered text is parsed and processed as it is read: in Western scripts that means from left to right, from top to bottom.

In many cases, however, text is not always or consistently a linear structure. Textual variation, for example, may constitute partially ordered information. Take the in-text revision in Figure 1:

The original text of this fragment reads aux pierres en saillie toute une écume. The words en saillie are crossed out and the word noyées is added above the line, changing the phrase into aux pierres noyées toute une écume. The words en saillie and noyées are on the same location in the text and thus mutually exclusive.

Text encoding implies the interpretation, transcription, and encoding of textual inscriptions on the document page. In case of documents with in-text revisions, markup can be used to identify the subsequent stages in the writing and revision process or to label the different types of revisions. To illustrate the trickiness of encoding non-linear text, let's take a look at how it can be done in TEI/XML. The TEI Guidelines, the de facto standard for text encoding (TEI P5), are currently based on the XML data model, which means that literary texts are usually modeled as an ordered, monohierarchical tree. So, as text encoders set out to encode the various stages of writing and revision as thoroughly as possible, they face the tricky task of encoding partially ordered information in a fully ordered data structure. The example above can be encoded in TEI/XML with del and add elements, and – if needed – a subst element to group the two elements together as a single intervention:

aux pierres <subst><del>en saillie</del><add>noyées</add></subst> toute une écume .

Here, the opening tag <subst> indicates where the textual information becomes temporarily non-linear. We can say that the two readings en saillie and noyées constitute two paths through the text. The tags del and add identify the two separate paths. Within the individual paths, the text is fully ordered again: the words en saillie would have a different meaning (if any) if the text characters lost their order. At the closing tag </subst> the two branches rejoin and the text becomes fully linear again.

Similar examples of non-linear or partially ordered information that is expressed in markup are the choice or app elements. Both are used to group a number of alternative encodings for the same point in the text.^[3]

Because XML cannot natively express non-linear text structures at the level of the model or the syntax (Haentjens Dekker and Birnbaum 2017), the TEI Guidelines provide several dedicated elements and schemata. As a result, the temporary non-linearity as expressed with the TEI/XML element subst can be licensed by and validated against a schema.^[4] In theory, a query processor that has access to and understands this schema will be able to recognize the non-linear information expressed by the markup. Following our example above, the processor will understand that the words en saillie and noyées are located on the same place in the text stream and that they are mutually exclusive alternatives to each other.

Unfortunately, this scenario does not apply to the majority of the XML query processors. In most cases, processing queries remains limited to a linear level. Put differently: the TEI/XML file may conceptually represent the encoder's idea of non-linear text, but that concept is typically not shared with a processor. This has significant consequences for searching, querying, and analysis. Desmond Schmidt found for instance that only 10% of digital editions using inline markup could find literal expressions that span inline substitutions (Schmidt 2019, note 3). The section below illustrates other complications that arise when collating texts with in-text variations.

Challenges for text analysis

In this contribution, we focus on one form of text analysis that would particularly benefit from having access to non-linear information: collation. Collation can be defined as the comparison of two or more versions (witnesses) of a literary text in order to establish a record of the textual variance. To this end, a scholar can use designated collation tools like CollateX (CollateX) or Juxta (Juxta). However, since these automated collation tools operate on character strings, the non-linear information about revisions within individual witnesses is not used to come to an alignment of the texts.

Editors who are working with witnesses containing in-text variation are compelled to either choose only one revision stage per text, or to pass on relevant information about deletions and additions through the collation pipeline so that it is present in the collation output (Bleeker 2017, section 2.2.5; Beshero-Bondar 2017; Bleeker et al. 2018; Birnbaum et al. 2018; Beshero-Bondar and Viglianti 2018).

The first option implies that relevant information about an author's writing and revision process is ignored by the collation tool. This may have a negative impact on the analysis of the textual variance. With the second option – passing on markup tags (in flattened form) – editors can at least use the retained information to visualize the deleted words in the collation output,^[5] or to raise the flattened transcription again (see the Variorum Frankenstein project, Birnbaum et al. 2018, discussed in more detail in section 4.1.2 Passing along markup). However, this option requires a considerable set of technical skills that may not be available to most scholarly editors. Furthermore, the collation tool would still operate on the character stream and the non-linear information is not part of the alignment process.

This brings us to three important requirements for the way TAG should handle non-linear information. First, editors need to be able to markup this kind of partially ordered information in a straightforward manner. Secondly, a processor needs to recognize non-linear information as such so that the texts can be queried and searched more effectively. And finally, we need a collation program that recognizes non-linear text as two or more mutually exclusive paths through the text, from which it then chooses the best match.

Our Balisage 2018 paper already demonstrated how to represent non-linear information in TAGML (Haentjens Dekker et al. 2018); section 3.2 TAGML will therefore focus on how this information is interpreted by the TAGML parser and stored as a hypergraph. Section 4.2 HyperCollate then describes how the individual TAG hypergraphs can be collated by the hypergraph-based collation tool HyperCollate (https://huygensing.github.io/hyper-collate/) that recognizes non-linear information and thus produces a more refined collation output. Both sections are proceeded by an overview of the related work done in these areas.

Encoding non-linear text

Related work

In addressing the need to model overlapping, non-linear, or discontinuous text structures, TAG shares the objectives of several existing markup systems. Accordingly, there are aspects of TAG's approach that correspond closely to other syntaxes or data models, most notably TexMECS, LMNL, and TEI/XML. This section should therefore not be read as a critique on existing markup approaches, but rather as an illustration of how TAG complements or relates to these approaches.

Embedded markup approaches

As the table in Figure 2 illustrates, there are various embedded markup approaches to expressing complex textual structures. Some are more effective than others, but theoretically text encoders can use any data model to express any kind of text, no matter how complex, as long as they are willing to use some workarounds, do some extra coding, and hand over certain tasks to other data formats (Vitali 2016). But the more additional coding, customized solutions, or handovers are necessary, the more complicated it will be to process, query, interchange, or reuse the encoded files (Haentjens Dekker et al. 2018, Schmidt 2019).

From the outset, the main objective behind the development of TAG has been to both simplify and advance the work of text encoders worldwide. In our ideal scenario, editors can work with a data model that natively supports the modeling of complex text features, with as few handovers or customized technical solutions as possible. The table below therefore represents to what extent markup systems support complex textual features like non-linearity, discontinuity, and overlap in a native way.^[6]

As the table shows, TexMECS is able to natively represent non-linearity. By default, all contents of a TexMECS document are ordered, and it is possible to indicate the start and end of an unordered element. For example, the children of a subst element can be marked as unordered in the following TexMECS notation:

This is a <|subst||<del|useful|del> <add|clear|add>||subst|> example

. Here, the deleted word and the added word are on the same position in the text stream and mutually exclusive (Huitfeldt and Sperberg-McQueen 2003).

XCONCUR (Schonefeld 2007), Concurrent XML (Dekhtyar and Iacob 2005), LMNL (Piez 2008), and linear extended Annotation Graphs (LeAG, Barrellon et al. 2017) are designed primarily to deal with overlapping structures and do not natively support non-linear structures.^[7] The case of TEI/XML is slightly more complicated. We mentioned above that the TEI Guidelines identify a number of elements (notably the subst, choice and app) whose children are understood to be unordered. We also noted that this requires the use of a schema language like XML Schema that supports unorderedness, as well as a schema-aware processor. In reality, most generic XML processors will not be aware of these exceptions and process the XML document as a fully ordered tree. In that case, all alternative paths through the text will be considered as being part of one and the same text stream.

Stand-off approaches

Several stand-off approaches to markup also allow for the expression of non-linear structures. The Multi-Version Document (MVD) system developed by Desmond Schmidt et al. (Schmidt 2008, Schmidt and Colomb 2009, Schmidt and Fiormonte 2010) implements a variant graph which is a suitable data structure for representing non-linearity. In the case of a draft manuscript featuring multiple revision stages, the MVD-approach suggests the editor creates separate transcriptions (layers) for each in-text revision stage. The transcription files can be in plain text, HTML, XML or LMNL format.^[8] Creating layers implies interpretative work from the editor who needs to differentiate between revision stages on the manuscript text based on cancelled, added, and transposed (units of) text. Layers are thus artificial constructs that represent a collection of in-text variations. The separate files are merged into one MVD, so that all versions of a text – both the in-text variation and the variation across documents – are stored in one variant graph.

EARMARK (Extremely Annotated RDF Markup; (Peroni and Vitali 2009)) is another standoff system and well-known to the Balisage-community. EARMARK implements a collection of RDF statements about text fragments that describe properties of that fragment. Technically, the underlying RDF data model is flexible enough to express partially-ordered information, but according to the EARMARK specification (Peroni and Vitali 2009), this feature is not supported.

EARMARK does support the option to represent multi-orderedness: via the e-GODDAG extension, RDF statements about the same text node can be repeated in different contexts. This way, users can express multiple text orders (Peroni and Vitali 2009, section 4.1; Di Iorio et al. 2009; Peroni et al. 2014). However, multi-orderedness is not identical to partially-orderedness. To our knowledge, expressing diverging and converging paths through the text stream (our definition of partially-ordered information) is currently not parsable in EARMARK.

TAGML

We have defined non-linear text as partially ordered information, and we have emphasized that it is desirable for a markup system to natively represent partially-ordered information. Now let's move on to the approach of TAG. This section first describes the pipeline of the TAGML parser. It subsequently illustrates the operations of the parser with three examples of textual variation within a manuscript fragment: a single deletion, an immediate deletion, and a grouped revision. For each example, we show the TAGML transcription and the Abstract Syntax Tree (AST) that is created by the TAGML parser.

Note that a draft manuscript can present a huge number of complex textual variations (substitutions within substitutions within substitutions, transpositions of multiple segments, revisions within a word, etc). For this paper, we selected three short examples, lest the visualisations of the ASTs becomes too large and uninformative.

TAGML Pipeline

Figure 3 presents a schematic representation of the TAGML processing pipeline.

The input of the pipeline is a TAGML document that contains a combination of text and markup. Markup tags indicate whether the text is fully ordered, partially ordered, or unordered.^[9] The TAGML document is first tokenized by the TAGML lexer which produces a stream of TAGML tokens; each token contains information about its position in the TAGML document, its type, and its length. We will discuss the lexer in more detail below. The stream of TAGML tokens is subsequently parsed with an ANTLR-generated parser that uses TAGML grammar, resulting in an AST of the input TAGML document. The most up-to-date version of the TAGML grammar can be found on Github. From the AST, an Abstract Semantic Graph (ASG) can be generated. In the TAG data model, the ASG is implemented either as a Multi-Colored Tree (MCT) or a hypergraph.^[10] In a tree model the markup elements start at the top level and are (almost) all above the text elements, which are at the bottom in leaf nodes. In contrast, the TAG hypergraph model has the text elements at the centre of the model. The relationship between Text nodes and Markup nodes is expressed by hyperedges.

Examples

The examples in this section come from the authorial notebooks of Virginia Woolf's To the Lighthouse, written between 1926 and 1927 (Woolf 1927). Digital facsimiles of the notebook pages are available via the digital archive Woolf Online.^[11] Of each example we show four representations: the manuscript fragment, the TAGML transcription, its AST as produced by the parser, and its hypergraph representation. Again, we'd like to point out that these examples are selected for their simplicity; more complex examples would result in an exceptionally large AST visualisation that won't fit into this paper.

Single deletion

If we follow the tag suggestions of the TEI Guidelines for transcription of primary sources and map them to TAGML,^[12] a textual fragment with the deletion can be represented as follows:

[TEI>[s>it seemed [?del>indeed<?del] as if it were now settled<s]<TEI]

The question mark prefix in the [?del> tag indicates that the element and its textual content are optional, thus two ways of reading the text. When processed by the TAGML parser, the following AST is produced:

The TAGML grammar is context-sensitive,^[13] and a TAGML document consists of one or more chunks. Each chunk can contain either a start tag, an end tag, text, or text variation. In this diagram, the TAGML tokens are visualized in the blue leaf nodes. The fact that the lexer has different modes enables us to reuse a (sequence of) character(s). Based on their position in the TAGML document, the same characters may get a different function. For example, if the lexer is in the 'Default'-mode, a left square bracket [ is identified as the start of a markup opener, so the lexer switches to the 'Inside Markup Opener'-modus. The lexer remains in this modus until it encounters a token that prompts it to switch to another modus, in this case that could be the > token, which triggers the lexer to switch back to the 'Default'-modus ('pop mode // back to default'). The diagram shows how a TAGML token can trigger the shift to a different modus: the modes of the lexer are visualized in red and connected to the tokens that trigger them.

From the resulting AST a Multi-Colored Tree or a hypergraph can be generated. Because hypergraphs are more suitable to represent non-linear information, we will show the hypergraph for each TAGML document. Note that the AST only expresses the syntactic structure of the TAGML document, which means the start tags and end tags are not linked. Going from an AST to a hypergraph, then, means that the start and end tags will need to be reconnected in order to form Markup nodes in the hypergraph.

Immediate deletion

In TAGML, the immediate deletion is expressed as follows:

[TEI>[s>The [del>im<del] picture of stark & compromising severity.<s]<TEI]

Note that the [del> does not have an optional prefix ? because we interpret an immediate revision as part of the same writing stage as the rest of the text. This means there is just one reading of the text and that reading includes two deleted text characters im.

Grouped revision

A grouped revision is a clear example of non-linear, partially ordered information: alternative readings for the same point in the text. In the TAGML syntax, the branching of the text stream is indicated with <|, the individual branches are separated with a vertical bar | and the converging of the branches is indicated with a |>. Individual branches contain markup and text. The present example can thus be encoded like this:

[TEI>[s> something <|[del>trustful<del]|[add>trusting<add]|>, something childlike<s]<TEI]

As we will see in the visualization of the AST below, the TAGML parser will recognize and interpret the divergence and convergence signs, so that the content of the branches is considered variant text ('ITV_text').

Taken together, the different visualizations of in-text revisions illustrate how TAG and TAGML allow for a natural and idiomatic digital representation of in-text variation, one that we think comes as close as possible to how a human encoder understands it. The next section will look at ways to transfer this understanding to a collation tool.

Analysing non-linear text: collation

Related work

We defined collation at its most basic level as the comparison of two or more versions (witnesses) of a text to find (dis)similarities between or among them. As mentioned above, collation software typically does not excel at handling markup within individiual witnesses: they collate the witnesses looking at the plain text – either ignoring all tags and attributes or requiring users to remove them in a pre-processing phase – or transform the markup to plain text characters so that the tags are collated but their meaning ignored. In either case, any partially ordered information is overlooked. Still, because the requirement of including (certain) markup elements in the collation process continues to exist, scholars have invented some nifty ways to work around the limitations of prevalent collation software. We distinguish three main approaches, which are briefly discussed in the following paragraphs:

Creating additional witnesses for each in-text revision;
Passing along markup for postprocessing;
Comparing structured (data-centric) XML files.

Creating additional witnesses for each in-text revision

Section 3.1.2 (Stand-off approaches) described how users of the MVD technology can create separate layers for each occurrence of in-text variation. These layers can subsequently be used as temporary witnesses for the plain text collation. Using the MVD collation functionality 'Compare', the temporary witnesses and the base texts are compared to one another. Both the official text versions and their temporary layers are merged into one MVD, which stores the text shared by each version and layer only once, similar to a variant graph. 'Compare' does not recognize markup elements and consequently does not differentiate between text and markup. In fact, it is not quite correct to say that the collation program ignores the markup information about non-linearity: the markup was never there in the first place.^[14] The result of the collation is an MVD variant graph containing in-text revisions as well as between-document variation. In the variant graph, in-text non-linear structures (i.e., within one witness) are distinguished from external variation (i.e., between witnesses) by sigla.

Passing along markup

Some collation tools allow for markup elements to be passed on through their pipeline. The markup elements are ignored during the alignment process, but they are present in the output and can then be used for further analysis or visualisation purposes.

Juxta Commons accepts TEI/XML encoded files for collation. The tool offers an interface that lists all TEI/XML elements contained by the input witness; the user can select which elements should be part of the collation and which can be filtered out. The TEI/XML elements are not part of the alignment process proper: they are saved as stand-off annotations to the text tokens and passed-on through the collation pipeline. The elements can be visualized in the heat map representation of the collation result, which shows deletions in red and additions in green. In other output formats of Juxta, the <add> and <del> tags are no longer there.

Another method to pass along markup through the collation pipeline requires coding on the side of the editor who interacts with the underlying code of the collation tool. We illustrate this approach by looking at the software CollateX.^[15] The JSON input format for CollateX allows for extra properties to be added to words. In a preprocessing step, the text is tokenized and transformed into JSON word tokens. Editors can make a selection of markup elements that they wish to attach to the JSON token as a t property (the t standing for token). The collation is executed using the value of the token's n property (n for normalized), but the selected markup is included in the JSON alignment table output via the t property, and can be processed in the visualisation of the alignment table. This approach does approximate the goal of having in-text revisions marked as non-linear text in both in- and output, but at the point of alignment, the collation algorithm is unaware of any non-linearity in the witness' text and treats the partially ordered information as a linear structure.

Comparing data-centric XML

Finally, there are several approaches to comparing XML trees (Barabucci et al. 2016, Ciancarini et al. 2016). Some XML editors, like oXygen, have a built-in XML comparison function. In theory, this approach would allow for the comparison of TEI/XML transcriptions containing non-linear information encoded with subst or app elements. The comparison functionality is primarily developed for structured, record-based XML documents – e.g., XML documents containing address information such as person name, address, age – and the documents are compared on the level of the XML elements.

Comparing two record-based XML documents results in an overview of the difference in markup. For example, an add element in witness A would be aligned with an add element that is on the same relative location in witness B. Note that these matches are made on the element level and not on level of the textual content. Textual scholars studying the revisions in literary texts, however, would typically give preference to the textual content.

The DeltaXML software does detect and display changes between two XML documents, either on the level of the XML elements or on the level of the text content. (Delta XML) Their Document comparator tool compares two XML documents and merges them into a new XML document that contains additional attributes representing the variation. What is more, DeltaXML provides the option to identify orderless containers: XML elements whose children can be arranged in an order that is not considered significant. Orderless containers are marked by placing the deltaxml:ordered="false" attribute on the element so that all its children will be considered to occur in any order. It's a description that seems to apply to our interpretation of the subst, the app and the choice elements. Comparing XML documents that contain a combination of ordered and orderless data (i.e., the children of some XML elements are ordered, but others are orderless) remains, according to the DeltaXML developers, a challenge. They propose to solve it by preprocessing the input XML documents.^[16]

That data-centric XML comparison methods usually do not work for text-centric encoded documents has already been pointed out by Di Iorio et al., who wrote that although XML is used to encode both literary documents and database dumps, there is something different from diff-ing an XML-encoded literary document and an XML-encoded database (Di Iorio et al. 2009, emphasis in original). In the case of comparing an encoded literary document, the authors state, the output of a diff should be evaluated on its naturalness, that is: for an algorithm to identify the changes that would be identified by a manual, human approach. To this end, Di Iorio et al. developed the implementation JNDiff.^[17] Di Iorio et al. were right to point out that the comparison of encoded literary documents is quite unique. In the majority of cases, the collation of such documents needs to take place on the level of the text. Rather than being compared, the markup indicating in-text revision needs to be interpreted as it was indented by the encoder: identifying the start and end of textual variation.

HyperCollate

In developing HyperCollate, we assumed the need for a collation tool that, first, recognizes the multiple writing stages within each witness text and, secondly, chooses the best match from these different stages. This implies that the non-linear structures in an encoded text need to be recognized by the collation program, and that the program needs to take this information into account during the alignment process. The technical implications are, first, that the input witnesses are not treated as a linear string of characters, but as hypergraphs consisting of partially ordered information. Secondly, that the collation program recognizes this partially ordered information and processes it accordingly.

As a result, a multidimensional text containing multiple revision stages would no longer have to be flattened before collation. More importantly, we estimate that the collation result would approximate the way a human editor would collate the texts. Accordingly, the collation tool would effectively support and advance scholarship.

HyperCollate's approach

HyperCollate takes as input two TAG hypergraphs of the individual witnesses. Note that each hypergraph may contain a combination of ordered, partially ordered and unordered information. The partially ordered information is represented as two or more branches in the hypergraph, with the text nodes in these branches having the same position in the text vis-à-vis the document root node (see the hypergraph visualisations in section 3.2.2 Examples).

The computational pipeline of HyperCollate can be visualized as follows:

For each hypergraph witness, the textual content is first tokenized into text tokens. By default, the text nodes are segmented on whitespace and punctuation. Note that the output of the tokenizer is not a linear stream of tokens: the tokens contain information about their distance vis-à-vis the root note (their depth), and in the case of non-linearity there may be more than one text token at the same position. The two witnesses are subsequently aligned. Alignment here means that HyperCollate calculates the smallest possible number of changes needed to change one set into the other.

The output of that alignment is a set of matching tokens. Again, the tokens in this set contain information about their depth, and there may be more than one text token on the same position. Based on the information from this set of matches, the two witness hypergraphs are merged into a new collation hypergraph that contains all information about the Text nodes and the Markup nodes of the input witnesses. All the nodes that are not aligned are unique to one of the two witnesses; all the nodes that are aligned can be reduced from two to one node, with labels on the edges indicating which node is part of which witness.

For every witness, then, there is always a fully connected path of Text nodes through the collation hypergraph: from the start node to the end node, following the sigla on the edges. Labeled hyperedges are used to associate the Markup node(s) with the Text nodes. The collation hypergraph can be visualized as an ASCII table or a collation graph (in .dot, SVG or PNG format). In the case of >2 witnesses, the collation output could also be used as the basis for a new collation following the progressive alignment-method (Spencer and Howe 2004).

Examples

In the paragraphs below, we return to the examples from section 3.2.2 Examples. Each example text fragment is collated with HyperCollate against another version of the same text that can be found in the print proofs of To the Lighthouse Woolf 1927.^[18] The HyperCollate output, a collation hypergraph, can be visualized in multiple ways, from an ASCII alignment table to an SVG graph. Each format has its own level of information density. Here, we provide an alignment table visualisation and an SVG collation graph visualisation to illustrate what differences this density makes: a simpler visualization may be clearer, but it may lack relevant information about the textual variance and/or the markup.

Single deletion

Input witness 1:

[TEI>[s>it seemed [?del>indeed<?del] as if it were now settled<s]<TEI]

Input witness 2:

[TEI>[s>as if it were settled<s]<TEI]

Immediate deletion

Input witness 1:

[TEI>[s>he appeared the [del>im<del] picture of stark & compromising severity.<s]<TEI]

Input witness 2:

[TEI>[s>he appeared the image of stark & compromising severity.<s]<TEI]

Grouped revision

Input witness 1:

[TEI>[s>something <|[del>trustful<del]|[add>trusting<add]|>, something childlike <s]<TEI]

Input witness 2:

[TEI>[s>something trustful, something childlike <s]<TEI]

Discussion

The contribution concentrated on representing in-text variation in TAGML and subsequently collating that information with HyperCollate. We described how the TAG model understands textual variation within one text version as non-linear, partially ordered information. The TAGML syntax allows encoders to express partially ordered information in a straightforward manner.

Partially ordered information is recognized and processed as such by the TAGML parser, and stored as multiple branches in a hypergraph for text. The Text tokens within each branch are mutually exclusive and have the same depth, meaning that they are both at the same distance from the root document node of the hypergraph.

HyperCollate is a hypergraph-based collation program that implements the TAG model. HyperCollate aligns Text tokens based on their relative position in the text as well as their depth in the hypergraph. The program recognizes the branches in the input hypergraphs: if two Text tokens have the same position number, the program finds the best match between them. The output of HyperCollate is a collation hypergraph that can be visualized in different ways; in this paper we showed an alignment table and a variant graph visualization.

Presently, the TAG approach to transcription and collation takes the text as leading, using the markup as basis to recognize in-text variation as partially ordered information. Nevertheless, future work could look into aligning hypergraphs while looking at other types of markup, e.g., paragraph or sentence breaks. This would be a drastic adjustment to the collation algorithm, though, because it would require HyperCollate to prioritize not only the text, but also the markup. Another topic of further work is the TAG query language (TAGQL) in order to query the information of both individual TAGML documents and the collation hypergraph.^[19]

Finally, we continue working on the different output formats of the collation. Currently, the collation output can be visualized as an ASCII table or a collation graph (in .dot, SVG, or PNG format). The examples used in this contribution were small text fragments and simplified TAGML transcriptions for a reason: representing and collating two larger TAGML transcriptions, each containing several stages of revisions, would result in an AST and a collation hypergraph that, in their entirety, cannot be visualized in any meaningful way for the reader. In fact, the TAGML input and HyperCollate output contain a much larger variety of textual information than the visualization of a collation hypergraph. Since this information can be of instrumental value to a deeper study of the text, a future aim of HyperCollate is to provide an output in a TAGML-format. This could be similar to the TEI critical apparatus, and would allow scholars to continue their textual analysis on the collation output.

Conclusion

So far, all of our contributions to Balisage are characterized by an aspect of 'ongoing research' and the same applies to this contribution. Among other things, HyperCollate is not yet operating optimally and TAGML does not have a fully functioning query language yet. Nevertheless, we hope to have shown the value of looking beyond the prevalent standard and continuing to question how we think about, represent, and analyse texts digitally.

As more processes are automated and new methods spring from using digital technologies, we have more and better digital instruments to map the writing process. But we need to pay equal attention to the thoughts that go into making these instruments: how are scholarly activities automated? And how does that affect our understanding of and interaction with text? In short: it is important that textual scholars continue to explore different options and to critically evaluate to what extent a data model addresses their research needs.

The underlying goal of our contribution was therefore to provide transparency about the way a TAGML document is parsed and subsequently collated. We illustrated how we transferred our human understanding of in-text variation to the computer and how this intelligence is used to improve the alignment of textual witnesses. In testing, we have found that HyperCollate's more refined collation technology allows scholars to closely examine different forms of textual variation and to discover patterns in the writing processes of literary authors. We can see this even with the small examples of Woolf's text used in this contribution: in two of the three cases a word that was deleted in the draft version reoccurred in the print proofs.

We consider HyperCollate as an inclusive approach to collation: scholars are no longer required to 'flatten' the manuscript text, to dive deep into the code of the collation software, or to create additional transcriptions solely for collation purposes. Instead, they can preserve and use the information about an author's creative revision process and arrive at a collation result that may reveal information previously hidden.

References

[Alexandria] Alexandria. https://huygensing.github.io/alexandria/ ; Information about installing and using the Alexandria command line app is available at links on the TAG portal at https://huygensing.github.io/TAG/.

[Barabucci et al. 2016] Barabucci, Gioele, Paolo Ciancarini, Angelo Di Iorio, and Fabio Vitali. Measuring the quality of diff algorithms: a formalization. In Computer Standards & Interfaces, vol. 46 (2016), pp. 52-65. doi:https://doi.org/10.1016/j.csi.2015.12.005.

[Barrellon et al. 2017] Barrellon, Vincent, Pierre-Edouard Portier, Sylvie Calabretto, and Olivier Ferret. Linear extended annotation graphs. In Proceedings of ACM Document Engineering, Malta, September 2017 (2017). doi:https://doi.org/10.1145/3103010.3103011. Online available.

[Beshero-Bondar 2017] Beshero-Bondar, Elisa Eileen. Rebuilding a Digital Frankenstein by 2018: Reflections toward a Theory of Losses and Gains in Up-Translation. Presented at Symposium on Up-Translation and Up-Transformation: Tasks, Challenges, and Solutions, Washington, DC, July 31, 2017. In Proceedings of the Symposium on Up-Translation and Up-Transformation: Tasks, Challenges, and Solutions. Balisage Series on Markup Technologies, vol. 20 (2017). doi:https://doi.org/10.4242/BalisageVol20.Beshero-Bondar01.

[Beshero-Bondar and Viglianti 2018] Beshero-Bondar, Elisa E., and Raffaele Viglianti. Stand-off Bridges in the Frankenstein Variorum Project: Interchange and Interoperability within TEI Markup Ecosystems. Presented at Balisage: The Markup Conference 2018, Washington, DC, July 31 - August 3, 2018. In Proceedings of Balisage: The Markup Conference 2018. Balisage Series on Markup Technologies, vol. 21 (2018). doi:https://doi.org/10.4242/BalisageVol21.Beshero-Bondar01.

[Birnbaum 2015] Birnbaum, David J. Using CollateX with XML: Recognizing and Tracking Markup Information During Collation. Blogpost, published on June 28, 2015 under Computer-supported collation with CollateX. Available on http://collatex.obdurodon.org/xml-json-conversion.xhtml.

[Birnbaum et al. 2018] Birnbaum, David J., Elisa E. Beshero-Bondar and C. M. Sperberg-McQueen. Flattening and unflattening XML markup: a Zen garden of XSLT and other tools. Presented at Balisage: The Markup Conference 2018, Washington, DC, July 31 - August 3, 2018. In Proceedings of Balisage: The Markup Conference 2018. Balisage Series on Markup Technologies, vol. 21 (2018). doi:https://doi.org/10.4242/BalisageVol21.Birnbaum01.

[Bleeker 2017] Bleeker, Elli. Mapping invention in writing: digital infrastructure and the role of the genetic editor. Ph.D. thesis, University of Antwerp (2017). Available via https://repository.uantwerpen.be/docman/irua/e959d6/155676.pdf.

[Bleeker et al. 2018] Bleeker, Elli, Bram Buitendijk, Ronald Haentjens Dekker, and Astrid Kulsdom. Including XML markup in the automated collation of literary texts. Presented at XML Prague 2018, Prague, Czech Republic, February 8–10, 2018. In XML Prague 2018 - Conference Proceedings (2018), pp. 77–95. Available via http://archive.xmlprague.cz/2018/files/xmlprague-2018-proceedings.pdf.

[Bleeker et al. 2019] Bleeker, Elli, Bram Buitendijk and Ronald Haentjens Dekker. Between Freedom and Formalisation: a Hypergraph Model for Representing the Nature of Text. Long paper presented at the TEI Conference and Members meeting 2019, September 16-20 2019, Graz, Austria. Slides available from: https://zenodo.org/record/3929350. doi:https://doi.org/10.5281/zenodo.3929350.

[Ciancarini et al. 2016] Ciancarini, Paolo, Angelo Di Iorio, Carlo Marchetti, Michelle Schririnzi, and Fabio Vitali. Bridging the gap between tracking and detecting changes in XML. In Software: Practice and Experience, vol. 46, no. 2 (2016), pp. 227-250. doi:https://doi.org/10.1002/spe.2305.

[CollateX] CollateX. https://pypi.python.org/pypi/collatex.

[Dekhtyar and Iacob 2005] Dekhtyar, Alex, and Ionut Emil Iacob. A Framework for Management of Concurrent XML Markup. In Data and Knowledge Engineering, vol. 52, no.2, pp. 185-215 (2005). doi:https://doi.org/10.1016/j.datak.2004.05.005.

[Delta XML] DeltaXML, v. 2019-09-12. https://www.deltaxml.com/.

[Di Iorio et al. 2009] Di Iorio, Angelo, Michele Schirinzi, Fabio Vitali, and Carlo Marchetti. A natural and multi-layered approach to detect changes in tree-based textual documents. In International Conference on Enterprise Information Systems, pp. 90-101. Springer: Berlin, Heidelberg (2009). doi:https://doi.org/10.1007/978-3-642-01347-8_8.

[Haentjens Dekker and Birnbaum 2017] Haentjens Dekker, Ronald and David J. Birnbaum. It’s more than just overlap: Text As Graph. Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1–4, 2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). doi:https://doi.org/10.4242/BalisageVol19.Dekker01.

[Haentjens Dekker et al. 2018] Haentjens Dekker, Ronald, Elli Bleeker, Bram Buitendijk, Astrid Kulsdom and David J. Birnbaum. TAGML: A markup language of many dimensions. Presented at Balisage: The Markup Conference 2018, Washington, DC, July 31 - August 3, 2018. In Proceedings of Balisage: The Markup Conference 2018. Balisage Series on Markup Technologies, vol. 21 (2018). doi:https://doi.org/10.4242/BalisageVol21.HaentjensDekker01.

[Huitfeldt and Sperberg-McQueen 2003] Huitfeldt, Claus, and C. M. Sperberg-McQueen. TexMECS: An experimental markup meta-language for complex documents. Last revised on October 5, 2003. Available online.

[Van Hulle et al. (editors) 2019] Van Hulle, Dirk, Mark Nixon, And Vincent Neyt, editors. Samuel Beckett's Digital Manuscript Project. Antwerp: University Press Antwerp. http://www.beckettarchive.org, last updated in 2019.

[Juxta] Juxta Commons. http://juxtacommons.org/ and http://www.juxtasoftware.org/.

[LMNL data model] LMNLWiki. LMNL data model. From the Lost Archives of LMNL. http://lmnl-markup.org/specs/archive/LMNL_data_model.xhtml.

[Marcoux et al. 2012] Marcoux, Yves, Claus Huitfeldt and C. M. Sperberg-McQueen. The MLCD Overlap Corpus (MOC): Project report. Presented at Balisage: The Markup Conference 2012, Montréal, Canada, August 7 - 10, 2012. In Proceedings of Balisage: The Markup Conference 2012. Balisage Series on Markup Technologies, vol. 8 (2012). doi:https://doi.org/10.4242/BalisageVol8.Huitfeldt02.

[Ostrowski et al.] Ostrowski, Donald, David J. Birnbaum, and Horace G. Lunt. The e-PVL: An electronic edition of the Rus' primary chronicle. Via http://pvl.obdurodon.org/.

[Peroni and Vitali 2009] Peroni, Silvio and Fabio Vitali. Annotation with EARMARK for Arbitrary, Overlapping and Out-of-Order Markup. Presented at the DocEng’09 conference, September 16-18, 2009. In Proceedings of the 2009 ACM Symposium on Document Engineering (2009), pp. 171-180. ACM: New York. doi:https://doi.org/10.1145/1600193.1600232.

[Peroni et al. 2014] Peroni, Silvio, Francesco Poggi and Fabio Vitali. Overlapproaches in documents: a definitive classification (in OWL, 2!). Presented at Balisage: The Markup Conference 2014, Washington, DC, August 5 - 8, 2014. In Proceedings of Balisage: The Markup Conference 2014. Balisage Series on Markup Technologies, vol. 13 (2014). doi:https://doi.org/10.4242/BalisageVol13.Peroni01. https://www.balisage.net/Proceedings/vol13/html/Peroni01/BalisageVol13-Peroni01.html.

[Piez 2008] Piez, Wendell. LMNL in miniature. An introduction. Amsterdam Goddag Workshop, 1–5 December 2008. http://piez.org/wendell/LMNL/Amsterdam2008/presentation-slides.html.

[Portier et al. 2012] Portier, Pierre-Édouard, Noureddine Chatti, Sylvie Calabretto, Elöd Egyed-Zsigmond and Jean-Marie Pinon. Modeling, Encoding And Querying Multi-structured Documents. In Information Processing & Management, vol. 48, no. 5 (2012), pp. 931-955. doi:https://doi.org/10.1016/j.ipm.2011.11.004.

[Schmidt 2008] Schmidt, Desmond. What is a Multi-Version Document? Blogpost, published on March 5, 2008. https://multiversiondocs.blogspot.com/2008/03/whats-multi-version-document.html.

[Schmidt and Colomb 2009] Schmidt, Desmond, and Robert Colomb. A data structure for representing multi-version texts online. In International Journal of Human-Computer Studies, vol. 67, no.6 (2009), pp. 497-514. doi:https://doi.org/10.1016/j.ijhcs.2009.02.001.

[Schmidt and Fiormonte 2010] Schmidt, Desmond, and Domenico Fiormonte. Multi-Version Documents: A digitisation solution for textual cultural heritage artefacts. In Intelligenza Artificiale, vol. 4, no. 1 (2010), pp. 56-61.

[Schmidt 2019] Schmidt, Desmond. A Model of Versions and Layers. In Digital Humanities Quarterly, vol. 13, no. 3 (2019), available from http://digitalhumanities.org/dhq/vol/13/3/000430/000430.html.

[Schonefeld 2007] Schonefeld, Oliver. XCONCUR and XCONCUR-CL: A constraint-based approach for the validation of concurrent markup. In Data Structures for Linguistic Resources and Applications. Proceedings of the Biennial GLDV Conference (2007).

[Spencer and Howe 2004] Spencer, Matthew and Christopher Howe. Collating Texts Using Progressive Multiple Alignment. In Computers and the Humanities, vol. 38 (2004), pp. 253-270. doi:https://doi.org/10.1007/s10579-004-8682-1.

[Sperberg-McQueen 2009] Sperberg-McQueen, C. M. Sometimes a question of scale. Presented at Balisage: The Markup Conference 2009, Montréal, Canada, August 11 - 14, 2009. In Proceedings of Balisage: The Markup Conference 2009. Balisage Series on Markup Technologies, vol. 3 (2009). doi:https://doi.org/10.4242/BalisageVol3.Sperberg-McQueen02.

[TEI P5] Text Encoding Initiative Consortium. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Version 4.0.0, last updated 13 February 2020. Available from: http://www.tei-c.org/release/doc/tei-p5-doc/en/html/index.html.

[Vitali 2016] Vitali, Fabio. The expressive power of digital formats: Criticizing the manicure of the wise man pointing at the moon. Workshop lecture presented at the DiXiT Convention. Vol. 2. 2016. Slides available at http://dixit.uni-koeln.de/wp-content/uploads/Vitali_Digital-formats.pdf.

[Witt et al 2007] Witt, Andreas, Oliver Schonefeld, Georg Rehm, Jonathan Khoo, and Kilian Evang. On the Lossless Transformation of Single-File, Multi-Layer Annotations in Multi-Rooted Trees. In Proceedings of Extreme Markup Languages 2007, Montréal, Canada, 2007. Available from: http://conferences.idealliance.org/extreme/html/2007/Witt01/EML2007Witt01.xml.

[Woolf 1927] Woolf, Virginia. 1927. To the Lighthouse. Holograph ms. Berg Collection, New York Public Library; Proofs, Smith College Libraries. Pamela L. Caughie, Nick Hayward, Mark Hussey, Peter Shillingsburg, and George K. Thiruvathukal, editors. Woolf Online.

^[1] The authors express their gratitude to the reviewers for their extensive and insightful comments.

^[2] Overlap is evidently a recurring favorite of the markup community and the past decades have witnessed a significant number of alternative markup languages and/or data models to represent overlapping structures. However, we argued elsewhere that there is much more to modeling complex texts than overlapping hierarchies alone (Haentjens Dekker and Birnbaum 2017). In the context of in-text revisions, Desmond Schmidt noted that solving the overlap problem does not necessarily solve the challenges of modeling textual variation: not all cases of textual variation are cases of overlapping hierarchies, and hence solutions to overlapping hierarchies cannot adequately represent textual variation (Schmidt and Colomb 2009, p. 499). Indeed, as the table in Figure 2 shows, data models developed to represent overlapping structures do not necessarily provide for expressing non-linear information.

^[3] See the TEI Guidelines on the choice element.

^[4] Note that the schema needs to be written in a language that supports unorderedness, such as XML schema.

^[5] This is done for example in the Beckett Digital Manuscript Project (Van Hulle et al. (editors) 2019); see their editorial policy: https://www.beckettarchive.org/editorial.jsp.

^[6] For some examples of discontinuity and overlap in literary text, see Bleeker et al. 2019.

^[7] Note that while the extended Annotation Graphs-model is a stand-off annotation model, its syntax LeAG is an inline markup syntax.

^[8] The practice of splitting individual files into multiple layers is also promoted by Witt et al 2007. This transforms different sets of linguistic corpora into a set of separate XML documents that have identical text, but different annotations.

^[9] An example of unordered information is metadata.

^[10] Technically, an ASG is a Directed Acyclic Graph (DAG) and the hypergraph is a rooted mixed property hypergraph. But conceptually they can be considered as similar, so in this context we will take the TAG hypergraph also as an implementation of the ASG.

^[11] We are grateful to and acknowledge the Society of Authors as the literary representative of the Estate of Virginia Woolf. The Woolf material may not be used for commercial purposes. Please credit the copyright holder when reusing Woolf's work.

^[12] It is currently not an official part of our research to map the TEI semantics to TAGML, but we intend to work towards TAGML being an alternative expression of TEI.

^[13] Because TAGML allows markup ranges to overlap, the markup does not have to be closed in the exact reverse order in which it was opened, like with XML. This makes the TAGML grammar context sensitive. The ANTLR4 grammar used in the TAGML library, however, is context-free, because ANTLR4 does not provide a way to encode context-sensitive grammars. The current parser that is generated from the grammar cannot check whether every open tag (eg. [tag>) is eventually followed by a corresponding close tag (<tag]). This and other validity checks are done in post-processing. We are currently examining how to build a context-sensitive parser that does not require post-processing.

^[14] Bleeker 2017, pp. 106-114, elaborates on the MVD "Compare" technology and why it will not always produce the desired results for the study of textual variation.

^[15] A number of projects make use of CollateX' option to pass along information about relevant markup elements through the collation pipeline, e.g., the Beckett Digital Manuscript Project (Van Hulle et al. (editors) 2019), the critical edition of the Primary Chronicles of David J. Birnbaum (Ostrowski et al.; Birnbaum 2015) and the Frankenstein Variorium Project (Beshero-Bondar and Viglianti 2018).

^[16] See the project page of the DeltaXML Document comparator, last accessed August 20, 2020.

^[17] Unfortunately we have not been able to test JNDiff as it has not been updated since 2014 and it is not clear whether it is still maintained.

^[18] Digital facsimiles of the page proofs are also available via the digital archive Woolf Online. The same copyright notice applies.

^[19] In recognition of the crucial role TEI/XML plays in the text encoding community, we already provide a TAGML-to-XML export function. A TAGML file uploaded in the Alexandria database can be exported to XML with the alexandria export-xml [filename] -o [filename].xml command. This will export the specified TAGML document as XML. Of course, the conversion from hypergraph to tree implies information loss. Overlapping hierarchies are represented in the XML output as Trojan Horse markup. See for more information the Alexandria documentation. (Alexandria)

Alexandria. https://huygensing.github.io/alexandria/ ; Information about installing and using the Alexandria command line app is available at links on the TAG portal at https://huygensing.github.io/TAG/.

Barabucci, Gioele, Paolo Ciancarini, Angelo Di Iorio, and Fabio Vitali. Measuring the quality of diff algorithms: a formalization. In Computer Standards & Interfaces, vol. 46 (2016), pp. 52-65. doi:https://doi.org/10.1016/j.csi.2015.12.005.

Barrellon, Vincent, Pierre-Edouard Portier, Sylvie Calabretto, and Olivier Ferret. Linear extended annotation graphs. In Proceedings of ACM Document Engineering, Malta, September 2017 (2017). doi:https://doi.org/10.1145/3103010.3103011. Online available.

Beshero-Bondar, Elisa Eileen. Rebuilding a Digital Frankenstein by 2018: Reflections toward a Theory of Losses and Gains in Up-Translation. Presented at Symposium on Up-Translation and Up-Transformation: Tasks, Challenges, and Solutions, Washington, DC, July 31, 2017. In Proceedings of the Symposium on Up-Translation and Up-Transformation: Tasks, Challenges, and Solutions. Balisage Series on Markup Technologies, vol. 20 (2017). doi:https://doi.org/10.4242/BalisageVol20.Beshero-Bondar01.

Beshero-Bondar, Elisa E., and Raffaele Viglianti. Stand-off Bridges in the Frankenstein Variorum Project: Interchange and Interoperability within TEI Markup Ecosystems. Presented at Balisage: The Markup Conference 2018, Washington, DC, July 31 - August 3, 2018. In Proceedings of Balisage: The Markup Conference 2018. Balisage Series on Markup Technologies, vol. 21 (2018). doi:https://doi.org/10.4242/BalisageVol21.Beshero-Bondar01.

Birnbaum, David J. Using CollateX with XML: Recognizing and Tracking Markup Information During Collation. Blogpost, published on June 28, 2015 under Computer-supported collation with CollateX. Available on http://collatex.obdurodon.org/xml-json-conversion.xhtml.

Birnbaum, David J., Elisa E. Beshero-Bondar and C. M. Sperberg-McQueen. Flattening and unflattening XML markup: a Zen garden of XSLT and other tools. Presented at Balisage: The Markup Conference 2018, Washington, DC, July 31 - August 3, 2018. In Proceedings of Balisage: The Markup Conference 2018. Balisage Series on Markup Technologies, vol. 21 (2018). doi:https://doi.org/10.4242/BalisageVol21.Birnbaum01.

Bleeker, Elli. Mapping invention in writing: digital infrastructure and the role of the genetic editor. Ph.D. thesis, University of Antwerp (2017). Available via https://repository.uantwerpen.be/docman/irua/e959d6/155676.pdf.

Bleeker, Elli, Bram Buitendijk, Ronald Haentjens Dekker, and Astrid Kulsdom. Including XML markup in the automated collation of literary texts. Presented at XML Prague 2018, Prague, Czech Republic, February 8–10, 2018. In XML Prague 2018 - Conference Proceedings (2018), pp. 77–95. Available via http://archive.xmlprague.cz/2018/files/xmlprague-2018-proceedings.pdf.

Bleeker, Elli, Bram Buitendijk and Ronald Haentjens Dekker. Between Freedom and Formalisation: a Hypergraph Model for Representing the Nature of Text. Long paper presented at the TEI Conference and Members meeting 2019, September 16-20 2019, Graz, Austria. Slides available from: https://zenodo.org/record/3929350. doi:https://doi.org/10.5281/zenodo.3929350.

Ciancarini, Paolo, Angelo Di Iorio, Carlo Marchetti, Michelle Schririnzi, and Fabio Vitali. Bridging the gap between tracking and detecting changes in XML. In Software: Practice and Experience, vol. 46, no. 2 (2016), pp. 227-250. doi:https://doi.org/10.1002/spe.2305.

CollateX. https://pypi.python.org/pypi/collatex.

Dekhtyar, Alex, and Ionut Emil Iacob. A Framework for Management of Concurrent XML Markup. In Data and Knowledge Engineering, vol. 52, no.2, pp. 185-215 (2005). doi:https://doi.org/10.1016/j.datak.2004.05.005.

DeltaXML, v. 2019-09-12. https://www.deltaxml.com/.

Di Iorio, Angelo, Michele Schirinzi, Fabio Vitali, and Carlo Marchetti. A natural and multi-layered approach to detect changes in tree-based textual documents. In International Conference on Enterprise Information Systems, pp. 90-101. Springer: Berlin, Heidelberg (2009). doi:https://doi.org/10.1007/978-3-642-01347-8_8.

Haentjens Dekker, Ronald and David J. Birnbaum. It’s more than just overlap: Text As Graph. Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1–4, 2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). doi:https://doi.org/10.4242/BalisageVol19.Dekker01.

Haentjens Dekker, Ronald, Elli Bleeker, Bram Buitendijk, Astrid Kulsdom and David J. Birnbaum. TAGML: A markup language of many dimensions. Presented at Balisage: The Markup Conference 2018, Washington, DC, July 31 - August 3, 2018. In Proceedings of Balisage: The Markup Conference 2018. Balisage Series on Markup Technologies, vol. 21 (2018). doi:https://doi.org/10.4242/BalisageVol21.HaentjensDekker01.

Huitfeldt, Claus, and C. M. Sperberg-McQueen. TexMECS: An experimental markup meta-language for complex documents. Last revised on October 5, 2003. Available online.

Van Hulle, Dirk, Mark Nixon, And Vincent Neyt, editors. Samuel Beckett's Digital Manuscript Project. Antwerp: University Press Antwerp. http://www.beckettarchive.org, last updated in 2019.

Juxta Commons. http://juxtacommons.org/ and http://www.juxtasoftware.org/.

LMNLWiki. LMNL data model. From the Lost Archives of LMNL. http://lmnl-markup.org/specs/archive/LMNL_data_model.xhtml.

Marcoux, Yves, Claus Huitfeldt and C. M. Sperberg-McQueen. The MLCD Overlap Corpus (MOC): Project report. Presented at Balisage: The Markup Conference 2012, Montréal, Canada, August 7 - 10, 2012. In Proceedings of Balisage: The Markup Conference 2012. Balisage Series on Markup Technologies, vol. 8 (2012). doi:https://doi.org/10.4242/BalisageVol8.Huitfeldt02.

Ostrowski, Donald, David J. Birnbaum, and Horace G. Lunt. The e-PVL: An electronic edition of the Rus' primary chronicle. Via http://pvl.obdurodon.org/.

Peroni, Silvio and Fabio Vitali. Annotation with EARMARK for Arbitrary, Overlapping and Out-of-Order Markup. Presented at the DocEng’09 conference, September 16-18, 2009. In Proceedings of the 2009 ACM Symposium on Document Engineering (2009), pp. 171-180. ACM: New York. doi:https://doi.org/10.1145/1600193.1600232.

Peroni, Silvio, Francesco Poggi and Fabio Vitali. Overlapproaches in documents: a definitive classification (in OWL, 2!). Presented at Balisage: The Markup Conference 2014, Washington, DC, August 5 - 8, 2014. In Proceedings of Balisage: The Markup Conference 2014. Balisage Series on Markup Technologies, vol. 13 (2014). doi:https://doi.org/10.4242/BalisageVol13.Peroni01. https://www.balisage.net/Proceedings/vol13/html/Peroni01/BalisageVol13-Peroni01.html.

Piez, Wendell. LMNL in miniature. An introduction. Amsterdam Goddag Workshop, 1–5 December 2008. http://piez.org/wendell/LMNL/Amsterdam2008/presentation-slides.html.

Portier, Pierre-Édouard, Noureddine Chatti, Sylvie Calabretto, Elöd Egyed-Zsigmond and Jean-Marie Pinon. Modeling, Encoding And Querying Multi-structured Documents. In Information Processing & Management, vol. 48, no. 5 (2012), pp. 931-955. doi:https://doi.org/10.1016/j.ipm.2011.11.004.

Schmidt, Desmond. What is a Multi-Version Document? Blogpost, published on March 5, 2008. https://multiversiondocs.blogspot.com/2008/03/whats-multi-version-document.html.

Schmidt, Desmond, and Robert Colomb. A data structure for representing multi-version texts online. In International Journal of Human-Computer Studies, vol. 67, no.6 (2009), pp. 497-514. doi:https://doi.org/10.1016/j.ijhcs.2009.02.001.

Schmidt, Desmond, and Domenico Fiormonte. Multi-Version Documents: A digitisation solution for textual cultural heritage artefacts. In Intelligenza Artificiale, vol. 4, no. 1 (2010), pp. 56-61.

Schmidt, Desmond. A Model of Versions and Layers. In Digital Humanities Quarterly, vol. 13, no. 3 (2019), available from http://digitalhumanities.org/dhq/vol/13/3/000430/000430.html.

Schonefeld, Oliver. XCONCUR and XCONCUR-CL: A constraint-based approach for the validation of concurrent markup. In Data Structures for Linguistic Resources and Applications. Proceedings of the Biennial GLDV Conference (2007).

Spencer, Matthew and Christopher Howe. Collating Texts Using Progressive Multiple Alignment. In Computers and the Humanities, vol. 38 (2004), pp. 253-270. doi:https://doi.org/10.1007/s10579-004-8682-1.

Sperberg-McQueen, C. M. Sometimes a question of scale. Presented at Balisage: The Markup Conference 2009, Montréal, Canada, August 11 - 14, 2009. In Proceedings of Balisage: The Markup Conference 2009. Balisage Series on Markup Technologies, vol. 3 (2009). doi:https://doi.org/10.4242/BalisageVol3.Sperberg-McQueen02.

Text Encoding Initiative Consortium. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Version 4.0.0, last updated 13 February 2020. Available from: http://www.tei-c.org/release/doc/tei-p5-doc/en/html/index.html.

Vitali, Fabio. The expressive power of digital formats: Criticizing the manicure of the wise man pointing at the moon. Workshop lecture presented at the DiXiT Convention. Vol. 2. 2016. Slides available at http://dixit.uni-koeln.de/wp-content/uploads/Vitali_Digital-formats.pdf.

Witt, Andreas, Oliver Schonefeld, Georg Rehm, Jonathan Khoo, and Kilian Evang. On the Lossless Transformation of Single-File, Multi-Layer Annotations in Multi-Rooted Trees. In Proceedings of Extreme Markup Languages 2007, Montréal, Canada, 2007. Available from: http://conferences.idealliance.org/extreme/html/2007/Witt01/EML2007Witt01.xml.

Woolf, Virginia. 1927. To the Lighthouse. Holograph ms. Berg Collection, New York Public Library; Proofs, Smith College Libraries. Pamela L. Caughie, Nick Hayward, Mark Hussey, Peter Shillingsburg, and George K. Thiruvathukal, editors. Woolf Online.

Author's keywords for this paper:

textual genetic research; automated collation; data model; nonlinearity; graph database; multiple hierarchies; overlap; hypergraph; simultaneity; TAG; text as graph; TAGML; HyperCollate

BalisageThe Markup Conference2020

Balisage Paper: Marking up microrevisions with major implications: Non-linear text in TAG

`<elli.bleeker@di.huc.knaw.nl>`

`<bram.buitendijk@di.huc.knaw.nl>`

`<ronald.dekker@di.huc.knaw.nl>`

Abstract

Table of Contents

Introduction^[1]

Non-linear text

Challenges for text-encoding

Challenges for text analysis

Encoding non-linear text

Related work

Embedded markup approaches

Stand-off approaches

TAGML

TAGML Pipeline

Examples

Single deletion

Immediate deletion

Grouped revision

Analysing non-linear text: collation

Related work

Creating additional witnesses for each in-text revision

Passing along markup

Comparing data-centric XML

HyperCollate

HyperCollate's approach

Examples

Single deletion

Immediate deletion

Grouped revision

Discussion

Conclusion

References

Author's keywords for this paper:

Balisage Series on Markup Technologies

Balisage Paper: Marking up microrevisions with major implications: Non-linear text in TAG

<elli.bleeker@di.huc.knaw.nl>

<bram.buitendijk@di.huc.knaw.nl>

<ronald.dekker@di.huc.knaw.nl>

Abstract

Table of Contents

Introduction[1]

Non-linear text

Challenges for text-encoding

Challenges for text analysis

Encoding non-linear text

Related work

Embedded markup approaches

Stand-off approaches

TAGML

TAGML Pipeline

Examples

Single deletion

Immediate deletion

Grouped revision

Analysing non-linear text: collation

Related work

Creating additional witnesses for each in-text revision

Passing along markup

Comparing data-centric XML

HyperCollate

HyperCollate's approach

Examples

Single deletion

Immediate deletion

Grouped revision

Discussion

Conclusion

References

Author's keywords for this paper:

Balisage Series on Markup Technologies

`<elli.bleeker@di.huc.knaw.nl>`

`<bram.buitendijk@di.huc.knaw.nl>`

`<ronald.dekker@di.huc.knaw.nl>`

Introduction^[1]