How to cite this paper

Cayless, Hugh. “Implementing TEI Standoff Annotation in the browser.” Presented at Balisage: The Markup Conference 2019, Washington, DC, July 30 - August 2, 2019. In Proceedings of Balisage: The Markup Conference 2019. Balisage Series on Markup Technologies, vol. 23 (2019). https://doi.org/10.4242/BalisageVol23.Cayless01.

Balisage: The Markup Conference 2019
July 30 - August 2, 2019

Balisage Paper: Implementing TEI Standoff Annotation in the browser

Hugh Cayless

Hugh is a Senior Digital Humanities Developer at the Duke Collaboratory for Classics Computing (DC3).

© Hugh A. Cayless

Abstract

Proposes a method for encoding and visualizing arbitrary annotated segments of TEI documents.

Table of Contents

Types of Standoff Markup in TEI
Assertive Standoff Annotation
Postscript

Types of Standoff Markup in TEI[1]

The essential user story for standoff markup is: “I have a text that I want to add information to without modifying the source.” This might involve changing the structure or the source, e.g. taking a text with page containers and representing it with div and paragraph containers. It might mean adding linguistic annotation to the words of the source. It might mean the addition of notes or commentary on arbitrary chunks of text. It might mean representing the results of a machine or human named entity recognition process. It might mean noting when and where witnesses diverge from the base text. It might mean the alignment of one text with another, e.g. a translation with its source. Various mechanisms exist for handling the connections involved in these processes: The TEI Guidelines have a set of global attributes for connecting elements[2], <note> has a target attribute, <span> (which associates an interpretive annotation with a run of text) has target and from/to attributes, <link> can associate any number of elements using its target attribute. The targets of these annotations may be referred to by their IDs, <anchor>s may be placed to mark targets, words or text segments may be wrapped in <w> or <seg> elements, and TEI Pointers may indicate arbitrary runs of text and/or elements. Most of these techniques involve what we might term “associative” markup—they attach additional information to some element or segment in the source text.

Properly speaking, any annotation that occurs away from the thing it is annotating might be called standoff markup. When the TEI Guidelines discuss standoff markup, however, they primarily refer to what we might term “restructuring” or “reconstructive” markup, wherein text from one place is imported into structured markup elsewhere. In section 16.9[3], for example, which is the main section on standoff markup, all of the examples use XInclude to pull text and/or markup from elsewhere into a new structure.

Figure 1: Example from section 16.9.3, “Stand-off Markup in TEI”

Source document:
<body>
 <p xml:id="par1">home, <emph>home</emph> on Brokeback Mountain.</p>
 <p xml:id="par2">That was the <emph>song</emph> that I sang</p>
</body>
Restructuring document:
<body>
  <div><include href="example1.xml" xmlns="http://www.w3.org/2001/XInclude"
  xpointer="range(xpath(id('par1')//emph),xpath(id('par2')//emph))"/>
  </div>
</body>  
Result document:
<body>
  <div>
    <p xml:id="par1">home, <emph>home</emph> on Brokeback Mountain.</p>
    <p xml:id="par2">That was the <emph>song</emph> that I sang</p>
  </div>
</body>
        
Given a source document, another document may pull chunks out of it and present them embedded in a new structure. It should be noted that the examples cited in section 16.7 (on <join>) do more or less the same thing in a purely TEI fashion, even though this is not described as a “standoff” technique in that section. This is essentially the method employed by (e.g.) CATMA[4] in its TEI export format to layer annotations onto a run of text.[5] CATMA uses <ptr> to refer to unmarked text and <seg> with an @ana attribute that refers to the annotation body (which is a TEI feature structure). Restructuring markup may be very useful in cases where overlap would otherwise be an issue. A “diplomatic” model of a codex might mark the text as contained by pages, for example, but there might well be a need for a parallel version where the text is structured into chapters and paragraphs.

Even though the TEI Guidelines consider restructuring markup to be the main form of “standoff” markup, for the purposes of this paper I will use a broader definition of “standoff” that includes any markup where information is added to a resource without directly modifying the resource itself. Associative markup need not be used in a standoff fashion, to be sure: <note>s may occur either inline or standoff, for example.

Figure 2: Notes

<p>Some text<note>with an inline note</note>.</p>

<p xml:id="id">Some text.</p>

...
 (elsewhere)
<note target="#id">with a standoff note</note>

<p>Some text.<ptr target="#id"/></p>

... (elsewhere)

<note xml:id="id">with a referenced note</note>
        

Between the two poles of associative and restructuring standoff annotation, there lies a third, and it is both important and, in TEI, almost entirely neglected. We might call this type “assertive” annotation, because rather than simply attaching new information to the text, it makes a positive, actionable statement about the segment of text in question. The only obvious way to do this in TEI is with inline markup. A common example of this kind of annotation is the marking of named entities. In TEI, a name is marked with the <name> element, wrapping the thing named, with a @ref attribute which points to some record of the entity identified (e.g. an entry in a personography or gazetteer). There are a number of refinements of <name>, including <persName>, <orgName>, and <placeName> as well as the more general <rs> (referring string). In all of these cases, the element is making a statement about its content and possibly linking (via the ref attribute) to additional information about the referent of the name. Someone wishing to say the same sort of thing using standoff markup quickly runs into difficulty. A typical way to identify a name with a person in TEI might look like this:

Figure 3: Identifying a personal name

<person xml:id="JC">
  <persName>Gaius Iulius Caesar</persName>
  <idno type="URL">https://www.wikidata.org/wiki/Q1048</idno>
</person>
...
<p>Litteris <persName ref="#JC">C. Caesaris</persName> consulibus redditis aegre ab his impetratum est summa 
tribunorum plebis contentione, ut in senatu recitarentur;...</p>
    

The marked up section is saying “this string identifies a personal name and refers to the person identified in the person element with id ‘JC’”. To say the same thing with standoff markup is much trickier. We could, for example, do something like:

Figure 4: Associating a person with a span of text

<person xml:id="JC">
  <persName>Gaius Iulius Caesar</persName>
  <idno type="URL">https://www.wikidata.org/wiki/Q1048</idno>
</person>
...
<p xml:id="p1">Litteris C. Caesaris consulibus redditis aegre ab his impetratum est summa 
tribunorum plebis contentione, ut in senatu recitarentur;...</p>
...
<span from="#match('p1','C. Caesaris')"><ptr target="#JC"/></span>
    

In this example, the span element points to the name “C. Caesaris” in the text, and annotates it with a pointer to the person element that identifies Julius Caesar. The semantics of the TEI do not permit us to explicitly say “that string represents a personal name”, however. We can only imply it by association. It should be noted that other annotation systems, like Web Annotation[6], for example, suffer from this same semantic drawback. In WA, associating a piece of information with a target is straightforward, but having the annotation make an assertion about the target involves (somewhat awkwardly) embedding RDF that makes the assertion into the body of the annotation. One solution to the problem of creating assertive standoff annotations in TEI would be to use restructuring markup to generate a new text that wrapped all of the named entities in their appropriate markup, but this is a heavyweight solution with some drawbacks. So how else could we implement assertive annotation in TEI using standoff markup?

Assertive Standoff Annotation

The question has both theoretical and practical implications. We will need to determine both what markup structures might be used and find a solution for creating the annotations themselves. Fortunately, an online annotation system capable of working with TEI already exists. Recogito[7] is an annotation tool developed by Pelagios Commons which provides for the machine-assisted annotation of texts, including TEI texts, with the names of persons, places, and events. These may be exported in a variety of formats, including TEI. The TEI export involves inserting inline assertive markup (e.g. <persName> tags) into the existing document, and the export mechanism understandably has trouble with overlapping markup (e.g. if part of a name already contains markup or the name overlaps another structure). The obvious fix for this is to export it as standoff. But it would have to be in a standard format that was still TEI, and it would have to be possible to do something useful with the output. One possible “something useful” that suggests itself is to create a browser-based view of the document plus annotation using CETEIcean[8], a prospect made even more inviting by the fact that Recogito uses CETEIcean to render TEI documents for annotation. All sorts of visualizations are possible, including turning the identified names into links, adding mouseover animations for them, index generation, and so on.

But the sticking point here is the “standard format” part. What we would need is a TEI mechanism for applying markup to fragments of a source text, without restructuring it? In other words, a way to do assertive standoff annotation. Such a construct does exist in fact, but it is not used in precisely this way. The elements in the Critical Apparatus module[9] are designed to model textual variance—cases where the witnesses to a text differ and the editor wishes to record the alternate possibiilites.

Figure 5: An example apparatus entry

<p n="1" xml:id="p1">

  <seg n="1" xml:id="seg-1.1">Bello Alexandrino

    conflato Caesar <app>

      <lem>Rhodo</lem>

      <rdg wit="#S" ana="#orthographical">Ordo</rdg>

    </app> atque ex Syria Ciliciaque omnem classem

    arcessit; ...</seg>

...
</p>
        
Here, the base text has “Rhodo” (Rhodes) and the editor wishes readers to know that a witness, S, has “Ordo” instead. It might at first seem outlandish to suggest using such a specialized type of markup to record annotations, but critical apparatus markup has several advantages that make it attractive. It can take either an inline or standoff form, it is designed explicitly for making assertive annotations and recording their provenance, it can accommodate differences in markup as well as text, it can cope with overlap, and it even has mechanisms for recording dependencies or conflicts between readings. A transposition in a variant, for example, requires that the base reading exclude the variant, and vice versa. Person or place identifications might have the same sorts of requirements, one identification implying—or ruling out—another. It is already common practice to note the suggestions of previous editors in the apparatus, so suggested emendations to the markup, such as the addition of <persName> tags around a name, would not be quite such a stretch as it might at first seem. It does not seem unreasonable to treat something like the identification of “C. Caesaris” as a personal name as a type of editorial emendation, even though it involves variant markup rather than variant text.

Usefully for us, a critical apparatus may appear either in an inline or standoff position. A standoff apparatus attaches to the base text using @from and @to attributes, which can indicate the location of the start and end of the varying text (if a TEI Pointer expressing a range is used, or a single element contains the whole variant only the @from attribute is needed). Given a text like

Figure 6: Base text

        <div type="textpart" subtype="chapter" n="1" xml:id="c1">
            <p type="textpart" subtype="section" n="1" xml:id="c1s1">
                <seg n="1" xml:id="c1s1p1">Gallia est omnis divisa in partes tres, quarum unam 
                    incolunt Belgae, Aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli 
                    appellantur.</seg>
               ...
            </p>
        </div>
      
We might do inline identification of the place named in the first word thus:

Figure 7: Base text with inline place name identification

        <div type="textpart" subtype="chapter" n="1" xml:id="c1">
            <p type="textpart" subtype="section" n="1" xml:id="c1s1">
                <seg n="1" xml:id="c1s1p1"><placeName   
                    ref="https://pleiades.stoa.org/places/993">Gallia</placeName> est omnis divisa 
                    in partes tres, quarum unam incolunt Belgae, Aliam Aquitani, tertiam qui ipsorum 
                    lingua Celtae, nostra Galli appellantur.</seg>
               ...
            </p>
        </div>
      
But an annotation system like Recogito could use a standoff apparatus to propose this change instead:

Figure 8: A standoff place name identification

        <standOff>
          <listApp>
            <app from="#string-index(//seg[@xml:id='c1s1p1'],0)" to="#string-index(//seg[@xml:id='c1s1p1'],6)">
              <rdg><placeName ref="https://pleiades.stoa.org/places/993" source="#Damon">Gallia</placeName></rdg>
            </app>
          </listApp>
        </standOff>
      
without needing to alter the source text. Usefully, the semantics of the latter are explicit: “Damon says this piece of the text should be read as a place name.”[10]

Such an annotation structure could either be embedded in the original or delivered as a separate document.

Given such a document, what could we do with it? And what problems will need to be overcome in order to use it? CETEIcean is already used as the basis for critical editions that use an inline apparatus to model textual variation. Experimentally, at least, resolving a set of standoff assertive annotations and applying them to the text as links is straightforward. Recogito internally marks the location of the start and end of an annotated string, using a generated XPath to register the nearest parent element(s) of the start and end points, and then indexes into the string containing the start and end. This method is isomorphic to the string-index() TEI Pointers[11] used in the example above.

The difficult piece of the puzzle, and the one which remains unresolved at the time of writing (August 2019), is the shape of the TEI export. A proposal for a new <standoff> container for annotations pointing into the text is being debated by the TEI community. Some form of this will likely be adopted for a future release, and may then serve as a place to put standoff assertive annotations of the kind mooted above. Whether or not the existing critical apparatus markup is deemed suitable for such annotations is an open question. It is to be hoped that if it is not, some equivalent structure can be developed. The technological pieces of the puzzle are all in place, so we can hope that the standards development component will catch up before too long.

Postscript

As part of the TEI Council’s Fall face-to-face meeting, we convened several stakeholders from the community to try to work out a structure for standoff markup. The meeting was held on September 16th, 2019 in Graz, Austria. The following decisions were agreed upon:

  1. The TEI element will be allowed to nest, so that one TEI document may be embedded directly inside another.

  2. A new <standOff> element will be created, which will be of the model.resourceLike class, meaning that it can appear directly inside the <TEI> element, alongside the <teiHeader>, <text>, etc.

  3. <standOff> will contain most list-like elements, including <listPerson>, <listPlace>, and <listOrg>. <listApp will be available also, though whether it will be used for assertive annotations of the type outlined here, or whether a new, parallel stucture will be created is an open question.

  4. <standOff> will also contain a new <listAnnotation> element, which will contain <annotationBlock> (used for linguistic annotation), and/or a new <annotation> element.

  5. The precise content model of the new <annotation> element is still to be determined. The plan is to model it after the Web Annotation data model[12], with some TEI-specific modifications.

Steps 1–4 will be implemented right away, with a plan to include them in the next release (probably in Spring 2020) and discussions on #5 will proceed in parallel.



[1] The idea for this paper presented itself right before the submission deadline, and as a result, it will require a good deal more fleshing out before presentation, but I think the skeleton is workable.

[4] Computer Assisted Text Markup and Analysis (https://catma.de/).

[6] https://www.w3.org/TR/annotation-model/

[8] https://github.com/TEIC/CETEIcean. See Cayless, Hugh, and Raffaele Viglianti. “CETEIcean: TEI in the Browser.” Presented at Balisage: The Markup Conference 2018, Washington, DC, July 31 - August 3, 2018. In Proceedings of Balisage: The Markup Conference 2018. Balisage Series on Markup Technologies, vol. 21 (2018). doi:https://doi.org/10.4242/BalisageVol21.Cayless01.

[10] This example uses a listApp wrapped in a proposed standoff element, which will be implemented in a future release of the TEI Guidelines.

[11] See TEI Guidelines 16.2.4 and Hugh A. Cayless, “Rebooting TEI Pointers”, Journal of the Text Encoding Initiative [Online], Issue 6 | December 2013 http://journals.openedition.org/jtei/907.

[12] https://www.w3.org/TR/annotation-model/