Implementing TEI Standoff Annotation in the browser

Hugh Cayless

© Hugh A. Cayless

expand Abstract

expand Hugh Cayless

Balisage logo

Preliminary Proceedings

expand How to cite this paper

Implementing TEI Standoff Annotation in the browser

Balisage: The Markup Conference 2019
July 30 - August 2, 2019

Types of Standoff Markup in TEI[1]

The essential user story for standoff markup is: "I have a text that I want to add information to without modifying the source." This might involve changing the structure or the source, e.g. taking a text with page containers and representing it with div and paragraph containers. It might mean adding linguistic annotation to the words of the source. It might mean the addition of notes or commentary on arbitrary chunks of text. It might mean representing the results of a machine or human named entity recognition process. It might mean representing textual variation from the base text. It might mean the alignment of one text with another, e.g. a translation with its source. Various mechanisms exist for handling the connections involved in these processes: The TEI Guidelines have a set of global attributes for connecting elements, <note> has a target attribute, <span> (which represents a run of text) has target and from/to attributes, <link> can associate any number of elements using its target attribute. The targets of these annotations may be referred to by their IDs, <anchor>s may be placed to mark targets, words or text segments may be wrapped in <w> or <seg> elements, and TEI Pointers may indicate arbitrary runs of text and/or elements. Most of these techniques involve "associative" markup—they simply attach additional information to some element or segment in the source text. Properly speaking, any annotation that occurs away from the thing it is annotating might be called standoff markup.

When the TEI Guidelines discuss standoff markup, however, they primarily refer to what we might term "restructuring" or "reconstructive" markup, wherein text from one place is imported into structured markup elsewhere. Section 16.9[2], for example, which is the main section on standoff markup uses only XInclude in its discussion, despite noting that there are other ways to do it. It should be noted that the examples cited in section 16.7 (on <join>) do more or less the same thing in a purely TEI fashion, even though this is not described as a "standoff" technique in that section. This is essentially the method employed by (e.g.) CATMA[3] in its TEI export format to layer annotations onto a run of text.[4] CATMA uses <ptr> to refer to unmarked text and <seg> with an @ana attribute that refers to the annotation body (which is a TEI feature structure). In this paper, I will use a broad definition of "standoff" that includes any markup where information is added to a resource without directly modifying the resource itself.

Between the two poles of associative and restructuring standoff annotation, there lies a third, and it is both important and, in TEI, almost entirely neglected. We might call this type "assertive" annotation, because rather than simply attaching new information to the text, it makes a positive, actionable statement about the segment of text in question. The only obvious way to do this in TEI is with inline markup. A common example of this kind of annotation is the marking of named entities. In TEI, a name is marked with the <name> element, wrapping the thing named, with a @ref attribute which points to some record of the entity identified (e.g. an entry in a personography or gazetteer). There are a number of refinements of <name>, including <persName>, <orgName>, and <placeName> as well as the more general <rs> (referring string). In all of these cases, the element is making a statement about its content and possibly linking (via the ref attribute) to additional information about the referent of the name. Someone wishing to say the same sort of thing using standoff markup quickly runs into difficulty. A typical way to identify a name with a person in TEI might look like this:

Figure 1: Example 1

<person xml:id="JC">
  <persName>Gaius Iulius Caesar</persName>
  <idno type="URL">https://www.wikidata.org/wiki/Q1048</idno>
</person>
...
<p>Litteris <persName ref="#JC">C. Caesaris</persName> consulibus redditis aegre ab his impetratum est summa 
tribunorum plebis contentione, ut in senatu recitarentur;...</p>
    

The marked up section is saying "this string identifies a personal name and refers to the person identified in the person element with id 'JC'". To say the same thing with standoff markup is much trickier. We could, for example, do something like:

Figure 2: Example 2

<person xml:id="JC">
  <persName>Gaius Iulius Caesar</persName>
  <idno type="URL">https://www.wikidata.org/wiki/Q1048</idno>
</person>
...
<p xml:id="p1">Litteris C. Caesaris consulibus redditis aegre ab his impetratum est summa 
tribunorum plebis contentione, ut in senatu recitarentur;...</p>
...
<span from="#match('p1','C. Caesaris')"><ptr target="#JC"/></span>
    

In this example, the span element points to the name "C. Caesaris" in the text, and annotates it with a pointer to the person element that identifies Julius Caesar. The semantics of the TEI do not permit us to explicitly say "that string represents a personal name", however. We can only imply it by association. It should be noted that other annotation systems, like Web Annotation[5], for example, suffer from this same semantic drawback. In WA, associating a piece of information with a target is straightforward, but having the annotation make an assertion about the target involves (somewhat awkwardly) embedding RDF that makes the assertion into the body of the annotation. One solution to the problem of creating assertive standoff annotations in TEI would be to use restructuring markup to generate a new text that wrapped all of the named entities in their appropriate markup, but this is a heavyweight solution with some drawbacks. So how else could we implement assertive annotation in TEI using standoff markup?

Variant Standoff Annotation

The former question has both theoretical and practical implications. We will need to determine both what markup structures might be used and find a solution for creating the annotations themselves. Fortunately, an online annotation system capable of working with TEI already exists. Recogito[6] is an annotation tool developed by Pelagios Commons which provides for the machine-assisted annotation of texts, including TEI texts, with the names of persons, places, and events. These may be exported in a variety of formats, including TEI. The TEI export involves inserting (e.g.) <persName> tags into the existing markup, but the export mechanism understandably has trouble with overlapping markup (e.g. if part of a name already contains markup). The obvious fix for this is to export it as standoff. But it would have to be in a relatively standard format and it would have to be possible to do something useful with the output. One possible "something useful" that suggests itself is to create a browser-based view of the document plus annotation using CETEIcean[7], a prospect made even more inviting by the fact that Recogito uses CETEIcean to render TEI documents for annotation. All sorts of visualizations are possible, including turning the identified names into links, adding mouseover animations for them, index generation, and so on.

But the sticking point here is the "relatively standard format" part. Is there an existing TEI mechanism for applying markup to fragments of a source text, without restructuring it? In fact there is, it just isn't typically used in quite that way. The elements in the Critical Apparatus module are designed to record textual variance—cases where witnesses to a text differ. These differences may involve markup as well as (or even instead of) text. But would it be appropriate to use a feature originally designed to support the recording of textual variance in an edition to handle interpretive annotations on a text instead? This may not be as much of a stretch as it appears: it is common practice to use the critical apparatus to record editorial observations and emendations as well as readings found in versions of the text. It does not seem unreasonable to treat something like the identification of "C. Caesaris" as a personal name as a type of editorial emendation, even though it involves variant markup rather than variant text.

Usefully for us, a critical apparatus may appear either in an inline or standoff position. A standoff apparatus attaches to the base text using @from and @to attributes, which can indicate the location of the start and end of the varying text (if a TEI Pointer expressing a range is used, or a single element contains the whole variant only the @from attribute is needed). Given a text like

Figure 3: Example 3

        <div type="textpart" subtype="chapter" n="1" xml:id="c1">
            <p type="textpart" subtype="section" n="1" xml:id="c1s1">
                <seg n="1" xml:id="c1s1p1">Gallia est omnis divisa in partes tres, quarum unam 
                    incolunt Belgae, Aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli 
                    appellantur.</seg>
               ...
            </p>
        </div>
      
We might do inline identification of the place named in the first word thus:

Figure 4: Example 4

        <div type="textpart" subtype="chapter" n="1" xml:id="c1">
            <p type="textpart" subtype="section" n="1" xml:id="c1s1">
                <seg n="1" xml:id="c1s1p1"><placeName   
                    ref="https://pleiades.stoa.org/places/993">Gallia</placeName> est omnis divisa 
                    in partes tres, quarum unam incolunt Belgae, Aliam Aquitani, tertiam qui ipsorum 
                    lingua Celtae, nostra Galli appellantur.</seg>
               ...
            </p>
        </div>
      
But an annotation system like Recogito could use a standoff apparatus to propose this change:

Figure 5: Example 5

        <standoff>
          <listApp>
            <app from="#string-index(c1s1p1,0)" to="#string-index(c1s1p1,6)">
              <rdg><placeName ref="https://pleiades.stoa.org/places/993">Gallia</placeName></rdg>
            </app>
          </listApp>
        </standoff>
      
without needing to alter the source text. [8]

Such an annotation structure could either be embedded in the original or delivered as a separate document.

Given such a document, what could we do with it? And what problems will need to be overcome in order to use it? CETEIcean is already used as the basis for critical editions that use an inline apparatus to model textual variation. Some of this work could be repurposed to visualize annotations with a standoff apparatus. Recogito internally marks the location of the start and end of an annotated string, using a generated XPath to register the nearest parent element of the start and end points, and then indexes into the string containing the start or end. This method is isomorphic to the string-index() TEI Pointer[9] used in the example above (though the example references an @xml:id instead of an XPath). Resolving TEI Pointers is a somewhat complex task, but is solvable. The author wrote an implementation in JavaScript in 2013, which will be updated for demonstration. An XSLT 3.0 implementation would be possible, given the introduction of <xsl:eval> for the dynamic invocation of XPaths, though unfortunately that functionality is unavailable in the free version of the reference implementation, Saxon, which makes the broad dissemination of the technique problematic.

Of course, it's not quite that simple. Annotations may run into overlap issues, and indeed one prime motivation for creating standoff assertive annotations for people, places, etc. instead of putting them inline would be to avoid messy, complex markup with multiple concerns. In such cases, the annotation will have to be split across the multiple text nodes in the source. Moreover, CETEIcean renderings of the source may insert HTML elements or text for display purposes. These issues will have to be worked around or ignored during annotation resolution, and additional metadata about the annotation target may be required in order for this to be possible. Annotations may also conflict with one another, a problem which the ciritcal apparatus structures are well prepared to handle, since textual variations too may often occur in irreconcilable ways. There is much work to be done, but the combination of these technologies may make it possible to render standoff annotations on XML sources in a regular web browser and thus enable the development of collaborative, information-rich, sustainable digital editions.



[1] The idea for this paper presented itself right before the submission deadline, and as a result, it will require a good deal more fleshing out before presentation, but I think the skeleton is workable.

[3] Computer Assisted Text Markup and Analysis (https://catma.de/).

[5] https://www.w3.org/TR/annotation-model/

[7] https://github.com/TEIC/CETEIcean. See Cayless, Hugh, and Raffaele Viglianti. “CETEIcean: TEI in the Browser.” Presented at Balisage: The Markup Conference 2018, Washington, DC, July 31 - August 3, 2018. In Proceedings of Balisage: The Markup Conference 2018. Balisage Series on Markup Technologies, vol. 21 (2018). doi:https://doi.org/10.4242/BalisageVol21.Cayless01.

[8] Example 5 uses a listApp wrapped in a proposed standoff element, which will be implemented in a future release of the TEI Guidelines.

[9] See TEI Guidelines 16.2.4 and Hugh A. Cayless, "Rebooting TEI Pointers", Journal of the Text Encoding Initiative [Online], Issue 6 | December 2013 http://journals.openedition.org/jtei/907. doi:https://doi.org/10.4000/jtei.907.