Implementing TEI Standoff Annotation in the browser
© Hugh A. Cayless
The essential user story for standoff markup is: "I have a text that I want to add
information to without modifying the source." This might involve changing the structure
source, e.g. taking a text with page containers and representing it with div and paragraph
containers. It might mean adding linguistic annotation to the words of the source.
mean the addition of notes or commentary on arbitrary chunks of text. It might mean
representing the results of a machine or human named entity recognition process. It
representing textual variation from the base text. It might mean the alignment of
with another, e.g. a translation with its source. Various mechanisms exist for handling
connections involved in these processes: The TEI Guidelines have a set of global attributes
for connecting elements,
<note> has a target attribute,
(which represents a run of text) has target and from/to attributes,
associate any number of elements using its target attribute. The targets of these
may be referred to by their IDs,
<anchor>s may be placed to mark targets,
words or text segments may be wrapped in
elements, and TEI Pointers may indicate arbitrary runs of text and/or elements. Most
techniques involve "associative" markup—they simply attach additional information
element or segment in the source text. Properly speaking, any annotation that occurs
the thing it is annotating might be called standoff markup.
When the TEI Guidelines discuss standoff markup, however, they primarily refer to
might term "restructuring" or "reconstructive" markup, wherein text from one place
into structured markup elsewhere. Section 16.9, for example, which is the main section on standoff markup uses only XInclude in
its discussion, despite noting that there are other ways to do it. It should be noted
examples cited in section 16.7 (on
<join>) do more or less the same thing in a
purely TEI fashion, even though this is not described as a "standoff" technique in
section. This is essentially the method employed by (e.g.) CATMA in its TEI export format to layer annotations onto a run of text. CATMA uses
<ptr> to refer to unmarked text and
<seg> with an
@ana attribute that refers to the annotation body
(which is a TEI feature structure). In this paper, I will use a broad definition of
that includes any markup where information is added to a resource without directly
the resource itself.
Between the two poles of associative and restructuring standoff annotation, there
third, and it is both important and, in TEI, almost entirely neglected. We might call
type "assertive" annotation, because rather than simply attaching new information
to the text,
it makes a positive, actionable statement about the segment of text in question. The
obvious way to do this in TEI is with inline markup. A common example of this kind
annotation is the marking of named entities. In TEI, a name is marked with the
<name> element, wrapping the thing named, with a @ref attribute which points
to some record of the entity identified (e.g. an entry in a personography or gazetteer).
are a number of refinements of
<placeName> as well as the more general
<rs> (referring string). In all of these cases, the element is making a
statement about its content and possibly linking (via the ref attribute) to additional
information about the referent of the name. Someone wishing to say the same sort of
using standoff markup quickly runs into difficulty. A typical way to identify a name
person in TEI might look like this:
Figure 1: Example 1
<person xml:id="JC"> <persName>Gaius Iulius Caesar</persName> <idno type="URL">https://www.wikidata.org/wiki/Q1048</idno> </person> ... <p>Litteris <persName ref="#JC">C. Caesaris</persName> consulibus redditis aegre ab his impetratum est summa tribunorum plebis contentione, ut in senatu recitarentur;...</p>
The marked up section is saying "this string identifies a personal name and refers to the person identified in the person element with id 'JC'". To say the same thing with standoff markup is much trickier. We could, for example, do something like:
Figure 2: Example 2
<person xml:id="JC"> <persName>Gaius Iulius Caesar</persName> <idno type="URL">https://www.wikidata.org/wiki/Q1048</idno> </person> ... <p xml:id="p1">Litteris C. Caesaris consulibus redditis aegre ab his impetratum est summa tribunorum plebis contentione, ut in senatu recitarentur;...</p> ... <span from="#match('p1','C. Caesaris')"><ptr target="#JC"/></span>
In this example, the span element points to the name "C. Caesaris" in the text, and annotates it with a pointer to the person element that identifies Julius Caesar. The semantics of the TEI do not permit us to explicitly say "that string represents a personal name", however. We can only imply it by association. It should be noted that other annotation systems, like Web Annotation, for example, suffer from this same semantic drawback. In WA, associating a piece of information with a target is straightforward, but having the annotation make an assertion about the target involves (somewhat awkwardly) embedding RDF that makes the assertion into the body of the annotation. One solution to the problem of creating assertive standoff annotations in TEI would be to use restructuring markup to generate a new text that wrapped all of the named entities in their appropriate markup, but this is a heavyweight solution with some drawbacks. So how else could we implement assertive annotation in TEI using standoff markup?
The former question has both theoretical and practical implications. We will need
determine both what markup structures might be used and find a solution for creating
annotations themselves. Fortunately, an online annotation system capable of working
already exists. Recogito is an annotation tool developed by Pelagios Commons which provides for the
machine-assisted annotation of texts, including TEI texts, with the names of persons,
and events. These may be exported in a variety of formats, including TEI. The TEI
involves inserting (e.g.)
<persName> tags into the existing markup, but the
export mechanism understandably has trouble with overlapping markup (e.g. if part
of a name
already contains markup). The obvious fix for this is to export it as standoff. But
have to be in a relatively standard format and it would have to be possible to do
useful with the output. One possible "something useful" that suggests itself is to
browser-based view of the document plus annotation using CETEIcean, a prospect made even more inviting by the fact that Recogito uses CETEIcean to
render TEI documents for annotation. All sorts of visualizations are possible, including
turning the identified names into links, adding mouseover animations for them, index
generation, and so on.
But the sticking point here is the "relatively standard format" part. Is there an existing TEI mechanism for applying markup to fragments of a source text, without restructuring it? In fact there is, it just isn't typically used in quite that way. The elements in the Critical Apparatus module are designed to record textual variance—cases where witnesses to a text differ. These differences may involve markup as well as (or even instead of) text. But would it be appropriate to use a feature originally designed to support the recording of textual variance in an edition to handle interpretive annotations on a text instead? This may not be as much of a stretch as it appears: it is common practice to use the critical apparatus to record editorial observations and emendations as well as readings found in versions of the text. It does not seem unreasonable to treat something like the identification of "C. Caesaris" as a personal name as a type of editorial emendation, even though it involves variant markup rather than variant text.
Usefully for us, a critical apparatus may appear either in an inline or standoff position. A standoff apparatus attaches to the base text using @from and @to attributes, which can indicate the location of the start and end of the varying text (if a TEI Pointer expressing a range is used, or a single element contains the whole variant only the @from attribute is needed). Given a text like
Figure 3: Example 3
<div type="textpart" subtype="chapter" n="1" xml:id="c1"> <p type="textpart" subtype="section" n="1" xml:id="c1s1"> <seg n="1" xml:id="c1s1p1">Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, Aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.</seg> ... </p> </div>
Figure 4: Example 4
<div type="textpart" subtype="chapter" n="1" xml:id="c1"> <p type="textpart" subtype="section" n="1" xml:id="c1s1"> <seg n="1" xml:id="c1s1p1"><placeName ref="https://pleiades.stoa.org/places/993">Gallia</placeName> est omnis divisa in partes tres, quarum unam incolunt Belgae, Aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.</seg> ... </p> </div>
Figure 5: Example 5
<standoff> <listApp> <app from="#string-index(c1s1p1,0)" to="#string-index(c1s1p1,6)"> <rdg><placeName ref="https://pleiades.stoa.org/places/993">Gallia</placeName></rdg> </app> </listApp> </standoff>
Such an annotation structure could either be embedded in the original or delivered as a separate document.
Given such a document, what could we do with it? And what problems will need to be
overcome in order to use it? CETEIcean is already used as the basis for critical editions
use an inline apparatus to model textual variation. Some of this work could be repurposed
visualize annotations with a standoff apparatus. Recogito internally marks the location
start and end of an annotated string, using a generated XPath to register the nearest
element of the start and end points, and then indexes into the string containing the
end. This method is isomorphic to the string-index() TEI Pointer used in the example above (though the example references an
instead of an XPath). Resolving TEI Pointers is a somewhat complex task, but is solvable.
An XSLT 3.0 implementation would be possible, given the introduction of
<xsl:eval> for the dynamic invocation of XPaths, though unfortunately that
functionality is unavailable in the free version of the reference implementation,
makes the broad dissemination of the technique problematic.
Of course, it's not quite that simple. Annotations may run into overlap issues, and indeed one prime motivation for creating standoff assertive annotations for people, places, etc. instead of putting them inline would be to avoid messy, complex markup with multiple concerns. In such cases, the annotation will have to be split across the multiple text nodes in the source. Moreover, CETEIcean renderings of the source may insert HTML elements or text for display purposes. These issues will have to be worked around or ignored during annotation resolution, and additional metadata about the annotation target may be required in order for this to be possible. Annotations may also conflict with one another, a problem which the ciritcal apparatus structures are well prepared to handle, since textual variations too may often occur in irreconcilable ways. There is much work to be done, but the combination of these technologies may make it possible to render standoff annotations on XML sources in a regular web browser and thus enable the development of collaborative, information-rich, sustainable digital editions.
 The idea for this paper presented itself right before the submission deadline, and as a result, it will require a good deal more fleshing out before presentation, but I think the skeleton is workable.
 https://github.com/TEIC/CETEIcean. See Cayless, Hugh, and Raffaele Viglianti. “CETEIcean: TEI in the Browser.” Presented at Balisage: The Markup Conference 2018, Washington, DC, July 31 - August 3, 2018. In Proceedings of Balisage: The Markup Conference 2018. Balisage Series on Markup Technologies, vol. 21 (2018). doi:https://doi.org/10.4242/BalisageVol21.Cayless01.
 Example 5 uses a listApp wrapped in a proposed standoff element, which will be implemented in a future release of the TEI Guidelines.