Balisage logo

Proceedings

On Implementing string-range() for TEI

Hugh A. Cayless

Analyst/Programmer

NYU

Adam Soroka

Engineer

UVA

Balisage: The Markup Conference 2010
August 3 - 6, 2010

Copyright © 2010 Hugh A. Cayless and Adam Soroka

How to cite this paper

Cayless, Hugh A., and Adam Soroka. “On Implementing string-range() for TEI.” Presented at Balisage: The Markup Conference 2010, Montréal, Canada, August 3 - 6, 2010. In Proceedings of Balisage: The Markup Conference 2010. Balisage Series on Markup Technologies, vol. 5 (2010). doi:10.4242/BalisageVol5.Cayless01.

Abstract

The Text Encoding Initiative Guidelines specify a number of pointer schemes for use in implementing standoff markup. This paper reports on an implementation of one of these pointer schemes, string-range(), and discusses the issues surrounding standoff markup in the context of TEI.

Table of Contents

Introduction
TEI, standoff markup, and string-range()

Introduction

The genesis of this paper lies in a discussion[1] on the Humanist mailing last that began with a request for comment from Desmond Schmidt on his recent article in LLC, The inadequacy of embedded markup for cultural heritage texts. [Schmidt2010] The core of which is an argument (really a series of arguments) that the insertion of what I will call inline markup (the format of which is typically XML) into the midst of a text to be interpreted is in some sense a violation of that text. Schmidt comes at this from several angles, highlighting the overlap problem, the imposition of subjective interpretation on the text in the form of markup that could become obsolete before the text itself does, the ways in which inline markup may duplicate information that could be derived automatically, and the fact that markup technologies like XML are industrial and inherit from textual command languages designed for print.

The authors aren’t sure they completely agree with all of this, but Schmidt’s is a thoughtful article, and a useful contribution to the ongoing debate over how satisfactory XML is for representing text. The subsequent discussion on Humanist went on for an unusually long series of posts, and was at times quite contentious. It inspired Hugh Cayless to call a session on The (in)adequacies of markup [http://thatcamp.org/2010/the-inadequacies-of-markup/] at the THATCamp meeting held shortly afterwards at George Mason University. The session participants quickly agreed on a ruthlessly practical approach. As programmers, we are quite pleased that XML is an industrial tool and while we’ll happily acknowledge the shortcomings of the Text Encoding Initiative (TEI), the size of its install base and the number of texts already encoded using it led us to look for solutions to the problems inherent in inline markup that could be implemented within the context of XML and the TEI. The obvious alternative to inline markup is standoff markup, and the TEI Guidelines have at least some things to say about doing standoff markup in TEI.

TEI, standoff markup, and string-range()

Section 16.2.4 of the Text Encoding Initiative Guidelines outlines a number of pointer schemes that are related to functions defined in the XPointer specification [XPtr]. These can (notionally at least) be used to produce standoff markup on a TEI document. There are a variety of problems with the pointer schemes defined by the guidelines, and also with the related XPointer functions, but the most basic is that most of them don't have any implementation. There is therefore, no good way to use them, and, because they are unused, no good reason to implement them either. It is a Catch-22. The TEI pointer schemes are clearly meant to be used in concert with XInclude, as functions that retrieve text or node sets (see the example in 16.9.3), but their effects are underspecified in the guidelines.

Recent developments in the TEI have opened up the possibility of creating an implementation of at least one of these schemes, namely string-range(). The string-range() pointer scheme is defined thus:

16.2.4.5 string-range(fragmentIdentifier, offset [, length])

The string-range() scheme locates a range based on character positions. While string-range endpoints are points adjacent to character positions, they must be designated by the characters to which they are adjacent, in the same way that the nodes corresponding to XML elements are. This avoids ambiguity about which point between two characters is indicated when characters are interrupted by markup.

The first argument to string-range() designates a node or a range within which a string is to be located. No string range, even an empty one, can be defined by a string-range() if the fragment identified has the empty string as its value. Every string-range is defined based on an ‘origin character’. The origin is numbered 0, and designates the first character of the string-value of pointer. The offset is a character index relative to the origin; the start of the resulting range is the position designated by the sum of the origin and offset."

If length is specified, the end of the range is at a point adjacent to the character designated by the origin added to the offset and length. If the offset is negative, or length is sufficiently large, a string-range can designate characters outside the string-value of the initial pointer. In this case, characters are located using the string-value of the entire document. It is also legal for length plus the origin to exceed the length of the string-value of the document by one, in order to accommodate ranges that include the last character of a document.

If length is not specified, it defaults to the value 1, and the string range contains one character. If it is specified as 0, the zero-length range is interpreted as the point immediately preceding the origin character or offset character if there is one. [http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SATSSR][2]

In theory, at least, string-range can be used to indicate an arbitrary section of text in a TEI document, without regard to the way that text is nested within the document's structure. A range could start inside one element, and end inside another. Put another way, it can span multiple text() nodes. This means that if string-range() can be implemented, it would present a solution to the overlapping hierarchies problem.

Since string-range depends on marking a starting point and length of text within a section of the document, it runs immediately into a problem with the way XML regards some whitespace as "ignorable". Space between elements, for example, is not necessarily preserved during operations on the document. Someone editing a document, for example, might pretty-print it in order to make it more readable. This would introduce extra newline and space characters into the document, and immediately break any string-range() pointers. In other words, the ignorable whitespace content of the document could be changed as a part of normal processing that doesn’t involve any editing of the document. This year, for the first time, TEI has begun to allow the xml:space attribute. [http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.global.html] This means that the ignorable whitespace issue can be accommodated in a standard way.

A second problem, and one that applies to several of the pointer schemes that the Guidelines specify, is that they extend the XML data model. The TEI pointer scheme conceives of Nodes and Node Sets (both of which correspond to objects in the XML Infoset/DOM), but also Points and Ranges. Points are theoretical objects that must lie between element nodes or between characters in text nodes. This is a useful concept for marking arbitrary ranges in a document, but since it does not correspond to anything conceived of by the XML specifications, there are are no hooks in XML processing tools on which to hang Points. They cannot be passed to or returned by any XPath function or XSLT instruction. This makes implementation a complex task. At best, they can be encapsulated in special-purpose markup for passing as messages or handled as uninterpreted XPath expressions. The former technique introduces a problem of standardization and the latter requires second-order processing, with the dangers and difficulties that implies. Since string-range focuses on text, however, it is possible to count, for each text node, the concatenated length of text nodes on the preceding axis, and thereby to locate the text nodes containing the start and end points indicated in a string-range() pointer.

A third problem with string-range() as defined by the TEI, and in fact with all of its XPointer schemes, is that the specification (the TEI Guidelines) doesn't properly address what implementation would mean. The example in 16.9.3 uses string-range in XInclude elements to import text from one XML document to another. Of course this example doesn’t work, because TEI’s string-range has no XInclude implementation. But the (unstated) implication seems to be that the string-range() function returns plain text only. String-range could certainly be used to declaratively indicate arbitrary sections of a document, but without some mechanism for executing it, there is nothing concrete for an implementer to do. A further complication is that there is nothing stopping a string-range from indicating text that overlaps elements in a non-hierarchical fashion. Should an implementer ignore elements thus captured? Or return them somehow? A related issue is the fact that since string-range defines text-based locations, elements are effectively invisible to it. A standalone element (e.g. <lb/>) immediately before text that one wants to mark with a string-range() won't automatically be part of that range.

Given the underspecified functionality of string-range, the authors have made some assumptions about implementation details. We have decided not to extend any existing XInclude implementation. Instead, we have decided to use string-range only in a declarative fashion, as a pointing mechanism within TEI, and we are developing XPath 2.0 functions that complement and use string-range(). Where it declares a range, they will be able to retrieve that range. We propose three functions, with the following signatures:

get-string-range(parentElt, offset1, offset2 [offset3, offset4, etc.])

- takes as arguments an XPath indicating a parent element (e.g. a div on which @xml:space="preserve" as been set) and a set of integer pairs of character offsets

- returns a sequence of strings derived from text nodes or portions of text nodes between the pairs of points passed in as parameters.

get-milestone-range(parentElt,offset1, offset2 [offset3, offset4, etc.])

- takes as arguments an XPath indicating a parent element (e.g. a div on which @xml:space="preserve" as been set) and a set of integer pairs of character offsets

- returns a sequence where elements have been converted to milestones (e.g. <p-start> and <p-end> instead of <p>).

get-fragment-range(parentElt,offset1, offset2 [offset3, offset4, etc.])

- takes as arguments an XPath indicating a parent element (e.g. a div on which @xml:space="preserve" as been set) and a set of integer pairs of character offsets

- returns a well-formed document fragment, where elements split by the range have been automatically opened or closed.

An XSLT 2.0 stylesheet that implements these functions is under development at http://github.com/hcayless/tei-string-range.

A fourth problem lies in the ease-of-use of the string-range function. Determining the index location of a piece of arbitrary text in a TEI document is prohibitively difficult for a human editor. It would be relatively easy to programmatically generate a string-range based on a selected range in an XML editor, like oXygen, but without this kind of functionality, it will be quite hard for someone marking up a document to create the expression with facility. What is needed at a bare minimum is a means to mark range starts and ends, in an editor-independent fashion, which can then be converted to string-range expressions. We propose using processing instructions in the form <?range-start r="n"?>/<?range-end r="n"?>, where "n" identifies a particular range. Pairs of these will mark range starts and ends, and can be processed by an XSLT stylesheet to create <linkGrp>s containing links that use string-range() to identify the marked ranges.

Our implementation then, consists of a simple way to create string-range() pointers using a XSLT 2.0 stylesheet transformation and a set of functions that can be used to process the data marked by a string-range() in the context of an XPath 2.0 processor. Using these stylesheets it is possible, for example, to mark up ranges of text in a non-hierarchical way and then generate a set of links denoting those ranges, to which additional standoff markup may be linked, or one can convert a document with inline markup to one where a division contains plain text and a second division contains markup and pointers to the text.

While the authors intend this effort to be a practical addition to the TEI’s arsenal of tools, this kind of implementation raises theoretical questions that bring us back to the question of the adequacy of inline markup. In the example below, taken from http://github.com/hcayless/tei-string-range/blob/master/bgu.1.116.xml, a transcription of a document written on papyrus from Arsinoite in Egypt, some of the text content in the edition <div> is readable in the original, and some has been supplied by the editor.

<lb n="1"/><handShift new="m3"/> <num value="62">ξβ</num> 
<lb n="2"/><handShift new="m1"/> 
  <supplied reason="lost">Ἁρποκρατίω</supplied>ν<supplied reason="lost">ι</supplied> 
  τ<supplied reason="lost">ῷ κ</supplied>αὶ Ἱέρακι 
  <expan>β<supplied reason="lost">ασ<ex>ιλικῷ</ex></supplied></expan> 
<lb n="3"/><supplied reason="lost"><expan>γρ<ex>αμματεῖ</ex></expan> 
  <expan>Ἀρσ<ex>ινοΐτου</ex></expan></supplied> 
  <expan>Ἡρ<supplied reason="lost">ακ<ex>λείδου</ex></supplied></expan>
  <supplied reason="lost"> με</supplied>ρίδος 
<lb n="4"/><supplied reason="lost">παρὰ</supplied> 
  Ὡ<supplied reason="lost">ριγέ</supplied><unclear>ν</unclear>ους 
  Ἰσιδ<supplied reason="lost">ώ</supplied>ρο<supplied reason="lost">υ</supplied> 
<lb n="5"/><supplied reason="lost">τῶν ἀπὸ</supplied> τῆ<supplied reason="lost">ς</supplied> 
  <expan>μ<supplied reason="lost">ητρ</supplied>ο<ex>πόλεως</ex></expan> 
  <expan>ἀπογε<supplied reason="lost">γρ</supplied>α<ex>μμένου</ex></expan> 
<lb n="6"/><supplied reason="lost">ἐπʼ <expan>ἀμφό<ex>δου</ex></expan> </supplied>
  <gap reason="lost" quantity="1" unit="character"/><abbr>ερω</abbr> 
  Θε<gap reason="lost" quantity="1" unit="character"/><abbr><unclear>μι</unclear>
  <gap reason="illegible" quantity="1" unit="character"/></abbr>.
        

A transcription of the first six lines following the Leiden convention reads thus:

(hand 3) ξβ 
(hand 1) [Ἁρποκρατίω]ν[ι] τ[ῷ κ]αὶ Ἱέρακι β[ασ(ιλικῷ)] 
[γρ(αμματεῖ) Ἀρσ(ινοΐτου)] Ἡρ[ακ(λείδου) με]ρίδος 
[παρὰ] Ὡ[ριγέ]ν̣ους Ἰσιδ[ώ]ρο[υ] 
[τῶν ἀπὸ] τῆ[ς] μ[ητρ]ο(πόλεως) ἀπογε[γρ]α(μμένου) 
[ἐπʼ ἀμφό(δου) ̣]ερω( ) Θε[ ̣]μ̣ι̣[ ̣]( ).

A “plain text” version, obtained by extracting the markup from the text content of the TEI document looks like:

ξβ 
Ἁρποκρατίωνι τῷ καὶ Ἱέρακι βασιλικῷ 
γραμματεῖ Ἀρσινοΐτου Ἡρακλείδου μερίδος 
παρὰ Ὡριγένους Ἰσιδώρου 
τῶν ἀπὸ τῆς μητροπόλεως ἀπογεγραμμένου 
ἐπʼ ἀμφόδου ερω Θεμι.
          

while the extracted markup, with <ptr> elements that refer back to the text div looks like:
            
<lb n="1"/>
<handShift new="m3"/>
<ptr target="#string-range('d2e120', 6, 1)"/>
<num value="62">
  <ptr target="#string-range('d2e120', 7, 2)"/>
</num>
<ptr target="#string-range('d2e120', 9, 7)"/>
<lb n="2"/>
<handShift new="m1"/>
<ptr target="#string-range('d2e120', 16, 1)"/>
<supplied reason="lost">
  <ptr target="#string-range('d2e120', 17, 10)"/>
</supplied>
<ptr target="#string-range('d2e120', 27, 1)"/>
<supplied reason="lost">
  <ptr target="#string-range('d2e120', 28, 1)"/>
</supplied>
<ptr target="#string-range('d2e120', 29, 2)"/>
<supplied reason="lost">
  <ptr target="#string-range('d2e120', 31, 3)"/>
</supplied>
<ptr target="#string-range('d2e120', 34, 10)"/>
<expan>
  <ptr target="#string-range('d2e120', 44, 1)"/>
  <supplied reason="lost">
    <ptr target="#string-range('d2e120', 45, 2)"/>
    <ex>
      <ptr target="#string-range('d2e120', 47, 5)"/>
    </ex>
  </supplied>
</expan>
<ptr target="#string-range('d2e120', 52, 7)"/>
<lb n="3"/>
<supplied reason="lost">
  <expan>
    <ptr target="#string-range('d2e120', 59, 2)"/>
    <ex>
      <ptr target="#string-range('d2e120', 61, 7)"/>
     </ex>
  </expan>
  <ptr target="#string-range('d2e120', 68, 1)"/>
  <expan>
    <ptr target="#string-range('d2e120', 69, 3)"/>
    <ex>
      <ptr target="#string-range('d2e120', 72, 7)"/>
    </ex>
  </expan>
</supplied>
<ptr target="#string-range('d2e120', 79, 1)"/>
<expan>
  <ptr target="#string-range('d2e120', 80, 2)"/>
  <supplied reason="lost">
    <ptr target="#string-range('d2e120', 82, 2)"/>
    <ex>
      <ptr target="#string-range('d2e120', 84, 6)"/>
    </ex>
  </supplied>
</expan>
<supplied reason="lost">
  <ptr target="#string-range('d2e120', 90, 3)"/>
</supplied>
<ptr target="#string-range('d2e120', 93, 12)"/>
<lb n="4"/>
<supplied reason="lost">
  <ptr target="#string-range('d2e120', 105, 4)"/>
</supplied>
<ptr target="#string-range('d2e120', 109, 2)"/>
<supplied reason="lost">
  <ptr target="#string-range('d2e120', 111, 4)"/>
</supplied>
<unclear>
  <ptr target="#string-range('d2e120', 115, 1)"/>
</unclear>
<ptr target="#string-range('d2e120', 116, 8)"/>
<supplied reason="lost">
  <ptr target="#string-range('d2e120', 124, 1)"/>
</supplied>
<ptr target="#string-range('d2e120', 125, 2)"/>
<supplied reason="lost">
  <ptr target="#string-range('d2e120', 127, 1)"/>
</supplied>
<ptr target="#string-range('d2e120', 128, 7)"/>
<lb n="5"/>
<supplied reason="lost">
  <ptr target="#string-range('d2e120', 135, 7)"/>
</supplied>
<ptr target="#string-range('d2e120', 142, 3)"/>
<supplied reason="lost">
  <ptr target="#string-range('d2e120', 145, 1)"/>
</supplied>
<ptr target="#string-range('d2e120', 146, 1)"/>
<expan>
  <ptr target="#string-range('d2e120', 147, 1)"/>
  <supplied reason="lost">
    <ptr target="#string-range('d2e120', 148, 3)"/>
  </supplied>
  <ptr target="#string-range('d2e120', 151, 1)"/>
  <ex>
    <ptr target="#string-range('d2e120', 152, 6)"/>
  </ex>
</expan>
<ptr target="#string-range('d2e120', 158, 1)"/>
<expan>
  <ptr target="#string-range('d2e120', 159, 5)"/>
  <supplied reason="lost">
    <ptr target="#string-range('d2e120', 164, 2)"/>
  </supplied>
  <ptr target="#string-range('d2e120', 166, 1)"/>
  <ex>
    <ptr target="#string-range('d2e120', 167, 6)"/>
  </ex>
</expan>
<ptr target="#string-range('d2e120', 173, 7)"/>
<lb n="6"/>
<supplied reason="lost">
  <ptr target="#string-range('d2e120', 180, 4)"/>
  <expan>
    <ptr target="#string-range('d2e120', 184, 4)"/>
    <ex>
      <ptr target="#string-range('d2e120', 188, 3)"/>
    </ex>
  </expan>
  <ptr target="#string-range('d2e120', 191, 1)"/>
</supplied>
<gap reason="lost" quantity="1" unit="character"/>
<abbr>
  <ptr target="#string-range('d2e120', 192, 3)"/>
</abbr>
<ptr target="#string-range('d2e120', 195, 3)"/>
<gap reason="lost" quantity="1" unit="character"/>
<abbr>
  <unclear>
    <ptr target="#string-range('d2e120', 198, 2)"/>
  </unclear>
  <gap reason="illegible" quantity="1" unit="character"/>
</abbr>
            
          

This example is actually a fairly unproblematic one, since it does not contain any alternate readings or editorial corrections or normalization. Yet even here there are difficulties: “Θεμι” (as is clear in the Leiden version) contains two gaps and unclear text, but since these visual features of the document are indicated using <gap/> and <unclear/> tags, it looks like an undamaged word-fragment in the plain text version. It must be noted that the traditional way of publishing these documents in print employs inline markup. So, in this example at least, a plain text version would itself be a somewhat misleading version of the document. This is not a refutation of Schmidt’s points, because there are many other ways one could encode the document, using standoff markup, that would mitigate this problem. But perhaps it suggests that there are at least some uses of inline markup (when it encodes features of the text that cannot be expressed straightforwardly in Unicode) that may be hard to replace.

The ability to extract the markup from the text and still preserve the manipulability it previously enjoyed suggests some additional possibilities: one could now layer in name and place information, lexical and grammatical analysis, structural information, such as line containment, rather than just marking line beginnings, etc. Different views could be generated, using these individually or using combinations of them. Nothing stops us from layering these on top of inline markup either.

Since it relies on character offsets, any implementation of string-range() is inherently somewhat brittle. The adoption of @xml:space by the TEI closes off one means by which links using string-range could be broken, but can do nothing to mitigate the danger of someone editing the text directly. Projects that use this mechanism will have to prevent the breakage of string-range links either through workflow or editing environments that manage shifting offsets.

We have already learned a good deal from our implementation efforts to date. If this approach is something other users of TEI or even the TEI Consortium itself wishes to support, there are several changes we would suggest. First, that the guidelines be emended to contain a more thorough specification of the TEI pointer schemes. Second, that a working group be formed look at practical implementations of standoff markup and on appropriate usage patterns for these. We must note that the example stylesheet we provide to generate a text + standoff markup version of a valid TEI document results in invalid TEI when applied to the bgu.1.116 example, because elements like <ex/> can only contain text, not pointers to text. Moreover, if one wants to extract a string-range with the inline markup converted to standalone elements, then again the result will not be valid TEI. We hope our efforts outlined above will prompt some useful examination and perhaps revision of the TEI guidelines perspective on standoff markup.

Bibliography

[TEIP5] Burnard, L. and S. Bauman (eds), Text Encoding Initiative: P5 Guidelines, http://www.tei-c.org/Guidelines/P5/ (2007).

[XPtr] DeRose, Steve, Eve Maler, and Ron Daniel Jr., XML Pointer Language (XPointer) Version 1.0, http://www.w3.org/TR/WD-xptr (2001).

[Schmidt2010] Schmidt, Desmond, The inadequacy of embedded markup for cultural heritage texts, Literary and Linguistic Computing 25.2 (2010).



[2] We are so far being quite restrictive in our interpretation of the term fragmentIdentifier. In theory this could encompass any means of identifying a section of the document, including functions in the xpointer framework, for example. In practise, fragment identifiers are context-dependent, relying both on the MIME type of the document identified by the URI and on the functionality of the technology used to call them. For example, in the context of an XInclude element, some xpointer functions will work, whereas in the context of a browser-based hyperlink, only @id or @xml:id values work. Since we are working outside XInclude, we take the narrow view that a fragment identifier in a string-range can only be the value of an @xml:id attribute somewhere in the current document or in an external XML document.

Author's keywords for this paper: TEI; standoff markup; XSLT/XPath 2.0

Hugh A. Cayless

Analyst/Programmer

NYU

Hugh Cayless works on digital papyrology for the NYU Digital Library Technology Services team. He holds a Ph.D. in Classics and an MS in Information Science and has research interests in the application of digital technologies to problems in the study of the ancient world.

Adam Soroka

Engineer

UVA

Adam Soroka is an engineer in the Research and Development section of the Department of Digital Research and Scholarship of the University of Virginia Library. His XML-related interests include the uses of tree automata and integrating geospatial data into textual markup.