16.2.4.5 string-range(fragmentIdentifier, offset [, length])The string-range() scheme locates a range based on character positions. While
string-range endpoints are points adjacent to character positions, they must be designated
by the characters to which they are adjacent, in the same way that the nodes corresponding
to XML elements are. This avoids ambiguity about which point between two characters is
indicated when characters are interrupted by markup.The first argument to string-range() designates a node or a range within which a
string is to be located. No string range, even an empty one, can be defined by a
string-range() if the fragment identified has the empty string as its value. Every
string-range is defined based on an ‘origin character’. The origin is numbered 0, and
designates the first character of the string-value of pointer. The offset is a character
index relative to the origin; the start of the resulting range is the position designated
by the sum of the origin and offset."If length is specified, the end of the range is at a point adjacent to the character
designated by the origin added to the offset and length. If the offset is negative, or
length is sufficiently large, a string-range can designate characters outside the
string-value of the initial pointer. In this case, characters are located using the
string-value of the entire document. It is also legal for length plus the origin to exceed
the length of the string-value of the document by one, in order to accommodate ranges that
include the last character of a document.If length is not specified, it defaults to the value 1, and the string range contains
one character. If it is specified as 0, the zero-length range is interpreted as the point
immediately preceding the origin character or offset character if there is one.
[http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SATSSR]We are so far being quite restrictive in our interpretation of the term
fragmentIdentifier. In theory this could encompass any means of
identifying a section of the document, including functions in the xpointer framework,
for example. In practise, fragment identifiers are context-dependent, relying both on
the MIME type of the document identified by the URI and on the functionality of the
technology used to call them. For example, in the context of an XInclude element, some
xpointer functions will work, whereas in the context of a browser-based hyperlink,
only @id or @xml:id values work. Since we are working outside XInclude, we take the
narrow view that a fragment identifier in a string-range can only be the value of an
@xml:id attribute somewhere in the current document or in an external XML
document.
In theory, at least, string-range can be used to indicate an arbitrary section
of text in a TEI document, without regard to the way that text is nested within the document's
structure. A range could start inside one element, and end inside another. Put another way, it
can span multiple text() nodes. This means that if string-range() can be implemented, it would
present a solution to the overlapping hierarchies problem.Since string-range depends on marking a starting point and length of text within a section
of the document, it runs immediately into a problem with the way XML regards some whitespace
as "ignorable". Space between elements, for example, is not necessarily preserved during
operations on the document. Someone editing a document, for example, might pretty-print it in
order to make it more readable. This would introduce extra newline and space characters into
the document, and immediately break any string-range() pointers. In other words, the ignorable
whitespace content of the document could be changed as a part of normal processing that
doesn’t involve any editing of the document. This year, for the first time, TEI has begun to
allow the xml:space attribute.
[http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.global.html] This
means that the ignorable whitespace issue can be accommodated in a standard way.A second problem, and one that applies to several of the pointer schemes that the
Guidelines specify, is that they extend the XML data model. The TEI pointer scheme conceives
of Nodes and Node Sets (both of which correspond to objects in the XML Infoset/DOM), but also
Points and Ranges. Points are theoretical objects that must lie between element nodes or
between characters in text nodes. This is a useful concept for marking arbitrary ranges in a
document, but since it does not correspond to anything conceived of by the XML specifications,
there are are no hooks in XML processing tools on which to hang Points. They cannot be passed
to or returned by any XPath function or XSLT instruction. This makes implementation a complex
task. At best, they can be encapsulated in special-purpose markup for passing as messages or
handled as uninterpreted XPath expressions. The former technique introduces a problem of
standardization and the latter requires second-order processing, with the dangers and
difficulties that implies. Since string-range focuses on text, however, it is possible to
count, for each text node, the concatenated length of text nodes on the preceding axis, and
thereby to locate the text nodes containing the start and end points indicated in a
string-range() pointer.A third problem with string-range() as defined by the TEI, and in fact with all of its
XPointer schemes, is that the specification (the TEI Guidelines) doesn't properly address what
implementation would mean. The example in 16.9.3 uses string-range in XInclude elements to
import text from one XML document to another. Of course this example doesn’t work, because
TEI’s string-range has no XInclude implementation. But the (unstated) implication seems to be
that the string-range() function returns plain text only. String-range could certainly be used
to declaratively indicate arbitrary sections of a document, but without some mechanism for
executing it, there is nothing concrete for an implementer to do. A further complication is
that there is nothing stopping a string-range from indicating text that overlaps elements in a
non-hierarchical fashion. Should an implementer ignore elements thus captured? Or return them
somehow? A related issue is the fact that since string-range defines text-based locations,
elements are effectively invisible to it. A standalone element (e.g. <lb/>)
immediately before text that one wants to mark with a string-range() won't automatically be
part of that range.Given the underspecified functionality of string-range, the authors have made some
assumptions about implementation details. We have decided not to extend any existing XInclude
implementation. Instead, we have decided to use string-range only in a declarative fashion, as
a pointing mechanism within TEI, and we are developing XPath 2.0 functions that complement and
use string-range(). Where it declares a range, they will be able to retrieve that range. We
propose three functions, with the following signatures:
get-string-range(parentElt, offset1, offset2 [offset3, offset4, etc.]) - takes as arguments an XPath indicating a parent element (e.g. a div on which
@xml:space="preserve" as been set) and a set of integer pairs of character
offsets- returns a sequence of strings derived from text nodes or portions of text nodes
between the pairs of points passed in as parameters.
get-milestone-range(parentElt,offset1, offset2 [offset3, offset4, etc.])- takes as arguments an XPath indicating a parent element (e.g. a div on which
@xml:space="preserve" as been set) and a set of integer pairs of character
offsets- returns a sequence where elements have been converted to milestones (e.g.
<p-start> and <p-end> instead of
<p>).
get-fragment-range(parentElt,offset1, offset2 [offset3, offset4, etc.])- takes as arguments an XPath indicating a parent element (e.g. a div on which
@xml:space="preserve" as been set) and a set of integer pairs of character
offsets- returns a well-formed document fragment, where elements split by the range have been
automatically opened or closed.
An XSLT 2.0 stylesheet that implements these functions is under development at
http://github.com/hcayless/tei-string-range. A fourth problem lies in the ease-of-use of the string-range function. Determining the
index location of a piece of arbitrary text in a TEI document is prohibitively difficult for a
human editor. It would be relatively easy to programmatically generate a string-range based on
a selected range in an XML editor, like oXygen, but without this kind of functionality, it
will be quite hard for someone marking up a document to create the expression with facility.
What is needed at a bare minimum is a means to mark range starts and ends, in an
editor-independent fashion, which can then be converted to string-range expressions. We
propose using processing instructions in the form <?range-start
r="n"?>/<?range-end r="n"?>, where "n" identifies a particular
range. Pairs of these will mark range starts and ends, and can be processed by an XSLT
stylesheet to create <linkGrp>s containing links that use string-range() to
identify the marked ranges.Our implementation then, consists of a simple way to create string-range() pointers using
a XSLT 2.0 stylesheet transformation and a set of functions that can be used to process the
data marked by a string-range() in the context of an XPath 2.0 processor. Using these
stylesheets it is possible, for example, to mark up ranges of text in a non-hierarchical way
and then generate a set of links denoting those ranges, to which additional standoff markup
may be linked, or one can convert a document with inline markup to one where a division
contains plain text and a second division contains markup and pointers to the text.While the authors intend this effort to be a practical addition to the TEI’s arsenal of
tools, this kind of implementation raises theoretical questions that bring us back to the
question of the adequacy of inline markup. In the example below, taken from
http://github.com/hcayless/tei-string-range/blob/master/bgu.1.116.xml, a
transcription of a document written on papyrus from Arsinoite in Egypt, some of the text
content in the edition <div> is readable in the original, and some has been
supplied by the editor.
This example is actually a fairly unproblematic one, since it does not contain any
alternate readings or editorial corrections or normalization. Yet even here there are
difficulties: “Θεμι” (as is clear in the Leiden version) contains two gaps and unclear text,
but since these visual features of the document are indicated using <gap/>
and <unclear/> tags, it looks like an undamaged word-fragment in the plain
text version. It must be noted that the traditional way of publishing these documents in print
employs inline markup. So, in this example at least, a plain text version would itself be a
somewhat misleading version of the document. This is not a refutation of Schmidt’s points,
because there are many other ways one could encode the document, using standoff markup, that
would mitigate this problem. But perhaps it suggests that there are at least some uses of
inline markup (when it encodes features of the text that cannot be expressed straightforwardly
in Unicode) that may be hard to replace.The ability to extract the markup from the text and still preserve the manipulability it
previously enjoyed suggests some additional possibilities: one could now layer in name and
place information, lexical and grammatical analysis, structural information, such as line
containment, rather than just marking line beginnings, etc. Different views could be
generated, using these individually or using combinations of them. Nothing stops us from
layering these on top of inline markup either.Since it relies on character offsets, any implementation of string-range() is inherently
somewhat brittle. The adoption of @xml:space by the TEI closes off one means by which links
using string-range could be broken, but can do nothing to mitigate the danger of someone
editing the text directly. Projects that use this mechanism will have to prevent the breakage
of string-range links either through workflow or editing environments that manage shifting
offsets.We have already learned a good deal from our implementation efforts to date. If this
approach is something other users of TEI or even the TEI Consortium itself wishes to support,
there are several changes we would suggest. First, that the guidelines be emended to contain a
more thorough specification of the TEI pointer schemes. Second, that a working group be formed
look at practical implementations of standoff markup and on appropriate usage patterns for
these. We must note that the example stylesheet we provide to generate a text + standoff
markup version of a valid TEI document results in invalid TEI when applied to the bgu.1.116
example, because elements like <ex/> can only contain text, not pointers to
text. Moreover, if one wants to extract a string-range with the inline markup converted to
standalone elements, then again the result will not be valid TEI. We hope our efforts outlined
above will prompt some useful examination and perhaps revision of the TEI guidelines
perspective on standoff markup.BibliographyBurnard, L. and S. Bauman (eds), Text Encoding
Initiative: P5 Guidelines, http://www.tei-c.org/Guidelines/P5/
(2007).DeRose, Steve, Eve Maler, and Ron Daniel Jr., XML Pointer Language (XPointer) Version 1.0,
http://www.w3.org/TR/WD-xptr (2001).Schmidt, Desmond, The inadequacy of embedded markup for
cultural heritage texts, Literary and Linguistic
Computing 25.2 (2010).