Balisage logo

Proceedings

Three Ways to Enhance the Interoperability of Cross-References in TEI XML

Joel Kalvesmaki

Editor in Byzantine Studies

Dumbarton Oaks

Symposium on Cultural Heritage Markup
August 10, 2015

“Three Ways to Enhance the Interoperability of Cross-References in TEI XML”, is licensed under a Creative Commons Attribution 4.0 International License.

How to cite this paper

Kalvesmaki, Joel. “Three Ways to Enhance the Interoperability of Cross-References in TEI XML.” Presented at Symposium on Cultural Heritage Markup, Washington, DC, August 10, 2015. In Proceedings of the Symposium on Cultural Heritage Markup. Balisage Series on Markup Technologies, vol. 16 (2015). DOI: 10.4242/BalisageVol16.Kalvesmaki01.

Abstract

Systems are 'interoperable' if each can work with products of the other with minimal external intervention. Semantic interoperability (exchange of underlying meaning not just syntax) is the goal. Currently supported TEI cross-reference mechanisms are typically not interoperable without extensive human intervention. I offer three practical ways to make standard TEI cross-references more semantically interoperable. The first is the deployment of Canonical Text Services URNs. The second is informal agreements among communities to adopt shared Schematron rules. Both of these methods can be implemented right now; the barriers are practical not technical. The third method is stand-off markup based on the Text Alignment Network, a planned TEI-friendly XML format for the interchange of aligned texts.

Table of Contents

Introduction
Standard Cross-References in TEI
TEI @cRef + Canonical Text Services URNs
TEI @cRef + Shared Schematron
TEI + Stand-off Markup
Conclusion

Introduction

Two systems are said to be interoperable if each is able to work with the parts or products of the other, with minimal if any external intervention. When applied to formats of digital texts, interoperability is differentiated between syntactic and semantic.

Note

My distinction between and use of syntactic and semantic is congruent with that of the European Interoperability Framework European Commission 2010, 23:

Semantic interoperability is about the meaning of data elements and the relationship between them. It includes developing vocabulary to describe data exchanges, and ensures that data elements are understood in the same way by communicating parties.

Syntactic interoperability is about describing the exact format of the information to be exchanged in terms of grammar, format and schemas.

Syntactic interoperability refers to consistency or completeness in encoding, markup, and related conventions attached to that markup. It generally implies the complete, lossless exchange of data, no matter its meaning. We witness syntactic interoperability every day that we use the Web. Major updated browsers accessing the data in any page written validly in a version of Hypertext Markup Language (HTML) will present different readers with the same content and roughly the same display. Likewise, in the realm of textual scholarship, files validly marked up with one of the Text Encoding Initiative (TEI) formats are, in general, syntactically interoperable. A valid TEI file created by one party can be shared with any other to be studied, processed, or otherwise used.

Semantic interoperability stands a level higher, and characterizes systems that can losslessly exchange not just the data but any associated or underlying meaning. For example, the UTF-8 string "France" may be syntactically interoperable with other systems that handle UTF-8, but for it to be semantically interoperable, the underlying significance or meaning, i.e., that the string represents the name of the country France, should also be preserved after exchange. Such semantics admit degrees of interest and importance. For example, in both HTML and TEI , <div> and <p> have some semantic meaning, but to most users, of little import or precision. HTML 5 has allowed a few other semantically interesting elements, e.g., <article>, but there are not many of these, thus keeping vocabulary to less than 120 elements. In its more concerted effort to support scholarly concepts with markup, the TEI Consortium has produced many more and with even greater precision, e.g., <watermark> and <residence>, so that in its full schema TEI supports nearly 550 elements. TEI encourages projects and users to build on this effort by customizing the TEI to add their own semantically precise elements, or to remove ones that have no relevance to a given project.

But assigning an XML element to every possible concept of interest is impractical, even in a customized TEI scheme. Thousands of concepts could be encoded, but with what result? If an elemental vocabulary gets too large, it winds up being misunderstood or misused. Or it may legitimate interpretations that members of the community may regard as wrongly deviating from standard usage.

An alternative has emerged to making elements the main carrier of semantics. Known loosely and variously as linked data, open linked data, or the semantic web, this set of practices builds upon a recommendation of the World Wide Web Consortium (W3C) called the Resource Description Framework (RDF), a relatively simple data model that envisions data as a network of nodes connected by lines, termed rather misleadingly edges (http://www.w3.org/RDF/).

Note

In everyday usage, edge implies the juncture of two surfaces of one or more solid objects, with no implications for where that edge might begin or end, if it does at all. None of these sine qua nons for real-life edges have a place in the RDF appropriation of the metaphor. A newcomer may be forgiven for objecting that what depicted looks like a line, not an edge.

Semantic web applications call for the use of universal resource identifiers (URIs), sometimes called international resource identifiers (IRIs), as the data content of nodes and edges to uniquely name things and concepts. These URIs are recommended to take the form of http:// universal resource locators (URLs), so that further information about a thing or concept can be automatically retrieved. The method of transferring semantics thus shifts, from elements and attributes to the data they contain, namely URIs.

RDF conventions have been implemented in markup languages to varyious degrees. Across the Internet, RDFa and other forms of structured markup (Microdata, Microformat) have been applied widely, helping HTML become a major vehicle for semantic interoperability. The Web is populated with billions of assertions that are semantically comparable.

Note

The University of Mannheim's Web Data Commons project, http://webdatacommons.org, conducts regular crawls of the entire Web. The project showed that in winter 2014 31% of HTML pages retrieved from 2.01 billion URLs (up from 26% of 2.24 billion in 2013) had some kind of structured markup, resulting in 20.5 billion RDF quads (RDF triples attached to a named graph; this figure is up from 17.2 billion in 2013). See http://webdatacommons.org/structureddata/2014-12/stats/stats.html and http://webdatacommons.org/structureddata/2013-11/stats/stats.html.

A comparable effort within TEI has remained largely nascent. In this article I argue that, whether or not TEI is capable of full semantic interoperability, it is capable of at least some, certainly much more than is currently outlined in the TEI Guidelines. Non-intrusive improvements could be made in several ways, resulting in rewards that far outweigh any extra work or requirements. Although a variety of features targeted by TEI for markup could be enhanced, I focus here on a relatively straight-forward candidate for semantic interoperability, the cross-reference, particularly to well-known or frequently cited literature. These are good test candidates because human-readable syntax of the cross-reference is rather controlled and simple (e.g., Homer, Iliad 1.1; Confucius, Analects 1.2.3.1). These canonical reference schemes, which are probably better termed standardized reference systems, or just simply reference systems, can be easily and quickly understood and processed by humans independent of any individual version, corpus, or project. They would seem ideal for computer exchange.

Note

For a theoretical reflection on canonical or standardized reference numbers and their place in digital projects, see Kalvesmaki 2014.

In this article I offer three practical ways to make standardized references in TEI more semantically interoperable. The first of these, deployment of Canonical Text Services URNs, is somewhat well known but has not yet been broadly used in TEI cross-references. The second has, to my knowledge, not yet been tried at all, namely, informal communities agreeing to adopt Schematron files, to be added to the prolog of TEI files to standardize cross-references to a work that is frequently cited. My third and final approach shifts to stand-off markup, and I offer a model based upon the Text Alignment Network, a planned TEI-friendly XML format for the interchange of aligned texts.

Standard Cross-References in TEI

[B]ecause the choice of tags is guided by human interpretation, TEI-XML encoded files are in general not interoperable (Schmidt 2014)

Doubts about the interoperability of the XML format supported by the Text Encoding Initiative (TEI) have been voiced on numerous occasions, even within the flagship journal of the TEI, as in the quote above.

Note

See also Schmidt and October 2014 discussions on the public TEI-L listserv, initiated by Roberto Rosselli Del Turco under the subject line "Interchange of TEI documents: examples?": https://listserv.brown.edu/archives/cgi-bin/wa).

Although the skeptics' richly complex counterexamples have persuaded me that XML and TEI are ill-equipped to handle assertions made by textual scholars when they are at their most expressive, I am also convinced that some of their most common, basic assertions could be made more semantically interoperable. The humble cross-reference is a good candidate. It is supported in TEI through several mechanisms, commonly @cRef, in tandem with <ptr> or <ref> (and sometimes supplemented by <cRefPattern>). But there are other ways as well. One could also use those elements with @target or @type. Or one could use <quote> along with @source. Other methods include the use of <link> and <linkGrp>, or even loose, unstructured mechanisms such as <bibl>. (The variety of options, as I shall argue, hamper interoperability.)

A few of these many options are discussed further in this paper. But for ease of discussion, I will concentrate on @cRef, presented in the TEI Guidelines as an ideal solution for an encoder who wishes to create a cross-reference to another work by means of a standardized or canonical reference. The relevant parts of the Guidelines, §3.10.4 and §16.2.5, although accurate, are disjoint, technical, and not clearly connected to everyday usage. So I present the material somewhat differently, from the perspective of the ordinary encoder who is putting a project together and doing the best to follow the recommended steps.

Note

All references to the TEI guidelines are based on version 2.8.0 of the P5 Guidelines, http://www.tei-c.org/Guidelines/P5/, last accessed 3 July 2015.

The Guidelines illustrate @cRef with the example of a text that quotes from the gospel of Matthew, chapter 5 verse 7 (Guidelines §16.2.5). Let us enhance this example by considering the needs of an encoder who is editing works by Anne Brontë and who has decided to encode explicit quotations, including the quotation from Matthew 5:7 that appears at chapter 5, paragraph 18 of Agnes Grey. Because our focus is on both syntax and semantics, let us assume that the encoder wishes to provide a cross-reference that will refer to as many versions of that text as possible, created independently by other encoders or projects, and will be as useful as possible to the maximum number users, with a minimum of human intervention for processing the data. Let us also assume that all the TEI transcriptions that exist in the world are both discoverable and available. Of course, this is a terrible assumption to make in real life, but the problems associated with discoverability and availability are ubiquitous for this method and every other one, whether discussed in this article or not. Assessing those problems here would be repetitive and tangential to the main point, interoperability.

We turn to the Brontë encoder, who has prepared a plain TEI transcription of Agnes Grey, and now turns to marking up cross-references. Following the TEI guidelines, the encoder tags the quotation with <quote>. After seeing that only <gloss>, <term>, <ptr>, and <ref> support @cRef, the encoder ignores the first two. Upon further reading, particularly of the examples, the encoder feeling that both <ptr> and <ref> are equally valid, decides that the markup is more of a reference than a pointer, so adds <ref> nearby in a valid location. The relevant part of the TEI file might look like this:

.....
<div type="chapter" n="5">
   .....
   <p @xml:base="•••••••••">‘But, for the child’s own sake, it ought not to be encouraged to have such amusements,’
      answered I, as meekly as I could, to make up for such unusual pertinacity.
      <said>‘<quote>“Blessed are the merciful, for they shall obtain mercy</quote><ref
      cRef="•••"/>.”’</said></p>
   .....
</div>
.....
The encoder has given @cref and @xml:base dummy values because it is as yet unknown what kind of values are expected. A target Bible text must be chosen, and then it must be interrogated to find out what elements and attributes have been used, and with what values. So the encoder finds one in TEI format. After noting the URL, the encoder studies the file and finds that it has the following structure at the place quoted:
.....
<div n="Matt">
   .....
   <div type="chap" n="5">
      .....
      <ab type="v" n="7">Blessed are the merciful, for they will be shown mercy.</ab>
      .....
   </div>
   .....
</div>
.....

The encoder therefore replaces ••••••••• with the target URL (let's call it http://example.com/nt.xml) and replaces ••• with Matt 5:7. But the latter, being so far parsable only to humans, must be converted to something a computer can act upon. So the Brontë encoder, again following the Guidelines, adds a statement to the <teiHeader>, something like this:

<teiHeader>
   .....
   <encodingDesc>
      <refsDecl xml:id="biblical">
         <cRefPattern matchPattern="(.+) (.+):(.+)"
            replacementPattern="#xpath(//div[@n='$1']/div[@n='$2']/ab[@n='$3'])">
            <p>This pointer pattern extracts and references the <q>book,</q>
               <q>chapter,</q> and <q>verse</q> parts of a biblical reference.</p>
         </cRefPattern>
      </refsDecl>
   </encodingDesc>
   .....
</teiHeader>

Note

The program listing above departs slightly from the official example in the TEI Guidelines (§16.2.5), which use #xpath(//div[@n='$1']/div[$2]/div[$3]), an XPath expression that assumes that verse labels and positions are isomorphic. That is a false assumption for most modern editions, which suppress or demote verses considered spurious without altering the canonical numbering. The @replacementPattern in my example also takes into account advice at §16.3 that Bible verses should be tagged <ab>.

This <cRefPattern> stipulates for any TEI processor that Matt 5:7 should be converted to the URL http://example.com/nt.xml#xpath(//div[@n='Matt']/div[@n='5']/ab[@n='7']).

The encoder's job finishes, and the work now moves to those who wish to process, publish, or study the data. This requires the use of some TEI-compliant and -aware processing mechanism, which will take the TEI elements and attributes that have been used for cross-referencing, resolve them to retrieve a string or document fragment, and then transform that data according to whatever purpose is intended. Although the end result differs widely from one processor to another, the initial, preparatory step is common across the board. All processors must be programmed to find instances of @cRef, take the string value, find a matching pattern in @matchPattern (in <cRefPattern>), create an XPath expression to be applied to the target XML file of Matthew (specified by @xml:base), and then retrieve the document fragment, for later transformation.

But even in this preparatory stage, the processor requires some human intervention. Someone must first step in and configure it to address irregularities not found in other TEI files. The person configuring the processor must study the Brontë text and discern which elements have been used for cross-references, and with what kind of editorial consistency. Perhaps the configurer is surprised to find that the encoder chose <ref> instead of <ptr>, and that the former was left empty. Perhaps the configurer is surprised to find that the Brontë encoder was enamoured by the attraction of @cRef and ignored a simpler solution, that of <quote> with @source. Perhaps the encoder and configurer will engage in a spirited discussion as to the best use of TEI.

Perhaps the configurer and encoder are not on speaking terms, and <ref> stands. The configurer must interrogate the use of the element even further to determine what relationship any given <quote> and <ref> pair share. After all, the former could be the previous sibling, next sibling, parent, or child to the latter. (Of these four valid configurations, three are offered as examples in the TEI Guidelines.) The configurer might find that in a series of adjacent quotes it is difficult to tell which <quote> is paired with which <ref>, and the encoder may not have been consistent. The variety of options in TEI is the source of extra work for the person configuring the pre-processor. As Schmidt points out, in the quote above, the choice of an element, as well as its placement, is subject to human interpretation, and is therefore detrimental to interoperability.

Such a workflow also requires quite a lot of human intervention and interpretation at both stages (transcription, pre-processing configuration). And not only does it fail to preserve any data required for semantic interoperability, such as URNs, but it can scarcely be said to be even syntactically interoperable. The syntax of the values of @cRef and @replacementPattern are guaranteed to be applicable only to one quoting version and one quoted version. Any attempts to apply the data to other versions of the New Testament (reflected by, say, changing the value of @xml:base) must be preceded by checking the structure and contents of the new file. In addition, once @cRef is used this way, it becomes difficult to use the attribute to refer to works other than the New Testament.

Note

This is most acute when an encoder wishes to use @cRef to point to multiple works, a practice that would tax the limits of @xml:base.

All in all, @cRef as an interoperable cross-reference mechanism proves to be rather limited. It may be suitable for a single project depending upon specific files, but it is not prepared to handle a distributed network of independently created TEI files.

TEI @cRef + Canonical Text Services URNs

The limitations of @cRef prompt many TEI users to migrate to more complex TEI linking mechanisms (discussed below). But @cRef need not be abandoned so quickly. Its syntactic and semantic value can be enhanced rather easily through Canonical Text Services (CTS) URNs, a convention that defines a way to coin unique, computer-actionable references to literary works independent of individual versions. A description of the syntax of CTS URNs would take us too far afield, and are easily found elsewhere.

Note

Discussed informally at http://www.homermultitext.org/hmt-doc/cite/cts-subreferences.html and defined formally at http://www.homermultitext.org/hmt-docs/specifications/ctsurn/. See also Kalvesmaki 2014, paras. 15-24. See esp. notes 12-17, where I register some concerns about the design of CTS URNs.

For the sake of the example adopted for this article, let us assume that the following CTS URN provides a unique reference to Matthew 5:7: urn:cts:greekLit:tlg0031.tlg001:5.7 (the Greek New Testament is catalogued by the Thesaurus Linguae Graecae as author number 0031, and Matthew as work number 001). This URN is said, by definition, to be valid for any version of Matthew.

Let us revisit the workflow of our example. Above we started with the Brontë encoder, and we placed no special requirements upon the TEI-compliant version of Matthew she or he used. But under the CTS URN method, the process has to start earlier, with the target text. Or rather, more precisely, a new participant is introduced as a mediary between the New Testament encoder and the Brontë one, namely, a CTS server.

The person who administers a CTS server finds one or more TEI-compliant New Testament texts, and processes those texts, importing them into an RDF-compliant data store. During that process each segment of text is converted into RDF data that connects the text string with a CTS URN (in RDF terms, the latter would be the subject and the former the predicate). The data could be stored and served in any number of ways, for example as a relational database or as a SPARQL Protocol and RDF Query Language (SPARQL) endpoint.

Note

Whereas the architects of CTS have developed CTS as a SPARQL endpoint, Jochen Tiepmar, at the University of Leipzig, has deployed a CTS server as a MySQL database. See https://github.com/cite-architecture/sparqlcts and http://www.culingtec.uni-leipzig.de/ESU_C_T/node/471

The CTS administrator makes all the text available to queries in an application program interface (API) and creates and publishes a method for searching the CTS data store, so that anyone who submits a CTS URN will get in return one or more spans of text (provided that the intended text is in the CTS server).

In our example, we start with an administrator of a CTS server, who finds a TEI New Testament. After interrogating the data structure, the administrator imports the verses of the New Testament, along with their proper CTS URNs into the service. The administrator publishes specifications for the API that state that any queries should target the URL http://ctsservice.example.com/text, add a question mark, then the CTS URN.

Work shifts to the Brontë transcriber, who now does not need to study the structure of any particular New Testament text. All he or she needs to do is get the base URL for the CTS service, follow the specifications for the API, and encode the novel accordingly, e.g.:

.....
<div xml:base="http://ctsservice.example.com/text?">
   <p>‘But, for the child’s own sake, it ought not to be encouraged to have such amusements,’
      answered I, as meekly as I could, to make up for such unusual pertinacity.
      ‘<quote>“Blessed are the merciful, for they shall obtain mercy.”</quote><ref
      cRef="urn:cts:greekLit:tlg0031.tlg001:5.7"/>’</p>
.....

This particular CTS URN points to every version of the New Testament held in a particular CTS service. But if the Brontë encoder knows that the quotation is from a specific version of Matthew, say a handwritten diary, and finds that version available in a CTS service, the value of @cRef can simply be narrowed further, e.g., urn:cts:greekLit:tlg0031.tlg001.diaryA:5.7.

The two attributes @xml:base and @cRef are all that is required of the transcriber. The syntax of the CTS URN renders <cRefPattern>unnecessary.

The work now shifts to the person configuring the processor, who still must interrogate the Brontë text, to see how elements and attributes have been used for cross-referencing. But once that is accomplished, the processor can be preconfigured by simply concatenating @xml:base and @cRef. Before sending this request to the CTS service, the configurer may wish to restrict the number of versions returned, which is simple enough: the value of @cRef or the SPARQL query is changed to specify the version or versions intended. The text or texts that are returned from the CTS service are then ready for transformation.

Under this method, the amount of work required of the transcriber and the pre-processor is reduced considerably. The transcriber does not need to know anything about regular expressions, XPath, and replacement patterns. The person configuring the processor does not need to rewrite any preprocessing stylesheets. The syntactic and semantic interoperability of the Brontë TEI file is increased significantly. The syntactic irregularities inherent in the customary use of @cRef are eliminated by the CTS specifications, which dictate exactly how a valid URN must be constructed. And a new level of semantic interoperability not traditionally part of TEI files has been introduced. In that single CTS URN, one has a machine-actionable name not only for a particular passage but for a collection, a work, or, possibly, a specific version. The Brontë encoder has not only pointed to a specific set of texts in a CTS service, but has uniquely named both a work (gospel of Matthew) and a specific part of that work (5:7). That URN can be used by any other system that is CTS URN-aware to collate the assertion governed by @cRef into heterogenous datasets. And that means that the cross-reference declared in the TEI file of the Brontë transcription has now been released to the semantic web.

This approach to cross-references assumes, of course, that a quoted text is available in a CTS service, an assumption we made at the outset (see above). But the need to have an available CTS server is a reminder that this method introduces a major step into the workflow, and an added point of possible failure in data processing. The relationship between source text, cross-reference, and target text is now mediated. In addition, the extra labor on the part of the CTS administrator is not to be underestimated. CTS services require software packages (e.g., SPARQL endpoints) that must be configured and maintained, requiring server administrator skills well beyond simply uploading a plain XML file to a public server. The average TEI encoder who has a basic website is not likely to be ready to administer a CTS server. There are also, at this time, few examples of CTS services, and only as that number grows will the specifics of other opportunities and shortcomings be made clear.

TEI @cRef + Shared Schematron

At the heart of a CTS URN is a familiar, standardized canonical reference system that has been transformed into a syntactically regularized string, to bridge independently created texts. Another way a community of encoders and projects can exploit so-called canonical references in the name of interoperability is to transform standardized references into an agreed controlled vocabulary, then specifying the rules for that vocabulary with a Schematron file. Anyone choosing to use the convention need merely add a reference to the Schematron file in the head of their TEI documents. This inclusion not only tells other users that the shared cross-reference system has been adopted, but, in the validation process, can weed out bad values and provide contextual help to the TEI encoder who may not know all the rules for the cross-reference system.

Note

The method advocated below resembles somewhat the constraints applied by the schemas developed for the Mary Baker Eddy Library, which regulates the syntax of cross-references within a single corpus to a variety of works. For documentation see http://www.wwp.neu.edu/outreach/seminars/mbel/TEI_development/schemas/mbel.odd; http://www.wwp.neu.edu/outreach/seminars/mbel/TEI_development/schemas/mbel.doc.html#att.pointing; and http://www.wwp.neu.edu/outreach/seminars/mbel/TEI_development/schemas/mbel.isosch. But whereas the Mary Baker Eddy schema focuses on the needs of a single project dealing with multiple works, in this section I deal with the inverse: multiple projects trying to interoperably quote a single work, no matter the specific version.

This method starts further upstream than either the Brontë encoder or a putative CTS server. It begins with the community that wishes to make Matthew and the rest of the New Testament (maybe the Bible in general) open to standardized cross-references. Out of that community a person or project (or perhaps a TEI special interest group) agrees to host and maintain master versions of the schema files. The community agrees to create a pair of Schematron files, one to regulate transcriptions of the New Testament, the other, transcriptions of texts that quote from the New Testament.

The first file defines the structure of the New Testament text and permissible values. Let us suppose the community has agreed that any New Testament transcription should have three levels of <div>, one for books, one for chapters, and one for verses. They also agree on a set of abbreviations that should be used for the names of the books. They envision transcriptions of the New Testament having a TEI <text> that looks something like this:

<text>
   <body>
      <div n="Mt">
      .....
         <div n="5">
         .....
            <div n="7">
               <p>μακάριοι οἱ ἐλεήμονες, ὅτι αὐτοὶ ἐλεηθήσονται.</p>
            </div>
         .....
         </div>
      .....
      </div>
     .....
  </body>
</text>

To enforce this structure, the community encodes assorted rules in the first of the two Schematron files. For example, this rule defines permissible book abbreviations:

<rule context="tei:div">
   <let name="hierarchy" value="count(ancestor::tei:div) + 1"/>
   <report test="$hierarchy = 1 and not(matches(@n,'^(Mt|Mk|Lu|Jn|Ac|
      Ro|1Co|2Co|Gal|Eph|Php|Col|1Th|2Th|1Tim|2Tim|Tit|Phm|
      Heb|Jam|1Pe|2Pe|1Jn|2Jn|3Jn|Jud|Re)$','x'))"
      >Book value must be one of the following: Mt, Mk, Lu, Jn, Ac, Ro, 1Co, 2Co, Gal, Eph,
      Php, Col, 1Th, 2Th, 1Tim, 2Tim, Tit, Phm, Heb, Jam, 1Pe, 2Pe, 1Jn, 2Jn, 3Jn, Jud,
      Re.</report>
   .....
</rule>

The example above concisely specifies that the first-level <div>s (those at the book level in the hierarchy) must have values of @n that draw from one of the abbreviations adopted by the community for the twenty-seven books of the New Testament. In the case of Matthew, the agreed abbreviation is Mt.

This <report> is but one of many that could be declared within the same <rule>. Another could include a specification as to the number of chapters allowed in a particular book. This next <report> specifies that the second level <div>s pertaining to the book of Matthew must be numbered 1 through 28:

<report test="$hierarchy = 2 and ../@n ='Mt' and @n and 
   not(matches(@n,'^([1-9]|1[0-9]|2[0-8])$'))">Mt has a maximum 
   of 28 chapters.</report>

The verse numbers too can be defined, as here, which specifies that verse numbers for Matthew 5 fall from 1 through 48:

<report test="$hierarchy = 3 and ../../@n = 'Mt' and ../@n = '5' and 
   @n and not(matches(@n,'^([1-9]|[1-3][0-9]|4[0-8])$'))">Mt 5 takes 
   verses 1 through 48.</report>

Furthermore, let us suppose that this community agrees with many modern text editors that certain verses should be deprecated, but they do not wish to render a text that includes them as being invalid. For example, Matthew 18:11, widely regarded as spurious, could be flagged in a report, but merely as a warning:

<report test="$hierarchy = 3 and ../../@n = 'Mt' and ../@n = '18' and @n='11'"
   role="warning">Most critical editions suppress Mt 18.11 as spurious.</report>

Perhaps most important of all, the schema file can declare that every <div> should have values of @n such that every <div> furthest from the root is uniquely citable, what I call the Leaf Div Uniqueness Rule:

<pattern>
   <let name="leafdiv-flatrefs"
      value="for $i in (//tei:div[not(descendant::tei:div)]) return 
      string-join($i/ancestor-or-self::tei:div/@n,' ')"/>
   <rule context="tei:div">
      .....
      <let name="this-ref" value="string-join(./ancestor-or-self::tei:div/@n,' ')"/>
      .....
      <report
         test="not(descendant::tei:div) and count(index-of($leafdiv-flatrefs,$this-ref)) > 1"
            >Canonical references must be unique. </report>
  </rule>
</pattern>

The <pattern> above binds to the variable $leafdiv-flatrefs a sequence of canonical reference for all leaf <div>s. Each item in the sequence is a string made up of all the @n values of a leaf <div> and its ancestors joined by a delimiter, e.g., Mt 5 7. Each item must be unique to the sequence, a rule that is checked by the <report>. If it is not, the duplicate leaf <div>s are marked as invalid. Enforcement of the Leaf Div Uniqueness Rule allows chains of @n joined vertically along an XML hierarchy to act as an ID, one that economically follows the standardized (canonical) reference systems that are familiar to human encoders.

Note

The uniqueness rule must apply only to leafmost <div>s because there are cases where a <div> midlevel in the hierarchy is intentionally split. For example, in the Greek Septuagint (LXX) version of Proverbs, the 30th chapter is split, and interleaved with the two halves of chapter 24 (24.1 - 24.22e [22a - 22e are LXX verses not extant in the Hebrew]; 30.1 - 30.14; 24.23 - 24.34; and 30.15 - 30.33). In this case the @ns of the two split book <div>s must be identical. This also explains why the report is tested not against a leafmost <div>'s siblings (which may be but only a partial selection of siblings according to the reference system) but against the entire sequence of leafmost <div>s.

The Rule also preserves the hierarchical organization of texts intuitive to humans and ensures that @n has little if any repetition.

Note

Such repetition is found in alternate approaches such as those that use @xml:id in the leafmost <div>, e.g., <div xml:id="Mt.5.7">, where Mt and 5 could have been inferred from the ancestors' @xml:id values. Abbreviations of book names and chapter numbers would need to be repeated for all ca. eight thousand verses of the New Testament.

We turn now to the second part of the pair of shared Schematron files, that pertaining to the quoting text and the syntax of the cross-reference. Here rules are superimposed upon @cRef (or @source or @ref). The community anticipates that the attribute might be used for multiple space-delimited cross-references, and to works other than the New Testament. They anticipate complex quoting files that might look something like this (illustrating the work of an encoder who wishes to add cross-references outside the New Testament, here to Proverbs 11:17):

.....
   <div type="chapter" n="5">
      <p n="18">‘But, for the child’s own sake, it ought not to be encouraged to have such
         amusements,’ answered I, as meekly as I could, to make up for such unusual
         pertinacity. ‘<quote>“Blessed are the merciful, for they shall
         obtain mercy.”</quote><ref cRef="NT.Mt.5.7 HebB.Prov.11.17"/>’</p>
   </div>
.....

The community therefore defines both a prefix for the work (NT) and some character to be used as a delimiter (here a period, but many other nonspacing, nonword characters would also serve). And the community specifies that every value of @cRef that begins with the reserved prefix should construct the cross-reference according to the established rules. For example, this next rule specifies that the second element of any New Testament cross-reference (e.g., the Mt in NT.Mt.5.7) should be one of the acceptable book abbreviations:

<pattern>
   <rule context="@cRef">
   <let name="delimiter" value="'\.'"/>
   <let name="these-refs" value="tokenize(.,'\s+')"/>
   <let name="invalid-books"
      value="for $i in $these-refs return
      if(matches($i,concat('^NT',$delimiter)) 
         and not(matches(tokenize($i,$delimiter)[2],'^(Mt|Mk|Lu|Jn|Ac|
            Ro|1Co|2Co|Gal|Eph|Php|Col|1Th|2Th|1Tim|2Tim|Tit|Phm|
            Heb|Jam|1Pe|2Pe|1Jn|2Jn|3Jn|Jud|Re)$','x')))
      then true()
      else false()"/>
   <report test="some $i in $invalid-books satisfies $i = true()">Error in cross-reference
      no. <value-of select="index-of($invalid-books,true())"/>. Book value must be one of the
      following: Mt, Mk, Lu, Jn, Ac, Ro, 1Co, 2Co, Gal, Eph, Php, Col, 1Th, 2Th, 1Tim, 2Tim,
      Tit, Phm, Heb, Jam, 1Pe, 2Pe, 1Jn, 2Jn, 3Jn, Jud, Re, separated by subsequent values by
      this delimiter: <value-of select="replace($delimiter,'\\','')"/></report>
   .....
   </rule>
</pattern>

Under this <rule>, every @cRef is tokenized into a sequence of space-delimited cross-references, assigned to the variable $these-refs. Another variable checks the ones that begin with NT, and makes sure that the next part (defined by the delimiter, the period) is one of the acceptable abbreviations for a New Testament book. If any value does not conform, that @cRef is marked as invalid, and a message is returned, indicating which cross-reference is faulty, as well as a list of acceptable values and the delimiter that should be used to separate parts of a cross-reference.

Other reports that are found in the first Schematron file can be replicated here as well. For example, allowable chapter and verse numbers can be specified (examples suppressed here for the sake of brevity). That second shared Schematron file could also specify exactly where the <ref> should be placed relative to the quotation:

   .....
   <report test="$this-val[1] = 'NT' and not(name(../preceding-sibling::*[1]) = 'quote')">An
      element containing @cRef must come immediately after the closing tag of the matching
      quote element.</report>
   .....

This report specifies that the element containing @cRef must be the very next sibling of its corresponding <quote>. This test removes the guesswork as to where a quotation's cross-reference is to be found, and so saves some labor on the part of the person configuring a processor.

The blocks of code in the examples above are not necessarily computationally efficient, nor do they necessarily represent the best use of TEI elements. They merely illustrate the types of patterns and rules a community of practice might embrace. Once the community has established their rules, the two master Schematron files are posted in a central location. The community has the freedom to update those rules as the community learns what works and what doesn't, and the updates benefit every user.

Now work shifts to the two different communities of transcribers. The first consists of those who wish to provide a citable transcription of the New Testament. They begin by adding to a pre-existing TEI file an extra prolog statement, for example:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_lite.rng" schematypens="http://relaxng.org/ns/structure/1.0"?>
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_lite.rng"
      schematypens="http://purl.oclc.org/dsdl/schematron"?>
<?xml-model href="http://example.org/schemas/nt/1.0/nt-quotable.sch" 
      schematypens="http://purl.oclc.org/dsdl/schematron"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
   .....
</TEI>

The transcriber runs the validator, and might find that the once-valid TEI file is now rendered invalid, because it does not follow the new rules precisely. But the explanations provided by the error messages will advise the transcriber on how and where to alter the file to make it valid, so it can be made interoperable with all others.

Note

In fact, the schematron file could be provided Schematron Quick Fixes, which in SQF-aware XML processors would allow the invalid data to be corrected with just two clicks or keystrokes, or even automatically. See http://www.schematron-quickfix.com/.

The transcriber complies, and corrects the transcription.

We now turn to the Brontë encoder, who, like the New Testament transcribers, adds a prolog:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml"
	schematypens="http://purl.oclc.org/dsdl/schematron"?>
<?xml-model href="http://example.org/schemas/nt/1.0/quoting-nt.sch" 
      schematypens="http://purl.oclc.org/dsdl/schematron"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
   .....
</TEI>

And once again, the encoder runs the validator, and the extra Schematron pattern is used to see if the citations to the New Testament conform to the rules agreed upon by the community. If there are any errors, the message specifies exactly where and for what reason. The Brontë encoder edits the file until there are no more error messages.

This process can be repeated as often as one wants, upon any version of any text, whether quoting or quoted. In fact, they can be combined in the same file, to allow a New Testament to be marked with internal cross-references. No matter the context, the Schematron reports steer the transcriber into the (usually small) fixes that need to be made. @cRef alone is sufficient to declare the cross-reference. Neither @xml:base nor <cRefPattern> is necessary. An @xml:base could be supplied, if so desired, but the @cRef is now applicable to any version of the New Testament that adopts the shared Schematron files.

Work now turns to the processor to do something with the cross-reference. Here, because the structure of every New Testament TEI file has been precisely defined (as a series of tesselated <div>s) very little human intervention is needed. Or, rather, the type of human intervention shifts, primarily to deciding which and how many of the available versions of the New Testament should be processed (compare the same wealth of riches in the CTS method). Once a processor is configured to handle these user-defined cross-references to the New Testament, it can be used on any valid file that also uses it, with no extra work. Naturally, this applies only to the preprocessing phase. How exactly that data will be used (display, statistics, etc.) is determined by what users want.

This method greatly improves both the syntactic and semantic interoperability of TEI files. It requires no new infrastructure, and it supports both customized and standard TEI schemas. The shared Schematron files provide structure and predictability—a controlled vocabulary for cross-references—in areas where encoders most want it. Like CTS, a middleman has been introduced, but it is rather simple and benign: two relatively small Schematron files made available by http request that will normally be cached by users on their local drive for day-to-day work. So maintenance and overhead are rather light.

Note too that the shared Schematron files can be used on TEI Lite, TEI All, or even customized TEI. No one has to use the same version of TEI in order to make New Testament references interoperable. The validation files do not preclude any other markup within a leaf <div>. They can be used on any version of the New Testament, partial or complete, in any language, and the books or chapters need not be in a specified order (thereby accommodating unusual editions that adopt alternative orders of the books of the New Testament).

Furthermore, this effort could be extended outside the TEI realm. That same community might create variations of the Schematron file pairs for XHTML 1, thereby allowing web pages to serve as host to syntactically and semantically interoperable transcriptions of New Testaments, or of texts quoting the New Testament.

But this general method also has a few major problems. It might work fine for heavily quoted works, but what about less frequently quoted ones? Organizing a community of practice to agree on rules might be difficult if not impossible for some texts (including, ironically, the Bible). Further, how would reserved keywords (here, NT) within the value of @cRef be minted without conflict? What happens in the case of duplicate or ambiguous prefixes adopted by independent communities? Such questions should be regarded not as reasons for abandonment but as problems that can and should be solved. But those solutions go beyond the scope of this article.

Note

The problem of conflicting prefixes could be solved if they were handled like namespace prefixes. But such "work prefixes" would require new specifications in the TEI Guidelines, to ensure the integrity of the method.

A central problem in these questions is to distinguish real objections from the theoretical, but such discernment would require experimentation and real-world examples, to see what works and what doesn't.

TEI + Stand-off Markup

The three methods discussed so far assume cross-references that are embedded within a transcription. Such inline annotation is the most common way an encoder points from one text to another, not just in TEI but also in HTML. But the TEI guidelines (§§16.9-16.10) provide for an alternative approach, stand-off markup, where linking and cross-referencing are placed in a file separate from the transcriptions. Such stand-off markup or annotation has a few immediate drawbacks, the most immediate being that it is difficult to easily see the text to which an annotation applies, either because the files must be navigated and edited independently or because the semantics in the pointing scheme may be difficult for a human to parse (character counting, complex or opaque XPath expressions, etc.). But stand-off markup also has great benefits. It allows multiple complimentary or competing annotations to be made of the same base transcription; stand-off markup files can be created, edited, and served independently of any source texts; it facilitates a division of labor that allows transcribers and annotators to focus independently and concurrently on their discrete tasks.

The current specifications of the TEI guidelines provide for a specific method of stand-off markup. It presumes that one or more transcription files are to be found somewhere, and an external aligning file stands apart from them. That external file can point to the source files either by means of XInclude elements (explained at TEI Guidelines §16.9) or by using @target with <ptr>, <ref>, or <link> (TEI Guidelines §§16.2, 16.7). Common to all these methods is a reliance upon the TEI XPointer scheme, which provides a precise, stable, and expressive reference system that follows a straight-forward, consistent syntax. The following examples show two different ways to create a stand-off cross-reference from the Brontë novel's quotation to the New Testament:

.....
<linkGrp>
    <link target="http://example2.com/agnesgray.xml#xpath(//div[@n='5']/p[18])
    http://example.com/nt.xml#xpath(//div[@n='Matt']/div[@n='5']/div[@n='7'])"/>
</linkGrp>
.....

.....
<body>
    <div>
        <include href="http://example2.com/agnesgray.xml" xmlns="http://www.w3.org/2001/XInclude"
          xpointer="range(xpath(//div[@n='5']/p[18]))"/>
        <include href="http://example.com/nt.xml" xmlns="http://www.w3.org/2001/XInclude"
          xpointer="range(xpath(//div[@n='Matt']/div[@n='5']/div[@n='7']))"/>
    </div>
 </body>
.....

Other examples using <ref> or <link> would look similar to the second one above. The XPointer framework stands at the heart of them all, pinpointing the precise node or document fragment that is meant. But as currently constructed, this XPointer scheme shares with @cRef a lack of semantics behind the syntax. That is, no information about the meaning of a particular node is built into the XPointer scheme. For the examples above, there is no way to imply in the XPath fragment div[@n='Matt'] that the div means a book and that the @n means the name of that book. In addition, this fragment has coinage only within a specific TEI file. Its interoperability is as limited as @cRef was shown to be above, since the XPointers are not guaranteed to have any validity for other versions of the same work. For every new version of Matthew or Agnes Grey that the encoder wishes to include, the file structure must be interrogated and a new XPointer expression created.

I propose a different approach to stand-off cross-references, one that relies upon semantically defined alignment. My proposal shares points with the previous two methods (CTS URNs and community-written Schematron files) but is more extensive in scope, anticipating an ecosystem of scholarly texts in which stand-off markup is the norm for all types of annotations, not simply cross-references. This ecosystem is the goal of a project that is still in development, the Text Alignment Network (TAN; http://textalign.net), a suite of XML encoding formats and set of recommended best practices to serve anyone who wishes to encode, exchange, and study varieties of text reuse: translations, quotations, paraphrases, adaptations, summaries, and so forth. In this section I use fragments of examples created in the TAN format to illustrate how stand-off annotation might be used to maximize the syntactic and semantic interoperability of the cross-reference.

Note

Because the TAN format is still under development, examples provided in this article may be rendered invalid in any public release.

Methods discussed above moved the beginning of the encoding workflow earlier, either to a new network of CTS servers or to communities of practice coming up with their own Schematron files. Under the TAN method work begins with what I hope will become an informal community that actively develops and maintains TAN validation schemas, documentation, and examples, and to house those files in a central repository.

To make the format maximally useful to TEI users, TAN defines a minor customization of the TEI All schema, introducing a few constraints. Every transcription file must:

  1. be dedicated exclusively to a normalized text of one version of one work found on one text bearing object;

  2. be uniquely named;

  3. uniquely name the work that has been transcribed;

  4. segment the transcription of the work into a series of nested <div>s. Each <div> must:

    1. contain other <div>s or no <div> at all;

    2. take @type and @n, specifying the type of division and its name;

    3. observe the Leaf Div Uniqueness Rule (explained above).

  5. define every metadatum with both human-readable names and machine-readable ones (URI/IRIs).

There are some other constraints, but they are not central to this discussion. The five rules above mean that every TAN-compliant TEI transcription, whether quoting or quoted, will have a regularized, sometimes predictable structure. That structure does not preclude extra TEI markup within leaf <div>s, but such markup is likely to be ignored by TAN users, since they are interested in TEI files primarily as a source of normalized, well-segmented transcriptions. Extra markup, such as nuanced, complex cross-references, are expected to be found in a separate file.

So, coming back to our example, we start with the transcriber of Agnes Grey, who makes a few adjustments to the TEI file (explained below):

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://textalign.net/release/1/schemas/TAN-TEI.rnc" type="application/relax-ng-compact-syntax"?>
<?xml-model href="http://textalign.net/release/1/schemas/TAN-TEI.sch" type="application/xml" 
   schematypens="http://purl.oclc.org/dsdl/schematron"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" TAN-version="1" id="tag:textalign.net,2015-04-07:test1">
   <teiHeader>
      .....
   </teiHeader>
   <head xmlns="tag:textalign.net,2015:ns">
      .....
      <declarations>
         <work>
            <IRI>http://dbpedia.org/resource/Agnes_Grey</IRI>
            <name>Agnes Grey</name>
         </work>
         <div-type xml:id="chapter">
            <IRI>http://dbpedia.org/resource/Chapter_(books)</IRI>
            <name>chapter</name>
         </div-type>
         <div-type xml:id="p">
             <IRI>http://dbpedia.org/resource/Paragraph</IRI>
             <name>paragraph</name>
         </div-type>
         .....
      </declarations>
      .....
   </head>
   <body xml:lang="eng">
      .....
      <div type="chapter" n="5">
         .....
         <div n="18" type="p">
            <p>‘But, for the child’s own sake, it ought not to be encouraged to have such
               amusements,’ answered I, as meekly as I could, to make up for such unusual
               pertinacity. ‘“Blessed are the merciful, for they shall obtain
               mercy.”’</p>
         </div>
         .....
      </div>
      .....
   </body>
</TEI>

That is all the Brontë encoder need do. The New Testament transcriber has a similar responsibility:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://textalign.net/release/1/schemas/TAN-TEI.rnc" type="application/relax-ng-compact-syntax"?>
<?xml-model href="http://textalign.net/release/1/schemas/TAN-TEI.sch" type="application/xml" 
   schematypens="http://purl.oclc.org/dsdl/schematron"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" TAN-version="1" id="tag:textalign.net,2015-04-07:test2">
   <teiHeader>
      .....
   </teiHeader>
   <head xmlns="tag:textalign.net,2015:ns">
      .....
      <declarations>
         <work>
            <IRI>http://dbpedia.org/resource/New_testament</IRI>
            <name>New Testament</name>
         </work>
         <div-type xml:id="bk">
            <IRI>http://dbpedia.org/resource/Book</IRI>
            <name>book</name>
         </div-type>
         <div-type xml:id="ch">
            <IRI>http://dbpedia.org/resource/Chapter_(books)</IRI>
            <name>chapter</name>
         </div-type>
         <div-type xml:id="v">
             <IRI>tag:textalign.net,2015-04-07:div-type:verse:biblical</IRI>
             <name>verse (Bible)</name>
         </div-type>
         .....
      </declarations>
      .....
   </head>
   <body xml:lang="eng">
      <div n="Matt" type="bk">
         .....
         <div n="5" type="ch">
            .....
            <div n="7" type="v"><ab>Blessed are the merciful: for they shall obtain mercy.</ab></div>
            .....
      </div>
      .....
   </body>
</TEI>

Starting from the top of both examples, observe the following:

  1. The prolog contains two declarations, one pointing to a customized TEI schema in RELAX-NG (compact syntax) and another pointing to a Schematron file. (These URLs do not resolve; they are merely illustrative.)

  2. The rootmost element, <TEI>, has @TAN-version and @id. The latter is a user-defined URN naming the file. (Actually, the name applies to all versions of that file, but I avoid a full explanation here.)

  3. There is a new <head> element. The TAN suite has formats for different kinds of data (some of which one would never use TEI to encode). Metadata from one type of TAN file to the next must be predictably and consistently structured. In a word, <teiHeader> is inadequate for TAN files, and would be confusing when juxtaposed with other TAN files. The <tan:head> structures metadata in a manner consistent with other TAN files. The need for predictability is also why it is a sibling, not a child, of <teiHeader>.

  4. The literary work and the division types are defined by <work> and <div-type>, which take what I call an IRI + name pattern, a recurrent feature of all TAN files. One or more <IRI>s supply a computer-readable name in the form of an Internationalized Resource Identifier (IRI, an extension of URI, Uniform Resource Identifier) and one or more <name>s, a human-readable one. The @xml:id provides a local identifier so that the entity, properly defined by its IRI values, can be easily referenced. Thus, the two examples assign the division "chapter" different abbreviations (ch versus chapter), but this difference does not matter because the definition, made by <IRI>, is shared.

  5. <body> takes a set of nested <div>s. Any markup inside a leaf <div> is optional, and will be ignored by many users of the file. (For this reason, a bare TAN format for transcriptions is provided, to support users who prefer plain text to TEI.)

The transcribers' work is finished. Before we move to the next phase, however, it is worth noting some important gains in interoperability that have already been made. Because a TAN transcriber is compelled to segment a single work according to a semi-intuitive reference system, and to declare the work and the types of division according to IRI/URIs, we have in place the foundation for computer-actionable alignment. That is, if one were to have one hundred people each independently transcribe a different version of Agnes Grey or the New Testament along TAN rules, it is likely that many of them would structure, define, and label <div>s in a similar fashion. Thus, a good number of these versions will already be prepared for automatic alignment, with no human intervention whatsoever. There will always be some versions encoded differently, of course, and the TAN format provides the tools for an aligner to easily reconcile differences where they exist. But even before the aligner has arrived, the stage has been set for computers to create multilingual editions of versions of the same work with minimal human intervention.

At this point, work shifts to the annotator who wants to encode the cross-reference. The TAN format specifies two formats for cross-referencing. One is designed exclusively for pairs of texts (bitexts) and is used to create clusters of words (or merely letters) that correspond across the bitexts. This format, intended for highly detailed, nuanced, and complex work, provides a kind of microscopic alignment. But we focus here on the other kind of format, mascroscopic, which is intended to be used to align any number of versions of any number of works, and to specify further alignments on the basis of leaf <div>s (but more larger or mor precise alignments, down to the level of words, can also be made).

Let us suppose an aligner has found not only our two example TAN transcription files but another version of each work, and wishes to declare a cross-reference from the Brontë novel to the New Testament that applies to all four. That alignment file will look something like this:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://textalign.net/schemas/1/TAN-TEI.rnc" type="application/relax-ng-compact-syntax"?>
<?xml-model href="http://textalign.net/schemas/1/TAN-TEI.sch" type="application/xml" 
   schematypens="http://purl.oclc.org/dsdl/schematron"?>
<TAN-A-div xmlns="tag:textalign.net,2015:ns"
    TAN-version="1" id="tag:textalign.net,2015-04-07:alignment-test1">
    <head>
       .....
       <source xml:id="bronte">
          <IRI>tag:textalign.net,2015-04-07:test1</IRI>
          <name>Agnes Grey in English</name>
          <location when-accessed="2015-07-13">test1.xml</location>
       </source>
       <source xml:id="bronte-fra">
          <IRI>tag:textalign.net,2015-04-07:test3</IRI>
          <name>Agnes Grey in French</name>
          <location when-accessed="2015-07-13">test3.xml</location>
       </source>
       <source xml:id="nt">
          <IRI>tag:textalign.net,2015-04-07:test2</IRI>
          <name>King James version of the New Testament</name>
          <location when-accessed="2015-07-13">test2.xml</location>
       </source>
       <source xml:id="nt-grc">
          <IRI>tag:textalign.net,2015-04-07:test4</IRI>
          <name>Nestle Aland version of the Greek New Testament</name>
          <location when-accessed="2015-07-13">test4.xml</location>
       </source>
       .....
    </head>
    <body>
        <align>
            <div-ref src="bronte" ref="chapter 5 p 18"/>
            <div-ref src="nt" ref="bk Matt ch 5 v 7"/>
        </align>
    </body>

The <head> is somewhat long, because four different versions are in play, and they each need the IRI + name pattern (see above) as well as one or more <location>s, to specify where the source has been found. But the <body> is relatively straightforward. A single <align> encloses a set of <div-ref>s, each of which names a particular passage by identifying the source and reference. The pair of <div-refs> provide a two-way cross-reference that follows a human-friendly syntax that does not require any knowledge of XPath, XPointer, regular expressions, and so forth.

Even though this cross-reference invokes only the sources given the id bronte and nt, the reference applies to all four sources. That is because <div>-based alignment rules stipulate that every processor must infer alignment wherever possible and that, unless otherwise specified, alignment is transitive. If two texts are versions of the same work (discerned through the <IRI> values of each source's <work>), then their constituent parts—their <div>s—should be aligned wherever they can (using the IRI values of @type and the data values of @n). Furthermore, if special alignment is made across works (such as the cross-reference above), then that alignment is to be treated as transitive unless otherwise specified. That is, if an <align> says that X ~ (aligns with) Y, then for every A ~ X and every B ~ Y, A ~ B.

There are a number of benefits to the simplified <div>-based alignment illustrated above, but one should be singled out. The value of @when-accessed (a required attribute of <location>) indicates when the aligner last saw a source transcription. If that file is corrected and updated, and the date of the change is logged in the source file, then when the aligner validates the alignment file, the Schematron pattern will issue a warning that the source has been updated. The aligner can then decide if the changes have any important consequences. So transcribers can keep their files in a central location and have the liberty of correcting typographical errors. They need not worry about altering any stand-off markup files or hunting down every person using their files. The Schematron schemas do the notifying. Those who depend upon the source file can be automatically informed of any changes, one of the signal strengths of stand-off markup.

The aligner's task is finished, and work shifts to the processor. Configuration of the pre-processor is a one-time affair that will apply not only to any version of a particular text (as was the case with the method of the shared Schematron file, discussed above) but to any TAN div-based alignment file for any work. That is, those who configure processors do not need to learn the structure of a given work or transcription file. They need only to know the TAN specifications for alignment (i.e., how to interpret a TAN-A-div file). Any TAN-compliant processor can be used on any TAN-A-div file, no matter how many works or versions it has. How the processor uses or transforms the data is another issue altogether, because that depends upon the purpose and questions the transformation serves. But the preliminary pre-processing stage need be configured only once, since all valid TAN files (both transcription and alignment) are interoperable, both syntactically and semantically.

There is obviously much more I should say about TAN alignment, in response to important questions or concerns. What if independent transcriptions of the same work are discordant, using different values for @n? What if division types and works are defined by different IRI vocabularies? What about versions of the same work that use altogether different reference systems? What about works that are similar but not really the same? What about coordinating specific ranges of text smaller than the leaf <div>? What if a commonly used reference system is misleading or inadequate?

These questions and more have been anticipated, and will be addressed in the full specifications for the Text Alignment Network. Explaining any single point adequately would involve moving into territory outside the remit of this article, and would raise yet other questions that would require a full discussion of the TAN design principles and rules.

But let us assume for the sake of argument that these concerns are not handled adequately under TAN specifications. Inevitable shortcomings aside, consider how much extra interoperability has been secured in the simple examples above. Like CTS URNs, TAN-compliant TEI provides a means for uniquely naming literary works. Like the shared Schematron method, TAN-TEI offers transcribers rules to make their texts consistent and predictably structured (and therefore citable). And by compelling <div> to be given a semantically precise definition, TAN specifications allow an otherwise generic element to become highly productive and semantically precise. That is, a transcriber is now free to define <div> to mean a textual division that might be unusual or specific to a field. Thus the world of textual divisions is now opened to the semantic web.

Even if TAN proves to have fatal flaws, I hope these examples inspire someone to create a better stand-off annotation system. If the goal is to allow a cross-reference to apply to any number of versions of any two works, then in-line annotation is not viable, because it indelibly impresses the cross-reference into a single version. To be applicable to other versions the cross-reference must be freed.

Conclusion

Three methods for enhancing the syntactic and semantic interoperability of cross-references in TEI files have been offered: Canonical Text Services URNs, shared Schematron files, and the stand-off markup of the Text Alignment Network. The first two could be implemented now. The principal barrier is practical—getting independent scholars, projects, and groups to adopt a method, try it out, and through trial and experience develop the protocols behind it. The third method needs both experimentation and development before it can be widely used. But all three show that greater interoperability is possible through a few modest adjustments to our approach to TEI. First, make source transcriptions predictably structured. Second, make sure that references to those predictably structured sources are themselves predictably structured. Third, define the syntax of the metadata such that each constituent part retains its semantics, defined by IRIs/URIs. Even if a reader finds one of the three methods disfavorable, that method is successful if, in the end, it catalyzes a better way.

References

[European Commission 2010] European Commission, Annex 2, ‘Towards Interoperability for European Public Services’, ver. 744final (2010-12-16), http://ec.europa.eu/isa/documents/isa_annex_ii_eif_en.pdf (accessed 2015-07-02).

[Kalvesmaki 2014] Joel Kalvesmaki, Canonical References in Electronic Texts: Rationale and Best Practices, Digital Humanities Quarterly 8.2 (2014), http://www.digitalhumanities.org/dhq/vol/8/2/000181/000181.html.

[Schmidt 2014] Desmond Schmidt, Towards an Interoperable Digital Scholarly Edition, Journal of the Text Encoding Initiative [Online], Issue 7 | November 2014, Online since 12 November 2014, connection on 24 March 2015. URL:http://jtei.revues.org/979; doi:10.4000/jtei.979.

[Schmidt] Desmond Schmidt, The Inadequacy of Embedded Markup for Cultural Heritage Texts, Literary and Linguistic Computing 25 (2010): 337-356. doi:10.1093/llc/fqq007.

Joel Kalvesmaki

Editor in Byzantine Studies

Dumbarton Oaks

Joel Kalvesmaki (PhD, early Christian studies, Catholic University of America, 2006) is Editor in Byzantine Studies at Dumbarton Oaks. His research centers on Greek theological and philosophical texts from late antiquity. Editor of the digital-only scholarly reference work Guide to Evagrius Ponticus, Joel also serves broadly as an advisor on the digital humanities. In 2015 he began the Text Alignment Network, a suite of TEI-friendly XML formats intended to facilitate the interoperable exchange of text alignments.