<?xml version="1.0" encoding="UTF-8"?><article xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0-subset Balisage-1.2" xml:id="Bal2010bans1222"><title>Why TEI stand-off annotation doesn't quite work</title><subtitle>and why you might want to use it nevertheless</subtitle><info><confgroup><conftitle>Balisage: The Markup Conference 2010</conftitle><confdates>August 3 - 6, 2010</confdates></confgroup><abstract><para>The present submission focuses on the concept of stand-off annotation as it is implemented in the current
        version of the TEI Guidelines. We look at the motivation for choosing the stand-off approach to encoding
        Language Resources, briefly recount the history of the concept within the broadly conceived TEI setting (since
        TEI P3 and the LT NSL suite, through CES and XCES, ending in TEI P5), review the various kinds of hyperlink
        semantics and identify three kinds of reasons for the poor uptake of the TEI-recommended stand-off annotation
        approach to corpus encoding. We also suggest some solutions that may contribute to a change in the current
        state of affairs.</para></abstract><author><personname><firstname>Piotr</firstname><surname>Bański</surname></personname><personblurb><para>Piotr Bański is an Assistant Professor at the Institute of English Studies, University of Warsaw, where he
          teaches formal linguistics (primarily linguistic morphology and syntax), lexicography, and the history of
          English. He has participated, in the role of the XML architect, in projects building the IPI PAN corpus of
          Polish (encoded in the XCES) and the National Corpus of Polish (a 10<superscript>9</superscript>-word
          resource encoded in multi-level stand-off TEI). He is co-administrator of two TEI-based multilingual
          projects, FreeDict (grouping bilingual dictionaries) and Open-Content Text Corpus (with multiple monolingual
          and aligned parts; currently at the alpha stage).</para></personblurb><affiliation><jobtitle>Assistant Professor</jobtitle><orgname>Institute of English Studies, University of Warsaw</orgname></affiliation><email>bansp@o2.pl</email></author><legalnotice><para>Copyright © 2010 by the author.  Used with
permission.</para></legalnotice><keywordset role="author"><keyword>TEI</keyword><keyword>stand-off annotation</keyword><keyword>hyperlink semantics</keyword><keyword>corpus encoding</keyword></keywordset></info><section xreflabel="Section 1" xml:id="sect_intro"><title>1. Introduction</title><para>The present contribution concerns the application of the <link xlink:href="http://www.tei-c.org/Guidelines/P5/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">TEI Guidelines</link> (<xref linkend="teip5"/>) to the
      description of Language Resources (LRs), defined as follows: <blockquote><para>A Language Resource is <quote>any physical or digital item that is a product of language documentation,
            description, or development, or is a tool that specifically supports the creation and use of such
            products</quote> (<xref linkend="simons-bird08"/>).</para></blockquote>
    </para><para>More specifically, we are looking at linguistic corpora – what <xref linkend="wittetal09a"/> call <emphasis role="ital">static text-based LRs</emphasis>. We
      furthermore restrict the discussion to text corpora, though we believe that much of it is true of e.g. speech corpora, or multimodal corpora in general.
      We also believe that the general implications of the discussion can be carried over to other places where linguistics meets markup, or, more generally still, where
      two communities with different backgrounds meet to describe an range of phenomena of interest to both of them.</para><para>Among text corpora, we look at those encoded in TEI XML. <footnote><para><emphasis role="ital">TEI</emphasis> stands for <link xlink:href="http://www.tei-c.org/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest"><emphasis role="ital">Text Encoding Initiative</emphasis></link> – the project, the organization and the community
          whose primary deliverable are the <emphasis role="ital">Guidelines for Electronic Text Encoding and
            Interchange</emphasis> (<xref linkend="teip5"/>), summing up or recommending best practices for the encoding
          of a multitude of varieties of textual phenomena (see <xref linkend="jannidis09"/> for a concise description
          and <xref linkend="renear04"/> for broader discussion from the perspective of version P4 and for the placement
          of the TEI in the context of text encoding studies). As <xref linkend="renear04"/> puts it, <quote>after HTML,
            the TEI is probably the most extensively used SGML/XML text encoding system in academic
          applications</quote>. It is also worth pointing out that the TEI partially informed the development of the
          XLink and XPointer W3C recommendations, and also the ISO Feature Structure Representation standard (ISO
          24610-1). Currently, there is some interaction between the TEI encoding methods and the emerging Language
          Annotation Framework (LAF) standards, created by the <link xlink:href="http://www.tc37sc4.org/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">ISO TC 37 SC
            4</link> committee.</para><para>The TEI began in 1987 and has been through several versions, coming all the way from SGML to XML and
          following closely the developments in the field of XML specifications. As we shall show below, some of the
          attempts are still to be completed. The current version of the TEI is TEI P5 – see <xref linkend="witternetal09"/> for a brief account of the innovations that this version introduces, and <xref linkend="cummings08"/> for a broader view in the context of the Humanities.</para></footnote> The aim of the paper is to assess the suitability of the TEI for the purpose of creating multi-layer
      descriptions of linguistic phenomena, but also for more focused applications, such as those described in <xref linkend="boot09"/> or <xref linkend="cummings09"/>. </para><para>The question we will ask is not whether stand-off annotation in the TEI is <emphasis role="ital">doable</emphasis> – the answer to that is clear and successful stand-off TEI systems exist. The question will
      be rather: is stand-off TEI <emphasis role="ital">feasible</emphasis>, available out-of-the-box to an XML-literate
      OWL (Ordinary Working Linguist) with a crush for the TEI.<footnote><para>The acronym OWL in the sense of "ordinary working linguist" predates the Web Ontology Language by a few
          decades. A relevant book reference is e.g. <xref linkend="lawler-dry"/>, but as <link xlink:href="http://groups.google.com/group/gold-ontology/msg/d624be2303c8492c?hl=en" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">Michael Maxwell
            (personal communication)</link> tells me, the term was in use among field linguists at least in the 80's.
          He goes on to say that <quote>its coining on the part of field linguists (particularly SIL field linguists)
            was a reaction to the disdain with which some theoretical linguists looked down on field work, or at least
            on field work that wasn't grounded in some (acceptable) theory.</quote>.</para></footnote> Here, the answer is <quote>unfortunately, it depends</quote>, and we shall look at the dependencies.
      Some of them are internal to the TEI and hence potentially open to relatively quick local fixing, some of them
      external and affecting the XML world at large. The TEI-internal issues will be shown to have twofold nature,
      technological and sociological, the former easier to solve and conditioning the latter.</para><para>The discussion is based on the author's experience in setting up stand-off TEI architecture in the <link xlink:href="http://nkjp.pl/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">National Corpus of Polish</link> (a 10<superscript>9</superscript>-word resource
      nearing completion, cf. <xref linkend="bansp-adamp10"/>), the <link xlink:href="http://octc.sourceforge.net/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">Open-Content Text Corpus</link> (at the alpha stage, cf. <xref linkend="bansp-beataw10"/>), and the Foreign
      Language Examination Corpus (at the planning stage, cf. <xref linkend="banski-gozdawa10"/>).</para><para>In what follows, we first attempt to answer the question of why bother: why stand-off markup is an attractive
      technique from the point of view of a linguist (<xref linkend="sect_why_stand-off"/>). Next, in <xref linkend="sect_history"/>, we briefly look at the history of SGML and XML stand-off approaches in the broadly
      defined context of the TEI and also at the semantics postulated for the interpretation of stand-off devices. In
        <xref linkend="sect_TEI"/>, we look at the TEI's approach to stand-off annotation, and in <xref linkend="sect_problems"/> at the various issues that may condition the insufficient level of uptake of this
      approach in the linguistic community. Finally, in <xref linkend="sect_solutions"/>, we sketch some solutions for
      the problems identified in the present article. <xref linkend="sect_conclusion"/> concludes the paper.</para></section><section xml:id="sect_why_stand-off" xreflabel="Section 2"><title>2. Practical motivation for stand-off representations</title><para>This section looks at the motivation for using stand-off representations as seen from the point of view of an Ordinary Working Linguist. The arguments come
      mostly from modularity, both theoretical and practical, but we also look at the issues of sustainability and interoperability of LRs. We finish by presenting three
      different multi-layer TEI stand-off annotation systems as illustration and a point of reference for further discussion.</para><section xreflabel="Section 2.1" xml:id="sect_why-OHCO"><title>2.1. OHCO, overlap, modularity, and the nature of OWLs</title><para>One of the claims that gave markup studies a solid push was the thesis that text is OHCO, an ordered
        hierarchy of content objects<footnote><para>See <xref linkend="deroseetal90"/> for a manifesto, <xref linkend="renearetal93"/> for a re-evaluation,
            and both <xref linkend="renear04"/> and <xref linkend="cummings08"/> for discussion of OHCO in the context
            of the TEI.</para></footnote> . The thesis has been shown to be both inaccurate as a general claim and valid as a statement of
        tendencies and pragmatic advantages: while the OHCO thesis does not hold in all cases, due to the existence of
        overlapping hierarchies and non-contiguous objects, it appears to constrain many conceptualizations of the
        nature of text, and OHCO-based approaches to e.g. text editing appear to have practical advantages. Much of
        linguistic modelling is also done assuming OHCO as the general conceptual approach, accompanied by additional
        devices (movement, linking, feature percolation, re-entrancy, etc.) as ways to more or less system(at)ically plug the holes
        that OHCO alone cannot fill.</para><para>Criticism of the OHCO thesis has appeared extensively in the literature – see e.g. <xref linkend="renearetal93"/> for an early formulation of the problems and reformulations of the thesis, and <xref linkend="derose04"/> for an overview of ways in which non-OHCO structures can be represented, also within the
        TEI; the Extreme Markup Languages and Balisage series contain numerous articles devoted to this issue. Our
        purpose here is not to provide new flashy arguments for something that has already sprung extensive research on
        alternatives to XML and on ways to handle the failure of the OHCO thesis by devices native to XML. Our aim is
        practical: we point out that overlap and discontinuity, and the need to embrace rather than trick them, are
        inherent in both theoretical linguistic constructs and in corpus linguistic practice. We furthermore point out
        that the existence of mismatches in description is one of the arguments for a modular approach to linguistic
        modelling, whereby objects with sometimes strikingly different properties are supposed to constitute separate
        domains of study, which are linked by correspondence or mapping rules. This is what we mean by theoretical
        modularity. There is also a more practical aspect of modularity, where it is advisable to keep the output of
        various linguistic tools separated, especially where each of these separate outputs may constitute the base for
        further descriptions in a multi-layer system.</para><para>This state of affairs is not only due to the fact that different kinds of linguistic description require
        different and often conflicting segmentations at various levels, some examples of which we shall look at below.
        It is also due to the fact that there is no single way to demarcate the domain of any component of grammar –
        there are a multitude of syntactic, semantic, morphological, phonological, etc. theories with differing
        theoretical apparatus, and sometimes even with differing domains of application, although they are theories of
        seemingly the same phenomena. Consider the virtual non-existence of linguistic morphology in the days of the
        early Generative Grammar (from the late 50's throughout the 60's), when the syntagmatic aspect of word
        composition (ordering of morphs, the "atoms of word forms") was delegated to the syntactic component, and its
        paradigmatic aspect (allomorphy, i.e. modifications in the shape of morphs) was delegated to the ultra-powerful
        phonological component (cf. <xref linkend="anderson92"/>, ch. 2 for a concise discussion and references).
        Consider also the lack of interest of Classical Phonemics in morphophonological phenomena<footnote><para>Morphophonology, defined roughly as dealing with the alternation of phonemes (the abstract contrastive
            elements of speech), was often – and with some embarrassment – kept in dark corners of most structuralist
            theories of at least the first half of the 20<superscript>th</superscript> century.</para></footnote>, which later became part of the focus of Generative Phonology. Similar remarks concern the division
        of labour between and across the semantic and pragmatic components – for those models that distinguish between
        the two – vis-à-vis models based on so-called Cognitive Grammar, which introduce different divisions. The point
        is that there is no single unified approach to morphology, syntax or semantics, etc., and any encoding strategy
        choosing one particular perspective as privileged is bound to attract criticism and to discourage researchers
        working in different paradigms. We OWLs can sometimes agree that what we want to describe is stretches of
        manifestations of natural language. Sometimes, this is also the limit of our consent, and anything beyond this,
        e.g. our views on the proper segmentation of these stretches, should be presented as equal variants rather than
        one "proper" version with possible "deviations".</para><para>On a plane more familiar to markup specialists, consider an example from <xref linkend="woerneretal06"/>
        that concisely presents the nature of the problem with overlapping linguistic hierarchies: the French
        preposition <emphasis role="ital">de</emphasis> and article <emphasis role="ital">la</emphasis> are pronounced
        as a single phonological unit, <code>[dla]</code>. At the same time, the preposition and the article are children of two
        different nodes in a syntactic tree:</para><figure xml:id="fig_Woerneretal_overlap"><title>Overlapping lexical, phonological and syntactic hierarchies (copied from <xref linkend="woerneretal06"/>)</title><mediaobject><imageobject><imagedata format="png" fileref="../../../vol5/graphics/Banski01/Banski01-001.png" width="50%"/></imageobject></mediaobject></figure><para>The same is true of other cases involving separate syntactic elements getting merged morphologically or
        prosodically, as in the German <emphasis role="ital">in das</emphasis> becoming <emphasis role="ital">ins</emphasis>, English <emphasis role="ital">gonna</emphasis> , <emphasis role="ital">won't</emphasis> or
          <emphasis role="ital">I'd've</emphasis>, the last of which is a fairly typical example of cliticization
        (where a syntactically independent element is prosodically dependent on another; in this case, both the
        contracted <emphasis role="ital">'d</emphasis> and <emphasis role="ital">'ve</emphasis> cliticize onto the
        pronoun <emphasis role="ital">I</emphasis>), and of numerous other examples cited in the linguistic literature,
        often under the heading of "bracketing paradoxes" or "mismatches" of various sorts.</para><para>Consider also somewhat different misalignments, for example conflicting POS (part-of-speech) descriptions.
        These may involve changes in the number and the kind of grammatical labels used (e.g. compare the various
        tagsets of the <link xlink:href="http://ucrel.lancs.ac.uk/claws/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">CLAWS tagger</link>), but the differences may
        in many cases go deeper and may involve conflicting segmentations: compare the divisions within
          <code>[does][n't]</code> (CLAWS/Penn Treebank) vs. <code>[doesn]['][t]</code> (<link xlink:href="http://www.coli.uni-saarland.de/~thorsten/tnt/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">TnT Tagger</link>), as adduced by <xref linkend="chiarcosetal09"/>.<footnote><para><code>[doesn]['t]</code> and the lemmatized (or string-mapped) <code>[does][not]</code> are also viable
            strategies; see <xref linkend="ide-suderman07"/> and <xref linkend="chiarcosetal09"/> for more examples and
            discussion; see also the tokenization section of the ACL Special Interest Group for Annotation wiki (<link xlink:href="http://cims.nyu.edu/~meyers/SIGANN-wiki/wiki/index.php/Tokenization_Standards" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://cims.nyu.edu/~meyers/SIGANN-wiki/wiki/index.php/Tokenization_Standards</link>) for a concise and
            up-to-date summary of the various issues concerning splitting the textual stream into interesting
            units.</para></footnote> Some relevant cases are illustrated below.</para><figure xml:id="fig_tokenization1"><title>Conflicting tokenizations: morpholexical (of the English <emphasis role="ital">doesn't</emphasis> and the Polish
            <emphasis role="ital">goście</emphasis>) and syntactic (<emphasis role="ital">się</emphasis>-haplology in
          Polish)</title><mediaobject><imageobject><imagedata format="png" fileref="../../../vol5/graphics/Banski01/Banski01-002.png" width="70%"/></imageobject></mediaobject><caption><para>On the left, we present two attested strategies for the tokenization of the English <emphasis role="ital">doesn't</emphasis> (after <xref linkend="chiarcosetal09"/>). In the middle, two possibilities
            of the interpretation of the string <emphasis role="ital">goście</emphasis> in Polish are shown, whereas
            the diagram on the right illustrates overlapping syntactic segments, where <emphasis role="ital">obawiał
              się</emphasis> means "he was afraid" and <emphasis role="ital">się uśmiechnąć</emphasis> means "to
            smile"; notice that both strings involve the reflexive marker <emphasis role="ital">się</emphasis>,
            preposed with respect to the second verb's lemma (= canonical form).</para></caption></figure><para>While the first case demonstrates a single function word with multiple possible segmentations depending on
        the given software tool, case (b) shows a single form that realizes distinct "underlying" sequences: either a
        plural noun (consisting of a stem and an ending (desinence) – but this level of detail is rarely needed) or a
        weak pronoun <emphasis role="ital">go</emphasis> "him" followed by an auxiliary (person-number) clitic <emphasis role="ital">śmy</emphasis>. Case (c) shows two overlapping syntactic words – this is an example of the
        haplology of the Polish reflexive marker (see <xref linkend="kupsc99"/>). The marker is obligatory for both
        verbs used here (the forms *<emphasis role="ital">obawiał</emphasis> and *<emphasis role="ital">uśmiechnął</emphasis> are ungrammatical without the accompanying <emphasis role="ital">się</emphasis>) but
        under appropriate circumstances, multiple instances of <emphasis role="ital">się</emphasis> may (and in fact
        should, in idiomatic Polish) reduce to a single occurrence that is perceived as shared by the verbs involved.
        (As a further complication, these parts of the reflexive verb need not be adjacent.)</para><para>Although all of the examples above present various cases of overlap, we do not want to treat them in the
        same way. Cases (b) and (c) belong to the same respective levels of grammatical description (basic segmentation
        in (b), syntactic word identification in (c)) and the contrast between the alternatives in each case is not
        based on any theoretical difference – in the words of <xref linkend="renearetal93"/>, they belong to a single
        perspective, and therefore are a counterexample to even the weakest version of OHCO. At the same time, we do
        not want to subject them to any kind of non-OHCO mechanism apart from a simple disjunction between (sub)trees:
        we want a single document to provide us with both variant readings in the case of (b), and both syntactic words
        in the case of (c). Example (a), on the other hand, may be argued to show different perspectives, as defined
        from the point of view of the software tool that is used to tokenize and tag the resulting strings. In such
        cases, we want the different tokenizations to reside in different documents.<footnote><para>See <xref linkend="chiarcosetal09"/> for discussion of cases where such fundamentally different
            tokenizations need to be merged and for a proposal of a merging algorithm.</para></footnote></para><para>Similarly, if the only difference lies in the assignment of POS labels – for example, the tag for the
        comparative degree of an adjective (<emphasis role="ital">better</emphasis>, <emphasis role="ital">older</emphasis>) in the CLAWS-5 tagset used to tag the British National Corpus, is "AJC", whereas in the
        CLAWS-8 tagset it is "JJR" – then, although expressing the labels in a single document would be trivial (e.g.
        in multi-valued attributes), we want them placed in separate documents, because they represent different
        perspectives or at least different tools. This is completely independent from the practical issue of validation
        of such multi-token attributes and the like – even if the validation were trivial, these perspectives are
        fundamentally different for practical reasons and should be kept separate also with an eye to using one of them
        but not the other for the purpose of building the next annotation layer. <footnote><para>Although some correlations between the violations of <xref linkend="renearetal93"/>'s OHCO-3 and the
            placement of the offending structures within the same document suggest themselves, I believe this issue –
            if it is a valid issue at all – to be beyond the scope of the present submission.</para></footnote>
      </para><para>Consider one more example, of the ambiguous sentence <emphasis role="ital">they killed the man with an
          umbrella</emphasis>. The realistic phrase structure analysis in such cases stops at the level of chunking (shallow
        parsing) – in this case, <code>[they][killed][the man][with the umbrella]</code>, with no indication of the structure of
        the verb phrase (VP) that starts at <emphasis role="ital">kill</emphasis> and continues to the end of the sentence on
        either reading. If deep parsing were attempted, the result would be as in (a) below, where the prepositional phrase
        modifies either the verb <emphasis role="ital">kill</emphasis> (upper tree) or the noun <emphasis role="ital">man</emphasis> (lower tree).</para><figure xml:id="fig_tree1"><title>Conflicting phrase structure analyses of a single sentence (a) vs. dependency analyses (b) and (c)</title><mediaobject><imageobject><imagedata format="png" fileref="../../../vol5/graphics/Banski01/Banski01-003.png" width="80%"/></imageobject></mediaobject><caption><para>The prepositional phrase (PP) <emphasis role="ital">with an umbrella</emphasis> can be interpreted
            either as a separate instrumental adverbial (upper tree in (a)) or as part of the noun phrase (NP) object (a
            modifier of the noun – not indicated separately in the lower tree in (a)). In the dependency analysis, we
            are looking at graphs with labelled edges. The meanings of the labels are as follows: <emphasis role="ital">main</emphasis> introduces the entire structure, <emphasis role="ital">subj</emphasis> = subject,
              <emphasis role="ital">obj</emphasis> = object, <emphasis role="ital">instr</emphasis> = instrumental
            adverbial, <emphasis role="ital">det</emphasis> = determiner (article), <emphasis role="ital">pcomp</emphasis> = prepositional complement; diagram (c) shows the modification of the object <emphasis role="ital">man</emphasis> necessary to reflect the interpretation whereby the man was carrying an
            umbrella (=the lower tree in (a)), <emphasis role="ital">mod</emphasis> = modifier.</para></caption></figure><para>Examples (b) and (c) represent a dependency analysis of the same sentence (based on the <link xlink:href="http://www.connexor.eu/technology/machinese/demo/syntax/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.connexor.eu/ "machinese" demo</link>),
        with (b) corresponding to the interpretation encoded by the top tree in (a), while (c) reflects the interpretation of the
        bottom tree of (a). Dependency analyses involve graphs with labelled edges, which may – but need not – be isomorphic
        with trees, and therefore their OHCO-compliance can at best be partial.</para><para>In essence, what is needed for linguistic description is best captured by keeping the text in as neutral form as
        possible, and offering various views of it, depending on the whim or the particular set of linguistic beliefs of the
        given user. One way to achieve this goal is to use <emphasis role="bold">stand-off annotation</emphasis>, whereby the
        source data is kept separate, either as raw text or with “low density” XML markup (i.e., with gross structural markup
        alone, e.g. identifying headers and paragraphs but little more, in order not to instil any theoretical linguistic
        interpretation into the text), and whereby all the possible linguistic interpretations are kept in separate documents,
        either referencing the source text directly, or forming a hierarchy of annotation layers (see e.g. <xref linkend="goeckeetal10"/> or <xref linkend="ide-romary07"/> for more details).<footnote><para>Naturally, stand-off annotation is not restricted to XML applications alone, but this is what we take as our
            focus here.</para></footnote>
      </para><para>While some of the cases of overlap and discontinuity presented here are open to reanalysis in terms other
        than stand-off annotation, even in the TEI itself – by means of milestone elements, fragmentation, in-file
        stand-off elements such as <code>&lt;link&gt;</code> and <code>&lt;join&gt;</code> or <link xlink:href="http://tei.oucs.ox.ac.uk/P5/Guidelines-web/en/html/ref-att.global.linking.html" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">linking
          attributes</link> such as @exclude, @synch and others (cf. <xref linkend="derose04"/> and chapters <link xlink:href="http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">16</link> and <link xlink:href="http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">20</link> of the TEI Guidelines),
        there is one important factor that rules out such strategies, and that is <emphasis role="bold">modularity of description</emphasis>. Descriptions of properties belonging to different theoretical
        perspectives are expected to be separate, in order to constitute separate modules that can be judged, verified
        and challenged on their own.<footnote><para>It is also worth mentioning that in some cases, the objects of interest do not form a contiguous whole
            and neither is there any hierarchy to talk about. For example, the layer of word sense disambiguation need
            only contain references to the particular forms of the lexemes disambiguated in the accompanying lexicon.
            This is exactly the case in the National Corpus of Polish, cf. <xref linkend="fig_ncp"/>.</para></footnote>
      </para></section><section xreflabel="Section 2.2" xml:id="sect_intro-buzzwords"><title>2.2. Issues of sustainability and interoperability</title><para>It is one of the tenets of at least some sustainability-oriented encoding practices that the object of
        description (in our case, text) be maximally divorced from its possible theoretical views (annotations). This
        way, the text, kept in as neutral form as possible, remains an attractive resource, open to future analyses and
        to the creation of new views, i.e., new annotation layers (this goes under the heading of extensibility).
        Equally attractive are the annotations themselves – they can serve as the basis for comparison of tools and theories.
        Security (immutability) of the text itself is also essential, and this is what stand-off approaches strive to
        guarantee, because they are non-destructive with respect to the resource that gets annotated.<footnote><para>On sustainability of LRs in general, see e.g. <xref linkend="bird-simons03"/> and <xref linkend="simons-bird08"/>. On sustainability in the context of the TEI, see <xref linkend="wittetal09b"/>. </para><para>For more on interoperability of LRs, see e.g. <xref linkend="ide-romary07"/> or <xref linkend="wittetal09a"/>. We use the stand-off terminology in accordance with <xref linkend="goeckeetal10"/>.</para></footnote>.</para><para>Interoperability values the ease of transduction, both in the case of the source text and with respect to
        its annotations. Sometimes, the ease of mapping a single layer of annotation to another resource (e.g. a
        translated document) is also important.</para><para>On the other hand, <xref linkend="rehmetal10"/> point out that stand-off approaches are not optimal from the
        point of view of sustainability because they require dedicated tools in order to merge annotations with the
        source text. This is very true of the current state of affairs. Our point is that if stand-off annotation can be
        handled by generic XML tools then the issue of the longevity of the annotation layers (note that the source text
        is relatively safe) piggybacks on the general well-being of XML technology, ages together with it, and is open to
        whatever plastic surgery is applied to make XML or its descendants look good 20 years from now.</para><para><xref linkend="rehmetal10"/> point out that the approach they suggest, <emphasis role="ital">multiply-annotated text</emphasis>, which also uses layers of annotation but each of these layers contains an
        exact copy of the source text, and thus achieves sustainability through redundancy, has more advantages than
        stand-off approaches that keep a single copy of the source text. It is not our aim to argue against that
        theoretical stance because, like the stand-off approach that we concentrate on here, it assumes modularity of
        description, and modularity is what OWLs need. Additionally, in principle, both approaches can be mixed in e.g.
        crowd-sourced corpora where annotation layers are contributed by external parties. Both approaches also appear
        to share one more problem: the lack of generic XML tool support, a matter which we will return to below.</para><para>Summing up, stand-off technology has both advantages and disadvantages from the point of view of the two
        deservedly hot leitmotifs of language documentation and linguistic infrastructure: sustainability and
        interoperability. On the one hand, the advocates of stand-off markup note the relative stability of source text
        with low-density markup (or with no markup at all), as well as the putative flexibility of the annotation
        layers. On the other hand, those who concentrate on the holistic advantages of language resources note that the
        merger of the source with the annotation layers requires dedicated machinery. In the next section, we look at
        three stand-off TEI systems that attempt to cope with these issues in various ways.</para></section><section xml:id="sect_why-demo" xreflabel="Section 2.3"><title>2.3. Selected TEI stand-off systems</title><para>The present section contains brief descriptions of selected complex systems involving versions of the TEI stand-off technology. The selection is absolutely
        partial and subjective, but, we believe, it serves its purpose nevertheless, exemplifying three out of many possible variants of stand-off systems.</para><para>The first resource to be presented is the National Corpus of Polish (NCP), an over-10<superscript>9</superscript>-segment deliverable of a 3-year
        state-funded project ending in late 2010, available for searching at <link xlink:href="http://nkjp.pl/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://nkjp.pl/</link>. We present the structure of a
        single corpus text in the diagram below.</para><figure xml:id="fig_ncp"><title>National Corpus of Polish (NCP): dependencies in a robust multi-layer stand-off system</title><mediaobject><imageobject><imagedata fileref="../../../vol5/graphics/Banski01/Banski01-004.png" width="60%" format="png"/></imageobject></mediaobject><caption><para>Dependencies among the annotation layers in the National Corpus of Polish. Red arrows (<inlinemediaobject><imageobject><imagedata fileref="../../../vol5/graphics/Banski01/Banski01-005.png" format="png"/></imageobject></inlinemediaobject> ) denote the dependencies among the various parts of the hierarchy. Blue arrows (<inlinemediaobject><imageobject><imagedata fileref="../../../vol5/graphics/Banski01/Banski01-006.png" format="png"/></imageobject></inlinemediaobject>) and purple arrows (<inlinemediaobject><imageobject><imagedata fileref="../../../vol5/graphics/Banski01/Banski01-007.png" format="png"/></imageobject></inlinemediaobject>) signal the inclusion of the local header and the main corpus header, respectively. For the sake of readability, these relationships are
            not indicated in the diagrams that follow.</para></caption></figure><para>In the NCP, the source text has minimal structural inline markup, down to the level of the paragraph
          (<code>&lt;p&gt;</code> or <code>&lt;ab&gt;</code>, the latter standing for "anonymous block" where we don't want
        to make a semantic commitment). Sentence boundaries as well as the individual tokens are identified at the
        segmentation layer (1.); this is also where segmental ambiguities such as those discussed in <xref linkend="fig_tokenization1"/> (b) are indicated. The segmentation layer serves as the basis for the layer
        that, firstly, identifies all the morphological interpretations of the given segment, and secondly, attempts to
        disambiguate them in the morphosyntactic context (2.); this layer is referenced by the next two: (3.) the layer
        of syntactic words (grouping e.g. analytic tense realizations but also elements such as <emphasis role="ital">obawiać się</emphasis> and <emphasis role="ital">uśmiechnąć się</emphasis>, cf. <xref linkend="fig_tokenization1"/> (c)) and (4.) the layer of word-sense disambiguation (experimental, for 100
        selected lexemes with multiple interpretations). The layer of syntactic words is the basis for the final two
        layers: the layer of named-entity recognition (5.)<footnote><para>In the earlier version of the corpus, for technical reasons, this layer was based on the
            morphosyntactic disambiguation layer (2.), which is reflected in some of the early publications.</para></footnote> and the layer of shallow parsing, identifying syntactic chunks (6.). All NCP documents, source text
        and annotations alike, include two kinds of headers: the local header that describes the properties of the
        source text and contains a changelog for all the modifications and additions that affect the given directory,
        and the single corpus header, which contains information shared by all parts of the corpus, including
        definitions of various taxonomies, which are referenced from the local headers. An early version of TEI ODD
        "literate encoding" documents describing some of these schemas has been made available at <link xlink:href="http://nlp.ipipan.waw.pl/TEI4NKJP/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://nlp.ipipan.waw.pl/TEI4NKJP/</link>. See <xref linkend="bansp-adamp10"/> for more description and references to more detailed papers on each of the
        annotation layers.</para><para>The two resources that follow took the overall model of the NCP as their starting point, and tailored it to
        their specific purposes and the context in which they are deployed. The first of them is the Open-Content Text
        Corpus. This is a resource meant to be both the open-source testing ground for TEI stand-off applications and
        at the same time, to constitute a common platform for collective research and academic work on preserving and
        describing language resources, especially those for "lower-density languages". While we leave the details aside
        (see <xref linkend="bansp-beataw10"/>), we note that the multilingual nature of the corpus (at the time of
        writing, it contains mini-subcorpora for 55 languages) forces the introduction of one more layer of
        organization, with its own header. Thus, each text of the OCTC includes three headers, the links to which have
        been mercifully omitted from the diagram below. Because the corpus is meant as a platform for many possible
        research or student teams, it is explicitly modelled as a <emphasis role="bold">multi-instance</emphasis>
        stand-off structure, which means that it is expected that a single annotation layer of the OCTC may come in
        many variants, depending on the tools used to create it. This is indicated below.</para><figure xml:id="fig_octc"><title>Open-Content Text Corpus (OCTC): dependencies in a multi-instance stand-off system</title><mediaobject><imageobject><imagedata fileref="../../../vol5/graphics/Banski01/Banski01-008.png" width="60%" format="png"/></imageobject></mediaobject><caption><para>Dependencies among the annotation layers in the Open-Content Text Corpus. Red arrows (<inlinemediaobject><imageobject><imagedata fileref="../../../vol5/graphics/Banski01/Banski01-005.png" format="png"/></imageobject></inlinemediaobject>
          ) denote the dependencies among the various parts of the hierarchy.</para></caption></figure><para>The final example presents the prototype structure of the English subcorpus of the FLEC (Foreign Language
        Examination Corpus), a learner corpus containing examination essays by students of the University of Warsaw.
        Its primary aims are to study the transfer of linguistic structures of Polish onto the second language and to
        measure the inter-rater agreement in order to attain greater objectivity in grading language exams (see <xref linkend="banski-gozdawa10"/> for details). The electronic source texts are produced by transcribers (the exams are
        written in hand), who fill out templates already divided into sentence-sized chunks, and introduce extra markup
        for unclear passages, special textual features, gaps and the like. The source text is then tokenized according
        to an agreed tokenization standard (by default, according to whitespace and punctuation, but the English part
        additionally obeys the CLAWS tokenization rules), and each token is indexed. This becomes the new base that
        other annotation layers reference, and in effect, the original source text remains only as a backup. This
        system is close to what <xref linkend="cummings09"/> describes; to distinguish it from systems in which the
        source text receives only light tagging (or no tagging at all, as in the <link xlink:href="http://www.americannationalcorpus.org/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">American National Corpus</link>), we use the term
        "rich-base stand-off system" to refer to it.</para><figure xml:id="fig_flec"><title>Foreign-Language Examination Corpus (FLEC): dependencies in a rich-base stand-off system</title><mediaobject><imageobject><imagedata fileref="../../../vol5/graphics/Banski01/Banski01-009.png" width="60%" format="png"/></imageobject></mediaobject><caption><para>Dependencies among the annotation layers in the Foreign-Language Examination Corpus. Red arrows (<inlinemediaobject><imageobject><imagedata fileref="../../../vol5/graphics/Banski01/Banski01-005.png" format="png"/></imageobject></inlinemediaobject> ) denote the dependencies among the various parts of the hierarchy. The black arrow
            indicates that the segmentation layer takes over all the functions and the content of the source layer,
            which is retained only for archival purposes but does not participate in further processing of the
            corpus.</para></caption></figure><para>Each essay is rated by at least two instructors, on special forms that make it possible to transcribe the ratings into electronic form. The morphosyntactic
        layer is added separately, to make it possible to perform searches. In the future, a syntactic level is planned, that will allow for searches based on syntactic
        criteria for, e.g., specific constructions.</para><para>Here is where modularity is required and enforced by practical considerations: the individual parts of the
        corpus are scheduled to be created at different points in time, as dictated by the availability of the new
        data, the ability of the transcriber team to cope with the hand-written exams, then to cope with the raters'
        judgements, while the morphosyntactic descriptions are created.</para></section><section xml:id="sect_why-summary" xreflabel="Section 2.4"><title>2.4. Motivation for stand-off representation: summary</title><para>This section looked at the motivation for the use of stand-off annotation in linguistic applications. We
        first looked at the application of the OHCO thesis to linguistic theorizing and found that there was no
        straightforward relationship between cases where OHCO failed and the preferred encoding strategy. The choice
        appears to depend on the particular perspective (hinted at rather than defined here with reference to
        linguistics): whether OHCO-conforming or not, perspectives that reflect the modular nature of the grammar, or
        that are due to practical issues such as the choice of the particular tagging tool or tagging system, call for
        dedicated annotation documents. In systems such as the FLEC, the practical issues reach even further and depend
        on the human annotators of each kind of documents (the essays are planned to be always transcribed as soon as
        possible, to provide raw material for studies of the lexical content).</para><para>Let us reiterate: many of the above issues could be encoded within single files, thanks to the ingenuity of
        the many designs allowing for non-OHCO representations. But we OWLs are not even going to try them: we want our
        encoding layers separated, for reasons both theoretical (they encode different pieces of our descriptions and we
        want to keep it that way) and practical (we want relatively generic systems that "just work" without requiring
        dedicated tools; if they require too much hassle, we'll just grab a different, existing solution, and not
        necessarily one based on XML).</para></section></section><section xml:id="sect_history" xreflabel="Section 3"><title>3. Stand-off annotation: the semantics of hyperlinks</title><para>To our knowledge, the earliest mentions of stand-off annotation, at least in the broad context of the TEI,
      were made in papers co-authored by Henry Thompson and David McKelvie, with the participation of Amy Isard and
      Chris Brew (<xref linkend="thompson-mckelvie97"/>, <xref linkend="mckelvieetal98"/>, <xref linkend="isardetal98"/>). They were mostly made in the context of the LT NSL package (later to become LT XML), created at the
      University of Edinburgh. In these papers, the foundations for stand-off semantics were laid. Below, we look at
      some of the possible interpretations of stand-off links, the first four defined by the LT NSL group. Much of that
      has later surfaced in the XInclude and XLink specifications (the latter partially based on TEI pointing
      techniques).</para><para>We can distinguish at least the following kinds of possible interpretations of linking attributes and elements:<orderedlist><listitem><para><emphasis role="bold">inclusion</emphasis> (with or without the loss of metadata, cf. <xref linkend="thompson-mckelvie97"/> vs. <xref linkend="isardetal98"/>),</para><figure xml:id="fig_inclusion"><title>Inclusion semantics</title><mediaobject><imageobject><imagedata fileref="../../../vol5/graphics/Banski01/Banski01-010.png" format="png" width="100%"/></imageobject></mediaobject><caption><para>Inclusion semantics of hyperlinks as originally presented in <xref linkend="thompson-mckelvie97"/>,
                with a simplified notation of the @target attributes. The proposals were unclear about the target
                metadata (in red) and either involved the loss of it (<xref linkend="thompson-mckelvie97"/>, <xref linkend="mckelvieetal98"/>) or preserved it (<xref linkend="isardetal98"/>); the latter option is
                shown in the figure above.</para></caption></figure></listitem><listitem><para><emphasis role="bold">replacement</emphasis> (later turning into XML Inclusions; involving the loss of
            the pointer metadata),</para><figure xml:id="fig_replacement"><title>Replacement semantics</title><mediaobject><imageobject><imagedata fileref="../../../vol5/graphics/Banski01/Banski01-011.png" format="png" width="100%"/></imageobject></mediaobject><caption><para>Replacement semantics of hyperlinks as presented in <xref linkend="mckelvieetal98"/> and <xref linkend="isardetal98"/>. This is the straightforward ancestor of XInclude semantics.</para></caption></figure></listitem><listitem><para><emphasis role="bold">inverse replacement</emphasis> ("include everything but the element that I point
            at, and use me instead of it"); this raises the question of feasibility of implementation if more than one
            such replacement is performed in a single document;</para></listitem><listitem><para><emphasis role="bold">multiple-point linking</emphasis> (in the future, it became the semantics of the
            TEI's <code>&lt;link&gt;</code>, specialized into <code>&lt;join&gt;</code> ). See also <xref linkend="listing_octc_align"/> below.</para></listitem><listitem><para><emphasis role="bold">correspondence</emphasis> semantics, the most underspecified semantic
            relationship possible (not mentioned in the LT NSL system but logically necessary and somewhat akin to
            multiple-point linking); correspondence semantics may be enough for visualising applications – i.e., there
            is no need to derive an extra TEI representation: all that the application has to know is which fragments
            of one layer correspond to which fragments of another, and that is enough to act on them. Correspondence
            semantics in also necessary in the case of multimodal corpora, where annotation layers (in)directly address
            binary streams. In the TEI, there exist a variety of devices for simple pointing, from the @target
            attribute (and the deprecated @targets), sometimes embedded in the <code>&lt;ptr&gt;</code> or
              <code>&lt;ref&gt;</code> elements, through the entire range of pointers with added shades of interpretation
            beyond pointing or linking, such as @corresp, @ref or @ana, among many others (see <link xlink:href="http://www.tei-c.org/release/doc/tei-p5-doc/en/html/REF-ATTS.html" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.tei-c.org/release/doc/tei-p5-doc/en/html/REF-ATTS.html</link> for a complete list of TEI
            attributes). Note also that the simplest version of multiple-point linking semantics involves this kind of
            pointing, but at <emphasis role="ital">more than one</emphasis> resource at the same time.</para></listitem><listitem><para><emphasis role="bold">merger</emphasis> semantics, possible under limited circumstances, "merge my
            attributes/content with the attributes/content of the element I am pointing at" – this is a viable
            possibility for e.g. a morphosyntactic layer composed out of <code>&lt;seg&gt;</code> elements containing
            feature structures and pointing to a segmentation layer composed of empty <code>&lt;seg&gt;</code>
            elements, whose only role is to address character spans in the source text. A variation of this scenario
            with more content is illustrated below.</para><figure xml:id="fig_merger"><title>Merger semantics</title><mediaobject><imageobject><imagedata fileref="../../../vol5/graphics/Banski01/Banski01-012.png" format="png" width="100%"/></imageobject></mediaobject><caption><para>Merger semantics, possible if the relevant schemas are non-conflicting (in the extreme case, if
                both annotation documents are instances of the same schema that also allows the occurrence of both
                kinds of element content together). The effect is that of unification of descriptions.</para></caption></figure></listitem><listitem><para><emphasis role="bold">reverse inclusion</emphasis> semantics – literal interpretation of the semantics
            of CES links (see below); untenable for at least practical reasons, though we stress that it has been used
            mostly in the context of virtual representations, and with such a proviso, reverse inclusion semantics may
            even be argued to be useful for descriptions of binary streams, which get <emphasis role="ital">virtually</emphasis> "adorned" with the annotation information.</para><figure xml:id="fig_reverse-inclusion"><title>Reverse inclusion (virtual)</title><mediaobject><imageobject><imagedata fileref="../../../vol5/graphics/Banski01/Banski01-013.png" format="png" width="100%"/></imageobject></mediaobject><caption><para>This type of inclusion can be called, in the context of the system presented here, "reverse
                inclusion". It represents the literal reading of the most popular characterisation of stand-off markup
                in the Corpus Encoding Standard documentation. The result is a virtual structure that in fact had to be
                realised as either straight inclusion or replacement, or else correspondence semantics.</para></caption></figure></listitem></orderedlist>
    </para><para>The last kind of semantics is inspired by the prose descriptions of the Corpus Encoding Standard (<link xlink:href="http://www.cs.vassar.edu/CES/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.cs.vassar.edu/CES/</link>), an SGML-based specialization
      of the <link xlink:href="http://www.tei-c.org/Vault/GL/P3/index.htm" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">TEI-P3</link> (cf. <xref linkend="ide-veronis93"/>, <xref linkend="ide98"/>) and its later XML version, XCES (<link xlink:href="http://www.cs.vassar.edu/XCES" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.cs.vassar.edu/XCES</link>, cf. <xref linkend="ideetal00"/>). This standard was an important way to offer OWLs a handy TEI-like tool for quick deployment and proved very
      popular in corpus-linguistic circles. One of its most important features was that it severely narrowed down the
      sometimes enormous range of options offered by the unconstrained TEI – this task had already been performed for
      the OWL, who only needed to choose from among the optional elements and attributes, but crucially, not so much
      from <emphasis role="ital">equivalent</emphasis> ways to annotate texts. Other important features of the XCES
      were: <itemizedlist><listitem><para>focus on the implementation of stand-off methodology,</para></listitem><listitem><para>three different content models for the three different layers (source text, analysis (morphosyntax and
            chunks), alignment),</para></listitem><listitem><para>re-entrant <code>cesCorpus</code> (supporting the structural encoding of subcorpora),</para></listitem><listitem><para>specific recommendations for morphosyntactic and alignment markup.</para></listitem></itemizedlist></para><para>(X)CES hyperlink semantics was always stated in free prose and typically mentioned “annotations virtually
      added to the base document” – treated literally, it would result in something like the “reverse inclusion”
      mentioned above. However, while it is natural to be able to predict and shape the behaviour and composition of
      the document under the control of the annotator, i.e. the annotation document, it is not necessarily so with
      respect to the source. This means that at best, we can expect inclusion or replacement semantics here, although
      what was often meant in the XCES, we believe, might have been simply correspondence semantics, with redundant
      text fragments copied from the source text layer, somewhat in the manner of multiply-annotated text of e.g. <xref linkend="goeckeetal10"/>, but with no guarantee of exhaustivity of description.</para><para>In the listing below, we illustrate some of the above-mentioned concepts of hyperlink semantics with
      fragments of existing text resources, the National Corpus of Polish (<xref linkend="listing_nkjp"/>) and the
      Open-Content Text Corpus. The NCP uses correspondence semantics; it is worth pointing out that the word <emphasis role="ital">abyś</emphasis> is split into the sentential conjunction <emphasis role="ital">aby</emphasis> "in
      order to" and the person-number clitic <emphasis role="ital">ś</emphasis>, which is marked as orthographically
      adjoined to its host (text tokens are listed in comments above the corresponding <code>&lt;seg&gt;</code>
      elements).</para><programlisting xml:space="preserve" xml:id="listing_nkjp" xreflabel="Listing 1">
   &lt;p corresp="text.xml#txt_2-div" xml:id="segm_2-p"&gt;
     ...
    &lt;s xml:id="segm_2.40-s"&gt;
      ...
      &lt;!-- pragnę --&gt;
      &lt;seg corresp="text_structure.xml#string-range(txt_2.1-ab,207,6)" xml:id="segm_2.33-seg"/&gt;
      &lt;!-- , --&gt;
      &lt;seg corresp="text_structure.xml#string-range(txt_2.1-ab,213,1)" nkjp:nps="true)" 
      xml:id="segm_2.34-seg"/&gt;
      &lt;!-- aby --&gt;
      &lt;seg corresp="text_structure.xml#string-range(txt_2.1-ab,215,3)" xml:id="segm_2.35-seg"/&gt;
      &lt;!-- ś --&gt;
      &lt;seg corresp="text_structure.xml#string-range(txt_2.1-ab,218,1)" nkjp:nps="true)" 
      xml:id="segm_2.36-seg"/&gt;
      &lt;!-- nim --&gt;
      &lt;seg corresp="text_structure.xml#string-range(txt_2.1-ab,220,3)" xml:id="segm_2.37-seg"/&gt;
      &lt;!-- pozostał --&gt;
      &lt;seg corresp="text_structure.xml#string-range(txt_2.1-ab,224,8)" xml:id="segm_2.38-seg"/&gt;
      &lt;!-- ” --&gt;
      &lt;seg corresp="text_structure.xml#string-range(txt_2.1-ab,232,1)" nkjp:nps="true)" 
      xml:id="segm_2.39-seg"/&gt;
      &lt;!-- . --&gt;
      &lt;seg corresp="text_structure.xml#string-range(txt_2.1-ab,233,1)" nkjp:nps="true)" 
      xml:id="segm_2.40-seg"/&gt;
    &lt;/s&gt;
    &lt;s xml:id="segm_2.55-s"&gt;...&lt;/s&gt;
  &lt;/p&gt;
    </programlisting><para>In the OCTC listings below, we first look at a segmentation file that uses mixed semantics: correspondence
      semantics for the containing element, <code>&lt;ab&gt;</code> ("anonymous block"), and replacement semantics
      realised by the XInclude directive using the W3C-defined <code>xpointer()</code> scheme.<footnote><para>The NCP listing is taken from <link xlink:href="http://nlp.ipipan.waw.pl/TEI4NKJP/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://nlp.ipipan.waw.pl/TEI4NKJP/</link> and slightly modified. The OCTC examples in <xref linkend="listing_octc_before"/> and <xref linkend="listing_octc_align"/> are taken from the OCTC
          SVN repository. They use W3C-defined pointers, which can be used as a fallback from the TEI XPointer version (not shown in the listing).</para></footnote></para><programlisting xml:id="listing_octc_before" xreflabel="Listing 2" xml:space="preserve">
&lt;ab l:id="swh_sgm_2-ab" type="para" corresp="text.xml#swh_txt_2-p"&gt;
    &lt;seg xml:id="swh_sgm_2.1-seg"&gt;
      &lt;xi:include href="text.xml"
        xpointer="xpointer(string-range(id('swh_txt_2-p')/text()[1],'',1,6)[1])"/&gt;
    &lt;/seg&gt;
    &lt;seg xml:id="swh_sgm_2.2-seg"&gt;
      &lt;xi:include href="text.xml"
        xpointer="xpointer(string-range(id('swh_txt_2-p')/text()[1],'',8,7)[1])"/&gt;
    &lt;/seg&gt;
    &lt;seg xml:id="swh_sgm_2.3-seg"&gt;
      &lt;xi:include href="text.xml"
        xpointer="xpointer(string-range(id('swh_txt_2-p')/text()[1],'',16,2)[1])"/&gt;
    &lt;/seg&gt;
    &lt;seg xml:id="swh_sgm_2.4-seg" rend="glued"&gt;
      &lt;xi:include href="text.xml"
        xpointer="xpointer(string-range(id('swh_txt_2-p')/text()[1],'',18,1)[1])"/&gt;
    &lt;/seg&gt;
    ...
&lt;/ab&gt;
    </programlisting><para>The result of resolving XInclude directives – the first segments of the Universal Declaration of Human
      Rights in Swahili – is provided in <xref linkend="listing_octc_after"/> below.</para><programlisting xml:id="listing_octc_after" xreflabel="Listing 3" xml:space="preserve">
 &lt;ab type="para" corresp="text.xml#swh_txt_2-p"&gt;
    &lt;seg&gt;Katika&lt;/seg&gt;
    &lt;seg&gt;Disemba&lt;/seg&gt;
    &lt;seg&gt;10&lt;/seg&gt;
    &lt;seg rend="glued"&gt;,&lt;/seg&gt;
    ...
  &lt;/ab&gt;
    </programlisting><para>The final listing illustrates one possible take at the multiple-point semantics – a fragment of a document
      aligning the Polish and the Swahili versions of the Universal Declaration. It does not use the
        <code>&lt;link&gt;</code> element, usually suggested for this purpose, but rather separate <code>&lt;ptr&gt;</code>
      elements, for greater granularity (in some cases, many:many relationships between fragments of text must be
      expressed and using <code>&lt;link&gt;</code> elements with multi-valued <code>@target</code> attributes would be
      rather tedious).</para><programlisting xml:id="listing_octc_align" xreflabel="Listing 4" xml:space="preserve">
&lt;div xml:id="pol-swh_aln_2-div" type="tu" part="N" prev="#pol-swh_aln_1.1-linkGrp" org="uniform"&gt;
    &lt;linkGrp xml:id="pol-swh_aln_2.1-linkGrp"&gt;
      &lt;ptr xml:id="pol-swh_aln_2.1.1-ptr" target="pol/UDHR/text.xml#pol_txt_1-head" type="tuv" 
      xml:lang="pl"/&gt;
      &lt;ptr xml:id="pol-swh_aln_2.1.2-ptr" target="swh/UDHR/text.xml#swh_txt_1-head" type="tuv" 
      xml:lang="sw"/&gt;
    &lt;/linkGrp&gt;
    &lt;linkGrp xml:id="pol-swh_aln_2.2-linkGrp"&gt;
      &lt;ptr xml:id="pol-swh_aln_2.2.1-ptr" target="pol/UDHR/text.xml#pol_txt_2-p" type="tuv" 
      xml:lang="pl"/&gt;
      &lt;ptr xml:id="pol-swh_aln_2.2.2-ptr" target="swh/UDHR/text.xml#swh_txt_2-p" type="tuv" 
      xml:lang="sw"/&gt;
    &lt;/linkGrp&gt;
    ...
&lt;/div&gt;
    </programlisting><para>In conclusion, it is also advisable to mention the concept of <quote>radical stand-off</quote> that the XCES
      evolved into, in the context of the American National Corpus, which keeps the source files in the form of raw
      UTF-16 text and uses dedicated software (ANCTool) to merge the raw text with the annotations selected by the
      user, cf. <xref linkend="ide-suderman06"/>. Our interpretation of the the XCES evolving in the context of the ANC
      and ending up merely as one of the output formats of the ANCTool is that the creators of the ANC have drawn
      conclusions from the stalled development of the W3C XPointer standard, various versions of which the XCES
      attempted to use over the years, and finally gave up on it and switched to an in-house tool that made it possible
      for them to go all the way towards radical stand-off annotation, which undoubtedly has some advantages from the
      point of view of sustainability of LRs (the texts are kept as read-only, so there is no danger of corrupting them
      by fixes and adaptations of markup).<footnote><para>Another way to cope with the lack of XPointer support has been to use XPointer-like attributes with
          in-house tools. This is the case of the PAULA, as illustrated by a fragment of Figure 1 in <xref linkend="dipper05"/>:
          <programlisting xml:space="preserve">&lt;mark id="tok 1" xlink:href="#xpointer(string-range(//body,'',1,3)))"/&gt;</programlisting>
          Such XPointers, however, would under the W3C definition return sequences of spans rather than single spans --
          what is missing is a position predicate that can be found in the OCTC listing above. This is not meant as
          criticism of the PAULA approach but merely as an observation that XPointer xpointer() syntax, due to the lack
          of tools implementing it, started a life of its own, being used as a concise replacement for what could be,
          e.g. <code>@node</code>, <code>@from</code>, and <code>@length</code> attributes.</para></footnote>
    </para></section><section xml:id="sect_TEI" xreflabel="Section 4"><title>4. Stand-off markup in the TEI</title><para>We begin by distinguishing two uses of stand-off devices and then concentrate on what features of the XCES can be found implemented in TEI P5.</para><section xml:id="sect_TEI-local" xreflabel="Section 4.1"><title>4.1. <quote>local stand-off</quote></title><para>Recall that one of the purposes of stand-off annotation is to make it possible to handle
        overlapping hierarchies and any other sort of conflicting markup. The TEI has several
        devices for this purpose, mentioned in <link xlink:href="http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">chapter 20</link>
        of the Guidelines. The prototypical example is <code>&lt;join&gt;</code>, aggregating
        elements that it points to into a virtual object. Other members of the family include
          <code>&lt;alt&gt;</code>, <code>&lt;span&gt;</code>, and the least semantics-laden
          <code>&lt;link&gt;</code>. <footnote><para>We ignore <code>&lt;ptr&gt;</code> and <code>&lt;ref&gt;</code>, because these are,
            respectively, the plain and the adorned instances of the infrastructure that makes
              <code>&lt;join&gt;</code> and its kin usable. Likewise, we ignore the role of pointing
            attributes on elements other than those mentioned here.</para></footnote></para><para>Due to the fact that pointers in TEI P5 are URI-based, these elements may be used as
        both "local stand-off" and "remote stand-off" elements (where the former is not an oxymoron
        and the latter not a tautology): if the metaphor for “stand-off” is paraphrased as
        “creating/organizing a structure in resource A out of elements of resource B by pointing to
        them”, then in the cases where the pointing is local, A and B are the same resource. The
        remaining discussion in this section does not refer to such uses, but we return to them in
          <xref linkend="sect_solutions"/>. In the remainder of this section, we look at the kind of
        stand-off annotation that involves pointing across separate documents.</para></section><section xml:id="sect_TEI-XCES" xreflabel="Section 4.2"><title>4.2. The converging paths of the XCES and TEI P5</title><para><link xlink:href="http://www.tei-c.org/release/doc/tei-p4-doc/htm" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">TEI P4</link> was an
        XML-ised version of P3, with minimal changes, and only the introduction of <link xlink:href="http://www.tei-c.org/Guidelines/P5/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">TEI P5</link> saw drastic modifications
        in the recommendations for linking. To an outside observer, this may be described
        as the TEI's <emphasis role="ital">reabsorption</emphasis> of the modifications introduced
        by the CES as the latter forked from TEI P3. Apart from important low-level modifications of
        elements designed to annotate linguistic structure, the important changes were the
        introduction of a self-nesting <code>&lt;teiCorpus&gt;</code> element (it did not self-nest
        in P4; this is a feature crucial in resources such as the above-mentioned NCP or OCTC; see
        also <xref linkend="sect_why-demo"/>) and the generalization of the concept of stand-off markup, made possible by the
        switch from IDREF-based pointing to the current URI-based version.<footnote><para>This is not to claim that the change was unidirectional and conditioned by the (X)CES. In fact, as Lou
            Burnard (personal communication) points out, TEI P3 already had a basis for stand-off systems thanks to the
              <code>&lt;xptr&gt;</code> element that handled external pointing. This has of course got streamlined in
            P5, with the adoption of uniform pointing devices. For more on these changes, cf. <xref linkend="witternetal09"/>. Our usage of the term "reabsorption" concerns the fact that the CES/XCES
            was/is a specialized system in which, by its very nature and due to the theoretical assumptions that shaped
            it, language-resource-oriented solutions had to be introduced from the outset. Whether their emergence in
            TEI P4 and P5 was only a matter of convergence of two independent lines of thought or whether the more
            specialized standard informed the more general one is something that I do not concern myself with, although
            I would welcome a situation whereby one open project benefits from the findings of another, rather than
            waste its time on reinventing the wheel.</para></footnote>
      </para><para>It is worth highlighting an ingenious move in the introduction of stand-off annotation
        in the TEI, namely the use of the XInclude standard
          (<link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.w3.org/TR/xinclude/</link>).</para><para>The initial version of the XCES (cf. <xref linkend="ideetal00"/>) used XLink (then
          <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.w3.org/TR/xlink/</link>, currently
          <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.w3.org/TR/xlink11/</link>) as the pointing device, with XPointer
          <code>xpointer()</code> schemas (back then, the schema was called <code>xptr()</code>) as
        the content of <code>xlink:href</code>. The XPointer <code>xpointer()</code> draft
          (<link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.w3.org/TR/xptr-xpointer/</link>) was being born right at that moment and
        nothing forewarned of its remaining at the draft stage for ever, or at least until now.
        The XLink recommendation was also fresh and promising to see much heavier use than it does
        today, remaining endemic to only a few specifications. See <xref linkend="ide00"/> to glimpse at the optimism that the introduction of new W3C standards brought into the LR community.</para><para>TEI P5 documents concerning the use of stand-off annotation
          (<link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.tei-c.org/Activities/Workgroups/SO/</link>) date from 2003 at the
        earliest, and at that time, the XInclude recommendation was at least at the Working Draft
        stage and promising to become at least useful. XInclude explicitly uses replacement
        semantics (not inclusion semantics, despite the similarity in names) as defined by <xref linkend="thompson-mckelvie97"/> (see <xref linkend="sect_history"/>), and that must have appeared
        the perfect solution, the more so that it allowed the TEI to delegate some of the intended
        functionality to an independent W3C standard, promising to get wide support in XML parsers.<footnote><para>This prediction was eventually borne out, after some issues of infoset merger were
            solved, especially those concerning <code>@xml:base</code> fixup
              (cf. <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://norman.walsh.name/2005/04/01/xinclude</link>); nowadays, full XInclude
            support is among the basic features that all general-purpose parsers are expected to
            have.</para></footnote> Additionally, TEI P5 stand-off implementation was designed with the use of
        TEI-defined XPointer schemes in mind, which was another brilliant move because nothing
        (except perhaps for <code>xpointer()</code>'s prolonged draft status, but one should always
        hope) signalled that these schemes will remain as unimplemented by parsers today as they were at the
        time of their registration.<footnote><para>For an explanation of the various XPointer-related terms and some guidance across the labyrinth of
            terminology, see <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://wiki.tei-c.org/index.php/XPointer</link>. The W3C repository for third-party XPointer
            schemes is located at <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.w3.org/2005/04/xpointer-schemes/</link>.</para></footnote></para></section></section><section xml:id="sect_problems" xreflabel="Section 5"><title>5. Problems with implementing TEI stand-off annotation</title><para>Stand-off annotation, with all its advantages for language description and documentation, typically requires
      a dedicated tool to implement the hyperlink semantics, compare e.g. LTXML2
        (<link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.ltg.ed.ac.uk/software/ltxml2</link>), ANNIS
        (<link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.sfb632.uni-potsdam.de/d1/annis/</link>), ANCTool
        (<link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.americannationalcorpus.org/tools/anctool.html</link>) or applications using MonetDB
        (<link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://monetdb.cwi.nl/</link>). In this context, let us repeat that using XInclude accompanied by a set
      of XPointer schemes was an ingenious move, because in theory, it should allow an OWL to be able to use TEI
      stand-off with ordinary off-the-shelf tools. In this section, we look at the various factors that conspire to
      what we believe is the undeserved lack of uptake of TEI stand-off devices. We divide these factors into external
      and internal, and among the latter, we make a distinction between technological and sociological.</para><section xml:id="sect_problems-technical-external" xreflabel="Section 5.1"><title>5.1. Technical issues external to the TEI</title><para>In this section, much depends on the reader being able to make the distinction between (i)
          <code>@xpointer</code> as the name of an XInclude attribute that can sometimes contain just a shorthand
        pointer (an NCname), (ii) XPointer as referring to the entire XPointer Framework, and (iii)
          <code>xpointer()</code> as referring to one of the XPointer schemes – that is why the sentence
          “<code>@xpointer</code> can hold XPointer's <code>xpointer()</code>” is meaningful, and true. Unsurprisingly,
        these terms are notoriously confused on various occasions. Similar remarks concern <code>string-range()</code>
        as the name of one of the <code>xpointer()</code> scheme's <emphasis role="ital">functions</emphasis>, defined
        by the W3C draft, and <code>string-range()</code> as the name of a TEI-defined XPointer <emphasis role="ital">scheme</emphasis>, on a par with <code>xpointer()</code>, <code>element()</code>, and other third-party
        schemes registered with the W3C. Some of these issues are addressed in a TEI Wiki article at <link xlink:href="http://wiki.tei-c.org/index.php/XPointer" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://wiki.tei-c.org/index.php/XPointer</link>.</para><para>As has been mentioned above, the TEI recommendations for stand-off annotation rely on
        the use of external standards, most importantly XInclude and the XPointer Framework, with
        its potential for defining third-party XPointer schemes. To cut a long story short: tool
        support for W3C XPointer schemes other than <code>element()</code> (obligatory for XInclude) and
          <code>xmlns()</code> does not exist, as far as the popular XML parsers are
        concerned, and support for third-party schemes is scant.</para><para>It might be claimed that these issues are internal to the TEI, after all, because they
        depend on the TEI's internal choice to use XInclude and to use its own XPointer schemes for pointing
        into the text. However, there appears to be no alternative to the use of XPointer schemes
        for pointing into spans of characters (short of unwinding history back to the era of TEI
        extended pointers whence XPointer comes), and in this sense, the lack of support for
        XPointer's xpointer() scheme blocks the possible development of support for the TEI-defined
        schemes, because that support should ideally piggyback (in terms of data structures, basic
        mechanisms, etc.), on the support for the W3C-defined <code>xpointer()</code>.<footnote><para>It may be speculated that, since XInclude already allows for including raw text (as a whole resource),
            it should also be allowed to address into raw text (in the <quote>extreme stand-off</quote> fashion), with
            appropriate XPointer schemes. For example, an appropriate scheme for addressing raw text could be a variant
            of the <code>string-range()</code> function, e.g.
            <programlisting xml:space="preserve">
              (string*)text-range([string-to-match], offset, length)
              (string)text-range(offset, length)
              (string)text-span(startoffset, endoffset)
            </programlisting>Several
            issues would have to be addressed at the application level (e.g. skipping the BOM character, recognizing
            character encoding, etc.), but on the whole, it does not feel particularly exotic and would only require
            lifting the XInclude ban on the simultaneous presence of <code>@parse=”text”</code> and
              <code>@xpointer</code> as well as an adjustment in the MIME types enumerated in the XPointer Framework
            Recommendation. Both issues are demand-driven and potentially interrelated: if XInclude were able to handle
            addressing into raw text, interest in developing modular implementations of XPointer schemes might follow.
            This is further elaborated on in <xref linkend="bansp10"/>.</para></footnote></para><para>It has to be noted that there exists a single widely accessible implementation of XInclude that goes beyond
        the minimum prescribed by the W3C Recommendation: <code>libxml2</code> (<link xlink:href="http://xmlsoft.org/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://xmlsoft.org/</link>) with the <code>xmllint</code> parser that supports limited
          <code>xpointer()</code> functionality, although unfortunately in a buggy way, so that while it can be tasted and
        demonstrated, it cannot be employed full-scale.<footnote><para>The two outstanding bugs that block the use of W3C-defined schemes in <code>xmllint</code> are reported
            in the Gnome Bugzilla, at <link xlink:href="https://bugzilla.gnome.org/show_bug.cgi?id=620190" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">https://bugzilla.gnome.org/show_bug.cgi?id=620190</link> and <link xlink:href="https://bugzilla.gnome.org/show_bug.cgi?id=620195" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">https://bugzilla.gnome.org/show_bug.cgi?id=620195</link>. The OCTC now implements workarounds for both
            these bugs.</para></footnote>
      </para><para>The lack of tool support made the developers of the National Corpus of Polish resign from using
        XInclude-based stand-off in favour of the underspecified semantics of the <code>@corresp</code> attribute that
        simply states correspondence between two elements (or an element and a span of characters), cf. <xref linkend="listing_nkjp"/>. Since this has to be handled by dedicated project tools anyway, it is enough for
        these tools to read information from <code>@corresp</code> rather than mimic the behaviour of an XInclude
        processor. This shows the inaccessibility of TEI stand-off to users without technical background or technical
        support – it does not work out-of-the-box despite the measures that the Guidelines took to ensure that the
        technique is lucidly described. The OCTC attempts to use W3C-defined <code>xpointer()</code> scheme as much as
        possible (cf. <xref linkend="listing_octc_before"/>) in order to be able to fall back from W3C technology to
        TEI schemes when the latter are finally implemented.</para><para><xref linkend="cayless-soroka10"/> point out the possibility of TEI pointers to point outside of their
        "lawful" domains and out across the whitespace, ignorable or not. This issue is real and actually seen in
        practice, in the xmllint bugs reported by the present author. Cayless and Soroka's observations should be
        treated as imposing specific constraints on stand-off pointers, whose ranges, <emphasis role="ital">must</emphasis> be located inside the elements identified by each individual pointer; this is trivial in the
        case of the lower value of the offset (which should not go below 1 in the case of the W3C pointers or below 0
        in the case of TEI-defined pointer schemes) but becomes less than trivial when it comes to ensure that the
        maximal value of the pointer does not extend beyond the addressed string.</para></section><section xml:id="sect_problems-technical-internal" xreflabel="Section 5.2"><title>5.2. TEI-internal technical issues</title><para>We argue here that, despite properly seizing the opportunity for a <quote>free ride</quote> with W3C
        specifications, some aspects of the putative reabsorption of XCES innovations into the TEI were not fully
        addressed.</para><para>The TEI diverges from its path of reabsorbing the XCES in that it packages all
        information, be it the source text or its annotations, into a single
          <code>teiCorpus/TEI/text</code> format. (Recall that the XCES used three different DTDs
        for the source text, the morphosyntactic analysis, and the alignment documents; they were
        different up to the root element.) This is inadequate for two reasons:<itemizedlist><listitem><para>it strains the semantics of the <code>&lt;text&gt;</code> element (annotations do
              not contain text, or at least do not have to contain it to be useful),</para></listitem><listitem><para>it packages technical annotation documents into the format expected of source text
              documents, which means that rather than putting a sequence of e.g. <link xlink:href="http://tei.oucs.ox.ac.uk/P5/Guidelines-web/en/html/ref-seg.html" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest"><code>&lt;seg&gt;</code></link> ("segment") elements with morphosyntactic information in
              feature structures straight into <code>text/body</code> (which is not ideal, as pointed out above,
              but for many would probably suffice), the developer has to trace the TEI content model
              and use e.g. the <link xlink:href="http://tei.oucs.ox.ac.uk/P5/Guidelines-web/en/html/ref-ab.html" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest"><code>&lt;ab&gt;</code></link> ("anonymous block") element as a wrapper for
                <code>&lt;seg&gt;</code>s only for the purpose of satisfying the content model
              designed for texts; similarly, it is impossible to keep a sequence of <link xlink:href="http://tei.oucs.ox.ac.uk/P5/Guidelines-web/en/html/ref-linkGrp.html" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest"><code>&lt;linkGrp&gt;</code></link> or <link xlink:href="http://tei.oucs.ox.ac.uk/P5/Guidelines-web/en/html/ref-spanGrp.html" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest"><code>&lt;spanGrp&gt;</code></link> elements in <code>text/body</code> – one has
              to use at least an empty dummy <code>&lt;div&gt;</code> for the document to validate. This
              feels like a kludge and the developer is tempted at this point to leave the TEI for
              the XCES or PAULA.</para><para>Note that one way out would be to redefine the content model inside
                <code>&lt;text&gt;</code> – after all, the TEI offers the mechanism of ODD for this
              purpose (cf. <xref linkend="burnard-rahtz04"/>). However, that would still mean that <itemizedlist><listitem><para>annotations are kept under <code>&lt;text&gt;</code>,</para></listitem><listitem><para>special effort must be put into designing the ODD beyond a mere selection of
                    the appropriate modules and elements, and</para></listitem><listitem><para>the resulting document is not TEI-conformant, because it changes the content model of
                      <code>&lt;text&gt;</code> (cf. <link xlink:href="http://www.tei-c.org/release/doc/tei-p5-doc/en/html/USE.html#CF" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">chapter 23.3 of the
                      Guidelines</link>). That in itself is no tragedy, but recall that we want to cater, among others,
                    to OWLs (ordinary working linguists), also with a view to the issues raised in the following
                    section. An OWL would often look for an out-of-the-box solution that the XCES promised (although
                    only for morphosyntactic annotation). Furthermore, breaking TEI-conformance may mean that whatever
                    tools there exist for handling the TEI may refuse to handle the non-conformant documents. Again, we
                    are thinking of an OWL who wants it to "just work", and we bear in mind that the recourse to using
                    W3C standards was exactly a step towards ensuring that things "just work". Having an OWL design
                    their own ODD in order to store stand-off annotations does not contribute to that goal.</para></listitem></itemizedlist>
            </para></listitem></itemizedlist></para><para>The users are aware of some of this, cf. a recent TEI-L (<link xlink:href="http://listserv.brown.edu/archives/cgi-bin/wa?A0=TEI-L" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://listserv.brown.edu/archives/cgi-bin/wa?A0=TEI-L</link>) discussion thread (e.g. <link xlink:href="http://listserv.brown.edu/archives/cgi-bin/wa?A2=ind1003&amp;L=TEI-L&amp;D=0&amp;T=0&amp;P=34628" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">Martin Holmes's message of Sat, 20 Mar 2010 06:38:00 -0700</link>) on where to keep
          <code>&lt;linkGrp&gt;</code> and similar elements: no consensus has been reached: there are users who keep it
        in <code>text/body</code>, <code>text/back</code>, or who would rather keep this in the header (because they
        feel it is a different kind of data), or, finally, “somewhere else” and we will see presently what this
        suggestion is.</para></section><section xml:id="sect_problems-socio" xreflabel="Section 5.3"><title>5.3. Social and sociological issues</title><para>Standards are good only insofar they meet the expectations of, and get a chance of getting feedback from,
        the community that they are targeted at. However, corpus linguists or corpus producers appear to be
        underrepresented in the TEI community. TEI-L, an otherwise helpful mailing list with a very high
        signal-to-noise ratio, falls largely silent when corpus-design-related questions are asked, in comparison to
        e.g. questions regarding the encoding of manuscripts, literary works, or bibliographies. Instead of concluding
        that TEI-ers with a corpus-design twist are the most unfriendly of TEI-ers, one should rather conclude that
        there are few such TEI-ers around, and ask why, and whether the XCES has taken them all away.</para><para>A corpus linguist, if s/he is going to choose XML (rather than plain text or a RDBMS) as
        the format of choice, may easily choose the XCES – dated but simple and sufficient for
        lightly-analysed corpora, or PAULA (<xref linkend="dipper05"/>) – not so simple but with an
        array of technical backup and a bunch of smart people popularizing it, or finally a more
        dedicated format such as that of the ANC (“extreme stand-off”, cf. <xref linkend="ide-suderman06"/>),  a testbed for the nascent ISO Linguistic Annotation
        Framework (<xref linkend="ide-romary07"/>) and more specifically, for the abstract pivot
        format, GrAF (<xref linkend="ide-suderman07"/>).</para><para>The question is whether the TEI has a chance to become an alternative to these systems
        for an OWL. Incidentally, when searching the Net for a reference to the origin of the
        interpretation of “OWL” used here, I came across a similar point made by <xref linkend="farrar-moran"/>:</para><blockquote><para>“[...] any new approach or technology requires critical mass. If too few in a
          community use the technology, then it will usually fail. TEI recommendations (using SGML)
          never caught on with the ordinary working linguist, likely due to the unavailability of
          tools to produce it. The situation with recent best-practice XML recommendations has been
          only slightly better.”</para><attribution><xref linkend="farrar-moran"/></attribution></blockquote><para>One of the points we want to make in this contribution is that the TEI still has to win
        some of the corpus linguistic audience in order to kickstart the development-feedback cycle
        for stand-off corpus encoding. It needs to make a move towards a corpus-oriented OWL,
        possibly by addressing the issues raised here and by a pressure towards the implementation
        of a widely-accessible generic tool that supports stand-off architecture. That would be the
        next of the numerous services to the linguistic and XML community that the TEI has done over
        the years.</para></section></section><section xml:id="sect_solutions" xreflabel="Section 6"><title>6. A sketch of solutions</title><para>The TEI is a very good choice for complex corpus encoding with a view to sustainability
      and interoperability because, apart from its other virtues, it offers a <emphasis role="ital">homogeneous</emphasis> format for encoding all the annotations and storing “formal
      metadata” in the TEI headers – this point is made in <xref linkend="bansp-adamp10"/>, and
      illustrated in <xref linkend="adamp-bansp10"/>. However, at present, stand-off TEI deployment
      for such purposes requires dedicated visualisation and query tools acting on the
      correspondence semantics of hyperlinks (because XInclusions fail due to the lack of support for third-party
      XPointer schemes anyway), as well as some dedication to simplify parts of the architecture
      that are unduly complex for someone who wants to “just do it”, just create a stand-off
      annotation document without having to create a mock text document in the process.</para><para>Some of the necessary pieces of the puzzle are already there. Recall that the (X)CES used
      three different content models for annotating the source text, the annotations, and the
      alignment. We do not want to argue for that – that would not be as homogeneous a format as
      what the TEI offers currently. Instead, we would like to point out that it is possible to
      reabsorb the (X)CES inventions more fully into the current TEI model, by keeping text under
        <code>&lt;text&gt;</code>, and non-text elsewhere. Let us have a look at the content model
      of the <code>&lt;TEI&gt;</code> element:</para><programlisting xml:space="preserve">
element TEI
{
   att.global.attributes,
   attribute version { xsd:decimal }?,
   ( teiHeader, ( ( model.resourceLike+, text? ) | text ) )
}
    </programlisting><para><code>model.resourceLike</code> contains <code>&lt;facsimile&gt;</code> (for <link xlink:href="http://tei.oucs.ox.ac.uk/P5/Guidelines-web/en/html/PH.html#PHFAX" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">digital facsimiles</link>) and
        <code>&lt;fsdDecl&gt;</code> (for <link xlink:href="http://tei.oucs.ox.ac.uk/P5/Guidelines-web/en/html/FS.html#FD" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">feature system declarations</link>).
      Another element that is planned to be included in this model, defined by a planned new chapter of the Guidelines
      devoted to genetic editions, is <code>&lt;document&gt;</code> (<quote>the physical object, the manuscript or
        other primary source, comprising one or more written surfaces</quote>, see <link xlink:href="http://www.tei-c.org/SIG/Manuscripts/genetic.html" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.tei-c.org/SIG/Manuscripts/genetic.html</link>). Another addition, suggested by <xref linkend="boot09"/>, is <code>&lt;dataSection&gt;</code> (to store, e.g., <code>&lt;linkGrp&gt;</code> elements
      that just do not fit under <code>&lt;text&gt;</code>). In two e-mails to the TEI-L (on <link xlink:href="http://listserv.brown.edu/archives/cgi-bin/wa?A2=ind1003&amp;L=TEI-L&amp;T=0&amp;F=&amp;S=&amp;P=35185" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">21 Mar 2010</link> and on <link xlink:href="http://listserv.brown.edu/archives/cgi-bin/wa?A2=ind1003&amp;L=TEI-L&amp;T=0&amp;F=&amp;S=&amp;P=36329" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">22 Mar 2010</link>), the present author has suggested the introduction of <code>&lt;standOff&gt;</code> for
      the same purpose. What this shows it that a need for an extra sibling of <code>&lt;text&gt;</code> is recognized
      in the community, and, in the case of <code>&lt;document&gt;</code>, it is actually being implemented.
      Implementing an element that could store stand-off markup would be the final, and in our view necessary, step in
      reabsorbing the XCES innovations into the TEI. It would simplify the content model for the annotation creators,
      making it possible for them not to abuse the semantics of <code>&lt;text&gt;</code>, and at the same time not to
      be bound by the requirements of <code>&lt;text&gt;</code>'s content model.</para><para>We sketch the possible configurations resulting from implementing this solution using
        <code>&lt;standOff&gt;</code> because the name of this element corresponds with the topic of the present
      contribution. We note, however, that the name "standOff" is only used for the same of discussion and that e.g.
      Peter Boot's suggestion for the name of the element in question, namely <code>&lt;dataSection&gt;</code> sounds
      just as good, although perhaps too generic (everything here is data).</para><para>Recall that the TEI contains a range of elements with uses that we have referred to in
        <xref linkend="sect_TEI-local"/>by the mildly fortunate term "local stand-off". They are elements
      that may sometimes be perfectly justified inside <code>&lt;text&gt;</code>, but there have
      been suggestions to move them into the teiHeader or (as in <xref linkend="boot09"/>) to move
      them to a sibling of <code>&lt;text&gt;</code>. We suggest that two of these solutions can be
      used, depending on the annotating task at hand (we reject the option of locating these
      elements in the header). Those who would rather separate the "local stand-off" elements from
      text proper, could use three child elements of <code>&lt;TEI&gt;</code> at the same time:
        <code>{ teiHeader, standOff, text }</code>. The third group, the creators of the "classical"
      stand-off annotation levels, would use <code>{ teiHeader, text }</code> for text documents but
        <code>{ teiHeader, standOff}</code> for annotations. This is not the place to suggest the
      content model of the putative <code>&lt;standOff&gt;</code> element – suffice it to say that
      we would expect it to hold, among others, elements from the "analysis" and "linking" parts of
      the TEI module inventory.<footnote><para>The NCP and to a slightly larger extent the OCTC would introduce one more innovation
          in the handling of annotation structures of the same text: complexes involving a <code>{
            teiHeader, text }</code> and multiple instances of <code>{ teiHeader, standOff}</code>
          would become, abstractly, <code>&lt;teiHeader, { text,
              standOff<subscript>(1..N)</subscript>}&gt;</code> (where angle brackets denote an
          ordered pair, and curly brackets a set), because in both corpora, the header is shared
          (XIncluded) among all the relevant documents. We leave the interesting consequences of
          such a setup for another occasion.</para></footnote>
    </para><para>It is also worth mentioning that, for the purpose of establishing character offsets in
      XPointer schemes, the W3C proposals point to character segments and use "1" as the initial
      offset, the TEI and LAF proposals look at inter-character points and use "0" as the initial
      offset. We consider this an unfortunate difference as it is not user-friendly and the cost of
      adapting to W3C model, while the TEI schemes are not implemented yet, should be minimal.</para><para>Lastly, community pressure (or simply funding) is needed to implement XPointer extensions
      in a <emphasis role="ital">generic</emphasis> XML tool (the best candidate being libxml2 by
      Daniel Veillard, with the xmllint parser, because it already has some of this functionality
      that no other popular and freely available parser has), so that the ingenious TEI stand-off
      system based on XML Inclusions can do something more than merely look nice.<footnote><para>Note that the fact that XInclude can act on IDs is enough only for the purpose of
          including pieces of an XML tree (or entire trees). However, stand-off annotation is (or
          can be, with the replacement semantics of XInclude rather than the correspondence
          semantics of <code>@corresp</code>), <emphasis role="ital">not </emphasis>about including
          elements and thus merging infosets – given the need for different content models for
          different annotation layers, stand-off annotation is about including <emphasis role="ital">text</emphasis>. A little step on the way towards this goal is offered by the <link xlink:href="http://simonstl.com/ietf/draft-stlaurent-xpath-frag-00.html" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest"><code>xpath1()</code></link> XPointer extension, supported by some parsers. Being an
          XPath implementation, however, it cannot support addressing into the text content, which
          misses a crucial part of the entire enterprise (one cannot XInclude substrings for the
          purpose of defining a segmentation level, with segments often defined over substrings of
          orthographic words, cf. <xref linkend="bansp-adamp09"/>).</para></footnote>
    </para><para>Our claim is that it should be possible for a corpus-oriented OWL with the basic TEI awareness to read
      portions of <link xlink:href="http://www.tei-c.org/release/doc/tei-p5-doc/en/html/CC.html" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">chapter 15</link> on
      language corpora and <link xlink:href="http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">chapter
        16</link> on stand-off linking and <link xlink:href="http://www.tei-c.org/release/doc/tei-p5-doc/en/html/AI.html" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">chapter 17 </link>on analytical
      mechanisms in order to be able to construct a simple working stand-off-annotated corpus prototype that they will
      be able to visualise and perhaps even to query. Ideally, even that step should be simplified and expressed as a
      single chapter-like set of recommendations (possibly in the form a TEI ODD file) targeted specifically at corpus linguists.<footnote><para>It is worth mentioning two TEI-stand-off-oriented tools refined in the National Corpus of Polish and
          available under the GPL: <link xlink:href="http://sourceforge.net/projects/poliqarp/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">poliquarp</link>, for
          querying and concordancing multi-level TEI corpora and <link xlink:href="http://nlp.ipipan.waw.pl/Anotatornia/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">anotatornia</link>, for manual annotation of individual
          layers.</para></footnote>
    </para></section><section xml:id="sect_conclusion" xreflabel="Section 7"><title>7. Conclusion</title><para>The motivation for the present analysis was to flesh out certain inadequacies of the TEI approach to
      stand-off annotation, in order to see how many of the problems have causes internal to the TEI Guidelines, and
      how many can be attributed to external factors, such as the lack of sufficiently developed generic XML tools or
      the inadequacies of standards assumed by the TEI. The ultimate question, therefore, is: should the TEI be used
      for modern stand-off text encoding at all, or should developers turn to other formats, such as the excellent
      PAULA toolkit (cf. <xref linkend="dipper05"/>), the slightly aged XCES (<xref linkend="ideetal00"/>) or the more
      generic but still incomplete LAF family of standards (<xref linkend="ide-romary07"/>).</para><para>If the TEI wants to become a viable alternative to other formats, it should ensure that an OWL can easily
      implement and use a prototype stand-off corpus. This is conditioned by two factors: one internal (making content
      models of stand-off documents maximally friendly and packaging them for out-of-the-box deployment) and one
      external (the lack of generic parsing tools that would implement XInclude with third-party XPointers schemes,
      ideally as modules). Both issues are solvable. Both would normally be solved by community pressure but the
      community (sub-community of NLP-oriented TEI users) has yet to be formed and in its absence, it is the rest that
      should act, for the sake of making the TEI community richer and more dynamic, and in this way, to supply all of
      us with new ideas and research topics, and new tools (including better support for stand-off document creation,
      visualisation and querying) to go with them, because tools follow users. We have identified three classes of
      problems, all tied together. In order to move on, the tie should be cut, preferably at a few places at the same
      time.</para><para><xref linkend="witternetal09"/> mention that a <link xlink:href="http://www.tei-c.org/Activities/Workgroups/SO/sow05.xml" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">separate chapter </link>concerning corpus
      annotation has been considered for inclusion in the Guidelines but never ended up finished and included. We
      believe to have shown here that that chapter, after considerable revisions, might be one of the ways in which the
      TEI reaches out to OWLs interested in corpora, whether of the purely textual or the multimodal kind. The nascent
      Special Interest Group for linguists (<link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://wiki.tei-c.org/index.php/TEI_for_linguists</link>) is another
      step towards that goal.</para></section><section xml:id="sect_ack" xreflabel="Acknowledgements"><title>Acknowledgements</title><para>I wish to thank the six anonymous Balisage reviewers for their encouraging and helpful comments. It was not
      possible to implement all of the suggestions in a single article, but I did my best. The responsibility for the
      remaining errors naturally remains my own.</para><para>I would like to express my gratitude to B. Tommie Usdin for her enormous patience and support, without which
      I would not be able to complete this article in time for publication.</para></section><bibliography><title>Bibliography</title><bibliomixed xreflabel="Anderson, 1992" xml:id="anderson92">Anderson, S. (1992). <emphasis role="ital">A-Morphous
    Morphology</emphasis>. Cambridge Studies in Linguistics (No. 62). CUP.</bibliomixed><bibliomixed xreflabel="Bański, 2010" xml:id="bansp10">Bański, P. (2010). XIncluding plain-text fragments for
      symmetry and profit. Poster presented at Balisage: The Markup Conference 2010, Montréal, Canada, August 3–6,
      2010. Available from <link xlink:href="http://bansp.users.sourceforge.net/pdf/Banski-Balisage2010-poster.pdf" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://bansp.users.sourceforge.net/pdf/Banski-Balisage2010-poster.pdf</link></bibliomixed><bibliomixed xreflabel="Bański &amp; Gozdawa-Gołębiowski, 2010" xml:id="banski-gozdawa10">Bański, P.,
      Gozdawa-Gołębiowski, R. (2010). Foreign Language Examination Corpus for L2-Learning Studies. In Rapp, R.,
      Zweigenbaum, P., Sharoff, S. (Eds.) Proceedings of the 3rd Workshop on Building and Using Comparable Corpora
      (BUCC), <quote>Applications of Parallel and Comparable Corpora in Natural Language Engineering and the
        Humanities</quote>, 22 May 2010, Valletta, Malta, pp. 56–64. Available from <link xlink:href="http://www.lrec-conf.org/proceedings/lrec2010/workshops/W12.pdf" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.lrec-conf.org/proceedings/lrec2010/workshops/W12.pdf</link>.</bibliomixed><bibliomixed xreflabel="Bański &amp; Przepiórkowski, 2009" xml:id="bansp-adamp09">Bański, P., Przepiórkowski, A.
      (2009). Stand-off TEI annotation: the case of the National Corpus of Polish. In Ide, N., Meyers, A. (Eds.)
        <emphasis role="ital">Proceedings of the Third Linguistic Annotation Workshop (LAW III)</emphasis> at
      ACL-IJCNLP 2009, Singapore, pp. 64-67.</bibliomixed><bibliomixed xreflabel="Bański &amp; Przepiórkowski, 2010" xml:id="bansp-adamp10">Bański, P., Przepiórkowski, A.
      (2010). The TEI and the NCP: the model and its application. In Arranz, V., van Eerten, L. (Eds.) Proceedings of
      the LREC workshop on <quote>Language Resources: From Storyboard to Sustainability and LR Lifecycle
        Management</quote> (LRSLM2010), 23 May 2010, Valletta, Malta, pp. 34–39. Available from <link xlink:href="http://www.lrec-conf.org/proceedings/lrec2010/workshops/W20.pdf" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.lrec-conf.org/proceedings/lrec2010/workshops/W20.pdf</link>.</bibliomixed><bibliomixed xreflabel="Bański &amp; Wójtowicz, 2010" xml:id="bansp-beataw10">Bański, P., Wójtowicz, B. (2010). The
      Open-Content Text Corpus project. In Arranz, V., van Eerten, L. (Eds.) Proceedings of the LREC workshop on
        <quote>Language Resources: From Storyboard to Sustainability and LR Lifecycle Management</quote> (LRSLM2010),
      23 May 2010, Valletta, Malta, pp. 19–25. Available from <link xlink:href="http://www.lrec-conf.org/proceedings/lrec2010/workshops/W20.pdf" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.lrec-conf.org/proceedings/lrec2010/workshops/W20.pdf</link>.</bibliomixed><bibliomixed xreflabel="Bird and Simons, 2003" xml:id="bird-simons03">Bird, S., Simons, G. (2003). Seven dimensions
      of portability for language documentation and description. <emphasis role="ital">Language</emphasis> 79(3), pp.
      557–582; <biblioid class="doi">10.1353/lan.2003.0149</biblioid></bibliomixed><bibliomixed xreflabel="Boot, 2009" xml:id="boot09">Boot, P. (2009). Towards a TEI-based encoding scheme for the
      annotation of parallel texts. <emphasis role="ital">Literary and Linguistic Computing</emphasis> 24(3), pp.
      347–361; <biblioid class="doi">10.1093/llc/fqp023</biblioid></bibliomixed><bibliomixed xreflabel="Burnard &amp; Rahtz, 2004" xml:id="burnard-rahtz04">Burnard, L., Rahtz, S. (2004). RelaxNG
      with Son of ODD. Presented at Extreme Markup Languages 2004, Montréal, Québec. Available from
        <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://conferences.idealliance.org/extreme/html/2004/Burnard01/EML2004Burnard01.html</link></bibliomixed><bibliomixed xreflabel="Cayless &amp; Soroka, 2010" xml:id="cayless-soroka10">Cayless, H, Soroka (2010). On
      Implementing string-range() for TEI. Presented at Balisage: The Markup Conference 2010, Montréal, Canada, August
      3–6, 2010. In Proceedings of Balisage: The Markup Conference 2010. Balisage Series on Markup Technologies, vol.
      5 (2009); <biblioid class="doi">10.4242/BalisageVol5.Cayless01</biblioid></bibliomixed><bibliomixed xreflabel="Chiarcos et al., 2009" xml:id="chiarcosetal09">Chiarcos, Ch., Ritz, J., Stede, M. (2009).
        <emphasis role="ital">By all these lovely tokens...</emphasis> Merging conflicting tokenizations. In Ide, N.,
      Meyers, A. (Eds.) <emphasis role="ital">Proceedings of the Third Linguistic Annotation Workshop (LAW
        III)</emphasis> at ACL-IJCNLP 2009, Singapore, pp. 35-43.</bibliomixed><bibliomixed xreflabel="Cummings, 2008" xml:id="cummings08">Cummings, J. (2008). The Text Encoding Initiative and
      the Study of Literature. In Schreibman, S., Siemens, R. <emphasis role="ital">A Companion to Digital Literary
        Studies</emphasis>. Oxford: Blackwell. <link xlink:href="http://www.digitalhumanities.org/companionDLS/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.digitalhumanities.org/companionDLS/</link></bibliomixed><bibliomixed xreflabel="Cummings, 2009" xml:id="cummings09">Cummings, J. (2009). Converting Saint Paul: A new TEI
      P5 edition of <emphasis role="ital">The Conversion of Saint Paul</emphasis> using stand-off methodology.
        <emphasis role="ital">Literary and Linguistic Computing</emphasis> 24(3), pp. 307–317; <biblioid class="doi">10.1093/llc/fqp019</biblioid></bibliomixed><bibliomixed xreflabel="DeRose, 2004" xml:id="derose04">DeRose, S. (2004). Markup overlap: a review and a horse.
      Proceedings of Extreme Markup Languages 2004.</bibliomixed><bibliomixed xreflabel="DeRose et al., 1990" xml:id="deroseetal90">DeRose, S., Durand, D., Mylonas, E., Renear, A.
      (1990). What is text, really?. <emphasis role="ital">Journal of Computing in Higher Education</emphasis>, Winter
      1990, Vol. I (2), pp. 3–26</bibliomixed><bibliomixed xreflabel="Dipper, 2005" xml:id="dipper05">Dipper, S. (2005). XML-based stand-off representation and
      exploitation of multi-level linguistic annotation. In <emphasis role="ital">Proceedings of Berliner XML Tage 2005
        (BXML 2005)</emphasis>. Berlin, pp. 39–50.</bibliomixed><bibliomixed xreflabel="Goecke et al., 2010" xml:id="goeckeetal10">Goecke, D., Metzing, D., Lüngen, H.,
      Stührenberg, M., Witt, A. (2010). Different views on markup. distinguishing levels and layers. In <emphasis role="ital">Linguistic modeling of information and markup languages. Contributions to language
        technology</emphasis>. Springer Netherlands, pp. 1–21.</bibliomixed><bibliomixed xreflabel="Farrar &amp; Moran, 2008" xml:id="farrar-moran">Farrar, S., Moran, S. (2008) "The
      e-Linguistics Toolkit" Presented at e-Humanities–an emerging discipline: Workshop in the 4th IEEE International
      Conference on e-Science.
        <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://faculty.washington.edu/farrar/documents/inproceedings/FarrarMoran2008.pdf</link></bibliomixed><bibliomixed xreflabel="Ide, 1998" xml:id="ide98">Ide, N. (1998). Corpus Encoding Standard: SGML Guidelines for
      Encoding Linguistic Corpora. Proceedings of the First International Language Resources and Evaluation Conference,
      Granada, Spain, pp. 463–470.</bibliomixed><bibliomixed xreflabel="Ide, 2000" xml:id="ide00">Ide, N. (2000). The XML Framework and Its Implications for the
      Development of Natural Language Processing Tools. Proceedings of the COLING Workshop on Using Toolsets and
      Architectures to Build NLP Systems, Luxembourg, 5 August 2000.
        <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.cs.vassar.edu/~ide/papers/coling00-ws-final.pdf</link></bibliomixed><bibliomixed xreflabel="Ide et al., 2000" xml:id="ideetal00">Ide, N., Bonhomme, P., Romary, L. (2000). XCES: An
      XML-based Standard for Linguistic Corpora. <emphasis role="ital">Proceedings of the Second Language Resources and
        Evaluation Conference (LREC)</emphasis>, Athens, Greece, pp. 825–830.</bibliomixed><bibliomixed xreflabel="Ide &amp; Romary, 2007" xml:id="ide-romary07">Ide, N., Romary, L. (2007). Towards
      International Standards for Language Resources. In Dybkjaer, L., Hemsen, H., Minker, W. (Eds.), <emphasis role="ital">Evaluation of Text and Speech Systems</emphasis>, Springer, pages 263–284.</bibliomixed><bibliomixed xreflabel="Ide &amp; Suderman, 2006" xml:id="ide-suderman06">Ide, N., Suderman, K. (2006). Integrating
      Linguistic Resources: The American National Corpus Model. In <emphasis role="ital">Proceedings of the Fifth
        Language Resources and Evaluation Conference (LREC)</emphasis>, Genoa, Italy.</bibliomixed><bibliomixed xreflabel="Ide &amp; Suderman, 2007" xml:id="ide-suderman07">Ide, N., Suderman, K. (2007). GrAF: A
      Graph-based Format for Linguistic Annotations. In the proceedings of the Linguistic Annotation Workshop, held in
      conjunction with ACL 2007, Prague, June 28-29, pp. 1–8.</bibliomixed><bibliomixed xreflabel="Ide &amp; Véronis, 1993" xml:id="ide-veronis93">Ide, N., Véronis, J. (1993). Background and
      context for the development of a Corpus Encoding Standard, EAGLES Working Paper,
        <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.cs.vassar.edu/CES/CES3.ps.gz</link></bibliomixed><bibliomixed xreflabel="Isard et al., 1998" xml:id="isardetal98">Isard, Amy, McKelvie, David, Thompson, Henry S.
      (1998). Towards a minimal standard for dialogue transcripts: a new SGML architecture for the HCRC map task
      corpus, In <emphasis role="ital">5<superscript>th</superscript> International Conference on Spoken Language
        Processing - 1998</emphasis>, paper 0322.</bibliomixed><bibliomixed xreflabel="Jannidis, 2009" xml:id="jannidis09">Jannidis, F. (2009). TEI in a crystal ball. <emphasis role="ital">Literary and Linguistic Computing</emphasis> 24(3), pp. 253–265; <biblioid class="doi">10.1093/llc/fqp015</biblioid></bibliomixed><bibliomixed xreflabel="Lawler &amp; Aristar Dry, 1998" xml:id="lawler-dry">Lawler, J. and H. Aristar Dry (Eds.)
      (1998). <emphasis role="ital">Using computers in linguistics: a practical guide</emphasis>. London:
      Routledge.</bibliomixed><bibliomixed xreflabel="Kupść, 1999" xml:id="kupsc99">Kupść, A. (1999). Haplology of the Polish Reflexive Marker.
      In Borsley, R.D., Przepiórkowski, A. (Eds.) <emphasis role="ital">Slavic in HPSG</emphasis>, pp. 91–124,
      Stanford, CA: CSLI Publications.</bibliomixed><bibliomixed xreflabel="McKelvie et al., 1998" xml:id="mckelvieetal98">McKelvie, D., Brew, Ch., Thompson, H.
      (1998). Using SGML as a basis for Data-Intensive Natural Language Processing. Often listed as appearing in
        <emphasis role="ital">Computers and the Humanities</emphasis>, 31 (5): pp. 367–388, but <link xlink:href="http://www.springerlink.com/content/x428658732t6/?p=6fe37e29b2aa414584bf9ce52c73fe02&amp;pi=23" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">not
        present in that volume, according to its publisher</link>. Available as manuscript from <link xlink:href="http://xml.coverpages.org/mckelvieNLP98-ps.gz" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://xml.coverpages.org/mckelvieNLP98-ps.gz</link></bibliomixed><bibliomixed xreflabel="Przepiórkowski &amp; Bański, 2010" xml:id="adamp-bansp10">Przepiórkowski, A., Bański, P.
      (2010). TEI P5 as a text encoding standard for multilevel corpus annotation. In Fang, A.C., Ide, N. and J.
      Webster (eds). <emphasis role="ital">Language Resources and Global Interoperability. The Second International
        Conference on Global Interoperability for Language Resources (ICGL2010)</emphasis>. Hong Kong: City University
      of Hong Kong, pp. 133–142.</bibliomixed><bibliomixed xreflabel="Rehm et al., 2010" xml:id="rehmetal10">Rehm, G., Schonefeld, O., Trippel, T., Witt, A.
      (2010). Sustainability of linguistic resources revisited. Presented at the International Symposium on XML for the
      Long Haul: Issues in the Long-term Preservation of XML, Montréal, Canada, August 2, 2010. In Proceedings of the
      International Symposium on XML for the Long Haul: Issues in the Long-term Preservation of XML. Balisage Series on
      Markup Technologies, vol. 6 (2010); <biblioid class="doi">doi:10.4242/BalisageVol6.Witt01</biblioid></bibliomixed><bibliomixed xreflabel="Renear et al., 1993" xml:id="renearetal93">Renear, A., Mylonas, E., Durand, D. (1993).
      Refining our notion of what text really is: the problem of overlapping hierarchies. Final version, January 6,
        1993.<link xlink:href="http://www.stg.brown.edu/resources/stg/monographs/ohco.html" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.stg.brown.edu/resources/stg/monographs/ohco.html</link></bibliomixed><bibliomixed xreflabel="Renear, 2004" xml:id="renear04">Renear, A. (2004). Text Encoding. In Schreibman, S.,
      Siemens, R., Unsworth, J. <emphasis role="ital">A Companion to Digital Humanities</emphasis>. Oxford: Blackwell.
        <link xlink:href="http://www.digitalhumanities.org/companion/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.digitalhumanities.org/companion/</link></bibliomixed><bibliomixed xreflabel="Simons &amp; Bird, 2008" xml:id="simons-bird08">Simons, G.F., Bird, S. (2008). Toward a
      global infrastructure for the sustainability of language resources. In <emphasis role="ital">Proceedings of the
        22nd Pacific Asia Conference on Language, Information and Computation: PACLIC 22</emphasis>. pp.
      87–100.</bibliomixed><bibliomixed xreflabel="TEI Consortium, 2010" xml:id="teip5">TEI Consortium (Eds.) (2010). TEI P5: Guidelines for
      Electronic Text Encoding and Interchange. Version 1.6.0. Last updated on February 12th 2010. TEI Consortium.
        <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.tei-c.org/Guidelines/P5/</link></bibliomixed><bibliomixed xreflabel="Thompson &amp; McKelvie, 1997" xml:id="thompson-mckelvie97">Thompson, H. S., McKelvie, D.
      (1997). Hyperlink semantics for standoff markup of read-only documents, <emphasis role="ital">Proceedings of SGML
        Europe</emphasis>. Available from <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.ltg.ed.ac.uk/~ht/sgmleu97.html</link>.</bibliomixed><bibliomixed xreflabel="Witt et al., 2009a" xml:id="wittetal09a">Witt, A., Heid, U., Sasaki, F., Sérasset, G.
      (2009). Multilingual language resources and interoperability. In <emphasis role="ital">Language Resources and
        Evaluation</emphasis>, vol. 43:1, pp. 1–14. <biblioid class="doi">10.1007/s10579-009-9088-x</biblioid>
    </bibliomixed><bibliomixed xreflabel="Witt et al., 2009b" xml:id="wittetal09b">Witt, A., Rehm, G., Hinrichs, E., Lehmberg, T.,
      Stegmann, J. (2009). SusTEInability of linguistic resources through feature structures. In <emphasis role="ital">Language Resources and Evaluation</emphasis>, vol. 43:3, pp. 363–372. <biblioid class="doi">10.1093/llc/fqp024</biblioid>
    </bibliomixed><bibliomixed xreflabel="Wittern et al., 2009" xml:id="witternetal09">Wittern, Ch., Ciula, A., Tuohy, C. (2009). The
      making of TEI P5. <emphasis role="ital">Literary and Linguistic Computing</emphasis> 24(3), pp. 281–296;
        <biblioid class="doi">10.1093/llc/fqp017</biblioid></bibliomixed><bibliomixed xreflabel="Wörner et al., 2006" xml:id="woerneretal06">Wörner, K., Witt, A., Rehm, G., Dipper, S.
      (2006). Modelling Linguistic Data Structures. Presented at Extreme Markup Languages 2006, Montréal, Québec.
      Available from
      <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://conferences.idealliance.org/extreme/html/2006/Witt01/EML2006Witt01.html</link></bibliomixed></bibliography></article>
