<?xml version="1.0" encoding="UTF-8"?><article xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0-subset Balisage-1.2"><title>SGF - An integrated model for multiple annotations and its application in a linguistic
    domain</title><info><confgroup><conftitle>Balisage: The Markup Conference 2008</conftitle><confdates>August 12 - 15, 2008</confdates></confgroup><abstract><para>Seamless integration of various, often heterogeneous linguistic resources (in terms of
        their output formats) and merging of the respective annotation layers are crucial tasks for
        linguistic research. After a decade of concentration on the development of formats in order
        to structure single annotations for specific linguistic issues, a variety of specifications
        to store multiple annotations over the same primary data has been developed in the last
        years. Among these approaches three main architectures can be identified: Prolog-based
        architectures, XML-related approaches and graph-based models that follow the XML syntax.
        However, these architectures are not free of disadvantages when used in real world
        applications. In the <emphasis role="ital">Sekimo</emphasis> project the XML-based <emphasis role="ital">Sekimo Generic Format</emphasis> (SGF) was developed for the purpose of
        storing multiple annotations on the same primary data and examine relationships between
        elements of different annotation layers without prepended conversion. SGF is based on the
        design principles of graph-based approaches but makes use of the XML-inherent tree
        structures whenever possible to reduce processing costs. Analysing data stored in SGF can be
        done via standard XML-related specifications such as XPath, XSLT or XQuery and is done in
        our project in the linguistic application domain of anaphora resolution.</para></abstract><author><personname><firstname>Maik</firstname><surname>Stührenberg</surname></personname><personblurb><para>Maik Stührenberg studied Computational Linguistics at Bielefeld University. He worked
          four years as research assistant at Giessen University in different text-technological
          projects (both funded by the German government and the German Research Foundation). He now
          works as a research assistant at Bielefeld University together with Andreas Witt, Dieter
          Metzing and Daniela Goecke in the <emphasis role="ital">Sekimo</emphasis> project of the
          Research Group <emphasis role="ital">Text-technological modelling of
          information</emphasis> funded by the German Research Foundation. His main research
          interests include specifications for structuring multiple annotated data and query
          languages and query processing. </para></personblurb></author><author><personname><firstname>Daniela</firstname><surname>Goecke</surname></personname><personblurb><para>Daniela Goecke studied Computational Linguistics at Bielefeld University. She finished
          her master thesis in cooperation with IBM Scientific Center Heidelberg and worked four
          years at Philips Speech Processing Aachen. She now works as a research assistant at
          Bielefeld University together with Andreas Witt, Dieter Metzing and Maik Stührenberg in
          the <emphasis role="ital">Sekimo</emphasis> project of the Research Group 437 <emphasis role="ital">Text-technological modelling of information</emphasis> funded by the German
          Research Foundation. Her main research topics are the unification of text-technological
          resources and anaphora resolution. </para></personblurb></author><legalnotice><para>Copyright © 2008 by the authors.  Used with
permission.</para></legalnotice><keywordset role="author"><keyword>Concurrent Markup</keyword></keywordset></info><note><para> The work presented in this paper is part of the project A2 (<emphasis role="ital">Sekimo</emphasis>) of the Research Group 437 <emphasis role="ital">Text-technological
        modelling of information</emphasis> funded by the German Research Foundation.<footnote><para>More information about the project can be obtained at <link xlink:href="http://www.text-technology.de/Sekimo" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.text-technology.de/Sekimo</link>. </para></footnote></para></note><section xml:id="sec.introduction"><title>Introduction</title><para>There is a large amount of machine-readable structured linguistic documents (often XML
      annotated) available to the public as well as several NLP tools which allow for the analysis
      of linguistic data. Besides corpora annotated for several linguistic phenomena, external
      knowledge bases like lexical nets (<emphasis role="ital">WordNet</emphasis>, cf. <xref linkend="Fellbaum1998"/>
      <emphasis role="ital">GermaNet</emphasis>, cf. <xref linkend="Hamp1997"/>) are an important
      source for linguistic studies. However, these resources are often heterogeneous in respect to
      both, the underlying schema of the output format and the functionality provided. Furthermore,
      their use for (semi-) automatic annotation can lead to the problem as to how to represent
      multi-dimensional, possibly overlapping markup - which often occurs when different linguistic
      annotation levels are unified (e.g. syllables vs. morphemes). Different methods for the
      annotation of multiple information levels have been developed: separation of multiple
      annotation levels in separate files, fragmentation or milestones (cf. <xref linkend="Sperberg-McQueen2002a"/>). In the <emphasis role="ital">Sekimo</emphasis> project
      different approaches for the integration of heterogeneous linguistic resources were developed
      and applied in the domain of anaphora resolution. For the task of anaphora resolution
      different types of information are necessary: POS, syntactic knowledge, world knowledge (e.g.
      in terms of an ontology) and the like. Therefore various linguistic resources such as parsers,
      dictionaries, wordnets or ontologies have to be combined. However, in most cases the output
      format of a linguistic resource A is not suitable as input format for a linguistic resource B,
      which means that a cascaded application of several resources is not possible. After
      experiences with a Prolog fact base approach (cf. <xref linkend="sec.prolog"/>) we have
      developed an XML-based abstract representation format similar to the standoff annotation model
      described by <xref linkend="Thompson1997"/> which encodes the same textual data in separate
      files according to different document grammars addressing different relevant phenomena.</para><para>Information structuring can always be split up into a conceptual process and a technical
      realization. We follow the discussion in <xref linkend="Buchkapitel"/> and use the term
        <emphasis role="ital">level</emphasis> to refer to the information modelling concept (e.g.,
      morphological structure, phrase structure) and the term <emphasis role="ital">layer</emphasis>
      for the technical realization, i.e. the XML markup. Levels and layers can be in different
      relations (1:1 relation, 1:n, m:1 or n:m) which can lead to overlapping markup in the layer
      structure. The annotation format described in <xref linkend="Witt2005"/> solves this issue and
      ensures a 1:1 relation. For clarification issues we prefer the term <emphasis role="ital">multi-rooted trees</emphasis> in favor of <emphasis role="ital">multiple
      annotations</emphasis> when talking about the architecture used in our project because the
      different levels of annotation are stored in a single representation.</para><para>The remainder of this paper is structured as follows: At first we will give an overview of
      different approaches for integrating multiple annotated data, followed by a description of the
      Sekimo Generic Format (SGF) sketched out in <xref linkend="sec.sgf"/>. In <xref linkend="sec.application"/> we will demonstrate how the SGF is used in the application
      domain of anaphora resolution. Finally, the paper closes with <xref linkend="sec.conclusion"/> in which possible extensions and future work are
    discussed.</para></section><section xml:id="sec.approaches"><title>Different approaches to multiple annotated markup</title><para>There is a variety of approaches for dealing with multiple annotated data (or multiple
      hierarchies) already available. <xref linkend="DeRose2004"/> summarizes some solutions
      (including both XML-based and non-XML-based approaches) with their respective strengths and
      weaknesses. We propose to group a selection of the available solutions into three categories: <orderedlist><listitem><para>Prolog-based architectures.</para></listitem><listitem><para>XML-related architectures.</para></listitem><listitem><para>Graph-based architectures that follow the XML syntax.</para></listitem></orderedlist></para><para>The reason for this grouping is partially due to a chronological ordering (e.g. the roots
      of the Prolog-based architectures go back more than ten years) and partially because of the
      underlying technical foundation (e.g. the separation of XML-based and non-XML-based
      architectures). The last point is crucial with respect to the support in terms of tools (e.g.
      parsers, transformation processors, query tools) when it comes to the <emphasis role="ital">application</emphasis> of a specific architecture (cf. <xref linkend="sec.application"/>).</para><section xml:id="sec.prolog"><title>Prolog-based architectures</title><para>In <xref linkend="Sperberg-McQueen2000"/> and <xref linkend="Sperberg-McQueen2002"/> an
        abstract representation format to represent meaning and interpretation of markup based on a
        Prolog fact base was introduced. <xref linkend="Witt2002"/> extended this architecture for
        dealing with multiple annotated data. In this extension textual data and annotation are
        split up in order to avoid overlapping markup (cf. <xref linkend="Bayerl2003"/> for a
        further discussion). The elements, attributes and text nodes of the annotation layers are
        stored as Prolog predicates which contain the following information (for details refer to
          <xref linkend="Witt2005"/>):</para><itemizedlist><listitem><para>The type of node (element, attribute or text) as the name of the predicate.</para></listitem><listitem><para>The name of the annotation layer.</para></listitem><listitem><para>The absolute start and end positions of the annotated text sequence.</para></listitem><listitem><para>The position of the node in the document tree.</para></listitem><listitem><para>The name of the element or attribute.</para></listitem><listitem><para>The value of an attribute.</para></listitem></itemizedlist><para>Each character in the text base (the <emphasis role="ital">primary data</emphasis>) can
        be addressed by its offset (its position) as shown in <xref linkend="numbering"/>. A single
        character has a start and end position and a step size of 1.</para><figure xml:id="numbering"><title>Addressing character positions</title><programlisting xml:space="preserve">
  T  h  i  s     i  s     a     s  e  n  t  e  n  c  e  .
00|01|02|03|04|05|06|07|08|09|10|11|12|13|14|15|16|17|18|19</programlisting></figure><para>On the basis of the Prolog fact base format, possible relationships between element
        instances of different annotation levels can be examined via Prolog predicates (cf. <xref linkend="Durusau2002"/> and <xref linkend="Witt2005"/>). As further option, a unified
        version can be created and exported back to XML where overlaps are handled by using
        milestones or fragments.</para><para>Although the conversion itself can be done very quickly (two implementations are
        available, one programmed in Python, another one in Perl), the fact remains that a
        conversion from XML to Prolog is necessary both for markup unification and for analysing
        relations between different annotation levels. The need for information about the position
        of each single character of the primary data - which is demanded for reconstructing the
        primary data - and the distributed storing of element and attribute information results in
        rather large Prolog fact bases: for the largest single text stored in our corpus a single
        annotation layer of 1.7 MB in size is converted to a 6.4 MB-size Prolog fact base, the
        combined three annotation layers that are used in our project (logical document structure,
        POS, anaphoric relations) result in a 14.3 MB-size Prolog fact base.</para></section><section xml:id="sec.nonxml"><title>XML-related architectures</title><para>Several XML-related but non-XML-based approaches for storing multiple annotated data
        have been developed in recent years, including the Layered Markup and Annotation Language
        (LMNL, cf. <xref linkend="Tennison2002"/>, <xref linkend="Cowan2006"/>), TexMECS (cf. <xref linkend="Huitfeldt2001"/>) and Generalized Ordered-Descendant Direct Acyclic Graphs
        (GODDAG, cf. <xref linkend="Sperberg-McQueen2004"/>) Multi-colored Trees (MCT, cf. <xref linkend="Jagadish2004"/>) or Delay Nodes (cf. <xref linkend="LeMaitre2006"/>). XCONCUR,
        formerly known as MuLaX (cf. <xref linkend="Hilbert2005"/> and <xref linkend="Hilbert2005a"/>) has been recently accompanied by XCONCUR-CL (cf. <xref linkend="Schonefeld2007"/>, <xref linkend="Witt2007"/>) as a constraint-based validation language. </para><para>Although some of these approaches (e.g. LMNL, TexMECS, XCONCUR) support inline
        annotation of multiple annotation layers, these documents can get very complex when dealing
        with a large number of annotation layers. As a drawback, both, design and implementation of
        most of these architectures, rely on the work of only a few people. Therefore,
        specifications such as XCONCUR roughly remain in the state of experimental markup languages
        lacking the support of the large number of tools that is available for XML-based
      solutions.</para></section><section xml:id="sec.graph"><title>Graph-based architectures</title><para>A variety of graph-based architectures that use the XML syntax has been developed in
        recent years. Starting with the Annotation Graph (AG) model presented by <xref linkend="Bird1999"/> and <xref linkend="Bird2001"/>, architectures such as the <emphasis role="ital">NITE Object Model</emphasis> (cf. <xref linkend="Carletta2003"/>) in
        conjunction with <emphasis role="ital">NITE-XML</emphasis>, <emphasis role="ital">ATLAS</emphasis> (cf. <xref linkend="Bird2000"/>; <xref linkend="Laprun2002"/>) and the
        ATLAS Interchange Format (AIF), the Linguistic Annotation Framework pivot format (cf. <xref linkend="Ide2004"/>) and the similar <emphasis role="ital">Potsdam Austauschformat für
          Linguistische Annotationen</emphasis> (PAULA, cf. <xref linkend="Dipper2005"/>), the
        Graph-based Format for Linguistic Annotation (GraF, cf. <xref linkend="Ide2007"/>) or the
        Graph Exchange Language (GXL, cf. <xref linkend="Holt2006"/>, firstly used in the
        graph-based linguistic database HyGraphDB<footnote><para>The HyGraphDB (cf. <xref linkend="Gleim2007"/>) has been developed as part of the X1
            project of the collaborative research centre (CRC) 673 <emphasis role="ital">Alignment
              in Communication</emphasis> and of the <emphasis role="ital">Indogram</emphasis>
            project of the Research Group 437 <emphasis role="ital">Text-technological modelling of
              information</emphasis>.</para></footnote> to represent linguistic data structures) were published. </para><para>In principle, these graph-based formats allow the annotation of nearly every possible
        linguistic annotation. However, as these formats tend to split even single annotation layers
        into separate files (such as a markable/token file which delimits text spans used in
        annotation, a structure file for storing relations between annotation elements and a feature
        file which stores the former annotation), they are often used only as interchange formats.
        In addition, the higher complexity of computing graph structures in contrast to tree
        structures in combination with the fact that at least most single annotation layers can be
        structured in trees, leads to a certain inefficiency (cf. <xref linkend="Dipper2007"/> who
        transform a standoff annotation into a an inline representation for efficient querying).
        Because our main focus was the development of a tool allowing for the comparison of
        different annotations we decided to implement an additional standoff format: The <emphasis role="ital">Sekimo Generic Format</emphasis>, SGF.</para></section></section><section xml:id="sec.sgf"><title>The Sekimo Generic Format</title><para>After the experiences made with the Prolog fact base format the decision was made to
      develop a similar representation based on XML. The initial goal was to use a native XML
      database as storage backend, however, during the development of the Sekimo Generic Format
      (SGF) several implementations were tested, including the use on a per-file basis, different
      native XML databases (e.g. eXist<footnote><para>
          <link xlink:href="http://www.exist-db.org" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.exist-db.org</link>
        </para></footnote>, Berkeley DB XML<footnote><para>
          <link xlink:href="http://www.oracle.com/database/berkeley-db/xml/index.html" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.oracle.com/database/berkeley-db/xml/index.html</link>
        </para></footnote>, Qizx/db<footnote><para>
          <link xlink:href="http://www.xmlmind.com/qizx/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.xmlmind.com/qizx/</link>
        </para></footnote>, IBM DB2 Express-C 9.5<footnote><para>
          <link xlink:href="http://www-306.ibm.com/software/data/db2/9/edition-express-c.html" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www-306.ibm.com/software/data/db2/9/edition-express-c.html</link>
        </para></footnote>), and a relational database (MySQL<footnote><para>
          <link xlink:href="http://www.mysql.com/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.mysql.com/</link>
        </para></footnote>, cf. <xref linkend="sec.serengeti"/>). In the following sections we will present
      SGF in detail. The annotation layers shown in <xref linkend="lst.phrase"/> and <xref linkend="lst.syll"/> will serve for demonstration purposes. In <xref linkend="sec.application"/> we will show a real world example from the domain of anaphora
      resolution.</para><figure xml:id="lst.phrase"><title>Phrase structure annotation</title><!--<programlisting xml:space="preserve" linenumbering="numbered">--><programlisting xml:space="preserve">&lt;s xmlns="http://www.text-technology.de/sekimo/phrase"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://www.text-technology.de/phrase phrase.xsd"&gt;
  &lt;np&gt;
    &lt;pron&gt;This&lt;/pron&gt;
  &lt;/np&gt;
  &lt;vp&gt;
    &lt;v&gt;is&lt;/v&gt;
    &lt;np&gt;
      &lt;det&gt;a&lt;/det&gt;
      &lt;n&gt;sentence&lt;/n&gt;
    &lt;/np&gt;
  &lt;/vp&gt;.
&lt;/s&gt;</programlisting></figure><figure xml:id="lst.syll"><title>Syllable annotation</title><!--<programlisting xml:space="preserve" linenumbering="numbered">--><programlisting xml:space="preserve">&lt;syll xmlns="http://www.text-technology.de/sekimo/syll"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://www.text-technology.de/syll syll.xsd"&gt;
  &lt;s&gt;This&lt;/s&gt;
  &lt;s&gt;is&lt;/s&gt;
  &lt;s&gt;a&lt;/s&gt;
  &lt;s&gt;sen&lt;/s&gt;
  &lt;s&gt;tence&lt;/s&gt;.
&lt;/syll&gt;</programlisting></figure><section xml:id="sec.concept"><title>The concept of SGF</title><para>SGF was developed for storing multiple annotated linguistic corpus data and examining
        relationships between elements derived from different annotation layers. The format consists
        of a base layer, providing the structure of an SGF instance and global attributes that are
        imported by the different annotation layers (cf. <xref linkend="sec.base"/>). The use of
        metadata in SGF is described in <xref linkend="sec.meta"/> while <xref linkend="sec.adding_layers"/>, <xref linkend="sec.disjoint"/> and <xref linkend="sec.validation"/> deal with different aspects of the format. Finally, we will
        discuss processing and querying of SGF annotated data in <xref linkend="sec.query"/> and
        conclude with possible caveats of the format in <xref linkend="sec.caveats"/>.</para><figure xml:id="fig.root"><title>Diagram of the <code>corpus</code> root element</title><mediaobject><imageobject><imagedata format="png" fileref="../../../vol1/graphics/Stuehrenberg01/Stuehrenberg01-001.png" width="100%"/></imageobject></mediaobject></figure><para>SGF can be used in two different ways as shown in <xref linkend="fig.root"/>: <orderedlist><listitem><para>As a container format that contains optional meta data (cf. <xref linkend="sec.meta"/>) and the corpus data, i.e. the whole corpus is saved as a
              single SGF instance. This is the appropriate way when using SGF for storing small and
              medium sized corpora in conjunction with a native XML database (cf. <xref linkend="lst.sgf"/>).</para></listitem><listitem><para>On a per-file basis or when dealing with larger corpora a meta SGF file is used
              containing (again optional) metadata for and references to the actual corpus files
              (cf. <xref linkend="lst.sgf_meta"/>).</para></listitem></orderedlist>
      </para><figure xml:id="lst.sgf"><title>Storing a whole corpus in a single SGF instance</title><!--<programlisting linenumbering="numbered" startinglinenumber="1" xml:space="preserve">--><programlisting xml:space="preserve">&lt;corpus xmlns="http://www.text-technology.de/sekimo"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xmlns:base="http://www.text-technology.de/sekimo"
  xsi:schemaLocation="http://www.text-technology.de/sekimo root.xsd"&gt;
  &lt;corpusData xml:id="c1" type="text" sgfVersion="1.0"&gt;
    &lt;!-- [...] --&gt;
  &lt;/corpusData&gt;
  &lt;corpusData xml:id="c2" type="text" sgfVersion="1.0"&gt;
    &lt;!-- [...] --&gt;
  &lt;/corpusData&gt;
&lt;/corpus&gt;</programlisting></figure><figure xml:id="lst.sgf_meta"><title>Splitting up a whole corpus into multiple SGF instances (SGF meta file use)</title><!--<programlisting linenumbering="numbered" startinglinenumber="1" xml:space="preserve">--><programlisting xml:space="preserve">&lt;base:corpus xmlns="http://www.text-technology.de/sekimo"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xmlns:base="http://www.text-technology.de/sekimo"
  xsi:schemaLocation="http://www.text-technology.de/sekimo ../xsd/root.xsd"&gt;
  &lt;base:corpusDataRef xml:id="c1" uri="c1.xml" mime-type="text/xml"
   encoding="UTF-8"/&gt;
  &lt;base:corpusDataRef xml:id="c2" uri="c2.xml" mime-type="text/xml"
   encoding="UTF-8"/&gt;
  &lt;base:corpusDataRef xml:id="c3" uri="c3.xml" mime-type="text/xml"
   encoding="UTF-8"/&gt;
&lt;/base:corpus&gt;</programlisting></figure><para>In both cases the root element is the <code>corpus</code> element; underneath this a
          <code>corpusDataRef</code> element or a <code>corpusData</code> element can be inserted.
        The empty <code>corpusDataRef</code> element allows for referring to an external file
        containing a corpus entry via its <code>uri</code> attribute and for specifying the external
        data in terms of encoding and mime-types (respective attributes of the same name). In this
        case the root element of the corpus entry instances that are referenced by the SGF meta file
        should be the <code>corpusData</code> element (cf. <xref linkend="sec.base"/>).</para></section><section xml:id="sec.base"><title>The base layer</title><para>The <code>corpusData</code> element is used for storing a single corpus entry containing
        optional metadata (cf. <xref linkend="sec.meta"/>), the primary data, the segmentation of
        the primary data, and zero or more respective annotation layer(s) (cf. <xref linkend="sec.adding_layers"/>). An example base layer is shown in <xref linkend="lst.base"/>. The <code>xml:id</code> attribute is obligatory while the <code>sgfVersion</code>
        attribute is optional (with a default value of <emphasis role="ital">1.0</emphasis>)</para><figure xml:id="fig.corpusdata"><title>Diagram of the <code>corpusData</code> element</title><mediaobject><imageobject><imagedata format="png" fileref="../../../vol1/graphics/Stuehrenberg01/Stuehrenberg01-002.png" width="100%"/></imageobject></mediaobject></figure><figure xml:id="lst.base"><title>The SGF base layer</title><!--<programlisting linenumbering="numbered" startinglinenumber="1" xml:space="preserve">--><programlisting xml:space="preserve">
&lt;corpusData xmlns="http://www.text-technology.de/sekimo"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xmlns:base="http://www.text-technology.de/sekimo"
  xsi:schemaLocation="http://www.text-technology.de/sekimo root.xsd"
  xml:id="c1" type="text" sgfVersion="1.0"&gt;
  &lt;primaryData start="0" end="19" xml:lang="en"&gt;
    &lt;textualContent&gt;This is a sentence.&lt;/textualContent&gt;
    &lt;checksum algorithm="md5"&gt;d15ba5f31fa7c797c093931328581664&lt;/checksum&gt;
  &lt;/primaryData&gt;
&lt;/corpusData&gt;</programlisting></figure><para>The <code>corpusData</code> element holds the <code>type</code> attribute which can be
        either set to the value <emphasis role="ital">text</emphasis> or <emphasis role="ital">multimodal</emphasis> while the <code>primaryData</code> child element contains either
        the textual primary data (i.e. the text that is used as basis for annotation) as text node
        of the <code>textualContent</code> element or a reference to a file containing the primary
        data (in case of larger texts or non-textual primary data) via a <code>location</code> child
        element (not shown in the example listing). In the latter case an optional checksum of the
        input file can be provided in the corresponding element to preserve integrity of primary
        data when dealing with multiple annotation resources. Note, that we do not handle any byte
        offset problems derived by different encodings (e.g. Latin 1 vs. UTF-16), therefore, the use
        of the <code>encoding</code> attribute is highly recommended.<footnote><para>Relying on character offsets can be a source of trouble. For that reason one has to
            assure that whitespace differences between the textual primary data and annotation
            layers are normalized. Different whitespace normalizer tools were developed as part of
            our project.</para></footnote></para><para>When using SGF for storing multimodal annotations, multiple <code>primaryData</code>
        elements are allowed. In this case, the attribute <code>role</code> has to be provided which
        marks exactly one primary data file as "master" while the other primary
        data files are marked as "slaves". The master primary data file sets the
        timeline, the slave files can be aligned to the master file via an optional
        <code>offset</code> attribute.</para></section><section xml:id="sec.meta"><title>Metadata</title><para>Metadata can be used in several locations in an SGF instance: as child element of the
          <code>corpus</code> element (for information regarding the whole corpus), underneath a
          <code>corpusData</code> entry (denoting metadata related to a single corpus entry and its
        annotation layer(s)), or as child of an annotation level. In the underlying XML schema
        description of the base layer the <code>meta</code> element is declared wrapper element for
        elements derived from a different namespace while the <code>processContents</code> attribute
        is set to <emphasis role="ital">lax</emphasis>, i.e. if an optional XML schema description
        for the referenced namespace is available it should be used for validation. In our case we
        use OLAC metadata (cf. <xref linkend="Simons2001"/>) which has turned out to be an adequate
        solution for a variety of linguistic data. <xref linkend="lst.layers"/> shows an SGF
        instance containing OLAC metadata.</para></section><section xml:id="sec.adding_layers"><title>Adding layers</title><para>Several annotations of the primary data can be stored inside a <code>corpusData</code>
        element. Whenever an annotation layer is added, two steps have to be undertaken:<orderedlist><listitem><para>The segments which delimit the annotated parts of the primary data are
            defined.</para></listitem><listitem><para>A converted representation of the original annotation is stored.</para></listitem></orderedlist></para><para>The <code>segments</code> element consists of at least one <code>segment</code>. Each
        segment is defined by its start and end position in the character stream - similar to the
        Prolog fact base format discussed in <xref linkend="sec.prolog"/> (for an alternative
        definition of segments cf. <xref linkend="sec.disjoint"/>). We use simple numeric attributes
        (defined as <code>nonNegativInteger</code> data type in the underlying XML Schema, cf. <xref linkend="sec.validation"/> and <xref linkend="XMLSchema2004b"/>) for defining the start
        and end position - in contrast to the PAULA format (<xref linkend="Dipper2005"/>), which
        uses XLink (<xref linkend="DeRose2001"/>) and the XPointer framework (<xref linkend="Grosso2003"/>) to identify text spans. Because single characters have a step size
        of 1 (cf. <xref linkend="numbering"/>), empty elements use the same value for start and end
        position. An optional segment <code>type</code> attribute can be used to provide more
        information about the segment (available values are <emphasis role="ital">empty</emphasis>,
          <emphasis role="ital">char</emphasis> for character data, <emphasis role="ital">ws</emphasis> for whitespace characters, <emphasis role="ital">pun</emphasis> for
        punctuation characters, <emphasis role="ital">dur</emphasis> for duration in case of
        multimodal primary data and <emphasis role="ital">seg</emphasis> for referring to already
        defined segments, cf. <xref linkend="sec.disjoint"/>).</para><para><xref linkend="lst.layers"/> shows the SGF representation of the two annotation layers
        given in <xref linkend="lst.phrase"/> and <xref linkend="lst.syll"/>. Note that a segment
        has to be defined only once, even if it is used in different annotation layers - in contrast
        to some other graph-based approaches (cf. <xref linkend="sec.graph"/>) which define the same
        character span separately for each annotation layer. This results in a smaller amount of
        segments that has to be defined even for a large number of annotation layers.</para><para>The annotation of the primary data is stored in the corresponding element. Following the
        terminological distinction between levels and layers (cf. <xref linkend="sec.introduction"/>), each <code>level</code> element contains - in addition to optional metadata - exactly
        one <code>layer</code> element consisting of the markup representation of the corresponding
        annotation level. An <code>annotation</code> element may contain more than one
        <code>level</code> element, this mechanism can be used for subsuming annotation levels (e.g.
        when the corresponding elements are declared in the same document grammar). The
        <code>layer</code> element is a wrapper element containing elements derived from a different
        namespace, similar to the meta element (cf. <xref linkend="sec.meta"/>). However, while the
        value of the <code>processContents</code> attribute of the latter is set to <emphasis role="ital">lax</emphasis>, the value of the respective attribute of the
        <code>layer</code> element is set to <emphasis role="ital">strict</emphasis>, resulting in
        the fact that an XML schema has to be provided for each annotation layer (cf. <xref linkend="sec.validation"/>).</para><figure xml:id="fig.level"><title>Diagram of the <code>level</code> element</title><mediaobject><imageobject><imagedata format="png" fileref="../../../vol1/graphics/Stuehrenberg01/Stuehrenberg01-003.png" width="100%"/></imageobject></mediaobject></figure><figure xml:id="lst.layers"><title>SGF instance containing two annotation layers</title><!--<programlisting linenumbering="numbered" startinglinenumber="1" xml:space="preserve">--><programlisting xml:space="preserve">&lt;corpus xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://www.text-technology.de/sekimo root.xsd"
  xmlns="http://www.text-technology.de/sekimo"
  xmlns:base="http://www.text-technology.de/sekimo"&gt;
  &lt;corpusData xml:id="c1" type="text"&gt;
    &lt;primaryData start="0" end="19" xml:lang="en"&gt;
      &lt;textualContent&gt;This is a sentence.&lt;/textualContent&gt;
      &lt;checksum algorithm="md5"&gt;d15ba5f31fa7c797c093931328581664&lt;/checksum&gt;
    &lt;/primaryData&gt;
    &lt;segments&gt;
      &lt;segment xml:id="seg0" type="char" start="0" end="19" /&gt;
      &lt;segment xml:id="seg1" type="char" start="0" end="4" /&gt;
      &lt;segment xml:id="seg2" type="char" start="5" end="18" /&gt;
      &lt;!--[...]--&gt;
    &lt;/segments&gt;
    &lt;annotation&gt;
      &lt;level xml:id="al1" priority="1"&gt;
        &lt;meta&gt;
          &lt;olac:olac xmlns:olac="http://www.language-archives.org/OLAC/1.0/"
            xmlns="http://purl.org/dc/elements/1.1/"
            xmlns:dcterms="http://purl.org/dc/terms/"
            xsi:schemaLocation="http://www.language-archives.org/OLAC/1.0/
            meta/olac.xsd"&gt;
            &lt;format&gt;text/xml&lt;/format&gt;
            &lt;dcterms:isFormatOf&gt;sentence.txt&lt;/dcterms:isFormatOf&gt;
            &lt;description&gt;Phrase structure annotation.&lt;/description&gt;
          &lt;/olac:olac&gt;
        &lt;/meta&gt;
        &lt;layer xmlns:phrase="http://www.text-technology.de/sekimo/phrase"
          xsi:schemaLocation="http://www.text-technology.de/sekimo/phrase
          phrase.xsd"&gt;
          &lt;phrase:s base:segment="seg0" xml:lang="en"&gt;
            &lt;phrase:np base:segment="seg1"&gt;
              &lt;phrase:pron base:segment="seg1" /&gt;
            &lt;/phrase:np&gt;
            &lt;phrase:vp base:segment="seg2"&gt;
              &lt;phrase:v base:segment="seg3" /&gt;
              &lt;phrase:np base:segment="seg4"&gt;
                &lt;phrase:det base:segment="seg5" /&gt;
                &lt;phrase:n base:segment="seg6" /&gt;
              &lt;/phrase:np&gt;
            &lt;/phrase:vp&gt;
          &lt;/phrase:s&gt;
        &lt;/layer&gt;
      &lt;/level&gt;
    &lt;/annotation&gt;
    &lt;annotation&gt;
      &lt;level xml:id="al2" priority="1"&gt;
        &lt;meta&gt;
          &lt;olac:olac xmlns:olac="http://www.language-archives.org/OLAC/1.0/"
            xmlns="http://purl.org/dc/elements/1.1/"
            xmlns:dcterms="http://purl.org/dc/terms/"
            xsi:schemaLocation="http://www.language-archives.org/OLAC/1.0/
            meta/olac.xsd"&gt;
            &lt;description&gt;Syllable annotation.&lt;/description&gt;
          &lt;/olac:olac&gt;
        &lt;/meta&gt;
        &lt;layer xmlns:syll="http://www.text-technology.de/sekimo/syll"
          xsi:schemaLocation="http://www.text-technology.de/sekimo/syll
          syll.xsd"&gt;
          &lt;syll:syll base:segment="seg0"&gt;
            &lt;syll:s base:segment="seg1" /&gt;
            &lt;syll:s base:segment="seg3" /&gt;
            &lt;syll:s base:segment="seg5" /&gt;
            &lt;syll:s base:segment="seg7" /&gt;
            &lt;syll:s base:segment="seg8" /&gt;
          &lt;/syll:syll&gt;
        &lt;/layer&gt;
      &lt;/level&gt;
    &lt;/annotation&gt;
  &lt;/corpusData&gt;
&lt;/corpus&gt;</programlisting></figure><para>As one can observe in <xref linkend="fig.sgf.id"/>, SGF heavily makes use of XML's
        inherent ID/IDREF(S) mechanism to connect segments of the primary data with single or
        multiple annotation layers (displayed as solid red lines).</para><figure xml:id="fig.sgf.id"><title>Use of XML's ID/IREF(S) mechanism in SGF</title><mediaobject><imageobject><imagedata format="png" fileref="../../../vol1/graphics/Stuehrenberg01/Stuehrenberg01-004.png" width="100%"/></imageobject></mediaobject></figure><para>When comparing the two annotation layers with the namespace prefixes <code>phrase</code>
        and <code>syll</code> with their respective original representation given in <xref linkend="lst.phrase"/> and <xref linkend="lst.syll"/>, a second design goal of SGF is made
        visible: to conserve as much of the former annotation format as possible. Still, a
        conversion has to be made consisting of the following steps:<itemizedlist><listitem><para>Elements with a mixed content model are converted into container elements.</para></listitem><listitem><para>Elements containing text nodes are converted into empty elements.</para></listitem><listitem><para>The <code>base:segment</code> attribute is added to former non-empty elements as
              an obligatory attribute (and as an optional attribute for empty elements).</para></listitem></itemizedlist> The same conversion rules are applied to the underlying XSD (cf. <xref linkend="sec.validation"/>). As shown in <xref linkend="lst.layers"/> the hierarchy of
        elements and all attributes remain intact, i.e. there is no need for additional files such
        as structure files which are needed for the graph-based annotation formats discussed in
          <xref linkend="sec.graph"/>. However, this statement is only true as long as the
        XML-inherent tree structures are adequate.<footnote><para>Of course it is possible to use graph-based annotation layers as well, however, the
            advantages of SGF over the formats discussed in <xref linkend="sec.graph"/> would be
            minimized in such cases (cf. <xref linkend="sec.query"/>).</para></footnote> An XSLT implementation is available for converting arbitrary inline annotation
        layers into their respective SGF representation while a second XSLT script merges different
        annotation layers according to the same primary data into a single SGF instance. Therefore,
        it is possible to add additional <code>annotation</code> elements to an already existing SGF
        instance at any time (as long as the primary data is not changed). Work has begun on a
        second implementation (written in Java).</para></section><section xml:id="sec.disjoint"><title>Disjoints and continuous segments</title><para>Often segments consist of other segments making it possible to create new segments not
        only by defining their start and end positions but by referring to already defined segments
        using the <code>segments</code> attribute, too (cf. <xref linkend="lst.segments"/>). In
        order to distinguish if these newly established segments include all segments starting from
        the first referred segment up to the last referred one, or define a disjoint span, the
        attribute <code>mode</code> has to be set to the value <emphasis role="ital">continuous</emphasis> or <emphasis role="ital">disjoint</emphasis>, respectively. The
        example in <xref linkend="lst.segments"/> shows a disjoint span.</para><figure xml:id="lst.segments"><title>Definition of a disjoint segment by referring to already established
          ones</title><!--<programlisting xml:space="preserve" linenumbering="unnumbered">--><!--<programlisting xml:space="preserve" linenumbering="unnumbered">--><programlisting xml:space="preserve">&lt;segment xml:id="seg6" type="seg" segments="seg1 seg3" mode="disjoint"/&gt;;</programlisting></figure><para>Note that this feature of SGF could be used for conversion between SGF instances and
        architectures mentioned in <xref linkend="sec.nonxml"/>, however, up to now it has been of
        theoretical use only.</para></section><section xml:id="sec.validation"><title>Validation</title><para>An important aspect when dealing with multiple annotated data is the question of
        validating this data. In case of overlaps it is strictly impossible to provide a document
        grammar that is feasible for validating the unification of different annotation layers -
        even without the amount of work that has to be done for producing such a document grammar.
        Therefore, we propose that each annotation level is validated separately - in addition to
        the SGF instance as a whole - with a transformed version of its original document grammar.
        This conversion follows the conversion of the annotation layer described in <xref linkend="sec.adding_layers"/>.</para><para>We decided to use W3C XML Schema Description Language (XSD) (cf. <xref linkend="XMLSchema2004"/>) as the underlying schema language for SGF for different
        reasons. As already stated, SGF relies heavily on two aspects: <itemizedlist><listitem><para>ID/IDREF(S) mechanism, and</para></listitem><listitem><para>Namespace support.</para></listitem></itemizedlist> While ID/IDREF(S) is already present in XML Document Type Definitions, DTDs
        lack real support for XML namespaces. Furthermore, SGF makes use of XML Schema data types
          (<xref linkend="XMLSchema2004b"/>) and when external document grammars (for annotation
        layers and metadata) are imported, the control of the processing of the imported document
        grammars is crucial (cf. <xref linkend="sec.serengeti"/> for the discussion of the Serengeti
        log functionality and the role of XML Schema's <code>processContents</code> attribute).
        Because of this we had to choose one of the XML schema languages available. XSD was
        favoured over RELAX NG (<xref linkend="RELAX2003"/>) because of the better software support,
        e.g. with Saxon-SA<footnote><para>
            <link xlink:href="http://www.saxonica.com" xlink:title="Saxonica Homepage" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.saxonica.com</link>
          </para></footnote> a schema-aware XSLT and XQuery engine is available which allows for the use of
        the id() and idref() functions for the task of comparing different annotation layers (cf.
          <xref linkend="sec.sgf_analysis"/>). Of course it would be possible to use simple string
        comparisons, however, XML IDs are usually indexed by the XSLT processor (for Saxon cf. 
        <link xlink:href="http://saxon.wiki.sourceforge.net/indexing" xlink:title="Saxon Wiki - Indexing" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://saxon.wiki.sourceforge.net/indexing</link>)
        and are for this reason - in most cases - much more efficient than the equivalent XPath
        expression using a string comparison predicate (cf. <xref linkend="Kay2008"/>, p. 802-804.).
        This helps reducing processing costs when dealing with larger SGF instances, however, the
        downside is that the validation of each XSD associated takes some time (approximately one to
        two seconds in our case). </para><para>Apart from XSD validation, embedded Schematron (<xref linkend="Schematron"/>) asserts
        are used as additional constraints, for example for refusing end positions of segments that
        are less than start positions (cf. <xref linkend="Robertson2002"/>). In the upcoming version
        1.1 of XML Schema, the <code>assert</code> element will be used for fulfilling this task
          (<xref linkend="XMLSchema2008"/>).</para></section><section xml:id="sec.query"><title>Querying</title><para>One of the goals during the development of SGF has been the possibility of analyzing the
        relationships between elements of different layers. In contrast to the work described by
          <xref linkend="Alink2006"/> and <xref linkend="Alink2006a"/>, which involves new standoff
        XPath axis steps, or the linguistic query language LPath, which extends the XPath 1.0 syntax
        and which was introduced by <xref linkend="Bird2006"/>, SGF uses unchanged XML-related
        specifications for querying data. Up to now we have employed XSLT 2.0, XPath 2.0 and XQuery
        1.0 queries for typical tasks carried out in our project (cf. <xref linkend="sec.application"/>). <xref linkend="Bird2006"/> and <xref linkend="Dipper2007"/>
        suggest different example queries to evaluate their architectures. By now, Q1
        ("Find all sentences that include the word 'kam'"), Q2 ("Find all
        sentences that do not include the word 'kam'"), Q3 ("Find all NPs. Return
        the reference to that NP") and Q7 ("Find all pairs of anaphors and direct
        antecedents in which the anaphor is a personal pronoun") described in <xref linkend="Dipper2007"/> were implemented.
        <footnote><para>The other queries were not appropriate for the corpus under investigation.</para></footnote>

        <xref linkend="lst.xquery.q7"/> shows Q7 for our
        corpus.</para><figure xml:id="lst.xquery.q7"><title>XQuery Q7 adapted for the corpus under investigation</title><programlisting xml:space="preserve">
declare boundary-space strip;
declare namespace base="http://www.text-technology.de/sekimo";
declare namespace doc="http://www.text-technology.de/sekimo/doc";
declare namespace cnx="http://www.text-technology.de/cnx";
declare namespace chs="http://www.text-technology.de/sekimo/chs";
declare variable $doc := "ling-deu-003-sgf-noWS.xml";
&lt;resultset file="{$doc}"&gt;
{
let $d := doc($doc)
for $s in $d//chs:semRel/chs:cospecLink[id(@phorIDRef)/
id(@base:segment)/idref(@xml:id)/..[name()='cnx:token'
and @pos='PRON' and contains(@morpho,'Pers')]]
return
  &lt;relation&gt;
  {$s/@*}
  {&lt;anapher&gt;
    {$s/id(@phorIDRef)/id(@headRef)/data(@text)}
    &lt;/anapher&gt;,
    &lt;antecedent&gt;
    {$s/id(@antecedentIDRefs)/id(@headRef)/data(@text)}
    &lt;/antecedent&gt;}
  &lt;/relation&gt;
}
&lt;/resultset&gt;</programlisting></figure><para>In addition, we have implemented Q8 ("Find all pairs of anaphors and
        antecedents and their respective parent(s) on the logical document layer"), for
        which it is necessary for the XQuery processor to traverse back to the segments, compare
        several <code>segment</code> elements and then to find the corresponding annotations. Most
        of the queries perform comparable to the respective inline queries referred to in <xref linkend="Dipper2007"/>, but in general they are difficult to compare since our corpus (six
        German scientific articles and eight German newspaper articles, containing 3,084 sentences,
        56,203 tokens, 11,740 markables, 4,323 anaphoric relations, three annotation levels: logical
        document structure, POS, anaphoric relations) is different both in terms of size and
        annotation levels. Apart from Q7, most parts of the queries can be performed inline (which
        is a benefit of SGF over other architectures discussed in <xref linkend="sec.graph"/>),
        which allows us to abstain from converting SGF instances to inline representation prior to
        analyzing the relations (which was one of the motivations in developing SGF) as proposed by
          <xref linkend="Dipper2007"/>.</para><para>For a first evaluation we have chosen both the aforementioned complete corpus and our
        largest single text, a German scientific article comprising 157 paragraphs, 696 sentences,
        12,345 token, 2,550 markables and 1,358 anaphoric relations (14,985 segments in total),
        annotated on the three annotation levels described above. All values are average results
        after five executions on two different machines: <orderedlist><listitem><para>PC1: a Sun Fire V20z equipped with dual single core AMD Opteron 248 clocked at 2,2
              GHz and 6 GB RAM running on Sun Solaris 10 (64bit) with Saxon-SA 9.0.0.1J on Java
              1.5.0_15 (2 GB RAM allocated for Java VM) and SWI-Prolog 5.6.21 (128 MB allocated as
              local stack limit).</para></listitem><listitem><para>PC2: a standard PC equipped with a Intel dual core Core2Duo E6600 clocked at 2,99
              GHz with 3.12 GB RAM running on Microsoft Windows XP SP3 (32bit) with Saxon-SA
              9.0.0.1J on Java 1.6.0_06 (1 GB RAM allocated for Java VM) and SWI-Prolog 5.6.57 (128
              MB allocated as local stack limit).</para></listitem></orderedlist> Included in the XQuery results is the validation of five XSD files (<emphasis role="ital">-val</emphasis> parameter) and the output of an XML file (<emphasis role="ital">-o</emphasis> parameter) with a <code>resultset</code> root element and the
        corresponding query results underneath. For comparison, we evaluated the same queries for
        the Prolog fact base architecture used in the first project phase (cf. <xref linkend="sec.prolog"/>) on the same two machines. For the latter the amount of time for
        consulting the Prolog fact base containing the annotated data (14.3 MB in size, 3.37 sec on
        PC1; 2.94 sec on PC2) and the Prolog query file (4.3 KB in size, 0.0 sec on both machines)
        is not included in the results. The query results are output to a separate text file.</para><table border="1" xml:id="tab.results"><caption><para>Evaluation results (in seconds). Average of five executions.</para></caption><thead><tr><th>Query</th><th>Prolog query results for single text (PC1 / PC2)</th><th>XQuery results for single text (PC1 / PC2)</th><th>XQuery results for whole corpus (PC1 / PC2)</th></tr></thead><tbody><tr><td>Q1</td><td>0.22 / 0.054</td><td>4.612 / 1.244</td><td>9.609 / 4.162</td></tr><tr><td>Q2</td><td>13.502 / 4.554</td><td>5.161 / 1.234</td><td>9.390 / 4.357</td></tr><tr><td>Q3</td><td>0.084 / 0.03</td><td>4.035 / 1.219</td><td>9.556 / 4.084</td></tr><tr><td>Q7</td><td>30.66 / 7.798</td><td>5.764 / 1.481</td><td>11.669 / 5.35</td></tr><tr><td>Q8</td><td>84.16 / 24.738</td><td>15.379 / 11.134</td><td>152.683 / 114.525</td></tr></tbody></table><para>Note that in contrast to the graph-based architectures described in <xref linkend="sec.graph"/>, the XQueries and their evaluation results depend on the annotation
        layers that are imported into the SGF base layer. This means that especially Q1, Q2 and Q3
        are very fast because they can be performed inline in our corpus (i.e. both sentence and
        token information are descendants of the same annotation element - and the
        <code>token</code> element contains its textual content in its <code>text</code> attribute).
        For Q7, information derived from different annotation layers has to be taken into account,
        however, since only the id() function is used, the results are satisfactory as well. Q8 is
        the single XQuery that requires the identification of the respective <code>segment</code>
        element and the use of the idref() function afterwards in order to get the corresponding
        annotations. For these reasons, the advantage when using SGF over comparable architectures
        rises or drops depending on the imported annotation layers. To further reduce processing
        costs it is possible to use merged inline annotation layers (e.g. a logical document layer
        and a POS layer) as a combined, single SGF layer and use separate SGF layers only when
        overlaps occur. In this case the XML-inherent hierarchies can be used for (inline) analyzing
        of wide parts of the annotated data while a reversion to SGF's use of the ID/IDREF mechanism
        should only be made if not avoidable.</para><para>The performance figures for the Prolog fact base format show higher performance for
        simple queries but lower performance for more complex ones. These figures result from the
        fact that our corpus annotation makes heavy use of attributes, which leads to distributed
        information. We believe that a re-implemented Prolog fact base format could both reduce file
        size and speed up the querying.</para></section><section xml:id="sec.caveats"><title>Caveats and problems</title><para>Up to now, several former inline annotation layers have been converted into SGF and the
        format as such is quite stable (although minor changes may occur). Apart from the huge
        amount of markup that is necessary to do this kind of analysis, problems may arise when the
        annotation layers that are stored in SGF are exported back into their original inline
        representation. This is especially true when the annotation layers contain empty elements,
        for which it is impossible to provide the exact position in the original document tree (of
        course the <code>base:segment</code> attribute can be used for these elements as well; when
        a large number of empty elements appears in a row, the values of all their respective
          <code>base:segment</code> attributes would be identical). Although our largest SGF
        instance is at 6 MB including optional whitespace segments (4.8 MB without optional
        whitespace segments), it is still smaller than the respective Prolog fact base
        representation at 14.3 MB, cf. <xref linkend="sec.prolog"/>.</para><para>When it comes to queries, SGF relies on the imported annotation layers. For this reason,
        there is no standard set of queries available and the execution time cannot be easily
        predicted.</para></section></section><section xml:id="sec.application"><title>Application of SGF</title><para>Various application domains require the analysis of different information resources in
      order to answer a specific question. <xref linkend="Alink2006"/>, <xref linkend="Alink2006a"/>, for example, describe the analysis of multiple markup in the domain of digital forensics.
      In our project, we focus on linguistic phenomena, especially on anaphora resolution. Anaphora
      occurs when the interpretation of a linguistic unit (the anaphor) is dependent on the
      interpretation of another element in the previous context (the antecedent). The anaphor is
      often an abbreviated or reformulated reference to its antecedent and thus provides for the
      progression of discourse topics and discourse coherence. Anaphoric relations can be
      categorized according different axes (cf. <xref linkend="Mitkov2002"/> for an overview): Type
      of anaphora (pronoun, NP, adverb, etc.), type of antecedent (e.g. nominal vs. abstract entity)
      and type of relation. In this paper, we will focus on nominal anaphora with nominal
      antecedents only. According the relation type, anaphoric relations may either express
      reference identity between the anaphor and its antecedent (<xref linkend="ex.1"/>) or the
      respective expressions are related via associative links (<xref linkend="ex.2"/>).</para><orderedlist><listitem><para xml:id="ex.1" xreflabel="Example 1">I met a man yesterday. He told me a story.
          (example taken from <xref linkend="Clark1977"/>, p. 414)</para></listitem><listitem><para xml:id="ex.2" xreflabel="Example 2">I looked into the room. The ceiling was very high.
          (example taken from <xref linkend="Clark1977"/>, p. 415)</para></listitem></orderedlist><para>In order to resolve anaphoric relations, different kinds of information have to be taken
      into account that are provided by different resources: POS tagger, Chunker, Parser, word net
      and ontologies. These resources provide information on gender or number agreement, noun
      phrases, grammatical function, lexico-semantic relations and domain or world knowledge. The
      resolution of the anaphoric relation given in <xref linkend="ex.1"/> is dependent on agreement
      information of the pronoun <emphasis role="ital">he</emphasis> whereas the resolution of <xref linkend="ex.2"/> requires the knowledge that a room typically has a ceiling which is
      provided in terminological nets such as <emphasis role="ital">WordNet</emphasis> (<xref linkend="Fellbaum1998"/>) or other ontological resources.<footnote><para>
          <link xlink:href="http://wordnet.princeton.edu/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://wordnet.princeton.edu/</link>
        </para></footnote></para><para>We apply SGF for the integration of different resources and access to these data. In terms
      of levels and layers, each resource provides information for a specific level and this
      information is stored in a respective layer: A POS tagger provides information on part of
      speech tags and respective markup is generated in the tool's output file whereas access to a
      word net provides information on semantic relatedness of words in terms of distance between
      word's synsets. This information has to be stored and accessed for the anaphora resolution
      process. <xref linkend="fig.resource"/> exemplifies the integration: Each resource is applied
      and the resulting markup is stored independently from the primary data. On the basis of the
      information stored in SGF it is possible to query the data, to create new markup
      layers, or to create inline versions of the markup and the primary data.</para><para>
      <!--<figure xml:id="fig.resource" pgwide="0">-->
      <figure xml:id="fig.resource"><title>Application of multiple resources</title><mediaobject><imageobject><imagedata format="png" fileref="../../../vol1/graphics/Stuehrenberg01/Stuehrenberg01-005.png" width="100%"/></imageobject></mediaobject></figure>
    </para><section xml:id="sec.sgf_analysis"><title>Analysing annotations</title><para>In the application domain of anaphora resolution, a raw text document is taken as input
        and annotation layers are created for different levels. All layers are converted to SGF and
        can be analyzed afterwards. For the task of anaphora resolution, a set of antecedent
        candidates is created for each anaphoric element via an XSLT script (cf. <xref linkend="lst.xslt"/>, an example candidate list is shown in <xref linkend="lst.candidates"/>). The candidate list consists of several <code>semRel</code> elements each containing one
          <code>anaphor</code> element and several <code>antecedentCandidate</code> elements.
        Information on the relation type between anaphor and correct antecedent is stored as
        attribute information in the <code>semRel</code> element. The <code>anaphor</code> element
        describes properties of the anaphoric element whereas the <code>antecedentCandidate</code> elements describe
        information on the antecedent candidates. In both cases this information is stored in terms
        of attributes. Number or gender agreement can be computed from the <code>morpho</code>
        attribute. Additional information is given for
        part of speech (<code>pos</code>), grammatical function (<code>syntax</code>), dependency
        structure (<code>dependHead</code>), position of element in the whole document (<code>position</code>), the parent element on the logical document layer (<code>docParent</code>) as
        well as for the head noun both in surface form (<code>text</code>) and lemma
        (<code>lemma</code>). Together with other pieces of information a score for the most
        probable antecedent candidate can be computed (cf. <xref linkend="Goecke2007"/> for a similar approach). For the
        anaphora resolution system each anaphor-candidate-pair is interpreted as a feature vector
        which is used for training a classifier. Information on the correct antecedent candidate is
        necessary in order to classify positive and negative training examples (cf. <xref linkend="Soon2001"/>, <xref linkend="Strube2003"/>, <xref linkend="Yang2004"/>).</para><para>The annotated example sentence in <xref linkend="lst.example"/> is an extract of a
        German newspaper article that is part of our corpus. The content of the text excerpt is as
        follows: </para><para><blockquote><para>Lurup ist ein sozialer Brennpunkt der Hansestadt, ein Vorort mit Einzelhäusern, aber
            auch vielen Wohnblocks im Westen der Stadt.</para></blockquote> which is translated into: <blockquote><para>Lurup is a social ghetto of the hanseatic city (Hansestadt), an outskirt with single
            unit houses but also many apartment blocks in the west of the city (Stadt).</para></blockquote>
      </para><para>In <xref linkend="lst.example"/> all levels that are used in the <emphasis role="ital">Sekimo</emphasis> project can be observed: the logical document structure (namespace
        prefix <code>doc</code>), the output of the commercial Parser/Tagger <emphasis role="ital">Machinese Syntax</emphasis> by Connexor Oy (namespace prefix <code>cnx</code>), the
        discourse entity level and the semantic relations level (namespace prefix <code>chs</code>).
        The segment <code>seg1</code> delimits the whole text, while <code>seg2</code> delimits a
        paragraph (containing a single sentence, cf. the <code>doc:text</code> and
        <code>doc:para</code> elements in the logical document layer and the
        <code>cnx:sentece</code> element in the <code>cnx</code> layer). The segments identified by
          <code>seg1589</code> and <code>seg1620</code> mark the two token (and respective discourse
        entities) "Hansestadt" and "Stadt". There is a
        cospecification relation (to be more specific: a hypernym relation) between these two
        discourse entities which is stored in the <code>chs:cospecLink</code> element located in the
          <code>chs</code> layer.</para><figure xml:id="lst.example"><title>SGF instance of a German newspaper text (excerpt)</title><!--<programlisting xml:space="preserve" linenumbering="unnumbered">--><programlisting xml:space="preserve">&lt;corpus xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://www.text-technology.de/sekimo root.xsd"
  xmlns="http://www.text-technology.de/sekimo"
  xmlns:base="http://www.text-technology.de/sekimo"&gt;
  &lt;corpusData xml:id="c15" type="text"&gt;
  &lt;primaryData start="0" end="8208" fileref="c15-pd.txt" xml:lang="de"&gt;
  &lt;checksum algorithm="md5"&gt;6ee0021b23c56b5917703746579e9ce8&lt;/checksum&gt;
    &lt;/primaryData&gt;
    &lt;segments&gt;
      &lt;segment xml:id="seg1" type="char" start="0" end="8207"/&gt;
      &lt;segment xml:id="seg2" type="char" start="0" end="16"/&gt;
      &lt;segment xml:id="seg1577" type="char" start="4439" end="4567"/&gt;
      &lt;segment xml:id="seg1578" type="char" start="4439" end="4444"/&gt;
      &lt;segment xml:id="seg1589" type="char" start="4473" end="4487"/&gt;
      &lt;segment xml:id="seg1592" type="char" start="4477" end="4487"/&gt;
      &lt;segment xml:id="seg1620" type="seg" segments="seg1621 seg1623"/&gt;
      &lt;segment xml:id="seg1621" type="char" start="4557" end="4560"/&gt;
      &lt;segment xml:id="seg1623" type="char" start="4561" end="4566"/&gt;
      &lt;-- [...] --&gt;
    &lt;/segments&gt;
    &lt;annotation&gt;
      &lt;level xml:id="doc" priority="0"&gt;
        &lt;meta&gt;&lt;-- [...] --&gt;&lt;/meta&gt;
        &lt;layer xmlns:doc="http://www.text-technology.de/sekimo/doc"
          xsi:schemaLocation="http://www.text-technology.de/sekimo/doc doc.xsd"&gt;
          &lt;doc:text base:segment="seg1" xml:lang="de"&gt;
            &lt;doc:para base:segment="seg2" skip="no"/&gt;
            &lt;-- [...] --&gt;
          &lt;/doc:text&gt;
        &lt;/layer&gt;
      &lt;/level&gt;
    &lt;/annotation&gt;
    &lt;annotation&gt;
      &lt;level xml:id="cnx" priority="0"&gt;
        &lt;meta&gt;&lt;-- [...] --&gt;&lt;/meta&gt;
        &lt;layer xmlns:cnx="http://www.text-technology.de/cnx"
          xsi:schemaLocation="http://www.text-technology.de/cnx cnx.xsd"&gt;
          &lt;-- [...] --&gt;
          &lt;cnx:sentence base:segment="seg1577" id="w826" auto="no"&gt;
            &lt;-- [...] --&gt;
            &lt;cnx:token base:segment="seg1578" text="Lurup" dependHead="w828"
              pos="N" syntax="@NH" lemma="lurup" dependValue="subj" morpho="NOM"
              id="w827"/&gt;
            &lt;-- [...] --&gt;
            &lt;cnx:token base:segment="seg1592" text="Hansestadt" dependHead="w831"
              pos="N" syntax="@NH" lemma="hanse#stadt" dependValue="mod"
              morpho="FEM SG GEN" id="w833"/&gt;
            &lt;-- [...] --&gt;
            &lt;cnx:token base:segment="seg1621" text="der" dependHead="w848"
              pos="DET" syntax="@PREMOD" lemma="die" dependValue="det"
              morpho="Def FEM SG GEN" id="w847"/&gt;
            &lt;cnx:token base:segment="seg1623" text="Stadt" dependHead="w846"
              pos="N" syntax="@NH" lemma="stadt" dependValue="mod"
              morpho="FEM SG GEN" id="w848"/&gt;
          &lt;/cnx:sentence&gt;
          &lt;-- [...] --&gt;
        &lt;/layer&gt;
      &lt;/level&gt;
    &lt;/annotation&gt;
    &lt;annotation&gt;
      &lt;level xml:id="de" priority="1"&gt;
        &lt;meta&gt;&lt;-- [...] --&gt;&lt;/meta&gt;
        &lt;layer xmlns:chs="http://www.text-technology.de/sekimo/chs"
          xsi:schemaLocation="http://www.text-technology.de/sekimo/chs chs.xsd"&gt;
          &lt;-- [...] --&gt;
          &lt;chs:de base:segment="seg1589" deID="de226" headRef="w833" /&gt;
          &lt;chs:de base:segment="seg1620" deID="de231" headRef="w848" deType="nom"/&gt;
          &lt;-- [...] --&gt;
        &lt;/layer&gt;
      &lt;/level&gt;
      &lt;level xml:id="chs" priority="1"&gt;
        &lt;meta&gt;&lt;-- [...] --&gt;&lt;/meta&gt;
        &lt;layer xmlns:chs="http://www.text-technology.de/sekimo/chs"
          xsi:schemaLocation="http://www.text-technology.de/sekimo/chs
          chs.xsd"&gt;
          &lt;chs:semRel&gt;
            &lt;-- [...] --&gt;
            &lt;chs:cospecLink id="sr86" relType="hypernym" phorIDRef="de231"
              antecedentIDRefs="de226"/&gt;
            &lt;-- [...] --&gt;
          &lt;/chs:semRel&gt;
        &lt;/layer&gt;
      &lt;/level&gt;
    &lt;/annotation&gt;
  &lt;/corpusData&gt;
&lt;/corpus&gt;</programlisting></figure><para>Apart from resources that have already been mentioned, further information is needed in
        order to create a suitable set of antecedent candidates for training and resolution. In
        general, a fixed search window in terms of markables (i.e. elements between which anaphoric
        relations can hold), sentences or paragraphs is chosen. This approach works well for pronoun
        anaphora due to the fact that pronouns tend to find their antecedents within a short
        distance (cf. <xref linkend="Mitkov2002"/>). However, for the resolution of non-pronominal
        definite noun phrases (definite descriptions) and the processing of long texts the
        application of a fixed search window is not feasible because definite descriptions tend to
        find their antecedents at a greater distance than pronouns. For the corpus under
        investigation that has been manually annotated for anaphoric relations (cf. <xref linkend="Diewald2008"/> for further information regarding the corpus and the annotation
        scheme), 26.8% of all non-pronominal anaphors (i.e. 20.9% of all anaphors in the corpus)
        find their antecedent at a distance of two or more paragraphs. We apply structural
        information to create candidate sets that include not only candidates at a short distance
        but also those at a larger distance. A small excerpt of the XSLT stylesheet that is used for
        the extraction is shown in <xref linkend="lst.xslt"/>. </para><figure xml:id="lst.xslt"><title>Excerpt of the XSLT stylesheet used for extracting candidates</title><programlisting xml:space="preserve">&lt;xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"
  xmlns="http://www.text-technology.de/sekimo"
  xmlns:base="http://www.text-technology.de/sekimo"
  xmlns:doc="http://www.text-technology.de/sekimo/doc"
  xmlns:cnx="http://www.text-technology.de/cnx"
  xmlns:chs="http://www.text-technology.de/sekimo/chs"&gt;
  &lt;-- [...] --&gt;
  &lt;xsl:template match="chs:bridgingLink | chs:cospecLink"&gt;
    &lt;xsl:variable name="link" select="."/&gt;
    &lt;semRel&gt;
      &lt;xsl:attribute name="relationID" select="@id"/&gt;
      &lt;xsl:attribute name="type" select="local-name()"/&gt;
      &lt;xsl:for-each select="id(@phorIDRef)"&gt;
        &lt;xsl:variable name="anaphoraPosition"&gt;
          &lt;xsl:number level="single"/&gt;
        &lt;/xsl:variable&gt;
        &lt;-- [...] --&gt;
        &lt;anaphor&gt;
          &lt;-- [...] --&gt;
          &lt;xsl:copy-of select="idref(id(@base:segment))[name()='cnx:token']/@*"/&gt;
          &lt;xsl:variable name="segstart" select="id(@base:segment)/@start"/&gt;
          &lt;xsl:variable name="segend" select="id(@base:segment)/@end"/&gt;
          &lt;xsl:for-each select="//element()[contains(name(),'doc')]"&gt;
            &lt;xsl:if test="id(@base:segment)/@start &lt;= $segstart and
            id(@base:segment)/@end &gt;= $segend"&gt;
              &lt;xsl:attribute name="docParent"&gt;
                &lt;xsl:value-of select="name()"/&gt;
                &lt;xsl:text&gt;[&lt;/xsl:text&gt;
                &lt;xsl:number level="single"/&gt;
                &lt;xsl:text&gt;]&lt;/xsl:text&gt;
              &lt;/xsl:attribute&gt;
            &lt;/xsl:if&gt;
          &lt;/xsl:for-each&gt;
        &lt;/anaphor&gt;
        &lt;xsl:for-each select="preceding-sibling::chs:de[position() &lt;= $de_distance]"&gt;
          &lt;xsl:variable name="antecedentPosition"&gt;
            &lt;xsl:number level="single"/&gt;
          &lt;/xsl:variable&gt;
          &lt;antecedentCandidate&gt;
            &lt;-- [...] --&gt;
          &lt;/antecedentCandidate&gt;
        &lt;/xsl:for-each&gt;
      &lt;/xsl:for-each&gt;
    &lt;/semRel&gt;
  &lt;/xsl:template&gt;
&lt;/xsl:stylesheet&gt;</programlisting></figure><para>Because the <code>segment</code> element is the central and critical mechanism in SGF
        (cf. <xref linkend="fig.sgf.id"/>) we have to use the id() and idref() XPath functions to
        analyze elements derived from different annotation layers. <xref linkend="lst.candidates"/>
        shows a result candidate list, extracted with a maximum distance of 10 discourse entities.</para><figure xml:id="lst.candidates"><title>Candidate list extracted from the SGF instance</title><!--<programlisting linenumbering="numbered" startinglinenumber="1" xml:space="preserve">--><programlisting xml:space="preserve">&lt;candidateList xmlns="http://www.text-technology.de/sekimo"
  xmlns:base="http://www.text-technology.de/sekimo"
  maxDeDistance="10" filename="c15-sgf.xml"&gt;
  &lt;-- [...] --&gt;
  &lt;semRel relationID="sr86" type="cospecLink" subtype="hyperonym" phorIDRef="de231"
    antecedentIDRefs="de226"&gt;
    &lt;anaphor base:segment="seg1623" deID="de231" headRef="w848" deType="nom"
      text="Stadt" dependHead="w846" pos="N" syntax="@NH" lemma="stadt"
      dependValue="mod" morpho="FEM SG GEN" id="w848" position="195" type="char"
      start="4557" end="4566" docParent="doc:para[13]"/&gt;
    &lt;antecedentCandidate base:segment="seg1560" deID="de221" headRef="w815"
      deType="nom" text="Kind" pos="N" syntax="@NH" lemma="kind"
      morpho="NEU SG NOM" id="w815" position="185" deDistance="10" type="char"
      start="4375" end="4403" docParent="doc:para[12]"/&gt;
    &lt;-- [...] --&gt;
    &lt;antecedentCandidate base:segment="seg1587" deID="de225" headRef="w831"
      deType="nom" text="Brennpunkt" dependHead="w828" pos="N" syntax="@NH"
      lemma="brenn#punkt" dependValue="comp" morpho="MSC SG NOM" id="w831"
      position="189" deDistance="6" type="char" start="4449" end="4472"
      docParent="doc:para[13]"/&gt;
    &lt;antecedentCandidate correctAntecendent="yes" base:segment="seg1592"
      deID="de226" headRef="w833" deType="nom" text="Hansestadt" dependHead="w831"
      pos="N" syntax="@NH" lemma="hanse#stadt" dependValue="mod"
      morpho="FEM SG GEN" id="w833" position="190" deDistance="5"
      type="char" start="4473" end="4487" docParent="doc:para[13]"/&gt;
    &lt;antecedentCandidate base:segment="seg1598" deID="de227" headRef="w836"
      deType="nom" text="Vorort" pos="N" syntax="@NH" lemma="vorort"
      morpho="MSC SG NOM" id="w836" position="191" deDistance="4" type="char"
      start="4489" end="4499" docParent="doc:para[13]"/&gt;
    &lt;antecedentCandidate base:segment="seg1602" deID="de228" headRef="w838"
      deType="nom" text="Einzelhäusern" dependHead="w836" pos="N" syntax="@NH"
      lemma="einzelhaus" dependValue="mod" morpho="NEU PL DAT" id="w838"
      position="192" deDistance="3" type="char" start="4504" end="4517"
      docParent="doc:para[13]"/&gt;
    &lt;-- [...] --&gt;
  &lt;/semRel&gt;
&lt;/candidateList&gt;</programlisting></figure><para>For all <code>antecedentCandidate</code> elements (i.e. former <code>chs:de</code>
        elements) <code>position</code> and <code>deDistance</code> attributes have been added.
        Apart from the discourse structure that is used to model accessibility of antecedent
        candidates (cf. <xref linkend="Polanyi1988"/>), the logical document structure provides
        information on the hierarchical structure of texts by describing
        the organisation of the text document in terms of chapters, sections, paragraphs, and the
        like and is stored in the <code>doc</code> layer of the SGF instance.<footnote><para>The logical document layer is a shortened variant of the DocBook schema (cf. <xref linkend="Bayerl2003"/> for details).</para></footnote> Based on this information which can be accessed from DocBook, OpenDocument, or
        LaTeX, a layout-oriented presentation can be generated which is application independent.
        Especially for texts from e-publishing sources a set of logical document structure elements
        is easily available which can be used to identify different text segments. The influence of
        the logical document structure on the choice of an antecedent might be either (a) a direct
        influence on the markables (or antecedent life span) or (b) an influence on the search
        window (cf. <xref linkend="Goecke2006"/>). In our candidate list shown in <xref linkend="lst.candidates"/> the <code>docParent</code> attribute supplies information about
        the (virtual) parent element of the logical document layer, i.e. the element of the logical
        document layer that refers to a segment whose start position is lower or equal and whose end
        position is greater or equal to that of the segment referred to by the element analyzed. </para><para>Regarding the document structure, corpus evidence shows that some discourse entities are
        more prominent throughout the whole document than others, e.g. markables occurring in the
        abstract of a text might be accessible during the whole text whereas markables that occur in
        a footnote-structure are less likely as an antecedent for anaphoric elements in the main
        text. Corpus evidence shows that in a corpus consisting of 4323 anaphoric relations 65.3% of
        all anaphor-antecedent-pairs are located in the same segment. Regarding the remaining
        anaphor-antecedent-pairs, we expect markables described in hierarchically higher elements
        (e.g. subsection) to be much more prone to finding their antecedents in structuring elements of a
        higher level (section) than in a preceding but hierarchically lower segment
        (subsubsection). Thus, the influence on the search window may either enlarge the search
        window, i.e. the antecedent may be located outside the standard window (e.g. located in the
        whole paragraph or in a preceding one), or may narrow the search window, e.g. due to the
        start of a new chapter or section. Furthermore, the position of an antecedent candidate
        within a paragraph gives hints as to how likely that candidate is chosen as the correct one.
        An analysis of our corpus data shows that 50.2% of the antecedents are located
        paragraph-initial and 29.1% are located paragraph-final whereas only 20.2% are located in
        the middle of the paragraph. Thus in addition to the information regarding the search
        window, information on logical document structure might give cues for selecting the correct
        antecedent from a set of candidates.</para></section><section xml:id="sec.serengeti"><title>SGF as import and export format</title><para>While the main reason for the development of SGF was analyzing relations between
        elements derived from different annotations (cf. <xref linkend="sec.application"/>), the
        format is used in a another application in our project. The <emphasis role="ital">Serengeti</emphasis> web-based annotation tool described in <xref linkend="Stührenberg2007"/> is currently enhanced to support different annotation schemes. This upcoming version of
        Serengeti will be used not only at Bielefeld University but also as an expert annotation
        tool in the <emphasis role="ital">AnaWiki</emphasis> project (cf. <xref linkend="Poesio2008"/>) and will use SGF as its import and export format. For this reason, an SGF API (written
        in Perl) was implemented that allows the mapping of SGF to the relational MySQL database
        that is used as a backend for Serengeti.</para><para>During this development a log functionality was added to SGF ensuring that the
        information of added, deleted or modified data is not only stored in the Serengeti
        application but can be included in the exported SGF instance. A <code>log</code> can be
        stored as child element of an annotation level and contains at least one log
        <code>entry</code>, consisting of optional metadata and one or more <code>action</code>
        elements. The user responsible for the log entry is identified via a respective attribute,
        together with the time the entry was made (<code>timestamp</code> attribute). Each action is
        specified by its <code>type</code> attribute (<emphasis role="ital">add</emphasis>,
          <emphasis role="ital">delete</emphasis>, <emphasis role="ital">modify</emphasis>) and
        refers to the affected elements via an optional IDREF <code>affectedItem</code> attribute
        (not when the <code>type</code> attribute's value is set to <emphasis role="ital">add</emphasis>). The content of an <code>action</code> element is a sequence of elements
        from any namespace (otherwise modification of segments would not be possible), however, XML
        Schema's <code>processContents</code> attribute is set to <emphasis role="ital">skip</emphasis>, therefore, it is possible to use the same IDs several times (e.g. when
        modifying a <code>segment</code> element).</para><para>In addition, an SGF application for storing lexical chains was developed. <emphasis role="ital">SGF-LC</emphasis>, a lightweight XSD that is imported into the SGF base layer
        and that makes use of the attributes provided by the base layer is described in <xref linkend="Waltinger2008"/> and is used as export format for the <emphasis role="ital">Scientific Workplace</emphasis> tool<footnote><para>
            <link xlink:href="http://www.scientific-workplace.org/" xlink:title="Scientific Workplace Homepage" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.scientific-workplace.org/</link>
          </para></footnote> developed by the project A4 (<emphasis role="ital">Indogram</emphasis>) of our
        Research Group.</para></section></section><section xml:id="sec.conclusion"><title>Conclusion and outlook</title><para>In this paper we presented the Sekimo Generic Format (SGF) as an alternative approach for
      storing multiple annotated data amongst a variety of already established architectures and
      formats. SGF is used as an XML-based solution for storing and especially analyzing a corpus of
      multiple annotated documents (multi-rooted trees) in the linguistic application domain of
      anaphora resolution. Future work regarding our linguistic task of anaphora resolution focuses
      on the analysis of relations between logical document structure and the distribution of
      antecedent detection. On the technical side, we will adapt SGF to the upcoming version 1.1 of
      XML Schema, which includes assertions similar to the Schematron asserts used in the current
      version of SGF. Other possible developments include the implementation of converter scripts
      between SGF and some of the graph-based architectures mentioned and the further testing of the
      efficiency of SGF in large scale corpora using a wider set of sample queries.</para></section><bibliography><title>References</title><bibliomixed xml:id="Alink2006" xreflabel="Alink et al., 2006">Alink, W., Bhoedjang, R., de
      Vries, A. P., and Boncz, P. A. <emphasis role="ital">Efficient XQuery Support for Stand-Off
        Annotation</emphasis>. In: Proceedings of the 3rd International Workshop on XQuery
      Implementation, Experience and Perspectives, in cooperation with ACM SIGMOD, Chicago, USA,
      2006. </bibliomixed><bibliomixed xml:id="Alink2006a" xreflabel="Alink et al., 2006a">Alink, W., Jijkoun, V., Ahn,
      D., and de Rijke, M. <emphasis role="ital">Representing and Querying Multi-dimensional Markup
        for Question Answering</emphasis>. In: Proceedings of the 5th EACL Workshop on NLP and XML
      (NLPXML-2006): Multi-Dimensional Markup in Natural Language Processing}, Trento, 2006.</bibliomixed><bibliomixed xml:id="Bayerl2003" xreflabel="Bayerl et al., 2003">Bayerl, P. S., Lüngen, H.,
      Goecke, D., Witt, A. and Naber, D. <emphasis role="ital">Methods for the semantic analysis of
        document markup</emphasis>. In: Roisin, C.; Muson, E. and Vanoirbeek, C. (ed.), Proceedings
      of the 3rd ACM Symposium on Document Engineering (DocEng), Grenoble, pages 161-170, 2003. 
      doi:<biblioid class="doi">10.1145/958220.958250</biblioid>.</bibliomixed><bibliomixed xml:id="Bird1999" xreflabel="Bird and Liberman, 1999">Bird, S. and Liberman,
        M.<emphasis role="ital">Annotation graphs as a framework for multidimensional linguistic
        data analysis</emphasis>. In: Proceedings of the Workshop "Towards Standards and Tools for
      Discourse Tagging", pages 1–10. Association for Computational Linguistics, 1999.</bibliomixed><bibliomixed xml:id="Bird2000" xreflabel="Bird et al., 2000"> Bird, S., Day, D., Garofolo, J.,
      Henderson,J., Laprun, C. and Liberman,M. <emphasis role="ital">ATLAS: A flexible and
        extensible architecture for linguistic annotation</emphasis>. In: Proceedings of the Second
      International Conference on Language Resources and Evaluation, pages 1699–1706, Paris, 2000.
      European Language Resources Association. </bibliomixed><bibliomixed xml:id="Bird2001" xreflabel="Bird and Liberman, 2001">Bird, S. and Liberman, M.
        <emphasis role="ital">A formal framework for linguistic annotation</emphasis>. Speech
      Communication, 33(1–2): pages 23–60, 2001. 
      doi:<biblioid class="doi">10.1016/S0167-6393(00)00068-6</biblioid>.</bibliomixed><bibliomixed xml:id="Bird2006" xreflabel="Bird et al., 2006"> Bird, S., Chen, Y., Davidson, S.,
      Lee, H. and Zheng,Y.  <emphasis role="ital">Designing and Evaluating an XPath Dialect for
        Linguistic Queries</emphasis>. In: Proceedings of the 22nd International Conference on Data
      Engineering (ICDE), Atlanta, USA., 2006. 
      doi:<biblioid class="doi">10.1109/ICDE.2006.48</biblioid>. </bibliomixed><bibliomixed xml:id="Carletta2003" xreflabel="Carletta et al., 2003">Carletta, J., Kilgour, J.,
      O’Donnel, T. J., Evert, S. and Voormann, H. <emphasis role="ital">The NITE Object Model
        Library for Handling Structured Linguistic Annotation on Multimodal Data Sets</emphasis>.
      In: Proceedings of the EACL Workshop on Language Technology and the Semantic Web (3rd Workshop
      on NLP and XML (NLPXML-2003)), Budapest, Ungarn, 2003. </bibliomixed><bibliomixed xml:id="Clark1977" xreflabel="Clark, 1977">Clark, H. (1977). <emphasis role="ital">Bridging</emphasis>. In: Johnson-Laird, P.N. and Wason, P.C. (eds.): Thinking: Readings in
      Cognitive Science. Cambridge : Cambridge University Press, 1977, S. 411 - 420. </bibliomixed><bibliomixed xml:id="Cowan2006" xreflabel="Cowan et al., 2006"> J. Cowan, J. Tennison, and Piez,
      W. <emphasis role="ital">LMNL update</emphasis>. In: Proceedings of Extreme Markup Languages,
      Montréal, Québec, 2006. </bibliomixed><bibliomixed xml:id="DeRose2001" xreflabel="DeRose et al., 2001"> DeRose, S., Maler, E. and
      Orchard, D. <emphasis role="ital">XML Linking Language (XLink) Version 1.0</emphasis>. W3C
      Recommendation, World Wide Web Consortium, June 2001. Online: <link xlink:href="http://www.w3.org/TR/2001/REC-xlink-20010627/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.w3.org/TR/2001/REC-xlink-20010627/</link>. </bibliomixed><bibliomixed xml:id="DeRose2004" xreflabel="DeRose, 2004">DeRose, S. J. <emphasis role="ital">Markup Overlap: A Review and a Horse</emphasis>. In: Proceedings of Extreme Markup
      Languages, 2004. </bibliomixed><bibliomixed xml:id="Diewald2008" xreflabel="Diewald et al. (submitted)">Diewald, N.,
      Stührenberg, M., Garbar, A. and Goecke, D. <emphasis role="ital">Serengeti -- Webbasierte
        Annotation semantischer Relationen</emphasis>. To appear in LDV Forum - Zeitschrift für
      Computerlinguistik und Sprachtechnologie.</bibliomixed><bibliomixed xml:id="Dipper2005" xreflabel="Dipper, 2005">Dipper, S. <emphasis role="ital">XML-based stand-off representation and exploitation of multi-level linguistic
      annotation</emphasis>. In: Proceedings of Berliner XML Tage 2005 (BXML 2005), pages 39–50,
      Berlin, Deutschland, 2005. </bibliomixed><bibliomixed xml:id="Dipper2007" xreflabel="Dipper et al., 2007"> Dipper, S., Götze, M.,
      Küssner, U. and Stede, M. <emphasis role="ital">Representing and Querying Standoff
      XML</emphasis>. In: Rehm, G., Witt, A. and Lemnitzer, L. editors, Datenstrukturen für
      linguistische Ressourcen und ihre Anwendungen. Data Structures for Linguistic Resources and
      Applications. Proceedings of the Biennial GLDV Conference 2007, pages 337–346, Tübingen, 2007.
      Gunter Narr Verlag. </bibliomixed><bibliomixed xml:id="Durusau2002" xreflabel="Durusau and O'Donnell, 2002"> Durusau, P. and
      O'Donnell, M.B.. <emphasis role="ital">Concurrent Markup for XML Documents</emphasis>. In:
      Proceedings of the XML Europe conference 2002.</bibliomixed><bibliomixed xml:id="Fellbaum1998" xreflabel="Fellbaum, 1998">Fellbaum, C. <emphasis role="ital">WordNet: An electronic lexical database</emphasis>. Cambridge, Mass.: MIT Press, 1998.</bibliomixed><bibliomixed xml:id="Gleim2007" xreflabel="Gleim et al., 2007"> Gleim, R., Mehler, A. and
      Eikmeyer, H.-J. <emphasis role="ital">Representing and Maintaining Large Corpora</emphasis>.
      In: Proceedings of the Corpus Linguistics 2007 Conference, Birmingham (UK), 2007.</bibliomixed><bibliomixed xml:id="Goecke2006" xreflabel="Goecke and Witt, 2006">Goecke, D. and Witt, A.
        <emphasis role="ital">Exploiting Logical Document Structure for Anaphora
      Resolution.</emphasis> In: Proceedings of the 5th International Conference on Language
      Resources and Evaluation (LREC 2006). Genoa, Italy, 2006. </bibliomixed><bibliomixed xml:id="Goecke2007" xreflabel="Goecke et al. (to appear)">Goecke, D., Stührenberg,
      M. and Wandmacher, T. <emphasis role="ital">Extraction and representation of semantic
        relations for resolving definite descriptions</emphasis>. To appear in LDV Forum -
      Zeitschrift für Computerlinguistik und Sprachtechnologie.</bibliomixed><bibliomixed xml:id="Buchkapitel" xreflabel="Goecke et al., 2008"> Goecke, D., Lüngen, H.,
      Metzing, D., Stührenberg, M. and Witt, A. <emphasis role="ital">Different Views on Markup.
        Distinguishing levels and layers</emphasis>. In: Linguistic modeling of information and
      Markup Languages. Contributions to language technology. Springer, 2008.</bibliomixed><bibliomixed xml:id="Grosso2003" xreflabel="Grosso et al., 2003"> Grosso, P., Maler, E., Marsh,
      J. and Walsh, N. <emphasis role="ital">XPointer Framework</emphasis>. W3C Recommendation,
      World Wide Web Consortium, March 2003. Online: <link xlink:href="http://www.w3.org/TR/2003/REC-xptr-framework-20030325/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.w3.org/TR/2003/REC-xptr-framework-20030325/</link>. </bibliomixed><bibliomixed xml:id="Hamp1997" xreflabel="Hamp and Feldweg, 1997">Hamp, B. and Feldweg, H.
        <emphasis role="ital">GermaNet - a Lexical-Semantic Net for German</emphasis>. In:
      Proceedings of ACL workshop "Automatic Information Extraction and Building of Lexical
      Semantic Resources for NLP Applications", pages 9–15, New Brunswick, New Jersey,
      1997. Association for Computational Linguistics. </bibliomixed><bibliomixed xml:id="Hilbert2005" xreflabel="Hilbert, 2005"> Hilbert, M. <emphasis role="ital">MuLaX – ein Modell zur Verarbeitung mehrfach XML-strukturierter Daten</emphasis>. Diploma
      thesis, Bielefeld University, 2005. </bibliomixed><bibliomixed xml:id="Hilbert2005a" xreflabel="Hilbert et al., 2005"> M. Hilbert, O. Schonefeld,
      and A. Witt. <emphasis role="ital">Making CONCUR work</emphasis>. In: Proceedings of Extreme
      Markup Languages, 2005. </bibliomixed><bibliomixed xml:id="Holt2006" xreflabel="Holt et al., 2006"> Holt, R., Schürr, A., Elliott Sim,
      S and Winter, A. <emphasis role="ital">GXL: A graph-based standard exchange format for
      reengineering</emphasis>. In: Science of Computer Programming, 60(2): 149-170, 2006. 
      doi:<biblioid class="doi">10.1016/j.scico.2005.10.003</biblioid>.
    </bibliomixed><bibliomixed xml:id="Huitfeldt2001" xreflabel="Huitfeldt and Sperberg-McQueen, 2001"> Huitfeldt,
      C. and Sperberg-McQueen, C.M. <emphasis role="ital">Texmecs: An experimental markup
        meta-language for complex documents</emphasis>. Markup Languages and Complex Documents
      (MLCD) Project, Februar 2001. </bibliomixed><bibliomixed xml:id="Ide2004" xreflabel="Ide and Romary, 2004"> Ide, N. and Romary, L. <emphasis role="ital">International Standard for a Linguistic Annotation Framework</emphasis>. Journal
      of Natural Language Engineering, 10(3-4): pages 211-225, 2004.
      doi:<biblioid class="doi">10.1017/S135132490400350X</biblioid>.</bibliomixed><bibliomixed xml:id="Ide2007a" xreflabel="Ide and Romary, 2007"> Ide, N. and Romary, L.
        <emphasis role="ital">Towards International Standards for Language Resources</emphasis>. In:
      Dybkjaer, L., Hemsen, H., and Minker, W., editors, Evaluation of Text and Speech Systems,
      pages 263--284. Springer. </bibliomixed><bibliomixed xml:id="Ide2007" xreflabel="Ide and Suderman, 2007"> Ide, N. and Suderman, K.
        <emphasis role="ital">GrAF: A Graph-based Format for Linguistic Annotations</emphasis>. In:
      Proceedings of the Linguistic Annotation Workshop, pages 1-8, Prague, Czech Republic.
      Association for Computational Linguistics, 2007.</bibliomixed><bibliomixed xml:id="Laprun2002" xreflabel="Laprun et al., 2002"> Laprun, C., Fiscus, J. G.,
      Garofolo, J. and Pajot, S. <emphasis role="ital">Recent improvements to the ATLAS
      architecture</emphasis>. In: Proceedings of HLT 2002, Second International Conference on Human
      Language Technology Research, 2002. </bibliomixed><bibliomixed xml:id="RELAX2003" xreflabel="ISO/IEC 19757-2:2003"> ISO/IEC 19757-2:2003.
        <emphasis role="ital">Information technology – Document Schema Definition Language (DSDL) –
        Part 2: Regular-grammar-based validation – RELAX NG (ISO/IEC 19757-2)</emphasis>.
      International Standard, International Organization for Standardization, Geneva, 2003. </bibliomixed><bibliomixed xml:id="Schematron" xreflabel="ISO/IEC 19757-3:2006"> ISO/IEC 19757-3:2006.
        <emphasis role="ital">Information technology – Document Schema Definition Language (DSDL) –
        Part 3: Rule-based validation – Schematron</emphasis>. International standard, International
      Organization for Standardization, Geneva, 2006. </bibliomixed><bibliomixed xml:id="Jagadish2004" xreflabel="Jagadish et al., 2004"> Jagadish, H. V.,
      Lakshmanany, L. V. S., Scannapieco, M., Srivastava, D. and Wiwatwattana, N. <emphasis role="ital">Colorful XML: One hierarchy isn’t enough</emphasis>. In: Proceedings of ACM
      SIGMOD International Conference on Management of Data (SIGMOD 2004), pages 251–262, Paris,
      June 13-18 2004. ACM Press New York, NY, USA. 
      doi:<biblioid class="doi">10.1145/1007568.1007598</biblioid>. </bibliomixed><bibliomixed xml:id="Kay2008" xreflabel="Kay 2008"> M. Kay. <emphasis role="ital">XSLT 2.0 and
        XPath 2.0 Programmer’s Reference</emphasis>. Wiley Publishing, Indianapolis, 4th edition,
      2008. </bibliomixed><bibliomixed xml:id="LeMaitre2006" xreflabel="Le Maitre, 2006"> Le Maitre, J. <emphasis role="ital">Describing multistructured XML documents by means of delay nodes</emphasis>. In:
      DocEng ’06: Proceedings of the 2006 ACM symposium on Document engineering, pages 155–164, New
      York, NY, USA, 2006. ACM Press. 
      doi:<biblioid class="doi">10.1145/1166160.1166200</biblioid>. </bibliomixed><bibliomixed xml:id="Mitkov2002" xreflabel="Mitkov, 2002">Mitkov, R. <emphasis role="ital">Anaphora resolution</emphasis>. London: Longman, 2002</bibliomixed><bibliomixed xml:id="Poesio2008" xreflabel="Poesio and Kruschwitz 2008"> Poesio, M. and
      Kruschwitz, U. <emphasis role="ital">Anawiki: Creating anaphorically annotated resources
        through web cooperation</emphasis>. In: Proceedings of LREC 2008. </bibliomixed><bibliomixed xml:id="Polanyi1988" xreflabel="Polanyi, 1988"> Polanyi, L. <emphasis role="ital">A
        formal model of the structure of discourse</emphasis>. In: Journal of Pragmatics 12 (1988),
      pages 601-638. doi:<biblioid class="doi">10.1016/0378-2166(88)90050-1</biblioid>.</bibliomixed><bibliomixed xml:id="Robertson2002" xreflabel="Robertson, 2002"> E. Robertson. <emphasis role="ital">Combining Schematron with other XML Schema languages</emphasis>, Juni 2002.
      Online: <link xlink:href="http://www.topologi.com/public/Schtrn_XSD/Paper.html" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.topologi.com/public/Schtrn_XSD/Paper.html</link>. </bibliomixed><bibliomixed xml:id="Schonefeld2007" xreflabel="Schonefeld, 2007"> O. Schonefeld. <emphasis role="ital">XCONCUR and XCONCUR-CL: A constraint-based approach for the validation of
        concurrent markup</emphasis>. In: Rehm, G., Witt, A., Lemnitzer, L. (eds.), Datenstrukturen
      für linguistische Ressourcen und ihre Anwendungen. Data Structures for Linguistic Resources
      and Applications. Proceedings of the Biennial GLDV Conference 2007, Tübingen, Germany, 2007.
      Gunter Narr Verlag. </bibliomixed><bibliomixed xml:id="Soon2001" xreflabel="Soon et al., 2001">Soon, W.M., Lim, D.C.Y. and Ng,
      H.T. (2001). <emphasis role="ital">A Machine Learning Approach to Coreference Resolution of
        Noun Phrases</emphasis>. In: Computational Linguistics 27 (2001), No. 4, pages 521-544. 
      doi:<biblioid class="doi">10.1162/089120101753342653</biblioid>.</bibliomixed><bibliomixed xml:id="Simons2001" xreflabel="Simons and Bird, 2003"> G. Simons and S. Bird.
        <emphasis role="ital">OLAC Metadata</emphasis>. OLAC: Open Language Archives Community,
      2003. Online: <link xlink:href="http://www.language-archives.org/OLAC/metadata.html" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.language-archives.org/OLAC/metadata.html</link>. </bibliomixed><bibliomixed xml:id="Sperberg-McQueen2000" xreflabel="Sperberg-McQueen et al., 2000">Sperberg-McQueen, C. M., Huitfeldt, C. and Renear, A.. <emphasis role="ital">Meaning and
        Interpretation of markup</emphasis>. Markup Languages - Theory &amp; Practice, 2, pages
      215-234, 2000. doi:<biblioid class="doi">10.1162/109966200750363599</biblioid>.</bibliomixed><bibliomixed xml:id="Sperberg-McQueen2002" xreflabel="Sperberg-McQueen et al., 2002">Sperberg-McQueen, C. M., Dubin, D., Huitfeldt, C. and Renear, A. <emphasis role="ital">Drawing inferences on the basis of markup</emphasis>. In: Proceedings of Extreme Markup
      Languages, 2002. </bibliomixed><bibliomixed xml:id="Sperberg-McQueen2002a" xreflabel="Sperberg-McQueen and       Burnard, 2002">
      C. Sperberg-McQueen, C. M. and Burnard, L. (eds.). <emphasis role="ital">TEI P4: Guidelines
        for Electronic Text Encoding and Interchange</emphasis>. published for the TEI Consortium by
      Humanities Computing Unit, University of Oxford, Oxford, Providence, Charlottesville, Bergen,
      2002. </bibliomixed><bibliomixed xml:id="Sperberg-McQueen2004" xreflabel="Sperberg-McQueen and       Huitfeldt, 2004">
      Sperberg-McQueen, C. M. and Huitfeldt, C. <emphasis role="ital">GODDAG: A Data Structure for
        Overlapping Hierarchies</emphasis>. In: King, P. and Munson, E. V. (eds.), Proceedings of
      the 5th International Workshop on the Principles of Digital Document Processing (PODDP 2000),
      volume 2023 of Lecture Notes in Computer Science, pages 139–160. Springer, 2004. </bibliomixed><bibliomixed xml:id="Strube2003" xreflabel="Strube and Müller, 2003">Strube, M. and Müller, C.
      (2003). <emphasis role="ital">A machine learning approach to pronoun resolution in spoken
        dialogue</emphasis>. In: ACL '03: Proceedings of the 41st Annual Meeting on Association for
      Computational Linguistics. Morristown, NJ, USA : Association for Computational Linguistics,
      2003, pages 168-175. 
      doi:<biblioid class="doi">10.3115/1075096.1075118</biblioid>.</bibliomixed><bibliomixed xml:id="Stührenberg2007" xreflabel="Stührenberg et al., 2007"> Stührenberg, M.,
      Goecke, D, Diewald, N., Cramer, I. and Mehler, A. <emphasis role="ital">Web-based annotation
        of anaphoric relations and lexical chains</emphasis>. In: Proceedings of the Linguistic
      Annotation Workshop (LAW), pages 140–147, Prague. Association for Computational Linguistics,
      2007</bibliomixed><bibliomixed xml:id="Tennison2002" xreflabel="Tennison, 2002"> Tennison, J. <emphasis role="ital">Layered Markup and Annotation Language (LMNL)</emphasis>. In: Proceedings of
      Extreme Markup Languages, Montréal, Québec, 2002. </bibliomixed><bibliomixed xml:id="Thompson1997" xreflabel="Thompson and McKelvie, 1997"> Thompson, H. S. and
      D. McKelvie. <emphasis role="ital">Hyperlink semantics for standoff markup of read-only
        documents</emphasis>. In: Proceedings of SGML Europe ’97: The next decade – Pushing the
      Envelope, pages 227–229, Barcelona, 1997. </bibliomixed><bibliomixed xml:id="Waltinger2008" xreflabel="Waltinger et al., 2008"> Waltinger, U., Mehler,
      A. Mehler, and Stührenberg, M. <emphasis role="ital">An Integrated Model of Lexical Chaining:
        Application, Resources and its Format</emphasis>. Accepted for Proceedings of Konvens 2008.</bibliomixed><bibliomixed xml:id="Witt2002" xreflabel="Witt, 2002"> Witt, A. <emphasis role="ital">Meaning
        and interpretation of concurrent markup</emphasis>. In: Proceedings of ALLC-ACH2002, Joint
      Conference of the ALLC and ACH, 2002. </bibliomixed><bibliomixed xml:id="Witt2004" xreflabel="Witt, 2004"> Witt, A. <emphasis role="ital">Multiple
        hierarchies: New Aspects of an Old Solution</emphasis>. In: Proceedings of Extreme Markup
      Languages, 2004. </bibliomixed><bibliomixed xml:id="Witt2005" xreflabel="Witt et al., 2005">Witt, A., Goecke, D., Sasaki, F.,
      and Lüngen, H. <emphasis role="ital">Unification of XML Documents with Concurrent
      Markup</emphasis>. Literary and Lingustic Computing, 20(1): pages 103-116, 2005.
      doi:<biblioid class="doi">10.1093/llc/fqh046</biblioid>.</bibliomixed><bibliomixed xml:id="Witt2007" xreflabel="Witt et al., 2007"> Witt, A., Schonefeld, O., Rehm,
      G., Khoo, J. and Evang, K. <emphasis role="ital">On the lossless transformation of
        single-file, multi-layer annotations into multi-rooted trees</emphasis>. In: Proceedings of
      Extreme Markup Languages, Montréal, Québec, 2007. </bibliomixed><bibliomixed xml:id="XMLSchema2004" xreflabel="XML Schema Part 1, 2004">XML Schema Part 1:
      Structures Second Edition. W3C Recommendation, World Wide Web Consortium, 28 October 2004.
      Online: <link xlink:href="http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/</link>.</bibliomixed><bibliomixed xml:id="XMLSchema2004b" xreflabel="XML Schema Part 2, 2004">XML Schema Part 2:
      Datatypes Second Edition. W3C Recommendation, World Wide Web Consortium, 28 October 2004.
      Online: <link xlink:href="http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/</link>.</bibliomixed><bibliomixed xml:id="XMLSchema2008" xreflabel="XML Schema 1.1 Part 1, 2008"> W3C XML Schema
      Definition Language (XSD) 1.1 Part 1: Structures. W3C Working Draft, World Wide Web
      Consortium, 20 June 2008. Online: <link xlink:href="http://www.w3.org/TR/2008/WD-xmlschema11-1-20080620/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.w3.org/TR/2008/WD-xmlschema11-1-20080620/</link>.</bibliomixed><bibliomixed xml:id="Yang2004" xreflabel="Yang et al., 2004">Yang, X., Su, J., Zhou, G. and Tan,
      C. L. (2004). <emphasis role="ital">Improving pronoun resolution by incorporating
        coreferential information of candidates.</emphasis> In: Proceedings of the 42nd Annual
      Meeting of the Association for Computational Linguistics (ACL04). Barcelona, Spain,
    2004. 
      doi:<biblioid class="doi">10.3115/1218955.1218972</biblioid>.</bibliomixed></bibliography></article>
