<?xml version="1.0" encoding="UTF-8"?><article xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0-subset Balisage-1.2"><title>Refining the Taxonomy of XML Schema Languages. A new Approach for Categorizing XML Schema
    Languages in Terms of Processing Complexity</title><info><confgroup><conftitle>Balisage: The Markup Conference 2010</conftitle><confdates>August 3 - 6, 2010</confdates></confgroup><abstract><para>This paper presents a refined taxonomy of XML schema languages based on the work by
          <xref linkend="Murata2005"/>. It can be seen as first building block for a more elaborate
        formal analysis of XML and its accompanied specifications, in this case: XML schema
        languages such as DTD, XSD and RELAX NG.</para></abstract><author><personname><firstname>Maik</firstname><surname>Stührenberg</surname></personname><personblurb><para>Maik Stührenberg studied Computational Linguistics at Bielefeld University. After
          working for four years as research assistant at Giessen University in different
          text-technological projects, he is now a Ph. D. student and research assistant at
          Bielefeld University. His main research interests include XML schema languages and
          specifications for structuring and querying multi-dimensional annotated data.</para></personblurb></author><author><personname><firstname>Christian</firstname><surname>Wurm</surname></personname><personblurb><para>Christian Wurm is a Ph. D. student in Computational and Mathematical Linguistics at
          Bielefeld University in the Cognitive Interaction Technology – Center of Excellence
          (CITEC) at Bielefeld University.</para></personblurb></author><legalnotice><para>Copyright © 2010 by the authors.  Used with
   permission.</para></legalnotice><keywordset role="author"><keyword>XML</keyword><keyword>Formal Language Theory</keyword><keyword>Schema Languages</keyword></keywordset></info><note><para>The authors would like to thank both the reviewers for their constructive comments and our
      colleagues Marcus Kracht and Jens Michaelis who provided additional insightful remarks.</para></note><section><title>Introduction</title><para>In this paper, we continue the fruitful research that has been performed on XML and
      formal language theory (see <xref linkend="sec.xml_formal"/>). As there is a close
      correspondence between schema languages and a hierarchy of tree languages discussed by <xref linkend="Murata2005"/>, and to which we will refer as the Murata hierarchy, formal language
      theory has been very useful to determine and describe the expressiveness and computability of
      different XML schema languages. From the point of view of formal language theory (ignoring
      things such as user-friendliness or software support, amongst others), the question which
      grammar formalism is most apt for defining a document grammar for a given XML markup language
      is determined by a trade-off between expressiveness on one side and processing complexity on
      the other side. The more expressive a grammar formalism is, the more resources we need for
      processing the corresponding languages, and the more likely are they to fall prey to
      ambiguity.</para><para>Expressiveness thus always comes at a cost; it is, however, not always quite clear at
      which cost. To make these things more clear is one goal of this paper. We will look at some
      well-known classes of formal grammars/languages that are relevant for XML, and scrutinize
      their expressivity and processing cost. Pursuing this approach, we will see that there are
      some other interesting and relevant classes, which have not been formally established yet to
      the best of our knowledge, and which we will define in the sequel. These classes partly
      refine, partly complement the hierarchy of <xref linkend="Murata2005"/>, and, as we will see,
      are partly already tacitly in use. Our main focus will not be on (non-)determinism in content
      models, that is, on properties of regular expressions; rather we will focus on determinism in
      tree structures, and try to formally clarify the relation between determinism and
      expressivity, as well as between locally and globally ambiguous grammars/trees. On the way, we
      provide some new results for the problem of ambiguity not only of grammars, but also of
      languages: as we will see, there are languages for which there is no unambiguous grammar, and
      not yet a class of grammars which generate all and only the languages for which there is an
      unambiguous grammar.</para><para>In contrast to other comparative analysis of XML schema languages, such as <xref linkend="Lee2000"/> or <xref linkend="Ansari2009"/>, the main research goal of our paper is
      thus a more elaborate and fine-grained theoretical approach based on work that has been
      already undertaken (see <xref linkend="sec.xml_formal"/>). The application examples shown in
      this paper (especially the ones given in <xref linkend="sec.murata.locality"/>) are clearly
      for demonstration purposes and do not necessarily introduce new findings.</para></section><section><title>XML and Formal Language Theory</title><para>The history of document grammars begins long before XML's success: in <xref linkend="Goldfarb1978"/> the first published formal <emphasis role="ital">document type
        descriptions</emphasis> can be found, while 1986 SGML Document Type Definition (DTD, <xref linkend="SGML"/>) were established, followed by XML's DTD (<xref linkend="XML10"/>) and
      other XML schema languages that were created during XML's ongoing success. This time line
      begins even earlier, in 1955, when Noam Chomsky published his theory on formal grammar (<xref linkend="Chomsky1955"/>, <xref linkend="Chomsky1956"/>).</para><para>One of the several benefits of XML markup languages is the possibility to use a schema (a
      document grammar) to assure not only <emphasis role="ital">well-formedness</emphasis>, but
      also <emphasis role="ital">validity</emphasis> of an instance of a given markup language. The
      XML specification defines well-formedness as follows: <blockquote><para>A textual object is a well-formed XML document if: </para><orderedlist><listitem><para>Taken as a whole, it matches the production labeled document.</para></listitem><listitem><para>It meets all the well-formedness constraints given in this specification.</para></listitem><listitem><para>Each of the parsed entities which is referenced directly or indirectly within the
              document is well-formed.</para></listitem></orderedlist><attribution><xref linkend="XML"/>, Section 2.1 "Well-Formed XML Documents"</attribution></blockquote> A valid instance in addition declares conformance and actually conforms to the
      rules of a schema of a given markup language. Or, in a more general way: <blockquote><para>The intention or purpose of validation is to subject a document or data set to a test,
          to determine whether it conforms to a given set of external criteria.</para><attribution><xref linkend="Piez2001"/>, p. 144</attribution></blockquote> A validating parser takes an instance document (and the corresponding schema) as
      input and produces a validation report, <quote>which includes at least a return code reporting
        whether [t]he document is valid and an optional Post Schema Validation Infoset (PSVI),
        updating the original document's infoset (the information obtained from the XML document by
        the parser) with additional information (default values, datatypes, etc.)</quote> (<xref linkend="vanderVlist2001"/>). During this process different levels of validation may be
      checked, depending on the XML schema language used: validation of the instance's structure
      (i.e., the markup), datatyping (i.e., the content of individual leaf nodes), integrity (in
      terms of links, either between nodes within a document or between documents). In addition,
      other tests, usually called <emphasis role="ital">business rules</emphasis> may apply as well
      (see <xref linkend="vanderVlist2001"/>). While one may differentiate between the terms
        <emphasis role="ital">valid</emphasis> (as defined by the XML specification and therefore
      only referring to validity according to a Document Type Definition) and <emphasis role="ital">schema-valid</emphasis> (i.e., valid according to one of the externally defined XML schema
      languages) we will use the former term throughout this paper as equal term for depicting the
      feature of confirmation to a given set of external criteria regardless of the schema language
      used. Furthermore, we will only discuss validation mechanisms through schema languages,
      therefore, technologies such as the <emphasis role="ital">Content Assembly
        Mechanism</emphasis> (CAM, see <xref linkend="Carey2009"/>) or meta-validation techniques
      such as <emphasis role="ital">Namespace-based Validation Dispatching Language</emphasis>
        (<xref linkend="NVDL"/>) are not observed any further. Since we will focus on <emphasis role="ital">grammar-based schema languages</emphasis>, <emphasis role="ital">rule-based</emphasis><footnote><para>For a short discussion if Schematron is a rules language (or rule-based language) see
            <xref linkend="Jeliffe2009"/>.
        </para></footnote> (or <emphasis role="ital">constraint-based</emphasis>) <emphasis role="ital">schema languages</emphasis> such as <xref linkend="Schematron"/>, the <emphasis role="ital">Constraint Language in XML</emphasis> (CLiX, <xref linkend="CLiX"/>) or <xref linkend="Moeller2005"/> will not be observed either.<footnote><para>For a different discussion on the topic of expressing constraints see <xref linkend="Bauman2008"/>.</para></footnote> For clarification reasons we follow the definitions of <xref linkend="Costello2008"/>:<blockquote><para>A grammar-based schema language specifies the structure and contents of elements and
          attributes in an XML instance document. For example, a grammar-based schema language can
          specify the presence and order of elements in an XML instance document, the number of
          occurrences of each element, and the contents and datatype of each element and attribute.
          A rule-based schema language specifies the relationships that must hold between the
          elements and attributes in an XML instance document. For example, a rule-based schema
          language can specify that the value of certain elements must conform to a rule or
          algorithm.</para><attribution><xref linkend="Costello2008"/></attribution></blockquote>
    </para><para>When someone starts developing a new XML-based markup language sooner or later the
      question about a formalism to define the corresponding document grammar arises, since there is
      a variety of schema (definition) languages available. While a schema can be considered as a
      formal definition of a grammar of the XML-based markup language (e.g. as a set of rules or
      criteria), the schema language is <quote>a formal language for expressing schemas</quote>
        (<xref linkend="Moeller2006"/>). Usually, choosing a schema language depends on several
      factors such as familiarity with a given formalism or support provided by the chosen authoring
      software or processing tools such as <xref linkend="XSLT2"/> or <xref linkend="XQuery"/>.
      These factors are very specific for one's own needs and environment and we will not give any
      advice regarding these topics. However, what we want to demonstrate in this paper are the
      differences in terms of expressiveness and computability between the three most used XML
      schema languages, starting with XML's inherent Document Type Definition (DTD, see <xref linkend="XML10"/>), based on SGML's DTD (see <xref linkend="SGML"/>, <xref linkend="Goldfarb1991"/>, <xref linkend="Maler1995"/>) where a non-XML syntax is used<footnote><para>We will not discuss any proposals for extended DTDs such as <xref linkend="Buck2000"/>, <xref linkend="Papakonstantinou2000"/>, <xref linkend="Vitali2003"/>
          <xref linkend="Balmin2004"/> or <xref linkend="Fiorello2004"/> since these play only minor
          roles in the wild, if any.</para></footnote>, over W3C's XML Schema Description Language (<xref linkend="XMLSchema2004"/>,
        <xref linkend="XMLSchema2004a"/>, <xref linkend="XMLSchema2004b"/>) and the formal language
      theory based RELAX NG (see <xref linkend="RELAX"/>, <xref linkend="vanderVlist2003"/>, <xref linkend="RELAX2nd"/> as a successor to both RELAX (<xref linkend="RELAXCore"/>) and TREX
        (<xref linkend="Clark2001"/>).</para></section><section xml:id="sec.xml_formal"><title>Formal Language Theory and XML</title><para>Although the formal model of an XML instance is always a single rooted tree, the different
      schema languages that can be used to define and constrain instances can be differentiated
      according to their expressiveness and – in a further step – according to their
      computability, which may be interesting when dealing with a task such as programming a
      validating parser. Different authors have dealt with the relationship between XML applications
      and formal languages, for example <xref linkend="Brueggemann-Klein1992"/>, <xref linkend="Brueggemann-Klein1993"/>, <xref linkend="Brueggemann-Klein1997"/>, <xref linkend="Hopcroft2000"/>, <xref linkend="Rizzi2001"/>, <xref linkend="Mani2001"/>, <xref linkend="Murata2001"/>, <xref linkend="Brueggemann-Klein2002"/>, <xref linkend="Sperberg-McQueen2003"/>, <xref linkend="Klarlund2003"/>, <xref linkend="Brueggemann-Klein2004"/>, <xref linkend="Murata2005"/>, <xref linkend="Martens2005"/>, <xref linkend="Kilpeläinen2007"/>, <xref linkend="Comon2008"/>, <xref linkend="Martens2009"/>, and <xref linkend="Gelade2009"/>. Often, a formal specification of
      XML's inherent <code>ID</code>/<code>IDREF</code> mechanism is omitted; however, <xref linkend="Abiteboul2000"/>, p. 33 claim that these references can be used to describe graphs
      rather than trees, since they allow for multidominance structures. (see <xref linkend="Stuehrenberg2009"/> for a practical implementation of using
        <code>ID</code>/<code>IDREF</code> for realizing graph structures within XML's tree model).
        <xref linkend="Kracht2010"/> uses modal logic to provide a semantics for XML-documents, and
      to characterize XML markup and search and retrieval mechanisms such as XPath. Other work
      leaves the formal model of XML and deals with graph structures that can be described by either
      XML or XML-like markup languages (see <xref linkend="Marcoux2008"/> for a graph
      characterization of TexMECS and other overlapping markup formalisms), but in this paper we
      will concentrate on schema languages that describe well-formed (that is tree-like)
      XML-documents. </para><para>Typically, DTDs are characterized as <emphasis role="ital">extended context-free
        grammars</emphasis> (see <xref linkend="Hopcroft2000"/> and <xref linkend="Rizzi2001"/>),
      that is, on the right-hand-side of a production rule regular expressions are allowed. This
      means for the declaration of an element that its allowed content is described by a regular
      expression using other element names (i.e., referring to other or the very same globally
      declared elements) or reserved keywords such as <code>#PCDATA</code> or <code>EMPTY</code>. In
      current work, especially <xref linkend="Murata2005"/> and <xref linkend="Moeller2006"/>, a
      family of <emphasis role="ital">tree grammars</emphasis> is used to model XML schema
      languages; for example, DTDs are defined as <emphasis role="ital">local tree
        grammars</emphasis> which can be considered strongly equivalent to CFGs, with the only
      difference that they allow non-finitary branching: <blockquote><para>Ignoring the attributes for a moment, there is a simple but elegant connection between
          DTDs and context-free grammars, namely, each DTD corresponds to an <emphasis role="ital">extended context-free grammar</emphasis>, where productions may have regular
          expressions on their right-hand side. Then, an XML document is valid with respect to the
          DTD precisely when its associated tree is a correct derivation tree for that
          grammar.</para><attribution><xref linkend="Klarlund2003"/>, p. 13</attribution></blockquote></para><para>In addition to using local tree grammars for characterizing DTDs, <xref linkend="Murata2005"/> construe a taxonomy of XML schema languages. The authors introduce
        <emphasis role="ital">single-type tree grammars</emphasis>, as characterizing XML Schema,
      and use <emphasis role="ital">regular tree grammars</emphasis> to characterize RELAX NG.
      Although this work is quite extensive, the formal analysis can be further improved by
      clarifying some propositions. Given the (still growing) importance of XML and the broad range
      of tasks it is used for, stronger theoretical background seems to be the best way to find new
      applications. Before we present our results, we have to introduce the formal concepts we are
      working with.</para></section><section><title>Refining the Taxonomy of XML Schema Languages</title><section><title>Introduction to the Formal Concepts</title><para>In this paper, we will mainly use tree grammars. Since the use of tree grammars is well
        established in the XML community, we will just shortly provide the necessary definitions;
        for a more explicit treatment and motivation, we defer the reader to <xref linkend="Gecseg1997"/> or <xref linkend="Murata2005"/>.</para><para><!-- xml:id="definition1" --><emphasis role="bold">Definition 1 </emphasis>
        <emphasis role="ital">A regular tree grammar (RTG) is a 4-tuple (N,T,S,P), where N is a
          finite set of nonterminals, T is a finite set of terminals, S is a set of start symbols,
          which form a subset of N, P is a set of production rules, which have the form: A →
          a(r), where A ∈ N, a ∈ T, and r is a regular expression over elements of N. We
          call A the left hand side of a rule, a the terminal or label which is introduced by the
          rule, and r its content model.</emphasis></para><para> We generally use uppercase letters for nonterminals, and lower-case letters for
        terminals. Note that this grammar generates trees, not strings, and that the nonterminals do
        not remain the labels of the (non-leaf) nodes they introduce, but are substituted by the
        terminal labels. The class of languages generated by RTGs is called the regular tree
        languages (RTLs). This is the most general class of languages we will consider here; and we
        now introduce various restrictions on this class of grammars, as they are defined by <xref linkend="Murata2005"/>. </para><para><!-- xml:id="definition2" --><emphasis role="bold">Definition 2</emphasis>
        <emphasis role="ital">We call two rules of a RTG competing, if they introduce the same
          terminal nodes, but have different left hand sides. Thus, A → a(r) and B →
          a(r') are competing. </emphasis></para><para>In general, in an RTG we can merge any two rules which have the same left-hand side and
        introduce the same terminal, by merging their content models, because for any two regular
        expressions we can easily form a single expression which denotes is the union of both.
        Therefore, we will generally assume that in our grammars any two rules with same left-hand
        side and same terminal do not exist. As a consequence, the concept of competing rules is the
        crucial point if we deal with determinism and ambiguity. For the same reason, we can speak
        of competing nonterminals almost in the same way as of competing rules: competing
        nonterminals are the left-hand sides of competing rules; a grammar has competing
        nonterminals exactly if it has competing rules, and exactly as many as.</para><para><!-- xml:id="definition3" --><emphasis role="bold">Definition 3</emphasis>
        <emphasis role="ital">A local tree grammar (LTG) is an RTG with no competing
          rules.</emphasis></para><para>In an LTG, we have thus a one-to-one correspondence of nonterminals and terminals, which
        makes them very similar to context-free grammars (though not identical, since regular
        content models allow for non-finitary branching).</para><para><!-- xml:id="definition4" --><emphasis role="bold">Definition 4</emphasis>
        <emphasis role="ital">A single type tree grammar (STG) is an RTG, where competing
          nonterminals must not occur in the same content model.</emphasis></para><para>As <xref linkend="Murata2005"/> point out, LTGs roughly correspond to DTDs, STGs
        correspond to XML Schema, and RTGs correspond to Relax NG. Note that this correspondence is
        established by only regarding the "core" syntactic features of the schema languages, that is
        XML's inherent reference mechanism is not taken into account. These are thus the most
        important grammar types for XML. <xref linkend="Murata2005"/> still add another type:</para><para><!-- xml:id="definition5" --><emphasis role="bold">Definition 5</emphasis>
        <emphasis role="ital">A restrained competition grammar (RCG) is an RTG, where competing
          nonterminals must not occur in the same content model and with the same prefix of
          nonterminals; we thus disallow rules with identical left-hand side, terminals, and content
          models of the form (Γ A Δ) and (Γ B Δ'), where A and B are competing
          nonterminals, and where uppercase Greek letters refer to possibly empty sequences of
          nonterminals.</emphasis></para><para>Notably, the restriction concerns only the left context of the competing nonterminals.
        Of course, there exists a parallel definition for the right context. The problem is that
        both definitions lack some generalization, as they both generate different classes of
        languages, and there is no inclusion in either direction. If we, however, generalize the
        restriction of competition to both the left and the right context (which weakens the overall
        restriction on the grammar), some problems arise. We will discuss possible generalizations
        later on.</para><para>It is easily seen that there is a hierarchy of proper inclusion of the grammar types
        presented: LTGs are always STGs, which are always RCGs, which are always RTGs, whereas the
        converse does not hold.</para><para>We furthermore define an interpretation of a given tree against a given grammar as
        follows:</para><para><!-- xml:id="definition6" --><emphasis role="bold">Definition 6 </emphasis>
        <emphasis role="ital">An interpretation I of a
          tree t against a grammar G is a mapping from each node label of t, denoted by e, to a
          nonterminal N of the grammar, such that</emphasis></para><itemizedlist><listitem><para><emphasis role="ital">I(e) is a start symbol when e is the root of
            t,</emphasis></para></listitem><listitem><para><emphasis role="ital">for each e and its daughter nodes e<subscript>0</subscript>,
                e<subscript>1</subscript>,...,e<subscript>n</subscript>, there is a production rule
              A → a(r) in G, such that</emphasis></para><itemizedlist><listitem><para><emphasis role="ital">I(e) is A,</emphasis></para></listitem><listitem><para><emphasis role="ital">the label of e is a,</emphasis></para></listitem><listitem><para><emphasis role="ital">I(e<subscript>0</subscript>),
                  I(e<subscript>1</subscript>),...,I(e<subscript>n</subscript>) matches
                  r.</emphasis></para></listitem></itemizedlist></listitem></itemizedlist><para>As is easily seen, with this definition, the ease of interpretation directly interacts
        with the Murata hierarchy. We will continue in this vein. To keep things clear, it is
        crucial to distinguish between the label of a node and its interpretation: the label of a
        node corresponds to its terminal in the production rule (recall that tree grammars directly
        generate trees, not strings via trees as CFGs), and it is immediately visible in the tree.
        By the interpretation of a node in turn we denote the nonterminal by which the node label
        has been produced. This nonterminal is not visible and has to be inferred. In addition, we
        have to distinguish between rule and rule instantiation: since content models are regular
        expressions over nonterminals, they denote sets of sequences of nonterminals, and one member
        of this set is an instantiation. An additional problem arises in interaction with the fact
        that there is no necessary one-to-one correspondence of nonterminals and terminals (i.e.,
        labels); a possible consequence is that the same sequence of labels can be produced by
        different instantiations of a content model (we will exhaustively discuss this source of
        ambiguity later on).</para><para>So far, we have introduced the main concepts which are well-known in the literature, and
        which we are going to use and elaborate in this paper.</para></section><section><title>An Example Grammar</title><para>We will use the following example to demonstrate some of our findings. We want to define
        a document grammar for a text. The text may contain an optional title, followed by either at
        least a single section or a single paragraph. An optional author entity (possibly decoded
        using an attribute) may contain information about the text's author. Inside a section an
        optional title followed by other (sub-)sections or paragraphs are allowed. The title
        consists of raw text while a paragraph may contain raw text or a reference to other
        paragraphs, since these may have an optional identifier (using XML's <code>ID</code>
        type).</para><para>If we try to express these constraints more formally we might end up with a grammar
        similar to the one shown in <xref linkend="fig.grammar"/>. Again, nonterminals are printed
        in capital letters, while node labels or terminals are printed in small letters. Note, that
        in this formulation elements, attributes and raw text are defined as terminals.</para><figure xml:id="fig.grammar"><title>An example grammar</title><programlisting xml:space="preserve"><emphasis role="ital">S → text(author,Title? (Section|Para))
Section → section(Title? (Section|Para))
Title → title(#pcdata)
Para → para(id, #pcdata|Xref)
Xref → xref(href,ε) </emphasis></programlisting></figure><section><title>The Example Grammar Realized by a DTD (Local Tree Grammar)</title><para>Following <xref linkend="Murata2005"/>, DTDs can be classified as local tree grammars.
          A possible realization for our example grammar can be seen in <xref linkend="lst.dtd"/>.</para><figure xml:id="lst.dtd"><title>DTD realization of the example grammar</title><programlisting xml:space="preserve">&lt;!ELEMENT text (title?, (section | para)+)&gt;
&lt;!ATTLIST text author CDATA #IMPLIED&gt;
&lt;!ELEMENT title (#PCDATA)&gt;
&lt;!ELEMENT section (title?, (section | para)+)&gt;
&lt;!ELEMENT para (#PCDATA | xref)*&gt;
&lt;!ATTLIST para id ID #IMPLIED&gt;
&lt;!ELEMENT xref EMPTY&gt;
&lt;!ATTLIST xref href IDREF #REQUIRED&gt;</programlisting></figure><para>Since local tree grammars and DTDs only support globally declared elements (and
          locally declared attributes), the content models of the <code>text</code> and the
            <code>section</code> element share references to the same three elements
            (<code>title</code>, <code>section</code> and <code>para</code>) and contain both a
          sequence and a choice together with the occurrence indicators <code>+</code> (at least one
          occurrence) and <code>?</code> (optional). The content model of the <code>para</code>
          element contains mixed content, that is both raw text and the <code>xref</code> element
          are allowed as children. Since DTDs force the use of the choice group (<code>|</code>) and
          the trailing asterisk (<code>*</code>) as occurrence indicator, there is no other way to
          define this specific content model using this schema language.</para><para>Note that DTDs do not support any type mechanism. <xref linkend="Buck2000"/>, <xref linkend="Papakonstantinou2000"/>, <xref linkend="Balmin2004"/> and <xref linkend="Martens2006"/> suppose the extension of DTDs by adding types, while DTD++,
          proposed by <xref linkend="Vitali2003"/> adds namespace awareness on top, and DTD++ 2.0
          (see <xref linkend="Fiorello2004"/>) even supports co-constraints.</para></section><section><title>The Example Grammar Realized by an XSD (Single Type Tree Grammar)</title><para>An XML schema description (i.e., a single type tree grammar) of the same document
          grammar may look like the one in <xref linkend="lst.xsd"/>.</para><figure xml:id="lst.xsd"><title>XSD realization of the example grammar</title><programlisting xml:space="preserve">&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"&gt;
  &lt;xs:element name="text"&gt;
    &lt;xs:complexType&gt;
      &lt;xs:complexContent&gt;
        &lt;xs:extension base="textType"&gt;
          &lt;xs:attribute name="author" type="xs:string" use="optional"/&gt;
        &lt;/xs:extension&gt;
      &lt;/xs:complexContent&gt;
    &lt;/xs:complexType&gt;
  &lt;/xs:element&gt;
  &lt;xs:element name="title" type="xs:string"/&gt;
  &lt;xs:element name="section" type="textType"/&gt;
  &lt;xs:element name="para"&gt;
    &lt;xs:complexType mixed="true"&gt;
      &lt;xs:sequence&gt;
        &lt;xs:element name="xref" minOccurs="0"&gt;
          &lt;xs:complexType&gt;
            &lt;xs:attribute name="href" type="xs:IDREF" use="required"/&gt;
          &lt;/xs:complexType&gt;
        &lt;/xs:element&gt;
      &lt;/xs:sequence&gt;
      &lt;xs:attribute ref="id" use="optional"/&gt;
    &lt;/xs:complexType&gt;
  &lt;/xs:element&gt;
  &lt;xs:attribute name="id" type="xs:ID"/&gt;
  &lt;xs:complexType name="textType"&gt;
    &lt;xs:sequence&gt;
      &lt;xs:element ref="title" minOccurs="0"/&gt;
      &lt;xs:group ref="sectOrPara" maxOccurs="unbounded"/&gt;
    &lt;/xs:sequence&gt;
  &lt;/xs:complexType&gt;
  &lt;xs:group name="sectOrPara"&gt;
    &lt;xs:choice&gt;
      &lt;xs:element ref="section"/&gt;
      &lt;xs:element ref="para"/&gt;
    &lt;/xs:choice&gt;
  &lt;/xs:group&gt;
&lt;/xs:schema&gt;</programlisting></figure><para>Note that this XML schema description is only one possible realization out of a
          variety of different XML schema descriptions that would fit our needs. Although it may be
          not very human-readable, it was designed to show some features that are supported by XSD.
          The <code>text</code> element is derived by extension of the globally declared complexType
            <code>textType</code> which itself refers to the globally declared model group
            <code>sectOrPara</code>. The schema contains both locally and globally declared
          attributes (<code>author</code> vs. <code>id</code>) and elements (<code>xref</code> as an
          example for a locally declared element). Apart from that, XSD supports <xref linkend="XMLNS"/> which are not shown in the example above. As <xref linkend="Martens2006"/> has already pointed out, that the actual extra expressive power
          of XSDs over DTDs can only be used to a very limited extent due to the <emphasis role="ital">Element Declarations Consistent</emphasis> (EDC) constraint (see <xref linkend="XMLSchema2004a"/>, Section 3.8.6).</para></section><section><title>The Example Grammar Realized by a RELAX NG Grammar (Regular Tree Grammar)</title><para>RELAX NG can be classified as regular tree grammar according to <xref linkend="Murata2005"/>. A possible realization with RELAX NG is shown in <xref linkend="lst.relaxng"/>.</para><figure xml:id="lst.relaxng"><title>RELAX NG realization of the example grammar</title><programlisting xml:space="preserve">&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;grammar xmlns="http://relaxng.org/ns/structure/1.0"
  datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"&gt;
  &lt;start&gt;
    &lt;element name="text"&gt;
      &lt;optional&gt;
        &lt;attribute name="author"/&gt;
      &lt;/optional&gt;
      &lt;choice&gt;
        &lt;optional&gt;
          &lt;group&gt;
            &lt;ref name="element.title"/&gt;
            &lt;ref name="element.section"/&gt;
          &lt;/group&gt;
        &lt;/optional&gt;
        &lt;optional&gt;
          &lt;group&gt;
            &lt;ref name="element.title"/&gt;
            &lt;ref name="element.para"/&gt;
          &lt;/group&gt;
        &lt;/optional&gt;
      &lt;/choice&gt;
    &lt;/element&gt;
  &lt;/start&gt;
  &lt;define name="element.title"&gt;
    &lt;element name="title"&gt;
      &lt;text/&gt;
    &lt;/element&gt;
  &lt;/define&gt;
  &lt;define name="element.section"&gt;
    &lt;element name="section"&gt;
      &lt;optional&gt;
        &lt;ref name="element.title"/&gt;
      &lt;/optional&gt;
      &lt;ref name="sectOrPara"/&gt;
    &lt;/element&gt;
  &lt;/define&gt;
  &lt;define name="element.para"&gt;
    &lt;element name="para"&gt;
      &lt;optional&gt;
        &lt;attribute name="id"&gt;
          &lt;data type="ID"/&gt;
        &lt;/attribute&gt;
      &lt;/optional&gt;
      &lt;zeroOrMore&gt;
        &lt;choice&gt;
          &lt;text/&gt;
          &lt;element name="xref"&gt;
            &lt;attribute name="href"&gt;
              &lt;data type="IDREF"/&gt;
            &lt;/attribute&gt;
            &lt;empty/&gt;
          &lt;/element&gt;
        &lt;/choice&gt;
      &lt;/zeroOrMore&gt;
    &lt;/element&gt;
  &lt;/define&gt;
  &lt;define name="sectOrPara"&gt;
    &lt;group&gt;
      &lt;oneOrMore&gt;
        &lt;choice&gt;
          &lt;ref name="element.section"/&gt;
          &lt;ref name="element.para"/&gt;
        &lt;/choice&gt;
      &lt;/oneOrMore&gt;
    &lt;/group&gt;
  &lt;/define&gt;
&lt;/grammar&gt;</programlisting></figure><para>Compared to DTD or XSD, RELAX NG is based both on the mathematical theory of regular
          expressions and the concept of hedge grammars (<xref linkend="vanderVlist2003"/> and <xref linkend="Murata2005"/>). As an XML schema language, RELAX NG has some advantages over
          other schema languages: while in DTDs and XSD mixed content models may contain child
          elements and text nodes in any arbitrary order, RELAX NG allows for ordering of the
          element child nodes (see <xref linkend="vanderVlist2003"/>, p. 57f.). Co-occurrence
          constraints can be used to specify the content model of an item according to the value of
          another item, allowing non-deterministic content models which cannot be realized in DTD or
          XSD (see <xref linkend="vanderVlist2003"/>, p. 62f, and <xref linkend="sec.determinism"/>
          for a discussion). In general, a co-occurrence constraint (or co-constraint as they are
          called by <xref linkend="Pawson2007"/>) may <quote>be a constraint over multiple items,
            not just two items</quote> and <quote>may exist between XML structure components
            (elements, attributes) as well as between data values</quote>. One may differentiate
          between element-to-element, element-to-attribute, or attribute-attribute co-occurrence
          constraint, based on the items involved.</para><para> In addition, SGML's interleave operator <code>&amp;</code> (see <xref linkend="Goldfarb1991"/>, p. 291) that is missing in XML DTD and XSD can be used in
          RELAX NG as well, although this adds nothing to its expressive power. In contrast to the
          two other schema languages discussed in this paper, RELAX NG does not support default
          values (which are supported for attributes in DTD and for attributes and elements in XSD).
          While both DTD and XSD support XML references via
            <code>ID</code>/<code>IDREF</code>(<code>S</code>) attribute types, RELAX NG has no
          included datatype library; however, as seen in the example grammar, it is possible to
          include the datatype library of <xref linkend="XMLSchema2004b"/>.</para><para>The document instance given in <xref linkend="lst.instance"/> would be valid according
          to all of the above defined document grammars (the example shows validation against the
          XML schema, adding a Doctype declaration and removing the
            <code>noNamespaceSchemaLocation</code> attribute of the root element would result in a
          valid instance according to the DTD. Note that RELAX NG does not contain a standard way to
          associate a RELAX NG schema to an XML instance since it was designed as part of the ISO
          DSDL framework (in this framework, the <xref linkend="NVDL"/> should be used as general
          external mechanism for validating instances).</para><figure xml:id="lst.instance"><title>Valid XML instance</title><programlisting xml:space="preserve">&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;text xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:noNamespaceSchemaLocation="text.xsd" author="maik"&gt;
  &lt;title&gt;A simple title&lt;/title&gt;
  &lt;section&gt;
    &lt;title&gt;A section title&lt;/title&gt;
    &lt;para id="p1"&gt;Introductory para&lt;/para&gt;
    &lt;section&gt;
      &lt;title&gt;A subsection title&lt;/title&gt;
      &lt;para&gt;Some text with a reference: &lt;xref href="p1"/&gt;.&lt;/para&gt;
    &lt;/section&gt;
  &lt;/section&gt;
&lt;/text&gt;</programlisting></figure></section></section></section><section><title>Determinism and Local Ambiguity</title><para>Determinism is a important property for XML Documents, schema languages and
      interpretation. If a grammar is deterministic, parsing will be much more efficient, since in
      general we do not have to keep in mind any information, do not have to backtrack, do not need
      any non-local information in case we search, etc.<footnote><para><xref linkend="Murata2005"/> present an algorithm using tree-automata that allows for
          efficient parsing of regular tree grammars.</para></footnote> The concept of determinism is closely related to the concept of local ambiguity:
      if there is no local ambiguity, then at every point in the parsing process we know the
      structure (in this case: interpretation) of what we have seen so far, because there is
      only one possible local analysis. If there is some local ambiguity, non-determinism arises: we
      cannot assign a unique interpretation locally, for we would need information which is not
      available yet, and we need to apply some heuristics.</para><section xml:id="sec.determinism"><title>Determinism, Algorithms and Local Ambiguities</title><para>In this section we will review the concept of determinism, as opposed to local ambiguity
        of a grammar. As introductory issue, we show that determinism does not only depend on the
        grammar we use, but also on the algorithm. In regular tree languages, there can be no
        one-dimensional concept of determinism, as there is for regular string languages. Note that
        this is more than a metaphor, since we can perceive of regular string languages as
        one-dimensional structures; in order to talk about trees or the strings formed by their
        leaves, we need at least two dimensions (see <xref linkend="Rogers2003"/>). In the
        remainder, we will show how different grammar types provide determinism for all, some
        particular class of, or no algorithms.</para></section><section><title>Deterministic Content Models</title><para>Firstly, we consider the concept of deterministic content models. This draws on the
        notion of deterministic (or 1-unambiguous) regular expressions (DREs), thoroughly surveyed
        by <xref linkend="Brueggemann-Klein1997"/><footnote><para>See <xref linkend="Mani2001"/> for a discussion of the pros and cons of
            1-unambiguous content models in XML schema languages.</para></footnote>. Assume we have a string <emphasis role="ital">s</emphasis> which is denoted by
        some regular expression <emphasis role="ital">E</emphasis>. Assume furthermore we build our
        expressions with letters from an alphabet <emphasis role="ital">Π</emphasis>, which is
        identical to the alphabet <emphasis role="ital">Σ</emphasis>, except for the fact that
        we have additional indices for the letters (taken here from the set of natural numbers). If
        we build an expression from <emphasis role="ital">Π</emphasis>, we have to make sure
        that every index must occur at most once in it. Constructors for regular expressions are as
        usual, and indices are passed on to the letters of the strings the expression denotes. We
        say that a letter in <emphasis role="ital">s</emphasis> instantiates a letter in <emphasis role="ital">E</emphasis>, if the following holds:</para><para>
        <emphasis role="bold">Definition 7</emphasis>
        <emphasis role="ital">A letter a<subscript>i</subscript> in s instantiates a letter
            a<subscript>j</subscript> from E, if i = j.</emphasis></para><para>The index ensures that for every string an expression denotes, there is a unique
        surjective mapping from letters in <emphasis role="ital">s</emphasis> to letters in
          <emphasis role="ital">E</emphasis>. We now define a mapping <emphasis role="ital">♮</emphasis> on strings over <emphasis role="ital">Π</emphasis> to strings
        over <emphasis role="ital">Σ</emphasis>, such that <emphasis role="ital">
            ♮(x<subscript>i</subscript>):= x,
        ♮(xs):=♮(x)♮(s)</emphasis>, where <emphasis role="ital"> s</emphasis>
        is a string and <emphasis role="ital">x</emphasis> a letter (the first of the string). This
        homomorphism simply deletes the indices and leaves anything else untouched. Note that for a
        unique string <emphasis role="ital">♮s</emphasis>, and a given expression <emphasis role="ital">E</emphasis>, there might be several <emphasis role="ital">s ∈
          E</emphasis>, such that their mapping under <emphasis role="ital">♮</emphasis> is
        identical. In this case we say that a letter in <emphasis role="ital">s</emphasis> might
        instantiate several letters in <emphasis role="ital">E</emphasis>. A deterministic regular
        expression over <emphasis role="ital">Π</emphasis> is then defined as follows:</para><para>
        <emphasis role="bold">Definition 8</emphasis>
        <emphasis role="ital">E is deterministic or one-unambiguous, if for all strings u, v, w over
          Π, and all letters x, y in Π, the following holds: if uxv and uyw ∈ L(E),
          and x ≠ y, then also ♮ x ≠ ♮ y</emphasis>.</para><para> This means that we can skip the indices, and we still know which letter in <emphasis role="ital">E</emphasis> is instantiated by any letter in <emphasis role="ital">s</emphasis>, simply from knowing its left context. Formally, this means that the mapping
        ♮ is reversible. A regular language is deterministic if it is denoted by a
        deterministic regular expression. A simple example of a non-deterministic expression is the
        expression <emphasis role="ital">a*b*a*</emphasis>, as we can easily check for the string
          <emphasis role="ital">a</emphasis>, which might be the instantiation of either of the two
          <emphasis role="ital">a</emphasis>s.</para><para>As a consequence for processing, quite informally, we can state that reading <emphasis role="ital">s</emphasis> ∈ <emphasis role="ital">L(E)</emphasis> from the left to
        the right, where <emphasis role="ital">E</emphasis> is a DRE, at any point in <emphasis role="ital">s</emphasis>, we know at which point in <emphasis role="ital">E </emphasis> we
        find ourselves. In automata theory, DREs correspond to deterministic Glushkov
        automata.</para><para>There is however a problem if we apply this concept to content models in regular tree
        grammars: in regular expressions, we can see from a letter in the string which type of
        letter in the expression it instantiates (thus, we have a unique letter in <emphasis role="ital">Σ</emphasis>, though not in <emphasis role="ital">Π</emphasis>). We
        can thus deduce from a letter in the string a letter in the expression, though not a letter
        instantiation, if the expression is not deterministic. We cannot, however, deduce from a
        given tree node label a unique type of nonterminal: if we have competing rules, different
        nonterminals introduce identical labels; and we still have to keep them apart. Thus, if the
        content model of nonterminals itself is deterministic, this is of little use if we cannot
        infer from a given label the unique nonterminal it belongs to.</para><para>By way of example, consider the following grammar rule:</para><para><emphasis role="ital">A → a(ABC|CBA)</emphasis></para><para> Its content model is surely deterministic. However, if <emphasis role="ital">A</emphasis> and <emphasis role="ital">C</emphasis> are competing rules (have identical
        labels), this is of little use. We have to check each subtree until we have its unique
        interpretation. This means, in the worst case, we have to check both subtrees (as we will
        see, this is the case in which the trees generated by <emphasis role="ital">A (C)</emphasis>
        form a subset of the trees generated by <emphasis role="ital">C (A)</emphasis>).</para><para>The obvious reason for the fact that this concept of determinism comes short is that it
        originates in one-dimensional strings. As our trees are two dimensional, we can define
        determinism only with respect to directions in which our analysis proceeds. The main
        difference is, of course, the one between top-down and bottom-up processing. In this paper
        we will not consider the difference between depth-first and breadth-first parsing, though
        this is surely worthwhile.</para><para>In the next subsection, we will reformulate and complete the Murata hierarchy in a way,
        that makes clear which kind of determinism is facilitated by which kind of grammar. In the
        sequel, we will disregard the one-dimensional problem of non-deterministic content models,
        since they have been thoroughly analyzed, and we have nothing to add (see <xref linkend="Brueggemann-Klein1997"/>). At this point, our interest is the second dimension:
        importantly, this means that talking about determinism, we implicitly always add: provided
        that content models are deterministic in the above sense. We thus exclude all problems which
        may arise from regular expressions.</para></section><section xml:id="sec.murata.locality"><title>The Murata Hierarchy as Hierarchy of Locality Conditions</title><para>As our main concern will be the formal properties of the grammar types, as they affect
        processing, we will firstly reformulate the hierarchy. This reformulation aims at making
        clear which information we need in order to uniquely interpret a local node or
        subtree.</para><orderedlist numeration="arabic"><listitem><para>In a <emphasis role="bold">local tree grammar</emphasis>, for any node <emphasis role="ital">a</emphasis> in any context, we know its unique interpretation. This is
            obvious, since for any node label, there is only one single rule which generates it, by
            the very definition of a local tree grammar. As a consequence, parsing is deterministic
            for any algorithm (provided we have deterministic content models), and the problem of
            giving a certain node of a given tree its interpretation is solvable in a constant
            amount of steps.</para></listitem><listitem><para>In a <emphasis role="bold">single type tree grammar</emphasis>, for any node label
              <emphasis role="ital">a</emphasis> in any context, we can determine its unique
            interpretation if we know the interpretation of its mother node. This follows directly
            from the definition: if we know the interpretation of a node's mother node (rules
            correspond to interpretations), we know its content model. Within a content model there
            must not occur any competing nonterminals.</para><para>Note, however, that it is not sufficient to know the mother nodes label. We can
            easily construct an example to show this: we have two competing rules, <emphasis role="ital">A → a(C)</emphasis> and <emphasis role="ital">B →
              a(D)</emphasis>, whose nonterminals do not occur in the same content model of any
            rule. </para><para>Furthermore, we have the two rules <emphasis role="ital">C → b(r)</emphasis>
            and <emphasis role="ital">D → b(r')</emphasis>. Then both nodes, as introduced by
              <emphasis role="ital">C</emphasis> and <emphasis role="ital">D</emphasis>, have label
              <emphasis role="ital">b</emphasis>, their mother nodes both have the label <emphasis role="ital">a</emphasis>, despite they have different interpretations.</para><para>For processing, this has an immediate consequence: a top down parser will at any
            point immediately know the interpretation of any node, whereas if we start
            interpretation from the bottom, in the worst case we will have to go up to the root in
            order to get the correct interpretation. The matter of providing the interpretation of a
            given node is nonetheless a linear search problem, since for a given tree and a given
            node, it is sufficient to go a path from the root to that node.</para></listitem><listitem><para> In a <emphasis role="bold">restrained competition grammar</emphasis> (RCG), in
            order to give a node its unique interpretation, we must have the interpretation of its
            mother node, and check its left siblings, in case it has any. Note that from how we
            defined RCGs, it follows that we only need the label of the siblings, not their
            interpretation: because any two competing nonterminals within a single content model are
            distinguished by a unique prefix, this prefix itself must not consist of competing
            nonterminals only, and neither must it be empty. This keeps the grammar unambiguous, and
            easy to process. However, it makes some restrictions we do not necessarily want to make:
            maybe the unique interpretation of a label should not depend on a left sibling, but on a
            right sibling. For example, in RCGs we cannot have competing nonterminals as leftmost
            symbols in a content-model, but as rightmost, given some left context. This causes an
            asymmetry which seems quite arbitrary. Of course we can equally define the asymmetric
            counterpart of RCG, checking for unique suffixes instead of prefixes; but care is to be
            taken: since we have to fix the type for the class of languages (i.e., document
            grammars) we define, we have the same problem. If we generalize the concept to both
            unique suffix and prefix, some problems arise, which can however be remedied. </para></listitem><listitem><para>We now define a <emphasis role="bold">generalized restrained competition
              grammar</emphasis> (GRCG), as follows: </para><para>In a GRCG, for any two competing nonterminals <emphasis role="ital">A</emphasis> and
              <emphasis role="ital">B</emphasis> within a single content model <emphasis role="ital">r</emphasis>, one of (<emphasis role="ital">Γ A Δ</emphasis>) and
              (<emphasis role="ital">Γ B Δ</emphasis>) fails to match <emphasis role="ital">r</emphasis> (Greek variables range over possible empty sequences of
            nonterminals).</para><para>We now have generalized the restriction from the left (right, respectively) to the
            entire context. Note that we have relaxed the overall restriction on the grammar, by
            making the restriction on content models more specific (indeed, this type properly
            includes the RCGs). </para><para> This little relaxation however causes a vast increase in processing complexity:
            because now, in order to give its unique interpretation to any node <emphasis role="ital">a</emphasis> in any context, in the worst case one needs to know the
            interpretation of its mother, the interpretation of its siblings, and the interpretation
            of its subtrees as well. And even then, GRCGs might still be ambiguous, allowing more
            than one interpretation for a entire single tree. </para><para>This needs some explanation. The first point is easy to see: that we need to know
            the interpretation of the mother node follows a fortiori from the preceding argument
            (single type grammars are properly included in restrained competition grammars). But
            this is insufficient, since competing nonterminals may occur within the same content
            model. We have to match all the sister labels to the nonterminals of the content model
            of the mothers interpretation in order to get a unique interpretation (according to the
            definition). </para><para>Note, however, that we need the interpretation of the sister nodes; it is not
            sufficient to have their labels. This we can easily verify with the following grammar
            rule: <emphasis role="ital">A → a(BC|B'C')</emphasis>, where <emphasis role="ital">B</emphasis> and <emphasis role="ital">B'</emphasis> and <emphasis role="ital">C</emphasis> and <emphasis role="ital">C'</emphasis> are competing rules, introduce
            the labels <emphasis role="ital">b</emphasis> and <emphasis role="ital">c</emphasis>,
            respectively.</para><para>This satisfies all conditions on GRCGs. In order to get the correct interpretation
            for the labels <emphasis role="ital">b</emphasis> and <emphasis role="ital">c</emphasis>, it is not sufficient to know the sister node's label, but its
            interpretation. </para><para>Things can thus get even worse, if we consider the case where the interpretation of
            a sister node depends on the interpretation of the node under consideration <emphasis role="ital">itself</emphasis>. Look at the following example rule: <emphasis role="ital">A → a(BC|CB)</emphasis>, where <emphasis role="ital">B</emphasis>
            and <emphasis role="ital">C</emphasis> are competing rules.</para><para>Obviously, they are both uniquely determined by their neighbor within the content
            model of <emphasis role="ital">A</emphasis> (<emphasis role="ital">A</emphasis> may only
            occur with <emphasis role="ital">B</emphasis> to its left or its right, and vice versa).
            However, as they carry the same labels, it is insufficient to determine either of them
            if we just check the label of its sister (since it is identical). Furthermore, we might
            have the case where it is impossible to interpret one of the subtrees, without its
            sisters interpretation (e.g., if one of the competing nonterminals generates a subset of
            the trees generated by the other).</para><para>We will consider this case more thoroughly in the next section, showing that there
            are globally ambiguous GRCGs, and that for every language that can be generated by an
            RTG, there is also a GRCG grammar that generates the same language. </para><para>From the point of view of processing, we see that in GRCGs, expressive power comes
            at a high cost: neither a bottom-up nor a top-down parser is capable of assigning a
            unique interpretation locally, and maybe not even globally. The problem of giving a
            given node its unique interpretation might thus be an exponential search problem, and in
            the worst case not even decidable. We will therefore introduce a subtype of GRCG, which
            we will call unambiguous RCG. This type is properly included in the class of GRCG
            grammars, and includes properly the class of STGs as well as RCGs, as can be seen
            easily.</para></listitem><listitem><para>We now introduce <emphasis role="bold">unambiguous restrained competition
              grammars</emphasis> (URCGs). What we want to eliminate is ambiguity, which can be
            caused by the fact that in a GRCG, we might have entire competing contexts, or the
            interpretation of the context of a label in a content-model might depend on the
            interpretation of the label itself. In the resulting grammar, it should be possible to
            yield the unique interpretation of a node from the interpretation of its mother and the
            labels (not interpretations) of its sisters.</para><para>We characterize the grammar type in the following terms: We introduce an alphabet of
            meta-variables <emphasis role="ital">O</emphasis>, which we use in the following way: </para><para>We form a set of all sets of nonterminals from <emphasis role="ital">N</emphasis>
            which compete with each other; we call these sets the <emphasis role="ital">competition
              sets</emphasis> (which are possibly singletons). We are interested in the sets where
            every nonterminal occurs in exactly one competition set, such that the whole set forms a
            partition of <emphasis role="ital">N</emphasis>. For every such partition we iterate the
            following procedure: To every competition set, we assign a single symbol from <emphasis role="ital">O</emphasis>. We call this an <emphasis role="ital">O-assignment</emphasis>. Then, for all content models, we check for all nonterminals,
            whether the content models still satisfy the GRCG condition, if we replace all other
            nonterminals by the symbols from <emphasis role="ital">O</emphasis> they are assigned
            to. In case there is more than one overall assignment, that is, a single nonterminal
            belongs to more than one competition set, we have to iterate this for every possible
            assignment. If for every assignment, nonterminal and content model, the resulting
            grammar is a GRCG, then the original grammar is a URCG.</para><para>Note that the assignments are only introduced for this evaluation procedure. We will
            call contexts which are identical under the <emphasis role="ital">O-assignment</emphasis>
            <emphasis role="bold">similar</emphasis>. We define accordingly a URCG as a grammar
            where competing nonterminals must not occur in the same content model and in similar
            contexts (this obviously subsumes identical contexts). It is easy to see that now we
            have made sure that competing nonterminals must not occur within the same contexts of
            labels (as opposed to nonterminals).</para><para>Thus, a URCG is a GRCG where for every node we find its unique interpretation if we
            have the interpretation of the mother node and the labels of its sister nodes. In
            particular, we can interpret any node without having to recur to its sister node's
            interpretation: the content model <emphasis role="ital">(BC|CB)</emphasis>, where
              <emphasis role="ital">B</emphasis> and <emphasis role="ital">C</emphasis> are
            competing nonterminals, does not satisfy the condition, because both <emphasis role="ital">B</emphasis> and <emphasis role="ital">C</emphasis> occur in the context
              <emphasis role="ital">__X</emphasis> or <emphasis role="ital">X__</emphasis>,
            respectively, where <emphasis role="ital">X</emphasis> is the <emphasis role="ital">O-assignment</emphasis> for both. </para><para>This kind of grammar is useful for the following reason: there is no global
            ambiguity in it (as we have deleted the only source of ambiguity, that interpretations
            of labels may depend on each other); and it is the strongest of the non-ambiguous
            grammar types we have considered. However, it is not capable of generating every
            language which is not inherently ambiguous, as we will show later. Note, that in order
            to provide the unique interpretation of a node, we still might have to check all of its
            sister nodes, but it is sufficient to check the labels.</para><para>It is easily seen that URCGs properly include RCGs, as both left and right context
            can count as distinctive (we will make this more precise later on). As we will also see
            further down, there are languages which can be generated by GRCGs, but not by URCGs.
            This will follow from the fact that actually every RTL can be generated by a GRCG, but
            there are languages for which there are no unambiguous grammars, and, obviously, URCGs
            are always unambiguous.</para><para>Interestingly, the search problem for URCGs is still linear, since we only need to
            go down the path from the root to a given node, and in addition check finitely many
            sister labels (note that while regular expressions allow for arbitrary branching, the
            branching of a given tree is, of course, always finite).</para><para>An example of a URCG could be the use of attribute based co-occurrence constraints
            or attribute-element constraints in the following RELAX NG declaration. We extend our
            example grammar by adding a <code>type</code> information to the <code>section</code>
            element. If the type is set to the value "global" other <code>section</code>
            child elements are allowed as part of the content model, if its value is set to
            "sub" only <code>para</code> child elements are allowed (see <xref linkend="lst.grammar.rng.extd"/>). <figure xml:id="lst.grammar.rng.extd"><title>Extended RELAX NG grammar</title><programlisting xml:space="preserve">&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;grammar xmlns="http://relaxng.org/ns/structure/1.0"
  xmlns:a="http://relaxng.org/ns/compatibility/annotations/1.0"
  datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"&gt;
  &lt;start&gt;
    &lt;element name="text"&gt;
      &lt;optional&gt;
        &lt;attribute name="author"/&gt;
      &lt;/optional&gt;
      &lt;optional&gt;
        &lt;ref name="element.title"/&gt;
      &lt;/optional&gt;
      &lt;oneOrMore&gt;
        &lt;choice&gt;
          &lt;ref name="element.section"/&gt;
          &lt;ref name="element.para"/&gt;
        &lt;/choice&gt;
      &lt;/oneOrMore&gt;
    &lt;/element&gt;
  &lt;/start&gt;
  &lt;define name="element.section"&gt;
    &lt;choice&gt;
      &lt;oneOrMore&gt;
        &lt;element name="section"&gt;
          &lt;optional&gt;
            &lt;ref name="element.title"/&gt;
          &lt;/optional&gt;
          &lt;optional&gt;
            &lt;attribute name="type"&gt;
              &lt;value&gt;global&lt;/value&gt;
            &lt;/attribute&gt;
          &lt;/optional&gt;
          &lt;oneOrMore&gt;
            &lt;choice&gt;
              &lt;ref name="element.section"/&gt;
              &lt;ref name="element.para"/&gt;
            &lt;/choice&gt;
          &lt;/oneOrMore&gt;
        &lt;/element&gt;
      &lt;/oneOrMore&gt;
      &lt;oneOrMore&gt;
        &lt;element name="section"&gt;
          &lt;optional&gt;
            &lt;ref name="element.title"/&gt;
          &lt;/optional&gt;
          &lt;optional&gt;
            &lt;attribute name="type"&gt;
              &lt;value&gt;sub&lt;/value&gt;
            &lt;/attribute&gt;
          &lt;/optional&gt;
          &lt;ref name="element.para"/&gt;
        &lt;/element&gt;
      &lt;/oneOrMore&gt;
    &lt;/choice&gt;
  &lt;/define&gt;
  &lt;define name="element.title"&gt;
    &lt;element name="title"&gt;
      &lt;text/&gt;
    &lt;/element&gt;
  &lt;/define&gt;
  &lt;define name="element.para"&gt;
    &lt;element name="para"&gt;
      &lt;interleave&gt;
        &lt;optional&gt;
          &lt;attribute name="id"
            datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"&gt;
            &lt;data type="ID"/&gt;
          &lt;/attribute&gt;
        &lt;/optional&gt;
        &lt;optional&gt;
          &lt;element name="xref"&gt;
            &lt;attribute name="href"
              datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"&gt;
              &lt;data type="IDREF"/&gt;
            &lt;/attribute&gt;
          &lt;/element&gt;
        &lt;/optional&gt;
        &lt;text/&gt;
      &lt;/interleave&gt;
    &lt;/element&gt;
  &lt;/define&gt;
&lt;/grammar&gt;
</programlisting></figure>
          </para><para>The interesting part is the definition of the <code>element.section</code> pattern.
            It allows for two different elements named "section" with different content
            models according to the value of the optional <code>type</code> attribute. The result is
            that the instance shown in <xref linkend="lst.instance.rng.ex1"/> is valid according to
            this RELAX NG grammar while the one shown in <xref linkend="lst.instance.rng.ex2"/> is
            not.</para><figure xml:id="lst.instance.rng.ex1"><title>Valid XML instance according to the extended RELAX NG grammar</title><programlisting xml:space="preserve">&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;text author="ms"&gt;
  &lt;title&gt;A simple title&lt;/title&gt;
  &lt;section type="global"&gt;
    &lt;title&gt;A section title&lt;/title&gt;
    &lt;para id="p1"&gt;Introductory para&lt;/para&gt;
    &lt;section type="sub"&gt;
      &lt;title&gt;A subsection title&lt;/title&gt;
      &lt;para&gt;Some text with a reference: &lt;xref href="p1"/&gt;.&lt;/para&gt;
    &lt;/section&gt;
  &lt;/section&gt;
&lt;/text&gt;
</programlisting></figure><note><para>Without any <code>type</code> attribute the instance shown in <xref linkend="lst.instance.rng.ex1"/> would still be valid.</para></note><figure xml:id="lst.instance.rng.ex2"><title>Invalid XML instance according to the extended RELAX NG grammar</title><programlisting xml:space="preserve">&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;text author="ms"&gt;
  &lt;title&gt;A simple title&lt;/title&gt;
  &lt;section type="sub"&gt;
    &lt;title&gt;A section title&lt;/title&gt;
    &lt;para id="p1"&gt;Introductory para&lt;/para&gt;
    &lt;section type="sub"&gt;
      &lt;title&gt;A subsection title&lt;/title&gt;
      &lt;para&gt;Some text with a reference: &lt;xref href="p1"/&gt;.&lt;/para&gt;
    &lt;/section&gt;
  &lt;/section&gt;
&lt;/text&gt;</programlisting></figure><para>Other well deployed RELAX NG examples of attribute based co-occurrence constraints
            can be found in <xref linkend="vanderVlist2003"/>, Chapter 7, in <xref linkend="Clark2003"/>, Section 15, or in the RELAX NG schema for the JLTF (Japanese
            Layout Taskforce) aligned document shown in <xref linkend="Sasaki2010"/>.</para><para>Importantly, co-occurrence constraints do not add anything to the formal generative
            capacity of the grammar. This is because attributes (or their values, respectively) add
            an additional specification to the <emphasis role="ital">terminals</emphasis>. Thereby
            we can convert competing terminals (or, equivalently, rules) into non-competing ones,
            but not vice versa. Any co-occurrence constraint thus gives us the possibility to
            distinguish maybe otherwise indistinguishable non-terminals, thereby at most keeping the
            complexity of the grammar constant, or even reducing it. Furthermore, as co-occurrence
            constraints do only affect immediate subtrees (i.e., content models), their expressivity
            is entirely contained within the expressive capacities of standard regular tree
            rewriting rules; the only thing we might need to add to our formal grammar model is some
            additional specification on the terminals.</para><para>Neither DTD<footnote><para><xref linkend="Fiorello2004"/> discuss DTD++ 2.0 which supports a large number
                of co-constraints using a syntax closely resembling DTD.</para></footnote> nor XSD 1.0 support such attribute-element constraint, although there are
            some workarounds or hacks that can be used in XML schema to mimic co-occurrence
            constraints: either the use of the <code>xsi:type</code> attribute or
              <code>xs:key</code><footnote><para>See <xref linkend="vanderVlist2003"/>, p. 65 and <link xlink:href="http://ajwelch.blogspot.com/2008/06/xml-schema-co-occurrence-constraint.html" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://ajwelch.blogspot.com/2008/06/xml-schema-co-occurrence-constraint.html</link>
                for further information.</para></footnote>. Another option to realize this particular constraint is the use of embedded
              <xref linkend="Schematron"/> business rules or conditional type assignment using type
            alternatives or assertions that are introduced in <xref linkend="XMLSchema2009"/> (for
            complex Types) and <xref linkend="XMLSchema2009-2"/> (for simple Types). <xref linkend="lst.grammar.xsd11"/> shows a possible XSD 1.1 realization.</para><figure xml:id="lst.grammar.xsd11"><title>XSD grammar with XSD 1.1 <code>assert</code> element</title><programlisting xml:space="preserve">&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"&gt;
  &lt;xs:element name="text"&gt;
    &lt;xs:complexType&gt;
      &lt;xs:complexContent&gt;
        &lt;xs:extension base="textType"&gt;
          &lt;xs:attribute name="author" type="xs:string" use="optional"/&gt;
        &lt;/xs:extension&gt;
      &lt;/xs:complexContent&gt;
    &lt;/xs:complexType&gt;
  &lt;/xs:element&gt;
  &lt;xs:element name="title" type="xs:string"/&gt;
  &lt;xs:element name="section"&gt;
    &lt;xs:complexType&gt;
      &lt;xs:sequence&gt;
        &lt;xs:element ref="title" minOccurs="0"/&gt;
        &lt;xs:group ref="sectOrPara" maxOccurs="unbounded"/&gt;
      &lt;/xs:sequence&gt;
      &lt;xs:attribute name="type" use="optional"&gt;
        &lt;xs:simpleType&gt;
          &lt;xs:restriction base="xs:string"&gt;
            &lt;xs:enumeration value="global"/&gt;
            &lt;xs:enumeration value="sub"/&gt;
          &lt;/xs:restriction&gt;
        &lt;/xs:simpleType&gt;
      &lt;/xs:attribute&gt;
      &lt;xs:assert test="@type!='sub' and (child::para | child::section) or 
        @type='sub' and not(child::section)" /&gt;
    &lt;/xs:complexType&gt;
  &lt;/xs:element&gt;
  &lt;xs:element name="para"&gt;
    &lt;xs:complexType mixed="true"&gt;
      &lt;xs:sequence&gt;
        &lt;xs:element name="xref" minOccurs="0"&gt;
          &lt;xs:complexType&gt;
            &lt;xs:attribute name="href" type="xs:IDREF" use="required"/&gt;
          &lt;/xs:complexType&gt;
        &lt;/xs:element&gt;
      &lt;/xs:sequence&gt;
      &lt;xs:attribute ref="id" use="optional"/&gt;
    &lt;/xs:complexType&gt;
  &lt;/xs:element&gt;
  &lt;xs:complexType name="textType"&gt;
    &lt;xs:sequence&gt;
      &lt;xs:element ref="title" minOccurs="0"/&gt;
      &lt;xs:group ref="sectOrPara" maxOccurs="unbounded"/&gt;
    &lt;/xs:sequence&gt;
  &lt;/xs:complexType&gt;
  &lt;xs:group name="sectOrPara"&gt;
    &lt;xs:choice&gt;
      &lt;xs:element ref="section"/&gt;
      &lt;xs:element ref="para"/&gt;
    &lt;/xs:choice&gt;
  &lt;/xs:group&gt;
  &lt;xs:attribute name="id" type="xs:ID"/&gt;
&lt;/xs:schema&gt;</programlisting></figure><para>The <code>section</code> element contains an XSD 1.1 <code>assert</code> element
            that uses an XPath expression in its <code>type</code> attribute to restrain the
            possible child elements according to the <code>type</code> attribute's value of the
              <code>section</code> element.</para><para>Note that when using XSD 1.1 assertions for expressing co-occurrence constraints we
            are not limited to immediate subtrees: although the evaluation of the XPath expression
            is done in the context of the parent element (i.e., XSD 1.1's <code>xs:assert</code>
            element uses a tree fragment rooted at the element whose type it is tested against) one
            could put the assertion at the level of the common ancestor, that is, the element that
            contains all the data needed to compute the assertion. The support for full XPath (i.e.,
            axes such as ancestor, parent or preceding and preceding-sibling and following and
            following-sibling, respectively) may be implementation-dependent. XSD 1.1 type alternatives
            are restricted to tests against constants or attributes on the element itself but not to
            ancestors, descendants, siblings or children or their attributes while Schematron
            business rules are not restricted to an XPath subset.</para></listitem><listitem><para>There is one type we should add, which cannot be assigned a place on the hierarchy
            from STGs to RTGs, which is, however, weakly and strongly between LTGs and RTGs.
            Grammars of this type satisfy the following conditions: we want to be able to assign a
            unique interpretation to any node <emphasis role="ital">a</emphasis>, provided we know
            the complete subtree it governs. This kind of grammar would facilitate deterministic
            parsing for bottom-up algorithms. In terms of grammar, this imposes the following
            restrictions on the formalism:</para><orderedlist numeration="loweralpha"><listitem><para>Every leaf-terminal is introduced by a single nonterminal,</para></listitem><listitem><para>for every nonterminal <emphasis role="ital">N</emphasis> in a given grammar,
                there is at most one rule which has a given terminal <emphasis role="ital">a</emphasis>, and <emphasis role="ital">N</emphasis> appears in its content
                model. </para></listitem></orderedlist><para>We call this grammar type <emphasis role="bold">unique subtree grammar</emphasis>
            (USG). Note that this grammar type does not include STGs and URCGs, nor is it included
            by them. The restrictions do not restrain the occurrence of competing nonterminals in
            content models, but rather the labels which belong to nonterminals in the content model.
            More precisely, whereas the former types restrain the occurrence of competing
            nonterminals within content models, USGs restrain the content models of competing rules
            itself. </para><para>However, as all other types, they properly include the class of LTGs, as every LTG
            is a USG, but not vice versa, and it easy to find a language which is generated by a
            USG, but not by LTG. Furthermore, they are properly included in the class of RTGs, in
            the strong and the weak sense (and weakly within the class of GRCGs, as we will see). </para><para>As is also easy to see, the distinctive USG property provides deterministic bottom
            up parsing. In order to get the interpretation of a given node, it is therefore
            sufficient to find a path from the leaves, which is a linear search problem.</para></listitem></orderedlist></section></section><section><title>Structure and Global Ambiguity</title><para>First, we will present some well-known results, which are important for our further
      discussion.</para><para><!-- xml:id="theorem1" -->
      <emphasis role="bold">Theorem 1 </emphasis><emphasis role="ital">RTGs are weakly equivalent to
        CFGs (that is, the set of strings of leaves generated by CFGs is equivalent to the set of
        strings of leaves generated by RTGs).</emphasis>
    </para><para><!-- xml:id="theorem2" --><emphasis role="bold">Theorem 2 </emphasis>
      <emphasis role="ital">RTGs are strongly equivalent
        to graphs generated by a CFGs, modulo a homomorphism of node labels (i.e., a homomorphism
        which maps various node labels in a given tree onto a single one), provided the RTG has
        finitary branching.</emphasis>
    </para><para><!-- xml:id="theorem3" --><emphasis role="bold">Theorem 3 </emphasis>
      <emphasis role="ital">LTGs are strongly equivalent
        to graphs generated by CFGs, provided finitary branching.</emphasis>
    </para><para>Proof is trivial: as in LTGs every node label is generated by exactly one nonterminal; and
      in CFGs, nodes which are not leaves are labeled by nonterminals, there is a one-to-one
      correspondence. This has some importance for the relation of LTGs and RTGs. It follows, that
      LTGs and RTGs are equivalent up to homomorphism, as well as all grammar types in between.
      Every discussion we have about generative capacity concerns only non-equivalence in
      isomorphism. Since also the strings of leaves for all grammar types are identical, we can only
      be interested in the sets of trees. In the sequel, if we speak of the weak generative capacity
      of a tree grammar, we refer to the sets of trees it generates, not the strings of
      leaves.</para><para>Since to our knowledge, only the strong generative capacity of the grammar types of <xref linkend="Murata2005"/> has been in the focus of research, we will now scrutinize their weak
      generative capacity.</para><para><!-- xml:id="theorem4" --><emphasis role="bold">Theorem 4 </emphasis><emphasis role="ital">The
        sets of trees generated by STGs form a proper subset of the sets of trees generated by
        RCGs.</emphasis></para><para>For proof, consider the trees generated by the following grammar: <programlisting xml:space="preserve">S → a(AB)
A → b(C)
B → b(D)
C → c(D)
D → c(ε)</programlisting></para><para>This is an RCG, since A and B do not occur in similar contexts. In order to see that there
      is no way to generate this tree with an STG, consider the following fact: <emphasis role="ital">a</emphasis> governs two identical labels, which however govern different
      subtrees. It is therefore impossible to introduce them with identical rules, and (by
      definition of STGs) forbidden to have two competing rules in the content model of the first
      rule. This is sufficient for the proof of weak inclusion, since all STG rules are also RCG
      rules, and therefore the languages generated form a proper subset.</para><para><!-- xml:id="theorem5" --><emphasis role="bold">Theorem 5 </emphasis><emphasis role="ital">The
        sets of trees generated by LTGs form a proper subset of the sets of trees generated by
        STGs.</emphasis></para><para>Consider the trees generated by the following grammar: <programlisting xml:space="preserve"><emphasis role="ital">A → a(B)
B → b(C)
C → a(D)
D → c(ε)</emphasis></programlisting></para><para>This is a single type tree grammar, and no LTG is able to generate such a tree (remember
      that LTGs are strongly equivalent to graphs generated by CFG, provided finitary
      branching).</para><section><title>Restrained Competition Grammars and Variants</title><para>In this section we will scrutinize formal properties of the different types of
        restrained competition grammars. We will show which kind of languages cannot be generated by
        RCGs; we will prove that there are GRCGs which are ambiguous, and that for every language
        which can be generated by an RTG, there is a GRCG which generates the same language.
        Finally, we will show that URCGs do not have these properties, are properly included within
        GRCGs and properly include RCGs. </para><para>The type of languages we cannot generate with RCGs is quickly described as follows: all
        grammars, where a single content model contains competing nonterminals, which can be
        uniquely distinguished only from their left (right, respectively) context, are not RCGs.
        Consequently, we cannot generate sets of trees, where a certain node has different subtrees
        depending on its right siblings. If we want to get rid if this asymmetry, and allow for
        GRCGs, where competing nonterminals in a single content model are uniquely determined by
        their left or right context, we run into problems: </para><para><!-- xml:id="theorem6" --><emphasis role="bold">Theorem 6 </emphasis><emphasis role="ital">There
          are GRCGs which are ambiguous.</emphasis></para><para> For proof, consider the following rule: <emphasis role="ital">S →
          a(AB|BA)</emphasis>. Suppose, <emphasis role="ital">A</emphasis> and <emphasis role="ital">B</emphasis> are competing nonterminals; suppose furthermore, that there is some
        overlapping between <emphasis role="ital">A</emphasis> and <emphasis role="ital">B</emphasis>; i.e., the nonterminals generate overlapping sets of trees. In particular,
        we may assume that the trees generated by <emphasis role="ital">A</emphasis> form a subset
        of the trees generated by <emphasis role="ital">B</emphasis>. For example, <emphasis role="ital">A</emphasis> and <emphasis role="ital">B</emphasis> generate identical trees
        up to depth <emphasis role="ital">n</emphasis>; <emphasis role="ital">B</emphasis> in
        addition generates a tree of depth <emphasis role="ital">n+1</emphasis>. In this case, the
        trees of the language have the root <emphasis role="ital">a</emphasis>, with two symmetrical
        sets of subtrees up to depth <emphasis role="ital">n</emphasis>, and possibly one subtree
        with depth <emphasis role="ital">n+1</emphasis>. It is easily seen that now it is impossible
        to merge <emphasis role="ital">A</emphasis> and <emphasis role="ital">B</emphasis>, for then
        we would be incapable of expressing the condition that at most one subtree has depth
          <emphasis role="ital">n+1</emphasis>. However, for the trees, where the subtrees
        introduced by <emphasis role="ital">A</emphasis> and <emphasis role="ital">B</emphasis> have
        depth at most <emphasis role="ital">n</emphasis>, there is necessarily more than one
        analysis. The grammar we have described so far is, however, a GRCG, because neither
          <emphasis role="ital">A</emphasis> nor <emphasis role="ital">B</emphasis> occur in
        identical contexts (though in similar contexts, remember the preceding section).</para><para><!-- xml:id="theorem7" --><emphasis role="bold">Theorem 7 </emphasis><emphasis role="ital">For
          every language which can be generated by an RTG, there is a GRCG which generates the same
          language.</emphasis></para><para> To proof this theorem, we describe a simple procedure to convert any RTG into a GRCG,
        which generates the same language. We define competing sequences of length <emphasis role="ital">n</emphasis> of nonterminals as follows: two sequences of nonterminals
        compete, if for all <emphasis role="ital">n</emphasis>, the <emphasis role="ital">n</emphasis>th nonterminal of one sequence competes with the <emphasis role="ital">n</emphasis>th nonterminal of the other sequence. We have to assume a content model
          <emphasis role="ital">r</emphasis> which is not GRCG conform. Therefore, there have to be
        two competing nonterminals or competing sequences of nonterminals <emphasis role="ital">A</emphasis> and <emphasis role="ital">B</emphasis> in <emphasis role="ital">r</emphasis>, such that for possibly empty sequences of nonterminals <emphasis role="ital">Γ</emphasis> and <emphasis role="ital">Δ</emphasis>, (<emphasis role="ital">Γ A Δ</emphasis>) and (<emphasis role="ital">Γ B Δ</emphasis>) match
        with <emphasis role="ital">r</emphasis>.</para><para> Given this, we can be sure, that in the instantiations of <emphasis role="ital">r</emphasis>, which violate the GRCG condition, <emphasis role="ital">A</emphasis> and
          <emphasis role="ital">B</emphasis> occur in exactly the same global tree contexts. By
        global tree context we here mean that a tree with a governing the subtrees generated by
          <emphasis role="ital">A</emphasis> is part of the language iff a also governs the set of
        subtrees generated by <emphasis role="ital">B</emphasis>. Since this is the case, we can
        simply merge the two nonterminals to a new one, <emphasis role="ital">C</emphasis>, which is
        the union of the former two. This new nonterminal substitutes all instantiations of
          <emphasis role="ital">A</emphasis> and <emphasis role="ital">B</emphasis>, which occur in
        the same global tree context. This, by definition, are the instantiations which violate the
        GRCG condition. This we can apply to all nonterminals which violate the GRCG condition. The
        only thing we have to take care of is that we apply this only to those instantiations of the
        content models where two competing nonterminals match equally (this might force us to change
        some regular expressions). We do not show an exact algorithm at this point, since it is
        clear that an equivalent GRCG exist, and the details of the construction are of no practical
        interest at this point. </para><para>We now show that there is a hierarchy of proper inclusion RCG ⊂ URCG ⊂ GRCG.
        To show that RCG ⊂ URCG, consider the following: every rule which is admitted by an
        RCG is also admitted by a URCG, because if competing nonterminals in the same content model
        have a unique prefix, a fortiori they also have a unique context (we have already shown that
        a unique prefix of nonterminals is paramount to a unique prefix of labels/siblings, by
        induction). Above, we have already shown that for an RCG it is impossible to generate
        languages as the following, which is a URCG. <programlisting xml:space="preserve"><emphasis role="ital">S → a(AC|BD)
A → b(C)
B → b(D)
C → c(ε)
D → d(ε)</emphasis></programlisting>
        <emphasis role="ital">A</emphasis> and <emphasis role="ital">B</emphasis> compete, but are
        determined uniquely by their context.</para><para>This concludes the first part; the second part will be a corollary of the next section:
        We will show that some languages are inherently ambiguous, that is, there is no unambiguous
        grammar for them. By Theorem 7 we know that we can generate these languages with GRCGs, but
        URCGs cannot:</para><para><!-- xml:id="proposition7" --><emphasis role="bold">Proposition 1 </emphasis><emphasis role="ital">A URCG cannot be ambiguous.</emphasis></para><para>This is easy to see: an ambiguous grammar assigns two different sequences of
        nonterminals to the daughters of one node (since root nodes are unambiguous): Then however,
        there must be at least two competing nonterminals which occur in the same content model in
        similar contexts, which, by definition, is impossible.</para></section><section><title>Inherently Ambiguous Languages</title><para>As a corollary, we can show that there are regular tree languages, for which there is no
        unambiguous grammar. There are sets of trees, which are generated by an ambiguous GRCG, but
        by no URCG. We will call these languages <emphasis role="ital">inherently
          ambiguous</emphasis>. </para><para><!-- xml:id="theorem8" --><emphasis role="bold">Theorem 8 </emphasis><emphasis role="ital">Some
          regular tree languages are inherently ambiguous.</emphasis></para><para>This can be seen easily, if we spell out a grammar which we described in the above
        subsection. We will then show that there is now way to write an unambiguous grammar which
        generates the same language.<programlisting xml:space="preserve"><emphasis role="ital">S → r(AB|BA)
A → a(C)
B → a(D)
C → b(ε)
D → b(ε|E)
E → c(ε)</emphasis></programlisting></para><para>There is no way to merge <emphasis role="ital">A</emphasis> and <emphasis role="ital">B</emphasis>, since they generate different sets of subtrees (we can write <emphasis role="ital">L(A)≠L(B)</emphasis>); but since they overlap (<emphasis role="ital">L(A)∩ L(B)≠ ∅</emphasis>), there is no way to have a unique
        interpretation in the cases where the subtrees generated by the nonterminals are identical.
        There will always be two ways to generate trees in this case.</para><para>We can, furthermore, precisely state the conditions, under which a regular tree language
        is ambiguous. To this end, however, we need to introduce some notation. We now for
        simplicity write trees as terms: a tree with root <emphasis role="ital">a</emphasis> and
        daughters <emphasis role="ital">b</emphasis> and <emphasis role="ital">c</emphasis> is
        denoted as <emphasis role="ital">a(b,c)</emphasis>, etc. As a next step, we define a context
        as the position, where certain subtrees occur within trees of a language. </para><para>
        <emphasis role="bold">Definition 9</emphasis>
        <emphasis role="ital">A context C is a tree-term with exactly one variable. We say that a
          set of subtrees α occurs in a context C in a language L, if the following holds: We
          can instantiate the variable of C with any tree from α, and the resulting tree is in
          L</emphasis>.</para><para>Note that sets of subtrees correspond to nonterminals, when we speak of languages rather
        than grammars. In the sequel, for simplicity we will use lower case Greek variables equally
        for sets of subtrees as for ordered sequences of sets of subtrees. The definition of a
        context is easily accommodated to sequences. A set of sequences of trees of length <emphasis role="ital">n</emphasis> consists of ordered tuples of trees of length <emphasis role="ital">n</emphasis>, of the form (<emphasis role="ital">t<subscript>0</subscript>,...,t<subscript>n-1</subscript></emphasis>). Sets of subtrees
        are then simply sets of one-tuples. Importantly, we will not provide a proof for the
        following proposition, and leave it open as a conjecture. However, we will sketch the
        argument. We now make the following conjecture: </para><para><!-- xml:id="proposition2" --><emphasis role="bold">Conjecture 1</emphasis>
        <emphasis role="ital">A tree-language is inherently ambiguous iff at least one node fulfills
          all of the following conditions: </emphasis></para><para><emphasis role="ital">We need to have one node with an arbitrary label a, with at least
          two (sequences of) sets of subtrees α and β, such that </emphasis></para><orderedlist numeration="arabic"><listitem><para><emphasis role="ital">α ∩ β ≠ ∅</emphasis>;</para></listitem><listitem><para><emphasis role="ital">α ≠ β</emphasis>;</para></listitem><listitem><para>There is at least one context <emphasis role="ital">C</emphasis> in <emphasis role="ital">L</emphasis>, such that both <emphasis role="ital">a(Γ,
                (t<subscript>1</subscript>,...,t<subscript>n</subscript>), Δ,
                (u<subscript>1</subscript>,...,u<subscript>n</subscript>), Θ)</emphasis> and
              <emphasis role="ital">a(Γ,
                (u<subscript>1</subscript>,...,u<subscript>n</subscript>), Δ,
                (t<subscript>1</subscript>,...,t<subscript>n</subscript>), Θ)</emphasis> occur
            in <emphasis role="ital">C</emphasis>, for all <emphasis role="ital">(t<subscript>1</subscript>,...,t<subscript>n</subscript>) ∈ α</emphasis>
            and all <emphasis role="ital">(u<subscript>1</subscript>,...,u<subscript>n</subscript>)
              ∈ β</emphasis>, where uppercase Greek letters designate possibly empty
            sequences of daughter sub-trees; note that the sequences need to have equal length in
            order to meet condition 1.</para></listitem></orderedlist><para>Due to space restrictions, we leave the prove for this conjecture open here; this
        reminds however of a theorem in <xref linkend="Odgen1968"/> for string languages. But we
        will give some rather informal discussion of the points in the next section. It is not hard
        to see that this is merely a generalization of the cases we have been described above. As we
        will see, we can derive some useful facts from these properties of ambiguous languages, even
        without a general proof: we can show that we can construe grammars for languages which do
        not fulfill one of the conditions, and, moreover, which type of grammars we can
        construe.</para></section><section><title>Unambiguous Languages</title><para><!-- xml:id="theorem9" --><emphasis role="bold">Theorem 9</emphasis>
        <emphasis role="ital">From the grammar types sketched so far, there is no type which
          generates all and only the RTLs that are not inherently ambiguous.</emphasis></para><para>We will demonstrate this going through the three conditions mentioned in the preceding
        section, and look which unambiguous grammar we can construe if one condition is not met.
        This is to be read as follows: if one condition is not met, then it means, that from all
        nodes of the tree language, there might be any one which meets the ones not in question, but
        none which meets the one currently under consideration.</para><orderedlist><listitem><para>If there is no intersection between the subtrees of a given node, the grammar is of
            course not ambiguous. We can, however not necessarily construe a URCG for this grammar,
            since in the content model of the mother node there are competing nonterminals in
            similar contexts (recall the example given above).</para><para>We can, however, construe a USG for such a language, since subtrees are uniquely
            identifiable.</para></listitem><listitem><para>This means that there are no two sets of subtrees governed by the same node which
            are not identical. We can thus introduce them by the same nonterminals, and have a local
            tree grammar (having no different sets of subtrees governed by the same node amounts to
            say we need no competing rules in the grammar, as nonterminals correspond to sets of
            subtrees).</para></listitem><listitem><para>If the third condition is not met, then we can construe nonterminals (corresponding
            to the sets of subtrees) such that for all of them the following holds: assuming they
            compete (introduce identical labels), they either occur in different contexts, in which
            case they are distinguishable thereby and no ambiguity arises; or they occur in
            identical contexts, in which case we can use a unique nonterminal which is the merge of
            both (this also holds for root nodes). The critical case, where the content of one
            (sequence of) set(s) of subtrees depends on the other one, which makes them occur in
            similar contexts, while making it impossible to merge them, however, we have excluded by
            assumption.</para><para>Since this argument holds inductively from the root to all subtrees, we can construe
            a URCG for the language were condition 3 is not met, but we cannot use any strictly
            weaker type. The only thing we have provided is that if two sets of subtrees occur in
            similar contexts (for the grammar we construe), then they actually occur in identical
            contexts. It follows that we do not need competing nonterminals in similar
            contexts.</para></listitem></orderedlist><para>This shows that we still have not solved the problem to define a canonical grammar type
        which generates all and only the unambiguous languages, since there are languages which are
        generated by USGs, but no other canonical class which does not allow any ambiguity type, and
        languages which are generated by URCGs and no other such type. So far, we are still lacking
        a characterization of the unambiguous languages in terms of grammar rules. </para></section></section><section><title>Application and Future Research</title><para>When we speak of XML schema languages and applications, the first thing that comes into
      mind is parsing an instance and validate it according to a respective schema. <xref linkend="Murata2005"/> have shown algorithms for parsing the three types of tree grammars we
      discussed already. However, a task which is still open is to provide algorithms for the new
      grammars types we have defined.</para><para><xref linkend="Mani2002"/> demonstrated the use of the theory of regular tree grammars for
      the XML to relational conversion as an additional application of formal language in the XML
      context. Again, this work could be extended using the newly established grammar types.</para><para>Regarding future research this paper may serve as just a foundation in the fields of XML
      applications and formal languages. New features that are introduced in XSD 1.1 such as
      conditional type assignment, assertions and the <code>openContent</code> element as well as
      the relaxed <emphasis role="ital">Unique Particle Attribution</emphasis> rule (UPA, aka the
      determinism rule, see <xref linkend="XMLSchema2009"/>, Section 3.8.6.4), or changed behavior
      regarding wildcards have effects on the expressiveness. Apart from these natural enhancements,
      another focus may lie in examining the relationships between XPath and XQuery and formal
      languages on the basis of the work undertaken in this paper; we expect to shed some light on
      this topic during future research. In addition, a more formal approach in the analysis of
      overlapping markup structures such as GODDAGs (<xref linkend="Sperberg-McQueen2004"/>) could
      be an interesting field for future work.</para><para>Seen from a practical perspective and under consideration of the findings in <xref linkend="Martens2006"/>, a large portion of the XML document grammars that can be found in
      the wild are structurally equivalent to DTDs or <emphasis role="ital">specialized
        DTDs</emphasis> (that is, adding a mechanism to decouple element names from their types to
      regular DTDs, see <xref linkend="Papakonstantinou2000"/> and <xref linkend="Balmin2004"/>
      – also called EDTDs by <xref linkend="Martens2006"/>), hence use roughly the
      expressiveness of local tree grammars. This is often due to nontransparent restrictions in the
      XML Schema spec such as the already discussed <emphasis role="ital">Element Declarations
        Consistent</emphasis> (EDC) constraint. <xref linkend="Bex2009"/> and <xref linkend="Martens2007"/> provide simplifications for XSDs and XSD authoring tools that should
      relive authors from the burden of these constraints by <quote>automatically transforming
        nondeterministic expressions into concise deterministic ones</quote>. Regarding RELAX NG
      document grammars we think that restraining its expressive power to the class of URCGs would
      provide a feasible compromise. Up to this point we hope that this more fine-grained hierarchy
      may serve others as guide for choosing a specific XML schema language depending on the
      expressivity of the markup language that has to be developed.</para></section><bibliography><title>Bibliography</title><bibliomixed xml:id="Abiteboul2000" xreflabel="Abiteboul et al., 2000">Abiteboul, S., P.
      Buneman, and D. Suciu (2000). Data on the Web: From Relations to Semistructured Data and XML.
      Morgan Kaufmann Publishers, San Francisco, California.</bibliomixed><bibliomixed xml:id="Ansari2009" xreflabel="Ansari et al., 2009">Ansari, M. S., Zahid, N., and
      K.-G. Doh. A Comparative Analysis of XML Schema Languages. In Slezak, D., Kim, T., Zhang, Y.,
      Ma, J., and K. Chung, eds., Database Theory and Application. International Conference, DTA
      2009, Held as Part of the Future Generation Information Technology Conference, FGIT 2009, Jeju
      Island, Korea, December 10-12, 2009. Proceedings, volume 64, pages 41– 48. Springer, Berlin,
      Heidelberg, 2009. doi: <biblioid class="doi">10.1007/978-3-642-10583-8_6</biblioid>.</bibliomixed><bibliomixed xml:id="Balmin2004" xreflabel="Balmin et al., 2004">Balmin, A., Papakonstantinou,
      Y., and V. Vianu (2004). Incremental validation of XML documents. ACM Transactions on Database
      Systems (TODS), 29(4):710–751. doi: <biblioid class="doi">10.1145/1042046.1042050</biblioid>.</bibliomixed><bibliomixed xml:id="Bauman2008" xreflabel="Bauman, 2008">Bauman, S., (2008). Freedom to
      Constrain: where does attribute constraint come from, mommy? In Proceedings of Balisage: The
      Markup Conference 2008. Balisage Series on Markup Technologies, vol. 1 (2008). doi: <biblioid class="doi">10.4242/BalisageVol1.Bauman01</biblioid>.</bibliomixed><bibliomixed xreflabel="Bex et al., 2009" xml:id="Bex2009">Bex, G. J., Gelade, W., Martens, W.
      and F. Neven (2009). Simplifying XML Schema: Effortless Handling of Nondeterministic Regular
      Expressions. In SIGMOD ’09: Proceedings of the 35th SIGMOD international conference on
      Management of data, pages 731–744, New York, NY, USA, ACM. doi: <biblioid class="doi">10.1145/1559845.1559922</biblioid>.</bibliomixed><bibliomixed xml:id="Brueggemann-Klein1992" xreflabel="Brüggemann-Klein and Wood, 1992">Brüggemann-Klein, A., and D. Wood (1992). Deterministic Regular Languages. In Finkel, A. and
      M. Jantzen, eds., STACS 92. 9th Annual Symposium on Theoretical Aspects of Computer Science
      Cachan, France, February 13–15, 1992 Proceedings, volume 577 of Lecture Notes in Computer
      Science, pages 173–184. Springer, Berlin, Heidelberg. doi: <biblioid class="doi">10.1007/3-540-55210-3_182</biblioid>.</bibliomixed><bibliomixed xml:id="Brueggemann-Klein1993" xreflabel="Brüggemann-Klein, 1993">Brüggemann-Klein,
      A. (1993). Formal Models in Document Processing. Habilitation, Albert-Ludwig-Universität zu
      Freiburg i. Br.</bibliomixed><bibliomixed xml:id="Brueggemann-Klein1997" xreflabel="Brüggemann-Klein and Wood, 1997">Brüggemann-Klein, A., and D. Wood (1997). One-unambiguous regular languages. Information and
      computation, 142:182–206. doi: <biblioid class="doi">10.1006/inco.1997.2695</biblioid>.</bibliomixed><bibliomixed xml:id="Brueggemann-Klein2002" xreflabel="Brüggemann-Klein and Wood, 2002">Brüggemann-Klein, A., and D. Wood (2002). The parsing of extended context-free grammars.
      HKUST Theoretical Computer Science Center Research Report HKUST-TCSC-2002-08, The Hong Kong
      University of Science and Technology Library.</bibliomixed><bibliomixed xml:id="Brueggemann-Klein2004" xreflabel="Brüggemann-Klein and Wood, 2004">Brüggemann-Klein, A., and D. Wood (2004). Balanced context-free grammars, hedge grammars and
      pushdown caterpillar automata. In Proceedings of Extreme Markup Languages, Montréal,
      Québec.</bibliomixed><bibliomixed xml:id="Buck2000" xreflabel="Buck et al., 2000">Buck, L., Goldfarb, C. F., and P.
      Prescod (2000). Datatypes for DTDs (DT4DTD) 1.0. W3C Note 13 January 2000, World Wide Web
      Consortium.</bibliomixed><bibliomixed xml:id="Carey2009" xreflabel="Carey, 2009">Carey, B. M. (2009). Meet CAM: A new XML
      validation technology. Take semantic and structural validation to the next level. IBM
      developerworks, IBM Corporation. <link xlink:href="http://www.ibm.com/developerworks/xml/library/x-cam/?S_TACT=105AGX54&amp;S_CMP=C0924&amp;ca=dnw-1036&amp;ca=dth-x&amp;open&amp;cm_mmc=6015-_-n-_-vrm_newsletter-_-10731_131528&amp;cmibm_em=dm:0:13962324" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.ibm.com/developerworks/xml/library/x-cam/?S_TACT=105AGX54&amp;S_CMP=C0924&amp;ca=dnw-1036&amp;ca=dth-x&amp;open&amp;cm_mmc=6015-_-n-_-vrm_newsletter-_-10731_131528&amp;cmibm_em=dm:0:13962324</link>.</bibliomixed><bibliomixed xml:id="Chomsky1955" xreflabel="Chomsky, 1955">Chomsky, N. (1955). Logical Syntax
      and Semantics: Their Linguistic Relevance. Language, 31(1):36–45, 1955. doi: <biblioid class="doi">10.2307/410891</biblioid>.</bibliomixed><bibliomixed xml:id="Chomsky1956" xreflabel="Chomsky, 1956">Chomsky, N. (1956). Three Models for
      the Description of Language. IRE Transactions on Information Theory, 2:113–124,
      1956. doi: <biblioid class="doi">10.1109/TIT.1956.1056813</biblioid>.</bibliomixed><bibliomixed xml:id="Clark2001" xreflabel="Clark, 2001">Clark, J. (2001). TREX – Tree Regular
      Expressions for XML Language Specification. Technical report, Thai Open Source Software Center
      Ltd.</bibliomixed><bibliomixed xml:id="Clark2003" xreflabel="Clark et al., 2003">Clark, J., J. Cowan, and M.
      Murata, (2003). Relax NG Compact Syntax Tutorial. Working Draft 26 March 2003, OASIS –-
      Organization for the Advancement of Structured Information Standards. <link xlink:href="http://relaxng.org/compact-tutorial-20030326.html" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://relaxng.org/compact-tutorial-20030326.html</link>.</bibliomixed><bibliomixed xml:id="Comon2008" xreflabel="Comon et al., 2008">Comon, H., M. Dauchet, R.
      Gilleron, C. Löding, F. Jacquemard, D. Lugiez, S. Tison, and M. Tommasi (2007). Tree Automata
      Techniques and Applications. Release November, 18th 2008. <link xlink:href="http://www.grappa.univ-lille3.fr/tata" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.grappa.univ-lille3.fr/tata</link>.</bibliomixed><bibliomixed xml:id="Costello2008" xreflabel="Costello and Simmons, 2008"> Costello, R. L., and
      R. A. Simmons (2008). Tutorials on Schematron: Two Types of XML Schema Language. <link xlink:href="http://www.xfront.com/schematron/Two-types-of-XML-Schema-Language.html" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.xfront.com/schematron/Two-types-of-XML-Schema-Language.html</link>.</bibliomixed><bibliomixed xml:id="Moeller2005" xreflabel="DSD2">Møller, A. (2005). Document Structure
      Description 2.0. Technical report, BRICS (Basic Research in Computer Science, Aarhus
      University), 2005. <link xlink:href="http://www.brics.dk/DSD/dsd2.html" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.brics.dk/DSD/dsd2.html</link>.</bibliomixed><bibliomixed xml:id="Fiorello2004" xreflabel="Fiorello et al., 2004">Fiorello, D., Gessa, N.,
      Marinelli, P., and F. Vitali. DTD++ 2.0: Adding support for co-constraints. In Proceedings of
      Extreme Markup Languages, Montréal, Québec.</bibliomixed><bibliomixed xml:id="Gelade2009" xreflabel="Gelade et al., 2009">Gelade, W, Martens, W., and F.
      Neven (2009). Optimizing Schema Languages for XML: Numerical Constraints and Interleaving.
      SIAM Journal on Computing, 38(5):2021–2043. doi: <biblioid class="doi">10.1137/070697367</biblioid>.</bibliomixed><bibliomixed xml:id="Goldfarb1978" xreflabel="Goldfarb, 1978">Goldfarb, C. F. (1978). DCF GML
      User’s Guide (IBM SH20-9160). IBM, 1978.</bibliomixed><bibliomixed xml:id="Goldfarb1991" xreflabel="Goldfarb, 1991">Goldfarb, C. F. (1991). The SGML
      Handbook. Oxford University Press, Oxford.</bibliomixed><bibliomixed xml:id="Gecseg1997" xreflabel="Gécseg and Steinby, 1997">Gécseg, F., and M. Steinby
      (1997). Tree languages. In Handbook of Formal Languages, volume 3, pages 1-68. Springer, New
      York.</bibliomixed><bibliomixed xml:id="Hopcroft2000" xreflabel="Hopcroft et al., 2000">Hopcroft, J., R. Motwani,
      and J. Ullman (2000). Introduction to Automata Theory, Languages, and Computation. 2nd
      edition. Addison Wesley Longman, Amsterdam.</bibliomixed><bibliomixed xml:id="Jeliffe2009" xreflabel="Jeliffe, 2009">Jeliffe, R. (2009). Is Schematron a
      rules language? Online: <link xlink:href="http://broadcast.oreilly.com/2009/01/is-schematron-a-rules-language.html" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://broadcast.oreilly.com/2009/01/is-schematron-a-rules-language.html</link>.</bibliomixed><bibliomixed xml:id="Kilpeläinen2007" xreflabel="Kilpeläinen and Tuhkanen, 2007">Kilpeläinen,
      P., and R. Tuhkanen (2007). One-unambiguity of regular expressions with numeric occurrence
      indicators. Information and Computation, 205(6):890–916. doi: <biblioid class="doi">10.1016/j.ic.2006.12.003</biblioid>.</bibliomixed><bibliomixed xml:id="Klarlund2003" xreflabel="Klarlund et al., 2003">Klarlund, N., T.
      Schwentick, and D. Suciu (2003). XML: Model, Schemas, Types, Logics and Queries. In Chomicki,
      J., R. van der Meyden, and G. Saake, eds., Logics for Emerging Applications of Databases,
      pages 1-41. Springer, Berlin, Heidelberg.</bibliomixed><bibliomixed xml:id="Kracht2010" xreflabel="Kracht, 2010">Kracht, M. (to appear). Modal Logic
      Foundations of Markup Structures in Annotation Systems. In Mehler, A., Kühnberger, K.-U.,
      Lobin, H., Lüngen, H., Storrer, A., and A. Witt, eds., Modeling, Learning and Processing of
      Text Technological Data Structures, Studies in Computational Intelligence. Springer,
      Dordrecht.</bibliomixed><bibliomixed xml:id="Lee2000" xreflabel="Lee and Chu, 2000">Lee, D. and W. Chu. Comparative
      Analysis of Six XML Schema Languages. ACM SIGMOD Record, 29(3):76–87, September
      2000. doi: <biblioid class="doi">10.1145/362084.362140</biblioid>.</bibliomixed><bibliomixed xml:id="Maler1995" xreflabel="Maler and Andaloussi, 1995">Maler, E., and J. E.
      Andaloussi (1995). Developing SGML DTDs: From Text to Model to Markup. Prentice Hall, Upper
      Saddle River, New Jersey</bibliomixed><bibliomixed xml:id="Mani2001" xreflabel="Mani, 2001">M. Mani (2001). Keeping chess alive: Do we
      need 1-unambiguous content models? In Proceedings of Extreme Markup Languages, Montréal,
      Québec.</bibliomixed><bibliomixed xml:id="Mani2002" xreflabel="Mani and Lee, 2002">Mani, M., and D. Lee (2002). XML
      to Relational Conversion using Theory of Regular Tree Grammars. In Proceedings of the 28th
      VLDB Conference, Hong Kong, China.</bibliomixed><bibliomixed xml:id="Marcoux2008" xreflabel="Marcoux, 2008">Marcoux, Y. (2008). Graph
      characterization of overlap-only TexMECS and other overlapping markup formalisms. In
      Proceedings of Balisage: The Markup Conference 2008. Balisage Series on Markup Technologies,
      vol. 1. Montréal, Québec. doi: <biblioid class="doi">10.4242/BalisageVol1.Marcoux01</biblioid>.</bibliomixed><bibliomixed xml:id="Martens2005" xreflabel="Martens et al., 2005">Martens, W., Neven, F., and
      T. Schwentick (2005). Which XML Schemas Admit 1-Pass Preorder Typing? In Eiter, T., and L.
      Libkin, eds., Database Theory – ICDT 2005, volume 3363 of Lecture Notes in Computer Science,
      pages 68–82. Springer, Berlin, Heidelberg, 2005. doi: <biblioid class="doi">10.1007/978-3-540-30570-5_5</biblioid>.</bibliomixed><bibliomixed xml:id="Martens2006" xreflabel="Martens et al., 2006">Martens, W., Neven, F.,
      Schwentick, T., and G. Bex (2006). Expressiveness and Complexity of XML Schema. ACM
      Transactions on Database Systems (TODS), 31(3):770–813. doi: <biblioid class="doi">10.1145/1166074.1166076</biblioid>.</bibliomixed><bibliomixed xml:id="Martens2007" xreflabel="Martens et al., 2007">Martens, W., Neven, F. and T.
      Schwentick (2007). Simple off the shelf abstractions for XML schema. SIGMOD Rec.,
      36(3):15–22. doi: <biblioid class="doi">10.1145/1324185.1324188</biblioid>.</bibliomixed><bibliomixed xml:id="Martens2009" xreflabel="Martens et al., 2009">Martens, W., Neven, F. and T.
      Schwentick (2009). Complexity of Decision Problems for XML Schemas and Chain Regular
      Expressions. SIAM Journal on Computing, 39(4):1486–1530. doi: <biblioid class="doi">10.1137/080743457</biblioid>.</bibliomixed><bibliomixed xml:id="Moeller2006" xreflabel="Møller and Schwartzbach, 2006">Møller, A., and M.
      Schwartzbach (2006). An Introduction to XML and Web Technologies, chapter Schema Languages,
      pages 92–187. Addison-Wesley, Harlow, England.</bibliomixed><bibliomixed xml:id="Murata2001" xreflabel="Murata et al., 2001">﻿Murata, M., D. Lee, and M.
      Mani (2001). Taxonomy of XML Schema Languages using Formal Language Theory. In Proceedings of
      Extreme Markup Languages, Montréal, Québec.</bibliomixed><bibliomixed xml:id="Murata2005" xreflabel="Murata et al., 2005">﻿Murata, M., D. Lee, M. Mani,
      and K. Kawaguchi (2005). Taxonomy of XML Schema Languages Using Formal Language Theory. ACM
      Transactions on Internet Technology, 5(4):660–704. doi: <biblioid class="doi">10.1145/1111627.1111631</biblioid>.</bibliomixed><bibliomixed xml:id="CLiX" xreflabel="Nentwich, 2005">Nentwich, C. (2005). CLiX – A
      Validation Rule Language for XML. Presented by Anthony Finkelstein at W3C Workshop on Rule
      Languages for Interoperability, 27-28 April 2005, Washington D.C. <link xlink:href="http://www.w3.org/2004/12/rules-ws/paper/24/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.w3.org/2004/12/rules-ws/paper/24/</link>.</bibliomixed><bibliomixed xml:id="NVDL" xreflabel="NVDL">ISO/IEC 19757-4:2006. Information technology —
      Document Schema Definition Languages (DSDL) — Part 4: Namespace-based Validation Dispatching
      Language (NVDL), International Standard, International Organization for Standardization,
      Geneva.</bibliomixed><bibliomixed xml:id="Odgen1968" xreflabel="Odgen, 1968">Odgen, W. (1968). A Helpful Result for
      Proving Inherent Ambiguity. In Mathematical Systems Theory, 2(3):191–194. doi: <biblioid class="doi">10.1007/BF01694004</biblioid>.</bibliomixed><bibliomixed xml:id="Pawson2007" xreflabel="Pawson, 2007">Pawson, D. (2007). ISO Schematron
      tutorial. <link xlink:href="http://www.dpawson.co.uk/schematron/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.dpawson.co.uk/schematron/</link>.</bibliomixed><bibliomixed xml:id="Papakonstantinou2000" xreflabel="Papakonstantinou and Vianu, 2000">Papakonstantinou, Y., and V. Vianu (2000). DTD inference for views of XML data. In PODS ’00:
      Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database
      systems, pages 35–46, New York, NY, USA, ACM. doi: <biblioid class="doi">10.1145/335168.335173</biblioid>.</bibliomixed><bibliomixed xml:id="Piez2001" xreflabel="Piez, 2001">Piez, W. (2001). Beyond the “descriptive
      vs. procedural” distinction. In Markup Languages – Theory &amp; Practice,
      3(2):141–172. doi: <biblioid class="doi">10.1162/109966201317356380</biblioid>.</bibliomixed><bibliomixed xml:id="RELAXCore" xreflabel="RELAX Core">ISO/IEC TR 22250-1:2002. Information
      technology – Document description and processing languages – Regular Language Description for
      XML – part 1: RELAX Core. International Standard, International Organization for
      Standardization, Geneva.</bibliomixed><bibliomixed xml:id="RELAX" xreflabel="RELAX NG">ISO/IEC 19757-2:2008. Information technology –
      Document Schema Definition Language (DSDL) – Part 2: Regular-grammar-based validation – RELAX
      NG (ISO/IEC 19757-2). International Standard, International Organization for Standardization,
      Geneva.</bibliomixed><bibliomixed xml:id="RELAX2nd" xreflabel="RELAX NG (2nd Ed.)">ISO/IEC 19757-2:2008. Information
      technology – Document Schema Definition Language (DSDL) – Part 2: Regular-grammar-based
      validation – RELAX NG (ISO/IEC 19757-2). Second Edition. International Standard, International
      Organization for Standardization, Geneva.</bibliomixed><bibliomixed xml:id="Rizzi2001" xreflabel="Rizzi, 2001">Rizzi, R. (2001). Complexity of
      context-free grammars with exceptions and the inadequacy of grammars as models for XML and
      SGML. Markup Languages – Theory &amp; Practice, 3(1):107–116. doi: <biblioid class="doi">10.1162/109966201753537222</biblioid>.</bibliomixed><bibliomixed xml:id="Rogers2003" xreflabel="Rogers, 2003">Rogers, J. (2003). Syntactic
      Structures as Multi-dimensional Trees. In Research on Language and Computation,
      1(3-4):265–305. doi: <biblioid class="doi">10.1023/A:1024695608419</biblioid>.</bibliomixed><bibliomixed xml:id="Sasaki2010" xreflabel="Sasaki, 2010">Sasaki, F. (2010). How to avoid
      suffering from markup: A project report about the virtue of hiding xml. In XML Prague 2010
      Conference Proceedings, pages 105–123, Prague, Czech Republic, March 13–14 2010. Institute for
      Theoretical Computer Science.</bibliomixed><bibliomixed xml:id="Schematron" xreflabel="Schematron">ISO/IEC 19757-3:2006 Information
      technology — Document Schema Definition Languages (DSDL) — Part 3: Rule-based validation —
      Schematron. International Standard, International Organization for Standardization,
      Geneva.</bibliomixed><bibliomixed xml:id="SGML" xreflabel="SGML">ISO 8879:1986. Information Processing — Text and
      Office Information Systems — Standard Generalized Markup Language. International Standard,
      International Organization for Standardization, Geneva.</bibliomixed><bibliomixed xml:id="Sperberg-McQueen2003" xreflabel="Sperberg-McQueen, 2003">Sperberg-McQueen,
      C. M. (2003). Logic grammars and XML Schema. In Proceedings of Extreme Markup Languages,
      Montréal, Québec.</bibliomixed><bibliomixed xml:id="Sperberg-McQueen2004" xreflabel="Sperberg-McQueen and        Huitfeldt, 2004">Sperberg-McQueen, C. M. and C.
      Huitfeldt (2004). GODDAG: A Data Structure for Overlapping Hierarchies. In King, P. and E. V.
      Munson, eds. Proceedings of the 5th International Workshop on the Principles of Digital
      Document Processing (PODDP 2000), volume 2023 of Lecture Notes in Computer Science, pages
      139–160. Springer, 2004</bibliomixed><bibliomixed xml:id="Stuehrenberg2009" xreflabel="Stührenberg and Jettka, 2009">Stührenberg, M.
      and D. Jettka (2009). A toolkit for multi-dimensional markup: The development of SGF to
      XStandoff. In Proceedings of Balisage: The Markup Conference 2009. Balisage Series on Markup
      Technologies, vol. 3 (2009). Montréal, Québec. doi: <biblioid class="doi">10.4242/BalisageVol3.Stuhrenberg01</biblioid>.</bibliomixed><bibliomixed xml:id="Vitali2003" xreflabel="Vitali et al., 2003">Vitali, F., Amorosi, N., and N.
      Gessa. Datatype- and namespace-aware DTDs: A minimal extension. In Proceedings of Extreme
      Markup Languages, Montré́al, Québec.</bibliomixed><bibliomixed xml:id="vanderVlist2001" xreflabel="van der Vlist, 2001">van der Vlist, E. (2001).
      Comparing XML Schema Languages, 12 December 2001. <link xlink:href="http://www.xml.com/pub/a/2001/12/12/schemacompare.html" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.xml.com/pub/a/2001/12/12/schemacompare.html</link>.</bibliomixed><bibliomixed xml:id="vanderVlist2003" xreflabel="van der Vlist, 2003">van der Vlist, E. (2003).
      RELAX NG. O’Reilly, Sebastopol.</bibliomixed><bibliomixed xml:id="XML10" xreflabel="XML 1.0">Extensible Markup Language (XML) 1.0. W3C
      Recommendation, World Wide Web Consortium, 10 February 1998. <link xlink:href="http://www.w3.org/TR/1998/REC-xml-19980210" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.w3.org/TR/1998/REC-xml-19980210</link>.</bibliomixed><bibliomixed xml:id="XML" xreflabel="XML 1.0 (Fifth Edition)">Extensible Markup Language (XML)
      1.0 (Fifth Edition). W3C Recommendation, World Wide Web Consortium, 26 November 2008. <link xlink:href="http://www.w3.org/TR/2008/REC-xml-20081126/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.w3.org/TR/2008/REC-xml-20081126/</link>.</bibliomixed><bibliomixed xml:id="XMLNS" xreflabel="XML Namespaces (Third Edition)">Namespaces in XML 1.0
      (Third Edition). W3C Recommendation, World Wide Web Consortium, 8 December 2009. <link xlink:href="http://www.w3.org/TR/2009/REC-xml-names-20091208/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.w3.org/TR/2009/REC-xml-names-20091208/</link>.</bibliomixed><bibliomixed xml:id="XMLSchema2004" xreflabel="XML Schema 1.0 Part 0">XML Schema Part 0: Primer
      Second Edition. W3C Recommendation, World Wide Web Consortium, 28 October 2004. <link xlink:href="http://www.w3.org/TR/2004/REC-xmlschema-0-20041028/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.w3.org/TR/2004/REC-xmlschema-0-20041028/</link>.</bibliomixed><bibliomixed xml:id="XMLSchema2004a" xreflabel="XML Schema 1.0 Part 1">XML Schema Part 1:
      Structures Second Edition. W3C Recommendation, World Wide Web Consortium, 28 October 2004.
        <link xlink:href="http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/</link>.</bibliomixed><bibliomixed xml:id="XMLSchema2004b" xreflabel="XML Schema 1.0 Part 2">XML Schema Part 2:
      Datatypes Second Edition. W3C Recommendation, World Wide Web Consortium, 28 October 2004.
        <link xlink:href="http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/</link>.</bibliomixed><bibliomixed xml:id="XMLSchema2009" xreflabel="XML Schema 1.1 Part 1">W3C XML Schema Definition
      Language (XSD) 1.1 Part 1: Structures. W3C Working Draft, World Wide Web Consortium, 3
      December 2009. <link xlink:href="http://www.w3.org/TR/2009/WD-xmlschema11-1-20091203/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.w3.org/TR/2009/WD-xmlschema11-1-20091203/</link>.</bibliomixed><bibliomixed xml:id="XMLSchema2009-2" xreflabel="XML Schema 1.1 Part 2">W3C XML Schema
      Definition Language (XSD) 1.1 Part 2: Datatypes. W3C Working Draft, World Wide Web Consortium,
      3 December 2009. <link xlink:href="http://www.w3.org/TR/2009/WD-xmlschema11-2-20091203/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.w3.org/TR/2009/WD-xmlschema11-2-20091203/</link>.</bibliomixed><bibliomixed xml:id="XSLT2" xreflabel="XSLT 2.0">XSL Transformations (XSLT) Version 2.0. W3C
      Recommendation, World Wide Web Consortium, 23 January 2007. <link xlink:href="http://www.w3.org/TR/2007/REC-xslt20-20070123/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.w3.org/TR/2007/REC-xslt20-20070123/</link>.</bibliomixed><bibliomixed xml:id="XQuery" xreflabel="XQuery 1.0">XQuery 1.0: An XML Query Language. W3C
      Recommendation, World Wide Web Consortium, 23 January 2007. <link xlink:href="http://www.w3.org/TR/2007/REC-xquery-20070123/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.w3.org/TR/2007/REC-xquery-20070123/</link>.</bibliomixed></bibliography></article>
