Refining the Taxonomy of XML Schema Languages. A new Approach for Categorizing XML Schema Languages in Terms of Processing Complexity

Refining the Taxonomy of XML Schema Languages. A new Approach for Categorizing XML Schema Languages in Terms of Processing Complexity Balisage: The Markup Conference 2010 August 3 - 6, 2010 This paper presents a refined taxonomy of XML schema languages based on the work by . It can be seen as first building block for a more elaborate formal analysis of XML and its accompanied specifications, in this case: XML schema languages such as DTD, XSD and RELAX NG. Maik Stührenberg Maik Stührenberg studied Computational Linguistics at Bielefeld University. After working for four years as research assistant at Giessen University in different text-technological projects, he is now a Ph. D. student and research assistant at Bielefeld University. His main research interests include XML schema languages and specifications for structuring and querying multi-dimensional annotated data. Christian Wurm Christian Wurm is a Ph. D. student in Computational and Mathematical Linguistics at Bielefeld University in the Cognitive Interaction Technology – Center of Excellence (CITEC) at Bielefeld University. Copyright © 2010 by the authors. Used with permission. XML Formal Language Theory Schema Languages The authors would like to thank both the reviewers for their constructive comments and our colleagues Marcus Kracht and Jens Michaelis who provided additional insightful remarks.

Introduction In this paper, we continue the fruitful research that has been performed on XML and formal language theory (see ). As there is a close correspondence between schema languages and a hierarchy of tree languages discussed by , and to which we will refer as the Murata hierarchy, formal language theory has been very useful to determine and describe the expressiveness and computability of different XML schema languages. From the point of view of formal language theory (ignoring things such as user-friendliness or software support, amongst others), the question which grammar formalism is most apt for defining a document grammar for a given XML markup language is determined by a trade-off between expressiveness on one side and processing complexity on the other side. The more expressive a grammar formalism is, the more resources we need for processing the corresponding languages, and the more likely are they to fall prey to ambiguity. Expressiveness thus always comes at a cost; it is, however, not always quite clear at which cost. To make these things more clear is one goal of this paper. We will look at some well-known classes of formal grammars/languages that are relevant for XML, and scrutinize their expressivity and processing cost. Pursuing this approach, we will see that there are some other interesting and relevant classes, which have not been formally established yet to the best of our knowledge, and which we will define in the sequel. These classes partly refine, partly complement the hierarchy of , and, as we will see, are partly already tacitly in use. Our main focus will not be on (non-)determinism in content models, that is, on properties of regular expressions; rather we will focus on determinism in tree structures, and try to formally clarify the relation between determinism and expressivity, as well as between locally and globally ambiguous grammars/trees. On the way, we provide some new results for the problem of ambiguity not only of grammars, but also of languages: as we will see, there are languages for which there is no unambiguous grammar, and not yet a class of grammars which generate all and only the languages for which there is an unambiguous grammar. In contrast to other comparative analysis of XML schema languages, such as or , the main research goal of our paper is thus a more elaborate and fine-grained theoretical approach based on work that has been already undertaken (see ). The application examples shown in this paper (especially the ones given in ) are clearly for demonstration purposes and do not necessarily introduce new findings.

XML and Formal Language Theory The history of document grammars begins long before XML's success: in the first published formal document type descriptions can be found, while 1986 SGML Document Type Definition (DTD, ) were established, followed by XML's DTD () and other XML schema languages that were created during XML's ongoing success. This time line begins even earlier, in 1955, when Noam Chomsky published his theory on formal grammar (, ). One of the several benefits of XML markup languages is the possibility to use a schema (a document grammar) to assure not only well-formedness, but also validity of an instance of a given markup language. The XML specification defines well-formedness as follows:

A textual object is a well-formed XML document if: Taken as a whole, it matches the production labeled document. It meets all the well-formedness constraints given in this specification. Each of the parsed entities which is referenced directly or indirectly within the document is well-formed. , Section 2.1 "Well-Formed XML Documents"

A valid instance in addition declares conformance and actually conforms to the rules of a schema of a given markup language. Or, in a more general way:

The intention or purpose of validation is to subject a document or data set to a test, to determine whether it conforms to a given set of external criteria. , p. 144

A validating parser takes an instance document (and the corresponding schema) as input and produces a validation report, which includes at least a return code reporting whether [t]he document is valid and an optional Post Schema Validation Infoset (PSVI), updating the original document's infoset (the information obtained from the XML document by the parser) with additional information (default values, datatypes, etc.) (). During this process different levels of validation may be checked, depending on the XML schema language used: validation of the instance's structure (i.e., the markup), datatyping (i.e., the content of individual leaf nodes), integrity (in terms of links, either between nodes within a document or between documents). In addition, other tests, usually called business rules may apply as well (see ). While one may differentiate between the terms valid (as defined by the XML specification and therefore only referring to validity according to a Document Type Definition) and schema-valid (i.e., valid according to one of the externally defined XML schema languages) we will use the former term throughout this paper as equal term for depicting the feature of confirmation to a given set of external criteria regardless of the schema language used. Furthermore, we will only discuss validation mechanisms through schema languages, therefore, technologies such as the Content Assembly Mechanism (CAM, see ) or meta-validation techniques such as Namespace-based Validation Dispatching Language () are not observed any further. Since we will focus on grammar-based schema languages, rule-based For a short discussion if Schematron is a rules language (or rule-based language) see . (or constraint-based) schema languages such as , the Constraint Language in XML (CLiX, ) or will not be observed either. For a different discussion on the topic of expressing constraints see . For clarification reasons we follow the definitions of :

A grammar-based schema language specifies the structure and contents of elements and attributes in an XML instance document. For example, a grammar-based schema language can specify the presence and order of elements in an XML instance document, the number of occurrences of each element, and the contents and datatype of each element and attribute. A rule-based schema language specifies the relationships that must hold between the elements and attributes in an XML instance document. For example, a rule-based schema language can specify that the value of certain elements must conform to a rule or algorithm.

When someone starts developing a new XML-based markup language sooner or later the question about a formalism to define the corresponding document grammar arises, since there is a variety of schema (definition) languages available. While a schema can be considered as a formal definition of a grammar of the XML-based markup language (e.g. as a set of rules or criteria), the schema language is a formal language for expressing schemas (). Usually, choosing a schema language depends on several factors such as familiarity with a given formalism or support provided by the chosen authoring software or processing tools such as or . These factors are very specific for one's own needs and environment and we will not give any advice regarding these topics. However, what we want to demonstrate in this paper are the differences in terms of expressiveness and computability between the three most used XML schema languages, starting with XML's inherent Document Type Definition (DTD, see ), based on SGML's DTD (see , , ) where a non-XML syntax is used We will not discuss any proposals for extended DTDs such as , , or since these play only minor roles in the wild, if any. , over W3C's XML Schema Description Language (, , ) and the formal language theory based RELAX NG (see , , as a successor to both RELAX () and TREX ().

Formal Language Theory and XML Although the formal model of an XML instance is always a single rooted tree, the different schema languages that can be used to define and constrain instances can be differentiated according to their expressiveness and – in a further step – according to their computability, which may be interesting when dealing with a task such as programming a validating parser. Different authors have dealt with the relationship between XML applications and formal languages, for example , , , , , , , , , , , , , , , , and . Often, a formal specification of XML's inherent ID/IDREF mechanism is omitted; however, , p. 33 claim that these references can be used to describe graphs rather than trees, since they allow for multidominance structures. (see for a practical implementation of using ID/IDREF for realizing graph structures within XML's tree model). uses modal logic to provide a semantics for XML-documents, and to characterize XML markup and search and retrieval mechanisms such as XPath. Other work leaves the formal model of XML and deals with graph structures that can be described by either XML or XML-like markup languages (see for a graph characterization of TexMECS and other overlapping markup formalisms), but in this paper we will concentrate on schema languages that describe well-formed (that is tree-like) XML-documents. Typically, DTDs are characterized as extended context-free grammars (see and ), that is, on the right-hand-side of a production rule regular expressions are allowed. This means for the declaration of an element that its allowed content is described by a regular expression using other element names (i.e., referring to other or the very same globally declared elements) or reserved keywords such as #PCDATA or EMPTY. In current work, especially and , a family of tree grammars is used to model XML schema languages; for example, DTDs are defined as local tree grammars which can be considered strongly equivalent to CFGs, with the only difference that they allow non-finitary branching:

Ignoring the attributes for a moment, there is a simple but elegant connection between DTDs and context-free grammars, namely, each DTD corresponds to an extended context-free grammar, where productions may have regular expressions on their right-hand side. Then, an XML document is valid with respect to the DTD precisely when its associated tree is a correct derivation tree for that grammar. , p. 13

In addition to using local tree grammars for characterizing DTDs, construe a taxonomy of XML schema languages. The authors introduce single-type tree grammars, as characterizing XML Schema, and use regular tree grammars to characterize RELAX NG. Although this work is quite extensive, the formal analysis can be further improved by clarifying some propositions. Given the (still growing) importance of XML and the broad range of tasks it is used for, stronger theoretical background seems to be the best way to find new applications. Before we present our results, we have to introduce the formal concepts we are working with.

Refining the Taxonomy of XML Schema Languages

Introduction to the Formal Concepts In this paper, we will mainly use tree grammars. Since the use of tree grammars is well established in the XML community, we will just shortly provide the necessary definitions; for a more explicit treatment and motivation, we defer the reader to or . Definition 1 A regular tree grammar (RTG) is a 4-tuple (N,T,S,P), where N is a finite set of nonterminals, T is a finite set of terminals, S is a set of start symbols, which form a subset of N, P is a set of production rules, which have the form: A → a(r), where A ∈ N, a ∈ T, and r is a regular expression over elements of N. We call A the left hand side of a rule, a the terminal or label which is introduced by the rule, and r its content model. We generally use uppercase letters for nonterminals, and lower-case letters for terminals. Note that this grammar generates trees, not strings, and that the nonterminals do not remain the labels of the (non-leaf) nodes they introduce, but are substituted by the terminal labels. The class of languages generated by RTGs is called the regular tree languages (RTLs). This is the most general class of languages we will consider here; and we now introduce various restrictions on this class of grammars, as they are defined by . Definition 2 We call two rules of a RTG competing, if they introduce the same terminal nodes, but have different left hand sides. Thus, A → a(r) and B → a(r') are competing. In general, in an RTG we can merge any two rules which have the same left-hand side and introduce the same terminal, by merging their content models, because for any two regular expressions we can easily form a single expression which denotes is the union of both. Therefore, we will generally assume that in our grammars any two rules with same left-hand side and same terminal do not exist. As a consequence, the concept of competing rules is the crucial point if we deal with determinism and ambiguity. For the same reason, we can speak of competing nonterminals almost in the same way as of competing rules: competing nonterminals are the left-hand sides of competing rules; a grammar has competing nonterminals exactly if it has competing rules, and exactly as many as. Definition 3 A local tree grammar (LTG) is an RTG with no competing rules. In an LTG, we have thus a one-to-one correspondence of nonterminals and terminals, which makes them very similar to context-free grammars (though not identical, since regular content models allow for non-finitary branching). Definition 4 A single type tree grammar (STG) is an RTG, where competing nonterminals must not occur in the same content model. As point out, LTGs roughly correspond to DTDs, STGs correspond to XML Schema, and RTGs correspond to Relax NG. Note that this correspondence is established by only regarding the "core" syntactic features of the schema languages, that is XML's inherent reference mechanism is not taken into account. These are thus the most important grammar types for XML. still add another type: Definition 5 A restrained competition grammar (RCG) is an RTG, where competing nonterminals must not occur in the same content model and with the same prefix of nonterminals; we thus disallow rules with identical left-hand side, terminals, and content models of the form (Γ A Δ) and (Γ B Δ'), where A and B are competing nonterminals, and where uppercase Greek letters refer to possibly empty sequences of nonterminals. Notably, the restriction concerns only the left context of the competing nonterminals. Of course, there exists a parallel definition for the right context. The problem is that both definitions lack some generalization, as they both generate different classes of languages, and there is no inclusion in either direction. If we, however, generalize the restriction of competition to both the left and the right context (which weakens the overall restriction on the grammar), some problems arise. We will discuss possible generalizations later on. It is easily seen that there is a hierarchy of proper inclusion of the grammar types presented: LTGs are always STGs, which are always RCGs, which are always RTGs, whereas the converse does not hold. We furthermore define an interpretation of a given tree against a given grammar as follows: Definition 6 An interpretation I of a tree t against a grammar G is a mapping from each node label of t, denoted by e, to a nonterminal N of the grammar, such that I(e) is a start symbol when e is the root of t, for each e and its daughter nodes e0, e1,...,en, there is a production rule A → a(r) in G, such that I(e) is A, the label of e is a, I(e0), I(e1),...,I(en) matches r. As is easily seen, with this definition, the ease of interpretation directly interacts with the Murata hierarchy. We will continue in this vein. To keep things clear, it is crucial to distinguish between the label of a node and its interpretation: the label of a node corresponds to its terminal in the production rule (recall that tree grammars directly generate trees, not strings via trees as CFGs), and it is immediately visible in the tree. By the interpretation of a node in turn we denote the nonterminal by which the node label has been produced. This nonterminal is not visible and has to be inferred. In addition, we have to distinguish between rule and rule instantiation: since content models are regular expressions over nonterminals, they denote sets of sequences of nonterminals, and one member of this set is an instantiation. An additional problem arises in interaction with the fact that there is no necessary one-to-one correspondence of nonterminals and terminals (i.e., labels); a possible consequence is that the same sequence of labels can be produced by different instantiations of a content model (we will exhaustively discuss this source of ambiguity later on). So far, we have introduced the main concepts which are well-known in the literature, and which we are going to use and elaborate in this paper.

An Example Grammar We will use the following example to demonstrate some of our findings. We want to define a document grammar for a text. The text may contain an optional title, followed by either at least a single section or a single paragraph. An optional author entity (possibly decoded using an attribute) may contain information about the text's author. Inside a section an optional title followed by other (sub-)sections or paragraphs are allowed. The title consists of raw text while a paragraph may contain raw text or a reference to other paragraphs, since these may have an optional identifier (using XML's ID type). If we try to express these constraints more formally we might end up with a grammar similar to the one shown in . Again, nonterminals are printed in capital letters, while node labels or terminals are printed in small letters. Note, that in this formulation elements, attributes and raw text are defined as terminals.

An example grammar S → text(author,Title? (Section|Para)) Section → section(Title? (Section|Para)) Title → title(#pcdata) Para → para(id, #pcdata|Xref) Xref → xref(href,ε)

The Example Grammar Realized by a DTD (Local Tree Grammar) Following , DTDs can be classified as local tree grammars. A possible realization for our example grammar can be seen in .

DTD realization of the example grammar <!ELEMENT text (title?, (section | para)+)> <!ATTLIST text author CDATA #IMPLIED> <!ELEMENT title (#PCDATA)> <!ELEMENT section (title?, (section | para)+)> <!ELEMENT para (#PCDATA | xref)*> <!ATTLIST para id ID #IMPLIED> <!ELEMENT xref EMPTY> <!ATTLIST xref href IDREF #REQUIRED> Since local tree grammars and DTDs only support globally declared elements (and locally declared attributes), the content models of the text and the section element share references to the same three elements (title, section and para) and contain both a sequence and a choice together with the occurrence indicators + (at least one occurrence) and ? (optional). The content model of the para element contains mixed content, that is both raw text and the xref element are allowed as children. Since DTDs force the use of the choice group (|) and the trailing asterisk (*) as occurrence indicator, there is no other way to define this specific content model using this schema language. Note that DTDs do not support any type mechanism. , , and suppose the extension of DTDs by adding types, while DTD++, proposed by adds namespace awareness on top, and DTD++ 2.0 (see ) even supports co-constraints.

The Example Grammar Realized by an XSD (Single Type Tree Grammar) An XML schema description (i.e., a single type tree grammar) of the same document grammar may look like the one in .

XSD realization of the example grammar <?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="text"> <xs:complexType> <xs:complexContent> <xs:extension base="textType"> <xs:attribute name="author" type="xs:string" use="optional"/> </xs:extension> </xs:complexContent> </xs:complexType> </xs:element> <xs:element name="title" type="xs:string"/> <xs:element name="section" type="textType"/> <xs:element name="para"> <xs:complexType mixed="true"> <xs:sequence> <xs:element name="xref" minOccurs="0"> <xs:complexType> <xs:attribute name="href" type="xs:IDREF" use="required"/> </xs:complexType> </xs:element> </xs:sequence> <xs:attribute ref="id" use="optional"/> </xs:complexType> </xs:element> <xs:attribute name="id" type="xs:ID"/> <xs:complexType name="textType"> <xs:sequence> <xs:element ref="title" minOccurs="0"/> <xs:group ref="sectOrPara" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> <xs:group name="sectOrPara"> <xs:choice> <xs:element ref="section"/> <xs:element ref="para"/> </xs:choice> </xs:group> </xs:schema> Note that this XML schema description is only one possible realization out of a variety of different XML schema descriptions that would fit our needs. Although it may be not very human-readable, it was designed to show some features that are supported by XSD. The text element is derived by extension of the globally declared complexType textType which itself refers to the globally declared model group sectOrPara. The schema contains both locally and globally declared attributes (author vs. id) and elements (xref as an example for a locally declared element). Apart from that, XSD supports which are not shown in the example above. As has already pointed out, that the actual extra expressive power of XSDs over DTDs can only be used to a very limited extent due to the Element Declarations Consistent (EDC) constraint (see , Section 3.8.6).

The Example Grammar Realized by a RELAX NG Grammar (Regular Tree Grammar) RELAX NG can be classified as regular tree grammar according to . A possible realization with RELAX NG is shown in .

RELAX NG realization of the example grammar <?xml version="1.0" encoding="UTF-8"?> <grammar xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"> <start> <element name="text"> <optional> <attribute name="author"/> </optional> <choice> <optional> <group> <ref name="element.title"/> <ref name="element.section"/> </group> </optional> <optional> <group> <ref name="element.title"/> <ref name="element.para"/> </group> </optional> </choice> </element> </start> <define name="element.title"> <element name="title"> <text/> </element> </define> <define name="element.section"> <element name="section"> <optional> <ref name="element.title"/> </optional> <ref name="sectOrPara"/> </element> </define> <define name="element.para"> <element name="para"> <optional> <attribute name="id"> <data type="ID"/> </attribute> </optional> <zeroOrMore> <choice> <text/> <element name="xref"> <attribute name="href"> <data type="IDREF"/> </attribute> <empty/> </element> </choice> </zeroOrMore> </element> </define> <define name="sectOrPara"> <group> <oneOrMore> <choice> <ref name="element.section"/> <ref name="element.para"/> </choice> </oneOrMore> </group> </define> </grammar> Compared to DTD or XSD, RELAX NG is based both on the mathematical theory of regular expressions and the concept of hedge grammars ( and ). As an XML schema language, RELAX NG has some advantages over other schema languages: while in DTDs and XSD mixed content models may contain child elements and text nodes in any arbitrary order, RELAX NG allows for ordering of the element child nodes (see , p. 57f.). Co-occurrence constraints can be used to specify the content model of an item according to the value of another item, allowing non-deterministic content models which cannot be realized in DTD or XSD (see , p. 62f, and for a discussion). In general, a co-occurrence constraint (or co-constraint as they are called by ) may be a constraint over multiple items, not just two items and may exist between XML structure components (elements, attributes) as well as between data values. One may differentiate between element-to-element, element-to-attribute, or attribute-attribute co-occurrence constraint, based on the items involved. In addition, SGML's interleave operator & (see , p. 291) that is missing in XML DTD and XSD can be used in RELAX NG as well, although this adds nothing to its expressive power. In contrast to the two other schema languages discussed in this paper, RELAX NG does not support default values (which are supported for attributes in DTD and for attributes and elements in XSD). While both DTD and XSD support XML references via ID/IDREF(S) attribute types, RELAX NG has no included datatype library; however, as seen in the example grammar, it is possible to include the datatype library of . The document instance given in would be valid according to all of the above defined document grammars (the example shows validation against the XML schema, adding a Doctype declaration and removing the noNamespaceSchemaLocation attribute of the root element would result in a valid instance according to the DTD. Note that RELAX NG does not contain a standard way to associate a RELAX NG schema to an XML instance since it was designed as part of the ISO DSDL framework (in this framework, the should be used as general external mechanism for validating instances).

Valid XML instance <?xml version="1.0" encoding="UTF-8"?> <text xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="text.xsd" author="maik"> <title>A simple title</title> <section> <title>A section title</title> <para id="p1">Introductory para</para> <section> <title>A subsection title</title> <para>Some text with a reference: <xref href="p1"/>.</para> </section> </section> </text>

Determinism and Local Ambiguity Determinism is a important property for XML Documents, schema languages and interpretation. If a grammar is deterministic, parsing will be much more efficient, since in general we do not have to keep in mind any information, do not have to backtrack, do not need any non-local information in case we search, etc. present an algorithm using tree-automata that allows for efficient parsing of regular tree grammars. The concept of determinism is closely related to the concept of local ambiguity: if there is no local ambiguity, then at every point in the parsing process we know the structure (in this case: interpretation) of what we have seen so far, because there is only one possible local analysis. If there is some local ambiguity, non-determinism arises: we cannot assign a unique interpretation locally, for we would need information which is not available yet, and we need to apply some heuristics.

Determinism, Algorithms and Local Ambiguities In this section we will review the concept of determinism, as opposed to local ambiguity of a grammar. As introductory issue, we show that determinism does not only depend on the grammar we use, but also on the algorithm. In regular tree languages, there can be no one-dimensional concept of determinism, as there is for regular string languages. Note that this is more than a metaphor, since we can perceive of regular string languages as one-dimensional structures; in order to talk about trees or the strings formed by their leaves, we need at least two dimensions (see ). In the remainder, we will show how different grammar types provide determinism for all, some particular class of, or no algorithms.

Deterministic Content Models Firstly, we consider the concept of deterministic content models. This draws on the notion of deterministic (or 1-unambiguous) regular expressions (DREs), thoroughly surveyed by See for a discussion of the pros and cons of 1-unambiguous content models in XML schema languages. . Assume we have a string s which is denoted by some regular expression E. Assume furthermore we build our expressions with letters from an alphabet Π, which is identical to the alphabet Σ, except for the fact that we have additional indices for the letters (taken here from the set of natural numbers). If we build an expression from Π, we have to make sure that every index must occur at most once in it. Constructors for regular expressions are as usual, and indices are passed on to the letters of the strings the expression denotes. We say that a letter in s instantiates a letter in E, if the following holds: Definition 7 A letter ai in s instantiates a letter aj from E, if i = j. The index ensures that for every string an expression denotes, there is a unique surjective mapping from letters in s to letters in E. We now define a mapping ♮ on strings over Π to strings over Σ, such that ♮(xi):= x, ♮(xs):=♮(x)♮(s), where s is a string and x a letter (the first of the string). This homomorphism simply deletes the indices and leaves anything else untouched. Note that for a unique string ♮s, and a given expression E, there might be several s ∈ E, such that their mapping under ♮ is identical. In this case we say that a letter in s might instantiate several letters in E. A deterministic regular expression over Π is then defined as follows: Definition 8 E is deterministic or one-unambiguous, if for all strings u, v, w over Π, and all letters x, y in Π, the following holds: if uxv and uyw ∈ L(E), and x ≠ y, then also ♮ x ≠ ♮ y. This means that we can skip the indices, and we still know which letter in E is instantiated by any letter in s, simply from knowing its left context. Formally, this means that the mapping ♮ is reversible. A regular language is deterministic if it is denoted by a deterministic regular expression. A simple example of a non-deterministic expression is the expression a*b*a*, as we can easily check for the string a, which might be the instantiation of either of the two as. As a consequence for processing, quite informally, we can state that reading s ∈ L(E) from the left to the right, where E is a DRE, at any point in s, we know at which point in E we find ourselves. In automata theory, DREs correspond to deterministic Glushkov automata. There is however a problem if we apply this concept to content models in regular tree grammars: in regular expressions, we can see from a letter in the string which type of letter in the expression it instantiates (thus, we have a unique letter in Σ, though not in Π). We can thus deduce from a letter in the string a letter in the expression, though not a letter instantiation, if the expression is not deterministic. We cannot, however, deduce from a given tree node label a unique type of nonterminal: if we have competing rules, different nonterminals introduce identical labels; and we still have to keep them apart. Thus, if the content model of nonterminals itself is deterministic, this is of little use if we cannot infer from a given label the unique nonterminal it belongs to. By way of example, consider the following grammar rule: A → a(ABC|CBA) Its content model is surely deterministic. However, if A and C are competing rules (have identical labels), this is of little use. We have to check each subtree until we have its unique interpretation. This means, in the worst case, we have to check both subtrees (as we will see, this is the case in which the trees generated by A (C) form a subset of the trees generated by C (A)). The obvious reason for the fact that this concept of determinism comes short is that it originates in one-dimensional strings. As our trees are two dimensional, we can define determinism only with respect to directions in which our analysis proceeds. The main difference is, of course, the one between top-down and bottom-up processing. In this paper we will not consider the difference between depth-first and breadth-first parsing, though this is surely worthwhile. In the next subsection, we will reformulate and complete the Murata hierarchy in a way, that makes clear which kind of determinism is facilitated by which kind of grammar. In the sequel, we will disregard the one-dimensional problem of non-deterministic content models, since they have been thoroughly analyzed, and we have nothing to add (see ). At this point, our interest is the second dimension: importantly, this means that talking about determinism, we implicitly always add: provided that content models are deterministic in the above sense. We thus exclude all problems which may arise from regular expressions.

The Murata Hierarchy as Hierarchy of Locality Conditions As our main concern will be the formal properties of the grammar types, as they affect processing, we will firstly reformulate the hierarchy. This reformulation aims at making clear which information we need in order to uniquely interpret a local node or subtree. In a local tree grammar, for any node a in any context, we know its unique interpretation. This is obvious, since for any node label, there is only one single rule which generates it, by the very definition of a local tree grammar. As a consequence, parsing is deterministic for any algorithm (provided we have deterministic content models), and the problem of giving a certain node of a given tree its interpretation is solvable in a constant amount of steps. In a single type tree grammar, for any node label a in any context, we can determine its unique interpretation if we know the interpretation of its mother node. This follows directly from the definition: if we know the interpretation of a node's mother node (rules correspond to interpretations), we know its content model. Within a content model there must not occur any competing nonterminals. Note, however, that it is not sufficient to know the mother nodes label. We can easily construct an example to show this: we have two competing rules, A → a(C) and B → a(D), whose nonterminals do not occur in the same content model of any rule. Furthermore, we have the two rules C → b(r) and D → b(r'). Then both nodes, as introduced by C and D, have label b, their mother nodes both have the label a, despite they have different interpretations. For processing, this has an immediate consequence: a top down parser will at any point immediately know the interpretation of any node, whereas if we start interpretation from the bottom, in the worst case we will have to go up to the root in order to get the correct interpretation. The matter of providing the interpretation of a given node is nonetheless a linear search problem, since for a given tree and a given node, it is sufficient to go a path from the root to that node. In a restrained competition grammar (RCG), in order to give a node its unique interpretation, we must have the interpretation of its mother node, and check its left siblings, in case it has any. Note that from how we defined RCGs, it follows that we only need the label of the siblings, not their interpretation: because any two competing nonterminals within a single content model are distinguished by a unique prefix, this prefix itself must not consist of competing nonterminals only, and neither must it be empty. This keeps the grammar unambiguous, and easy to process. However, it makes some restrictions we do not necessarily want to make: maybe the unique interpretation of a label should not depend on a left sibling, but on a right sibling. For example, in RCGs we cannot have competing nonterminals as leftmost symbols in a content-model, but as rightmost, given some left context. This causes an asymmetry which seems quite arbitrary. Of course we can equally define the asymmetric counterpart of RCG, checking for unique suffixes instead of prefixes; but care is to be taken: since we have to fix the type for the class of languages (i.e., document grammars) we define, we have the same problem. If we generalize the concept to both unique suffix and prefix, some problems arise, which can however be remedied. We now define a generalized restrained competition grammar (GRCG), as follows: In a GRCG, for any two competing nonterminals A and B within a single content model r, one of (Γ A Δ) and (Γ B Δ) fails to match r (Greek variables range over possible empty sequences of nonterminals). We now have generalized the restriction from the left (right, respectively) to the entire context. Note that we have relaxed the overall restriction on the grammar, by making the restriction on content models more specific (indeed, this type properly includes the RCGs). This little relaxation however causes a vast increase in processing complexity: because now, in order to give its unique interpretation to any node a in any context, in the worst case one needs to know the interpretation of its mother, the interpretation of its siblings, and the interpretation of its subtrees as well. And even then, GRCGs might still be ambiguous, allowing more than one interpretation for a entire single tree. This needs some explanation. The first point is easy to see: that we need to know the interpretation of the mother node follows a fortiori from the preceding argument (single type grammars are properly included in restrained competition grammars). But this is insufficient, since competing nonterminals may occur within the same content model. We have to match all the sister labels to the nonterminals of the content model of the mothers interpretation in order to get a unique interpretation (according to the definition). Note, however, that we need the interpretation of the sister nodes; it is not sufficient to have their labels. This we can easily verify with the following grammar rule: A → a(BC|B'C'), where B and B' and C and C' are competing rules, introduce the labels b and c, respectively. This satisfies all conditions on GRCGs. In order to get the correct interpretation for the labels b and c, it is not sufficient to know the sister node's label, but its interpretation. Things can thus get even worse, if we consider the case where the interpretation of a sister node depends on the interpretation of the node under consideration itself. Look at the following example rule: A → a(BC|CB), where B and C are competing rules. Obviously, they are both uniquely determined by their neighbor within the content model of A (A may only occur with B to its left or its right, and vice versa). However, as they carry the same labels, it is insufficient to determine either of them if we just check the label of its sister (since it is identical). Furthermore, we might have the case where it is impossible to interpret one of the subtrees, without its sisters interpretation (e.g., if one of the competing nonterminals generates a subset of the trees generated by the other). We will consider this case more thoroughly in the next section, showing that there are globally ambiguous GRCGs, and that for every language that can be generated by an RTG, there is also a GRCG grammar that generates the same language. From the point of view of processing, we see that in GRCGs, expressive power comes at a high cost: neither a bottom-up nor a top-down parser is capable of assigning a unique interpretation locally, and maybe not even globally. The problem of giving a given node its unique interpretation might thus be an exponential search problem, and in the worst case not even decidable. We will therefore introduce a subtype of GRCG, which we will call unambiguous RCG. This type is properly included in the class of GRCG grammars, and includes properly the class of STGs as well as RCGs, as can be seen easily. We now introduce unambiguous restrained competition grammars (URCGs). What we want to eliminate is ambiguity, which can be caused by the fact that in a GRCG, we might have entire competing contexts, or the interpretation of the context of a label in a content-model might depend on the interpretation of the label itself. In the resulting grammar, it should be possible to yield the unique interpretation of a node from the interpretation of its mother and the labels (not interpretations) of its sisters. We characterize the grammar type in the following terms: We introduce an alphabet of meta-variables O, which we use in the following way: We form a set of all sets of nonterminals from N which compete with each other; we call these sets the competition sets (which are possibly singletons). We are interested in the sets where every nonterminal occurs in exactly one competition set, such that the whole set forms a partition of N. For every such partition we iterate the following procedure: To every competition set, we assign a single symbol from O. We call this an O-assignment. Then, for all content models, we check for all nonterminals, whether the content models still satisfy the GRCG condition, if we replace all other nonterminals by the symbols from O they are assigned to. In case there is more than one overall assignment, that is, a single nonterminal belongs to more than one competition set, we have to iterate this for every possible assignment. If for every assignment, nonterminal and content model, the resulting grammar is a GRCG, then the original grammar is a URCG. Note that the assignments are only introduced for this evaluation procedure. We will call contexts which are identical under the O-assignment similar. We define accordingly a URCG as a grammar where competing nonterminals must not occur in the same content model and in similar contexts (this obviously subsumes identical contexts). It is easy to see that now we have made sure that competing nonterminals must not occur within the same contexts of labels (as opposed to nonterminals). Thus, a URCG is a GRCG where for every node we find its unique interpretation if we have the interpretation of the mother node and the labels of its sister nodes. In particular, we can interpret any node without having to recur to its sister node's interpretation: the content model (BC|CB), where B and C are competing nonterminals, does not satisfy the condition, because both B and C occur in the context __X or X__, respectively, where X is the O-assignment for both. This kind of grammar is useful for the following reason: there is no global ambiguity in it (as we have deleted the only source of ambiguity, that interpretations of labels may depend on each other); and it is the strongest of the non-ambiguous grammar types we have considered. However, it is not capable of generating every language which is not inherently ambiguous, as we will show later. Note, that in order to provide the unique interpretation of a node, we still might have to check all of its sister nodes, but it is sufficient to check the labels. It is easily seen that URCGs properly include RCGs, as both left and right context can count as distinctive (we will make this more precise later on). As we will also see further down, there are languages which can be generated by GRCGs, but not by URCGs. This will follow from the fact that actually every RTL can be generated by a GRCG, but there are languages for which there are no unambiguous grammars, and, obviously, URCGs are always unambiguous. Interestingly, the search problem for URCGs is still linear, since we only need to go down the path from the root to a given node, and in addition check finitely many sister labels (note that while regular expressions allow for arbitrary branching, the branching of a given tree is, of course, always finite). An example of a URCG could be the use of attribute based co-occurrence constraints or attribute-element constraints in the following RELAX NG declaration. We extend our example grammar by adding a type information to the section element. If the type is set to the value "global" other section child elements are allowed as part of the content model, if its value is set to "sub" only para child elements are allowed (see ).

Extended RELAX NG grammar <?xml version="1.0" encoding="UTF-8"?> <grammar xmlns="http://relaxng.org/ns/structure/1.0" xmlns:a="http://relaxng.org/ns/compatibility/annotations/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"> <start> <element name="text"> <optional> <attribute name="author"/> </optional> <optional> <ref name="element.title"/> </optional> <oneOrMore> <choice> <ref name="element.section"/> <ref name="element.para"/> </choice> </oneOrMore> </element> </start> <define name="element.section"> <choice> <oneOrMore> <element name="section"> <optional> <ref name="element.title"/> </optional> <optional> <attribute name="type"> <value>global</value> </attribute> </optional> <oneOrMore> <choice> <ref name="element.section"/> <ref name="element.para"/> </choice> </oneOrMore> </element> </oneOrMore> <oneOrMore> <element name="section"> <optional> <ref name="element.title"/> </optional> <optional> <attribute name="type"> <value>sub</value> </attribute> </optional> <ref name="element.para"/> </element> </oneOrMore> </choice> </define> <define name="element.title"> <element name="title"> <text/> </element> </define> <define name="element.para"> <element name="para"> <interleave> <optional> <attribute name="id" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"> <data type="ID"/> </attribute> </optional> <optional> <element name="xref"> <attribute name="href" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"> <data type="IDREF"/> </attribute> </element> </optional> <text/> </interleave> </element> </define> </grammar> The interesting part is the definition of the element.section pattern. It allows for two different elements named "section" with different content models according to the value of the optional type attribute. The result is that the instance shown in is valid according to this RELAX NG grammar while the one shown in is not.

Valid XML instance according to the extended RELAX NG grammar <?xml version="1.0" encoding="UTF-8"?> <text author="ms"> <title>A simple title</title> <section type="global"> <title>A section title</title> <para id="p1">Introductory para</para> <section type="sub"> <title>A subsection title</title> <para>Some text with a reference: <xref href="p1"/>.</para> </section> </section> </text> Without any type attribute the instance shown in would still be valid.

Invalid XML instance according to the extended RELAX NG grammar <?xml version="1.0" encoding="UTF-8"?> <text author="ms"> <title>A simple title</title> <section type="sub"> <title>A section title</title> <para id="p1">Introductory para</para> <section type="sub"> <title>A subsection title</title> <para>Some text with a reference: <xref href="p1"/>.</para> </section> </section> </text> Other well deployed RELAX NG examples of attribute based co-occurrence constraints can be found in , Chapter 7, in , Section 15, or in the RELAX NG schema for the JLTF (Japanese Layout Taskforce) aligned document shown in . Importantly, co-occurrence constraints do not add anything to the formal generative capacity of the grammar. This is because attributes (or their values, respectively) add an additional specification to the terminals. Thereby we can convert competing terminals (or, equivalently, rules) into non-competing ones, but not vice versa. Any co-occurrence constraint thus gives us the possibility to distinguish maybe otherwise indistinguishable non-terminals, thereby at most keeping the complexity of the grammar constant, or even reducing it. Furthermore, as co-occurrence constraints do only affect immediate subtrees (i.e., content models), their expressivity is entirely contained within the expressive capacities of standard regular tree rewriting rules; the only thing we might need to add to our formal grammar model is some additional specification on the terminals. Neither DTD discuss DTD++ 2.0 which supports a large number of co-constraints using a syntax closely resembling DTD. nor XSD 1.0 support such attribute-element constraint, although there are some workarounds or hacks that can be used in XML schema to mimic co-occurrence constraints: either the use of the xsi:type attribute or xs:key See , p. 65 and http://ajwelch.blogspot.com/2008/06/xml-schema-co-occurrence-constraint.html for further information. . Another option to realize this particular constraint is the use of embedded business rules or conditional type assignment using type alternatives or assertions that are introduced in (for complex Types) and (for simple Types). shows a possible XSD 1.1 realization.

XSD grammar with XSD 1.1 <code>assert</code> element <?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="text"> <xs:complexType> <xs:complexContent> <xs:extension base="textType"> <xs:attribute name="author" type="xs:string" use="optional"/> </xs:extension> </xs:complexContent> </xs:complexType> </xs:element> <xs:element name="title" type="xs:string"/> <xs:element name="section"> <xs:complexType> <xs:sequence> <xs:element ref="title" minOccurs="0"/> <xs:group ref="sectOrPara" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="type" use="optional"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="global"/> <xs:enumeration value="sub"/> </xs:restriction> </xs:simpleType> </xs:attribute> <xs:assert test="@type!='sub' and (child::para | child::section) or @type='sub' and not(child::section)" /> </xs:complexType> </xs:element> <xs:element name="para"> <xs:complexType mixed="true"> <xs:sequence> <xs:element name="xref" minOccurs="0"> <xs:complexType> <xs:attribute name="href" type="xs:IDREF" use="required"/> </xs:complexType> </xs:element> </xs:sequence> <xs:attribute ref="id" use="optional"/> </xs:complexType> </xs:element> <xs:complexType name="textType"> <xs:sequence> <xs:element ref="title" minOccurs="0"/> <xs:group ref="sectOrPara" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> <xs:group name="sectOrPara"> <xs:choice> <xs:element ref="section"/> <xs:element ref="para"/> </xs:choice> </xs:group> <xs:attribute name="id" type="xs:ID"/> </xs:schema> The section element contains an XSD 1.1 assert element that uses an XPath expression in its type attribute to restrain the possible child elements according to the type attribute's value of the section element. Note that when using XSD 1.1 assertions for expressing co-occurrence constraints we are not limited to immediate subtrees: although the evaluation of the XPath expression is done in the context of the parent element (i.e., XSD 1.1's xs:assert element uses a tree fragment rooted at the element whose type it is tested against) one could put the assertion at the level of the common ancestor, that is, the element that contains all the data needed to compute the assertion. The support for full XPath (i.e., axes such as ancestor, parent or preceding and preceding-sibling and following and following-sibling, respectively) may be implementation-dependent. XSD 1.1 type alternatives are restricted to tests against constants or attributes on the element itself but not to ancestors, descendants, siblings or children or their attributes while Schematron business rules are not restricted to an XPath subset. There is one type we should add, which cannot be assigned a place on the hierarchy from STGs to RTGs, which is, however, weakly and strongly between LTGs and RTGs. Grammars of this type satisfy the following conditions: we want to be able to assign a unique interpretation to any node a, provided we know the complete subtree it governs. This kind of grammar would facilitate deterministic parsing for bottom-up algorithms. In terms of grammar, this imposes the following restrictions on the formalism: Every leaf-terminal is introduced by a single nonterminal, for every nonterminal N in a given grammar, there is at most one rule which has a given terminal a, and N appears in its content model. We call this grammar type unique subtree grammar (USG). Note that this grammar type does not include STGs and URCGs, nor is it included by them. The restrictions do not restrain the occurrence of competing nonterminals in content models, but rather the labels which belong to nonterminals in the content model. More precisely, whereas the former types restrain the occurrence of competing nonterminals within content models, USGs restrain the content models of competing rules itself. However, as all other types, they properly include the class of LTGs, as every LTG is a USG, but not vice versa, and it easy to find a language which is generated by a USG, but not by LTG. Furthermore, they are properly included in the class of RTGs, in the strong and the weak sense (and weakly within the class of GRCGs, as we will see). As is also easy to see, the distinctive USG property provides deterministic bottom up parsing. In order to get the interpretation of a given node, it is therefore sufficient to find a path from the leaves, which is a linear search problem.

Structure and Global Ambiguity First, we will present some well-known results, which are important for our further discussion. Theorem 1 RTGs are weakly equivalent to CFGs (that is, the set of strings of leaves generated by CFGs is equivalent to the set of strings of leaves generated by RTGs). Theorem 2 RTGs are strongly equivalent to graphs generated by a CFGs, modulo a homomorphism of node labels (i.e., a homomorphism which maps various node labels in a given tree onto a single one), provided the RTG has finitary branching. Theorem 3 LTGs are strongly equivalent to graphs generated by CFGs, provided finitary branching. Proof is trivial: as in LTGs every node label is generated by exactly one nonterminal; and in CFGs, nodes which are not leaves are labeled by nonterminals, there is a one-to-one correspondence. This has some importance for the relation of LTGs and RTGs. It follows, that LTGs and RTGs are equivalent up to homomorphism, as well as all grammar types in between. Every discussion we have about generative capacity concerns only non-equivalence in isomorphism. Since also the strings of leaves for all grammar types are identical, we can only be interested in the sets of trees. In the sequel, if we speak of the weak generative capacity of a tree grammar, we refer to the sets of trees it generates, not the strings of leaves. Since to our knowledge, only the strong generative capacity of the grammar types of has been in the focus of research, we will now scrutinize their weak generative capacity. Theorem 4 The sets of trees generated by STGs form a proper subset of the sets of trees generated by RCGs. For proof, consider the trees generated by the following grammar: S → a(AB) A → b(C) B → b(D) C → c(D) D → c(ε) This is an RCG, since A and B do not occur in similar contexts. In order to see that there is no way to generate this tree with an STG, consider the following fact: a governs two identical labels, which however govern different subtrees. It is therefore impossible to introduce them with identical rules, and (by definition of STGs) forbidden to have two competing rules in the content model of the first rule. This is sufficient for the proof of weak inclusion, since all STG rules are also RCG rules, and therefore the languages generated form a proper subset. Theorem 5 The sets of trees generated by LTGs form a proper subset of the sets of trees generated by STGs. Consider the trees generated by the following grammar: A → a(B) B → b(C) C → a(D) D → c(ε) This is a single type tree grammar, and no LTG is able to generate such a tree (remember that LTGs are strongly equivalent to graphs generated by CFG, provided finitary branching).

Restrained Competition Grammars and Variants In this section we will scrutinize formal properties of the different types of restrained competition grammars. We will show which kind of languages cannot be generated by RCGs; we will prove that there are GRCGs which are ambiguous, and that for every language which can be generated by an RTG, there is a GRCG which generates the same language. Finally, we will show that URCGs do not have these properties, are properly included within GRCGs and properly include RCGs. The type of languages we cannot generate with RCGs is quickly described as follows: all grammars, where a single content model contains competing nonterminals, which can be uniquely distinguished only from their left (right, respectively) context, are not RCGs. Consequently, we cannot generate sets of trees, where a certain node has different subtrees depending on its right siblings. If we want to get rid if this asymmetry, and allow for GRCGs, where competing nonterminals in a single content model are uniquely determined by their left or right context, we run into problems: Theorem 6 There are GRCGs which are ambiguous. For proof, consider the following rule: S → a(AB|BA). Suppose, A and B are competing nonterminals; suppose furthermore, that there is some overlapping between A and B; i.e., the nonterminals generate overlapping sets of trees. In particular, we may assume that the trees generated by A form a subset of the trees generated by B. For example, A and B generate identical trees up to depth n; B in addition generates a tree of depth n+1. In this case, the trees of the language have the root a, with two symmetrical sets of subtrees up to depth n, and possibly one subtree with depth n+1. It is easily seen that now it is impossible to merge A and B, for then we would be incapable of expressing the condition that at most one subtree has depth n+1. However, for the trees, where the subtrees introduced by A and B have depth at most n, there is necessarily more than one analysis. The grammar we have described so far is, however, a GRCG, because neither A nor B occur in identical contexts (though in similar contexts, remember the preceding section). Theorem 7 For every language which can be generated by an RTG, there is a GRCG which generates the same language. To proof this theorem, we describe a simple procedure to convert any RTG into a GRCG, which generates the same language. We define competing sequences of length n of nonterminals as follows: two sequences of nonterminals compete, if for all n, the nth nonterminal of one sequence competes with the nth nonterminal of the other sequence. We have to assume a content model r which is not GRCG conform. Therefore, there have to be two competing nonterminals or competing sequences of nonterminals A and B in r, such that for possibly empty sequences of nonterminals Γ and Δ, (Γ A Δ) and (Γ B Δ) match with r. Given this, we can be sure, that in the instantiations of r, which violate the GRCG condition, A and B occur in exactly the same global tree contexts. By global tree context we here mean that a tree with a governing the subtrees generated by A is part of the language iff a also governs the set of subtrees generated by B. Since this is the case, we can simply merge the two nonterminals to a new one, C, which is the union of the former two. This new nonterminal substitutes all instantiations of A and B, which occur in the same global tree context. This, by definition, are the instantiations which violate the GRCG condition. This we can apply to all nonterminals which violate the GRCG condition. The only thing we have to take care of is that we apply this only to those instantiations of the content models where two competing nonterminals match equally (this might force us to change some regular expressions). We do not show an exact algorithm at this point, since it is clear that an equivalent GRCG exist, and the details of the construction are of no practical interest at this point. We now show that there is a hierarchy of proper inclusion RCG ⊂ URCG ⊂ GRCG. To show that RCG ⊂ URCG, consider the following: every rule which is admitted by an RCG is also admitted by a URCG, because if competing nonterminals in the same content model have a unique prefix, a fortiori they also have a unique context (we have already shown that a unique prefix of nonterminals is paramount to a unique prefix of labels/siblings, by induction). Above, we have already shown that for an RCG it is impossible to generate languages as the following, which is a URCG. S → a(AC|BD) A → b(C) B → b(D) C → c(ε) D → d(ε) A and B compete, but are determined uniquely by their context. This concludes the first part; the second part will be a corollary of the next section: We will show that some languages are inherently ambiguous, that is, there is no unambiguous grammar for them. By Theorem 7 we know that we can generate these languages with GRCGs, but URCGs cannot: Proposition 1 A URCG cannot be ambiguous. This is easy to see: an ambiguous grammar assigns two different sequences of nonterminals to the daughters of one node (since root nodes are unambiguous): Then however, there must be at least two competing nonterminals which occur in the same content model in similar contexts, which, by definition, is impossible.

Inherently Ambiguous Languages As a corollary, we can show that there are regular tree languages, for which there is no unambiguous grammar. There are sets of trees, which are generated by an ambiguous GRCG, but by no URCG. We will call these languages inherently ambiguous. Theorem 8 Some regular tree languages are inherently ambiguous. This can be seen easily, if we spell out a grammar which we described in the above subsection. We will then show that there is now way to write an unambiguous grammar which generates the same language.S → r(AB|BA) A → a(C) B → a(D) C → b(ε) D → b(ε|E) E → c(ε) There is no way to merge A and B, since they generate different sets of subtrees (we can write L(A)≠L(B)); but since they overlap (L(A)∩ L(B)≠ ∅), there is no way to have a unique interpretation in the cases where the subtrees generated by the nonterminals are identical. There will always be two ways to generate trees in this case. We can, furthermore, precisely state the conditions, under which a regular tree language is ambiguous. To this end, however, we need to introduce some notation. We now for simplicity write trees as terms: a tree with root a and daughters b and c is denoted as a(b,c), etc. As a next step, we define a context as the position, where certain subtrees occur within trees of a language. Definition 9 A context C is a tree-term with exactly one variable. We say that a set of subtrees α occurs in a context C in a language L, if the following holds: We can instantiate the variable of C with any tree from α, and the resulting tree is in L. Note that sets of subtrees correspond to nonterminals, when we speak of languages rather than grammars. In the sequel, for simplicity we will use lower case Greek variables equally for sets of subtrees as for ordered sequences of sets of subtrees. The definition of a context is easily accommodated to sequences. A set of sequences of trees of length n consists of ordered tuples of trees of length n, of the form (t0,...,tn-1). Sets of subtrees are then simply sets of one-tuples. Importantly, we will not provide a proof for the following proposition, and leave it open as a conjecture. However, we will sketch the argument. We now make the following conjecture: Conjecture 1 A tree-language is inherently ambiguous iff at least one node fulfills all of the following conditions: We need to have one node with an arbitrary label a, with at least two (sequences of) sets of subtrees α and β, such that α ∩ β ≠ ∅; α ≠ β; There is at least one context C in L, such that both a(Γ, (t1,...,tn), Δ, (u1,...,un), Θ) and a(Γ, (u1,...,un), Δ, (t1,...,tn), Θ) occur in C, for all (t1,...,tn) ∈ α and all (u1,...,un) ∈ β, where uppercase Greek letters designate possibly empty sequences of daughter sub-trees; note that the sequences need to have equal length in order to meet condition 1. Due to space restrictions, we leave the prove for this conjecture open here; this reminds however of a theorem in for string languages. But we will give some rather informal discussion of the points in the next section. It is not hard to see that this is merely a generalization of the cases we have been described above. As we will see, we can derive some useful facts from these properties of ambiguous languages, even without a general proof: we can show that we can construe grammars for languages which do not fulfill one of the conditions, and, moreover, which type of grammars we can construe.

Unambiguous Languages Theorem 9 From the grammar types sketched so far, there is no type which generates all and only the RTLs that are not inherently ambiguous. We will demonstrate this going through the three conditions mentioned in the preceding section, and look which unambiguous grammar we can construe if one condition is not met. This is to be read as follows: if one condition is not met, then it means, that from all nodes of the tree language, there might be any one which meets the ones not in question, but none which meets the one currently under consideration. If there is no intersection between the subtrees of a given node, the grammar is of course not ambiguous. We can, however not necessarily construe a URCG for this grammar, since in the content model of the mother node there are competing nonterminals in similar contexts (recall the example given above). We can, however, construe a USG for such a language, since subtrees are uniquely identifiable. This means that there are no two sets of subtrees governed by the same node which are not identical. We can thus introduce them by the same nonterminals, and have a local tree grammar (having no different sets of subtrees governed by the same node amounts to say we need no competing rules in the grammar, as nonterminals correspond to sets of subtrees). If the third condition is not met, then we can construe nonterminals (corresponding to the sets of subtrees) such that for all of them the following holds: assuming they compete (introduce identical labels), they either occur in different contexts, in which case they are distinguishable thereby and no ambiguity arises; or they occur in identical contexts, in which case we can use a unique nonterminal which is the merge of both (this also holds for root nodes). The critical case, where the content of one (sequence of) set(s) of subtrees depends on the other one, which makes them occur in similar contexts, while making it impossible to merge them, however, we have excluded by assumption. Since this argument holds inductively from the root to all subtrees, we can construe a URCG for the language were condition 3 is not met, but we cannot use any strictly weaker type. The only thing we have provided is that if two sets of subtrees occur in similar contexts (for the grammar we construe), then they actually occur in identical contexts. It follows that we do not need competing nonterminals in similar contexts. This shows that we still have not solved the problem to define a canonical grammar type which generates all and only the unambiguous languages, since there are languages which are generated by USGs, but no other canonical class which does not allow any ambiguity type, and languages which are generated by URCGs and no other such type. So far, we are still lacking a characterization of the unambiguous languages in terms of grammar rules.

Application and Future Research When we speak of XML schema languages and applications, the first thing that comes into mind is parsing an instance and validate it according to a respective schema. have shown algorithms for parsing the three types of tree grammars we discussed already. However, a task which is still open is to provide algorithms for the new grammars types we have defined. demonstrated the use of the theory of regular tree grammars for the XML to relational conversion as an additional application of formal language in the XML context. Again, this work could be extended using the newly established grammar types. Regarding future research this paper may serve as just a foundation in the fields of XML applications and formal languages. New features that are introduced in XSD 1.1 such as conditional type assignment, assertions and the openContent element as well as the relaxed Unique Particle Attribution rule (UPA, aka the determinism rule, see , Section 3.8.6.4), or changed behavior regarding wildcards have effects on the expressiveness. Apart from these natural enhancements, another focus may lie in examining the relationships between XPath and XQuery and formal languages on the basis of the work undertaken in this paper; we expect to shed some light on this topic during future research. In addition, a more formal approach in the analysis of overlapping markup structures such as GODDAGs () could be an interesting field for future work. Seen from a practical perspective and under consideration of the findings in , a large portion of the XML document grammars that can be found in the wild are structurally equivalent to DTDs or specialized DTDs (that is, adding a mechanism to decouple element names from their types to regular DTDs, see and – also called EDTDs by ), hence use roughly the expressiveness of local tree grammars. This is often due to nontransparent restrictions in the XML Schema spec such as the already discussed Element Declarations Consistent (EDC) constraint. and provide simplifications for XSDs and XSD authoring tools that should relive authors from the burden of these constraints by automatically transforming nondeterministic expressions into concise deterministic ones. Regarding RELAX NG document grammars we think that restraining its expressive power to the class of URCGs would provide a feasible compromise. Up to this point we hope that this more fine-grained hierarchy may serve others as guide for choosing a specific XML schema language depending on the expressivity of the markup language that has to be developed.

Bibliography Abiteboul, S., P. Buneman, and D. Suciu (2000). Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann Publishers, San Francisco, California. Ansari, M. S., Zahid, N., and K.-G. Doh. A Comparative Analysis of XML Schema Languages. In Slezak, D., Kim, T., Zhang, Y., Ma, J., and K. Chung, eds., Database Theory and Application. International Conference, DTA 2009, Held as Part of the Future Generation Information Technology Conference, FGIT 2009, Jeju Island, Korea, December 10-12, 2009. Proceedings, volume 64, pages 41– 48. Springer, Berlin, Heidelberg, 2009. doi:10.1007/978-3-642-10583-8_6. Balmin, A., Papakonstantinou, Y., and V. Vianu (2004). Incremental validation of XML documents. ACM Transactions on Database Systems (TODS), 29(4):710–751. doi:10.1145/1042046.1042050. Bauman, S., (2008). Freedom to Constrain: where does attribute constraint come from, mommy? In Proceedings of Balisage: The Markup Conference 2008. Balisage Series on Markup Technologies, vol. 1 (2008). doi:10.4242/BalisageVol1.Bauman01. Bex, G. J., Gelade, W., Martens, W. and F. Neven (2009). Simplifying XML Schema: Effortless Handling of Nondeterministic Regular Expressions. In SIGMOD ’09: Proceedings of the 35th SIGMOD international conference on Management of data, pages 731–744, New York, NY, USA, ACM. doi:10.1145/1559845.1559922. Brüggemann-Klein, A., and D. Wood (1992). Deterministic Regular Languages. In Finkel, A. and M. Jantzen, eds., STACS 92. 9th Annual Symposium on Theoretical Aspects of Computer Science Cachan, France, February 13–15, 1992 Proceedings, volume 577 of Lecture Notes in Computer Science, pages 173–184. Springer, Berlin, Heidelberg. doi:10.1007/3-540-55210-3_182. Brüggemann-Klein, A. (1993). Formal Models in Document Processing. Habilitation, Albert-Ludwig-Universität zu Freiburg i. Br. Brüggemann-Klein, A., and D. Wood (1997). One-unambiguous regular languages. Information and computation, 142:182–206. doi:10.1006/inco.1997.2695. Brüggemann-Klein, A., and D. Wood (2002). The parsing of extended context-free grammars. HKUST Theoretical Computer Science Center Research Report HKUST-TCSC-2002-08, The Hong Kong University of Science and Technology Library. Brüggemann-Klein, A., and D. Wood (2004). Balanced context-free grammars, hedge grammars and pushdown caterpillar automata. In Proceedings of Extreme Markup Languages, Montréal, Québec. Buck, L., Goldfarb, C. F., and P. Prescod (2000). Datatypes for DTDs (DT4DTD) 1.0. W3C Note 13 January 2000, World Wide Web Consortium. Carey, B. M. (2009). Meet CAM: A new XML validation technology. Take semantic and structural validation to the next level. IBM developerworks, IBM Corporation. http://www.ibm.com/developerworks/xml/library/x-cam/?S_TACT=105AGX54&S_CMP=C0924&ca=dnw-1036&ca=dth-x&open&cm_mmc=6015-_-n-_-vrm_newsletter-_-10731_131528&cmibm_em=dm:0:13962324. Chomsky, N. (1955). Logical Syntax and Semantics: Their Linguistic Relevance. Language, 31(1):36–45, 1955. doi:10.2307/410891. Chomsky, N. (1956). Three Models for the Description of Language. IRE Transactions on Information Theory, 2:113–124, 1956. doi:10.1109/TIT.1956.1056813. Clark, J. (2001). TREX – Tree Regular Expressions for XML Language Specification. Technical report, Thai Open Source Software Center Ltd. Clark, J., J. Cowan, and M. Murata, (2003). Relax NG Compact Syntax Tutorial. Working Draft 26 March 2003, OASIS –- Organization for the Advancement of Structured Information Standards. http://relaxng.org/compact-tutorial-20030326.html. Comon, H., M. Dauchet, R. Gilleron, C. Löding, F. Jacquemard, D. Lugiez, S. Tison, and M. Tommasi (2007). Tree Automata Techniques and Applications. Release November, 18th 2008. http://www.grappa.univ-lille3.fr/tata. Costello, R. L., and R. A. Simmons (2008). Tutorials on Schematron: Two Types of XML Schema Language. http://www.xfront.com/schematron/Two-types-of-XML-Schema-Language.html. Møller, A. (2005). Document Structure Description 2.0. Technical report, BRICS (Basic Research in Computer Science, Aarhus University), 2005. http://www.brics.dk/DSD/dsd2.html. Fiorello, D., Gessa, N., Marinelli, P., and F. Vitali. DTD++ 2.0: Adding support for co-constraints. In Proceedings of Extreme Markup Languages, Montréal, Québec. Gelade, W, Martens, W., and F. Neven (2009). Optimizing Schema Languages for XML: Numerical Constraints and Interleaving. SIAM Journal on Computing, 38(5):2021–2043. doi:10.1137/070697367. Goldfarb, C. F. (1978). DCF GML User’s Guide (IBM SH20-9160). IBM, 1978. Goldfarb, C. F. (1991). The SGML Handbook. Oxford University Press, Oxford. Gécseg, F., and M. Steinby (1997). Tree languages. In Handbook of Formal Languages, volume 3, pages 1-68. Springer, New York. Hopcroft, J., R. Motwani, and J. Ullman (2000). Introduction to Automata Theory, Languages, and Computation. 2nd edition. Addison Wesley Longman, Amsterdam. Jeliffe, R. (2009). Is Schematron a rules language? Online: http://broadcast.oreilly.com/2009/01/is-schematron-a-rules-language.html. Kilpeläinen, P., and R. Tuhkanen (2007). One-unambiguity of regular expressions with numeric occurrence indicators. Information and Computation, 205(6):890–916. doi:10.1016/j.ic.2006.12.003. Klarlund, N., T. Schwentick, and D. Suciu (2003). XML: Model, Schemas, Types, Logics and Queries. In Chomicki, J., R. van der Meyden, and G. Saake, eds., Logics for Emerging Applications of Databases, pages 1-41. Springer, Berlin, Heidelberg. Kracht, M. (to appear). Modal Logic Foundations of Markup Structures in Annotation Systems. In Mehler, A., Kühnberger, K.-U., Lobin, H., Lüngen, H., Storrer, A., and A. Witt, eds., Modeling, Learning and Processing of Text Technological Data Structures, Studies in Computational Intelligence. Springer, Dordrecht. Lee, D. and W. Chu. Comparative Analysis of Six XML Schema Languages. ACM SIGMOD Record, 29(3):76–87, September 2000. doi:10.1145/362084.362140. Maler, E., and J. E. Andaloussi (1995). Developing SGML DTDs: From Text to Model to Markup. Prentice Hall, Upper Saddle River, New Jersey M. Mani (2001). Keeping chess alive: Do we need 1-unambiguous content models? In Proceedings of Extreme Markup Languages, Montréal, Québec. Mani, M., and D. Lee (2002). XML to Relational Conversion using Theory of Regular Tree Grammars. In Proceedings of the 28th VLDB Conference, Hong Kong, China. Marcoux, Y. (2008). Graph characterization of overlap-only TexMECS and other overlapping markup formalisms. In Proceedings of Balisage: The Markup Conference 2008. Balisage Series on Markup Technologies, vol. 1. Montréal, Québec. doi:10.4242/BalisageVol1.Marcoux01. Martens, W., Neven, F., and T. Schwentick (2005). Which XML Schemas Admit 1-Pass Preorder Typing? In Eiter, T., and L. Libkin, eds., Database Theory – ICDT 2005, volume 3363 of Lecture Notes in Computer Science, pages 68–82. Springer, Berlin, Heidelberg, 2005. doi:10.1007/978-3-540-30570-5_5. Martens, W., Neven, F., Schwentick, T., and G. Bex (2006). Expressiveness and Complexity of XML Schema. ACM Transactions on Database Systems (TODS), 31(3):770–813. doi:10.1145/1166074.1166076. Martens, W., Neven, F. and T. Schwentick (2007). Simple off the shelf abstractions for XML schema. SIGMOD Rec., 36(3):15–22. doi:10.1145/1324185.1324188. Martens, W., Neven, F. and T. Schwentick (2009). Complexity of Decision Problems for XML Schemas and Chain Regular Expressions. SIAM Journal on Computing, 39(4):1486–1530. doi:10.1137/080743457. Møller, A., and M. Schwartzbach (2006). An Introduction to XML and Web Technologies, chapter Schema Languages, pages 92–187. Addison-Wesley, Harlow, England. Murata, M., D. Lee, and M. Mani (2001). Taxonomy of XML Schema Languages using Formal Language Theory. In Proceedings of Extreme Markup Languages, Montréal, Québec. Murata, M., D. Lee, M. Mani, and K. Kawaguchi (2005). Taxonomy of XML Schema Languages Using Formal Language Theory. ACM Transactions on Internet Technology, 5(4):660–704. doi:10.1145/1111627.1111631. Nentwich, C. (2005). CLiX – A Validation Rule Language for XML. Presented by Anthony Finkelstein at W3C Workshop on Rule Languages for Interoperability, 27-28 April 2005, Washington D.C. http://www.w3.org/2004/12/rules-ws/paper/24/. ISO/IEC 19757-4:2006. Information technology — Document Schema Definition Languages (DSDL) — Part 4: Namespace-based Validation Dispatching Language (NVDL), International Standard, International Organization for Standardization, Geneva. Odgen, W. (1968). A Helpful Result for Proving Inherent Ambiguity. In Mathematical Systems Theory, 2(3):191–194. doi:10.1007/BF01694004. Pawson, D. (2007). ISO Schematron tutorial. http://www.dpawson.co.uk/schematron/. Papakonstantinou, Y., and V. Vianu (2000). DTD inference for views of XML data. In PODS ’00: Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 35–46, New York, NY, USA, ACM. doi:10.1145/335168.335173. Piez, W. (2001). Beyond the “descriptive vs. procedural” distinction. In Markup Languages – Theory & Practice, 3(2):141–172. doi:10.1162/109966201317356380. ISO/IEC TR 22250-1:2002. Information technology – Document description and processing languages – Regular Language Description for XML – part 1: RELAX Core. International Standard, International Organization for Standardization, Geneva. ISO/IEC 19757-2:2008. Information technology – Document Schema Definition Language (DSDL) – Part 2: Regular-grammar-based validation – RELAX NG (ISO/IEC 19757-2). International Standard, International Organization for Standardization, Geneva. ISO/IEC 19757-2:2008. Information technology – Document Schema Definition Language (DSDL) – Part 2: Regular-grammar-based validation – RELAX NG (ISO/IEC 19757-2). Second Edition. International Standard, International Organization for Standardization, Geneva. Rizzi, R. (2001). Complexity of context-free grammars with exceptions and the inadequacy of grammars as models for XML and SGML. Markup Languages – Theory & Practice, 3(1):107–116. doi:10.1162/109966201753537222. Rogers, J. (2003). Syntactic Structures as Multi-dimensional Trees. In Research on Language and Computation, 1(3-4):265–305. doi:10.1023/A:1024695608419. Sasaki, F. (2010). How to avoid suffering from markup: A project report about the virtue of hiding xml. In XML Prague 2010 Conference Proceedings, pages 105–123, Prague, Czech Republic, March 13–14 2010. Institute for Theoretical Computer Science. ISO/IEC 19757-3:2006 Information technology — Document Schema Definition Languages (DSDL) — Part 3: Rule-based validation — Schematron. International Standard, International Organization for Standardization, Geneva. ISO 8879:1986. Information Processing — Text and Office Information Systems — Standard Generalized Markup Language. International Standard, International Organization for Standardization, Geneva. Sperberg-McQueen, C. M. (2003). Logic grammars and XML Schema. In Proceedings of Extreme Markup Languages, Montréal, Québec. Sperberg-McQueen, C. M. and C. Huitfeldt (2004). GODDAG: A Data Structure for Overlapping Hierarchies. In King, P. and E. V. Munson, eds. Proceedings of the 5th International Workshop on the Principles of Digital Document Processing (PODDP 2000), volume 2023 of Lecture Notes in Computer Science, pages 139–160. Springer, 2004 Stührenberg, M. and D. Jettka (2009). A toolkit for multi-dimensional markup: The development of SGF to XStandoff. In Proceedings of Balisage: The Markup Conference 2009. Balisage Series on Markup Technologies, vol. 3 (2009). Montréal, Québec. doi:10.4242/BalisageVol3.Stuhrenberg01. Vitali, F., Amorosi, N., and N. Gessa. Datatype- and namespace-aware DTDs: A minimal extension. In Proceedings of Extreme Markup Languages, Montré́al, Québec. van der Vlist, E. (2001). Comparing XML Schema Languages, 12 December 2001. http://www.xml.com/pub/a/2001/12/12/schemacompare.html. van der Vlist, E. (2003). RELAX NG. O’Reilly, Sebastopol. Extensible Markup Language (XML) 1.0. W3C Recommendation, World Wide Web Consortium, 10 February 1998. http://www.w3.org/TR/1998/REC-xml-19980210. Extensible Markup Language (XML) 1.0 (Fifth Edition). W3C Recommendation, World Wide Web Consortium, 26 November 2008. http://www.w3.org/TR/2008/REC-xml-20081126/. Namespaces in XML 1.0 (Third Edition). W3C Recommendation, World Wide Web Consortium, 8 December 2009. http://www.w3.org/TR/2009/REC-xml-names-20091208/. XML Schema Part 0: Primer Second Edition. W3C Recommendation, World Wide Web Consortium, 28 October 2004. http://www.w3.org/TR/2004/REC-xmlschema-0-20041028/. XML Schema Part 1: Structures Second Edition. W3C Recommendation, World Wide Web Consortium, 28 October 2004. http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/. XML Schema Part 2: Datatypes Second Edition. W3C Recommendation, World Wide Web Consortium, 28 October 2004. http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/. W3C XML Schema Definition Language (XSD) 1.1 Part 1: Structures. W3C Working Draft, World Wide Web Consortium, 3 December 2009. http://www.w3.org/TR/2009/WD-xmlschema11-1-20091203/. W3C XML Schema Definition Language (XSD) 1.1 Part 2: Datatypes. W3C Working Draft, World Wide Web Consortium, 3 December 2009. http://www.w3.org/TR/2009/WD-xmlschema11-2-20091203/. XSL Transformations (XSLT) Version 2.0. W3C Recommendation, World Wide Web Consortium, 23 January 2007. http://www.w3.org/TR/2007/REC-xslt20-20070123/. XQuery 1.0: An XML Query Language. W3C Recommendation, World Wide Web Consortium, 23 January 2007. http://www.w3.org/TR/2007/REC-xquery-20070123/.