A textual object is a well-formed XML document if: Taken as a whole, it matches the production labeled document.It meets all the well-formedness constraints given in this specification.Each of the parsed entities which is referenced directly or indirectly within the
document is well-formed., Section 2.1 "Well-Formed XML Documents"
A valid instance in addition declares conformance and actually conforms to the
rules of a schema of a given markup language. Or, in a more general way:
The intention or purpose of validation is to subject a document or data set to a test,
to determine whether it conforms to a given set of external criteria., p. 144
A validating parser takes an instance document (and the corresponding schema) as
input and produces a validation report, which includes at least a return code reporting
whether [t]he document is valid and an optional Post Schema Validation Infoset (PSVI),
updating the original document's infoset (the information obtained from the XML document by
the parser) with additional information (default values, datatypes, etc.) (). During this process different levels of validation may be
checked, depending on the XML schema language used: validation of the instance's structure
(i.e., the markup), datatyping (i.e., the content of individual leaf nodes), integrity (in
terms of links, either between nodes within a document or between documents). In addition,
other tests, usually called business rules may apply as well
(see ). While one may differentiate between the terms
valid (as defined by the XML specification and therefore
only referring to validity according to a Document Type Definition) and schema-valid (i.e., valid according to one of the externally defined XML schema
languages) we will use the former term throughout this paper as equal term for depicting the
feature of confirmation to a given set of external criteria regardless of the schema language
used. Furthermore, we will only discuss validation mechanisms through schema languages,
therefore, technologies such as the Content Assembly
Mechanism (CAM, see ) or meta-validation techniques
such as Namespace-based Validation Dispatching Language
() are not observed any further. Since we will focus on grammar-based schema languages, rule-basedFor a short discussion if Schematron is a rules language (or rule-based language) see
.
(or constraint-based) schema languages such as , the Constraint Language in XML (CLiX, ) or will not be observed either.For a different discussion on the topic of expressing constraints see . For clarification reasons we follow the definitions of :
A grammar-based schema language specifies the structure and contents of elements and
attributes in an XML instance document. For example, a grammar-based schema language can
specify the presence and order of elements in an XML instance document, the number of
occurrences of each element, and the contents and datatype of each element and attribute.
A rule-based schema language specifies the relationships that must hold between the
elements and attributes in an XML instance document. For example, a rule-based schema
language can specify that the value of certain elements must conform to a rule or
algorithm.
When someone starts developing a new XML-based markup language sooner or later the
question about a formalism to define the corresponding document grammar arises, since there is
a variety of schema (definition) languages available. While a schema can be considered as a
formal definition of a grammar of the XML-based markup language (e.g. as a set of rules or
criteria), the schema language is a formal language for expressing schemas
(). Usually, choosing a schema language depends on several
factors such as familiarity with a given formalism or support provided by the chosen authoring
software or processing tools such as or .
These factors are very specific for one's own needs and environment and we will not give any
advice regarding these topics. However, what we want to demonstrate in this paper are the
differences in terms of expressiveness and computability between the three most used XML
schema languages, starting with XML's inherent Document Type Definition (DTD, see ), based on SGML's DTD (see , , ) where a non-XML syntax is usedWe will not discuss any proposals for extended DTDs such as , , or since these play only minor
roles in the wild, if any., over W3C's XML Schema Description Language (,
, ) and the formal language
theory based RELAX NG (see , , as a successor to both RELAX () and TREX
().Formal Language Theory and XMLAlthough the formal model of an XML instance is always a single rooted tree, the different
schema languages that can be used to define and constrain instances can be differentiated
according to their expressiveness and – in a further step – according to their
computability, which may be interesting when dealing with a task such as programming a
validating parser. Different authors have dealt with the relationship between XML applications
and formal languages, for example , , , , , , , , , , , , , , , , and . Often, a formal specification of
XML's inherent ID/IDREF mechanism is omitted; however, , p. 33 claim that these references can be used to describe graphs
rather than trees, since they allow for multidominance structures. (see for a practical implementation of using
ID/IDREF for realizing graph structures within XML's tree model).
uses modal logic to provide a semantics for XML-documents, and
to characterize XML markup and search and retrieval mechanisms such as XPath. Other work
leaves the formal model of XML and deals with graph structures that can be described by either
XML or XML-like markup languages (see for a graph
characterization of TexMECS and other overlapping markup formalisms), but in this paper we
will concentrate on schema languages that describe well-formed (that is tree-like)
XML-documents. Typically, DTDs are characterized as extended context-free
grammars (see and ),
that is, on the right-hand-side of a production rule regular expressions are allowed. This
means for the declaration of an element that its allowed content is described by a regular
expression using other element names (i.e., referring to other or the very same globally
declared elements) or reserved keywords such as #PCDATA or EMPTY. In
current work, especially and , a
family of tree grammars is used to model XML schema
languages; for example, DTDs are defined as local tree
grammars which can be considered strongly equivalent to CFGs, with the only
difference that they allow non-finitary branching:
Ignoring the attributes for a moment, there is a simple but elegant connection between
DTDs and context-free grammars, namely, each DTD corresponds to an extended context-free grammar, where productions may have regular
expressions on their right-hand side. Then, an XML document is valid with respect to the
DTD precisely when its associated tree is a correct derivation tree for that
grammar., p. 13
In addition to using local tree grammars for characterizing DTDs, construe a taxonomy of XML schema languages. The authors introduce
single-type tree grammars, as characterizing XML Schema,
and use regular tree grammars to characterize RELAX NG.
Although this work is quite extensive, the formal analysis can be further improved by
clarifying some propositions. Given the (still growing) importance of XML and the broad range
of tasks it is used for, stronger theoretical background seems to be the best way to find new
applications. Before we present our results, we have to introduce the formal concepts we are
working with.Refining the Taxonomy of XML Schema LanguagesIntroduction to the Formal ConceptsIn this paper, we will mainly use tree grammars. Since the use of tree grammars is well
established in the XML community, we will just shortly provide the necessary definitions;
for a more explicit treatment and motivation, we defer the reader to or .Definition 1 A regular tree grammar (RTG) is a 4-tuple (N,T,S,P), where N is a
finite set of nonterminals, T is a finite set of terminals, S is a set of start symbols,
which form a subset of N, P is a set of production rules, which have the form: A →
a(r), where A ∈ N, a ∈ T, and r is a regular expression over elements of N. We
call A the left hand side of a rule, a the terminal or label which is introduced by the
rule, and r its content model. We generally use uppercase letters for nonterminals, and lower-case letters for
terminals. Note that this grammar generates trees, not strings, and that the nonterminals do
not remain the labels of the (non-leaf) nodes they introduce, but are substituted by the
terminal labels. The class of languages generated by RTGs is called the regular tree
languages (RTLs). This is the most general class of languages we will consider here; and we
now introduce various restrictions on this class of grammars, as they are defined by . Definition 2We call two rules of a RTG competing, if they introduce the same
terminal nodes, but have different left hand sides. Thus, A → a(r) and B →
a(r') are competing. In general, in an RTG we can merge any two rules which have the same left-hand side and
introduce the same terminal, by merging their content models, because for any two regular
expressions we can easily form a single expression which denotes is the union of both.
Therefore, we will generally assume that in our grammars any two rules with same left-hand
side and same terminal do not exist. As a consequence, the concept of competing rules is the
crucial point if we deal with determinism and ambiguity. For the same reason, we can speak
of competing nonterminals almost in the same way as of competing rules: competing
nonterminals are the left-hand sides of competing rules; a grammar has competing
nonterminals exactly if it has competing rules, and exactly as many as.Definition 3A local tree grammar (LTG) is an RTG with no competing
rules.In an LTG, we have thus a one-to-one correspondence of nonterminals and terminals, which
makes them very similar to context-free grammars (though not identical, since regular
content models allow for non-finitary branching).Definition 4A single type tree grammar (STG) is an RTG, where competing
nonterminals must not occur in the same content model.As point out, LTGs roughly correspond to DTDs, STGs
correspond to XML Schema, and RTGs correspond to Relax NG. Note that this correspondence is
established by only regarding the "core" syntactic features of the schema languages, that is
XML's inherent reference mechanism is not taken into account. These are thus the most
important grammar types for XML. still add another type:Definition 5A restrained competition grammar (RCG) is an RTG, where competing
nonterminals must not occur in the same content model and with the same prefix of
nonterminals; we thus disallow rules with identical left-hand side, terminals, and content
models of the form (Γ A Δ) and (Γ B Δ'), where A and B are competing
nonterminals, and where uppercase Greek letters refer to possibly empty sequences of
nonterminals.Notably, the restriction concerns only the left context of the competing nonterminals.
Of course, there exists a parallel definition for the right context. The problem is that
both definitions lack some generalization, as they both generate different classes of
languages, and there is no inclusion in either direction. If we, however, generalize the
restriction of competition to both the left and the right context (which weakens the overall
restriction on the grammar), some problems arise. We will discuss possible generalizations
later on.It is easily seen that there is a hierarchy of proper inclusion of the grammar types
presented: LTGs are always STGs, which are always RCGs, which are always RTGs, whereas the
converse does not hold.We furthermore define an interpretation of a given tree against a given grammar as
follows:Definition 6 An interpretation I of a
tree t against a grammar G is a mapping from each node label of t, denoted by e, to a
nonterminal N of the grammar, such thatI(e) is a start symbol when e is the root of
t,for each e and its daughter nodes e0,
e1,...,en, there is a production rule
A → a(r) in G, such thatI(e) is A,the label of e is a,I(e0),
I(e1),...,I(en) matches
r.As is easily seen, with this definition, the ease of interpretation directly interacts
with the Murata hierarchy. We will continue in this vein. To keep things clear, it is
crucial to distinguish between the label of a node and its interpretation: the label of a
node corresponds to its terminal in the production rule (recall that tree grammars directly
generate trees, not strings via trees as CFGs), and it is immediately visible in the tree.
By the interpretation of a node in turn we denote the nonterminal by which the node label
has been produced. This nonterminal is not visible and has to be inferred. In addition, we
have to distinguish between rule and rule instantiation: since content models are regular
expressions over nonterminals, they denote sets of sequences of nonterminals, and one member
of this set is an instantiation. An additional problem arises in interaction with the fact
that there is no necessary one-to-one correspondence of nonterminals and terminals (i.e.,
labels); a possible consequence is that the same sequence of labels can be produced by
different instantiations of a content model (we will exhaustively discuss this source of
ambiguity later on).So far, we have introduced the main concepts which are well-known in the literature, and
which we are going to use and elaborate in this paper.An Example GrammarWe will use the following example to demonstrate some of our findings. We want to define
a document grammar for a text. The text may contain an optional title, followed by either at
least a single section or a single paragraph. An optional author entity (possibly decoded
using an attribute) may contain information about the text's author. Inside a section an
optional title followed by other (sub-)sections or paragraphs are allowed. The title
consists of raw text while a paragraph may contain raw text or a reference to other
paragraphs, since these may have an optional identifier (using XML's ID
type).If we try to express these constraints more formally we might end up with a grammar
similar to the one shown in . Again, nonterminals are printed
in capital letters, while node labels or terminals are printed in small letters. Note, that
in this formulation elements, attributes and raw text are defined as terminals.The Example Grammar Realized by a DTD (Local Tree Grammar)Following , DTDs can be classified as local tree grammars.
A possible realization for our example grammar can be seen in .Since local tree grammars and DTDs only support globally declared elements (and
locally declared attributes), the content models of the text and the
section element share references to the same three elements
(title, section and para) and contain both a
sequence and a choice together with the occurrence indicators + (at least one
occurrence) and ? (optional). The content model of the para
element contains mixed content, that is both raw text and the xref element
are allowed as children. Since DTDs force the use of the choice group (|) and
the trailing asterisk (*) as occurrence indicator, there is no other way to
define this specific content model using this schema language.Note that DTDs do not support any type mechanism. , , and suppose the extension of DTDs by adding types, while DTD++,
proposed by adds namespace awareness on top, and DTD++ 2.0
(see ) even supports co-constraints.The Example Grammar Realized by an XSD (Single Type Tree Grammar)An XML schema description (i.e., a single type tree grammar) of the same document
grammar may look like the one in .Note that this XML schema description is only one possible realization out of a
variety of different XML schema descriptions that would fit our needs. Although it may be
not very human-readable, it was designed to show some features that are supported by XSD.
The text element is derived by extension of the globally declared complexType
textType which itself refers to the globally declared model group
sectOrPara. The schema contains both locally and globally declared
attributes (author vs. id) and elements (xref as an
example for a locally declared element). Apart from that, XSD supports which are not shown in the example above. As has already pointed out, that the actual extra expressive power
of XSDs over DTDs can only be used to a very limited extent due to the Element Declarations Consistent (EDC) constraint (see , Section 3.8.6).The Example Grammar Realized by a RELAX NG Grammar (Regular Tree Grammar)RELAX NG can be classified as regular tree grammar according to . A possible realization with RELAX NG is shown in .Compared to DTD or XSD, RELAX NG is based both on the mathematical theory of regular
expressions and the concept of hedge grammars ( and ). As an XML schema language, RELAX NG has some advantages over
other schema languages: while in DTDs and XSD mixed content models may contain child
elements and text nodes in any arbitrary order, RELAX NG allows for ordering of the
element child nodes (see , p. 57f.). Co-occurrence
constraints can be used to specify the content model of an item according to the value of
another item, allowing non-deterministic content models which cannot be realized in DTD or
XSD (see , p. 62f, and
for a discussion). In general, a co-occurrence constraint (or co-constraint as they are
called by ) may be a constraint over multiple items,
not just two items and may exist between XML structure components
(elements, attributes) as well as between data values. One may differentiate
between element-to-element, element-to-attribute, or attribute-attribute co-occurrence
constraint, based on the items involved. In addition, SGML's interleave operator & (see , p. 291) that is missing in XML DTD and XSD can be used in
RELAX NG as well, although this adds nothing to its expressive power. In contrast to the
two other schema languages discussed in this paper, RELAX NG does not support default
values (which are supported for attributes in DTD and for attributes and elements in XSD).
While both DTD and XSD support XML references via
ID/IDREF(S) attribute types, RELAX NG has no
included datatype library; however, as seen in the example grammar, it is possible to
include the datatype library of .The document instance given in would be valid according
to all of the above defined document grammars (the example shows validation against the
XML schema, adding a Doctype declaration and removing the
noNamespaceSchemaLocation attribute of the root element would result in a
valid instance according to the DTD. Note that RELAX NG does not contain a standard way to
associate a RELAX NG schema to an XML instance since it was designed as part of the ISO
DSDL framework (in this framework, the should be used as general
external mechanism for validating instances).Determinism and Local AmbiguityDeterminism is a important property for XML Documents, schema languages and
interpretation. If a grammar is deterministic, parsing will be much more efficient, since in
general we do not have to keep in mind any information, do not have to backtrack, do not need
any non-local information in case we search, etc. present an algorithm using tree-automata that allows for
efficient parsing of regular tree grammars. The concept of determinism is closely related to the concept of local ambiguity:
if there is no local ambiguity, then at every point in the parsing process we know the
structure (in this case: interpretation) of what we have seen so far, because there is
only one possible local analysis. If there is some local ambiguity, non-determinism arises: we
cannot assign a unique interpretation locally, for we would need information which is not
available yet, and we need to apply some heuristics.Determinism, Algorithms and Local AmbiguitiesIn this section we will review the concept of determinism, as opposed to local ambiguity
of a grammar. As introductory issue, we show that determinism does not only depend on the
grammar we use, but also on the algorithm. In regular tree languages, there can be no
one-dimensional concept of determinism, as there is for regular string languages. Note that
this is more than a metaphor, since we can perceive of regular string languages as
one-dimensional structures; in order to talk about trees or the strings formed by their
leaves, we need at least two dimensions (see ). In the
remainder, we will show how different grammar types provide determinism for all, some
particular class of, or no algorithms.Deterministic Content ModelsFirstly, we consider the concept of deterministic content models. This draws on the
notion of deterministic (or 1-unambiguous) regular expressions (DREs), thoroughly surveyed
by See for a discussion of the pros and cons of
1-unambiguous content models in XML schema languages.. Assume we have a string s which is denoted by
some regular expression E. Assume furthermore we build our
expressions with letters from an alphabet Π, which is
identical to the alphabet Σ, except for the fact that
we have additional indices for the letters (taken here from the set of natural numbers). If
we build an expression from Π, we have to make sure
that every index must occur at most once in it. Constructors for regular expressions are as
usual, and indices are passed on to the letters of the strings the expression denotes. We
say that a letter in s instantiates a letter in E, if the following holds:Definition 7A letter ai in s instantiates a letter
aj from E, if i = j.The index ensures that for every string an expression denotes, there is a unique
surjective mapping from letters in s to letters in
E. We now define a mapping ♮ on strings over Π to strings
over Σ, such that
♮(xi):= x,
♮(xs):=♮(x)♮(s), where s
is a string and x a letter (the first of the string). This
homomorphism simply deletes the indices and leaves anything else untouched. Note that for a
unique string ♮s, and a given expression E, there might be several s ∈
E, such that their mapping under ♮ is
identical. In this case we say that a letter in s might
instantiate several letters in E. A deterministic regular
expression over Π is then defined as follows:Definition 8E is deterministic or one-unambiguous, if for all strings u, v, w over
Π, and all letters x, y in Π, the following holds: if uxv and uyw ∈ L(E),
and x ≠ y, then also ♮ x ≠ ♮ y. This means that we can skip the indices, and we still know which letter in E is instantiated by any letter in s, simply from knowing its left context. Formally, this means that the mapping
♮ is reversible. A regular language is deterministic if it is denoted by a
deterministic regular expression. A simple example of a non-deterministic expression is the
expression a*b*a*, as we can easily check for the string
a, which might be the instantiation of either of the two
as.As a consequence for processing, quite informally, we can state that reading s ∈ L(E) from the left to
the right, where E is a DRE, at any point in s, we know at which point in E we
find ourselves. In automata theory, DREs correspond to deterministic Glushkov
automata.There is however a problem if we apply this concept to content models in regular tree
grammars: in regular expressions, we can see from a letter in the string which type of
letter in the expression it instantiates (thus, we have a unique letter in Σ, though not in Π). We
can thus deduce from a letter in the string a letter in the expression, though not a letter
instantiation, if the expression is not deterministic. We cannot, however, deduce from a
given tree node label a unique type of nonterminal: if we have competing rules, different
nonterminals introduce identical labels; and we still have to keep them apart. Thus, if the
content model of nonterminals itself is deterministic, this is of little use if we cannot
infer from a given label the unique nonterminal it belongs to.By way of example, consider the following grammar rule:A → a(ABC|CBA) Its content model is surely deterministic. However, if A and C are competing rules (have identical
labels), this is of little use. We have to check each subtree until we have its unique
interpretation. This means, in the worst case, we have to check both subtrees (as we will
see, this is the case in which the trees generated by A (C)
form a subset of the trees generated by C (A)).The obvious reason for the fact that this concept of determinism comes short is that it
originates in one-dimensional strings. As our trees are two dimensional, we can define
determinism only with respect to directions in which our analysis proceeds. The main
difference is, of course, the one between top-down and bottom-up processing. In this paper
we will not consider the difference between depth-first and breadth-first parsing, though
this is surely worthwhile.In the next subsection, we will reformulate and complete the Murata hierarchy in a way,
that makes clear which kind of determinism is facilitated by which kind of grammar. In the
sequel, we will disregard the one-dimensional problem of non-deterministic content models,
since they have been thoroughly analyzed, and we have nothing to add (see ). At this point, our interest is the second dimension:
importantly, this means that talking about determinism, we implicitly always add: provided
that content models are deterministic in the above sense. We thus exclude all problems which
may arise from regular expressions.The Murata Hierarchy as Hierarchy of Locality ConditionsAs our main concern will be the formal properties of the grammar types, as they affect
processing, we will firstly reformulate the hierarchy. This reformulation aims at making
clear which information we need in order to uniquely interpret a local node or
subtree.In a local tree grammar, for any node a in any context, we know its unique interpretation. This is
obvious, since for any node label, there is only one single rule which generates it, by
the very definition of a local tree grammar. As a consequence, parsing is deterministic
for any algorithm (provided we have deterministic content models), and the problem of
giving a certain node of a given tree its interpretation is solvable in a constant
amount of steps.In a single type tree grammar, for any node label
a in any context, we can determine its unique
interpretation if we know the interpretation of its mother node. This follows directly
from the definition: if we know the interpretation of a node's mother node (rules
correspond to interpretations), we know its content model. Within a content model there
must not occur any competing nonterminals.Note, however, that it is not sufficient to know the mother nodes label. We can
easily construct an example to show this: we have two competing rules, A → a(C) and B →
a(D), whose nonterminals do not occur in the same content model of any
rule. Furthermore, we have the two rules C → b(r)
and D → b(r'). Then both nodes, as introduced by
C and D, have label
b, their mother nodes both have the label a, despite they have different interpretations.For processing, this has an immediate consequence: a top down parser will at any
point immediately know the interpretation of any node, whereas if we start
interpretation from the bottom, in the worst case we will have to go up to the root in
order to get the correct interpretation. The matter of providing the interpretation of a
given node is nonetheless a linear search problem, since for a given tree and a given
node, it is sufficient to go a path from the root to that node. In a restrained competition grammar (RCG), in
order to give a node its unique interpretation, we must have the interpretation of its
mother node, and check its left siblings, in case it has any. Note that from how we
defined RCGs, it follows that we only need the label of the siblings, not their
interpretation: because any two competing nonterminals within a single content model are
distinguished by a unique prefix, this prefix itself must not consist of competing
nonterminals only, and neither must it be empty. This keeps the grammar unambiguous, and
easy to process. However, it makes some restrictions we do not necessarily want to make:
maybe the unique interpretation of a label should not depend on a left sibling, but on a
right sibling. For example, in RCGs we cannot have competing nonterminals as leftmost
symbols in a content-model, but as rightmost, given some left context. This causes an
asymmetry which seems quite arbitrary. Of course we can equally define the asymmetric
counterpart of RCG, checking for unique suffixes instead of prefixes; but care is to be
taken: since we have to fix the type for the class of languages (i.e., document
grammars) we define, we have the same problem. If we generalize the concept to both
unique suffix and prefix, some problems arise, which can however be remedied. We now define a generalized restrained competition
grammar (GRCG), as follows: In a GRCG, for any two competing nonterminals A and
B within a single content model r, one of (Γ A Δ) and
(Γ B Δ) fails to match r (Greek variables range over possible empty sequences of
nonterminals).We now have generalized the restriction from the left (right, respectively) to the
entire context. Note that we have relaxed the overall restriction on the grammar, by
making the restriction on content models more specific (indeed, this type properly
includes the RCGs). This little relaxation however causes a vast increase in processing complexity:
because now, in order to give its unique interpretation to any node a in any context, in the worst case one needs to know the
interpretation of its mother, the interpretation of its siblings, and the interpretation
of its subtrees as well. And even then, GRCGs might still be ambiguous, allowing more
than one interpretation for a entire single tree. This needs some explanation. The first point is easy to see: that we need to know
the interpretation of the mother node follows a fortiori from the preceding argument
(single type grammars are properly included in restrained competition grammars). But
this is insufficient, since competing nonterminals may occur within the same content
model. We have to match all the sister labels to the nonterminals of the content model
of the mothers interpretation in order to get a unique interpretation (according to the
definition). Note, however, that we need the interpretation of the sister nodes; it is not
sufficient to have their labels. This we can easily verify with the following grammar
rule: A → a(BC|B'C'), where B and B' and C and C' are competing rules, introduce
the labels b and c,
respectively.This satisfies all conditions on GRCGs. In order to get the correct interpretation
for the labels b and c, it is not sufficient to know the sister node's label, but its
interpretation. Things can thus get even worse, if we consider the case where the interpretation of
a sister node depends on the interpretation of the node under consideration itself. Look at the following example rule: A → a(BC|CB), where B
and C are competing rules.Obviously, they are both uniquely determined by their neighbor within the content
model of A (A may only
occur with B to its left or its right, and vice versa).
However, as they carry the same labels, it is insufficient to determine either of them
if we just check the label of its sister (since it is identical). Furthermore, we might
have the case where it is impossible to interpret one of the subtrees, without its
sisters interpretation (e.g., if one of the competing nonterminals generates a subset of
the trees generated by the other).We will consider this case more thoroughly in the next section, showing that there
are globally ambiguous GRCGs, and that for every language that can be generated by an
RTG, there is also a GRCG grammar that generates the same language. From the point of view of processing, we see that in GRCGs, expressive power comes
at a high cost: neither a bottom-up nor a top-down parser is capable of assigning a
unique interpretation locally, and maybe not even globally. The problem of giving a
given node its unique interpretation might thus be an exponential search problem, and in
the worst case not even decidable. We will therefore introduce a subtype of GRCG, which
we will call unambiguous RCG. This type is properly included in the class of GRCG
grammars, and includes properly the class of STGs as well as RCGs, as can be seen
easily.We now introduce unambiguous restrained competition
grammars (URCGs). What we want to eliminate is ambiguity, which can be
caused by the fact that in a GRCG, we might have entire competing contexts, or the
interpretation of the context of a label in a content-model might depend on the
interpretation of the label itself. In the resulting grammar, it should be possible to
yield the unique interpretation of a node from the interpretation of its mother and the
labels (not interpretations) of its sisters.We characterize the grammar type in the following terms: We introduce an alphabet of
meta-variables O, which we use in the following way: We form a set of all sets of nonterminals from N
which compete with each other; we call these sets the competition
sets (which are possibly singletons). We are interested in the sets where
every nonterminal occurs in exactly one competition set, such that the whole set forms a
partition of N. For every such partition we iterate the
following procedure: To every competition set, we assign a single symbol from O. We call this an O-assignment. Then, for all content models, we check for all nonterminals,
whether the content models still satisfy the GRCG condition, if we replace all other
nonterminals by the symbols from O they are assigned
to. In case there is more than one overall assignment, that is, a single nonterminal
belongs to more than one competition set, we have to iterate this for every possible
assignment. If for every assignment, nonterminal and content model, the resulting
grammar is a GRCG, then the original grammar is a URCG.Note that the assignments are only introduced for this evaluation procedure. We will
call contexts which are identical under the O-assignmentsimilar. We define accordingly a URCG as a grammar
where competing nonterminals must not occur in the same content model and in similar
contexts (this obviously subsumes identical contexts). It is easy to see that now we
have made sure that competing nonterminals must not occur within the same contexts of
labels (as opposed to nonterminals).Thus, a URCG is a GRCG where for every node we find its unique interpretation if we
have the interpretation of the mother node and the labels of its sister nodes. In
particular, we can interpret any node without having to recur to its sister node's
interpretation: the content model (BC|CB), where
B and C are
competing nonterminals, does not satisfy the condition, because both B and C occur in the context
__X or X__,
respectively, where X is the O-assignment for both. This kind of grammar is useful for the following reason: there is no global
ambiguity in it (as we have deleted the only source of ambiguity, that interpretations
of labels may depend on each other); and it is the strongest of the non-ambiguous
grammar types we have considered. However, it is not capable of generating every
language which is not inherently ambiguous, as we will show later. Note, that in order
to provide the unique interpretation of a node, we still might have to check all of its
sister nodes, but it is sufficient to check the labels.It is easily seen that URCGs properly include RCGs, as both left and right context
can count as distinctive (we will make this more precise later on). As we will also see
further down, there are languages which can be generated by GRCGs, but not by URCGs.
This will follow from the fact that actually every RTL can be generated by a GRCG, but
there are languages for which there are no unambiguous grammars, and, obviously, URCGs
are always unambiguous.Interestingly, the search problem for URCGs is still linear, since we only need to
go down the path from the root to a given node, and in addition check finitely many
sister labels (note that while regular expressions allow for arbitrary branching, the
branching of a given tree is, of course, always finite).An example of a URCG could be the use of attribute based co-occurrence constraints
or attribute-element constraints in the following RELAX NG declaration. We extend our
example grammar by adding a type information to the section
element. If the type is set to the value "global" other section
child elements are allowed as part of the content model, if its value is set to
"sub" only para child elements are allowed (see ).
The interesting part is the definition of the element.section pattern.
It allows for two different elements named "section" with different content
models according to the value of the optional type attribute. The result is
that the instance shown in is valid according to
this RELAX NG grammar while the one shown in is
not.Without any type attribute the instance shown in would still be valid.Other well deployed RELAX NG examples of attribute based co-occurrence constraints
can be found in , Chapter 7, in , Section 15, or in the RELAX NG schema for the JLTF (Japanese
Layout Taskforce) aligned document shown in .Importantly, co-occurrence constraints do not add anything to the formal generative
capacity of the grammar. This is because attributes (or their values, respectively) add
an additional specification to the terminals. Thereby
we can convert competing terminals (or, equivalently, rules) into non-competing ones,
but not vice versa. Any co-occurrence constraint thus gives us the possibility to
distinguish maybe otherwise indistinguishable non-terminals, thereby at most keeping the
complexity of the grammar constant, or even reducing it. Furthermore, as co-occurrence
constraints do only affect immediate subtrees (i.e., content models), their expressivity
is entirely contained within the expressive capacities of standard regular tree
rewriting rules; the only thing we might need to add to our formal grammar model is some
additional specification on the terminals.Neither DTD discuss DTD++ 2.0 which supports a large number
of co-constraints using a syntax closely resembling DTD. nor XSD 1.0 support such attribute-element constraint, although there are
some workarounds or hacks that can be used in XML schema to mimic co-occurrence
constraints: either the use of the xsi:type attribute or
xs:keySee , p. 65 and http://ajwelch.blogspot.com/2008/06/xml-schema-co-occurrence-constraint.html
for further information.. Another option to realize this particular constraint is the use of embedded
business rules or conditional type assignment using type
alternatives or assertions that are introduced in (for
complex Types) and (for simple Types). shows a possible XSD 1.1 realization.The section element contains an XSD 1.1 assert element
that uses an XPath expression in its type attribute to restrain the
possible child elements according to the type attribute's value of the
section element.Note that when using XSD 1.1 assertions for expressing co-occurrence constraints we
are not limited to immediate subtrees: although the evaluation of the XPath expression
is done in the context of the parent element (i.e., XSD 1.1's xs:assert
element uses a tree fragment rooted at the element whose type it is tested against) one
could put the assertion at the level of the common ancestor, that is, the element that
contains all the data needed to compute the assertion. The support for full XPath (i.e.,
axes such as ancestor, parent or preceding and preceding-sibling and following and
following-sibling, respectively) may be implementation-dependent. XSD 1.1 type alternatives
are restricted to tests against constants or attributes on the element itself but not to
ancestors, descendants, siblings or children or their attributes while Schematron
business rules are not restricted to an XPath subset.There is one type we should add, which cannot be assigned a place on the hierarchy
from STGs to RTGs, which is, however, weakly and strongly between LTGs and RTGs.
Grammars of this type satisfy the following conditions: we want to be able to assign a
unique interpretation to any node a, provided we know
the complete subtree it governs. This kind of grammar would facilitate deterministic
parsing for bottom-up algorithms. In terms of grammar, this imposes the following
restrictions on the formalism:Every leaf-terminal is introduced by a single nonterminal,for every nonterminal N in a given grammar,
there is at most one rule which has a given terminal a, and N appears in its content
model. We call this grammar type unique subtree grammar
(USG). Note that this grammar type does not include STGs and URCGs, nor is it included
by them. The restrictions do not restrain the occurrence of competing nonterminals in
content models, but rather the labels which belong to nonterminals in the content model.
More precisely, whereas the former types restrain the occurrence of competing
nonterminals within content models, USGs restrain the content models of competing rules
itself. However, as all other types, they properly include the class of LTGs, as every LTG
is a USG, but not vice versa, and it easy to find a language which is generated by a
USG, but not by LTG. Furthermore, they are properly included in the class of RTGs, in
the strong and the weak sense (and weakly within the class of GRCGs, as we will see). As is also easy to see, the distinctive USG property provides deterministic bottom
up parsing. In order to get the interpretation of a given node, it is therefore
sufficient to find a path from the leaves, which is a linear search problem.Structure and Global AmbiguityFirst, we will present some well-known results, which are important for our further
discussion.Theorem 1 RTGs are weakly equivalent to
CFGs (that is, the set of strings of leaves generated by CFGs is equivalent to the set of
strings of leaves generated by RTGs).Theorem 2 RTGs are strongly equivalent
to graphs generated by a CFGs, modulo a homomorphism of node labels (i.e., a homomorphism
which maps various node labels in a given tree onto a single one), provided the RTG has
finitary branching.Theorem 3 LTGs are strongly equivalent
to graphs generated by CFGs, provided finitary branching.Proof is trivial: as in LTGs every node label is generated by exactly one nonterminal; and
in CFGs, nodes which are not leaves are labeled by nonterminals, there is a one-to-one
correspondence. This has some importance for the relation of LTGs and RTGs. It follows, that
LTGs and RTGs are equivalent up to homomorphism, as well as all grammar types in between.
Every discussion we have about generative capacity concerns only non-equivalence in
isomorphism. Since also the strings of leaves for all grammar types are identical, we can only
be interested in the sets of trees. In the sequel, if we speak of the weak generative capacity
of a tree grammar, we refer to the sets of trees it generates, not the strings of
leaves.Since to our knowledge, only the strong generative capacity of the grammar types of has been in the focus of research, we will now scrutinize their weak
generative capacity.Theorem 4 The
sets of trees generated by STGs form a proper subset of the sets of trees generated by
RCGs.For proof, consider the trees generated by the following grammar: S → a(AB)
A → b(C)
B → b(D)
C → c(D)
D → c(ε)This is an RCG, since A and B do not occur in similar contexts. In order to see that there
is no way to generate this tree with an STG, consider the following fact: a governs two identical labels, which however govern different
subtrees. It is therefore impossible to introduce them with identical rules, and (by
definition of STGs) forbidden to have two competing rules in the content model of the first
rule. This is sufficient for the proof of weak inclusion, since all STG rules are also RCG
rules, and therefore the languages generated form a proper subset.Theorem 5 The
sets of trees generated by LTGs form a proper subset of the sets of trees generated by
STGs.Consider the trees generated by the following grammar: A → a(B)
B → b(C)
C → a(D)
D → c(ε)This is a single type tree grammar, and no LTG is able to generate such a tree (remember
that LTGs are strongly equivalent to graphs generated by CFG, provided finitary
branching).Restrained Competition Grammars and VariantsIn this section we will scrutinize formal properties of the different types of
restrained competition grammars. We will show which kind of languages cannot be generated by
RCGs; we will prove that there are GRCGs which are ambiguous, and that for every language
which can be generated by an RTG, there is a GRCG which generates the same language.
Finally, we will show that URCGs do not have these properties, are properly included within
GRCGs and properly include RCGs. The type of languages we cannot generate with RCGs is quickly described as follows: all
grammars, where a single content model contains competing nonterminals, which can be
uniquely distinguished only from their left (right, respectively) context, are not RCGs.
Consequently, we cannot generate sets of trees, where a certain node has different subtrees
depending on its right siblings. If we want to get rid if this asymmetry, and allow for
GRCGs, where competing nonterminals in a single content model are uniquely determined by
their left or right context, we run into problems: Theorem 6 There
are GRCGs which are ambiguous. For proof, consider the following rule: S →
a(AB|BA). Suppose, A and B are competing nonterminals; suppose furthermore, that there is some
overlapping between A and B; i.e., the nonterminals generate overlapping sets of trees. In particular,
we may assume that the trees generated by A form a subset
of the trees generated by B. For example, A and B generate identical trees
up to depth n; B in
addition generates a tree of depth n+1. In this case, the
trees of the language have the root a, with two symmetrical
sets of subtrees up to depth n, and possibly one subtree
with depth n+1. It is easily seen that now it is impossible
to merge A and B, for then
we would be incapable of expressing the condition that at most one subtree has depth
n+1. However, for the trees, where the subtrees
introduced by A and B have
depth at most n, there is necessarily more than one
analysis. The grammar we have described so far is, however, a GRCG, because neither
A nor B occur in
identical contexts (though in similar contexts, remember the preceding section).Theorem 7 For
every language which can be generated by an RTG, there is a GRCG which generates the same
language. To proof this theorem, we describe a simple procedure to convert any RTG into a GRCG,
which generates the same language. We define competing sequences of length n of nonterminals as follows: two sequences of nonterminals
compete, if for all n, the nth nonterminal of one sequence competes with the nth nonterminal of the other sequence. We have to assume a content model
r which is not GRCG conform. Therefore, there have to be
two competing nonterminals or competing sequences of nonterminals A and B in r, such that for possibly empty sequences of nonterminals Γ and Δ, (Γ A Δ) and (Γ B Δ) match
with r. Given this, we can be sure, that in the instantiations of r, which violate the GRCG condition, A and
B occur in exactly the same global tree contexts. By
global tree context we here mean that a tree with a governing the subtrees generated by
A is part of the language iff a also governs the set of
subtrees generated by B. Since this is the case, we can
simply merge the two nonterminals to a new one, C, which is
the union of the former two. This new nonterminal substitutes all instantiations of
A and B, which occur in
the same global tree context. This, by definition, are the instantiations which violate the
GRCG condition. This we can apply to all nonterminals which violate the GRCG condition. The
only thing we have to take care of is that we apply this only to those instantiations of the
content models where two competing nonterminals match equally (this might force us to change
some regular expressions). We do not show an exact algorithm at this point, since it is
clear that an equivalent GRCG exist, and the details of the construction are of no practical
interest at this point. We now show that there is a hierarchy of proper inclusion RCG ⊂ URCG ⊂ GRCG.
To show that RCG ⊂ URCG, consider the following: every rule which is admitted by an
RCG is also admitted by a URCG, because if competing nonterminals in the same content model
have a unique prefix, a fortiori they also have a unique context (we have already shown that
a unique prefix of nonterminals is paramount to a unique prefix of labels/siblings, by
induction). Above, we have already shown that for an RCG it is impossible to generate
languages as the following, which is a URCG. S → a(AC|BD)
A → b(C)
B → b(D)
C → c(ε)
D → d(ε)A and B compete, but are
determined uniquely by their context.This concludes the first part; the second part will be a corollary of the next section:
We will show that some languages are inherently ambiguous, that is, there is no unambiguous
grammar for them. By Theorem 7 we know that we can generate these languages with GRCGs, but
URCGs cannot:Proposition 1 A URCG cannot be ambiguous.This is easy to see: an ambiguous grammar assigns two different sequences of
nonterminals to the daughters of one node (since root nodes are unambiguous): Then however,
there must be at least two competing nonterminals which occur in the same content model in
similar contexts, which, by definition, is impossible.Inherently Ambiguous LanguagesAs a corollary, we can show that there are regular tree languages, for which there is no
unambiguous grammar. There are sets of trees, which are generated by an ambiguous GRCG, but
by no URCG. We will call these languages inherently
ambiguous. Theorem 8 Some
regular tree languages are inherently ambiguous.This can be seen easily, if we spell out a grammar which we described in the above
subsection. We will then show that there is now way to write an unambiguous grammar which
generates the same language.S → r(AB|BA)
A → a(C)
B → a(D)
C → b(ε)
D → b(ε|E)
E → c(ε)There is no way to merge A and B, since they generate different sets of subtrees (we can write L(A)≠L(B)); but since they overlap (L(A)∩ L(B)≠ ∅), there is no way to have a unique
interpretation in the cases where the subtrees generated by the nonterminals are identical.
There will always be two ways to generate trees in this case.We can, furthermore, precisely state the conditions, under which a regular tree language
is ambiguous. To this end, however, we need to introduce some notation. We now for
simplicity write trees as terms: a tree with root a and
daughters b and c is
denoted as a(b,c), etc. As a next step, we define a context
as the position, where certain subtrees occur within trees of a language. Definition 9A context C is a tree-term with exactly one variable. We say that a
set of subtrees α occurs in a context C in a language L, if the following holds: We
can instantiate the variable of C with any tree from α, and the resulting tree is in
L.Note that sets of subtrees correspond to nonterminals, when we speak of languages rather
than grammars. In the sequel, for simplicity we will use lower case Greek variables equally
for sets of subtrees as for ordered sequences of sets of subtrees. The definition of a
context is easily accommodated to sequences. A set of sequences of trees of length n consists of ordered tuples of trees of length n, of the form (t0,...,tn-1). Sets of subtrees
are then simply sets of one-tuples. Importantly, we will not provide a proof for the
following proposition, and leave it open as a conjecture. However, we will sketch the
argument. We now make the following conjecture: Conjecture 1A tree-language is inherently ambiguous iff at least one node fulfills
all of the following conditions: We need to have one node with an arbitrary label a, with at least
two (sequences of) sets of subtrees α and β, such that α ∩ β ≠ ∅;α ≠ β;There is at least one context C in L, such that both a(Γ,
(t1,...,tn), Δ,
(u1,...,un), Θ) and
a(Γ,
(u1,...,un), Δ,
(t1,...,tn), Θ) occur
in C, for all (t1,...,tn) ∈ α
and all (u1,...,un)
∈ β, where uppercase Greek letters designate possibly empty
sequences of daughter sub-trees; note that the sequences need to have equal length in
order to meet condition 1.Due to space restrictions, we leave the prove for this conjecture open here; this
reminds however of a theorem in for string languages. But we
will give some rather informal discussion of the points in the next section. It is not hard
to see that this is merely a generalization of the cases we have been described above. As we
will see, we can derive some useful facts from these properties of ambiguous languages, even
without a general proof: we can show that we can construe grammars for languages which do
not fulfill one of the conditions, and, moreover, which type of grammars we can
construe.Unambiguous LanguagesTheorem 9From the grammar types sketched so far, there is no type which
generates all and only the RTLs that are not inherently ambiguous.We will demonstrate this going through the three conditions mentioned in the preceding
section, and look which unambiguous grammar we can construe if one condition is not met.
This is to be read as follows: if one condition is not met, then it means, that from all
nodes of the tree language, there might be any one which meets the ones not in question, but
none which meets the one currently under consideration.If there is no intersection between the subtrees of a given node, the grammar is of
course not ambiguous. We can, however not necessarily construe a URCG for this grammar,
since in the content model of the mother node there are competing nonterminals in
similar contexts (recall the example given above).We can, however, construe a USG for such a language, since subtrees are uniquely
identifiable.This means that there are no two sets of subtrees governed by the same node which
are not identical. We can thus introduce them by the same nonterminals, and have a local
tree grammar (having no different sets of subtrees governed by the same node amounts to
say we need no competing rules in the grammar, as nonterminals correspond to sets of
subtrees).If the third condition is not met, then we can construe nonterminals (corresponding
to the sets of subtrees) such that for all of them the following holds: assuming they
compete (introduce identical labels), they either occur in different contexts, in which
case they are distinguishable thereby and no ambiguity arises; or they occur in
identical contexts, in which case we can use a unique nonterminal which is the merge of
both (this also holds for root nodes). The critical case, where the content of one
(sequence of) set(s) of subtrees depends on the other one, which makes them occur in
similar contexts, while making it impossible to merge them, however, we have excluded by
assumption.Since this argument holds inductively from the root to all subtrees, we can construe
a URCG for the language were condition 3 is not met, but we cannot use any strictly
weaker type. The only thing we have provided is that if two sets of subtrees occur in
similar contexts (for the grammar we construe), then they actually occur in identical
contexts. It follows that we do not need competing nonterminals in similar
contexts.This shows that we still have not solved the problem to define a canonical grammar type
which generates all and only the unambiguous languages, since there are languages which are
generated by USGs, but no other canonical class which does not allow any ambiguity type, and
languages which are generated by URCGs and no other such type. So far, we are still lacking
a characterization of the unambiguous languages in terms of grammar rules. Application and Future ResearchWhen we speak of XML schema languages and applications, the first thing that comes into
mind is parsing an instance and validate it according to a respective schema. have shown algorithms for parsing the three types of tree grammars we
discussed already. However, a task which is still open is to provide algorithms for the new
grammars types we have defined. demonstrated the use of the theory of regular tree grammars for
the XML to relational conversion as an additional application of formal language in the XML
context. Again, this work could be extended using the newly established grammar types.Regarding future research this paper may serve as just a foundation in the fields of XML
applications and formal languages. New features that are introduced in XSD 1.1 such as
conditional type assignment, assertions and the openContent element as well as
the relaxed Unique Particle Attribution rule (UPA, aka the
determinism rule, see , Section 3.8.6.4), or changed behavior
regarding wildcards have effects on the expressiveness. Apart from these natural enhancements,
another focus may lie in examining the relationships between XPath and XQuery and formal
languages on the basis of the work undertaken in this paper; we expect to shed some light on
this topic during future research. In addition, a more formal approach in the analysis of
overlapping markup structures such as GODDAGs () could
be an interesting field for future work.Seen from a practical perspective and under consideration of the findings in , a large portion of the XML document grammars that can be found in
the wild are structurally equivalent to DTDs or specialized
DTDs (that is, adding a mechanism to decouple element names from their types to
regular DTDs, see and
– also called EDTDs by ), hence use roughly the
expressiveness of local tree grammars. This is often due to nontransparent restrictions in the
XML Schema spec such as the already discussed Element Declarations
Consistent (EDC) constraint. and provide simplifications for XSDs and XSD authoring tools that should
relive authors from the burden of these constraints by automatically transforming
nondeterministic expressions into concise deterministic ones. Regarding RELAX NG
document grammars we think that restraining its expressive power to the class of URCGs would
provide a feasible compromise. Up to this point we hope that this more fine-grained hierarchy
may serve others as guide for choosing a specific XML schema language depending on the
expressivity of the markup language that has to be developed.BibliographyAbiteboul, S., P.
Buneman, and D. Suciu (2000). Data on the Web: From Relations to Semistructured Data and XML.
Morgan Kaufmann Publishers, San Francisco, California.Ansari, M. S., Zahid, N., and
K.-G. Doh. A Comparative Analysis of XML Schema Languages. In Slezak, D., Kim, T., Zhang, Y.,
Ma, J., and K. Chung, eds., Database Theory and Application. International Conference, DTA
2009, Held as Part of the Future Generation Information Technology Conference, FGIT 2009, Jeju
Island, Korea, December 10-12, 2009. Proceedings, volume 64, pages 41– 48. Springer, Berlin,
Heidelberg, 2009. doi:10.1007/978-3-642-10583-8_6.Balmin, A., Papakonstantinou,
Y., and V. Vianu (2004). Incremental validation of XML documents. ACM Transactions on Database
Systems (TODS), 29(4):710–751. doi:10.1145/1042046.1042050.Bauman, S., (2008). Freedom to
Constrain: where does attribute constraint come from, mommy? In Proceedings of Balisage: The
Markup Conference 2008. Balisage Series on Markup Technologies, vol. 1 (2008). doi:10.4242/BalisageVol1.Bauman01.Bex, G. J., Gelade, W., Martens, W.
and F. Neven (2009). Simplifying XML Schema: Effortless Handling of Nondeterministic Regular
Expressions. In SIGMOD ’09: Proceedings of the 35th SIGMOD international conference on
Management of data, pages 731–744, New York, NY, USA, ACM. doi:10.1145/1559845.1559922.Brüggemann-Klein, A., and D. Wood (1992). Deterministic Regular Languages. In Finkel, A. and
M. Jantzen, eds., STACS 92. 9th Annual Symposium on Theoretical Aspects of Computer Science
Cachan, France, February 13–15, 1992 Proceedings, volume 577 of Lecture Notes in Computer
Science, pages 173–184. Springer, Berlin, Heidelberg. doi:10.1007/3-540-55210-3_182.Brüggemann-Klein,
A. (1993). Formal Models in Document Processing. Habilitation, Albert-Ludwig-Universität zu
Freiburg i. Br.Brüggemann-Klein, A., and D. Wood (1997). One-unambiguous regular languages. Information and
computation, 142:182–206. doi:10.1006/inco.1997.2695.Brüggemann-Klein, A., and D. Wood (2002). The parsing of extended context-free grammars.
HKUST Theoretical Computer Science Center Research Report HKUST-TCSC-2002-08, The Hong Kong
University of Science and Technology Library.Brüggemann-Klein, A., and D. Wood (2004). Balanced context-free grammars, hedge grammars and
pushdown caterpillar automata. In Proceedings of Extreme Markup Languages, Montréal,
Québec.Buck, L., Goldfarb, C. F., and P.
Prescod (2000). Datatypes for DTDs (DT4DTD) 1.0. W3C Note 13 January 2000, World Wide Web
Consortium.Carey, B. M. (2009). Meet CAM: A new XML
validation technology. Take semantic and structural validation to the next level. IBM
developerworks, IBM Corporation. http://www.ibm.com/developerworks/xml/library/x-cam/?S_TACT=105AGX54&S_CMP=C0924&ca=dnw-1036&ca=dth-x&open&cm_mmc=6015-_-n-_-vrm_newsletter-_-10731_131528&cmibm_em=dm:0:13962324.Chomsky, N. (1955). Logical Syntax
and Semantics: Their Linguistic Relevance. Language, 31(1):36–45, 1955. doi:10.2307/410891.Chomsky, N. (1956). Three Models for
the Description of Language. IRE Transactions on Information Theory, 2:113–124,
1956. doi:10.1109/TIT.1956.1056813.Clark, J. (2001). TREX – Tree Regular
Expressions for XML Language Specification. Technical report, Thai Open Source Software Center
Ltd.Clark, J., J. Cowan, and M.
Murata, (2003). Relax NG Compact Syntax Tutorial. Working Draft 26 March 2003, OASIS –-
Organization for the Advancement of Structured Information Standards. http://relaxng.org/compact-tutorial-20030326.html.Comon, H., M. Dauchet, R.
Gilleron, C. Löding, F. Jacquemard, D. Lugiez, S. Tison, and M. Tommasi (2007). Tree Automata
Techniques and Applications. Release November, 18th 2008. http://www.grappa.univ-lille3.fr/tata. Costello, R. L., and
R. A. Simmons (2008). Tutorials on Schematron: Two Types of XML Schema Language. http://www.xfront.com/schematron/Two-types-of-XML-Schema-Language.html.Møller, A. (2005). Document Structure
Description 2.0. Technical report, BRICS (Basic Research in Computer Science, Aarhus
University), 2005. http://www.brics.dk/DSD/dsd2.html.Fiorello, D., Gessa, N.,
Marinelli, P., and F. Vitali. DTD++ 2.0: Adding support for co-constraints. In Proceedings of
Extreme Markup Languages, Montréal, Québec.Gelade, W, Martens, W., and F.
Neven (2009). Optimizing Schema Languages for XML: Numerical Constraints and Interleaving.
SIAM Journal on Computing, 38(5):2021–2043. doi:10.1137/070697367.Goldfarb, C. F. (1978). DCF GML
User’s Guide (IBM SH20-9160). IBM, 1978.Goldfarb, C. F. (1991). The SGML
Handbook. Oxford University Press, Oxford.Gécseg, F., and M. Steinby
(1997). Tree languages. In Handbook of Formal Languages, volume 3, pages 1-68. Springer, New
York.Hopcroft, J., R. Motwani,
and J. Ullman (2000). Introduction to Automata Theory, Languages, and Computation. 2nd
edition. Addison Wesley Longman, Amsterdam.Jeliffe, R. (2009). Is Schematron a
rules language? Online: http://broadcast.oreilly.com/2009/01/is-schematron-a-rules-language.html.Kilpeläinen,
P., and R. Tuhkanen (2007). One-unambiguity of regular expressions with numeric occurrence
indicators. Information and Computation, 205(6):890–916. doi:10.1016/j.ic.2006.12.003.Klarlund, N., T.
Schwentick, and D. Suciu (2003). XML: Model, Schemas, Types, Logics and Queries. In Chomicki,
J., R. van der Meyden, and G. Saake, eds., Logics for Emerging Applications of Databases,
pages 1-41. Springer, Berlin, Heidelberg.Kracht, M. (to appear). Modal Logic
Foundations of Markup Structures in Annotation Systems. In Mehler, A., Kühnberger, K.-U.,
Lobin, H., Lüngen, H., Storrer, A., and A. Witt, eds., Modeling, Learning and Processing of
Text Technological Data Structures, Studies in Computational Intelligence. Springer,
Dordrecht.Lee, D. and W. Chu. Comparative
Analysis of Six XML Schema Languages. ACM SIGMOD Record, 29(3):76–87, September
2000. doi:10.1145/362084.362140.Maler, E., and J. E.
Andaloussi (1995). Developing SGML DTDs: From Text to Model to Markup. Prentice Hall, Upper
Saddle River, New JerseyM. Mani (2001). Keeping chess alive: Do we
need 1-unambiguous content models? In Proceedings of Extreme Markup Languages, Montréal,
Québec.Mani, M., and D. Lee (2002). XML
to Relational Conversion using Theory of Regular Tree Grammars. In Proceedings of the 28th
VLDB Conference, Hong Kong, China.Marcoux, Y. (2008). Graph
characterization of overlap-only TexMECS and other overlapping markup formalisms. In
Proceedings of Balisage: The Markup Conference 2008. Balisage Series on Markup Technologies,
vol. 1. Montréal, Québec. doi:10.4242/BalisageVol1.Marcoux01.Martens, W., Neven, F., and
T. Schwentick (2005). Which XML Schemas Admit 1-Pass Preorder Typing? In Eiter, T., and L.
Libkin, eds., Database Theory – ICDT 2005, volume 3363 of Lecture Notes in Computer Science,
pages 68–82. Springer, Berlin, Heidelberg, 2005. doi:10.1007/978-3-540-30570-5_5.Martens, W., Neven, F.,
Schwentick, T., and G. Bex (2006). Expressiveness and Complexity of XML Schema. ACM
Transactions on Database Systems (TODS), 31(3):770–813. doi:10.1145/1166074.1166076.Martens, W., Neven, F. and T.
Schwentick (2007). Simple off the shelf abstractions for XML schema. SIGMOD Rec.,
36(3):15–22. doi:10.1145/1324185.1324188.Martens, W., Neven, F. and T.
Schwentick (2009). Complexity of Decision Problems for XML Schemas and Chain Regular
Expressions. SIAM Journal on Computing, 39(4):1486–1530. doi:10.1137/080743457.Møller, A., and M.
Schwartzbach (2006). An Introduction to XML and Web Technologies, chapter Schema Languages,
pages 92–187. Addison-Wesley, Harlow, England.Murata, M., D. Lee, and M.
Mani (2001). Taxonomy of XML Schema Languages using Formal Language Theory. In Proceedings of
Extreme Markup Languages, Montréal, Québec.Murata, M., D. Lee, M. Mani,
and K. Kawaguchi (2005). Taxonomy of XML Schema Languages Using Formal Language Theory. ACM
Transactions on Internet Technology, 5(4):660–704. doi:10.1145/1111627.1111631.Nentwich, C. (2005). CLiX – A
Validation Rule Language for XML. Presented by Anthony Finkelstein at W3C Workshop on Rule
Languages for Interoperability, 27-28 April 2005, Washington D.C. http://www.w3.org/2004/12/rules-ws/paper/24/.ISO/IEC 19757-4:2006. Information technology —
Document Schema Definition Languages (DSDL) — Part 4: Namespace-based Validation Dispatching
Language (NVDL), International Standard, International Organization for Standardization,
Geneva.Odgen, W. (1968). A Helpful Result for
Proving Inherent Ambiguity. In Mathematical Systems Theory, 2(3):191–194. doi:10.1007/BF01694004.Pawson, D. (2007). ISO Schematron
tutorial. http://www.dpawson.co.uk/schematron/.Papakonstantinou, Y., and V. Vianu (2000). DTD inference for views of XML data. In PODS ’00:
Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database
systems, pages 35–46, New York, NY, USA, ACM. doi:10.1145/335168.335173.Piez, W. (2001). Beyond the “descriptive
vs. procedural” distinction. In Markup Languages – Theory & Practice,
3(2):141–172. doi:10.1162/109966201317356380.ISO/IEC TR 22250-1:2002. Information
technology – Document description and processing languages – Regular Language Description for
XML – part 1: RELAX Core. International Standard, International Organization for
Standardization, Geneva.ISO/IEC 19757-2:2008. Information technology –
Document Schema Definition Language (DSDL) – Part 2: Regular-grammar-based validation – RELAX
NG (ISO/IEC 19757-2). International Standard, International Organization for Standardization,
Geneva.ISO/IEC 19757-2:2008. Information
technology – Document Schema Definition Language (DSDL) – Part 2: Regular-grammar-based
validation – RELAX NG (ISO/IEC 19757-2). Second Edition. International Standard, International
Organization for Standardization, Geneva.Rizzi, R. (2001). Complexity of
context-free grammars with exceptions and the inadequacy of grammars as models for XML and
SGML. Markup Languages – Theory & Practice, 3(1):107–116. doi:10.1162/109966201753537222.Rogers, J. (2003). Syntactic
Structures as Multi-dimensional Trees. In Research on Language and Computation,
1(3-4):265–305. doi:10.1023/A:1024695608419.Sasaki, F. (2010). How to avoid
suffering from markup: A project report about the virtue of hiding xml. In XML Prague 2010
Conference Proceedings, pages 105–123, Prague, Czech Republic, March 13–14 2010. Institute for
Theoretical Computer Science.ISO/IEC 19757-3:2006 Information
technology — Document Schema Definition Languages (DSDL) — Part 3: Rule-based validation —
Schematron. International Standard, International Organization for Standardization,
Geneva.ISO 8879:1986. Information Processing — Text and
Office Information Systems — Standard Generalized Markup Language. International Standard,
International Organization for Standardization, Geneva.Sperberg-McQueen,
C. M. (2003). Logic grammars and XML Schema. In Proceedings of Extreme Markup Languages,
Montréal, Québec.Sperberg-McQueen, C. M. and C.
Huitfeldt (2004). GODDAG: A Data Structure for Overlapping Hierarchies. In King, P. and E. V.
Munson, eds. Proceedings of the 5th International Workshop on the Principles of Digital
Document Processing (PODDP 2000), volume 2023 of Lecture Notes in Computer Science, pages
139–160. Springer, 2004Stührenberg, M.
and D. Jettka (2009). A toolkit for multi-dimensional markup: The development of SGF to
XStandoff. In Proceedings of Balisage: The Markup Conference 2009. Balisage Series on Markup
Technologies, vol. 3 (2009). Montréal, Québec. doi:10.4242/BalisageVol3.Stuhrenberg01.Vitali, F., Amorosi, N., and N.
Gessa. Datatype- and namespace-aware DTDs: A minimal extension. In Proceedings of Extreme
Markup Languages, Montré́al, Québec.van der Vlist, E. (2001).
Comparing XML Schema Languages, 12 December 2001. http://www.xml.com/pub/a/2001/12/12/schemacompare.html.van der Vlist, E. (2003).
RELAX NG. O’Reilly, Sebastopol.Extensible Markup Language (XML) 1.0. W3C
Recommendation, World Wide Web Consortium, 10 February 1998. http://www.w3.org/TR/1998/REC-xml-19980210.Extensible Markup Language (XML)
1.0 (Fifth Edition). W3C Recommendation, World Wide Web Consortium, 26 November 2008. http://www.w3.org/TR/2008/REC-xml-20081126/.Namespaces in XML 1.0
(Third Edition). W3C Recommendation, World Wide Web Consortium, 8 December 2009. http://www.w3.org/TR/2009/REC-xml-names-20091208/.XML Schema Part 0: Primer
Second Edition. W3C Recommendation, World Wide Web Consortium, 28 October 2004. http://www.w3.org/TR/2004/REC-xmlschema-0-20041028/.XML Schema Part 1:
Structures Second Edition. W3C Recommendation, World Wide Web Consortium, 28 October 2004.
http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/.XML Schema Part 2:
Datatypes Second Edition. W3C Recommendation, World Wide Web Consortium, 28 October 2004.
http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/.W3C XML Schema Definition
Language (XSD) 1.1 Part 1: Structures. W3C Working Draft, World Wide Web Consortium, 3
December 2009. http://www.w3.org/TR/2009/WD-xmlschema11-1-20091203/.W3C XML Schema
Definition Language (XSD) 1.1 Part 2: Datatypes. W3C Working Draft, World Wide Web Consortium,
3 December 2009. http://www.w3.org/TR/2009/WD-xmlschema11-2-20091203/.XSL Transformations (XSLT) Version 2.0. W3C
Recommendation, World Wide Web Consortium, 23 January 2007. http://www.w3.org/TR/2007/REC-xslt20-20070123/.XQuery 1.0: An XML Query Language. W3C
Recommendation, World Wide Web Consortium, 23 January 2007. http://www.w3.org/TR/2007/REC-xquery-20070123/.