How to cite this paper
Marcoux, Yves, Michael SperbergMcQueen and Claus Huitfeldt. “Modeling overlapping structures: Graphs and serializability.” Presented at Balisage: The Markup Conference 2013, Montréal, Canada, August 6  9, 2013. In Proceedings of Balisage: The Markup Conference 2013. Balisage Series on Markup Technologies, vol. 10 (2013). https://doi.org/10.4242/BalisageVol10.Marcoux01.
Balisage: The Markup Conference 2013
August 6  9, 2013
Balisage Paper: Modeling overlapping structures
Graphs and serializability
Yves Marcoux
Associate professor
Université de Montréal, Canada
Michael SperbergMcQueen
Senior consultant
Black Mesa
Technologies
Claus Huitfeldt
Associate professor
University of Bergen, Norway
Copyright © 2013 by the authors. Used with permission.
Abstract
The problem of overlapping structures has long been familiar to the
structured document community. In a poem, for example, the verse and line
structures overlap, and having them both available simultaneously is
convenient, and sometimes necessary (for example for automatic analyses).
However, only structures that embed nicely can be represented directly in
XML. Proposals to address this problem include XML solutions (based
essentially on a layer of semantics) and nonXML ones. Among the latter is
TexMecs HS2003, a markup language that allows overlap
(and many other features).
XML documents, when viewed as graphs, correspond to trees. Marcoux M2008 characterized overlaponly TexMecs documents by
showing that they correspond exactly to completionacyclic nodeordered directed acyclic graphs.
In this paper, we elaborate on that result in two ways.
First, we cast it in the setting of a strictly larger class of graphs,
childarcordered directed graphs, that
includes multigraphs and nonacyclic graphs, and show that —
somewhat surprisingly — it does not hold in general for graphs with
multiple roots. Second, we formulate a stronger condition, fullcompletionacyclicity, that guarantees
correspondence with an overlaponly document, even for graphs that have
multiple roots.
The definition of fullycompletionacyclic graph does not in itself
suggest an efficient algorithm for checking the condition, nor for
computing a corresponding overlaponly document when the condition is
satisfied. We present basic polynomialtime upper bounds on the complexity
of accomplishing those tasks.
Table of Contents
 1. Motivation and related work

 1.1. Graphs and documents
 1.2. The problem of serialization
 1.3 The approach of this paper
 2. Childarcordered directed graphs
 3. Overlaponly documents
 4. Correspondence between a graph and a document
 5. Main results
 6. Checking fullcompletionacyclicity
 7. Conclusion and future work
 Appendix A. Notation and symbols
1. Motivation and related work
1.1. Graphs and documents
Many operations are more conveniently performed on a graph
representation than on a linear representation of a marked up
document, and vice versa. One of the strengths of XML is that
XML documents in serial form are readily deserialized into
ordered trees, which form a convenient data structure for many
useful operations.^{[1]}
Socalled “XML trees” are directed acyclic
graphs with single parenthood and a total ordering on leaf
nodes. While this constitutes an intuitively natural and
generally suitable model for the representation of the
structure of most documents, and for most purposes, it also
poses a challenge for the representation of complex structures
such as overlapping, fragmented or disordered document
elements, and multiple coexisting alternative structures,
which allow for a more natural representation of complex
documents in a wide range of situations.
For such purposes, a different kind of graph
representation has been proposed, the socalled Goddag SH2004. Roughly, Goddags (General OrderedDescendant Directed Acyclic
Graphs) are like XML trees except that they allow
multiple parenthood and do not require a total ordering on
leaf nodes. (Thus, XML trees constitute a subset of
Goddags.)
Documents using different techniques for representing such
structures in XML form (e.g., milestones, fragmentation,
virtual elements, etc. B1995 SH1999 W2005) can be mapped onto
Goddags, though not without applicationspecific mechanisms
typically involving levels of indirection which may appear
cumbersome. The experimental markup system TexMecs HS2003 offers mechanisms for the representation of
complex structures which can be mapped on to Goddags
independently of such knowledge.
Since its introduction, the Goddag data structure has
frequently been cited, and it is used as a reference in
various works on overlap. (For example, Moore M2012
studied Goddags in the context of access control, and
introduced the notion of globally ordered
Goddag.) However, the original description of
Goddags is rather informal, and exhibits the kinds of gaps,
vaguenesses, and ambiguities that have, over time, given
informality a bad name among mathematicians and others
interested in firm results.
For example, it was conjectured that a
linearized document which made use, in addition to the
mechanisms of XML, only of markup for overlapping elements,
could be represented by a Goddag with a total order on leaf
nodes (socalled restricted
Goddags), but no proofs were given of this fact. The paper was
silent and its authors agnostic about the serializability of
graphs with multiple roots, and the relationship between Goddags
and markup in terms of serializability was not systematically
investigated. In HS2003, it was assumed, but
no attempt was made to prove, that all TexMecs documents could
be represented as Goddags, or that all Goddags could be
serialized as TexMecs documents. The present paper is a modest
contribution towards straightening up the situation, by way of a
systematic study of the fine point of the complex relationship
between markup formalisms like TexMecs and graph structures like
Goddags.
1.2. The problem of serialization
The general problem is this: whenever a markup system (be it
XML, TexMecs, or another system) provides more than one way to
represent a given abstract structure,^{[2]}
that same abstract structure can be written out again
(serialized) in more than one way. Can we control the
serialization process to provide the markedup forms we find
easiest to work with at a given moment? Can we tell, by
inspection of a given graph, what serialization formats are
possible for the graph? In many cases, a marked up form using
overlapping elements, seems to at least some observers to be the
most natural
representation of a given document;
when can a graph be serialized using overlap alone, and when
does it require use of the more powerful mechanisms of virtual
or discontinuous elements?
A concrete example may help illustrate the point.
In the following fragment (adapted from D2004), the verse
elements are
empty milestones marking the beginning and end of verses, in
Trojan Horse
style markup. A Goddag structure
representing this fragment would have nodes for the verses,
but those nodes do not correspond one to one with XML elements
in this serialization:^{[3]}
<div xmlns="http://www.teic.org/ns/1.0">
<p>
<verse xml:id="Jer.2.1"/>
Moreover the word of the LORD
came to me, saying,
<verse eID="#Jer.2.1"/>
<q n="QJer.2.2A">
<verse xml:id="Jer.2.2"/>
Go and cry in the
hearing of Jerusalem, saying,
<q n="QJer.2.2B">
Thus says the LORD:
<q n="QJer.2.2C">
I remember you,
The kindness of your youth,
The love of your betrothal,
When you went after Me in the wilderness,
In a land not sown.
<verse eID="#Jer.2.2"/>
<verse xml:id="Jer.2.3"/>
Israel [was] holiness to the LORD,
The firstfruits of His increase.
All that devour him will offend;
Disaster will come upon them,
</q>
<!True Close QJer.2.2C>
says the LORD.</q>
<verse eID="#Jer.2.3"/>
<!* ... *>
</q>
</p>
</div>
The same Goddag structure can also be serialized in an extended
form of TexMecs notation.^{[4]}
<div
<p
<[email protected]Moreover the word
of the LORD came to me, saying,verse>
<q n="QJer.2.2A"
<[email protected]
Go and cry in the
hearing of Jerusalem, saying,
<[email protected]
Thus says the LORD:~>
<[email protected]
I remember you,
The kindness of your youth,
The love of your betrothal,
When you went after Me in the wilderness,
In a land not sown.~>
verse>
<q n="QJer.2.2B"
<[email protected]=>
<q n="QJer.2.2C"
<=Jer.2.2c=>
<[email protected]
Israel [was] holiness to the LORD,
The firstfruits of His increase.
All that devour him will offend;
Disaster will come upon them,
~>
q>
q>
<[email protected]
<=Jer.2.3a=>
says the LORD.
verse>
<* ... *>
q>
p>
div>
This particular Goddag structure can also be
serialized without virtual elements, just by allowing
the q
and verse
elements
to overlap:
<div
<p
<[email protected]
Moreover the word of the LORD
came to me, saying,
verse>
<q n="QJer.2.2A"
<[email protected]
Go and cry in the
hearing of Jerusalem, saying,
<q n="QJer.2.2B"
Thus says the LORD:
<q n="QJer.2.2C"
I remember you,
The kindness of your youth,
The love of your betrothal,
When you went after Me in the wilderness,
In a land not sown.
verse>
<[email protected]
Israel [was] holiness to the LORD,
The firstfruits of His increase.
All that devour him will offend;
Disaster will come upon them,
q>
says the LORD.
verse>
q>
<* ... etc. ... *>
q>
p>
div>
Intuitively, many readers find the overlaponly version of the
document simpler and more natural than the version using
bilocation tags. But (as demonstrated by
M2008), not all Goddag structures can be
written out using only overlap, without virtual elements,
discontinuous elements, or bilocation tags.
This leads directly and obviously to the questions
When can graphs be serialized
using overlap only? And conversely, when are other
markup mechanisms necessary?
Marcoux M2008 introduced the notions of
noDAG and overlaponly
(oo) TexMecs as a first step towards answering these
questions.
A noDAG is a nodeordered directed acyclic graph,
i.e., a slight variation on the Goddag, where there is a strict partial
ordering on nodes. As a markup language, ooTexMecs is the subset of TexMecs that
allows multiple roots and overlapping elements, but not virtual or interrupted
elements. Marcoux established that a noDAG is serializable if and only if it is
completionacyclic, and that
“roundtripping” is possible, in that there is essentially a
bijective correspondence between noDAGs and ooTexMecs documents.
1.3 The approach of this paper
In order to investigate whether and how the results of
M2008 apply to other classes
of graphs, we introduce here the more general notion of a childarcordered directed graph (CODG), and
demonstrate that the results from M2008 hold also for CODGs,
with the somewhat surprising exception of some CODGs with multiple roots. By defining the stronger
notion of “fully completionacyclic” graphs, we succeed in
identifying this subset: the ooserializable CODGs are exactly the
fullycompletionacyclic ones. We also give basic polynomialtime upper bounds
on the complexity of checking fullcompletionacyclicity and of actually
computing an ooserialization of fullycompletionacyclic
CODGs.
2. Childarcordered directed graphs
2.1 Definition A childarcordered directed graph (CODG for short) G = (V, ch) is a
directed graph over a
finite
nonempty set of vertices (or nodes) V, where ch (for children) is a total mapping from V to finite (and possibly
empty) sequences of nodes from V. The set of arcs (or edges) of G, noted
E(G), comprises exactly those ordered pairs (v, w) for which
(∃n ∈ N)[ ch(v, n) = w ],
where N represents the set of nonnegative integers. The notation
ch(v, n) is used as a shorthand for (ch(v))(n), that is, for the element with
index n in the sequence ch(v). We use 0origin indexing; thus, for all v
∈ V, ch(v, 0) denotes the first child of v (or is undefined, if v has no child).
Note: Throughout this paper, the
“parent” relation must be understood to be the exact inverse of
the “child” relation (we bother to make this explicit because it
is not the case in some other models, such as
the XPath 1.0 data model).
It is possible for the same child to show up at more than one place
in a sequence of children; that is, ch(v, n) = ch(v, m) with m ≠ n is
possible. Loops are allowed; that is, ch(v, n) = v for a given n is possible.
Note that (v, w) ∈ E(G) for given v and w tells only part of the story: There
could be many distinct values of n for which ch(v, n) = w. Also note that the
length of ch(v), i.e., the smallest value of n (≥ 0) for which ch(v, n) is undefined,
is greater than or equal to the number of distinct children of v (if v has no child,
ch(v) = ∅, which, as a sequence, is of length 0).
CODGs are very loose structures: they can be
“multigraphs,” in that more than one arc can link any given
pair of nodes. They can have cycles and loops (i.e., cycles of length one).
There can be both a direct (length one) and indirect path between any two given
nodes.
The rationale for the adjective childarcordered
is that for all
v ∈ V, ch(v) can be seen as inducing an ordering on the arcs going out of
v (the
childarcs
of v).
Examples We present examples of
CODGs illustrating some of their features.
Example 2.1 illustrates that CODGs can be disconnected, and that, in a
CODG:

A node can have more than one parent (here most simply
node d, with parents a and c, but also node b [with
parents a and b]). This is a significant departure from
the rule of single parenthood in XML.

There can be cycles and loops (cycles of length 1); here
the only example is the loop on node b.

The same node can occur more than once as a child of
some parent (here node c, which is both second and fourth
among the children of node a).

There can be both direct and indirect paths between two
nodes (here node a dominates node d both directly and
via node c).
Note that the order of the
outgoing arcs is usually not shown explicitly in the visual representation of a CODG.
We adopt the convention of drawing
the arcs going out of any node in order from left to right (even if
the arcs must cross each other further down, in order to reach
the child node they point to). So the leftmost arc leaving any
parent is pointing to that parent's first child, and the
rightmost arc points to that parent's last child.^{[5]}
Thus, in Example 1, ch(a) = (b, c, d, c).
On the rare occasions that this convention is not practical, we use
explicit green arrows between the outgoing
arcs to indicate their order, as in the next example.
In Example 2.2, ch(a) = (b, c) and ch(d) = (e, c). Rightpointing
arrows, though superfluous, are sometimes shown as a reminder of the implicit
convention.
Sibling precedence For all v, ch(v)
induces a siblingprecedence
relation sp(v) among
the children of v, defined by:
sp(v) =_{def} { (w, x) ∈ V × V
 (∃m, n ∈ N)[ m < n & ch(v, m) = w & ch(v, n) = x ]
}.
This relation may or may not be a strict order relation. When it is, we say that v
orders its children.
Example 2.3 illustrates that parents may order the same nodes
differently as children. Thus, note that ch(a) = (b, c), which induces the
strict order relation b
< c, and ch(d) = (c, b), which induces the strict order relation c < b.
In examples in which no pair of nodes is ordered differently by different
parents, we will usually place the green arrows between nodes, rather than between
arcs:
Auxiliary concepts We now define a
number of auxiliary concepts useful in discussions of CODGs. All
of them are secondary concepts in the sense that they are entirely and uniquely
determined by the set of vertices and the sequences of children of the
graph.
2.2 Definition Let G = (V, ch) be a
CODG. Then:

⇒_{G} denotes the (positive)
transitive closure of E(G).
Note that ⇒_{G} is not necessarily
antireflexive, as E(G) may contain cycles.

⇒^{*}_{G}
denotes the reflexive transitive closure of E(G), that is:
⇒_{G} ∪ { (v, v)  v ∈
V }.

sp(G) =_{def} { (v, w, x) ∈ V × V × V
 (∃m, n ∈ N)[ m < n & ch(v, m) = w & ch(v, n) = x ]
}.
The name “sp” stands for “sibling
precedence.” Note that, iff w occurs more than once in the sequence of
children of v, then (v, w, w) ∈ sp(G). Note also that it is entirely
possible for both (v, w, x) and and (v, x, w) to be members of sp(G).
Finally note that sp(G) is the union over all v ∈ V of ({v} × sp(v)),
where sp(v) is the siblingprecedence relation induced by ch(v), as defined above
at Sibling precedence.

gsp(G) =_{def} { (w, x) ∈ V × V 
(∃v ∈ V)[ (v, w, x) ∈ sp(G) ] }.
The name “gsp” stands for “global sibling
precedence.” It is the projection of sp(G) onto the last two components.
It follows from the observations in the preceding point that there can be loops
and cycles in gsp(G).
2.3 Notation Let G = (V, ch) be a CODG.
Unless otherwise stated:

V can also be denoted by V(G),

E denotes E(G),

⇒ denotes ⇒_{G},

⇒^{*} denotes
⇒^{*}_{G},

sp denotes sp(G),

gsp denotes gsp(G).
3. Overlaponly documents
The phenomenon we wish to study in this paper is how the structural
properties of a CODG relate to the fact that it mimics the containment and
precedence relationships among elements in some markedup document. More
specifically, we want to consider documents expressed in markup languages that
allow overlapping elements, such as TexMecs HS2003. Thus, we
need to define a model for such documents.
TexMecs allows many more constructs than element embedding and
overlap. However, in this paper, we concentrate on those two, ignoring the
others, such as virtual elements, interrupted elements, empty elements,
attribute specifications, entity references, generic identifier coindexing
(for handling selfoverlap), unordered contents, and comments. This is why we
speak of “overlaponly” (or “oo”) documents. When
the structure of a CODG corresponds to the containment and precedence
relationships of some oodocument (to be defined precisely in a moment), we say
the CODG is “ooserializable,” because the oodocument can be
viewed as a serialization (a representation in
serial form) of the CODG.
Instead of defining documents as character strings with syntactic
constraints, we use a more abstract approach that avoids some complications and
leads to results that are simpler to formulate. More constraints on the
definition of document could later be added to suit specific markup languages
if and when desired.
Intuitively, we adopt a tokenized view of the document, where tokens
are tags and leaves. Tokens are represented by their (integer) position in the
sequence of tokens that make up the document.
The tags in our model of oodocuments correspond, in the XML
world, to start and endtags for nonempty elements.
The leaves in our model of oodocuments correspond, in the
XML world, to text nodes (#PCDATA) and empty elements. Note that
our model abstracts away from the actual textual content of
elements and documents, and also ignores the differences among
different element types. We claim, however, that our abstraction
captures the essential structural aspects of markedup documents
with possible element overlap.
3.1 Definition An oodocument is a finite set of pairs of
the form (x, y), where x, y ∈ N (the set of nonnegative
natural numbers) and x ≤ y, additionally satisfying a number of
“wellformedness” constraints (stated below).
The pairs in a document are called ranges. If r = (x, y) is a
range, then r_{1} and r_{2} denote
respectively x and y.
Intuitively, a range gives the position of a starttag and of the
corresponding endtag, or the position of a leaf, in which case, x =
y. Formally, if D is an oodocument:

leaves(D) =_{def} { x ∈ N  (x, x)
∈ D },

stags(D) =_{def} domain(D) 
leaves(D),

etags(D) =_{def} image(D) 
leaves(D).
Note that, as usual:

domain(D) =_{def} { x ∈ N  ( ∃ y ∈ N  (x,
y) ∈ D ) }, and

image(D) =_{def} { y ∈ N  ( ∃ x ∈ N  (x,
y) ∈ D ) }, and
Oodocuments are subject to the following wellformedness
constraints.
For all oodocument D:

D is a partial function over N, i.e., for all x ∈ N,
there is at most one y such that (x, y) ∈ D.

D^{1} (that is, the inverse
of D) is also a partial function over N, i.e., for all y
∈ N, there is at most one x such that (x, y)
∈ D.

stags(D) ∩ etags(D) = ∅.
Put less formally: No token is both a starttag and an
endtag.
It must also be remembered that (as stated at the beginning of the
definition) for all range r, r_{1} ≤
r_{2}, which corresponds to the normal
rule of syntax that starttag must precede its matching endtag.
Note that we do not require the numbering of token positions to be
gapfree, nor do we forbid consecutive leaves without intervening
tags. There is also no requirement of an element spanning the whole
document: this is of course crucial for oodocuments to be able to
correspond to graphs with multiple roots.
3.2 Definition Let D be an oodocument,
and r, s ∈ D:
Note that in the latter case, r and s may or may not overlap. Also
note that r cannot both contain and precede
s.
4. Correspondence between a graph and a document
Intuitively, a CODG and an oodocument correspond to each other when the nodes of the graph and
the ranges of the document can be put in correspondence in such a way that node
reachability corresponds to range containment, and gsp corresponds to range
precedence.
4.1 Definition A CODG G and an
oodocument D correspond to each other iff
there exists a bijective mapping g from V(G) to D, such that all of the
following conditions hold:

(∀ b, c ∈ V(G)) [(b ⇒ c) iff g(b) contains
g(c)]

(∀ b, c ∈ V(G)) [if (b, c) ∈ gsp(G), then
g(b) precedes g(c)]
We then say that G and D correspond to each other
through g.
4.2 Definition A CODG G is said to
be ooserializable iff there exists an
oodocument that corresponds to G.
It is clear that every oodocument has a corresponding CODG: use
ranges as nodes, and the transitive reduction of range containment as
parentchild relation. Then, order all sets of siblings in range precedence
order.
It is also clear that some CODGs are
not ooserializable: for example, a CODG with cycles would imply (by
transitivity) a range containing itself, which is impossible. But are all
acyclic CODGs ooserializable? The following examples, inspired from
M2008, show that the question is at best not trivial:
Both graphs are acyclic and they differ by just the presence/absence
of one arc. Yet, only the first one is ooserializable M2008.^{[6]}
In M2008, Marcoux defined noDAGs, or
nodeordered DAGs, as (essentially) directed
acyclic graphs (DAGs) in which the nodes are partially ordered in such a way
that siblings (children of a common parent), as well as
distinct roots, are totally ordered. He then defined the property of
completionacyclicity for noDAGs, and showed
that ooserializable noDAGs are exactly the completionacyclic ones.
In order to investigate whether the same is true of CODGs, we must
define an analogous property for CODGs. The following definition is the natural
adaptation of completionacyclicity to CODGs.
4.3 Definition Let G = (V, ch) be a
CODG. Then:

ssba(G) =_{def} { (w, x) ∈ V × V 
(∃v ∈ V)[ v ⇒ w & (v, x) ∈ gsp & x
⇏ w ] }.
The name “ssba” stands for
“shouldstartbefore additions.”

ssb(G) denotes the transitive closure of (E ∪ gsp ∪
ssba).
The relation “ssb” is called the
“shouldstartbefore completion” of G.

seaa(G) =_{def} { (w, x) ∈ V × V 
(∃v ∈ V)[ v ⇒ w & (v, x) ∈
gsp^{1} & x ⇏ w
] }.
The name “seaa” stands for “shouldendafter
additions.” The relation gsp^{1} is
the inverse of relation gsp.

sea(G) denotes the transitive closure of (E ∪
gsp^{1} ∪ seaa).
The relation “sea” is called the
“shouldendafter completion” of G.
4.4 Notation Let G = (V, ch) be a CODG.
Unless otherwise stated:

ssba denotes ssba(G),

ssb denotes ssb(G),

seaa denotes seaa(G),

sea denotes sea(G).
The relations ssb and sea can be understood as meaning:
“should <something>
in any ooserialization of the
CODG,” for example “should
end after in any ooserialization of the
CODG.” Thus, “(v, w) ∈ ssb” can be read out as:
“v should start before w in any ooserialization of the CODG.” In
other words, ssb (respectively, sea) represents the start (respectively, end)
tagprecedence relations that can be deduced from the topology of the CODG,
supposing parentchild relations are interpreted as element containment, and
siblingprecedence relations as start and endtagprecedence.
The relations ssba and seaa represent the “additional”
arcs (over and above those in E and gsp or gsp^{1})
that must be considered to compute all the
possible ssb and sea pairs that can be deduced from the CODG topology.
4.5 Definition A CODG G = (V, ch) is
said to be completionacyclic (CA) iff each of
ssb(G) and sea(G) is acyclic.
5. Main results
Things are not as simple with CODGs as with noDAGs. There
are, it turns out, CODGs that are completionacyclic, yet not ooserializable. Our
first result is to show that Example 2.2 above, as well as the following CODG,
are in that situation:
Since the number of structurally distinct documents that can
possibly correspond to a 5node graph is finite, we could
exhaustively enumerate them and verify that none of them correspond
to either Example 2.2 or to
Example 6.
However, that would not be very
insightful.^{[7]}
We will thus rather proceed by way of a lemma (5.9) that
provides a general characterization of ooserializable CODGs, and
will be useful for our second main result.
5.1 Definition Let G = (V, ch) be a
CODG. The ancestral precedence relation of G,
denoted ap(G), is defined as:
ap(G) =_{def} { (v, w) ∈ V × V  (v
⇒ w) & (w ⇏ v) }.
It is easy to show that ap(G) is always a strict partialorder on V.
Informally, we could say that ap(G) gets rid of the cycles in G by contracting
its (maximal) stronglyconnected components, then reexpanding them to an equal
number of disconnected
vertices.
5.2 Definition Let G = (V, ch) be a
CODG. A root in G is a vertex r ∈ V for
which:
(∄w ∈ V)[ (w, r) ∈ ap(G) ].
5.3 Notation Let G = (V, ch) be a CODG.
Unless otherwise stated, ap denotes ap(G).
The next result establishes that for any distinct roots v and w,
either v and w are in the same stronglyconnected component, or else v and w
are unordered in each of ssb and sea.
5.4 Lemma Let G = (V, ch) be a CODG,
and v and w two distinct roots in G. Then, either:
or
{ (v, w), (w, v) } ∩ ((ssb ∪ sea) − ⇒) = ∅.
Note: For space consideration, most
proofs are omitted.
Thus, an important difference between noDAGs and CODGs is that the
latter can have unordered root pairs, whereas noDAGs have (by definition) their
roots totallyordered.
5.5 Definition A CODG G = (V, ch) is
said to be spconsistent iff gsp is an acyclic
relation, i.e., iff:
(∀v ∈ V)[(v, v) ∉ transitiveclosure(gsp)].
Note that if G is truly a multigraph, i.e., if some node occurs more
than once as a child of the same parent, then G is certainly
not spconsistent. However, G could fail to be
spconsistent without being a true multigraph, for example if two siblings are
ordered differently by two distinct parents.
5.6 Definition A CODG G = (V, ch) is
said to be reduced iff no node is both
directly and indirectly reachable from some other node, i.e., iff:
(∄v, w, x ∈ V)[ {(v, w), (v, x)} ⊆ E(G) &
w ⇒ x].
Note that if there is a cycle in G, then it is not reduced.
5.7 Lemma If a CODG is
spconsistent, is reduced, and has a single root, then it is isomorphic to a
noDAG; thus, by M2008, it is ooserializable iff it is
completionacyclic.
5.8 Definition Let G be a CODG. A
singlerooted extension (sre) of a G, is
identical to G with an added root that has as children the roots of the
original CODG, in some ordering, without
repetition.
Note that, in general, a CODG has more than one sre (in effect, n!,
where n is the number of roots in the original CODG, i.e., one for each
possible ordering of the original roots).
5.9 Lemma A CODG is ooserializable
iff it has an sre that is ooserializable.
Proof sketch. (←): Let G be a
CODG, and H an ooserializable sre of G. Let D be any serialization of H.
Because H has only one root, and D corresponds to H, there must be a range in D
that contains all the others. Thus, the first and last tag of H must be
matching tags. By “removing” those tags from D, we obtain an
ooserialization of G.
(→): Let G be an ooserializable CODG, and D any serialization
of G. By “adding” a starttag and a matching end tag at
(respectively) the beginning and end of D, we obtain a document that can be
shown to correspond to some sre of G.
We are now ready to state our first main result:
5.10 Theorem There exist CODGs that
are completionacyclic but not ooserializable.
Proof sketch. The theorem follows
from the observations that:

Each of Examples 2.2 and 6 is completionacyclic.

Each of Examples 2.2 and 6 has exactly two sres, each of which is
spconsistent, reduced, and (by definition of sre) has a single root.

Each sre of each of Examples 2.2 and 6 is completioncyclic.
By Lemma 5.7, none of Examples 2.2 and 6 has an sre that is
ooserializable. Thus, by Lemma 5.9, none of Examples 2.2 and 6 is
ooserializable.
Our second main result is easiest seen as a corollary to the proof of
the preceding theorem. First, we define:
5.11 Definition A CODG is said to be
fullycompletionacyclic (FCA) iff it has an sre
that is completionacyclic.
5.12 Theorem A CODG is
ooserializable iff it is fullycompletionacyclic.
Proof sketch. The theorem follows
from the proof of the preceding theorem and the following lemma:
5.13 Lemma If a CODG is not
spconsistent or is not reduced, then it is not completionacyclic.
6. Checking fullcompletionacyclicity
An obvious way to check whether a CODG is fullycompletionacyclic is to try
out all possible sres and see if at least one is completionacyclic. From a
completionacyclic sre, it would be easy to derive an ooserialization
of the CODG. However, since there are n! different sres to check (where n is
the number of roots in the CODG), this can be very inefficient. It would be
nice to be able to check whether a CODG is fullycompletionacyclic without
having to generate all possible sres.
It turns out it suffices to check each pair of roots for a particular
condition which is verifiable in polynomial time. Since there are n × (n  1)
/ 2 rootpairs and checking the condition can be done in polynomial time, it
follows that fullcompletionacyclicity can in fact be checked in polynomial
time.
The condition to be checked is as follows.
6.1 Definition Let r and s be two roots of
some CODG G = (V, ch) that are unordered with respect to ssb(G). We say that r
must precede s, noted r ↝ s, iff there
exist vertices x and y standing in either (or both) of the following configurations
with
respect to r and s:
Here, the doublearrows represent the reachability (⇒) relation, not just
parentchild relationships. The red doublearrow (with a stroke through it)
means the complement of ⇒ (thus, in Configuration 1, s ⇏ x). It does not
matter whether or not r ⇒ y (resp. s ⇒ x) in Configuration 1 (resp.
Configuration 2). In other words, at least one of r
⇏ y and s ⇏ x must be the case. The dotted green arrow means that (x, y) ∈
(ssb − ⇒), in other words, that x precedes y without being an ancestor of
it.
An instance of at least one of those configurations is found in each of the following
CODGs:
6.2 Lemma Let G be a CODG, and r and s be
two roots in G such that r ↝ s. Then, for each H that is a CA sre of G, (r, s)
∈ gsp(H).
Proof sketch. If there exists no CA sre of G, or if
there are no two roots r and s in G such that r ↝ s, then the lemma is
vacuously verified. Let thus H be any CA sre of G and, for the sake of
contradiction, suppose r and s are two Groots such that r ↝ s and such that
(r, s) ∉ gsp(H). Suppose x and y are two vertices as in Configuration 1
above (we prove only the case of Configuration 1; that of
Configuration 2 is proved similarly).
By the definition of sre, if (r, s) ∉ gsp(H), then it must be the case that
(s, r) ∈ gsp(H), and thus, by construction of ssb(H) (Definition 4.32),
that (s, r) ∈ ssb(H). Hence, it follows by Lemma 4 of M2008 and the fact that r and s do not stand in
ancestordescendant relationship (being both Groots, they are
⇒incomparable), that (r, s) ∈ sea(H). Similarly, from (x, y) ∈ ssb(H) and x ⇏
y, and y ⇏ x (because (x, y) ∈ ssb(H)), we conclude that (y, x) ∈
sea(H).
Now, by construction of sea(H) (Definition 4.34), and from the facts
that r ⇒ x, that (r, s) ∈ gsp^{1}(H), and that s ⇏ x,
we conclude that (x, s) ∈ sea(H). So, we have (y, x) ∈ sea(H) (established
earlier), and (x, s) ∈ sea(H), and (s, y) ∈ sea(H) (because s ⇒ y). Thus,
sea(H) is cyclic, contrary to our hypothesis that H is CA, and so we must
reject the hypothesis that (r, s) ∉ gsp(H), and conclude that (r, s) ∈
gsp(H).
QED
6.3 Theorem A CODG is FCA iff it is CA and
it does not have any two roots r and s such that r ↝ s and s ↝ r.
Proof sketch.
(⇒) Any FCA CODG is CA. If a CODG had roots r and s such that r ↝ s and s ↝
r, then by Lemma 6.2, it would have a cycle in ssb, and thus could not be
CA.
(⇐) Let G be a CA CODG in which no two roots r and s are such that r ↝ s and
s ↝ r. Note that by Lemma 5.13 and the fact that G is CA, we know that G
is spconsistent and reduced, and will thus take this for granted.
We give an algorithm for constructing an ordering of the roots of G that can
be used as the rootorder in a sre H of G which will be shown to be FCA. In
the algorithm, ssb(G, ROR) denotes the result of building ssb as per
Definitions 4.31 and 4.32, but using gsp(G) ∪ ROR instead of
gsp(G)
Algorithm:

Let ROR = { (r, s)  r and s are roots in G and r ↝ s }.

Let SSB = ssb(G, ROR).

WHILE (∃ x, y ∈ V)[ch(x) = ch(y) & {(x, y), (y, x)} ∩ SSB =
∅]

Pick any x and y satisfying the WHILE clause, and
let X = x, and Y = y.

Let ROR = ROR ∪ {(r, s)  r and s are roots in G,
and r ⇒^{*} X, and s
⇒^{*} Y}.

Let SSB = ssb(G, ROR).
When the algorithm stops, ROR is the rootorder to be used for
constructing H.
The intuition behind the algorithm is best conveyed with examples.
Essentially, the algorithm goes like this: start with the rootorderings that
are imposed by the topology of the CODG, i.e., those pairs of roots (r, s) for
which r ↝ s. Then, for the other rootpairs, they can basically be ordered
randomly, as long as no silly decision
is taken.
To see what silly decisions
would be, consider
Example 10. A silly decision would be to stick root r between s and t. To
avoid such decisions, the rootordering must be built gradually, considering
one by one (the order does not matter) the unordered pairs of parents of the
same children. For each such pair (x, y), decide of an arbitrary order, then,
make sure all roots reachable upwards from x come before all those reachable
upwards from y (there can be no root reachable both ways, otherwise, x and y
would have been ordered to start with).
Each time such rootorderings are added, the consequences on the global ordering
of the CODG are recomputed and propagated down from the roots.
Example 11 provides a more intricate example, in which eight possible
pairs of unordered parents of the same children exist: (u, w), (u, x), (v, w),
and (v, x), and their inverse. Whichever pair is chosen, it will result in roots r
and s
being ordered at step 3b, and then, in all eight pairs being ordered at
step 3c.
Examples 12 and 13 show that the addition of a leaf can cause some
rootpairs to be ordered from the start: (r ↝ s) and (r ↝ t) in
Example 12; (r ↝ s) in Example 13.
Proof sketch of termination: At each turn, at least one
pair of the kind sought for in (3) is ordered. Indeed, it can be shown that
the chosen pair (x, y) causes arcs to be added to ROR that will necessarily
order the pair (x, y) itself. Thus, eventually, no pair satisfying the WHILE
clause will remain.
Proof sketch that ROR orders all pairs of roots: The
existence of an unordered rootpair
implies that there is a pair satisfying the WHILE clause. Thus, when no more
such pair exists, all the roots
have been ordered.
Proof sketch that ssb(G, ROR) is acyclic: Any condition
that might cause a cycle would also cause a cycle in { (r, s)  r and s are
roots in G and r ↝ s }, a contradiction.
A rough analysis of the algorithm shows that its running time is polynomial
(probably with degree at most 3 or 4). Obviously, it could be used to actually
build a CA sre of any FCA CODG, and thus an ooserialization of the CODG. This
establishes a polynomial upperbound on the task of verifying
fullcompletionacyclicity and of generating an ooserialization of a FCA
CODG. While interesting, we do not believe these upperbounds to be tight, and
hence consider the exact complexity of these tasks to be open
questions.
7. Conclusion and future work
In this paper, we defined a class of graphs, childarcordered directed graphs (CODGs), that includes
multigraphs and nonacyclic graphs, and investigated the conditions under
which a CODG is “ooserializable”, i.e., has a structure which
corresponds to that of an overlaponly markedup document. We found that the
property of completionacyclicity does not
guarantee ooserializability in general for CODGs, by showing that there exist
completionacyclic CODGs that are not ooserializable. By contrast, Marcoux
has shown that for the less general class of nodeorderedDAGs (noDAGs), completionacyclicity does
guarantee ooserializability M2008.
We then defined a condition strictly stronger than
completionacyclicity, fullcompletionacyclicity, and showed that it does
guarantee ooserializability for all CODGs.
Finally, we presented polynomialtime algorithms for checking
fullcompletionacyclicity and for computing an ooserialization of
fullycompletionacyclic CODGs. However, we do not believe these algorithms to
be optimal. Thus, open questions include determining the exact complexity of
— and finding optimal algorithms for — checking
completionacyclicity, fullcompletionacyclicity, and of actually computing
an ooserialization of a CODG once it is found to be serializable.
Another area of research we hope to pursue in the near future is
investigating whether and how some forms of interrupted and virtual elements,
as found in TexMecs, can be characterized in terms of graphs.
Appendix A. Notation and symbols
For the convenience of readers who find the notation used
here unfamiliar, we list here the symbols and conventional
variable names used in this paper.
↝

the mustprecede relation on roots of a graph
G: r ↝ s if and only if there exist vertices x and y standing in some
specific configurations with respect to r and s (see Definition 6.1).

⇒_{G}

the positive transitive closure of
E(G), for any graph G; sometimes known as the
reachability relation of G; often abbreviated
to ⇒ when the identity of G is understood.

⇒^{*}_{G}

the reflexive transitive closure of
E(G), for any graph G; often abbreviated
to ⇒^{*} when
the identity of G is understood.

∈

is an element of. So x ∈
y means that x is an element in the
set y.

∪

set union. So x ∪ y
denotes the set of all elements which are members either
of set x or of set y or both.

∩

set intersection. So x ∩ y
denotes the set of all elements which are members of
both set x and set y.

ap(G)

the ancestralprecedence relation of a graph G;
abbreviated ap when G is understood. This is a binary
relation consisting of all node pairs (v, w) for which
(informally) v is an ancestor of w and not viceversa.

ch

(core meaning) a unary function that maps each
node of a graph to a sequence of nodes of that
graph.
(secondary meaning) a twoargument function that
maps from a pair (v, n) (where v is a node and n is
an integer) to at most one node among the children of v.

ch(v)

for any node v in a graph G,
ch(v) denotes a sequence of nodes in
G.
Note that sequences are typically modeled as
sets of pairs (n, e) where n is a number and e an
element of the sequence. The set of pairs
denoted by ch(v) can thus be treated as a function
from nonnegative integers to nodes in the graph:
for any node v and any suitable integer
n, ch(v)(n) denotes the nth child of v;
ch(v)(0) denotes the first child, ch(v)(1) denotes
the second, etc. To reduce the need for parentheses,
ch(v)(n) is normally written in the
simpler form ch(v,n) .

ch(v, n)

denotes the nth child (counting from 0)
of node v. This is a shorthand form for the
expression ch(v)(n).

E(G)

the set of arcs in a graph G

G

the conventional variable for a graph (here
invariably a CODG)

gsp

a shorthand form of gsp(G) when the identity of
the graph G is understood.

gsp(G)

the global siblingprecedence relation of a graph G;
written as sp when G is understood.
This is (speaking informally) a relation consisting of all
node pairs (v, w) which share a parent, and for which v
precedes w among the children of that parent. Note that
the same pair of children may share more than one parent,
occurring in one order for one parent and in the other
order for the other parent. So gsp(Example 2.3)
includes both (b, c) and (c, b).

N

the natural numbers (0, 1, 2, ...)

n, m

conventional variables used to represent
individual natural numbers

r

conventional variable used to denote an
arbitrary range.

r_{1}

for a given range r,
r_{1} denotes the first
element of r.

r_{2}

for a given range r,
r_{2} denotes the second
element of r.

r, s

conventional variables used to denote two
roots of a CODG.

sea(G)

the shouldendafter relation
of a graph G; abbreviated sea when G is
understood

seaa(G)

the shouldendafter additions
relation of a graph G; abbreviated seaa
when G is understood

sp

a shorthand form of sp(G) when the identity of
the graph G is understood.

sp(G)

the siblingprecedence relation of a graph G; this is
a ternary relation consisting of all node triples (v, w,
x) for which (informally) w precedes x among the children
of v; note that if w occurs more than once among the
children of v, then the triple (v, w, w) appears in
sp. If some node x occurs between the two occurrences of
w, then (v, w, x) and (v, x, w) are both in sp.

sre

singlerooted extension of some graph G. Note that
any multirooted graph G has many singlerooted extensions
(one for each possible ordering of the roots of G).

ssb(G)

the shouldstartbefore
relation of a graph G; abbreviated ssb when
G is understood

ssba(G)

the shouldstartbefore
additions relation of a graph G; abbreviated
ssba when G is understood

V

the set of nodes (or vertices) in a graph;
a shorthand for V(G)

V(G)

the set of vertices in a graph G

V × V

the Cartesian product of V, the set of nodes;
that is, the set of pairs (v, w) where v ∈ V and w
∈ V

V × V × V

the set of triples (v, w, x) where v, w and x
are all ∈ V.

v, w, x, u

variables conventionally used
for individual nodes in a graph (by convention, all of
v, w, x, u ∈ V)

References
[B1995] David Barnard, Lou Burnard, JeanPierre
Gaspart, Lynne A. Price, C. M. SperbergMcQueen, and Giovanni Battista Varile.
“Hierarchical Encoding of Text: Technical Problems and SGML
Solutions”, in Computers and the
Humanities, 29/3 1995, pp. 211231.
http://www.springerlink.com/content/p7775247276v88h3/, http://xml.coverpages.org/barnardHierps.gz. doi:https://doi.org/10.1007/BF01830617
[D2004] Steven DeRose. “Markup Overlap: A
Review and a Horse”. Paper delivered at Extreme Markup
Languages, 2004, Montréal.
http://conferences.idealliance.org/extreme/html/2004/DeRose01/EML2004DeRose01.html
[HS2003] Claus Huitfeldt and C. M. SperbergMcQueen.
TexMECS: An experimental markup metalanguage for complex
documents. University of Bergen, January 2001, rev. October 2003.
http://mlcd.blackmesatech.com/mlcd/2003/Papers/texmecs.html
[M2008] Yves Marcoux. Graph
characterization of overlaponly TexMECS and other overlapping markup
formalisms. Proceedings of the Balisage
2008 conference, 1215 august 2008, Montréal (Canada).
doi:https://doi.org/10.4242/BalisageVol1.Marcoux01
[M2012] Moore, Neil. Multihierarchical documents
and finegrained access control (2012). Theses and
DissertationsComputer Science. Paper 6. http://uknowledge.uky.edu/cs_etds/6
[SH2004] C. M. SperbergMcQueen and Claus Huitfeldt.
GODDAG: A Data Structure for Overlapping
Hierarchies. SpringerVerlag (2004).
Preprint at
http://cmsmcq.com/2000/poddp2000.html
[SH1999] C. M. SperbergMcQueen and Claus Huitfeldt:
“Concurrent Document Hierarchies in MECS and SGML”, in
Literary and Linguistic Computing, 14 1999,
pp. 2942.
http://llc.oxfordjournals.org/cgi/content/abstract/14/1/29. doi:https://doi.org/10.1093/llc/14.1.29
[W2005] Andreas Witt. “Multiple Hierarchies: New
Aspects of an Old Solution”, in: Stefanie Dipper, Michael Götze, and
Manfred Stede (eds.), Heterogeneity in Focus: Creating
and Using Linguistic Databases, vol. 2 of Interdisciplinary
Studies on Information Structure (ISIS), Working Papers of the SFB 632.
University of Potsdam, Germany, 2005. (Corrected reprint of an Extreme Markup
2004 paper.)
http://www.sfb632.unipotsdam.de/publications/isis02_4witt.pdf
×Yves Marcoux. Graph
characterization of overlaponly TexMECS and other overlapping markup
formalisms. Proceedings of the Balisage
2008 conference, 1215 august 2008, Montréal (Canada).
doi:https://doi.org/10.4242/BalisageVol1.Marcoux01
×Andreas Witt. “Multiple Hierarchies: New
Aspects of an Old Solution”, in: Stefanie Dipper, Michael Götze, and
Manfred Stede (eds.), Heterogeneity in Focus: Creating
and Using Linguistic Databases, vol. 2 of Interdisciplinary
Studies on Information Structure (ISIS), Working Papers of the SFB 632.
University of Potsdam, Germany, 2005. (Corrected reprint of an Extreme Markup
2004 paper.)
http://www.sfb632.unipotsdam.de/publications/isis02_4witt.pdf