<?xml version="1.0" encoding="UTF-8"?><article xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0-subset Balisage-1.2"><title>Graph characterization of overlap-only TexMECS and other overlapping
	 markup formalisms</title><!--<subtitle><emphasis role="bold">A cornucopia of characters:</emphasis>
	 &#x2209;  &#x2192; &#x2212; &#x2260; &#x2019;</subtitle>--><info><confgroup><conftitle>Balisage: The Markup Conference 2008</conftitle><confdates>August 12 - 15, 2008</confdates></confgroup><abstract><para>We establish a necessary and sufficient condition for a graph to
		  correspond to the structure of an overlapping markup document, such as a
		  well-formed TexMECS document (not using interrupted or virtual elements). This
		  provides a test for determining if any given graph can be serialized into a
		  TexMECS document—or any other similar language—using only
		  overlapping markup. Such a test may prove useful in DOM-based applications, to
		  determine if an attempted modification operation would preserve the
		  overlap-only serializability of the document. For example, in a document editor
		  using a graph-oriented interface, the user could be warned when a requested
		  operation would prevent the document from being serializable with overlapping
		  elements only. To our knowledge, no such characterization has been given
		  before.</para></abstract><author><personname><firstname>Yves</firstname><surname>Marcoux</surname></personname><personblurb><para>Yves Marcoux is a faculty member at EBSI, University of Montréal,
			 since 1991. He is mainly involved in teaching and research activities in the
			 field of document informatics. Prior to his appointment at EBSI, he has worked
			 for 10 years in systems maintenance and development, in Canada, the U.S., and
			 Europe. He obtained his Ph.D. in theoretical computer science from University
			 of Montréal in 1991. His main research interests are document theory,
			 structured document implementation methodologies, and information retrieval in
			 structured documents. Through GRDS, his research group at EBSI, he has been
			 principal architect for the Governmental Framework for Integrated Document
			 Management, a project funded by the National Archives of Québec and by the
			 Québec Treasury Board. He is currently a visiting researcher at Aksis,
			 University of Bergen (Norway).</para></personblurb><affiliation><jobtitle>Associate professor</jobtitle><orgname>Université de Montréal, Canada</orgname></affiliation><affiliation><jobtitle>Visiting researcher</jobtitle><orgname>University of Bergen, Norway</orgname></affiliation><email>yves.marcoux@umontreal.ca</email></author><legalnotice><para>Copyright © 2008 by the authors.  Used with
permission.</para></legalnotice></info><section><title>Introduction and motivation</title><section><title>Overlapping markup</title><para>Overlapping structures are generally recognized as a reality with
		  which structured documents have to deal with. For example, if we want to encode
		  in a single document both the paragraph and page structures of a text, we have
		  to deal with overlapping structures in one way or another. Numerous approaches
		  to tackle the problem have been proposed in the literature or used in practical
		  encoding projects. An excellent survey of such approaches can be found in the
		  introduction of a 2006 EML article by Sperberg-McQueen
		  <citation linkend="S2006">S2006</citation>. Examples include TEI
		  <emphasis role="ital">milestones</emphasis>
		  <citation linkend="TEI">TEI</citation>, LMNL (Layered Markup and
		  Annotation Language) of Jennison et al.
		  <citation linkend="LMNL">LMNL</citation>, and multi-colored trees of
		  Jagadish et al.
		  <citation linkend="J2004">J2004</citation>.</para><para>A number of those solutions involve <emphasis role="ital">overlapping markup</emphasis>, i.e., markup in which elements must
		  have matching start- and end-tags, but need not nest properly as in XML. For
		  example, the following document is not well-formed XML, because the elements do
		  not nest properly:</para><blockquote><para>
			 <programlisting xml:space="preserve">&lt;A&gt;Hello &lt;B&gt;small&lt;/A&gt; world!&lt;/B&gt;</programlisting></para></blockquote><para>Yet, it could be considered well-formed in some hypothetical markup
		  language allowing overlap.</para></section><section><title>Overlap in graphs</title><para>Structured documents are often drawn as graphs. Some would say it
		  is one of the best ways to clearly bring out the structure of a document. For
		  authors, it might be one of the most usable representations of a document, in
		  which the nature and meaning of the various manipulations that can be performed
		  on the edited document are most obvious and clear. In fact, most extant
		  structured editors offer a view of the document that can be regarded as a graph
		  representation.</para><para>When overlapping markup is allowed in documents, a graph
		  representation becomes even more useful, because <emphasis role="ital">arrows</emphasis> can then be used to represent the overlap
		  relationships, which the relative geometric positioning of elements could only
		  awkwardly convey. So, a graph-based user-interface for editing documents with
		  overlapping markup is a sensible idea.</para><para>Suppose you have developed a graph-based editor for structured
		  documents with overlapping markup. One day, a minimalist poet composes a
		  document with the following structure:</para><para>
		  <mediaobject><imageobject><imagedata format="png" fileref="../../../vol1/graphics/Marcoux01/Marcoux01-001.png"/></imageobject></mediaobject> </para><para>TexMECS (Huitfeldt and Sperberg-McQueen
		  <citation linkend="HS2003">HS2003</citation>) is a markup language that
		  allows overlapping elements. It will be discussed in more detail in
		  <xref linkend="texmecs"/>, but we can say right now that start-tags are of the
		  form <code>&lt;a|</code> and end-tags of the form <code>|a&gt;</code>. In
		  TexMECS, the above structure is thus representable as follows:</para><para>
		  <programlisting xml:space="preserve">&lt;book|
  &lt;prelude|
    autumn
  &lt;poem|
  &lt;afterthought|
    leaves
  |prelude&gt;
    fall
  |poem&gt;
    down
  |afterthought&gt;
|book&gt;</programlisting> </para><para>Leaving out any unmatched tag, the contents of, for example,
		  <code>afterthought</code> is indeed <code>" leaves fall down "</code>, as the
		  graph structure says it should; and so on, for all elements.</para><para>Feeling particularly Zen that day, the poet decides to thin down
		  the poem even more by removing <code>" leaves "</code> from the
		  <code>poem</code>:</para><para>
		  <mediaobject><imageobject><imagedata format="png" fileref="../../../vol1/graphics/Marcoux01/Marcoux01-002.png"/></imageobject></mediaobject></para><para>Would your editor allow that arrow removal? Actually, it follows
		  from our Theorem 1 that the resulting structure is <emphasis role="ital">not
		  representable at all</emphasis> with overlapping elements in TexMECS.<footnote><para>It is representable in TexMECS, but only with features more
				powerful than overlapping markup; see <xref linkend="texmecs"/>.</para></footnote></para><para>As the reader may want to verify, any trial-and-error attempt to
		  write a TexMECS document corresponding to the new structure fails, either by
		  aborting, or by producing the following document:</para><para>
		  <programlisting xml:space="preserve">&lt;book|
  &lt;prelude|
    autumn
  &lt;afterthought|
    leaves
  |prelude&gt;
  &lt;poem|
    fall
  |poem&gt;
    down
  |afterthought&gt;
|book&gt;</programlisting> </para><para>which corresponds to this structure:</para><para>
		  <mediaobject><imageobject><imagedata format="png" fileref="../../../vol1/graphics/Marcoux01/Marcoux01-003.png"/></imageobject></mediaobject></para><para>Intuitively, the problem is that we have to start
		  <code>afterthought</code> before <code>poem</code> starts, and end it after
		  <code>poem</code> ends, making it impossible not to embed <code>poem</code>
		  within <code>afterthought</code>. The problem is obvious in this example,
		  however, the general conditions under which a given graph is serializable with
		  overlapping elements turn out to be not so simple, as the statement of Theorem
		  1 shows.</para><para>A good editor should at least warn the author when an attempted
		  operation would jeopardize the serializability of the document. For this, a
		  <emphasis role="ital">characterization</emphasis> of exactly which graphs are
		  serializable is needed. Without such a characterization, an editor does not
		  know how to determine if the graph is serializable or not.</para></section><section><title>Overview of this paper</title><para>We give here an exact characterization of graphs that are
		  serializable with overlapping markup. As far as we know, it is the first time
		  such a characterization is given. We state a criterion (Theorem 1) which
		  guarantees a graph to be serializable, and which is satisfied by
		  <emphasis role="ital">all</emphasis> serializable graphs. By testing this
		  criterion, an editor can verify at all times whether the edited graph is
		  serializable or not.</para><para>We also give the inverse characterization (Theorem 2): we show that
		  <emphasis role="ital">all</emphasis> well-formed overlapping-markup documents
		  can be obtained by serializing some graph (necessarily satisfying the
		  criterion). Thus, an editor offering <emphasis role="ital">only</emphasis> a
		  graph-based interface (and no plain-text view), and allowing only the creation
		  of serializable graphs, would still be complete, in that it would permit the
		  creation of <emphasis role="ital">any</emphasis> document representable by
		  overlapping markup.</para><para>Our results are formulated in terms of the TexMECS markup language,
		  however, they apply to any markup language (or subset of markup language)
		  allowing overlapping elements.</para></section><section><title>Related work</title><para>Sperberg-McQueen and Huitfeldt
		  <citation linkend="SH2004">SH2004</citation> define a graph model for
		  essentially the same subset of TexMECS as the one we deal with here: the
		  <emphasis role="ital">general ordered-descendant directed acyclic
		  graph</emphasis> (GODDAG). They study certain restrictions of the model but do
		  not give an exact characterization of serializable graphs.</para></section><section><title>Node ordering</title><para>It is clear that, in graphs corresponding to documents, at least
		  <emphasis role="ital">some</emphasis> ordering of the nodes is important. In
		  the above example, the prelude is "autumn leaves" and not "leaves autumn". In a
		  book, we expect the author to be able to specify the order of the chapters,
		  etc. To account for this, we use <emphasis role="ital">node-ordered</emphasis>
		  graphs, in which the children of a node are ordered relative to each
		  other.</para><para>One part of our main result (Theorem 1) states it is actually
		  useless, for overlap-only documents, to allow specifying the order of
		  appearance of the nodes over and above that of siblings. In other words, an
		  overlap-only marked-up document is entirely and uniquely determined by the
		  combination of parent-child relationships and sibling ordering. So, for
		  example, an editor for overlap-only documents would offer no extra expressivity
		  by allowing authors to arbitrarily specify the order of appearance of the
		  nodes, over and above specifying the parent-child relationships and ordering
		  siblings.</para></section></section><section><title>Basic definitions and notation</title><section xml:id="dags" xreflabel="2.1"><title>Digraphs, DAGs</title><para>A <emphasis role="ital">directed graph</emphasis> (or
		  <emphasis role="ital">digraph</emphasis>) G = (N<subscript>G</subscript>,
		  A<subscript>G</subscript>) is made up of a set of <emphasis role="ital">nodes</emphasis>, N<subscript>G</subscript>, and of a set of
		  <emphasis role="ital">arcs</emphasis>, A<subscript>G</subscript>. The set
		  A<subscript>G</subscript> is a subset of (N<subscript>G</subscript> ×
		  N<subscript>G</subscript>) and, thus, determines a <emphasis role="ital">binary
		  relation</emphasis> on N<subscript>G</subscript>.</para><para>Let G = (N<subscript>G</subscript>, A<subscript>G</subscript>) be a
		  digraph, and b, c, and d, nodes in N<subscript>G</subscript>.</para><para>Iff (b, c) ∈ A<subscript>G</subscript>, we say that c is a
		  <emphasis role="ital">child</emphasis> (or <emphasis role="ital">direct
		  child</emphasis>) <emphasis role="ital">of</emphasis> b, and that b is a
		  <emphasis role="ital">parent of</emphasis> c. The fact that c is a child of b
		  is noted b →<subscript>G</subscript> c.</para><para>Iff both b →<subscript>G</subscript> c and b
		  →<subscript>G</subscript> d, and c ≠ d, we say that c and d are
		  <emphasis role="ital">siblings</emphasis>.</para><para>We define the <emphasis role="ital">reachability</emphasis> (or
		  <emphasis role="ital">dominance</emphasis>) relation of G, noted
		  ⇒<subscript>G</subscript>, as the transitive closure of
		  A<subscript>G</subscript>. Node c is said to be <emphasis role="ital">reachable</emphasis> from, or a <emphasis role="ital">descendant</emphasis> of, or <emphasis role="ital">dominated
		  by</emphasis>, node b iff b ⇒<subscript>G</subscript> c. Node b is said
		  to be an <emphasis role="ital">ancestor</emphasis> of node c iff c is a
		  descendant of b.</para><para>When G is clear from the context or irrelevant, we may use the
		  notations → and ⇒ instead of, respectively,
		  →<subscript>G</subscript> and ⇒<subscript>G</subscript>.</para><para>Iff b → c, we say that c is <emphasis role="ital">directly
		  reachable</emphasis> from b. Iff there exists some d ∈
		  N<subscript>G</subscript> such that b ⇒ d and d ⇒ c, we say that
		  c is <emphasis role="ital">indirectly reachable</emphasis> from (or an
		  <emphasis role="ital">indirect descendant</emphasis> of) b.</para><para>The <emphasis role="ital">internal nodes</emphasis> of a digraph
		  are the nodes that have at least one child; its <emphasis role="ital">leaves</emphasis> are the nodes without any child; its
		  <emphasis role="ital">roots</emphasis> are the nodes that are not a child of
		  any other node.</para><para>A <emphasis role="ital">cycle-free</emphasis> or
		  <emphasis role="ital">acyclic</emphasis> digraph is one in which no node is its
		  own descendant. An acyclic digraph is also called a <emphasis role="ital">directed acyclic graph</emphasis> (DAG).</para></section><section xml:id="orderrel" xreflabel="2.2"><title>Strict partial orders</title><para>Note that, when G is a DAG, ⇒<subscript>G</subscript> is a
		  transitive, antireflexive (and hence antisymmetric) binary relation on
		  N<subscript>G</subscript>. Such relations are called <emphasis role="ital">strict</emphasis> (or <emphasis role="ital">antireflexive</emphasis>, or sometimes <emphasis role="ital">irreflexive</emphasis>) partial orders on
		  N<subscript>G</subscript>. Thus, for all DAG G,
		  ⇒<subscript>G</subscript> is a strict partial order on
		  N<subscript>G</subscript>.</para><para>When R is a strict partial order on N<subscript>G</subscript>, we
		  say that b and c are R-<emphasis role="ital">ordered</emphasis> (or R-<emphasis role="ital">comparable</emphasis>) iff either (b, c) ∈ R or (c, b)
		  ∈ R. Otherwise, we say that they are R-<emphasis role="ital">unordered</emphasis> (or R-<emphasis role="ital">incomparable</emphasis>). We say R is <emphasis role="ital">total</emphasis> (or a <emphasis role="ital">strict total
		  order</emphasis>) on N<subscript>G</subscript> iff for all b, c ∈
		  N<subscript>G</subscript>, b and c are R-comparable, unless b = c.</para><para>For any binary relation R, the fact that (b, c) ∈ R is noted
		  b R c.</para></section></section><section><title>Graphs and TexMECS documents</title><section xreflabel="2.3" xml:id="nodags"><title>General strategy</title><para>To define a correspondence between graphs and TexMECS documents,
		  one option would have been to assign textual labels to the nodes of a graph,
		  and define a serialization algorithm to collect the labels and produce a
		  well-formed TexMECS document. However, we preferred the (essentially
		  equivalent) avenue of requiring an isomorphic mapping between the nodes of a
		  graph and the set of <emphasis role="ital">ranges</emphasis> (defined below)
		  that correspond to the various parts of a well-formed TexMECS document. We
		  chose that approach for the following reasons:</para><orderedlist><listitem><para>While a serialization algorithm can be defined in such a way
				that it always yield a well-formed TexMECS document from a DAG, there exist
				DAGs for which the element-containment structure of the serialized document
				does not reproduce the parent-child relationships of the graph. This is in fact
				a direct corollary of our Theorem 1. Thus, the very notion of correspondence
				based on a serialization algorithm would be contingent on the structural
				constraints of Theorem 1.</para></listitem><listitem><para>The approach better shields our proofs from the superficial
				details of the serialization algorithm, and thus, from the “lexical
				sugar” of the markup language.</para></listitem></orderedlist></section><section xreflabel="2.3"><title>Node-ordered DAGs (noDAGs)</title><para>We now define a type of graph that allows ordering siblings.
		  Intuitively, this ordering specifies the desired order of appearance of the
		  corresponding elements in the serialized document.</para><para>The definition also makes it possible to specify an order among
		  nodes that are not siblings. As noted earlier, one of our results (Theorem 1)
		  entails that, for overlap-only documents, the order among siblings entirely and
		  uniquely determines the order of appearance of <emphasis role="ital">all</emphasis> the elements in the document, and that, thus, the
		  extra ordering capability of the graph model does not increase its
		  expressivity. In future work, we plan to investigate other markup formalisms
		  for which the extra ordering capability might increase expressivity.</para><para><emphasis role="bold">Definition:</emphasis> Let G be a DAG, and
		  R<subscript>G</subscript> a strict partial order on N<subscript>G</subscript>.
		  We say that (G, R<subscript>G</subscript>) is a <emphasis role="ital">node-ordered DAG</emphasis> (noDAG) iff R<subscript>G</subscript>
		  is such that, for all b and c ∈ N<subscript>G</subscript>, the following
		  holds:</para><blockquote><para> if ((b and c are siblings) or (b and c are distinct roots)),
			 then (b R<subscript>G</subscript> c or c R<subscript>G</subscript> b)</para></blockquote><para>that is, siblings are totally R<subscript>G</subscript>-ordered
		  relative to each other, and so are distinct roots. Some other pairs of nodes
		  may be R<subscript>G</subscript>-ordered, but it is not necessary.</para><para><emphasis role="bold">Definition:</emphasis> Let
		  (G, R<subscript>G</subscript>) be a noDAG. We define the <emphasis role="ital">minimal ordering</emphasis> of (G, R<subscript>G</subscript>),
		  noted &lt;<subscript>G</subscript>, as follows:</para><blockquote><para> &lt;<subscript>G</subscript> = {(b, c) ∈
			 R<subscript>G</subscript> | (b and c are siblings) or (b and c are distinct
			 roots)}</para></blockquote><para>Note that &lt;<subscript>G</subscript> is a (not necessarily
		  proper) subset of R<subscript>G</subscript>. It is also a strict partial order,
		  and it contains exactly those pairs from R<subscript>G</subscript> necessary to
		  totally order siblings and distinct roots.</para><para>All the examples of noDAGs in this paper have
		  R<subscript>G</subscript> = &lt;<subscript>G</subscript>. The graphs given as
		  examples earlier in the paper are all noDAGs, with the (minimal) ordering of
		  siblings represented by left-to-right disposition of the arrows going out of
		  the parent node. Unless otherwise stated, the (minimal) ordering of siblings is
		  always represented in that way in our examples.</para><para><emphasis role="bold">Notation:</emphasis> In the remainder of this
		  paper, if (G, R<subscript>G</subscript>) is a noDAG, the notation G may be used
		  as a shorthand standing for (G, R<subscript>G</subscript>), unless it would
		  cause ambiguity.</para><para><emphasis role="bold">Definitions:</emphasis> Let G be a noDAG, and
		  b, c, d, and e stand for nodes in N<subscript>G</subscript>. Node c is called
		  the <emphasis role="ital">first</emphasis> (or <emphasis role="ital">leftmost</emphasis>) child of b iff b → c and for no other
		  child d of b is it the case that d &lt;<subscript>G</subscript> c. Conversely,
		  c is called the <emphasis role="ital">last</emphasis> (or <emphasis role="ital">rightmost</emphasis>) child of b iff b → c and for no other
		  node child d of b is it the case that c &lt;<subscript>G</subscript> d. We say
		  c and d are <emphasis role="ital">consecutive siblings</emphasis> iff there
		  exists b such that b → c, b → d, and for no other node child e of
		  b is it the case that c &lt;<subscript>G</subscript> e
		  &lt;<subscript>G</subscript> d.</para></section><section><title>Ranges</title><para>A range corresponds intuitively to a stretch of consecutive
		  character positions within some document or character string. Ranges and the
		  associated relations can be defined in various ways; the following definitions
		  suffice for our purposes.</para><para>A <emphasis role="ital">range</emphasis> is a pair of integers (x,
		  y), where 1 ≤ x ≤ y.</para><para>The <emphasis role="ital">start-point</emphasis> of a range r = (x,
		  y) is x, and is noted start(r); its <emphasis role="ital">end-point</emphasis>
		  is y, and it is noted end(r).</para><para>Intuitively, start(r) is the character position in the document
		  where r starts, and end(r) is <emphasis role="ital">one more</emphasis> than
		  the position where it ends. So, for all integers x ≥ 1, (x, x) is what
		  we call an <emphasis role="ital">empty</emphasis> range.</para><para>Range r is said to <emphasis role="ital">properly
		  contain</emphasis> range s iff start(r) &lt; start(s) and end(s) &lt;
		  end(r).</para><para>Range r is said to <emphasis role="ital">precede</emphasis> range s
		  iff start(r) &lt; start(s).</para><para>Note that both the proper containment and the precedence relations
		  among ranges are strict partial orders.</para><para>Let S be a set of ranges, and r, s ∈ S. We say that r
		  <emphasis role="ital">directly contains</emphasis> s <emphasis role="ital">with
		  respect to</emphasis> (wrt) S iff the two following statements hold:</para><blockquote><para>1. r properly contains s</para><para>2. no t ∈ S is such that r properly contains t and t
			 properly contains s</para></blockquote><para>When the set S is clear from context, we may drop the “with
		  respect to S” part.</para><para>We say S has <emphasis role="ital">distinct boundaries</emphasis>
		  iff no two distinct ranges in S have the same end-point; that is, iff for all
		  r, s ∈ S, unless (r = s), end(r) ≠ end(s).</para></section><section><title>Correspondence between a noDAG and a set of ranges</title><para><emphasis role="bold">Definition:</emphasis> Let S be a set of
		  ranges. We say that S <emphasis role="ital">corresponds</emphasis> to some
		  noDAG G (or, conversely, that G corresponds to S) iff there exists a bijective
		  mapping g between N<subscript>G</subscript> and S, such that all of the
		  following conditions hold:</para><blockquote><para>1. (for all b, c ∈ N<subscript>G</subscript>) [(b
			 →<subscript>G</subscript> c) iff g(b) directly contains g(c) wrt
			 S]</para><para>2. (for all b, c ∈ N<subscript>G</subscript>) [if (b
			 R<subscript>G</subscript> c), then start(g(b)) &lt; start(g(c))]</para></blockquote><para>We then say that G and S correspond to each other
		  <emphasis role="ital">through</emphasis> g.</para><para>Condition (1) means that the direct containment relationships among
		  the ranges in S correspond exactly to the direct dominance relationships among
		  nodes in G. Condition (2) means that whenever R<subscript>G</subscript>
		  specifies an order between two nodes, that order corresponds to the precedence
		  relation on ranges.</para><para><emphasis role="bold">Notation:</emphasis> For any b ∈
		  N<subscript>G</subscript>, b' will denote g(b). Conversely, for any r ∈
		  S, r' will denote g<superscript>-1</superscript>(r).</para><para><emphasis role="bold">Observation:</emphasis> It is not too
		  difficult to show that if a noDAG and a set of ranges correspond to each other,
		  then they do so through a <emphasis role="ital">unique</emphasis>
		  bijection.</para></section><section xml:id="texmecs" xreflabel="the section on TexMECS"><title>TexMECS</title><para>Huitfeldt and Sperberg-McQueen introduce TexMECS as “a
		  markup language (or, more precisely, a markup meta-language or family of markup
		  languages) intended for experimental work in dealing with complex
		  documents”
		  <citation linkend="HS2003">HS2003, p. 1</citation>. The main
		  differentiating characteristic of TexMECS, relative to XML, is that it allows
		  elements to overlap, rather than requiring them to nest in a strictly
		  hierarchical way.</para><para>We present below a small portion of TexMECS, called
		  <emphasis role="ital">overlap-only TexMECS</emphasis>, which includes just
		  enough of the full language to allow overlapping elements. Of interest to
		  readers familiar with the complete language, here are the features of TexMECS
		  excluded from overlap-only TexMECS:</para><itemizedlist><listitem><para>virtual elements;</para></listitem><listitem><para> interrupted elements;</para></listitem><listitem><para>empty elements;</para></listitem><listitem><para>attribute specifications;</para></listitem><listitem><para>entity references;</para></listitem><listitem><para>generic identifier co-indexing (for handling
				self-overlap);</para></listitem><listitem><para>unordered contents;</para></listitem><listitem><para>comments.</para></listitem></itemizedlist><para>Please bear in mind that TexMECS is an experimental language, and
		  may evolve in the future.</para><para>All our results are stated—and proved—relative to
		  overlap-only TexMECS, but clearly, they would hold <emphasis role="ital">mutatis mutandis</emphasis> for any equivalent formalism. At least
		  at first glance, there is no reason to believe that throwing in all but the two
		  first features from the above list would cause any major problem (aside from
		  adding nitty-gritty details to the proofs). Our proofs do not apply—and,
		  in fact, we strongly suspect the results do not hold (although we propose no
		  formal proof of this)—if we add interrupted elements and/or virtual
		  elements.</para><section><title>Overlap-only TexMECS</title><para>The following are adapted or taken from Huitfeldt and
			 Sperberg-McQueen
			 <citation linkend="HS2003">HS2003</citation>.</para><para>Start-tags are of the form <code>&lt;a|</code> and end-tags of
			 the form <code>|a&gt;</code>, where <code>a</code> stands for any acceptable
			 XML generic identifier. Two tags (start- or end-) are said to
			 <emphasis role="ital">match</emphasis> iff they bear the same generic
			 identifier.</para><para>The <emphasis role="ital">depth</emphasis> of any tag T in a
			 document is defined as the difference:</para><blockquote><para>(number of start-tags matching T and occurring at or before T
				in the document)</para><para>  − (number of end-tags matching T and occurring before
				T in the document)</para></blockquote><para>A character string is said to be a <emphasis role="ital">well-formed overlap-only TexMECS document</emphasis> iff all of the
			 following conditions hold:</para><orderedlist><listitem><para>There are equal numbers of start- and end-tags.</para></listitem><listitem><para>No tag has depth 0.</para></listitem><listitem><para>The document starts with a tag and ends with a tag.</para></listitem></orderedlist><para>Note that, under those definitions, the following are
			 well-formed:</para><itemizedlist><listitem><para><code>&lt;A|a&lt;B|b|A&gt;c|B&gt;</code></para></listitem><listitem><para><code>&lt;A|a|A&gt;abc&lt;B|b|B&gt;</code></para></listitem><listitem><para><code>&lt;A||A&gt;&lt;B||B&gt;</code></para></listitem></itemizedlist><para>but not the following:</para><itemizedlist><listitem><para><code>abc</code></para></listitem><listitem><para><code>a&lt;A||A&gt;</code></para></listitem><listitem><para><code>&lt;A||A&gt;a</code></para></listitem></itemizedlist><para>A start-tag S and another tag E are said to <emphasis role="ital">correspond</emphasis> (or be <emphasis role="ital">paired</emphasis>) to each other iff E is the tag closest to S
			 among those that both (1) occur after S in the document; (2) match S; and (3)
			 have the same depth as S. It can be shown that, in well-formed documents, every
			 tag corresponds to exactly one other tag, and that corresponding tags have
			 opposite polarities (one is a start-tag and the other is an end-tag).</para></section></section><section><title>Correspondence between a noDAG and a TexMECS document</title><para><emphasis role="bold">Definition:</emphasis> Let D be a well-formed
		  overlap-only TexMECS document. We define the <emphasis role="ital">set of
		  ranges associated to</emphasis> D, noted ranges(D), as the set containing
		  exactly the following ranges:</para><blockquote><para>1. For each start-tag in D, a range going from the first
			 character of the start-tag up to (and including) the last character of the
			 matching end-tag.</para><para>2. For each start-tag in D not followed immediately by another
			 start-tag, a range going from the character immediately following the start-tag
			 up to (and excluding) the first character of the next upcoming tag (start- or
			 end-).</para><para>3. For each end-tag in D not followed immediately by another tag,
			 a range going from the character immediately following the end-tag up to (and
			 excluding) the first character of the next upcoming tag (start- or
			 end-).</para></blockquote><para>Note that (2) will introduce an empty range for each start-tag
		  immediately followed by an end-tag, and that no other configuration of tags
		  will introduce an empty range.</para><para><emphasis role="bold">Definition:</emphasis> We say that a
		  well-formed overlap-only TexMECS document D <emphasis role="ital">corresponds</emphasis> to some noDAG G (or, conversely, that G
		  corresponds to D) through some mapping g iff ranges(D) corresponds to G through
		  g.</para><para><emphasis role="bold">Example:</emphasis> Let D be the following
		  well-formed TexMECS document (character positions given underneath, for ease of
		  reference):</para><blockquote><para>
			 <programlisting xml:space="preserve">&lt;A|&lt;B|x&lt;C||B&gt;y|C&gt;|A&gt;
000000000111111111122
123456789012345678901</programlisting></para></blockquote><para>Then, ranges(D) will contain six ranges:</para><orderedlist><listitem><para>(1,21) corresponding to element <code>A</code>;</para></listitem><listitem><para>(4,14) corresponding to element <code>B</code>;</para></listitem><listitem><para>(8,18) corresponding to element <code>C</code>;</para></listitem><listitem><para>(7,8) corresponding to character <code>x</code>;</para></listitem><listitem><para>(11,11) corresponding to the empty string between
				<code>&lt;C|</code> and <code>|B&gt;</code>;</para></listitem><listitem><para>(14,15) corresponding to character <code>y</code>.</para></listitem></orderedlist><para>The reader can readily verify that the set ranges(D) corresponds to
		  the following noDAG (through the bijection implicitly defined
		  above):</para><mediaobject><imageobject><imagedata format="png" fileref="../../../vol1/graphics/Marcoux01/Marcoux01-004.png"/></imageobject></mediaobject><para><emphasis role="bold">Observation 1:</emphasis> For all well-formed
		  overlap-only TexMECS document D, the set ranges(D) has distinct boundaries.
		  This follows simply from our definition of ranges(D) and the fact that TexMECS
		  end-tags have positive length. Note that no two ranges in ranges(D) have the
		  same start-point either, but, because of the way our various definitions are
		  set up, we do not need this in our proofs.</para></section><section><title>Empty strings and spurious overlap</title><para>Intuitively, the ranges we associate with a document are the
		  various “locations” in the document where its contents are
		  situated. We have chosen to consider that the empty string is
		  “content” only when located between a start-tag and an adjacent
		  following end-tag. This seems to us the most natural point of view.</para><para>However, one can view things differently; for example, it could be
		  considered that the empty string between <emphasis role="ital">any</emphasis>
		  two adjacent tags is content. Looking at things this way, the document in the
		  above example would have two more ranges: (4,4) and (18,18), and accordingly,
		  the corresponding noDAG would have two more nodes.</para><para>Why would one want to do that? One answer is that it allows
		  discussing <emphasis role="ital">spurious overlap</emphasis> in terms of
		  graphs. Sperberg-McQueen and Huitfeldt
		  <citation linkend="SH2004">SH2004</citation> define spurious overlap as
		  the use of overlapping elements in places where properly-nesting elements would
		  have adequately represented an “equivalent” structure. For
		  example, the above document contains spurious overlap, because flipping the
		  <code>&lt;C|</code> and <code>|B&gt;</code> tags would eliminate the overlap,
		  yet only change the hierarchical positioning of an empty string. Another tag
		  configuration considered spurious overlap is illustrated in the following
		  document:</para><blockquote><para>
			 <programlisting xml:space="preserve">&lt;A|&lt;B|xyz|A&gt;|B&gt;</programlisting></para></blockquote><para>Here, flipping either the start- or end-tags would result in
		  properly-nesting elements, while preserving the fact that all three data
		  characters are within both an <code>A</code> element and a <code>B</code>
		  element.</para><para>In order to define precisely spurious overlap in terms of graphs,
		  Sperberg-McQueen and Huitfeldt naturally take the approach that there is empty
		  content between any two adjacent tags, then define appropriate conditions on
		  graphs that correspond to spurious overlap.</para><para>We are convinced that, in general, spurious overlap can have
		  distinctive meaningful semantics, and thus, should not be forbidden.
		  Nevertheless, we recognize it can be useful to detect it, for example, as a
		  helper function for authors, or when it is known to be unintentional
		  <citation linkend="SH2004">SH2004</citation>. With our current
		  definition of ranges(D) and the ensuing graph representation of a document, the
		  discussion of spurious overlap found in
		  <citation linkend="SH2004">SH2004</citation> cannot be held directly.
		  To change this, we would simply need to define ranges(D) so that an empty range
		  is introduced for any two adjacent tags. Our results would still hold, except
		  that the statement of Theorem 1 would need to include the additional condition
		  that the first and last child of any internal node are both leaves (they could
		  be the same leaf).<footnote><para>The proofs, however, would need to be slightly more complex,
				because the set ranges(D) would not necessarily have distinct boundaries any
				more.</para></footnote></para></section></section><section><title>Results</title><section><title>Completion-acyclic noDAGs</title><para>We now introduce a class of noDAGs that is central in our
		  characterization.</para><para>Let G be a noDAG, and b, c, and d, nodes in
		  N<subscript>G</subscript>.</para><para>We define two relations, derived from G, which we call collectively
		  the <emphasis role="ital">completions</emphasis> of G. One is the
		  <emphasis role="ital">starts-before-</emphasis> (or SB-) <emphasis role="ital">completion</emphasis> of G, and is noted SB(G); the other is the
		  <emphasis role="ital">ends-after-</emphasis> (or EA-) <emphasis role="ital">completion</emphasis> of G, and is noted EA(G). The rationale for
		  those names will become clear later.</para><para><emphasis role="bold">Definition:</emphasis> Relation SB(G) is the
		  transitive closure of the union of the following sets:</para><blockquote><para>1. A<subscript>G</subscript></para><para>2. &lt;<subscript>G</subscript></para><para>3. {(b, c) | (∃d)[(d ⇒ b) &amp; (d
			 &lt;<subscript>G</subscript> c)] &amp; c ⇏ b}</para></blockquote><para><emphasis role="bold">Notation:</emphasis> For convenience, the
		  third set above will be noted CSB(G).</para><para><emphasis role="bold">Definition:</emphasis> Relation EA(G) is the
		  transitive closure of the union of the following sets, where
		  &gt;<subscript>G</subscript> denotes the <emphasis role="ital">inverse</emphasis> relation of &lt;<subscript>G</subscript>:</para><blockquote><para>1. A<subscript>G</subscript></para><para>2. &gt;<subscript>G</subscript></para><para>3. {(b, c) | (∃d)[(d ⇒ b) &amp; (d
			 &gt;<subscript>G</subscript> c)] &amp; c ⇏ b}</para></blockquote><para><emphasis role="bold">Notation:</emphasis> For convenience, the
		  third set above will be noted CEA(G).</para><para>Note that the only difference between the two preceding definitions
		  is that, for EA(G), &gt;<subscript>G</subscript> is used instead of
		  &lt;<subscript>G</subscript>.</para><para>As will be shown in the examples below, the completions of a noDAG
		  are not necessarily cycle-free, when seen as arcs linking the nodes of G. The
		  case where they are is, however, of particular interest, so we define the
		  following.</para><para><emphasis role="bold">Definition:</emphasis> A noDAG G is said to
		  be <emphasis role="ital">completion-acyclic</emphasis> iff both digraphs
		  (N<subscript>G</subscript>, SB(G)) and (N<subscript>G</subscript>, EA(G)) are
		  acyclic.</para></section><section><title>Examples</title><para>In the upcoming examples (as in the earlier ones), the ordering of
		  siblings is represented by left-to-right disposition of the arrows going out of
		  the parent node. In the completions, this ordering (or its inverse) appears
		  explicitly as added arcs.</para><para>The arcs added to form the completions are shown in red, but
		  <emphasis role="bold">most added arcs obtainable by transitivity are not
		  shown</emphasis>, to increase readability. For each example, we first show the
		  original graph, then the starts-before completion, then the ends-after
		  completion.</para><para>The first two examples use the same graph structures as earlier
		  examples.</para><para>
		  <mediaobject><imageobject><imagedata format="png" fileref="../../../vol1/graphics/Marcoux01/Marcoux01-005.png"/></imageobject><caption><para>Example 1: Original graph EX1</para></caption></mediaobject>
		  <mediaobject><imageobject><imagedata format="png" fileref="../../../vol1/graphics/Marcoux01/Marcoux01-006.png"/></imageobject><caption><para>Example 1: SB(EX1)</para></caption></mediaobject>
		  <mediaobject><imageobject><imagedata format="png" fileref="../../../vol1/graphics/Marcoux01/Marcoux01-007.png"/></imageobject><caption><para>Example 1: EA(EX1)</para></caption></mediaobject> </para><para>Notice how, for example, in SB(EX1), arc EC has been added because
		  (B &lt;<subscript>G</subscript> C) &amp; (B ⇒ E) &amp; (C ⇏ E),
		  and in EA(EX1), arc GB has been added because (C &gt;<subscript>G</subscript>
		  B) &amp; (C ⇒ G) &amp; (B ⇏ G).</para><para>
		  <mediaobject><imageobject><imagedata format="png" fileref="../../../vol1/graphics/Marcoux01/Marcoux01-008.png"/></imageobject><caption><para>Example 2: Original graph EX2</para></caption></mediaobject>
		  <mediaobject><imageobject><imagedata format="png" fileref="../../../vol1/graphics/Marcoux01/Marcoux01-009.png"/></imageobject><caption><para>Example 2: SB(EX2)</para></caption></mediaobject>
		  <mediaobject><imageobject><imagedata format="png" fileref="../../../vol1/graphics/Marcoux01/Marcoux01-010.png"/></imageobject><caption><para>Example 2: EA(EX2)</para></caption></mediaobject> </para><para>Notice how, in SB(EX2), the added arcs CD and FC form the cycle
		  CDFC.</para><para>The third example shows a noDAG for which the SB-completion has no
		  cycle, but the EA-completion has one.</para><para>
		  <mediaobject><imageobject><imagedata format="png" fileref="../../../vol1/graphics/Marcoux01/Marcoux01-011.png"/></imageobject><caption><para>Example 3: Original graph EX3</para></caption></mediaobject>
		  <mediaobject><imageobject><imagedata format="png" fileref="../../../vol1/graphics/Marcoux01/Marcoux01-012.png"/></imageobject><caption><para>Example 3: SB(EX3)</para></caption></mediaobject>
		  <mediaobject><imageobject><imagedata format="png" fileref="../../../vol1/graphics/Marcoux01/Marcoux01-013.png"/></imageobject><caption><para>Example 3: EA(EX3)</para></caption></mediaobject> </para><para>Notice how, in EA(EX3), the added arcs ED and DB form the cycle
		  BEDB.</para><para>Since both completions of EX1 are cycle-free, EX1 is
		  completion-acyclic. Because at least one completion of EX2 and EX3 has a cycle,
		  neither EX2 nor EX3 is completion-acyclic.</para></section><section><title>Main results</title><para>Before we get to our main results, we need to establish preliminary
		  results.</para><para><emphasis role="bold">Lemma 1:</emphasis> Let G be a noDAG to which
		  some set of ranges corresponds. Then, no node in G is both directly and
		  indirectly dominated by some other node.</para><para><emphasis role="ital">Proof</emphasis>. Immediate by the
		  definitions of correspondence and of direct containment.</para><para> <emphasis role="bold">QED (Lemma 1)</emphasis> </para><para><emphasis role="bold">Corollary 1 to Lemma 1:</emphasis> Let G be a
		  noDAG to which some set of ranges S corresponds. Then, for all b, c ∈
		  N<subscript>G</subscript>, if (b &lt;<subscript>G</subscript> c), then b
		  ⇏<subscript>G</subscript> c.</para><para><emphasis role="ital">Proof</emphasis>. By the definition of
		  &lt;<subscript>G</subscript>, if (b &lt;<subscript>G</subscript> c), then
		  either b and c are both roots, or they are siblings. If c is a root, then,
		  clearly b ⇏<subscript>G</subscript> c. If b and c are siblings, then (b
		  ⇒<subscript>G</subscript> c) would contradict the lemma.</para><para> <emphasis role="bold">QED (Corollary 1 to Lemma 1)</emphasis>
		  </para><para><emphasis role="bold">Corollary 2 to Lemma 1:</emphasis> Let G be a
		  noDAG to which corresponds some set of ranges with distinct boundaries S
		  through g. Then, for all b, c ∈ N<subscript>G</subscript>, if (b
		  &lt;<subscript>G</subscript> c), then end(b') &lt; end(c').</para><para><emphasis role="ital">Proof</emphasis>. By Corollary 1, b
		  ⇏<subscript>G</subscript> c. By the definitions of a noDAG and of
		  correspondence, we know that start(b') &lt; start(c'). But end(c') &lt; end(b')
		  would imply b ⇒<subscript>G</subscript> c. So, we must conclude that
		  end(b') ≤ end(c'). Since S has distinct boundaries, we conclude that
		  end(b') &lt; end(c').</para><para> <emphasis role="bold">QED (Corollary 2 to Lemma 1)</emphasis>
		  </para><para><emphasis role="bold">Lemma 2:</emphasis> Let G be a noDAG to which
		  corresponds some set of ranges with distinct boundaries S through g. Then, for
		  all r, s ∈ N<subscript>G</subscript>, the following two conditions
		  hold:</para><blockquote><para>1- if (r, s) ∈ SB(G), then start(r') &lt; start(s')</para><para>2- if (r, s) ∈ EA(G), then end(r') &gt; end(s')</para></blockquote><para><emphasis role="ital">Proof</emphasis>. We will prove the lemma
		  only for SB(G); the proof for EA(G) is entirely similar (but makes use of
		  Corollary 2 to Lemma 1).</para><para>By the definition of starts-before-completion, we know that:</para><para> SB(G) = transitive-closure(A<subscript>G</subscript> ∪
		  &lt;<subscript>G</subscript> ∪ CSB(G)).</para><para>We will prove that (1) above holds for each arc in
		  (A<subscript>G</subscript> ∪ &lt;<subscript>G</subscript> ∪
		  CSB(G)). Then, by transitivity of &lt; on integers, the lemma will
		  follow.</para><para>For all arcs (r, s) ∈ (A<subscript>G</subscript> ∪
		  &lt;<subscript>G</subscript>), then clearly, by the appropriate definitions,
		  start(r') &lt; start(s').<!--Note: même pour A_G, je n'ai pas besoin d'une notion plus forte de
distinct boundaries.--> There remains to treat the case where (r, s) ∈
		  CSB(G). We prove this by contradiction.</para><para>Suppose that (r, s) ∈ CSB(G) and start(s') ≤
		  start(r'). By the definition of CSB(G), there must exist d such that
		  (d ⇒ r) &amp; (d &lt;<subscript>G</subscript> s) &amp; s ⇏
		  r.</para><para>From (d &lt;<subscript>G</subscript> s), by Corollary 1 to Lemma 1,
		  we conclude that d ⇏ s.</para><para>Since, by hypothesis, s start(s') ≤ start(r'), and because s
		  ⇏ r, we conclude that end(s') ≤ end(r'). Because d ⇒ r, we
		  know that end(r') ≤ end(d'). Thus, end(s') ≤ end(d'). But d
		  &lt;<subscript>G</subscript> s implies that start(d') &lt; start(s'). Thus, we
		  conclude that d ⇒ s, a contradiction.</para><para>We must thus reject the hypothesis that start(s') ≤
		  start(r'), and conclude that, for all arcs (r, s) ∈ CSB(G), start(r')
		  &lt; start(s').</para><para> <emphasis role="bold">QED (Lemma 2)</emphasis></para><para><emphasis role="bold">Lemma 3:</emphasis> Let G be a finite noDAG.
		  Then, for all distinct b, c ∈ N<subscript>G</subscript>, both of the
		  following hold:</para><para>a) At least one of the pairs (b, c) and (c, b) is in SB(G).</para><para>b) At least one of the pairs (b, c) and (c, b) is in EA(G).</para><para><emphasis role="ital">Proof</emphasis>. Here again, we will prove
		  the lemma only for SB(G), the proof for EA(G) being entirely similar.</para><para>Let G be a noDAG, and S = SB(G). The proof is simpler if G has only
		  one root, and it generalizes easily (though somewhat tediously) to the cases
		  where G has more than one root. So, we will treat only the case where G has a
		  unique root.</para><para>Let distinct b, c ∈ N<subscript>G</subscript>. If either b
		  ⇒<subscript>G</subscript> c or c ⇒<subscript>G</subscript> b,
		  then, by definition, one of (b, c) and (c, b) ∈ S. There remains to
		  treat the case where b and c are
		  (⇒<subscript>G</subscript>)-incomparable. Suppose they are, and let m be
		  a (⇒<subscript>G</subscript>)-minimal-upper-bound (mub) of b and
		  c.<footnote xml:id="mub"><para>A (⇒<subscript>G</subscript>)-minimal-upper-bound (mub)
				of b and c is any node m such that (m ⇒<subscript>G</subscript> b) and
				(m ⇒<subscript>G</subscript> c), and such that, for all descendants e of
				m, (e ⇏ b) or (e ⇏ c). Since we assumed G is finite and has a
				single root, a simple argument shows that at least one
				(⇒<subscript>G</subscript>)-mub exists for all b and
				c.</para></footnote></para><para>Suppose, without loss of generality, that neither b nor c is a
		  child of m (if either is, similar—and actually simpler—arguments
		  lead to the same conclusions). Then, by the definition of mub, and because b
		  and c are (⇒<subscript>G</subscript>)-incomparable, there must exist e,
		  f ∈ N<subscript>G</subscript> such that (m → e ⇒ b) &amp;
		  (m → f ⇒ c) &amp; (f ⇏ b) &amp; (e ⇏ c). Since e
		  and f are siblings, they are &lt;<subscript>G</subscript>-comparable. If, on
		  the one hand, e &lt;<subscript>G</subscript> f, then, we have:</para><blockquote><para> (e ⇒ b) &amp; (e &lt;<subscript>G</subscript> f) &amp; (f
			 ⇏ b)</para></blockquote><para>and, thus, by construction of S, (b, f) ∈ S, and, by
		  transitivity, (b, c) ∈ S. If, on the other hand, f
		  &lt;<subscript>G</subscript> e, then, we have:</para><blockquote><para> (f ⇒ c) &amp; (f &lt;<subscript>G</subscript> e) &amp; (e
			 ⇏ c)</para></blockquote><para>and, thus, by construction of S, (c, e) ∈ S, and, by
		  transitivity, (c, b) ∈ S.</para><para><emphasis role="bold">QED (Lemma 3)</emphasis></para><para><emphasis role="bold">Corollary to Lemma 3:</emphasis> Both
		  completions of a finite completion-acyclic noDAG G are strict total orders on
		  N<subscript>G</subscript>.</para><para><emphasis role="ital">Proof</emphasis>. By the definition, both
		  completions of a completion-acyclic noDAG are cycle-free. Thus, it is immediate
		  that they are antireflexive and antisymmetric. Since they are obtained by
		  transitive closure, they are transitive. It follows from the lemma that they
		  are total.</para><para> <emphasis role="bold">QED (Corollary to Lemma 3)</emphasis></para><para><emphasis role="bold">Lemma 4:</emphasis> For all
		  completion-acyclic noDAG G, SB(G) ∩ EA(G) =
		  (⇒<subscript>G</subscript>).</para><para><emphasis role="ital">Proof</emphasis>. Let G be a
		  completion-acyclic noDAG. By the definitions of the respective completions, it
		  suffices to show that no pair (b, c) can be both in CSB(G) and CEA(G).</para><para>We prove this by contradiction: suppose there exists a pair (b, c)
		  that is a member of both sets. Since (b, c) is in CSB(G), then it is also in
		  SB(G), by definition. Because it is in CEA(G), we know that c ⇏ b, and
		  that there exists d such that c &lt;<subscript>G</subscript> d and d
		  ⇒<subscript>G</subscript> b. From this and the definition of SB(G), we
		  conclude that the pair (c, b) is in SB(G). So, both (b, c) and (c, b) are in
		  SB(G). Hence, SB(G) contains a cycle, which contradicts the hypothesis that G
		  is completion-acyclic.</para><para> <emphasis role="bold">QED (Lemma 4)</emphasis></para><para><emphasis role="bold">Lemma 5:</emphasis> Let G be a finite
		  completion-acyclic noDAG. For all (b, c) ∈ N<subscript>G</subscript>, if
		  (b, c) ∈ SB(G) but b ⇏ c, then any node dominated by b and not by
		  c precedes, in SB(G)-order, c and any node dominated by c. Conversely, if (b,
		  c) ∈ EA(G) but b ⇏ c, then any node dominated by b and not by c
		  precedes, in EA(G)-order, c and any node dominated by c.</para><para><emphasis role="ital">Proof.</emphasis> Let G be a finite
		  completion-acyclic noDAG. First observe that, by Corollary to Lemma 3 and
		  Lemma 4, for all distinct, ⇒-incomparable b, c ∈
		  N<subscript>G</subscript>, if (b, c) ∈ SB(G), then (c, b) ∈
		  EA(G), and vice-versa.</para><para> Again, we prove the lemma only for SB, the proof for EA being
		  entirely similar. Let b and c be as in the statement of the lemma, and suppose
		  d and e are such that b ⇒ d, c ⇏ d, c ⇒ e. Note that d
		  ⇏ c. Suppose, for the sake of contradiction, that (c, d) ∈ SB(G).
		  Then, by the preceding observation, we have (d, c) ∈ EA(G), (c, b)
		  ∈ EA(G), and, because b ⇒ d, (b, d) ∈ EA(G). Thus, there
		  is a cycle in EA(G), contradicting the fact that G is completion-acyclic. So,
		  by Corollary to Lemma 3, we must conclude that (d, c) ∈ SB(G), and thus,
		  by transitivity, that (d, e) ∈ SB(G).</para><para> <emphasis role="bold">QED (Lemma 5)</emphasis></para><para><emphasis role="bold">Theorem 1:</emphasis> Let (G,
		  R<subscript>G</subscript>) be a noDAG. Then, there exists a well-formed
		  overlap-only TexMECS document corresponding to G iff all of the following
		  hold:</para><blockquote><para>1- N<subscript>G</subscript> is finite and non-empty.</para><para>2- No node of G is both directly and indirectly reachable from
			 some other node.</para><para>3- G is completion-acyclic.</para><para>4- R<subscript>G</subscript> is a subset of SB(G).</para><para>5- No two leaves are consecutive in both SB(G)-order and reverse
			 EA(G)-order.</para><para>6- Neither the first nor the last root of G (in
			 &lt;<subscript>G</subscript> order) is a leaf.</para></blockquote><para>Note that point (4) asserts that the ordering among nodes specified
		  by R<subscript>G</subscript> − (&lt;<subscript>G</subscript>) can only
		  confirm what is already implicit in &lt;<subscript>G</subscript>. Points (5)
		  and (6) may seem mysterious at first, but simply translate idiosyncratic
		  morphological contingencies of well-formed overlap-only TexMECS. An alternate
		  formulation of point (5) is that no two consecutive siblings are leaves with
		  the same set of parents.</para><para><emphasis role="ital">Proof.</emphasis></para><para>(⇒) Let (G, R<subscript>G</subscript>) be a noDAG to which
		  corresponds a well-formed overlap-only TexMECS document D. We show in
		  succession that G satisfies conditions (1) to (6) in the statement of the
		  theorem.</para><para>(1): Follows immediately from the various definitions.</para><para>(2): Follows immediately from Lemma 1.</para><para>(3): Follows from Lemma 2 (which, by Observation 1, we know is
		  applicable) because, if either SB(G) or EA(G) had a cycle, it would imply that,
		  for some r ∈ ranges(D), start(r) &lt; start(r) or end(r) &lt;
		  end(r).</para><para>(4): By (1) and (3), G is finite and completion-acyclic. Thus,
		  Corollary to Lemma 3 applies, and we conclude that SB(G) is a strict total
		  order on N<subscript>G</subscript>. Since SB(G) is total, any pair (r, s) that
		  is not <emphasis role="ital">in</emphasis> it can only contradict it. Suppose,
		  then, there exists a pair (r, s) ∈ R<subscript>G</subscript> −
		  SB(G). By the preceding argument, we conclude that (s, r) ∈ SB(G). So,
		  by Lemma 2, start(s') &lt; start(r'). However, by the definition of
		  correspondence, (r, s) ∈ R<subscript>G</subscript> implies that
		  start(r') &lt; start(s'), a contradiction.</para><para>(5): <emphasis role="ital">Proof sketch.</emphasis> Suppose, for
		  the sake of contradiction, that there exist two leaves v and w in G consecutive
		  in both SB(G)-order and reverse EA(G)-order. Suppose, without loss of
		  generality, that v and w are not roots. Then, it is not hard to see that v' and
		  w' must be both directly contained (wrt ranges(D)) in some other range, without
		  any range starting and/or ending between them. But, by construction of
		  ranges(D), this is impossible.</para><para>(6): We treat only the case of the first root; the proof for the
		  last root is entirely similar. It is easy to show that, in a completion-acyclic
		  noDAG, the first node in SB-order is always the first root. So, if r is the
		  first root of G, in &lt;<subscript>G</subscript> order, then r is the first
		  node in SB(G)-order. By Lemma 2 and the construction of ranges(D), then,
		  start(r') = 1. Because D is well-formed, it must start with a tag; thus, r is
		  not a leaf.</para><para>(⇐) <emphasis role="ital">Proof sketch.</emphasis> Let (G,
		  R<subscript>G</subscript>) be a noDAG satisfying conditions (1) to (6) in the
		  statement of the theorem. We construct a well-formed overlap-only TexMECS
		  document D and show it corresponds to G.</para><para>We sketch informally the construction of D.</para><para>We start by assigning an arbitrary, distinct, generic identifier to
		  each internal node of G. Then, on each side (left and right) of the node, we
		  stick TexMECS tags derived from the generic identifier: a start-tag on the
		  left, and an end-tag on the right; for example “&lt;QC|” on the
		  left and “|QC&gt;” on the right.</para><para>Then, we let those tags slide down from their node, following the
		  arrow linking the node to its first child (for a start-tag) and to its last
		  child (for an end-tag). When a tag arrives to a new internal node, it continues
		  sliding down, using the same arrow position (first or last) as initially, all
		  the way down, until it ends up on a leaf. When it does, it goes to the left of
		  the leaf (if it is a start-tag), or to its right (if it is an end-tag), and
		  stays there.</para><para>After both tags of all internal nodes have slid down to a leaf, we
		  order the start-tags on the left of each leaf in SB(G) order
		  <emphasis role="ital">of the node they come from</emphasis>. Then, we order the
		  end-tags on the right of each leaf in <emphasis role="ital">reversed</emphasis>
		  EA(G) order of the node they come from.</para><para>Then, for each leaf, we create a character string consisting of the
		  concatenation of all the start-tags to its left, in order, then, the character
		  “A”, then all the end-tags to its right, again in order.</para><para>Finally, we pick-up and concatenate the strings created for the
		  leaves, in SB(G) order. That is our document D.</para><para>The mapping g between ranges(D) and N<subscript>G</subscript> is
		  the one that maps to each internal node the range stretching from one to the
		  other of (and including) the two tags that originate from it, and to each leaf
		  the character “A” that it contributed to D.</para><para>To see that D is a well-formed overlap-only TexMECS document,
		  observe that: (1) there are equal numbers of start- and end-tags; (2) all
		  generic identifiers are distinct, and thus, the fact (following from Lemma 5)
		  that any start-tag occurs to the left of the matching end-tag suffices to
		  guarantee that no tag has depth 0; and (3) the document starts and ends with a
		  tag.</para><para>We now sketch a proof that the set ranges(D) corresponds to G
		  through g, i.e., that the following statements both hold:</para><blockquote><para>i. (for all b, c ∈ N<subscript>G</subscript>) [(b
			 →<subscript>G</subscript> c) iff (b' directly contains c') wrt
			 ranges(D)]</para><para>ii. (for all b, c ∈ N<subscript>G</subscript>) [if (b
			 R<subscript>G</subscript> c), then start(b') &lt; start(c')]</para></blockquote><para>To prove (i), we will show that proper containment on ranges(D)
		  corresponds to reachability on G. Since, by hypothesis, G contains only direct
		  reachability relationships, (i) will follow.</para><para>Suppose first that b ⇒<subscript>G</subscript> c. Then, by
		  construction of D, Lemma 5, and the fact that b precedes c in SB(G)-order,
		  start(b') &lt; start(c'). Likewise, end(c') &lt; end(b'). Thus, b' properly
		  contains c'.</para><para>Now suppose for some r and s in ranges(D), r properly contains s.
		  By construction of D, and Lemma 5, it can be show that
		  r' ⇒<subscript>G</subscript> s'.</para><para>To prove (ii), we will show that for all (b, c) ∈ SB(G),
		  start(b') &lt; start(c'). Since, by hypothesis, R<subscript>G</subscript> is a
		  subset of SB(G), (ii) will follow. By transitivity of &lt; on integers, it
		  suffices to show that for all (b, c) ∈ (A<subscript>G</subscript>
		  ∪ &lt;<subscript>G</subscript> ∪ CSB(G)), start(b') &lt;
		  start(c'). For each of the three sets constituent of the union, the desired
		  result follows from Lemma 5.</para><para> <emphasis role="bold">QED (Theorem 1)</emphasis> </para><para><emphasis role="bold">Theorem 2:</emphasis> Every well-formed
		  overlap-only TexMECS document corresponds to some noDAG satisfying all the
		  conditions in the statement of Theorem 1.</para><para><emphasis role="ital">Proof sketch</emphasis>. From D, any
		  well-formed overlap-only TexMECS document, first compute ranges(D), then the
		  direct containment relationships between the members of ranges(D). Construct a
		  noDAG (G, R<subscript>G</subscript>) as follows:</para><blockquote><para>1. N<subscript>G</subscript> = ranges(D)</para><para>2. For all r, s ∈ N, r → s iff r directly contains
			 s</para><para>3. For all r, s ∈ N, r R<subscript>G</subscript> s iff
			 start(r) &lt; start(s)</para></blockquote><para>It is easily seen that (G, R<subscript>G</subscript>) corresponds
		  to D through the identity function. Note that (G, R<subscript>G</subscript>)
		  necessarily satisfies all the conditions in the statement of Theorem 1, since
		  it corresponds to a well-formed, overlap-only TexMECS document.</para><para> <emphasis role="bold">QED (Theorem 2)</emphasis> </para></section><section><title>Serializability and leaves</title><para>Defining restrictions on DAGs by requiring the existence of some
		  special ordering of the leaves, or by restricting which nodes dominate which
		  leaves, or both, is quite natural. For example, with a slightly different data
		  structure than noDAGs (GODDAGs), Sperberg-McQueen and Huitfeldt
		  <xref linkend="SH2004"/> define interesting restrictions through criteria
		  involving the sets of leaves dominated by internal nodes.</para><para>The following example, however, shows that no criteria involving
		  only the leaves can guarantee serializability for noDAGs in general. We exhibit
		  (Figure <xref linkend="one-leaf"/>) a noDAG with only one leaf, which is not
		  completion-acyclic and, thus, by Theorem 1, not serializable. As can be seen on
		  Figure <xref linkend="one-leaf-SB"/>, its SB-completion has the cycle CDFC
		  (here again, added arcs obtainable by transitivity are not shown).</para><para><mediaobject xml:id="one-leaf" xreflabel="9"><imageobject><imagedata format="png" fileref="../../../vol1/graphics/Marcoux01/Marcoux01-014.png"/></imageobject><caption><para>Figure 9. Non-completion-acyclic noDAG with only one
				  leaf.</para></caption></mediaobject></para><para><mediaobject xreflabel="10" xml:id="one-leaf-SB"><imageobject><imagedata format="png" fileref="../../../vol1/graphics/Marcoux01/Marcoux01-015.png"/></imageobject><caption><para>Figure 10. SB-completion of the noDAG on Figure
				  <xref linkend="one-leaf"/>.</para></caption></mediaobject></para></section></section><section><title>Applications</title><para>Perhaps the most important application of this characterization is to
		allow the precise definition of a DOM (document object model) for overlapping
		markup documents, and the inclusion of serializability verification functions
		in software products (editors, etc.) based on a DOM representation of
		documents.</para><para>Among other things, it would allow a graph-interface document editor
		to determine whether or not some node-oriented operation attempted by an author
		(node creation or displacement) preserves the serializability of the
		graph.</para><para>Another application area is the <emphasis role="ital">validation</emphasis> of documents. Validation is usually defined
		on the serialized form of documents. However, in some cases, validation on the
		graph (or DOM) structure is preferable. To be able to define validation
		mechanisms on the <emphasis role="ital">graphs</emphasis> of documents, it is
		important to know precisely the structural properties of exactly those graphs
		that the validation mechanism must be able to handle. The kind of
		characterization we provide here is a useful and precise description of the
		exact class of graphs that all validation mechanisms must be able to deal
		with.</para></section><section><title>Conclusion and future work</title><para>In this article, we established a necessary and sufficient condition
		for a node-ordered graph (noDAG) to correspond to the structure of an
		overlapping markup document, such as a well-formed TexMECS document that uses
		only overlapping markup (and no interrupted or virtual elements). That
		characterization provides a criterion which graph-based (e.g., DOM-based)
		applications can test to determine if a graph can be serialized into a document
		using only overlapping markup. Future work includes establishing the complexity
		of testing the criterion and finding optimal algorithms to do so.</para><para>A consequence of that characterization is that, for overlap-only
		markup, no expressivity is to be gained by allowing authors to arbitrarily
		order the nodes, over and above specifying the parent-child relationships and
		ordering siblings, as those two combined determine entirely and uniquely the
		serialized document.</para><para>We also showed that <emphasis role="ital">all</emphasis> well-formed
		overlapping-markup documents can be obtained by serializing some graph that
		satisfies the criterion. Thus, an editor offering only a graph-based interface
		(and no plain-text view), and allowing only the creation of graphs that satisfy
		the criterion, would still be complete, in that it would permit the creation of
		<emphasis role="ital">any</emphasis> well-formed overlapping-markup
		document.</para><para>A corollary to our proofs is that round-tripping between noDAGs and
		TexMECS documents is possible. This means that algorithms can be given to
		transform any TexMECS document into a (labelled) noDAG, and any serializable
		(labelled) noDAG into a TexMECS document, in such a way that applying the two
		transformations one after the other will give essentially the original object.
		Developing efficient algorithms to do this, and appropriate definitions of
		canonical forms is something we want to investigate in the future.</para><para>We also believe that relaxing somewhat the conditions in Theorem 1
		could characterize variants of overlap-only TexMECS, for example, a variant in
		which overlapping markup is prohibited, but interrupted elements are allowed.
		This is also on the future work agenda.</para><para>Finally (and this is actually what brought us to overlapping markup
		in the first place), we would like to investigate what sort of
		“natural” intertextual semantics might be defined for overlapping
		markup.</para></section><section><title>Acknowledgements</title><para>The author thanks Claus Huitfeldt and Lars Johnsen for exciting and
		fruitful discussions, as well as useful comments on drafts of this
		paper.</para></section><bibliography><title>References</title><bibliomixed xml:id="HS2003">Claus Huitfeldt and C. M. Sperberg-McQueen.
		“TexMECS: An experimental markup meta-language for complex
		documents.” (2003)
		<link xlink:href="http://decentius.aksis.uib.no/mlcd/2003/Papers/texmecs.html" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://decentius.aksis.uib.no/mlcd/2003/Papers/texmecs.html</link></bibliomixed><bibliomixed xml:id="J2004">Jagadish, H. V., Laks V. S. Lakshmanan, Monica
		Scannapieco, Divesh Srivastava, and Nuwee Wiwatwattana. “Colorful XML:
		One hierarchy isn't enough.” <emphasis role="ital">Proceedings of the
		2004 ACM SIGMOD International conference on management of data.</emphasis>
		(2004)
		<biblioid class="doi">10.1145/1007568.1007598</biblioid></bibliomixed><bibliomixed xml:id="S2006">C. M. Sperberg-McQueen. “Rabbit/duck
		grammars: a validation method for overlapping structures.”
		<emphasis role="ital">Proceedings of the 2006 Extreme Markup Languages
		conference.</emphasis> (2006)
		<link xlink:href="http://www.idealliance.org/papers/extreme/proceedings/html/2006/SperbergMcQueen01/EML2006SperbergMcQueen01.html" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.idealliance.org/papers/extreme/proceedings/html/2006/SperbergMcQueen01/EML2006SperbergMcQueen01.html</link></bibliomixed><bibliomixed xml:id="SH2004">C. M. Sperberg-McQueen and Claus Huitfeldt.
		“GODDAG: A Data Structure for Overlapping Hierarchies.” In: P.
		King and E.V. Munson (Eds.): <emphasis role="ital">DDEP-PODDP 2000</emphasis>,
		Lecture Notes in Computer Science 2023, Springer-Verlag, pp. 139-160.
		(2004)</bibliomixed><bibliomixed xml:id="TEI">Text Encoding Initiative Consortium.
		<emphasis role="ital">TEI Guidelines for Electronic Text Encoding and
		Interchange.</emphasis>
		<link xlink:href="http://www.tei-c.org/Guidelines/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.tei-c.org/Guidelines/</link></bibliomixed><bibliomixed xml:id="LMNL">Jeni Tennison, Gavin Thomas Nicol, and Wendell
		Piez. <emphasis role="ital">LMNL (Layered Markup and Annotation Language)
		Tutorial.</emphasis>
		<link xlink:href="http://lmnl.net/prose/tutorial/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://lmnl.net/prose/tutorial/</link></bibliomixed></bibliography></article>
