The XML tree paradigm has several well-known limitations for document modeling and
processing, some of which have received a lot of attention (especially overlap; see the
overviews in Sperberg-McQueen and Huitfeldt 2000 and DeRose 2004) and
some of which have received less (e.g., discontinuity, simultaneity, transposition, white
space as crypto-overlap). Many of these have work-arounds, also well known, that—as is
implicit in the term
work-around—have disadvantages, but because they get the
job done and because XML has a large user community with diverse levels of technological
expertise, it is difficult to overcome inertia and move to a technology that might offer a
more comprehensive fit with the full range of document structures with which researchers need
to interact both intellectually and programmatically. Proceeding from a high-level view of why
XML has the limitations it has, this presentation explores how an alternative model of Text as
Graph (TAG) might address these types of structures and tasks in a more natural and idiomatic
way than is available within an XML paradigm.
From an informatic perspective all documents are structured, including those that are
traditionally identified as plain text. Some of the structural properties of plain-text
documents are expressed through formatting conventions, such as the use of blank lines to
separate paragraphs, or of indentation to mark the beginning of a paragraph, or of centering
to mark a header. The sequence of words in a text, delimited in a complex way that involves
white space, punctuation, and other symbols, constitutes, on a certain level, an implicit
organizational tier above the sequence of characters. The conventions at work in plain text do not formally, completely, unambiguously,
or in a wholly standardized way differentiate the content of a document from the coded
representation of its structure (or, perhaps more accurately, structures), which problematizes
using plain text for document processing (whether for data mining, publication, or other
purposes). The challenges this poses have come to be addressed by representing the structural
properties of a document not through plain-text characters (which might be considered
pseudo-markup), but through formal, standardized markup, such as XML.
The XML data model is an ordered tree, or, more precisely, a rooted and ordered directed
acyclic graph that prohibits multiple parentage, which in the document-processing community
has come to be understood as representing an Ordered Hierarchy of Content Objects (OHCO). It
is well known that the OHCO model works reasonably well for describing structures that consist
of single ordered hierarchies, such as the exhaustive tesselated division of a novel into
chapters and the chapters into paragraphs, but it is not well suited to modeling structures
that cannot be represented fully by a single tree. The markup community has focused intensively on overlapping hierarchies as a
challenge to the OHCO model, and with good reason, but we argue below that overlap is only one manifestation of
a higher-level problem, and this perspective has implications for deciding how best to
overcome it. If overlap were the problem, projecting multiple
trees over the content might solve it (e.g., through the SGML CONCUR feature), as might the adoption of a model that permits but does not require hierarchy,
including multiple hierarchies, such as the range model exemplified by LMNL. But if the problem is that a tree is inadequate for higher-level reasons that are
only partially exemplified by overlap, we might have more success if we address the issue at
that higher level. The Text As Graph model that we introduce below is not intended to be a
the overlap problem in XML; it is built around a fresh
consideration of the textual structures, both latent and overt, that a data model will need to
be able to represent. It is nonetheless not accidental that TAG agrees in some respects with
XML, in others with GODDAG or TexMECS, and in others with LMNL, since all of these
specifications have sought, in partially converging ways, to model text structure.
Below we identify specific situations that pose problems for an OHCO perspective. With
respect to hierarchy, in addition to overlap, where text may have multiple overlapping
hierarchies, text may not be hierarchical at all. We also identify situations where text is
not ordered, as well as those where XML creates artifactual content objects that do not
clearly correspond to what a human would consider a textual content object. In other words,
TAG seeks to interrogate and address the O, the H, and the CO of OHCO, and not only the
well-known multiple-hierarchy challenge.
Challenges for text modeling
In this section we illustrate several types of textual structures that have proven awkward
for XML because they contradict or otherwise are not part of the OHCO tree model. For each we
provide an abstract description of the problem, of one or more XML workarounds, and their
GODDAG, TexMECS, and LMNL counterparts (as appropriate), illustrated with examples drawn from
use cases in Digital Humanities research projects.
The challenge to text modeling in XML that has attracted the most attention is overlap.
For example, notice in the image below how the phrase
Two vast and trunkless legs of
stone Stand in the desart begins in the middle of line 2 and ends in the middle of
line 3, an absence of synchronicity between verse lines and sentences that is called
Figure 2: Percy Bysshe Shelley,
Piez’s illustration is actually of LMNL ranges, rather than of XML element trees. The
same structure might be visualized as independent overlapping trees as follows, where cyan
represents the tree of metrical lines and green represents the tree of linguistic
Figure 3: Percy Bysshe Shelley,
Projecting two independent trees over a common set of words. In XML the individual
words would not be children of
elements, and would instead be grouped inside Text node children of those
Because it is not possible to represent the preceding structure in XML markup, the
following pseudo-XML is not well-formed:
<line><phrase>Who said —</phrase> <phrase>“Two vast and trunkless legs of stone</line>
<line>Stand in the desart….</phrase> <phrase>Near them,</phrase> <phrase>on the sand</phrase></line>
New XML users often misunderstand the prohibition against overlap as a prohibition
against overlapping tags, but if that were the entire
issue, it could be remedied by simply removing the syntactic prohibition. But the rule about
tags exists because tags must represent a tree, hierarchy in a tree prohibits multiple
parentage, and overlap would permit a node to have more than one parent. Overlap is possible
in GODDAG only incidentally because TexMECS permits overlapping tags; at a higher level it
is because GODDAG permits Text nodes to have multiple parents and TexMECS serializes the
GODDAG model. LMNL sawtooth syntax may look like XML syntax with the prohibition against
overlapping tags removed, but the real difference is at the level of the data model: LMNL
ranges can overlap and XML elements cannot because the content between XML start and end tags is a
sequence of descendant nodes in a tree, and not a range of textual atoms.
TAG represents overlap naturally because the TAG counterpart to an XML element is a
directed hyperedge that associates a head Markup node with a set of tail Text nodes. To tag
a line of poetry in the example above, TAG would create a hyperedge from a Markup node with
name property value of
line (comparable to a
<line> element in XML) to a set of Text nodes (comparable to Text nodes
in XML). Sets are unordered, but because the TAG model requires
edges between Text nodes, which record the continuous order of the text stream (comparable
to the sequence of atoms in the LMNL model), the textual content of the line is fully
specified by (= can be retrieved by examining) the membership of the set of tail Text nodes
and the sequence edges between them. In the illustration below, the black arrows represent
regular edges that connect Text nodes in order, the irregular colored bounding lines
demarcate the sets of tail Text nodes, and a similarly colored arrow points into them from
their Markup node heads:
Figure 4: Percy Bysshe Shelley,
A hypergraph representing overlap of phrases and metrical lines.
Additional use cases involving overlap challenges in XML include pages vs paragraphs in
publications of novels, folios vs texts in medieval manuscripts, and speeches vs metrical
lines in drama. Overlap in poetic structures has been explored in detail in Piez 2014, which also discusses an unusual structural paradox involving
Chapter 24 of Mary Shelley’s Frankenstein. Overlap
involving word and metrical foot boundaries in poetry is discussed below.
Sperberg-McQueen and Huitfeldt 2008b offer the following paragraph from Lewis
Carroll’s Alice in Wonderland as an example of
Alice was beginning to get very tired of sitting by her sister on the bank, and of
having nothing to do: once or twice she had peeped into the book her sister was reading,
but it had no pictures or conversations in it,
and what is the use of a
book, thought Alice
without pictures or conversation?
There is no way to mark up this passage in XML without fragmenting the quotation into
two elements (and relying on semantics to stitch together the pieces in the application
layer), yet our human intuition is that there is a single quotation, and that the model,
therefore, should represent it as a single object. As Sperberg-McQueen and Huitfeldt 2008b also observe, there is a sense in
without are adjacent and a different sense in
thought are adjacent. XML syntax and the XML
tree cannot represent both of these realities simultaneously, which means that at least one
of them must be handed off to the application layer.
Sperberg-McQueen and Huitfeldt 2008b situate this type of structure in a GODDAG
context, where it intersects with the distinction between containment and dominance. Concerning LMNL,
they write that
[With respect to] the unity of discontinuous elements: such a unity may be asserted by
the application layer (that is, by the definition of a LMNL vocabulary), but it is not
visible on the LMNL level, and thus need not be accounted for at the level of LMNL
The design of LMNL thus seems to require that any account of dominance (as distinct
from containment), and any account of discontinuous elements, be handled in the
application layer. LMNL itself achieves a degree of simplicity and regularity as a result,
at the expense of complexity in the application.
Piez 2008 describes discontinuity in LMNL as modeled by the limen, where the example provided
records it through coindexed annotations. That dependency seems to locate discontinuity in
an application layer, since whether coindexed annotations represent discontinuity is not an
inherent property of the coindexing. As was noted earlier, TexMECS is capable of modeling
discontinuity directly with suspend-tags and resume-tags, but if a TexMECS document is to be
parsed into GODDAG, the fragments are then modeled as separate structural components
TAG prioritizes the representation of text structures, including discontinuity, in the
model, without dependency on application-layer semantics. The example from Alice in Wonderland described above would have the following form
Figure 5: Lewis Carroll,
Alice in Wonderland
Each hyperedge points from a Markup node head (identifying the type of text
structure) to a tail set of Text nodes. Normal edges connect the Text nodes in order. In
the TAG model, then, a discontinuous textual object is represented the same way as a
continuous one, as an ordered set of Text nodes that form the tail of a Markup-to-Text
Other examples of discontinuity involve stage directions in dramatic text, such as the
following example from George Bernard Shaw’s Mrs. Warren’s
VIVIE. Sit down: I’m not ready to go back to work yet. [Praed sits]. You both think I
have an attack of nerves. Not a bit of it. But there are two subjects I want dropped, if
you don’t mind.
One of them [to Frank] is love’s young dream in any shape or form: the other [to
Praed] is the romance and beauty of life, especially Ostend and the gaiety of Brussels.
You are welcome to any illusions you may have left on these subjects: I have none. If we
three are to remain friends, I must be treated as a woman of business, permanently single
[to Frank] and permanently unromantic [to Praed].
Here the last two stage directions interrupt not just the speech, but the
Hierarchy, containment and dominance
The challenges that have emerged from our experience of XML as a model of text involve
not only the limitations of OHCO, but also its tyranny. If text is understood as an Ordered Hierarchy of Content
Objects, are there aspects of text that are not ordered (O), are there aspects that are not
hierarchical (H, by which we mean not just that they are not mono-hierarchical, but that
they are not hierarchical at all), and does the model create content objects artifactually,
that is, where they are not perceived as inherent properties of the text being modeled (CO)?
XML requires us to model all content as both ordered and hierarchical, and it represents
content objects as elements (at least as content objects are described in DeRose et al. 1990). GODDAG and LMNL both grew out of a recognition that not all
properties of text can be modeled effectively as a single hierarchy, and their focus is not
limited to that issue, but they differ in the extent to which they interrogate features of
text that may not be hierarchical at all, that may not be ordered, and that may not involve
what a human would consider a content object.
As was mentioned earlier, the XML data model does not distinguish containment from
dominance, which Tennison explains and illustrates in LMNL terms as follows:
Containment is a happenstance relationship between ranges while dominance is one
that has a meaningful semantic. A page may happen to contain a stanza, but a poem dominates
the stanzas that it contains.[Tennison 2008]
In XML, an ancestor element both contains (starts
before and ends after in the serialization) its descendants and dominates them (is connected to them by a path that travels only downward in
the tree). In the XML view below of the Shakespearean sonnet the we used as a TAG example
<poem> element both contains and dominates three
<quatrain> elements and one
<couplet> element, and
<couplet> elements both contain and
<line>No longer mourn for me when I am dead</line>
<line>Than you shall hear the surly sullen bell</line>
<line>Give warning to the world that I am fled</line>
<line>From this vile world with vilest worms to dwell:</line>
<line>Nay, if you read this line, remember not</line>
<line>The hand that writ it, for I love you so,</line>
<line>That I in your sweet thoughts would be forgot,</line>
<line>If thinking on me then should make you woe.</line>
<line>O! if,—I say you look upon this verse,</line>
<line>When I perhaps compounded am with clay,</line>
<line>Do not so much as my poor name rehearse;</line>
<line>But let your love even with my life decay;</line>
<line>Lest the wise world should look into your moan,</line>
<line>And mock you with me after I am gone.</line>
Figure 6: William Shakespeare, Sonnet 71
The XML tree encodes containment and hierarchy in the same way.
In the earlier TAG example of this sonnet, a Markup-to-Text hyperedge defines the tail
as a set of Text nodes and labels (datatypes) it. In the TAG version of this example, all
quatrain and line Markup-to-Text hyperedges point to sets of Text nodes, and containment is
modeled by subset relations among the Text-node tails of those hyperedges. Where the Text
nodes that constitute the tail of a Markup-to-Text hyperedge with the
property (on the Markup node) of
line form a proper subset of the Text nodes
that constitute the tail of a Markup-to-Text hyperedge with the
(on the Markup node) of
quatrain, the quatrain contains the line. In this emphasis on containment, rather than dominance, TAG is similar to flat
LMNL, except that LMNL ranges must be continuous (LMNL handles discontinuity separately),
while contiguity is not relevant in defining the set of Text nodes that may serve as the
tail of a hyperedge in TAG (see the discussion of Discontinuity, above). In the TAG version
we have chosen not to make the three quatrains and the couplet what in XML terms would be
children of a root
<poem> element, but we could, should we wish, create a
Markup-to-Text hyperedge with a
name property value (on the Markup node) of
poem. This could point, through a hyperedge, to the set of all Text nodes
in the poem, which would let us model containment. It could also serve as the head of a
Markup-to-Markup hyperedge from it to the Markup nodes with
name property values, which would let us model dominance. In other words, Markup-to-Text nodes model containment, rather than dominance
(indirectly, through subset properties of the Text nodes to which they point), and where it
is important to distinguish dominance from containment, the TAG model supports this through
One final consequence of the XML conflation of containment and dominance is that when
exactly the same text must be tagged in two ways simultaneously, XML requires one of the
elements to contain the other. But, as was noted above, if a Markup node with the
name property value of
paragraph and a Markup node with the
name property value of
quotation both point to exactly the
same set of Text nodes, in TAG it does not make sense to ask whether the paragraph contains
the quotation or the quotation contains the paragraph because containment in TAG is defined
as a proper subset relationship among sets of Text nodes. Whether a paragraph consists of a
quotation or a quotation consists of a paragraph is a reasonable question, but in TAG it is
a question of dominance, expressed through Markup-to-Markup hyperedges, and not of
containment, expressed (indirectly) through Markup-to-Text hyperedges.
As we described above, in XML, markup is (among other things) a form of datatyping, and
the XML spec uses the word
type explicitly in this meaning:
Each element has a type, identified by name, sometimes called its
identifier (GI) [W3C XML §3]
This means that when XML assigns a type to part of a document by making it an
element, it simultaneously creates an element node, which pushes the textual content down a
level in the document hierarchy. Consider the following XML structure, represented here in
markup and as hierarchy:
<title><name>Romeo</name> and <name>Juliet</name></title>
Figure 7: Romeo and Juliet (XML)
In XML, the three words of the title are spread over two levels of the hierarchy.
Cyan ellipses are element nodes and pink rectangles are Text nodes.
If we wish to specify in XML that the first and third words of the title are of type
name, we can tag them as elements of that type, with the result that the
Text nodes they contain wind up on a different level of the hierarchy than the conjunction
between them. This contradicts our intuition that the title contains three words, two of which
have the type
name, replacing it with a model in which the title contains two
objects of type
name with a word between them, and it is the
name objects that contain the first and third words.
Because TAG separates the use of markup in hierarchy and its use for datatyping, it is
possible to assign a type to text without distorting the hierarchy. Here is the TAG
representation of the same content:
Figure 8: Romeo and Juliet (TAG)
In this TAG example, the markup on names specifies their type and content without
imposing a hierarchy. The title contains three text nodes, represented by pink ellipses.
Markup nodes are green (for names) and cyan (for titles), and the document node is
As illustrated in the example above, markup of Text nodes in the TAG model, unlike in
XML, does not create a hierarchical layer as a side effect of datatyping. As we have seen
earlier, it is possible to represent hierarchy in TAG, but it is not an inescapable
consequence of all markup, as it is in XML.
White space as crypto-overlap
In natural language processing, tokenization is the process of breaking up a string of
plain text characters into substrings (typically words and punctuation, which may be
adjacent or separated by white space), often while removing token separators in the process.
Tokenization of plain text in XML is performed using regular expressions and the
tokenize() function, but
tokenize() atomizes its first argument,
which means that it cannot be used on tagged text without losing the markup in the process.
Even tokenization that would not create overlap-based well-formedness violations, such as
splitting and tagging the words of a line of poetry in which the stressed vowels are tagged
<stress> (see the illustration below), requires intermediary temporary
manipulations, such as converting the markup to text, tokenizing with
tokenize(), and then converting the temporary text back into markup, or
adding additional markup, tokenizing with
<xsl:for-each-group>, and then
removing the temporary markup.
The reason tokenizing tagged text is awkward in XML even where overlap is not a risk has
one explanation in terms of the syntax and another in terms of the data model. In terms of
the syntax, the markup and text are intertwined in a way that makes it impossible to ignore
markup during tokenization while retaining access to it after the process is complete. In
terms of the data model, as noted above, tagging the stressed vowels in a line of verse
pushes their textual content down a level in the hierarchy, so the line no longer forms a
string. Furthermore, although it is not usually described this way, the use of white space
to separate words may be understood as pseudo-markup, which means that the words in tagged
text potentially represent overlapping hierarchies in plain-text disguise.
In TAG, however, Markup nodes on one layer point to Text nodes on another layer, one
that contains nothing but Text nodes, which makes it possible to tokenize the text without
interference from the markup. The tokenization splits larger Text nodes into smaller ones,
but they remain in the tail of their old Markup-to-Text hyperedges, while new Markup-to-Text
hyperedges are added to tag the new individual words. In the simplified illustrations below,
we have created a
poem that consists entirely of a single three-word line
No longer mourn). In the firt of these illustrations, the stressed vowels
are tagged but the words are not:
Figure 9: A simplified poem without word tokenization
Without word tokenization, there are five Text nodes, three of which are not marked
up individually and two of which form the tails of Markup-to-Text hyperedges with a
name property of
stress on the Markup node, indicating
that these are stressed vowels.
Because stress is marked on a single vowel sound, XML would be capable of tagging the
individual words while retaining the stress markup, since no overlap would result. For that
reason, the following XML representation, which tags both words and stressed vowels, is well
Yet if we try to use
tokenize() in a transformation to add the
<word> markup to a line that already contains the
<stress> markup, the
<stress> markup will be lost
This situation is not a challenge for TAG. In the example below, we have added
Markup-to-Text nodes to tag the words, which can be determined by tokenizing the text on
white space. Tokenization is possible because the Text nodes are not interrupted by the
markup, which points to them without being inserted between them (syntactically) and without
pushing them to different levels of the hierarchy (in the tree structure):
Figure 10: A simplified poem with word tokenization
With word tokenization, there are now seven Text nodes. Two of them still form the
tails of Markup-to-Text hyperedges with a
name property of
stress on the Markup node, indicating that these are stressed vowels.
There are now also three Markup-to-Text hyperedges with a
name property of
word on the Markup node (in orange); the first points to a single Text
node, and the others each point to three Text nodes.
The additional markup requires additional division of Text nodes, but all modifications
are local, and the only part of the graph that has to be updated is the part to which the
markup is being added.
The preceding example does not create overlap because the Text nodes that are marked up
for stress are subsets of those that are marked up as words. But if we also want to tag
poetic feet, which are needed to identify caesura (a regular coincidence of word and foot
boundaries in the lines of a poem), overlap would become an issue in XML. One work-around in
an XML environment has turned out to involve, surprisingly, tagging neither the feet nor the
words (see [Birnbaum and Thorsen 2015]), deriving both from other properties of the
line during processing, but the fact that we can use white-space pseudo-markup to escape the
consequences of syntactic overlap doesn’t mean that the the overlap isn’t there. A data
model that can represent both feet and words explicitly, and that could identify caesura as
a relationship between those two types of structural components, would represent explicitly
the human understanding of caesura, and the explicit representation of structure is much of
what markup is all about. In the illustration below, we have added foot markup to the
Figure 11: A simplified poem with word and foot markup
When we tag the feet by adding Markup-to-Text hyperedges with a
property value of
foot on the Markup node (in violet), we create overlap
between feet and words, since the word
longer is split between the ictic
second syllable of the first iamb and the non-ictic first syllable of the second.
A corresponding XML-like structure that tags words and feet would not be well formed
<foot> elements would overlap with the
[t]he identification of caesura requires the identification of both feet
and words, which are not coextensive and which frequently overlap. The challenge, then, is
to locate where foot and line boundaries coincide without employing markup in a way that
would violate well-formedness overlap constraints. [Birnbaum and Thorsen 2015] In TAG, where overlap is not an issue, caesura is possible when two adjacent Text nodes
are in the tails of different Markup-to-Text
word hyperedges and different
foot hyperedges. Caesura is typically 1) at or near the middle
of the line, and 2) implemented consistently, so not every coincidence of word and foot
boundaries proclaims a caesura; that coincidence is necessary, but not sufficient.
Scope of reference
Footnotes can be understood as annotations on text, but in XML they are typically
represented by elements at the location where the note reference should occur in a reading
text, as with the
<footnote> element in DocBook or the
<note> element in TEI. Anchoring a footnote at a point in the text
stream, instead of as an annotation on a string of (possibly tagged) text with a beginning
and an end, is problematic because it does not mark explicitly the scope of the note, such
as whether a footnote reference at the end of a paragraph points to the preceding sentence
or the preceding two sentences or more, or to the entire paragraph. The TEI
<note> element avoids this limitation because it can point to an
arbitrary target with XPointer, but this stand-off strategy is an indirect way of specifying
what might have been represented more immediately as an attribute if XML attributes were
able 1) to model rich content, and 2) to annotate something without being forced to give it
a generic identifier that specifies its type.
TAG avoids the XML prohibition against markup in attribute values because in TAG the
Text nodes of an annotation can be a target of markup, just like those of the main text. TAG
avoids the scope of reference problem because the annotation can point to a Markup node with
a name if an appropriate one exists (such as paragraph in a document that marks up
paragraphs). In the example below, because TAG permits anonymous Markup nodes (that is,
name property of Markup nodes is optional), we annotate arbitrary
text without giving it the equivalent of an XML generic identifier, although in a revision
currently under development, we are exploring pointing directly from the annotation to the
Text nodes, which would obviate the need for the anonymous Markup node. With either of these
approaches, footnote-like relationships can be modeled in TAG as what they are: rich-text
annotations on text regardless of whether the target of the annotation corresponds to a
Content Object with an identifiable type. TAG is similar to LMNL in this respect, except
that in TAG text being footnoted that is discontinuous is no different from continuous text;
it is a set of Text nodes that constitute the tail of a Markup-to-Text hyperedge.
In the simplified example below, we add a footnote to the second and third lines of a
poem by using an Annotation node (orange) to point to a Markup node (violet) that is the
head of an anonymous Markup-to-Text hyperedge, and the text of the annotation also has
markup (a sky blue Markup node with a
name property of
points to a single Text node). Neither of these features is available with attribute markup
in XML because elements must have generic identifiers (= cannot be anonymous) and attribute
values cannot contain markup. And if the footnote target happens to be something that would
create overlap in XML (e.g., if it runs from the middle of one line to the middle of another
and the lines have been tagged explicitly), XML is further encumbered by the prohibition
Figure 12: A poem with an annotation (footnote) on lines 2 and 3
The violet ellipse is an anonymous Markup node that serves as the head of a
Markup-to-Text hyperedge and points to the Text nodes for lines 2 and 3 of the poem. The
orange ellipse is an annotation on the anonymous Markup node. Annotations have their own
textual content, which is represented by a chain of Text nodes that begins with the
Annotation node, much as the main text is represented by a chain of Text nodes that
begins with the Document node. The text of an annotation may be the target of markup,
and in this example a Markup-to-Text hyperedge with a
name property of
emphasis on the Markup node points to a single Text node.
Insofar as a footnote can be considered metadata about text, the structure illustrated
above represents it as an annotation, but it does not require us to assign a type to the
target of the annotation as a side effect of referring to it, and it allows us to add markup
to the footnote text itself.
Data model versus syntax
Syntax is not necessarily the same as a data model. A data model could, at least in
principle, be serialized in multiple ways, and syntax developed to represent one data model
could be coopted to represent a different one. TAG does not at present have its own
serialization syntax, and the Alexandria Markup implementation described below can read and
write LMNL sawtooth syntax and TexMECS (parsing the results as a representation of the TAG
data model, rather than of LMNL or GODDAG), and it is intended to be able to do the same
with XML syntax.
One challenge of comparing TAG to XML, LMNL, GODDAG, and TexMECS is that TAG, like LMNL
and GODDAG, is a data model, while XML and TexMECS are defined by their syntax. Perhaps a
bit surprisingly in the context of Balisage, which describes itself as
the markup conference, our focus here is not on markup (that
is, on syntax and serialization), but on the data models that may be expressed through
markup, which means that for comparative purposes we may sometimes need to infer a data
model from a syntactic specification.
The situation is especially complicated in the case of XML because although it does not
have a data model, it also has three almost-data-models: XML DOM [W3C DOM], which is an object model and API; the XML InfoSet [W3C XML InfoSet],
which is an information model; and XDM [W3C XDM], which is a data model
for processing XML. Our inferred data model for XML for comparative purposes here includes
the seven node types specified in XDM (not the twelve of XML DOM or the eleven types of
information items of the XML InfoSet), along with the structural properties of the ordered
tree that are relevant for understanding (but not necessarily adequate for processing)
well-formed XML (e.g., attribute nodes on an element are unordered). Our aim is not to
create a data model for XML, which lies far outside the scope of this paper, but to identify
features of the way XML models text that can be used comparatively to help elucidate
features of TAG.
The fact that some of our objects of comparison are serializations and others are data
models matters because, as the etymology of the term implies, serialization is an ordered
linear expression, which is not a requirement of data models. If, for example, a paragraph
is exactly coextensive with a quotation, in XML syntax, LMNL sawtooth syntax, and TexMECS
syntax, the start tag of either the paragraph or the quotation must come first in linear
order. But in LMNL the relative order of the ranges defined by the tags is not an obligatory
part of the model, which permits two ranges to begin at the same location in the text, and
the same is true of TAG. In XML, however, one element must be the parent of the other, and
the order of the start tags reflects both containment and hierarchy. TexMECS negotiates this
issue by using different start- and end-tag delimiters to distinguish when the relative
order of the tags is informational and when it is not.
Semantics versus application level
Another challenge for text modeling involves distinguishing properties that inhere in
the structure of the text being modeled from those that depend on semantics that must be
interpreted at a higher (application) level. A failure to make this distinction may have two
types of consequences (which are really aspects of the same thing, the delegation of
information that should be part of the model to the application layer): either the
application must know that some properties of the model are not informational and are to be
ignored, or the application must know that there is information that is not represented
entirely by the model and must therefore be added during processing. If, however, the model
explicitly represents the structural properties of the text and nothing else, the
application level is freed from having to supplement the model, and can concentrate on
features that are truly application-specific.
Moving structural information out of the application layer and into the model is a
priority in the design of TAG, and here are two illustrations of the issue:
The pairing of start and end tags in XML markup is inherent in the markup itself,
and is available during parsing with no reference to semantics. In contrast, the
pairing of XML milestones that are used to simulate container tags as a work-around
for overlap (see the discussion of Trojan markup in DeRose 2004)
depends on semantics. XML applications do not need to know that regular start and end
tags delimit an element because that information is an inalienable feature of all XML
documents that is fully specified by the syntax, but they do need to know when empty
tags are being used to simulate the beginning and end of a content object and when
they are not, or which pseudo-start-tags are to be associated with which
pseudo-end-tags. A robust and efficient strategy would represent all structural
features as parts of the model itself, instead of requiring that some of them be
handled through semantic information that is available only at the application
Because XML models an ordered hierarchy, element always have order, which requires
the application layer to distinguish situations where order is semantically meaningful
from situations where it isn’t. For example, the TEI
element has the semantics of associating content objects that do not have a natural
order with respect to one another, such as an abbreviation and its expansion or an
error and its correction. How those should be rendered is the proper business of the
application layer, but the XML model requires that one option proceed or follow the
other even when the order does not represent an inherent, informational property of
the text being modeled. This has the undesirable consequence that, incorrectly (from
the perspective of what the marked-up text means), an XML processor will regard two
TEI documents as different if they differ only in the order of the children of their
<choice> elements unless the processor is given access to TEI
markup semantics. Imposing an arbitrary order as a schema enhancement (for example,
requiring that an abbreviation always precede its expansion inside a TEI
<choice> element) will avoid the problem of distinguishing when
two documents should be considered the same or different, but at the cost of making
order informational in some situations and arbitrary in others, that is, of imposing
order on something that is not inherently ordered. A more robust and efficient model
would not specify order when it must then be ignored, so that a processor will know
when order is informational and when it is not from the model, without recourse to
Concerning the first of these issues, matching up pseudo-start-tags with pseudo-end-tags
during processing does not arise in TAG not only because TAG does not at present have its
own syntactic expression (although we can represent some features of TAG by borrowing LMNL
sawtooth syntax or TexMECS), but also because the fact that TAG permits overlap makes such
workarounds unnecessary. The second issue is more challenging, and because TAG currently
models text as a single chain of Text nodes, it does not yet distinguish situations where
order is not informational. But because that is a feature of what text is (that is, because
the first O of OHCO is as much an issue as the H that follows it), it is a design
requirement that we intend to address as development continues (see Appendix B).
[Birnbaum and Thorsen 2015] Birnbaum, David J.,
and Elise Thorsen.
Markup and meter: Using XML tools to teach a computer to think about
versification. Presented at Balisage: The Markup Conference 2015, Washington, DC,
August 11–14, 2015. In Proceedings of Balisage: The Markup Conference
2015. Balisage Series on Markup Technologies, vol. 15 (2015). doi:10.4242/BalisageVol15.Birnbaum01. https://www.balisage.net/Proceedings/vol15/html/Birnbaum01/BalisageVol15-Birnbaum01.html
[Bleeker 2017] Bleeker, Elli.
invention in writing: digital infrastructure and the role of the genetic editor. PhD
dissertation, University of Antwerp, 2017.
[Coombs et al. 1987] Coombs, James H., Allen H.
Renear, and Steven J. DeRose.
Markup systems and the future of scholarly text
Communications of the association for computing machinery,
30.11 (Nov. 1987): 933–47. doi:10.1145/32206.32209
[DeRose 2004] DeRose, Steven J.
overlap: a review and a horse. Proceedings of Extreme Markup Languages 2004.
[DeRose et al. 1990] DeRose, Steven J., David G.
Durand, Elli Mylonas, and Allen H. Renear.
What is text, really?, Journal of computing in higher education, 1.2 (1990): 3–26.
[Hilbert, Schonefeld, and Witt 2005] Hilbert,
Mirco, Oliver Schonefeld, and Andreas Witt.
Making CONCUR work.
Proceedings of Extreme Markup Languages 2005.
[Huitfeldt and Sperberg-McQueen 2003] Huitfeldt,
Claus and C. Michael Sperberg-McQueen.
TexMECS. An experimental markup meta-language
for complex documents. Revision of 5 October 2003.
[Ide and Suderman 2007] Ide, Nancy and Keith Suderman.
GrAF: a graph-based format for linguistic annotations. Proceedings of the
Linguistic Annotation Workshop, held in conjunction with ACL 2007, Prague, June 28–29, 1–8.
[LMNL data model] LMNLWiki.
model. From the Lost Archives of LMNL.
[Peroni et al. 2014] Peroni, Silvio, Francesco Poggi
and Fabio Vitali.
Overlapproaches in documents: a definitive classification (in OWL,
2!). Presented at Balisage: The Markup Conference 2014, Washington, DC, August 5 -
8, 2014. In Proceedings of Balisage: The Markup Conference 2014.
Balisage Series on Markup Technologies, vol. 13 (2014). doi:10.4242/BalisageVol13.Peroni01.
[Piez 2008] Piez, Wendell.
LMNL in miniature.
An introduction. Amsterdam Goddag Workshop, 1–5 December 2008.
[Piez 2010] Piez, Wendell.
markup. An architectural outline. Presentation at Digital Humanities 2010, King’s
College, London. http://piez.org/wendell/papers/dh2010/. The screen shot in this
paper is taken from
[Piez 2014] Piez, Wendell.
range space: From LMNL to OHCO. Presented at Balisage: The
Markup Conference 2014, Washington, DC, August 5 - 8, 2014. In Proceedings of Balisage: The Markup Conference 2014. Balisage Series on Markup
Technologies, vol. 13 (2014). doi:10.4242/BalisageVol13.Piez01.
[Renear, Mylonas, and Durand 1996] Renear, Allen H.,
Elli Mylonas, and David G. Durand.
Refining our notion of what text really is: the
problem of overlapping hierarchies.
Research in humanities computing, ed. Nancy Ide and Susan
Hockey. Oxford: Oxford University Press. 1996.
[Sperberg-McQueen and Huitfeldt 2008] Sperberg-McQueen, C. M., and Claus Huitfeldt.
Containment and dominance in Goddag
structures. Talk at conference on
Processing Text-Technological Resources, Center for
interdisciplinary research, University of Bielefeld, March 2008. http://cmsmcq.com/2008/bielefeld/slides.html#(1)
[Sperberg-McQueen and Huitfeldt 2000] Sperberg-McQueen, C. M. and Claus Huitfeldt.
GODDAG: a data structure for overlapping
Digital documents: systems and principles: 8th international conference
on digital documents and electronic publishing, DDEP 2000, 5th international workshop on the
principles of digital document processing, PODDP 2000, Munich, Germany, September 13–15,
2000, revised papers, ed. Peter King and Ethan V. Munson. NY: Springer, 2004,
139–60. A revised version is available at
[Sperberg-McQueen and Huitfeldt 2008a] Sperberg-McQueen, C. M., and Claus Huitfeldt.
Containment and dominance in Goddag
structures. Presented at Processing text-technological resources, Bielefeld, March
13-15, 2008, organized by the Zentrum für interdisziplinäre Forschung der Universität
Bielefeld. Slides (but not full text) available on the Web at
[Sperberg-McQueen and Huitfeldt 2008b] Sperberg-McQueen, C. M. and Claus Huitfeldt.
Markup Discontinued: Discontinuity in
TexMecs, Goddag structures, and rabbit/duck grammars.
Presented at Balisage: The Markup Conference 2008, Montréal, Canada,
August 12 - 15, 2008. In Proceedings of Balisage: The
Markup Conference 2008. Balisage Series on Markup Technologies, vol. 1 (2008).
[Tennison 2008] Tennison, Jeni.
containment and dominance. Jeni’s musings,
[TEI Genetic editions] TEI WG-GE.
An encoding model
for genetic editions.
[W3C DOM] W3C.
What is the Document Object
Document Object Model (DOM). Level 2 Core Specification. Version
[W3C XML] W3C. Extensible Markup
Language (XML) 1.0 (fifth edition).
[W3C XML InfoSet] W3C. XML
Information Set (second edition).
[W3C XDM] W3C. XQuery and XPath
Data Model 3.1.