How to cite this paper
A tag algebra for document markup
Balisage: The Markup Conference 2011
August 2 - 5, 2011
Digital documents: Representational forms and operations
When computers were first used for purposes of text processing, documents were typically
represented as linear sequences of characters. This format lends reasonable support
operations on documents, such as visual presentation, editing, retrieval, content
linguistic and quantitative analysis, and various forms of transformation.
However, the objects of such operations are not always most naturally thought of,
easily identified, in terms of character sequences. Some operations are performed
structural partitions of documents such as sections, paragraphs, or sentences. Objects
in terms of their thematic, linguistic, rhetorical or other properties, such as names,
words, high-lighted phrases etc. may also be the target of operations to be performed.
Sometimes such objects can be identified at least indirectly in terms of properties
character sequences, but seldom reliably, and often not at all.
Document markup can be seen as a response to these problems: when candidate objects
operations are explicitly marked, text processing applications are relieved of the
identifying them, and developers of such applications may concentrate on the operations
performed rather than on the identification of the objects to be processed. Thus,
documents provide improved support for text processing, compared to documents without
Therefore, it is perhaps only natural that with the introduction of document markup,
number and kinds of document objects that became subject of interest for processing
Markup languages and tools were developed not only for general purpose document publication
and analysis, but also for more specialized purposes and document types.
For a number of reasons, the markup languages which became most widely used, and the
grammars which ultimately acquired the status of international standards for those
such as SGML and XML, were based on context-free grammars. One obvious reason was that representations
based on context-free grammars were processable according to well known and computationally
efficient algorithms. Over the years a plethora of general as well as special-purpose
for XML document processing has become available.
Such markup languages invite an abstract view of document structure and content,
the "artifacts" of markup in the document in its serial form are represented in a
conspicuous way. Many or most XML processors do not actually perform operations on
document in its serialized form, but only (or largely) on a model of the document,
the form of a labeled directed acyclic graph. The graph is constructed from the serialized
form of the document, and reserialized only for purposes of storage and exchange.
and the same graph may be represented in the form of more than one serialization,
to be transferred on the graph (or, in XML terms, the document object model), rather
serial form of the document.
The Overlap Problem
The coincidence of the two factors mentioned above (i.e. the increasing interest
marking up more and more aspects of documents, and the fact that available systems
on context-free grammars) lead to a number of closely related problems which are now
referred to as "the overlap problem". On the one hand, the obvious advantages of markup lead to interest in marking up
wide variety of aspects of document structure, often necessitating different segmentations
one and the same document. On the other hand, standard markup systems did not provide
straight-forward and natural way of representing such competing or conflicting segmentations.
So although marked up documents provide better support for a broad range of
processing purposes than documents without markup, the introduction of markup also
lead to a
number of problems one did not have much means or motivation to identify or formulate
the introduction of markup.
The so-called overlap problem was identified fairly early, and the literature of proposed
solutions goes more than 30 years back in time. The responses to the problem may be said to fall largely in four groups: 1)
Alternate linear forms, 2) Alternate document models, 3) Stand-off markup, and 4)
Alternate linear forms
Since the syntax of XML enforces a strict hierarchical nesting of XML elements, it
seems like a tempting solution simply to lift or soften this restriction. A large
such proposals have been made.
Some notable examples are SGML CONCUR [Goldfarb 1990], XCONCUR
[Schonefeld 2007], MECS [Huitfeldt 1998],
TexMecs [Huitfeldt and Sperberg-McQueen 2003], and LMNL [Tennison and Piez 2002].
The advantages of most of these proposals is that they provide what for human readers
seems like a more natural and easily readable linear representation of overlapping
than is provided by alternative proposals adhering to XML syntax. They may also provide
better support for certain operations on overlapping elements. However, for the most
is unclear how to perform anything like a full range of standard operations on their
constructs, and there are unresolved issues of processability, expressive power etc.
Alternate document models
Some proposals provide alternatives to what is commonly perceived as the XML document
model, i.e. tree-shaped graphs like XDM or XML DOM, either in combination with or
independently of proposals for alternative linear forms.
Some notable examples are LMNL [Tennison and Piez 2002], Goddag
[Sperberg-McQueen and Huitfeldt 2000], multi-colored trees [Jagadish et al. 2004], and Multi-Document Graphs [Schmidt and Colomb 2009].
Although these solutions in general provide better support for at least certain
operations on overlapping elements, the question of how to perform other standard
is mostly unclear or not systematically addressed. For some of the proposals it is
unclear whether, or to what extent, it is possible to support roundtripping between
proposed document model and its form, or even what such roundtripping would amount
is the case, for example, concerning the relation between Goddag and TexMecs.) Again,
are in general unresolved issues of processability, expressive power etc.
Quite early on, it was suggested that many or most of the problems related to overlap
can be seen as a result of the general limitations of embedded markup. [Raymond et al. 1992, Raymond et al. 1995] Thus, the idea
of separating the markup from the document to be marked up, in the form of so-called
"stand-off" markup has received considerable attention and adoption.
Some notable examples are xStandoff [Stührenberg and Jettka 2009],
Nite [Carletta et al. 2005], Earmark [Di Iorio et al. 2009], and (as an interesting limiting case) Multix [Chatti et al. 2007]. TEI [TEI P5] also contains recommendation for stand-off
Stand-off markup does indeed provide potentially very strong expressive power and
seem to be the one sweeping solution to all problems related to overlap. And perhaps
it is. We note, however, that standoff markup either has been used, or could be used,
in combination with any or all of the linear formats and document models referred
Thus, all the unclarities that pertain to those, and to the relationships between
pertain also to stand-off markup.
Since there is such a multitude of proposals for alternate linear forms and document
models, since they all have their pros and cons, and since no universal agreement
unified solution seems to be forthcoming, it has been proposed that one should instead
to provide a common framework of algorithms for transformation between different linear
forms and/or document models.
The advantages of this approach may be as obvious as its disadvantages: A major
advantage is that, to the extent that transformations may be performed without loss
information, it makes all the strengths of all of the solutions above available to
users. Major disadvantages are: First, for every alternate serial form and/or document
introduced, the number of required transformation algorithms increases exponentially.
(and perhaps most importantly), there seems to be no established and agreed-upon way
deciding whether or not any given transformation algorithm introduces information
The idea of a tag algebra
There seems to be no consensus neither on whether the basic elements of marked up
documents are most suitably regarded as partitions of a character stream (tags, elements
or nodes of a graph of some kind, nor on what the basic operations on these elements
how they are to be defined. As long as these issues are not clear, it is hard to compare
evaluate the various approaches to the overlap problem.
Elementary algebra assumes that numbers are the same whether written in Arabic or
notation, or, for that matter, whether we use a decimal, a digital or a hexadecimal
system, and that the sum of two numbers is the same irrespective of which algorithm
to calculate it. Certainly, some notations may seem less intuitive than others, and
notations require the use of more complex algorithms than others in order to perform
operations. However, while important enough, such issues have no bearing on the question
what the correct result of an operation is. There is one and only one answer to the
of what the sum of two numbers are, irrespective of which notation and algorithm one
choose to use in computing that sum.
Similarly, a tag algebra should be able to tell us whether two representations of
document marked up using different notations are representations of the same document
and what the result of adding or subtracting a specific element to or from that document
should be. In what follows, we hope to establish at common basis for the solution
of at least
the first of these problems. We believe this basis will also be useful for defining
on marked up documents, although we do not get that far in this paper.
Bracketed notations and matching
In the rest of this paper, we will explore the relationship between markup without
overlap and markup with overlap by way of comparing XML to another, imaginary markup
This other markup language, which we may call O-XML, is exactly like XML, except that
overlap is allowed. So whereas nesting structures like
are well-formed in both
XML and O-XML, overlapping structures like
are ill-formed in XML,
but well-formed in O-XML. As we are concerned exclusively with element structure,
all other features or mechanisms of XML such as attributes, entities, comments, declarations,
cdata sections, processing instructions, etc.
We may start with the observation that in XML all elements nest, while in documents
overlap not all of them do. We may also observe that the element structure of XML documents can be deduced
without looking at the generic identifiers of end tags, whereas in notations allowing overlap this information is essential. We might
summarize this by saying that in XML the start and end tags of nesting elements always
whereas in other notations they do not always match.
But there is something puzzling about this way of putting the matter. Exactly what
mean by the terms
observe that whereas there is no overlap in
and there is overlap in
the difference between
them disappears if we leave out the generic identifiers:
< >...< >...</ >...</ >
Our guess is that it
probably seems natural to most readers to assume that the element structure of the
example above is identical to the first, rather than the second. But is this necessarily
and if so, why? Let us pursue the similarities between the nesting of start and end
the nesting of simple parentheses a bit further.
Consider any string consisting of left (i.e. start, or open) and right (i.e. end,
close) parentheses. One common assumption is that there should be a one to one correspondence
between left and right parentheses, so that they are equal in number. This ensures
((())) is to be regarded as well-formed, while the string
((()) is not. Another common assumption is that for every left parenthesis there
is a succeeding right parenthesis. This ensures that the string
)()( is not to be
regarded as well-formed. In the following, we will refer to these two assumptions
The two basic assumptions do not by themselves imply any specific answer to the question
which of the left parentheses correspond to which of the right parentheses in the
cases. The intuitively most plausible (or common) reading of the string
be in accordance with a "nesting" convention, to the effect that the first left parenthesis
corresponds to the last right parenthesis. According to this convention, the structure
()(()) looks like this, where subscripts indicate correspondence:
However, nothing in the two basic assumptions as stated stops us from adopting
other conventions. We might imagine a "mirroring" convention, where the first left
corresponds to the first right, the second left parenthesis to the second right, and
so on. A
"mirroring" convention for the same expression may look like this:
(Note the "overlap" between parentheses subscripted
Other conventions may be imagined in which some parentheses are selected according
specific rules, while the rest follow some general scheme like nesting or mirroring.
Thus, the two basic assumptions concerning the well-formedness of parenthetical
expressions are compatible with alternate ways of assigning structure to those expressions.
This important fact is easy to overlook, because one common method of testing for
well-formedness, namely the use of rewrite grammars, also lends itself very easily
assignment of a structure in accordance with the "nesting" convention above. The following
example will illustrate this.
A rewrite grammar for generating strings of parentheses which obey the two basic
assumptions may look like this:
A string is well-formed, in accordance with the two basic assumptions, if and
only if it can be generated by this grammar, which generates strings like
As mentioned, this grammar may also be used in order to assign a structure to the
expressions in accordance with the "nesting" convention. This may be done by associating
each other the two parentheses that are generated together in each of the subordination
base steps. Selecting an index from the sequence number of rule application, the string
(()()) could be derived with indicated associations as follows, where each rule
P in the string coming from the immediately preceding rule application,
and substitutes it with the right hand side of the →:
0th step, start symbol
1st step, subordination
(1 P )1
2nd step, coordination step
(1 P P )1
3rd step, base step
(1 (3 )3 P )1
4th step, base step
(1 (3 )3(4 )4 )1
After the fourth step, there are no non-terminals left in the expression, and
the process halts. As can be seen from the last line, the structure that has been
the expression is in exact accordance with the "nesting" convention.
This confirms that rewrite grammars may be used for assigning structure as well as
checking well-formedness. However, one may very well use a rewrite grammar to check
well-formedness without using that same grammar for assigning
structure. Moreover, there are other ways both of checking well-formedness and of
structure than using rewrite grammars.
This should make clear that well-formedness is separate from structure. In the appendix
prove that well-formedness is preserved even when inserting randomly any left parenthesis,
succeeded by another randomly inserted right parenthesis, into a well-formed string.
We observe that the rule for matching of start and end tags in XML is identical to
called a "nesting" convention for simple parenthetical expressions above. We also
this seems to be the "default" rule for matching of start and end tags in markup languages
which allow overlap, such as O-XML. In such languages, however, deviations from the
rule are allowed, and they are signaled by the selection of generic identifiers for
This explains why, as we noted above, the element structure of XML documents can be
identified independently of information about the generic identifiers of start and
whereas in O-XML this information is essential. In the following sections we will
these observations in order to look at alternate ways of describing the difference
nesting and overlap.
We believe that lattice theory [Grätser 1971] may usefully be
applied to the construction of document models from marked-up documents. Lattice theory,
graph theory, is a way of reasoning over ordering relations. The theory has two components,
one is a theory of order, the other of algebraic operations defined in terms of that
this section we will consider the order aspect, while algebraic operations will be
in the next section ("Algebraic characterization").
A lattice is a partially ordered set of objects, here referred to as nodes, which
provide the elements of document models. Not all nodes will find their way into a
model, only those that do may also be referred to as elements, in agreement with the
this term in XML and XSLT. A key requirement for an ordered collection of nodes to
a lattice, is that there should be one unique largest and one unique smallest node
collection. If there is only a largest (or smallest) node, it is what is called a
semi-lattice. Examples of lattices so constructed abound. Any society ordered by social rank (or
constitutional order) is a semi-lattice, with e.g president or monarch at the top.
are semi lattices, with a root node on the top and text nodes at the bottom.
We analyze the stream of characters making up a document morphologically into a series
typed objects consisting of start tags, end tags, and simples. The terms "start tag" and "end tag" should require no further explanation. We will
used XML notation in specifying them.
The term "simples" refers to PCDATA. Informally, we might say that a simple is whatever
occurs between two tags.
Empty elements are treated as a start tag followed by an end tag, without any intervening
We assume a linear order on the morphological constituents. This order is derived
from the serialized form of the document.
Consider the following XML sample document, D1:
Figure 1: Sample document 1 (D1)
According to the definitions just given, this document is a linearly ordered set
token objects, consisting of the following morphological constituents with the properties
indicated in this table, along with position indexes which, for our purpose, serve
indexes for the types:
For any document D, let S be the set of all start tag tokens and E the set of all
tokens in the document. Take the Cartesian product (all possible combinations) of
S and E, and
subtract all pairs for which the position of the end tag does not succeed the start
object then represents all the possibilities for a start tag to match up with an end
the simples, and name the result L(D).
L(D1) now consists of the following nodes:
The reader may be puzzled by the fact that we include non-matching pairs of tags,
nodes with start and end tags with different generic identifiers (GIs), among the
nodes of the
lattice. This is simply because the point of the exercise at this stage is to see
how far we
are able to build document models without taking the generic identifiers of tags into
We define a hierarchical order relation between nodes in terms of
containment as follows: A node x is larger than a node y if and only if
the start position of x is smaller or equal to that of y, and the end position of
x is greater
or equal to that of y (viewed as strings the larger, or higher, node contains the
other). This relation is asymmetric and transitive, so it is a partial order.
With this order in place, L(D1) as given in the table above is a semi-lattice. It
has a top node, namely the pair consisting of the first start tag and last end tag
<a>₁</a>₉. It also has a set of minimal nodes, the
simples. Each pair of tag tokens is a node in the semilattice. Lattice structures
displayed using Hasse-diagrams (familiar from tree structures).
Nodes are positioned vertically according to their position in the hierarchy, with
the more inclusive nodes at the top and the less inclusive further down.
Nodes at the same vertical level are positioned from left to right according to
their linear order in the document.
The lattice structure for L(D1):
As mentioned, the hierarchical order is transitive. According to the conventions of
diagrams like these, however, lines are drawn only between nodes immediately ordered.
Before we proceed, we need to define a relation we call relatives: Two nodes are relatives if they share a token start or end tag. This is
a reflexive, symmetric and non-transitive relation.
We also need to define a concept which has already been introduced implicitly, the
of a minimal node: A node is minimal if it is lowest in the
hierarchy, i.e. if it contains no other node.
Building document models from lattices
Modeling documents without overlap
In what follows, we will show how what we may call a "standard" document model can
built on the basis of the lattice. In building this model, we will rely on distinguishing
between nodes only in terms of the morphology introduced earlier, i.e. on the basis
start tags, end tags, or simple tokens they contain. In particular, we do not take the GIs of tags into consideration in the building of this
A document model for D1 from the lattice L(D1) is built through a
number of steps by selecting subsets from L(D1). Hence, the model building process
can be viewed as a filtering process. We start from the bottom of the lattice and
all the lowest-level minimal nodes. We copy these minimal nodes, i.e. the nodes labeled
X₂, Y₄ and Z₇, from L and transfer them to a subset of L(D1)
which we call S0.
We then delete these minimal nodes from L(D1
and call the result L0
We repeat the process of selecting minimal nodes, now from L0,
and add them to S0, naming the result S1. In
this case the minimal nodes are those labeled <b>₃</b>₅ and
<c>₆</c>₈. S1 inherits the order from L, and looks like
Next, we delete these minimal nodes, as well as all their
, from L0
, and call the result
Removal of the relatives assures that a tag can only enter into a relationship once.
soon as a node is selected for the model, neither the start nor the end tag of that
enter into another node of the model. The relatives of <b>₃</b>₅
in L(D1) are the following nodes
relatives b/b: Relatives of <b>₃</b>₅
L1 looks like this:
As L1 only contains the single node
<a>₁</a>₉ a repetition of the step above, which copies this
minimal node to S1 resulting in S2, will
terminate the process. There are now no more nodes to select, and the process stops with the following
candidate model M of D1:
Lattice D2: subset S2=M(D1)
What we find in S2 is, mutatis mutandis, identical to the
"standard" XML document tree for the document we started with:
Lattice D3: Standard document model of D1
While GIs have been left out of consideration during the building, the candidate
will also be subjected to a well-formedness check: For each node in the model, the
GI of the start tag must be identical to the GI of the end tag.. For this
model, this is the case.
We conclude that with the method described here, lattices can be used to generate
"standard" models of XML documents in the form of conventional XML document trees.
that we may also refer to nodes that are members of the model as elements. The claim
that these constitue trees will be
substantiated below, where we will show that the model building procedure indeed
guarantees production of tree structures only.
Modeling documents with overlap
In this section, we will show that lattices can also be used in order to generate
of O-XML documents in the form of GODDAGs [Sperberg-McQueen and Huitfeldt 2000]. Consider the following O-XML sample document, D2:
Figure 10: Sample document 2 (D2)
The lattice L(D2) for this document is as follows (while we have used capital M
to designate document models for XML documents, we will use capital O to designate
models for O-XML models):
An application of the method described in the previous section to this lattice, yields
the following candidate model:
Lattice P: Candidate model of D2, M(D2)
It easy enough to see that something must have gone wrong here, as there is disagreement
between the GIs of the start and end tags for the nodes <b>₂</c>₈ and
<c>₄</b>₆. Now D2 is not a well-formed XML document, and we have used the
same procedure for generating M(D2) from D2 (which is a well-formed O-XML document
with overlap, but not a well-formed XML document) as we did for building M(D1) from
D1 (which is a well-formed XML document). The method will build document models for
well-formed as well as ill-formed documents. However, the ill-formedness of D2 will be
captured by the well-formedness condition introduced above.
In constructing O(D2), it suffices to introduce one single modification to the
procedure described in the previous section: Before performing any other operations
L(D2), we remove all nodes that have start and end tags with different GIs. Once
this rule is introduced, the application of the procedure produces the following candidate
What we find here is, mutatis mutandis, identical to the GODDAG document model of
Lattice R: GODDAG model of D2
We conclude that with the method described here, lattices can also be used for
generating document models of O-XML documents with overlap in the form of GODDAGs.
extra rule that we introduced in this section does not affect the result for XML documents,
one and the same method can be used for generating XML trees from XML documents and
from O-XML documents. In other words, if D is a well-formed XML document M(D)=O(D).
O(D) does not have any built-in well-formedness check.
In the previous section we noticed, in passing, that of the proposed method for building
models from XML and O-XML documents, only M-models carry a well-formedness constraint.
introducing the rule that all nodes with different GIs on start and end tags should
deleted from the lattice, we effectively made sure that O(D) fulfills one well-formedness
constraint on documents. The difference between the M- and O-constructions is subtle:
M-construction will always produce trees, while the O-construction will not necessarily
produce trees. That issue is discussed in a subsequent section. The question to be
here is: Can document lattices be used to capture the full set of well-formedness
The model building in itself may work even if document D starts with and end tag or
with a start tag. The process will just ignore them. If they are taken into account
simples, the requirement that L(D) is a lattice will rule them out, since they will
ordered on the same level as the widest start tag/end tag pair.
Another source of ill-formedness stems from unbalanced parenthesis. For M(D) this
taken care of by requiring that the top node of L(D) is also the top node of M(D).
reason for this is that the top node of L(D) will be the widest start/end tag pair.
pair is not a node it will be because all the tags are used up, so to speak, before
reached, as the model building moves inside out. Consider this ill-formed document
The top node for L(D) is
but this node will be removed from M(D) as a relative of the minimal
For O(D) this doesn't work since the removal process looks at tag GI, so if the document
it is <b>₂</a>₃ which is left out. So we cannot
pin the requirement on the top node. However, we take it that for any document, including
that each tag should find a partner. Clearly <b>₂ in this document does
not have one.
We conjecture that, in addition to the already introduced rule of deleting nodes with
non-matching GIs on start and end tags, the following rule should suffice to capture
remaining well-formedness constraints on O-XML:
For any x.start or x.end in L(D) there must be an element E in M(D) or O(D) such
E.start=x.start or E.end=x.end.
This should make sure that every tag in L(D) finds a match in M(D) or O(D). Moreover,
M(D) is constructed without first removing non macthin GIs, the GI agreement constraint
mentioned above will rule out any overlap.
If this conjecture is correct, it means that checking for well-formedness as well
asignment of document structure can be accomplished by using document lattices, and
the use of rewrite grammars.
We believe that the proposed use of document lattices for building document models
marked up documents provides a unified account of the assignment of structure to documents
with and without overlap. If our conjecture about well-formedness is correct, this
is also able to capture all well-formedness constraints on both types of markup.
The proposed method exploits the observations made about possible alternative
conventions for the assignment of structure to simple parenthetical expressions made
earlier. In building models for both kinds of documents, XML and O-XML, we rely on
"nesting" convention. The difference is that for XML documents, information about
the GIs of
tags may be ignored, whereas for O-XML documents this information is essential.
In introducing O-XML, we said that the only difference between XML and O-XML is that
O-XML allows overlapping elements, whereas XML does not. What about so-called self-overlap,
discontinuous elements and virtual elements, mechanisms which have been proposed in
example TexMECS, LMNL and other alternative markup languages? We believe that the
of self-overlap with document lattices is just a question of adjusting the basic morphology.
Whether, or to what extent, document lattices can also be used for markup languages
discontinuous or virtual elements we simply do not know.
Meet and join
The meet and join operations are a central part of lattice theory. In our context,
may serve to characterize the difference between the models for XML and O-XML, and
relationship to the lattice L.
The meet operation selects the largest node below its operands, while join selects
smallest above its operands. A requirement for lattices is that any two nodes have
and a join.
Since we are dealing not primarily with lattices in the strict sense, but mostly with
semi-lattices, we need to lighten this requirement for the meet operation: two nodes
allowed not to have any meet if they do not have any common nodes below them at all.
We indicate the vertical ordering of nodes in a lattice by using the symbols ⊒
and ⊑, so that the opening is faced towards the larger node. In other words,
x⊒y means that x is higher in the lattice than y, while x⊑y means that y
is higher than x.
We can now define the meet operation, indicated by the operator ⊓, and the
join operation, indicated by the operator ⊔, as follows:
x⊓y=z iff, for all u, whenever u⊑x and u⊑y then
x⊔y=z iff, for all u, whenever u⊒x and u⊒y then
The meet and join operations can also be characterized in terms of tag configurations.
Let the start and end tag for a node x be written as x.start and x.end. The meet operation
can be captured by saying that whenever z=x⊓y then z.start=max(x.start, y.start)
and z.end=min(x.end,y.end). Similarly, the join operation can be captured by saying
whenever z=x⊔y then z.start=min(x.start,y.start) and z.end=max(x.end, y.end).
Are M(D) and O(D) lattices?
The first step of the method we have proposed consists in constructing, on the basis
a marked up document D, a lattice L(D). The construction method ensures that L(D)
a lattice, i.e. that it has a unique largest node, that every pair of nodes has a
that every pair of non-minimal nodes has a meet. The document models, M(D) for XML
and O(D) for O-XML documents, however, are derived from L(D) through a filtering process.
Therefore, it is appropriate to ask whether M(D) and O(D) are also lattices, or not.
In this section we will argue that M(D) is not only guaranteed to be a lattice, it
also guaranteed to be a tree. O(D), however, is obviously not guaranteed to be a tree,
it is also not guaranteed to be a lattice.
We have earlier observed the similarity between the document model M(D) and XML trees.
The following argument will show that this is necessarily the case for any model M(D).
XML, whenever both x.start<y.start and y.start<x.end is true, it is safe to conclude
that y.end<x.end (in other words, there is no overlap). This is also true for the
relationships in M(D).
To see why, consider x and y, both elements of a model M(D), with the assumption
x.start<y.start and y.start<x.end. We want to show that it will always be the case
that y.end<x.end under these circumstances. So suppose for contradiction that this
so, and that x.end<y.end. Then there exists a node z=(y.start,x.end) in L(D), since
y.start<x.end (recall that L(D) contains all pairs of start tags followed by end tags).
It is also clear that z is not a member of M(D). Further, from z's definition, we
z is smaller than both x and y, and also a relative of both. This is enough to contradict
the selection assumption for M(D): if z is smaller than x, then, since z=(y.start,
cannot have been removed in the model building process as a candidate for x, because
node is below all the relatives it removes, and by assumption z<x . So z must then
been removed as candidate for M(D) considered as a relative of y. But this contradicts
assumption that z<y, since y<z for z to be removed. We can therefore conclude that
assuming that x.end<y.end is not tenable, and therefore that y.end<x.end. This means
that if ⊓ is confined to elements of M(D), u⊓v is either one of u or v (if it
has a value at all); if u strictly comes after v, they would not share any nodes.
consequence of this is that any element x must have a unique parent (if it has one
if both u and v where distinct parents in M(D) (i.e. smallest nodes above x), x⊑u
x⊑v, which would mean that x⊑u⊓v=u (or v). We can therefore conclude that
M(D) is a tree.
The above argument does not hold for O(D) which models overlap structures. The example
GODDAG in Lattice R shows that the structure is not a tree, as the simple Y has
two parents. Also, the join may be undefined when restricted to O(D) (that is, redefined
within O(D) and not computed within L(D)). In O(D) two elemnts do not necessarily
unique smallest dominating element, even if there are elements above them. For example,
elements x and y may be dominated by elementss a and b, and neither of these dominate
other if they overlap. In this example it is therefore impossible to select a smallest
dominating element to serve as the join of x and y.
In this section and the ones that follow, we will look at the differences between
and M-models in terms of meet and join. This also brings us to regard M(D) and O(D)
subsets of L(D), and consider ways of adding nodes to them in order to make them satisfy
When M(D) and O(D) are seen as subsets of L(D), with meet and join computed in L(D),
there are a number of differences between them that were not captured above, when
ourselves to look at their structures only. Our first question is whether the meet
operations are closed in M(D) and O(D). (An operation is closed in a subset A if the
of the operation stays in A when operating on elements of A.)
One defining difference between O(D) and M(D) in terms of closure is that while the
operation is closed in M(D), it is not in O(D). As we saw above, given two elements
x and y
in M(D), their meet is either x or y, which is still in M(D). If u and v in O(D) are
distinct and overlapping, which means that u.start<v.start while u.end<v.end, their
meet is w=(v.start,u.end), which is not an element of O(D). (It is a node in L(D),
it is a relative of both u and v, it is not a member of O(D)).
The join operation will not always be closed, neither in M(D) nor in O(D). If x precedes
y with both tags of x before y.start, their join is (x.start,y.end). This node cannot
element neither of M(D) nor of O(D), since it is a relative of both x and y, but not
identical to any of them.
An observation concerning M(D) is that there are nested documents that are closed
both join and meet. Documents where no node precedes another are of that type, like
<r><a>X</a><r>. In a document like this, where ⊑ is a
total order, it is the case that if x⊔y=x then x⊓y=y and vice versa. A
document with overlap will in general leak into L(D) for both meet and join.
Consider D2 from the previous section. The lattice L(D2) may be used to
represent M(D2) as well as O(D2). D2 is an O-XML document with overlap, so
M(D2) will be ill-formed, but that is not of relevance at this point; M(D2) will
simply exhibit the nested view of the document. In the figure below, showing the lattice
L(2), the elements of M(D2) are indicated with a ✓, while the elements of
O(D2) are indicated with a ✗; the simples are members of both models.
Lattice OMD2: L(D2) with subsets ✓M(D2) and ✗O(D2)
We write M(D)* for the closure of M(D). It is obtained by collecting all nodes
x⊓y and x⊔y in L(D) where x, y are elements of M(D), with recombinations.
Similarly, we write O(D)* for the closure of O(D). The new nodes that are added in
are called semi-elements, distinct from the elements of M(D) or O(D).
Here is what the O-XML document D2 with overlap looks like with semi-elements
added. The semi-elements are indicated with a check mark, ✓. Note that the
semi-elements are precisely the candidate elements for M(D2), an observation we return
Lattice oOMD2: O(D2)* with semi-elements check marked.
The semi-elements are considered nameless, they may be annotated with an algebraic
formula having that node as value.
The reader may check that the above diagram represents the closure of O(D2). Here
are all possible combinations alongside their value, not including combinations involving
elements x and y such that x≤y.
All possible meet and join combinations for O(D2)
|X₃⊔Y₅ = b = <b>₂</b>₆
|X₃⊔Z₇ = b⊔c = <b>₂</c>₈
|Z₇⊔Y₅ = c = <c>₄</c>₈
|b⊓c = <c>₄</b>₆
It was noted above that M(D2) was a subset of O(D2)*. We want to demonstrate
that it holds for any document D that M(D) is a subset of O(D)*, even though M(D)
need only have the top element in common. The converse does not hold. A consequence
is that O-XML documents with overlap cannot be reached from algebraic manipulations
but that the corresponding XML document without overlap can be reached from O(D)*.
For this to make sense, D must be a well-formed overlapping O-XML document. If D
overlap, it will also be a well-formed XML document for which we have already established
that O(D)=M(D). Hence, in that case M(D) will automatically be a subset of O(D)* (=M(D)*).
Further, we take it without argument that if O(D) contains an element x that does
overlap with any other element y in O(D), x will also be a member of M(D). It doesn't
if x contains, or is contained, in an overlap configuration as long x does not itself
overlap with another element. To illustrate, the a and b elements of the following
D4 will be elements of both M(D4) and O(D4), even though c and d overlaps.
The latter two will obviously not be members of M(D4):
addition to the simples, the elements of M(D4
) will be these (token indexes are skipped
since the GIs serve to uniquely identify token tags), where the linear order of the
also reflects the hierarchical order:
(<a>,</a>) (<d>,</c>) (<c>,</d>) (<b>,</b>)
So the odd elements here, which are not elements of O(D4
), are these two
It can be seen that these
two nodes are semi-elements of O(D4
)* given the equations
The above equations hold also in the general case. Consider an O-XML document D, and
element m in M(D), so m may possibly not have matching tags. Then it will be the case
that either m
is in O(D), or else there will be elements in O(D) so that m is their join, their
meet or a
combination of join and meet. To see this, note that m will either be well-formed
to XML (checked through the GI-identity constraint) or it will not, but still well-formed
according to O-XML. So it is like either
(<d>,</c>) of M(D4). If m is a member of O(D), it will also be in
O(D)*. So consider the case where m is not an element in O(D). Then there must be
elements a and b in O(D) so that a.start=m.start and b.end=m.end, this because of
constraint which says all tag tokens must be part of an element. Now there are three
a.start<b.start and a.end<b.end (this subsumes overlap and
b.start<a.start and b.end<a.end (overlap only)
a.start<b.start and b.end<a.end (nesting)
In the first case m=a⊔b, while in the second, m=a⊓b, parallel to the case
for D4.The third case, the nesting case is a bit trickier. The tag configuration for a and
b is now
), another way of stating that m=(a.start, b.end). Recall
that the M-models are constructed without looking at GIs, so for m to be an element
all the tags between a.start(=m.start) and b.end(=m.end) must match up. In particular
must be an equal number of start tags and end tags between a.start and b.end. Since
already have b.start in there, we know there must be at least one extra end tag between
a.start and b.end. That end tag must be matched in O(D) by a start tag outside of
(a.start, b.end) configuration. Which means that there must be an element c of O(D)
that c.end is between a.start and b.end and such that c.start is outside of the pair,
is c.start<a.start. Now m can be written on the form m=a⊓c⊔b, and so is a
semi-element of O(D)*.
This result, that M(D) is a subset of O(D)*, but not vice versa, pinpoints an asymmetry
between overlapping and nesting documents. Nested documents can be recovered, so to
from overlapping documents, but there are overlapping documents that cannot be described
algebraicly in terms of nested documents.
Closure models may be help in representing certain cases so-called spurious overlap
as discussed in [Huitfeldt and Sperberg-McQueen 2003]. Roughly, spurious overlap occurs
whenever two elements a and b overlap, and there is no PCDATA either between the two
tags, or between the two end tags, or between the start tag of a and the end tag of
following O-XML document D3 may illustrate.
Its document model is this
Except for token indexes, it is identical to the document model for the following
The closures of the two document models are slightly different, though. The
document with spurious overlap has this closure:
while the document without overlap has this closure:
This may serve to illustrate the close relationship between the lattice model and
serial form of the marked up document. By the use of closure models with semi-elements,
documents with spurious overlap can be distinguished from other documents which share
same document model.
We have described a method for building lattices from marked up documents with and
without overlap, and for generating, from these lattices, document models in the form
for XML documents, and in the form of GODDAGs for documents with overlap. We have
one and the same method can be used for generating both kinds of models, and we have
reasons to believe that lattices can also be used to implement well-formedness constraints
both kinds of documents. In this sense, this is also a step away from relying on context
grammars and the like in analyzing documents. As presented here, model building is
separate from grammar, but we leave open the question of how this bears on document
We have discussed and compared some of the algebraic features of the document models
the relations between them, and pointed out some interesting results. For example,
algebraic operations provide a link between nested documents and overlapping documents.
former can be computed by algebraic means from the latter, but not the other way around.
have also disussed how the algebraically closed models can be used to distinguish
spurious overlap from nested models, via the introduction of semi-elements.
Therefore, although many of the details in the work presented here probably are in
of correction and revision, we believe that the application of lattices to marked
may provide a unified account of markup languages with and without overlap.
In the section called "Bracketed notations and matching" we defined the grammar P.
formally prove that any two parentheses (a left then a right) can be added to a well-formed
parenthetical expression according to this grammar, while preserving well-formedness.
with the following lemma
Lemma on string splitting
Let the string
S over left and right parentheses be derived according to P,
and decomposed into
can be reduced, by removing non-terminal Ps
from each of them into
Z', such that
Z'=)q, so that
S' is well-formed.
X is the leftmost string of
S, any right
X must occur in the pattern
such a pattern can safely be removed without destroying the well-formedness of the
X will in the end be reduced to
X' as a consecutive string of
left parentheses. A parallel argument goes for
Z and its reduction to
Z' as a string of right parentheses. For
Y the story is a bit
different, since a left or right parenthesis may be generated with a corresponding
Z. However, we can safely conclude that if all
Ps are removed from Y, then there can be no string of the form
Y, since either
) is derived with a corresponding
( is derived with a corresponding
W. Thus, Y' must be a string of
)'s followed by
Since removing a
P from a well-formed string won't affect its well-formedness,
S'=X'Y'Z' is well-formed.
This lemma will be used to show that any pair of left and right parentheses can be
inserted into a well-formed string without affecting its well-formedness.
Well-formedness of additional parentheses
Let the string S be a well-formed parenthetical expression according to P. Inserting
left parenthesis anywhere in
S followed by a right parenthesis anywhere after the
left, will result in a well-formed string according to
Proof: Split the string
S into three substrings
XYZ at the
positions where ( and ) are to be inserted, so that The new string is
Now according to lemma, X Y and Z can be reduced and re-concatenated into the well-formed
Inserting the new left-( and right-) into their respective positions, results in
Now, all that is left to do is just to shift the superscripts to the right and
left; this expression can be written as
But this is just the expression
(X'Y'Z'), and since the
can be reduced to a sequence of
Ps, the resulting
expression is also well-formed, being then on the form
This proposition guarantees that the intended matching of the parentheses is of no
concern to the well-formedness of the expression.
[Abney 1997] Abney, Steven,
Partial Parsing via
Finite-State Cascades. Journal of Natural Language Engineering
[Barnard et al. 1988] Barnard, David, Ron
Hayter, Maria Karababa, George Logan, and John McFadden,
SGML-based markup for literary
texts: Two problems and some solutions, Computers and the
Humanities, 22: 265-276. http://www.springerlink.com/content/r1p6t63627663436/
[Barnard et al. 1995] Barnard, David; Burnard,
Lou; Gaspart, Jean-Pierre; Price, Lynne A.; Sperberg-McQueen, C. M.; Varile, Giovanni
Hierarchical Encoding of Text: Technical Problems and SGML Solutions,
Computers and the Humanities, The Text Encoding Initiative:
Background and Contents, Guest Editors Nancy Ide and Jean Vèronis, 29/3, 211-231.
[Bauman 2005] Bauman, Syd,
around, Proceedings of Extreme Markup Languages®.
[Carletta et al. 2005] Carletta, J.; S. Evert;
U. Heid; and J. Kilgour,
The NITE XML Toolkit: data model and query, Language Resources and Evaluation, 39.4: 313-334. http://www.springerlink.com/content/j37h2p15u682075g/
[Chatti et al. 2007] Chatti, Noureddine; Suha
Kaouk; Sylvie Calabretto; and Jean Marie Pinon,
MultiX: an XML-based formalism to
encode multi-structured documents, Proceedings of Extreme
Markup Languages. http://conferences.idealliance.org/extreme/html/2007/Chatti01/EML2007Chatti01.html
[DeRose 2004] DeRose, Steven J.,
overlap: A review and a horse, Proceedings of Extreme Markup
[Di Iorio et al. 2009] Di Iorio, Angelo; Silvio
Peroni; and Fabio Vitali,
Towards markup support for full GODDAGs and beyond: the
EARMARK approach, Balisage Series on Markup
Technologies, vol. 3. http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html. doi:https://doi.org/10.4242/BalisageVol3.Peroni01
[Goldfarb 1990] Charles F. Goldfarb, The SGML Handbook, Clandon Press, Oxford.
[Grätser 1971] Gråtser, George, Lattice Theory, first concepts and distributive lattices San
Fransisco, Calif., 1971. xv+212 pp. Softcover edition, Dover Publications, 2009. http://server.maths.umanitoba.ca/homepages/gratzer.html
[Hilbert et al. 2005] Hilbert, Mirco; Oliver
Schonefeld; and Andreas Witt,
Making CONCUR work, Proceedings of Extreme Markup Languages®. http://conferences.idealliance.org/extreme/html/2005/Witt01/EML2005Witt01.xml
[Huitfeldt 1998] Huitfeldt C,
MECS - A
Multi-Element Code System, Working Papers from the
Wittgenstein Archives at the University of Bergen, No 3, Version October 1998.
[Huitfeldt and Sperberg-McQueen 2003] Huitfeldt, Claus; and C. M.
TexMECS: An experimental markup meta-language for complex
documents, Working paper of the project Markup Languages for
Complex Documents (MLCD), University of Bergen. http://decentius.aksis.uib.no/mlcd/2003/Papers/texmecs.html
[Jagadish et al. 2004] Jagadish, H.V.; Laks V.
S. Lakshmanan; Monica Scannapieco; Divesh Srivastava; and Nuwee Wiwatwattana,
XML: one hierarchy isn't enough, Proceedings of the 2004 ACM
SIGMOD international conference on Management of data, Paris, France: 251-262.
[Marcoux2008] Marcoux, Yves.
Graph characterization of overlap-only
TexMECS and other overlapping markup formalisms. Presented at Balisage: The Markup
Conference 2008, Montréal, Canada, August 12 - 15, 2008. In Proceedings
of Balisage: The Markup Conference 2008. Balisage Series on Markup Technologies,
vol. 1 (2008). doi:https://doi.org/10.4242/BalisageVol1.Marcoux01. http://balisage.net/Proceedings/vol1/html/Marcoux01/BalisageVol1-Marcoux01.html
[Marinelli et al. 2008] Marinelli, P.,
Vitali, F., Zacchiroli, S.,
Towards the unification of formats for overlapping
markup, The New Review of Hypermedia and
[Nicol 2002a] Nicol, Gavin,
algebra: Toward a formal theory of markup, Extreme Markup
[Nicol 2002b] Gavin Nicol,
algebra: Extending core range algebra to arbitrary structures, . http://www.mind-to-mind.com/library/papers/ara/attributed-range-algebra-07-2002.pdf
[Raymond et al. 1992] Raymond, Darrell R;
Tompa, Frank William, and Wood, Derick:
Markup Reconsidered. University of
Western Ontario. www.cs.uwaterloo.ca/~fwtompa/.papers/markup.ps
[Raymond et al. 1995] Raymond, Darrell R;
Tompa, Frank William, and Wood, Derick:
From Data Representation to Data Model:
Meta-Semantic Issues in the Evlution of SGML.
[Schonefeld and Witt 2006] Schonefeld,
Oliver, and Andreas Witt,
Towards validation of concurrent markup, Proceedings of Extreme Markup Languages®, . http://conferences.idealliance.org/extreme/html/2006/Schonefeld01/EML2006Schonefeld01.html
[Schonefeld 2007] Schonefeld, Oliver Georg Rehm,
Andreas Witt, Lothar Lemnitzer (eds.),
XCONCUR and XCONCUR-CL: A constraint-based
approach for the validation of concurrent markup, Datenstrukturen für linguistische Ressourcen und ihre Anwendungen / Data structures
linguistic resources and applications: Proceedings of the Biennial GLDV Conference
2007, Tübingen: Gunter Narr Verlag. Pp. 347-356.
[Schmidt and Colomb 2009] Schmidt, D.,
A data structure for representing multi-version texts online,
International Journal of Human-Computer Studies, . http://portal.acm.org/citation.cfm?id=1523966
[Sperberg-McQueen and Huitfeldt 2000] Sperberg-McQueen, C. M., and Claus
Huitfeldt Peter R. King and Ethan V. Munson (eds.),
GODDAG: A Data Structure for
Overlapping Hierarchies, Digital documents: systems and
principles. Lecture Notes in Computer Science 2023, Berlin: Springer, 2004, pp.
139-160. Paper given at Digital Documents: Systems and Principles. 8th International
Conference on Digital Documents and Electronic Publishing, DDEP 2000, 5th International
Workshop on the Principles of Digital Document Processing, PODDP 2000, Munich, Germany,
September 13-15, 2000.. http://www.w3.org/People/cmsmcq/2000/poddp2000.html
[Stührenberg and Jettka 2009] Stührenberg, Maik; and Daniel Jettka,
A toolkit for multi-dimensional markup: The
development of SGF to XStandoff, Proceedings of Balisage: The
Markup Conference 2009. Balisage Series on Markup Technologies, Vol. 3. http://www.balisage.net/Proceedings/vol3/html/Stuhrenberg01/BalisageVol3-Stuhrenberg01.html. doi:https://doi.org/10.4242/BalisageVol3.Stuhrenberg01
[TEI P5] Lou Burnard and Syd Bauman (eds.),
Non-hierarchical Structures, Guidelines for the
Encoding and Interchange of Machine-Readable Texts (TEI P5), The TEI Consortium.
[Tennison and Piez 2002] Tennison, J. and W.
The Layered Markup and Annotation Language (LMNL), Proceedings of Extreme Markup Languages®. http://www.idealliance.org/papers/extreme/proceedings/author-pkg/2002/Tennison02/EML2002Tennison02.zip
[Witt 2005] Andreas Witt Stefanie Dipper, Michael
Götze, and Manfred Stede (eds.),
Multiple Hierarchies: New Aspects of an Old
Solution, Heterogeneity in Focus: Creating and and Using
Linguistic Databases, volume 2 of Interdisciplinary Studies on Information
Structure (ISIS), Working Papers of the SFB 632. University of Potsdam, Germany. (Corrected
reprint of an Extreme Markup 2004 paper). http://en.scientificcommons.org/42597903