How to cite this paper
A tag algebra for document markup
Balisage: The Markup Conference 2011
August 2 - 5, 2011
Digital documents: Representational forms and operations
When computers were first used for purposes of text processing, documents were typically
represented as linear sequences of characters. This format lends reasonable support to basic
operations on documents, such as visual presentation, editing, retrieval, content extraction,
linguistic and quantitative analysis, and various forms of transformation.
However, the objects of such operations are not always most naturally thought of, or not
easily identified, in terms of character sequences. Some operations are performed on
structural partitions of documents such as sections, paragraphs, or sentences. Objects defined
in terms of their thematic, linguistic, rhetorical or other properties, such as names, foreign
words, high-lighted phrases etc. may also be the target of operations to be performed.
Sometimes such objects can be identified at least indirectly in terms of properties of
character sequences, but seldom reliably, and often not at all.
Document markup can be seen as a response to these problems: when candidate objects of
operations are explicitly marked, text processing applications are relieved of the burden of
identifying them, and developers of such applications may concentrate on the operations to be
performed rather than on the identification of the objects to be processed. Thus, marked up
documents provide improved support for text processing, compared to documents without markup.
Therefore, it is perhaps only natural that with the introduction of document markup, the
number and kinds of document objects that became subject of interest for processing increased.
Markup languages and tools were developed not only for general purpose document publication
and analysis, but also for more specialized purposes and document types.
For a number of reasons, the markup languages which became most widely used, and the
grammars which ultimately acquired the status of international standards for those languages,
such as SGML and XML, were based on context-free grammars. One obvious reason was that representations
based on context-free grammars were processable according to well known and computationally
efficient algorithms. Over the years a plethora of general as well as special-purpose software
for XML document processing has become available.
Such markup languages invite an abstract view of document structure and content, in which
the "artifacts" of markup in the document in its serial form are represented in a more
conspicuous way. Many or most XML processors do not actually perform operations on the
document in its serialized form, but only (or largely) on a model of the document, usually in
the form of a labeled directed acyclic graph. The graph is constructed from the serialized
form of the document, and reserialized only for purposes of storage and exchange. Since one
and the same graph may be represented in the form of more than one serialization, focus tends
to be transferred on the graph (or, in XML terms, the document object model), rather than the
serial form of the document.
The Overlap Problem
The coincidence of the two factors mentioned above (i.e. the increasing interest in
marking up more and more aspects of documents, and the fact that available systems were based
on context-free grammars) lead to a number of closely related problems which are now commonly
referred to as "the overlap problem". On the one hand, the obvious advantages of markup lead to interest in marking up a
wide variety of aspects of document structure, often necessitating different segmentations of
one and the same document. On the other hand, standard markup systems did not provide a
straight-forward and natural way of representing such competing or conflicting segmentations. So although marked up documents provide better support for a broad range of
processing purposes than documents without markup, the introduction of markup also lead to a
number of problems one did not have much means or motivation to identify or formulate before
the introduction of markup.
The so-called overlap problem was identified fairly early, and the literature of proposed
solutions goes more than 30 years back in time. The responses to the problem may be said to fall largely in four groups: 1)
Alternate linear forms, 2) Alternate document models, 3) Stand-off markup, and 4)
Alternate linear forms
Since the syntax of XML enforces a strict hierarchical nesting of XML elements, it
seems like a tempting solution simply to lift or soften this restriction. A large number of
such proposals have been made.
Some notable examples are SGML CONCUR [Goldfarb 1990], XCONCUR
[Schonefeld 2007], MECS [Huitfeldt 1998],
TexMecs [Huitfeldt and Sperberg-McQueen 2003], and LMNL [Tennison and Piez 2002].
The advantages of most of these proposals is that they provide what for human readers
seems like a more natural and easily readable linear representation of overlapping elements
than is provided by alternative proposals adhering to XML syntax. They may also provide
better support for certain operations on overlapping elements. However, for the most part it
is unclear how to perform anything like a full range of standard operations on their markup
constructs, and there are unresolved issues of processability, expressive power etc.
Alternate document models
Some proposals provide alternatives to what is commonly perceived as the XML document
model, i.e. tree-shaped graphs like XDM or XML DOM, either in combination with or
independently of proposals for alternative linear forms.
Some notable examples are LMNL [Tennison and Piez 2002], Goddag
[Sperberg-McQueen and Huitfeldt 2000], multi-colored trees [Jagadish et al. 2004], and Multi-Document Graphs [Schmidt and Colomb 2009].
Although these solutions in general provide better support for at least certain
operations on overlapping elements, the question of how to perform other standard operations
is mostly unclear or not systematically addressed. For some of the proposals it is also
unclear whether, or to what extent, it is possible to support roundtripping between the
proposed document model and its form, or even what such roundtripping would amount to. (This
is the case, for example, concerning the relation between Goddag and TexMecs.) Again, there
are in general unresolved issues of processability, expressive power etc.
Quite early on, it was suggested that many or most of the problems related to overlap
can be seen as a result of the general limitations of embedded markup. [Raymond et al. 1992, Raymond et al. 1995] Thus, the idea
of separating the markup from the document to be marked up, in the form of so-called
"stand-off" markup has received considerable attention and adoption.
Some notable examples are xStandoff [Stührenberg and Jettka 2009],
Nite [Carletta et al. 2005], Earmark [Di Iorio et al. 2009], and (as an interesting limiting case) Multix [Chatti et al. 2007]. TEI [TEI P5] also contains recommendation for stand-off
Stand-off markup does indeed provide potentially very strong expressive power and might
seem to be the one sweeping solution to all problems related to overlap. And perhaps it is. We note, however, that standoff markup either has been used, or could be used,
in combination with any or all of the linear formats and document models referred to above.
Thus, all the unclarities that pertain to those, and to the relationships between them,
pertain also to stand-off markup.
Since there is such a multitude of proposals for alternate linear forms and document
models, since they all have their pros and cons, and since no universal agreement on a
unified solution seems to be forthcoming, it has been proposed that one should instead try
to provide a common framework of algorithms for transformation between different linear
forms and/or document models.
The advantages of this approach may be as obvious as its disadvantages: A major
advantage is that, to the extent that transformations may be performed without loss of
information, it makes all the strengths of all of the solutions above available to the
users. Major disadvantages are: First, for every alternate serial form and/or document model
introduced, the number of required transformation algorithms increases exponentially. Second
(and perhaps most importantly), there seems to be no established and agreed-upon way of
deciding whether or not any given transformation algorithm introduces information
The idea of a tag algebra
There seems to be no consensus neither on whether the basic elements of marked up
documents are most suitably regarded as partitions of a character stream (tags, elements etc.)
or nodes of a graph of some kind, nor on what the basic operations on these elements are and
how they are to be defined. As long as these issues are not clear, it is hard to compare and
evaluate the various approaches to the overlap problem.
Elementary algebra assumes that numbers are the same whether written in Arabic or Roman
notation, or, for that matter, whether we use a decimal, a digital or a hexadecimal numeral
system, and that the sum of two numbers is the same irrespective of which algorithm we employ
to calculate it. Certainly, some notations may seem less intuitive than others, and some
notations require the use of more complex algorithms than others in order to perform certain
operations. However, while important enough, such issues have no bearing on the question of
what the correct result of an operation is. There is one and only one answer to the question
of what the sum of two numbers are, irrespective of which notation and algorithm one might
choose to use in computing that sum.
Similarly, a tag algebra should be able to tell us whether two representations of a
document marked up using different notations are representations of the same document or not,
and what the result of adding or subtracting a specific element to or from that document
should be. In what follows, we hope to establish at common basis for the solution of at least
the first of these problems. We believe this basis will also be useful for defining operations
on marked up documents, although we do not get that far in this paper.
Bracketed notations and matching
In the rest of this paper, we will explore the relationship between markup without
overlap and markup with overlap by way of comparing XML to another, imaginary markup language.
This other markup language, which we may call O-XML, is exactly like XML, except that in O-XML
overlap is allowed. So whereas nesting structures like
are well-formed in both
XML and O-XML, overlapping structures like
are ill-formed in XML,
but well-formed in O-XML. As we are concerned exclusively with element structure, we ignore
all other features or mechanisms of XML such as attributes, entities, comments, declarations,
cdata sections, processing instructions, etc.
We may start with the observation that in XML all elements nest, while in documents with
overlap not all of them do. We may also observe that the element structure of XML documents can be deduced
without looking at the generic identifiers of end tags, whereas in notations allowing overlap this information is essential. We might
summarize this by saying that in XML the start and end tags of nesting elements always match,
whereas in other notations they do not always match.
But there is something puzzling about this way of putting the matter. Exactly what do we
mean by the terms
observe that whereas there is no overlap in
and there is overlap in
the difference between
them disappears if we leave out the generic identifiers:
< >...< >...</ >...</ >
Our guess is that it
probably seems natural to most readers to assume that the element structure of the last
example above is identical to the first, rather than the second. But is this necessarily so,
and if so, why? Let us pursue the similarities between the nesting of start and end tags and
the nesting of simple parentheses a bit further.
Consider any string consisting of left (i.e. start, or open) and right (i.e. end, or
close) parentheses. One common assumption is that there should be a one to one correspondence
between left and right parentheses, so that they are equal in number. This ensures that the
((())) is to be regarded as well-formed, while the string
((()) is not. Another common assumption is that for every left parenthesis there
is a succeeding right parenthesis. This ensures that the string
)()( is not to be
regarded as well-formed. In the following, we will refer to these two assumptions as "basic".
The two basic assumptions do not by themselves imply any specific answer to the question
which of the left parentheses correspond to which of the right parentheses in the well-formed
cases. The intuitively most plausible (or common) reading of the string
be in accordance with a "nesting" convention, to the effect that the first left parenthesis
corresponds to the last right parenthesis. According to this convention, the structure of the
()(()) looks like this, where subscripts indicate correspondence:
However, nothing in the two basic assumptions as stated stops us from adopting
other conventions. We might imagine a "mirroring" convention, where the first left parenthesis
corresponds to the first right, the second left parenthesis to the second right, and so on. A
"mirroring" convention for the same expression may look like this:
(Note the "overlap" between parentheses subscripted
Other conventions may be imagined in which some parentheses are selected according to
specific rules, while the rest follow some general scheme like nesting or mirroring.
Thus, the two basic assumptions concerning the well-formedness of parenthetical
expressions are compatible with alternate ways of assigning structure to those expressions.
This important fact is easy to overlook, because one common method of testing for
well-formedness, namely the use of rewrite grammars, also lends itself very easily to the
assignment of a structure in accordance with the "nesting" convention above. The following
example will illustrate this.
A rewrite grammar for generating strings of parentheses which obey the two basic
assumptions may look like this:
A string is well-formed, in accordance with the two basic assumptions, if and
only if it can be generated by this grammar, which generates strings like
As mentioned, this grammar may also be used in order to assign a structure to the
expressions in accordance with the "nesting" convention. This may be done by associating with
each other the two parentheses that are generated together in each of the subordination or
base steps. Selecting an index from the sequence number of rule application, the string
(()()) could be derived with indicated associations as follows, where each rule
P in the string coming from the immediately preceding rule application,
and substitutes it with the right hand side of the →:
0th step, start symbol
1st step, subordination
(1 P )1
2nd step, coordination step
(1 P P )1
3rd step, base step
(1 (3 )3 P )1
4th step, base step
(1 (3 )3(4 )4 )1
After the fourth step, there are no non-terminals left in the expression, and
the process halts. As can be seen from the last line, the structure that has been assigned to
the expression is in exact accordance with the "nesting" convention.
This confirms that rewrite grammars may be used for assigning structure as well as for
checking well-formedness. However, one may very well use a rewrite grammar to check for
well-formedness without using that same grammar for assigning
structure. Moreover, there are other ways both of checking well-formedness and of assigning
structure than using rewrite grammars.
This should make clear that well-formedness is separate from structure. In the appendix we
prove that well-formedness is preserved even when inserting randomly any left parenthesis,
succeeded by another randomly inserted right parenthesis, into a well-formed string.
We observe that the rule for matching of start and end tags in XML is identical to what we
called a "nesting" convention for simple parenthetical expressions above. We also observe that
this seems to be the "default" rule for matching of start and end tags in markup languages
which allow overlap, such as O-XML. In such languages, however, deviations from the default
rule are allowed, and they are signaled by the selection of generic identifiers for end tags.
This explains why, as we noted above, the element structure of XML documents can be
identified independently of information about the generic identifiers of start and end tags,
whereas in O-XML this information is essential. In the following sections we will exploit
these observations in order to look at alternate ways of describing the difference between
nesting and overlap.
We believe that lattice theory [Grätser 1971] may usefully be
applied to the construction of document models from marked-up documents. Lattice theory, like
graph theory, is a way of reasoning over ordering relations. The theory has two components,
one is a theory of order, the other of algebraic operations defined in terms of that order. In
this section we will consider the order aspect, while algebraic operations will be discussed
in the next section ("Algebraic characterization").
A lattice is a partially ordered set of objects, here referred to as nodes, which will
provide the elements of document models. Not all nodes will find their way into a document
model, only those that do may also be referred to as elements, in agreement with the use of
this term in XML and XSLT. A key requirement for an ordered collection of nodes to constitute
a lattice, is that there should be one unique largest and one unique smallest node in the
collection. If there is only a largest (or smallest) node, it is what is called a
semi-lattice. Examples of lattices so constructed abound. Any society ordered by social rank (or
constitutional order) is a semi-lattice, with e.g president or monarch at the top. XML trees
are semi lattices, with a root node on the top and text nodes at the bottom.
We analyze the stream of characters making up a document morphologically into a series of
typed objects consisting of start tags, end tags, and simples. The terms "start tag" and "end tag" should require no further explanation. We will
used XML notation in specifying them.
The term "simples" refers to PCDATA. Informally, we might say that a simple is whatever
occurs between two tags.
Empty elements are treated as a start tag followed by an end tag, without any intervening
We assume a linear order on the morphological constituents. This order is derived directly
from the serialized form of the document.
Consider the following XML sample document, D1:
Figure 1: Sample document 1 (D1)
According to the definitions just given, this document is a linearly ordered set of
token objects, consisting of the following morphological constituents with the properties
indicated in this table, along with position indexes which, for our purpose, serve as token
indexes for the types:
For any document D, let S be the set of all start tag tokens and E the set of all end tag
tokens in the document. Take the Cartesian product (all possible combinations) of S and E, and
subtract all pairs for which the position of the end tag does not succeed the start tag. This
object then represents all the possibilities for a start tag to match up with an end tag. Add
the simples, and name the result L(D).
L(D1) now consists of the following nodes:
The reader may be puzzled by the fact that we include non-matching pairs of tags, i.e.
nodes with start and end tags with different generic identifiers (GIs), among the nodes of the
lattice. This is simply because the point of the exercise at this stage is to see how far we
are able to build document models without taking the generic identifiers of tags into
We define a hierarchical order relation between nodes in terms of
containment as follows: A node x is larger than a node y if and only if
the start position of x is smaller or equal to that of y, and the end position of x is greater
or equal to that of y (viewed as strings the larger, or higher, node contains the other). This relation is asymmetric and transitive, so it is a partial order.
With this order in place, L(D1) as given in the table above is a semi-lattice. It
has a top node, namely the pair consisting of the first start tag and last end tag
<a>₁</a>₉. It also has a set of minimal nodes, the
simples. Each pair of tag tokens is a node in the semilattice. Lattice structures are
displayed using Hasse-diagrams (familiar from tree structures).
Nodes are positioned vertically according to their position in the hierarchy, with
the more inclusive nodes at the top and the less inclusive further down.
Nodes at the same vertical level are positioned from left to right according to
their linear order in the document.
The lattice structure for L(D1):
As mentioned, the hierarchical order is transitive. According to the conventions of
diagrams like these, however, lines are drawn only between nodes immediately ordered.
Before we proceed, we need to define a relation we call relatives: Two nodes are relatives if they share a token start or end tag. This is
a reflexive, symmetric and non-transitive relation.
We also need to define a concept which has already been introduced implicitly, the concept
of a minimal node: A node is minimal if it is lowest in the
hierarchy, i.e. if it contains no other node.
Building document models from lattices
Modeling documents without overlap
In what follows, we will show how what we may call a "standard" document model can be
built on the basis of the lattice. In building this model, we will rely on distinguishing
between nodes only in terms of the morphology introduced earlier, i.e. on the basis of which
start tags, end tags, or simple tokens they contain. In particular, we do not take the GIs of tags into consideration in the building of this
A document model for D1 from the lattice L(D1) is built through a
number of steps by selecting subsets from L(D1). Hence, the model building process
can be viewed as a filtering process. We start from the bottom of the lattice and identify
all the lowest-level minimal nodes. We copy these minimal nodes, i.e. the nodes labeled
X₂, Y₄ and Z₇, from L and transfer them to a subset of L(D1)
which we call S0.
We then delete these minimal nodes from L(D1
and call the result L0
We repeat the process of selecting minimal nodes, now from L0,
and add them to S0, naming the result S1. In
this case the minimal nodes are those labeled <b>₃</b>₅ and
<c>₆</c>₈. S1 inherits the order from L, and looks like
Next, we delete these minimal nodes, as well as all their
, from L0
, and call the result
Removal of the relatives assures that a tag can only enter into a relationship once. As
soon as a node is selected for the model, neither the start nor the end tag of that node can
enter into another node of the model. The relatives of <b>₃</b>₅
in L(D1) are the following nodes
relatives b/b: Relatives of <b>₃</b>₅
L1 looks like this:
As L1 only contains the single node
<a>₁</a>₉ a repetition of the step above, which copies this
minimal node to S1 resulting in S2, will
terminate the process. There are now no more nodes to select, and the process stops with the following
candidate model M of D1:
Lattice D2: subset S2=M(D1)
What we find in S2 is, mutatis mutandis, identical to the
"standard" XML document tree for the document we started with:
Lattice D3: Standard document model of D1
While GIs have been left out of consideration during the building, the candidate model
will also be subjected to a well-formedness check: For each node in the model, the
GI of the start tag must be identical to the GI of the end tag.. For this
model, this is the case.
We conclude that with the method described here, lattices can be used to generate
"standard" models of XML documents in the form of conventional XML document trees. Recall
that we may also refer to nodes that are members of the model as elements. The claim that these constitue trees will be
substantiated below, where we will show that the model building procedure indeed
guarantees production of tree structures only.
Modeling documents with overlap
In this section, we will show that lattices can also be used in order to generate models
of O-XML documents in the form of GODDAGs [Sperberg-McQueen and Huitfeldt 2000]. Consider the following O-XML sample document, D2:
Figure 10: Sample document 2 (D2)
The lattice L(D2) for this document is as follows (while we have used capital M
to designate document models for XML documents, we will use capital O to designate document
models for O-XML models):
An application of the method described in the previous section to this lattice, yields
the following candidate model:
Lattice P: Candidate model of D2, M(D2)
It easy enough to see that something must have gone wrong here, as there is disagreement
between the GIs of the start and end tags for the nodes <b>₂</c>₈ and
<c>₄</b>₆. Now D2 is not a well-formed XML document, and we have used the
same procedure for generating M(D2) from D2 (which is a well-formed O-XML document
with overlap, but not a well-formed XML document) as we did for building M(D1) from
D1 (which is a well-formed XML document). The method will build document models for
well-formed as well as ill-formed documents. However, the ill-formedness of D2 will be
captured by the well-formedness condition introduced above.
In constructing O(D2), it suffices to introduce one single modification to the
procedure described in the previous section: Before performing any other operations on
L(D2), we remove all nodes that have start and end tags with different GIs. Once
this rule is introduced, the application of the procedure produces the following candidate
What we find here is, mutatis mutandis, identical to the GODDAG document model of the
Lattice R: GODDAG model of D2
We conclude that with the method described here, lattices can also be used for
generating document models of O-XML documents with overlap in the form of GODDAGs. Since the
extra rule that we introduced in this section does not affect the result for XML documents,
one and the same method can be used for generating XML trees from XML documents and GODDAGs
from O-XML documents. In other words, if D is a well-formed XML document M(D)=O(D). However,
O(D) does not have any built-in well-formedness check.
In the previous section we noticed, in passing, that of the proposed method for building
models from XML and O-XML documents, only M-models carry a well-formedness constraint. By
introducing the rule that all nodes with different GIs on start and end tags should be
deleted from the lattice, we effectively made sure that O(D) fulfills one well-formedness
constraint on documents. The difference between the M- and O-constructions is subtle: The
M-construction will always produce trees, while the O-construction will not necessarily
produce trees. That issue is discussed in a subsequent section. The question to be asked
here is: Can document lattices be used to capture the full set of well-formedness
The model building in itself may work even if document D starts with and end tag or ends
with a start tag. The process will just ignore them. If they are taken into account as e.g.
simples, the requirement that L(D) is a lattice will rule them out, since they will be
ordered on the same level as the widest start tag/end tag pair.
Another source of ill-formedness stems from unbalanced parenthesis. For M(D) this is
taken care of by requiring that the top node of L(D) is also the top node of M(D). The
reason for this is that the top node of L(D) will be the widest start/end tag pair. If that
pair is not a node it will be because all the tags are used up, so to speak, before it is
reached, as the model building moves inside out. Consider this ill-formed document
The top node for L(D) is
but this node will be removed from M(D) as a relative of the minimal
For O(D) this doesn't work since the removal process looks at tag GI, so if the document
it is <b>₂</a>₃ which is left out. So we cannot
pin the requirement on the top node. However, we take it that for any document, including O-XML,
that each tag should find a partner. Clearly <b>₂ in this document does
not have one.
We conjecture that, in addition to the already introduced rule of deleting nodes with
non-matching GIs on start and end tags, the following rule should suffice to capture all
remaining well-formedness constraints on O-XML:
For any x.start or x.end in L(D) there must be an element E in M(D) or O(D) such that
E.start=x.start or E.end=x.end.
This should make sure that every tag in L(D) finds a match in M(D) or O(D). Moreover, ff
M(D) is constructed without first removing non macthin GIs, the GI agreement constraint
mentioned above will rule out any overlap.
If this conjecture is correct, it means that checking for well-formedness as well as
asignment of document structure can be accomplished by using document lattices, and without
the use of rewrite grammars.
We believe that the proposed use of document lattices for building document models from
marked up documents provides a unified account of the assignment of structure to documents
with and without overlap. If our conjecture about well-formedness is correct, this account
is also able to capture all well-formedness constraints on both types of markup.
The proposed method exploits the observations made about possible alternative
conventions for the assignment of structure to simple parenthetical expressions made
earlier. In building models for both kinds of documents, XML and O-XML, we rely on the
"nesting" convention. The difference is that for XML documents, information about the GIs of
tags may be ignored, whereas for O-XML documents this information is essential.
In introducing O-XML, we said that the only difference between XML and O-XML is that
O-XML allows overlapping elements, whereas XML does not. What about so-called self-overlap,
discontinuous elements and virtual elements, mechanisms which have been proposed in for
example TexMECS, LMNL and other alternative markup languages? We believe that the handling
of self-overlap with document lattices is just a question of adjusting the basic morphology.
Whether, or to what extent, document lattices can also be used for markup languages with
discontinuous or virtual elements we simply do not know.
Meet and join
The meet and join operations are a central part of lattice theory. In our context, they
may serve to characterize the difference between the models for XML and O-XML, and their
relationship to the lattice L.
The meet operation selects the largest node below its operands, while join selects the
smallest above its operands. A requirement for lattices is that any two nodes have a meet
and a join.
Since we are dealing not primarily with lattices in the strict sense, but mostly with
semi-lattices, we need to lighten this requirement for the meet operation: two nodes are
allowed not to have any meet if they do not have any common nodes below them at all.
We indicate the vertical ordering of nodes in a lattice by using the symbols ⊒
and ⊑, so that the opening is faced towards the larger node. In other words,
x⊒y means that x is higher in the lattice than y, while x⊑y means that y
is higher than x.
We can now define the meet operation, indicated by the operator ⊓, and the
join operation, indicated by the operator ⊔, as follows:
x⊓y=z iff, for all u, whenever u⊑x and u⊑y then
x⊔y=z iff, for all u, whenever u⊒x and u⊒y then
The meet and join operations can also be characterized in terms of tag configurations.
Let the start and end tag for a node x be written as x.start and x.end. The meet operation
can be captured by saying that whenever z=x⊓y then z.start=max(x.start, y.start)
and z.end=min(x.end,y.end). Similarly, the join operation can be captured by saying that
whenever z=x⊔y then z.start=min(x.start,y.start) and z.end=max(x.end, y.end).
Are M(D) and O(D) lattices?
The first step of the method we have proposed consists in constructing, on the basis of
a marked up document D, a lattice L(D). The construction method ensures that L(D) always is
a lattice, i.e. that it has a unique largest node, that every pair of nodes has a join, and
that every pair of non-minimal nodes has a meet. The document models, M(D) for XML documents
and O(D) for O-XML documents, however, are derived from L(D) through a filtering process.
Therefore, it is appropriate to ask whether M(D) and O(D) are also lattices, or not.
In this section we will argue that M(D) is not only guaranteed to be a lattice, it is
also guaranteed to be a tree. O(D), however, is obviously not guaranteed to be a tree, but
it is also not guaranteed to be a lattice.
We have earlier observed the similarity between the document model M(D) and XML trees.
The following argument will show that this is necessarily the case for any model M(D). In
XML, whenever both x.start<y.start and y.start<x.end is true, it is safe to conclude
that y.end<x.end (in other words, there is no overlap). This is also true for the
relationships in M(D).
To see why, consider x and y, both elements of a model M(D), with the assumption that
x.start<y.start and y.start<x.end. We want to show that it will always be the case
that y.end<x.end under these circumstances. So suppose for contradiction that this is not
so, and that x.end<y.end. Then there exists a node z=(y.start,x.end) in L(D), since
y.start<x.end (recall that L(D) contains all pairs of start tags followed by end tags).
It is also clear that z is not a member of M(D). Further, from z's definition, we note that
z is smaller than both x and y, and also a relative of both. This is enough to contradict
the selection assumption for M(D): if z is smaller than x, then, since z=(y.start, x.end), z
cannot have been removed in the model building process as a candidate for x, because any
node is below all the relatives it removes, and by assumption z<x . So z must then have
been removed as candidate for M(D) considered as a relative of y. But this contradicts the
assumption that z<y, since y<z for z to be removed. We can therefore conclude that
assuming that x.end<y.end is not tenable, and therefore that y.end<x.end. This means
that if ⊓ is confined to elements of M(D), u⊓v is either one of u or v (if it
has a value at all); if u strictly comes after v, they would not share any nodes. A
consequence of this is that any element x must have a unique parent (if it has one at all):
if both u and v where distinct parents in M(D) (i.e. smallest nodes above x), x⊑u and
x⊑v, which would mean that x⊑u⊓v=u (or v). We can therefore conclude that
M(D) is a tree.
The above argument does not hold for O(D) which models overlap structures. The example
GODDAG in Lattice R shows that the structure is not a tree, as the simple Y has
two parents. Also, the join may be undefined when restricted to O(D) (that is, redefined
within O(D) and not computed within L(D)). In O(D) two elemnts do not necessarily have a
unique smallest dominating element, even if there are elements above them. For example, two
elements x and y may be dominated by elementss a and b, and neither of these dominate the
other if they overlap. In this example it is therefore impossible to select a smallest
dominating element to serve as the join of x and y.
In this section and the ones that follow, we will look at the differences between O-
and M-models in terms of meet and join. This also brings us to regard M(D) and O(D) as
subsets of L(D), and consider ways of adding nodes to them in order to make them satisfy
When M(D) and O(D) are seen as subsets of L(D), with meet and join computed in L(D),
there are a number of differences between them that were not captured above, when we limited
ourselves to look at their structures only. Our first question is whether the meet and join
operations are closed in M(D) and O(D). (An operation is closed in a subset A if the result
of the operation stays in A when operating on elements of A.)
One defining difference between O(D) and M(D) in terms of closure is that while the meet
operation is closed in M(D), it is not in O(D). As we saw above, given two elements x and y
in M(D), their meet is either x or y, which is still in M(D). If u and v in O(D) are
distinct and overlapping, which means that u.start<v.start while u.end<v.end, their
meet is w=(v.start,u.end), which is not an element of O(D). (It is a node in L(D), but since
it is a relative of both u and v, it is not a member of O(D)).
The join operation will not always be closed, neither in M(D) nor in O(D). If x precedes
y with both tags of x before y.start, their join is (x.start,y.end). This node cannot be any
element neither of M(D) nor of O(D), since it is a relative of both x and y, but not
identical to any of them.
An observation concerning M(D) is that there are nested documents that are closed under
both join and meet. Documents where no node precedes another are of that type, like for
<r><a>X</a><r>. In a document like this, where ⊑ is a
total order, it is the case that if x⊔y=x then x⊓y=y and vice versa. A
document with overlap will in general leak into L(D) for both meet and join.
Consider D2 from the previous section. The lattice L(D2) may be used to
represent M(D2) as well as O(D2). D2 is an O-XML document with overlap, so
M(D2) will be ill-formed, but that is not of relevance at this point; M(D2) will
simply exhibit the nested view of the document. In the figure below, showing the lattice
L(2), the elements of M(D2) are indicated with a ✓, while the elements of
O(D2) are indicated with a ✗; the simples are members of both models.
Lattice OMD2: L(D2) with subsets ✓M(D2) and ✗O(D2)
We write M(D)* for the closure of M(D). It is obtained by collecting all nodes
x⊓y and x⊔y in L(D) where x, y are elements of M(D), with recombinations.
Similarly, we write O(D)* for the closure of O(D). The new nodes that are added in this way
are called semi-elements, distinct from the elements of M(D) or O(D).
Here is what the O-XML document D2 with overlap looks like with semi-elements
added. The semi-elements are indicated with a check mark, ✓. Note that the
semi-elements are precisely the candidate elements for M(D2), an observation we return
Lattice oOMD2: O(D2)* with semi-elements check marked.
The semi-elements are considered nameless, they may be annotated with an algebraic
formula having that node as value.
The reader may check that the above diagram represents the closure of O(D2). Here
are all possible combinations alongside their value, not including combinations involving
elements x and y such that x≤y.
All possible meet and join combinations for O(D2)
|X₃⊔Y₅ = b = <b>₂</b>₆
|X₃⊔Z₇ = b⊔c = <b>₂</c>₈
|Z₇⊔Y₅ = c = <c>₄</c>₈
|b⊓c = <c>₄</b>₆
It was noted above that M(D2) was a subset of O(D2)*. We want to demonstrate
that it holds for any document D that M(D) is a subset of O(D)*, even though M(D) and O(D)
need only have the top element in common. The converse does not hold. A consequence of this
is that O-XML documents with overlap cannot be reached from algebraic manipulations on M(D),
but that the corresponding XML document without overlap can be reached from O(D)*.
For this to make sense, D must be a well-formed overlapping O-XML document. If D has no
overlap, it will also be a well-formed XML document for which we have already established
that O(D)=M(D). Hence, in that case M(D) will automatically be a subset of O(D)* (=M(D)*).
Further, we take it without argument that if O(D) contains an element x that does not
overlap with any other element y in O(D), x will also be a member of M(D). It doesn't matter
if x contains, or is contained, in an overlap configuration as long x does not itself
overlap with another element. To illustrate, the a and b elements of the following document
D4 will be elements of both M(D4) and O(D4), even though c and d overlaps.
The latter two will obviously not be members of M(D4):
addition to the simples, the elements of M(D4
) will be these (token indexes are skipped
since the GIs serve to uniquely identify token tags), where the linear order of the elements
also reflects the hierarchical order:
(<a>,</a>) (<d>,</c>) (<c>,</d>) (<b>,</b>)
So the odd elements here, which are not elements of O(D4
), are these two
It can be seen that these
two nodes are semi-elements of O(D4
)* given the equations
The above equations hold also in the general case. Consider an O-XML document D, and an
element m in M(D), so m may possibly not have matching tags. Then it will be the case that either m
is in O(D), or else there will be elements in O(D) so that m is their join, their meet or a
combination of join and meet. To see this, note that m will either be well-formed according
to XML (checked through the GI-identity constraint) or it will not, but still well-formed
according to O-XML. So it is like either
(<d>,</c>) of M(D4). If m is a member of O(D), it will also be in
O(D)*. So consider the case where m is not an element in O(D). Then there must be distinct
elements a and b in O(D) so that a.start=m.start and b.end=m.end, this because of the
constraint which says all tag tokens must be part of an element. Now there are three cases
a.start<b.start and a.end<b.end (this subsumes overlap and
b.start<a.start and b.end<a.end (overlap only)
a.start<b.start and b.end<a.end (nesting)
In the first case m=a⊔b, while in the second, m=a⊓b, parallel to the case
for D4.The third case, the nesting case is a bit trickier. The tag configuration for a and
b is now
), another way of stating that m=(a.start, b.end). Recall
that the M-models are constructed without looking at GIs, so for m to be an element of M(D),
all the tags between a.start(=m.start) and b.end(=m.end) must match up. In particular there
must be an equal number of start tags and end tags between a.start and b.end. Since we
already have b.start in there, we know there must be at least one extra end tag between
a.start and b.end. That end tag must be matched in O(D) by a start tag outside of the
(a.start, b.end) configuration. Which means that there must be an element c of O(D) such
that c.end is between a.start and b.end and such that c.start is outside of the pair, that
is c.start<a.start. Now m can be written on the form m=a⊓c⊔b, and so is a
semi-element of O(D)*.
This result, that M(D) is a subset of O(D)*, but not vice versa, pinpoints an asymmetry
between overlapping and nesting documents. Nested documents can be recovered, so to speak,
from overlapping documents, but there are overlapping documents that cannot be described
algebraicly in terms of nested documents.
Closure models may be help in representing certain cases so-called spurious overlap as discussed in [Huitfeldt and Sperberg-McQueen 2003]. Roughly, spurious overlap occurs
whenever two elements a and b overlap, and there is no PCDATA either between the two start
tags, or between the two end tags, or between the start tag of a and the end tag of b. The
following O-XML document D3 may illustrate.
Its document model is this
Except for token indexes, it is identical to the document model for the following
The closures of the two document models are slightly different, though. The
document with spurious overlap has this closure:
while the document without overlap has this closure:
This may serve to illustrate the close relationship between the lattice model and the
serial form of the marked up document. By the use of closure models with semi-elements,
documents with spurious overlap can be distinguished from other documents which share the
same document model.
We have described a method for building lattices from marked up documents with and
without overlap, and for generating, from these lattices, document models in the form of trees
for XML documents, and in the form of GODDAGs for documents with overlap. We have shown that
one and the same method can be used for generating both kinds of models, and we have given
reasons to believe that lattices can also be used to implement well-formedness constraints for
both kinds of documents. In this sense, this is also a step away from relying on context free
grammars and the like in analyzing documents. As presented here, model building is kept
separate from grammar, but we leave open the question of how this bears on document
We have discussed and compared some of the algebraic features of the document models and
the relations between them, and pointed out some interesting results. For example, that the
algebraic operations provide a link between nested documents and overlapping documents. The
former can be computed by algebraic means from the latter, but not the other way around. We
have also disussed how the algebraically closed models can be used to distinguish models of
spurious overlap from nested models, via the introduction of semi-elements.
Therefore, although many of the details in the work presented here probably are in need
of correction and revision, we believe that the application of lattices to marked up documents
may provide a unified account of markup languages with and without overlap.
In the section called "Bracketed notations and matching" we defined the grammar P. We will
formally prove that any two parentheses (a left then a right) can be added to a well-formed
parenthetical expression according to this grammar, while preserving well-formedness. We start
with the following lemma
Lemma on string splitting
Let the string
S over left and right parentheses be derived according to P,
and decomposed into
can be reduced, by removing non-terminal Ps
from each of them into
Z', such that
Z'=)q, so that
S' is well-formed.
X is the leftmost string of
S, any right
X must occur in the pattern
such a pattern can safely be removed without destroying the well-formedness of the expression.
X will in the end be reduced to
X' as a consecutive string of
left parentheses. A parallel argument goes for
Z and its reduction to
Z' as a string of right parentheses. For
Y the story is a bit
different, since a left or right parenthesis may be generated with a corresponding parenthesis
Z. However, we can safely conclude that if all
Ps are removed from Y, then there can be no string of the form
Y, since either
) is derived with a corresponding
( is derived with a corresponding
W. Thus, Y' must be a string of
)'s followed by
Since removing a
P from a well-formed string won't affect its well-formedness,
S'=X'Y'Z' is well-formed.
This lemma will be used to show that any pair of left and right parentheses can be
inserted into a well-formed string without affecting its well-formedness.
Well-formedness of additional parentheses
Let the string S be a well-formed parenthetical expression according to P. Inserting a
left parenthesis anywhere in
S followed by a right parenthesis anywhere after the
left, will result in a well-formed string according to
Proof: Split the string
S into three substrings
XYZ at the
positions where ( and ) are to be inserted, so that The new string is
Now according to lemma, X Y and Z can be reduced and re-concatenated into the well-formed
Inserting the new left-( and right-) into their respective positions, results in
Now, all that is left to do is just to shift the superscripts to the right and
left; this expression can be written as
But this is just the expression
(X'Y'Z'), and since the
can be reduced to a sequence of
Ps, the resulting
expression is also well-formed, being then on the form
This proposition guarantees that the intended matching of the parentheses is of no
concern to the well-formedness of the expression.
[Abney 1997] Abney, Steven,
Partial Parsing via
Finite-State Cascades. Journal of Natural Language Engineering
[Barnard et al. 1988] Barnard, David, Ron
Hayter, Maria Karababa, George Logan, and John McFadden,
SGML-based markup for literary
texts: Two problems and some solutions, Computers and the
Humanities, 22: 265-276. http://www.springerlink.com/content/r1p6t63627663436/
[Barnard et al. 1995] Barnard, David; Burnard,
Lou; Gaspart, Jean-Pierre; Price, Lynne A.; Sperberg-McQueen, C. M.; Varile, Giovanni
Hierarchical Encoding of Text: Technical Problems and SGML Solutions,
Computers and the Humanities, The Text Encoding Initiative:
Background and Contents, Guest Editors Nancy Ide and Jean Vèronis, 29/3, 211-231. http://www.springerlink.com/content/p7775247276v88h3/
[Bauman 2005] Bauman, Syd,
around, Proceedings of Extreme Markup Languages®.
[Carletta et al. 2005] Carletta, J.; S. Evert;
U. Heid; and J. Kilgour,
The NITE XML Toolkit: data model and query, Language Resources and Evaluation, 39.4: 313-334. http://www.springerlink.com/content/j37h2p15u682075g/
[Chatti et al. 2007] Chatti, Noureddine; Suha
Kaouk; Sylvie Calabretto; and Jean Marie Pinon,
MultiX: an XML-based formalism to
encode multi-structured documents, Proceedings of Extreme
Markup Languages. http://conferences.idealliance.org/extreme/html/2007/Chatti01/EML2007Chatti01.html
[DeRose 2004] DeRose, Steven J.,
overlap: A review and a horse, Proceedings of Extreme Markup
[Di Iorio et al. 2009] Di Iorio, Angelo; Silvio
Peroni; and Fabio Vitali,
Towards markup support for full GODDAGs and beyond: the
EARMARK approach, Balisage Series on Markup
Technologies, vol. 3. http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html. doi:10.4242/BalisageVol3.Peroni01
[Goldfarb 1990] Charles F. Goldfarb, The SGML Handbook, Clandon Press, Oxford.
[Grätser 1971] Gråtser, George, Lattice Theory, first concepts and distributive lattices San
Fransisco, Calif., 1971. xv+212 pp. Softcover edition, Dover Publications, 2009. http://server.maths.umanitoba.ca/homepages/gratzer.html
[Hilbert et al. 2005] Hilbert, Mirco; Oliver
Schonefeld; and Andreas Witt,
Making CONCUR work, Proceedings of Extreme Markup Languages®. http://conferences.idealliance.org/extreme/html/2005/Witt01/EML2005Witt01.xml
[Huitfeldt 1998] Huitfeldt C,
MECS - A
Multi-Element Code System, Working Papers from the
Wittgenstein Archives at the University of Bergen, No 3, Version October 1998.
[Huitfeldt and Sperberg-McQueen 2003] Huitfeldt, Claus; and C. M.
TexMECS: An experimental markup meta-language for complex
documents, Working paper of the project Markup Languages for
Complex Documents (MLCD), University of Bergen. http://decentius.aksis.uib.no/mlcd/2003/Papers/texmecs.html
[Jagadish et al. 2004] Jagadish, H.V.; Laks V.
S. Lakshmanan; Monica Scannapieco; Divesh Srivastava; and Nuwee Wiwatwattana,
XML: one hierarchy isn't enough, Proceedings of the 2004 ACM
SIGMOD international conference on Management of data, Paris, France: 251-262.
[Marcoux2008] Marcoux, Yves.
Graph characterization of overlap-only
TexMECS and other overlapping markup formalisms. Presented at Balisage: The Markup
Conference 2008, Montréal, Canada, August 12 - 15, 2008. In Proceedings
of Balisage: The Markup Conference 2008. Balisage Series on Markup Technologies,
vol. 1 (2008). doi:10.4242/BalisageVol1.Marcoux01. http://balisage.net/Proceedings/vol1/html/Marcoux01/BalisageVol1-Marcoux01.html
[Marinelli et al. 2008] Marinelli, P.,
Vitali, F., Zacchiroli, S.,
Towards the unification of formats for overlapping
markup, The New Review of Hypermedia and
[Nicol 2002a] Nicol, Gavin,
algebra: Toward a formal theory of markup, Extreme Markup
[Nicol 2002b] Gavin Nicol,
algebra: Extending core range algebra to arbitrary structures, . http://www.mind-to-mind.com/library/papers/ara/attributed-range-algebra-07-2002.pdf
[Raymond et al. 1992] Raymond, Darrell R;
Tompa, Frank William, and Wood, Derick:
Markup Reconsidered. University of
Western Ontario. www.cs.uwaterloo.ca/~fwtompa/.papers/markup.ps
[Raymond et al. 1995] Raymond, Darrell R;
Tompa, Frank William, and Wood, Derick:
From Data Representation to Data Model:
Meta-Semantic Issues in the Evlution of SGML.
[Schonefeld and Witt 2006] Schonefeld,
Oliver, and Andreas Witt,
Towards validation of concurrent markup, Proceedings of Extreme Markup Languages®, . http://conferences.idealliance.org/extreme/html/2006/Schonefeld01/EML2006Schonefeld01.html
[Schonefeld 2007] Schonefeld, Oliver Georg Rehm,
Andreas Witt, Lothar Lemnitzer (eds.),
XCONCUR and XCONCUR-CL: A constraint-based
approach for the validation of concurrent markup, Datenstrukturen für linguistische Ressourcen und ihre Anwendungen / Data structures for
linguistic resources and applications: Proceedings of the Biennial GLDV Conference
2007, Tübingen: Gunter Narr Verlag. Pp. 347-356.
[Schmidt and Colomb 2009] Schmidt, D.,
A data structure for representing multi-version texts online,
International Journal of Human-Computer Studies, . http://portal.acm.org/citation.cfm?id=1523966
[Sperberg-McQueen and Huitfeldt 2000] Sperberg-McQueen, C. M., and Claus
Huitfeldt Peter R. King and Ethan V. Munson (eds.),
GODDAG: A Data Structure for
Overlapping Hierarchies, Digital documents: systems and
principles. Lecture Notes in Computer Science 2023, Berlin: Springer, 2004, pp.
139-160. Paper given at Digital Documents: Systems and Principles. 8th International
Conference on Digital Documents and Electronic Publishing, DDEP 2000, 5th International
Workshop on the Principles of Digital Document Processing, PODDP 2000, Munich, Germany,
September 13-15, 2000.. http://www.w3.org/People/cmsmcq/2000/poddp2000.html
[Stührenberg and Jettka 2009] Stührenberg, Maik; and Daniel Jettka,
A toolkit for multi-dimensional markup: The
development of SGF to XStandoff, Proceedings of Balisage: The
Markup Conference 2009. Balisage Series on Markup Technologies, Vol. 3. http://www.balisage.net/Proceedings/vol3/html/Stuhrenberg01/BalisageVol3-Stuhrenberg01.html. doi:10.4242/BalisageVol3.Stuhrenberg01
[TEI P5] Lou Burnard and Syd Bauman (eds.),
Non-hierarchical Structures, Guidelines for the
Encoding and Interchange of Machine-Readable Texts (TEI P5), The TEI Consortium.
[Tennison and Piez 2002] Tennison, J. and W.
The Layered Markup and Annotation Language (LMNL), Proceedings of Extreme Markup Languages®. http://www.idealliance.org/papers/extreme/proceedings/author-pkg/2002/Tennison02/EML2002Tennison02.zip
[Witt 2005] Andreas Witt Stefanie Dipper, Michael
Götze, and Manfred Stede (eds.),
Multiple Hierarchies: New Aspects of an Old
Solution, Heterogeneity in Focus: Creating and and Using
Linguistic Databases, volume 2 of Interdisciplinary Studies on Information
Structure (ISIS), Working Papers of the SFB 632. University of Potsdam, Germany. (Corrected
reprint of an Extreme Markup 2004 paper). http://en.scientificcommons.org/42597903