How to cite this paper
The MLCD Overlap Corpus (MOC)
Balisage: The Markup Conference 2012
August 7 - 10, 2012
For some time now, people interested in descriptive markup have been considering the problem
of how best to handle overlapping structures in electronic representations of documents. There
have been proposals for handling such overlap in SGML using
CONCUR, for handling it
in SGML or XML using application-level semantics (milestone elements, Trojan Horse markup,
fragmentation and recombination using virtual elements of various kinds, standoff markup), for
CONCUR in the XML context, and for a variety of non-XML approaches
(colored XML, LMNL, Just-in-Time trees, TexMecs, Goddag structures, EARMARK). The literature on
the subject is still manageable, but it has grown to the point where it is hard to keep track
even of the number of reviews of the literature.
The proliferation of proposals has led to some secondary phenomena which seem to be problems
in their own right. Because there are so many proposals for dealing with overlap, it can be
difficult to keep track of them all. Because so many of the papers describing them use only a
few terse examples it can be challenging to understand just how the proposal works in practice,
and unclear just how any given proposal resembles or differs from other proposals made
elsewhere. Most important of all, it is currently difficult to compare different techniques for
dealing with overlap with each other and reach well founded conclusions as to their relative
The MLCD Overlap Corpus (MOC) is a first step toward improving this situation. This paper
describes the current state of MOC and future plans for the project.
The main immediate goal of the MOC project is to build a corpus of well understood and
well documented examples of overlap, discontinuity, alternate ordering, and related phenomena
in various notations, for use in the investigation of methods of recording such phenomena.
Where possible, we would like to allow, indeed to encourage, participation of and
contributions from a wider community in building the corpus. When the corpus has reached a
suitable size and degree of completeness, we would also like to make it available for research
and to encourage its use.
To address the concerns which led to the project, the MOC corpus should satisfy a number
It should provide illustrative examples to make it easier to understand various
It should provide readily available documentation of overlap proposals (with
pointers to the original papers).
Its samples should cover as wide a range of problems as is feasible, in the
interests of seeing whether different proposals for overlap work better on different
kinds of problems.
The corpus may provide, or should at least support work toward, some kind of
systematic categorization or typology of
The samples in the corpus should be able to serve as a kind of testbed for the
development of tools, including editors, translators, query languages, and so on.
It should be able to serve as a testbed for head-to-head comparison of overlap
solutions, by making it possible to build demonstration applications using the same
documents in different encodings and compare the volume and complexity of the code
needed to support the different encodings.
It is currently difficult to compare different techniques for dealing with overlap with
each other and to apply concrete metrics to them. We believe that MOC may provide a testbed
for application and tool development, and an empirical basis for answering questions such as:
How successful is a given syntactic proposal in capturing relevant information about
a given document with overlapping structures?
How verbose or succinct is the proposal's markup for the document?
How complex is each proposal's markup (assuming it is possible
to specify some quantitative measure of markup complexity).
How complex is the task of parsing a given syntax and mapping it to a given data
How successful is a given proposed data structure in capturing the relevant
information about overlapping structures in a document?
Given a representation of a particular document in a given data structure, how
complex is the task of operating on that data structure in support of a given
application using the document?
Content of the corpus
In its initial form, MOC will comprise three sets of samples:
toy samples, typically just a few lines in length.
Most toy samples are drawn from the literature on overlap. These samples usually
reduce the problem of overlap to very minimal terms, which makes them helpful for
highlighting the essential features of a particular overlap problem and the proposed
solution. By the same token, they elide many of the details that must be handled in
short samples, each typically a few pages long.
These samples are designed to be large enough to illustrate the interaction of
overlapping structures with other problems of text encoding, but short enough to make
it feasible to encode them multiple times by hand.
long samples, each typically a complete document (e.g. a play, long
short story, or novel).
These samples are designed to be large enough to make it feasible to build simple
text applications (e.g. interactive search and retrieval systems or text visualization
systems) using the MOC samples as data, and to illuminate technical issues in the
processing of overlapping structures. By current standards, however, none of the
samples in this class are expected to be big in the sense of
Along with the samples, MOC records information about each sample group, notation,
vocabulary, and idiom used in the corpus, together with bibliographic references. For
notations like XML and vocabularies like TEI, documentation is readily available and MOC
makes no attempt to compete with other sources as regards completeness of its lists of
bibliographic references. But for less commonly known notations, it is hoped that MOC's
collection of information may be helpful to those seeking to learn more.
Each notation, vocabulary, idiom, sample, and sample group in MOC has a distinct URI;
users can dereference the URI to see the information MOC has about the item in question.
Since its purpose is to illuminate problems connected with overlap and with existing
proposals for handling it, MOC does not attempt to make the selection of texts
representative of any particular linguistic or textual population. (MOC is not a
corpus in that sense.) For MOC, the relevant population is not a
particular set of natural-language users, but the set of overlap-related problems
encountered by people who work with natural-language texts for whatever purposes.
Accordingly, MOC takes a resolutely opportunistic approach to samples; we will take
samples anywhere we can find them. This is particularly visible in the toy samples: the
current version of MOC includes among the toy samples many short samples originally
published in papers on overlap that have come to our attention. Opportunistic sampling is
less fruitful when it comes to the short and long samples.
Since one of the purposes of MOC is to support investigations of different kinds of
overlap as well as different ways of encoding overlap, the collections of short and long
samples will, to the extent possible, reflect a variety of overlap phenomena and textual
interests. In the absence of a well grounded categorization of different kinds of overlap,
it's difficult to be certain how many really different kinds of overlap there are, and which
kinds of overlap are structurally and conceptually isomorphic. In the absence of such a well
grounded categorization, we hope to include examples at least of the following kinds of
structural overlap and multiple hierarchies (as in verse drama, or physical and
logical hierarchies [page vs paragraph], or in the analysis of the Peter / Paul
example above into utterances and into syntactic units)
overlapping annotation targets (as in fine-grained commentary on specific
change-history markup showing the revision of a text over time (of practical
import for technical documentation, but also of interest for genetic editions)
overlapping sites of textual variation (as in text-critical editions)
discontinuous and disordered elements (as in cases where one text is quoted and
commented on in another text, for which songs and plays-within-plays in drama provide
examples; a well known example in the overlap literature is the attempt of Hughie,
Louis, and Dewey to remember a haiku)
We also hope to provide examples that illustrate the occurrence of overlap in texts and
applications of interest in different communities:
language corpora (discourse analysis, syntax, prosody, ...)
documentary, historical-critical, genetic, and other scholarly editions
Each sample in the corpus goes through the following processes, leading to the
candidate: The sample has been collected and may
or may not be included in the corpus proper. (We expect this will apply just to toy
samples, but it may also apply to others.)
projected: We have agreed in principle and in
theory that we want this sample.
planned: We have agreed on the desired properties
of the sample in sufficient detail to allow data capture to proceed:
incomplete: Data capture has begun but has not
yet been completed.
rough: Data capture has been completed, and the
person who did the data capture has done an initial proofreading.
validated (or wf-checked): The sample has been validated against all appropriate
schemas, if there are any, or (if there is no schema) has been checked for
well-formedness by some automatic tool.
At this point the paths divide. Toy samples and small samples undergo repeated
proofreadings (the initial plan is to do three proofreadings for each, but that plan has not
yet been put to the test). Large samples are (we assume) too long for multiple proofreadings
(or possibly even one). Instead, we perform a single proofreading and several spot checks.
Once it reaches the
validated state, each large sample acquires a list of
spot checks to be performed. One by one, not necessarily in any prescribed order, the
prescribed checks are performed. Each check results, possibly, in corrections and
re-validation, and possibly in the addition of new checks to the to-check list. Whenever we
notice something odd or amiss in the document, especially if it could be a systematic
problem, then a new task is added to the to-check list (assuming we can devise a way to
check systematically for the error in question).
It is not yet clear exactly what spot-checks we need to do; we expect them to vary with
the notation, the idioms, the sample, etc. But some examples may make the idea clearer:
When feasible, a spell-checker is used to check the text for typographic
A selected one-, ten-, or one-hundred-percent sample of markup constructs
(typically occurrences of particular element types or attributes) is checked for
semantic plausibility (their syntactic correctness having already been guaranteed by
validation). For example, we might spot-check one percent, ten percent, or all of the
markup used for page breaks, page numbers, the TEI
part attribute, the
prev attributes, the
instances of markup for discontinuous elements, instances of fragmented elements, or
Trojan horses, to make sure they are semantically correct. The specific constructs
that need checking will, of course, typically depend on the idiom used.
Systematic errors found in spot checks are fixed in whatever way we can
The ultimate aim of MOC is to provide a fully populated matrix of materials: for each
sample group, one sample in each relevant combination of notation, vocabulary, and idiom.
As a first step towards this larger goal we have built a prototype corpus of
toy samples (MOC-POC), as a proof of concept. MOC-POC currently contains
52 samples distributed over 14 sample groups, 4 notations and 6 idioms.
The text fragments comprising the samples of MOC-POC are taken from a selection of
research publications on the overlap problem. This prototype does not claim any kind of
completeness; it has, however, successfully identified a number of weak spots in our initial
Notations currently represented in MOC-POC are:
Most of the samples are encoded using a vocabulary taken from or based on some version of
TEI, but ad hoc vocabularies are also represented.
Samples encoded in XML are encoded using six different TEI idioms to resolve overlap problems:
Fragmentation using the next and prev attributes defined as part of the TEI tags set
for segmentation and alignment.
Fragmentation using the part attribute provided for certain elements in the TEI
vocabulary. (If a TEI-encoded example requires fragmentation of an element for which TEI
provides no part attribute, the attribute is added.)
Fragmentation using the part and id attributes and the join element defined as part
of the TEI tags set for segmentation and alignment.
The "Trojan horse 1" idiom uses milestone tags to resolve overlap. Milestones are
used only when necessary, normal XML elements are used in all other cases. It is left to
the encoder's choice to decide which element to mark with milestones.
"Trojan horse 2", like "Trojan horse 1", uses milestone tags to resolve overlap.
However, the Trojan horse 2 idiom represents every element in the overlap as a pair of
milestones with intervening content.
The "XStandoff" idiom uses XML-conformant markup that points to the character data
("primary data"), which is kept in a separate location.
A few preliminary results of our work on MOC can be mentioned.
Our attempts to explore the solution space of techniques like TEI-style fragmentation led
very quickly to the realization that the TEI's techniques for handling overlapping structures
(here we will use the
prev attributes as an example, but
the same observations apply to all the techniques described by the TEI) do not in themselves
fully determine the encoding of a given sample, even when there is no uncertainty about which
textual features are to be encoded. This is not surprising in itself; the TEI almost always
leaves a great deal of leeway to the individual project and its encoding policies. But it does
mean that a full description of how the TEI is used to encode a given sample must go beyond
saying that the
prev attributes are used.
prev are used, an overlap of two logical elements
is resolved by breaking one of the logical elements into smaller pieces
fragmentation) and using
prev to signal
that each XML element is just a fragment of the original logical element. For example, the
Peter/Paul example given earlier might be encoded this way:
<s id="s1">Hey, Paul!</s>
<s id="s2a" next="s2b">Would you pass me </s>
<stage>Handing him the hammer</stage>
<s id="s2b" prev="s2a">the hammer?</s>
Here sentences are tagged using the
s (sentence-unit) element, and the second
sentence is fragmented to fit within the hierarchy defined by the
It would be logically possible, however, to break the
speech elements as
needed to fit within the
pref="sp1a">Would you pass
In order to be usable, an encoding using
resolve overlap problems will need to be consistent in choosing which logical elements to
fragment and which to leave intact. In some cases, it will suffice to say, for each element
type in the vocabulary, whether or not it is to be fragmented in case of need. Those elements
which are never to be fragmented or modified are referred to, jokingly, as
sacred; the others in contrast as
profane. But a binary
classification of element types as either sacred or profane suffices only when every pair of
overlapping elements has one sacred and one profane member: it does not provide adequate
guidance when both elements in the pair are sacred, or both profane. In more complex cases,
therefore, it may be desirable to formulate a scale of values assigning
each element type a degree of sacredness or profanity, and to ensure that no two element types
which overlap each other have the same value. Then the rule can be formulated: for any pair of
overlapping logical elements, represent the more sacred logical element as a single XML
element, and fragment the less sacred element in order to make the XML elements nest.
The sacred / profane distinction has been picked up (and stretched into a slightly
different shape) by [Marinelli / Vitali / Zacchiroli 2008].
Conclusion and future work
The MOC project has been presented to markup-related communities on three occasions: a
poster session at Digital Humanities 2010 in London, a nocturne at Balisage 2010, and a
talk at the TEI 2010 Members Meeting in Zadar, Croatia. In all cases, the response of
participants suggested that a corpus along the lines envisaged for MOC may meet a need of
the community. The nocturne, in particular, led to the creation of a mailing list for
project-related discussion at Brown University (rather quiet so far, but still
As already mentioned, however, MOC is still work in progress. More lies ahead than behind.
Our first task is to make a first version of MOC which is reasonably complete and suitable
for at least some of its intended uses. The steps we intend to take are:
Concerning the technical infrastructure:
Finalize and document decisions on the corpus repository structure and
linking possibilities and mechanisms.
Adjust the structure of the current repository to conform to the above decisions.
Build, test, and deploy a multi-user and user-friendly interface to the
Call for the contribution of any group or community interested in overlap to take part in
the effort of populating the corpus.
Develop collaboration and work-organization strategies (including
Populate the corpus up to a
critical mass size (including full-size samples):
systematic extension of the bibliography
selection of useful toy examples from the literature
identification of a small but illustrative set of idioms to be illustrated
selection and careful encoding of a small set of small examples
selection and careful encoding of a (very) small set of large examples
systematic encoding of all examples in all applicable notations, vocabularies, and
Once MOC has something like a critical mass of samples, it should be possible to use it to
investigate and illustrate the relative merits of various encodings, building applications to
operate on the data, for example by displaying it using visualizations like those developed by
Wendell Piez for demonstrations of LMNL, or simple search and retrieval interfaces.
Such applications should make it possible to explore the suggestion by Fabio Vitali and
his research team [Di Iorio et al. 2009] that SPARQL might be a more useful query
language for overlapping structures than the various extensions to XPath described in the
[ACH/ACL/ALLC 1994] Association for Computers and the
Humanities, Association for Computational Linguistics, and Association for Literary and
Linguistic Computing. 1994. Guidelines for Electronic Text Encoding and Interchange
(TEI P3). Ed. C. M. Sperberg-McQueen and Lou Burnard. Chicago, Oxford: Text
Encoding Initiative, 1994.
[Barnard et al. 1988] Barnard, D., Hayter, R.,
Karababa, M., Logan, G. and McFadden, J. 1988.
SGML Markup for Literary Texts.
Computers and the Humanities 22: 265-276. doi:10.1007/BF00118602.
[Barnard et al. 1995] Barnard, D., Burnard, L.,
Gaspart, J. P., Price, L. A., Sperberg-McQueen, C. M. and Varile, G. B. 1995.
Hierarchical encoding of text: Technical problems and SGML solutions.
Computers and the Humanities 29 211-231. doi:10.1007/BF01830617
[Carletta et al. 2005] Carletta, J., Evert, S.,
Heid, U. and Kilgour, J. 2005.
The NITE XML Toolkit: data model and query.
Language Resources and Evaluation 39(4) 313-334. doi:10.1007/s10579-006-9001-9
[Chatti et al. 2007] Chatti, N., Kaouk, S.,
Calabretto, S. and Pinon, J. M. 2007.
MultiX: an XML-based formalism to encode
Proceedings of Extreme Markup Languages 2007. Montréal (Canada)
Aug. 2007. http://conferences.idealliance.org/extreme/html/2007/Chatti01/EML2007Chatti01.html
[DeRose 2004] DeRose, Steven. 2004.
overlap: A review and a horse.
Proceedings of Extreme Markup Languages 2004. Montréal (Canada)
Aug. 2004. http://conferences.idealliance.org/extreme/html/2004/DeRose01/EML2004DeRose01.html
[Di Iorio et al. 2009] Di Iorio, A.; Peroni, S.;
and Vitali, F.
Towards markup support for full GODDAGs and beyond: the EARMARK approach.
Proceedings of Balisage: The Markup Conference 2009. Montréal
(Canada), August 11-14, 2009. doi:10.4242/BalisageVol3.Peroni01.
[Durusau and O’Donnell 2002] Durusau, Patrick and O’Donnell, Matthew Brook.
Coming down from the
trees: Next step in the evolution of markup?.
Proceedings of Extreme Markup Languages® 2002. http://www.durusau.net/publications/Down_from_the_trees.pdf
[Hilbert et al. 2005] Hilbert, Mirco;
Schonefeld, Oliver; and Witt, Andreas.
Making CONCUR work
Proceedings of Extreme Markup Languages® 2005. http://conferences.idealliance.org/extreme/html/2005/Witt01/EML2005Witt01.xml
[Huitfeldt and Marcoux 2010] Huitfeldt, Claus and
The MLCD overlap corpus: A markup research infrastructure.
Presented at the TEI Members Meeting 2010 Zadar (Croatia).
[Huitfeldt and Sperberg-McQueen 2003] Huitfeldt, Claus and Sperberg-McQueen, C. M.
experimental markup meta-language for complex documents. Working paper of the project
Markup Languages for Complex Documents (MLCD) University of Bergen
January 2001, rev. October 2003. http://mlcd.blackmesatech.com/mlcd/2003/Papers/texmecs.html
[Huitfeldt et al. 2010] Huitfeldt, Claus;
Sperberg-McQueen, C. M.; and Marcoux, Yves.
The MLCD Overlap Corpus (MOC). Poster
presented at the Digital Humanities 2010 Conference. King's College,
London, 7-10 July 2010. http://dh2010.cch.kcl.ac.uk/academic-programme/abstracts/papers/html/ab-633.html
[Jagadish et al. 2004] Jagadish, H. V.;
Lakshmanan, L. V. S.; Scannapieco, M.; Srivastava, D.; and Wiwatwattana, N.
XML: one hierarchy isn't enough.
Proceedings of the 2004 ACM SIGMOD international conference on Management of
data. Paris, France: pp. 251-262, 2004. doi:10.1145/1007568.1007598
[Marinelli / Vitali / Zacchiroli 2008] Marinelli,
Paolo; Vitali, Fabio; Zacchiroli, Stefano.
Towards the unification of formats for
The New Review of Hypermedia and Multimedia 14 57-94. doi:10.1080/13614560802316145; see http://en.scientificcommons.org/38517317, http://www.tandfonline.com/doi/full/10.1080/13614560802316145, and http://hal.archives-ouvertes.fr/docs/00/34/05/78/PDF/nrhm-overlapping-conversions.pdf
[Schonefeld 2007] Schonefeld, Oliver.
XCONCUR and XCONCUR-CL: A constraint-based approach for the validation of concurrent
markup. Georg Rehm, Andreas Witt, Lothar Lemnitzer Datenstrukturen
für linguistische Ressourcen und ihre Anwendungen / Data structures for linguistic
resources and applications: Proceedings of the Biennial GLDV Conference 2007.
Tübingen: Gunter Narr Verlag, pp. 347-356, 2007. See also http://www.xconcur.org/.
[Sperberg-McQueen 2010] Sperberg-McQueen, C.
MOC catalog and maintenance plan
formal model itself, in Alloy, is available at http://mlcd.blackmesatech.com/mlcd/2010/A/moc-catalog.als).
[Sperberg-McQueen / Huitfeldt 1998] Sperberg-McQueen,
C.M. and Huitfeldt, Claus. 1998.
Concurrent Document Hierarchies in MECS and SGML.
Literary and Linguistic Computing 14 29-42
[Sperberg-McQueen and Huitfeldt 2000] Sperberg-McQueen, C. M. and Huitfeldt, Claus.
GODDAG: A Data Structure
for Overlapping Hierarchies. Peter R. King and Ethan V. Munson Digital
documents: systems and principles. Lecture Notes in Computer Science 2023 Berlin:
Springer, 2004, pp. 139-160. Paper given at Digital Documents: Systems and Principles. 8th
International Conference on Digital Documents and Electronic Publishing, DDEP 2000, 5th
International Workshop on the Principles of Digital Document Processing, PODDP 2000, Munich,
Germany, September 13-15, 2000. 2004 http://www.springerlink.com/content/98j1vbu5nby73ul3/?p=4eefed0ac09e4ee381d09d3ac2afcb46&pi=8
[Stührenberg / Goecke 2008] Stührenberg,
M. and Goecke, D. 2008.
SGF — An integrated model for multiple annotations and
its application in a linguistic domain.
Proceedings of Balisage: The Markup Conference 2008. Montréal
(Canada) August 12-15, 2008. http://www.balisage.net/Proceedings/vol1/html/Stuehrenberg01/BalisageVol1-Stuehrenberg01.html.
[Stührenberg and Jettka 2009] Stührenberg, M. and Jettka, D.
A toolkit for multi-dimensional markup:
The development of SGF to XStandoff.
Proceedings of Balisage: The Markup Conference 2009. Montréal
(Canada), August 11-14, 2009. doi:10.4242/BalisageVol3.Stuhrenberg01.
[TEI 2007] TEI Consortium. 2007. TEI P5:
Guidelines for Electronic Text Encoding and Interchange. Ed. Lou Burnard and Syd
Bauman. Oxford, Providence, Charlottesville, Nancy: The TEI Consortium, 2007, rev. 2010.
[Tennison and Piez 2002] Tennison, Jeni and Piez,
The Layered Markup and Annotation Language (LMNL).
Proceedings of Extreme Markup Languages® 2002. http://conferences.idealliance.org/extreme/html/2002/Tennison02/EML2002Tennison02.html (abstract only). Some information on LMNL can be found at http://www.piez.org/wendell/LMNL/lmnl-page.html.
[Witt 2004] Witt, Andreas. 2004.
hierarchies: new aspects of an old solution. Paper given at Extreme Markup Languages
2004, Montréal, sponsored by IDEAlliance. Available on the Web at http://www.mulberrytech.com/Extreme/Proceedings/html/2004/Witt01/EML2004Witt01.html
[Witt / Lüngen / Goecke 2005] Witt, A.,
Lüngen, H., Sasaki, F. and Goecke, D. 2005.
Unification of XML Documents with
Literary and Linguistic Computing 20(1): 103-116. doi:10.1093/llc/fqh046.