PeterHey, Paul! Would you pass me —Paul[Handing him the hammer]— the hammer?
notation: By notation we mean a
particular markup syntax, such as SGML, XML, TexMecs, etc.vocabulary: the same textual material may be
encoded in XML or another notation using different vocabularies, such as (for example)
TEI, HTML, OSIS, or an ad hoc vocabulary invented for the example.MOC is rather casual about the affiliation of vocabularies with notations. Most
public vocabularies (TEI for instance) are defined using just one notation (typically
XML), so it's stretching things a bit to say (as MOC does) that a vocabulary defined
only for one notation can be used in samples encoded in another notation (say,
TexMecs). For MOC purposes, that is, a vocabulary provides information about the
meaning to be attached to particular identifiers used in an encoding, without being
particular about the notation in which the identifiers occur. idiom: a given vocabulary may provide more than
one way to encode overlapping structures; we refer to a particular way of using a
vocabulary as an idiom. Some vocabularies are designed to reduce
such variation as far as possible; others tolerate or even encourage it. In the TEI
vocabulary, for example, overlapping elements can be encoded in several ways: using for-the-purpose milestone elements (such as pb and
lb),using generic milestone elements (milestone), using Trojan Horse markup [],using virtual elements fragmented to fit into the imposed hierarchy and then
knit together in a variety of ways:using the join element,using the attributes next and prev,using the attribute part with the values I,
M, and F. using stand-off markup of various kinds In all of these cases, the encoder faces the choice of which logical
elements of the document structure (if any) to encode in the conventional way (one
logical element, one XML element) and which to encode in the alternative way using
milestones, multiple XML elements, or elements in a stand-off annotation structure.
(See further discussion of sacred and profane elements
below.) The convenience of concrete operations on the document may vary widely
depending on the choices made, so ideally MOC should provide a wide range of variation
in choices here, to enable them to be compared empirically. MOC reifies idioms by defining them and giving them names so they can be tracked
and compared.Each sample may instantiate any number of named idioms.source: a bibliographic reference to the source
of the sample. Omitted for samples constructed by the MOC project. description: a prose description of the sample,
commenting on any points of particular interest or importance. status, to-do list, and change history:
provisions for work-flow management; see discussion below. A formal model of (an early draft of) the MOC catalog has been created and is described
in [].Ancillary materialsAlong with the samples, MOC records information about each sample group, notation,
vocabulary, and idiom used in the corpus, together with bibliographic references. For
notations like XML and vocabularies like TEI, documentation is readily available and MOC
makes no attempt to compete with other sources as regards completeness of its lists of
bibliographic references. But for less commonly known notations, it is hoped that MOC's
collection of information may be helpful to those seeking to learn more. Each notation, vocabulary, idiom, sample, and sample group in MOC has a distinct URI;
users can dereference the URI to see the information MOC has about the item in question.
Selection criteriaSince its purpose is to illuminate problems connected with overlap and with existing
proposals for handling it, MOC does not attempt to make the selection of texts
representative of any particular linguistic or textual population. (MOC is not a
corpus in that sense.) For MOC, the relevant population is not a
particular set of natural-language users, but the set of overlap-related problems
encountered by people who work with natural-language texts for whatever purposes.Accordingly, MOC takes a resolutely opportunistic approach to samples; we will take
samples anywhere we can find them. This is particularly visible in the toy samples: the
current version of MOC includes among the toy samples many short samples originally
published in papers on overlap that have come to our attention. Opportunistic sampling is
less fruitful when it comes to the short and long samples. Since one of the purposes of MOC is to support investigations of different kinds of
overlap as well as different ways of encoding overlap, the collections of short and long
samples will, to the extent possible, reflect a variety of overlap phenomena and textual
interests. In the absence of a well grounded categorization of different kinds of overlap,
it's difficult to be certain how many really different kinds of overlap there are, and which
kinds of overlap are structurally and conceptually isomorphic. In the absence of such a well
grounded categorization, we hope to include examples at least of the following kinds of
overlap: structural overlap and multiple hierarchies (as in verse drama, or physical and
logical hierarchies [page vs paragraph], or in the analysis of the Peter / Paul
example above into utterances and into syntactic units)overlapping annotation targets (as in fine-grained commentary on specific
texts)change-history markup showing the revision of a text over time (of practical
import for technical documentation, but also of interest for genetic editions) overlapping sites of textual variation (as in text-critical editions) discontinuous and disordered elements (as in cases where one text is quoted and
commented on in another text, for which songs and plays-within-plays in drama provide
examples; a well known example in the overlap literature is the attempt of Hughie,
Louis, and Dewey to remember a haiku) We also hope to provide examples that illustrate the occurrence of overlap in texts and
applications of interest in different communities: literary studylexicologymetrical studylanguage corpora (discourse analysis, syntax, prosody, ...)textual criticismdocument publishingdocumentary, historical-critical, genetic, and other scholarly editionsanalytical bibliographyhistorical annotationlegal documentsWork flowEach sample in the corpus goes through the following processes, leading to the
corresponding status: candidate: The sample has been collected and may
or may not be included in the corpus proper. (We expect this will apply just to toy
samples, but it may also apply to others.) projected: We have agreed in principle and in
theory that we want this sample.planned: We have agreed on the desired properties
of the sample in sufficient detail to allow data capture to proceed: sample group (i.e., information about which sample group the sample belongs
to.)source text notationvocabularyidiomincomplete: Data capture has begun but has not
yet been completed. rough: Data capture has been completed, and the
person who did the data capture has done an initial proofreading. validated (or wf-checked): The sample has been validated against all appropriate
schemas, if there are any, or (if there is no schema) has been checked for
well-formedness by some automatic tool. At this point the paths divide. Toy samples and small samples undergo repeated
proofreadings (the initial plan is to do three proofreadings for each, but that plan has not
yet been put to the test). Large samples are (we assume) too long for multiple proofreadings
(or possibly even one). Instead, we perform a single proofreading and several spot checks.Once it reaches the validated state, each large sample acquires a list of
spot checks to be performed. One by one, not necessarily in any prescribed order, the
prescribed checks are performed. Each check results, possibly, in corrections and
re-validation, and possibly in the addition of new checks to the to-check list. Whenever we
notice something odd or amiss in the document, especially if it could be a systematic
problem, then a new task is added to the to-check list (assuming we can devise a way to
check systematically for the error in question). It is not yet clear exactly what spot-checks we need to do; we expect them to vary with
the notation, the idioms, the sample, etc. But some examples may make the idea clearer: When feasible, a spell-checker is used to check the text for typographic
errors.A selected one-, ten-, or one-hundred-percent sample of markup constructs
(typically occurrences of particular element types or attributes) is checked for
semantic plausibility (their syntactic correctness having already been guaranteed by
validation). For example, we might spot-check one percent, ten percent, or all of the
markup used for page breaks, page numbers, the TEI part attribute, the
next and prev attributes, the join element,
instances of markup for discontinuous elements, instances of fragmented elements, or
Trojan horses, to make sure they are semantically correct. The specific constructs
that need checking will, of course, typically depend on the idiom used. Systematic errors found in spot checks are fixed in whatever way we can
manage. Current statusThe ultimate aim of MOC is to provide a fully populated matrix of materials: for each
sample group, one sample in each relevant combination of notation, vocabulary, and idiom.As a first step towards this larger goal we have built a prototype corpus of
toy samples (MOC-POC), as a proof of concept. MOC-POC currently contains
52 samples distributed over 14 sample groups, 4 notations and 6 idioms.Not all sample groups contain samples in all notations and idioms. When completed,
this toy corpus of 14 sample groups will contain 126 samples.The text fragments comprising the samples of MOC-POC are taken from a selection of
research publications on the overlap problem. This prototype does not claim any kind of
completeness; it has, however, successfully identified a number of weak spots in our initial
design. Notations currently represented in MOC-POC are: XMLXConcurLMNL saw-tooth notationTexMecsMost of the samples are encoded using a vocabulary taken from or based on some version of
TEI, but ad hoc vocabularies are also represented. Samples encoded in XML are encoded using six different TEI idiomsFor other vocabularies, analogous attributes and elements are assumed or
provided. to resolve overlap problems: Fragmentation using the next and prev attributes defined as part of the TEI tags set
for segmentation and alignment.Fragmentation using the part attribute provided for certain elements in the TEI
vocabulary. (If a TEI-encoded example requires fragmentation of an element for which TEI
provides no part attribute, the attribute is added.)Fragmentation using the part and id attributes and the join element defined as part
of the TEI tags set for segmentation and alignment. The "Trojan horse 1" idiom uses milestone tags to resolve overlap. Milestones are
used only when necessary, normal XML elements are used in all other cases. It is left to
the encoder's choice to decide which element to mark with milestones."Trojan horse 2", like "Trojan horse 1", uses milestone tags to resolve overlap.
However, the Trojan horse 2 idiom represents every element in the overlap as a pair of
milestones with intervening content.The "XStandoff" idiom uses XML-conformant markup that points to the character data
("primary data"), which is kept in a separate location.We express warm thanks to Maik Stührenberg (Bielefeld) for having
prepared and allowed us to use XStandoff samples in all sample groups.Preliminary resultsA few preliminary results of our work on MOC can be mentioned.Our attempts to explore the solution space of techniques like TEI-style fragmentation led
very quickly to the realization that the TEI's techniques for handling overlapping structures
(here we will use the next and prev attributes as an example, but
the same observations apply to all the techniques described by the TEI) do not in themselves
fully determine the encoding of a given sample, even when there is no uncertainty about which
textual features are to be encoded. This is not surprising in itself; the TEI almost always
leaves a great deal of leeway to the individual project and its encoding policies. But it does
mean that a full description of how the TEI is used to encode a given sample must go beyond
saying that the next and prev attributes are used. When next and prev are used, an overlap of two logical elements
is resolved by breaking one of the logical elements into smaller pieces
(fragmentation) and using next and prev to signal
that each XML element is just a fragment of the original logical element. For example, the
Peter/Paul example given earlier might be encoded this way:
<sp>
<speaker>Peter</speaker>
<p>
<s id="s1">Hey, Paul!</s>
<s id="s2a" next="s2b">Would you pass me </s>
—
</p>
</sp>
<sp>
<speaker>Paul</speaker>
<stage>Handing him the hammer</stage>
<p>—
<s id="s2b" prev="s2a">the hammer?</s>
</p>
</sp>Here sentences are tagged using the s (sentence-unit) element, and the second
sentence is fragmented to fit within the hierarchy defined by the speech
elements.It would be logically possible, however, to break the speech elements as
needed to fit within the s-unit hierarchy:
<p>
<s id="s1">
<sp who="Peter"
id="sp1a"
next="sp1b">Hey, Paul!</sp>
</s>
<s id="s2">
<sp id="sp1b"
pref="sp1a">Would you pass
me —</sp>
<sp id="sp2"
who="Paul">—
the hammer?</sp>
</s>
</p>In order to be usable, an encoding using next and prev to
resolve overlap problems will need to be consistent in choosing which logical elements to
fragment and which to leave intact. In some cases, it will suffice to say, for each element
type in the vocabulary, whether or not it is to be fragmented in case of need. Those elements
which are never to be fragmented or modified are referred to, jokingly, as
sacred; the others in contrast as profane. But a binary
classification of element types as either sacred or profane suffices only when every pair of
overlapping elements has one sacred and one profane member: it does not provide adequate
guidance when both elements in the pair are sacred, or both profane. In more complex cases,
therefore, it may be desirable to formulate a scale of values assigning
each element type a degree of sacredness or profanity, and to ensure that no two element types
which overlap each other have the same value. Then the rule can be formulated: for any pair of
overlapping logical elements, represent the more sacred logical element as a single XML
element, and fragment the less sacred element in order to make the XML elements nest. The sacred / profane distinction has been picked up (and stretched into a slightly
different shape) by [].Conclusion and future workThe MOC project has been presented to markup-related communities on three occasions: a
poster session at Digital Humanities 2010 in London, a nocturne at Balisage 2010, and a
talk at the TEI 2010 Members Meeting in Zadar, Croatia. In all cases, the response of
participants suggested that a corpus along the lines envisaged for MOC may meet a need of
the community. The nocturne, in particular, led to the creation of a mailing list for
project-related discussion at Brown University (rather quiet so far, but still
there). As already mentioned, however, MOC is still work in progress. More lies ahead than behind. Our first task is to make a first version of MOC which is reasonably complete and suitable
for at least some of its intended uses. The steps we intend to take are:Concerning the technical infrastructure:Finalize and document decisions on the corpus repository structure and
linking possibilities and mechanisms.Adjust the structure of the current repository to conform to the above decisions.
Build, test, and deploy a multi-user and user-friendly interface to the
repository.
Call for the contribution of any group or community interested in overlap to take part in
the effort of populating the corpus.
Develop collaboration and work-organization strategies (including
funding).
Populate the corpus up to a critical mass size (including full-size samples):
systematic extension of the bibliographyselection of useful toy examples from the literatureidentification of a small but illustrative set of idioms to be illustratedselection and careful encoding of a small set of small examplesselection and careful encoding of a (very) small set of large examplessystematic encoding of all examples in all applicable notations, vocabularies, and
idioms Once MOC has something like a critical mass of samples, it should be possible to use it to
investigate and illustrate the relative merits of various encodings, building applications to
operate on the data, for example by displaying it using visualizations like those developed by
Wendell Piez for demonstrations of LMNL, or simple search and retrieval interfaces. Such applications should make it possible to explore the suggestion by Fabio Vitali and
his research team [] that SPARQL might be a more useful query
language for overlapping structures than the various extensions to XPath described in the
literature. ReferencesAssociation for Computers and the
Humanities, Association for Computational Linguistics, and Association for Literary and
Linguistic Computing. 1994. Guidelines for Electronic Text Encoding and Interchange
(TEI P3). Ed. C. M. Sperberg-McQueen and Lou Burnard. Chicago, Oxford: Text
Encoding Initiative, 1994. Barnard, D., Hayter, R.,
Karababa, M., Logan, G. and McFadden, J. 1988. SGML Markup for Literary Texts.
Computers and the Humanities 22: 265-276. doi:10.1007/BF00118602.Barnard, D., Burnard, L.,
Gaspart, J. P., Price, L. A., Sperberg-McQueen, C. M. and Varile, G. B. 1995.
Hierarchical encoding of text: Technical problems and SGML solutions.
Computers and the Humanities 29 211-231. doi:10.1007/BF01830617Carletta, J., Evert, S.,
Heid, U. and Kilgour, J. 2005. The NITE XML Toolkit: data model and query.
Language Resources and Evaluation 39(4) 313-334. doi:10.1007/s10579-006-9001-9Chatti, N., Kaouk, S.,
Calabretto, S. and Pinon, J. M. 2007. MultiX: an XML-based formalism to encode
multi-structured documents.
Proceedings of Extreme Markup Languages 2007. Montréal (Canada)
Aug. 2007. http://conferences.idealliance.org/extreme/html/2007/Chatti01/EML2007Chatti01.html
DeRose, Steven. 2004. Markup
overlap: A review and a horse.
Proceedings of Extreme Markup Languages 2004. Montréal (Canada)
Aug. 2004. http://conferences.idealliance.org/extreme/html/2004/DeRose01/EML2004DeRose01.html
Di Iorio, A.; Peroni, S.;
and Vitali, F. Towards markup support for full GODDAGs and beyond: the EARMARK approach.
Proceedings of Balisage: The Markup Conference 2009. Montréal
(Canada), August 11-14, 2009. doi:10.4242/BalisageVol3.Peroni01.
Durusau, Patrick and O’Donnell, Matthew Brook. Coming down from the
trees: Next step in the evolution of markup?.
Proceedings of Extreme Markup Languages® 2002. http://www.durusau.net/publications/Down_from_the_trees.pdf
Hilbert, Mirco;
Schonefeld, Oliver; and Witt, Andreas. Making CONCUR workProceedings of Extreme Markup Languages® 2005. http://conferences.idealliance.org/extreme/html/2005/Witt01/EML2005Witt01.xml
Huitfeldt, Claus and
Marcoux, Yves. The MLCD overlap corpus: A markup research infrastructure.
Presented at the TEI Members Meeting 2010 Zadar (Croatia). Huitfeldt, Claus and Sperberg-McQueen, C. M. TexMECS: An
experimental markup meta-language for complex documents. Working paper of the project
Markup Languages for Complex Documents (MLCD) University of Bergen
January 2001, rev. October 2003. http://mlcd.blackmesatech.com/mlcd/2003/Papers/texmecs.html
Huitfeldt, Claus;
Sperberg-McQueen, C. M.; and Marcoux, Yves. The MLCD Overlap Corpus (MOC). Poster
presented at the Digital Humanities 2010 Conference. King's College,
London, 7-10 July 2010. http://dh2010.cch.kcl.ac.uk/academic-programme/abstracts/papers/html/ab-633.htmlJagadish, H. V.;
Lakshmanan, L. V. S.; Scannapieco, M.; Srivastava, D.; and Wiwatwattana, N. Colorful
XML: one hierarchy isn't enough.
Proceedings of the 2004 ACM SIGMOD international conference on Management of
data. Paris, France: pp. 251-262, 2004. doi:10.1145/1007568.1007598Marinelli,
Paolo; Vitali, Fabio; Zacchiroli, Stefano. Towards the unification of formats for
overlapping markup.
The New Review of Hypermedia and Multimedia 14 57-94. doi:10.1080/13614560802316145; see http://en.scientificcommons.org/38517317, http://www.tandfonline.com/doi/full/10.1080/13614560802316145, and http://hal.archives-ouvertes.fr/docs/00/34/05/78/PDF/nrhm-overlapping-conversions.pdf
Schonefeld, Oliver.
XCONCUR and XCONCUR-CL: A constraint-based approach for the validation of concurrent
markup. Georg Rehm, Andreas Witt, Lothar Lemnitzer Datenstrukturen
für linguistische Ressourcen und ihre Anwendungen / Data structures for linguistic
resources and applications: Proceedings of the Biennial GLDV Conference 2007.
Tübingen: Gunter Narr Verlag, pp. 347-356, 2007. See also http://www.xconcur.org/. Sperberg-McQueen, C.
M. MOC catalog and maintenance plan
http://mlcd.blackmesatech.com/mlcd/2010/A/moc-catalog-sketch.xml (the
formal model itself, in Alloy, is available at http://mlcd.blackmesatech.com/mlcd/2010/A/moc-catalog.als). Sperberg-McQueen,
C.M. and Huitfeldt, Claus. 1998. Concurrent Document Hierarchies in MECS and SGML.
Literary and Linguistic Computing 14 29-42 Sperberg-McQueen, C. M. and Huitfeldt, Claus. GODDAG: A Data Structure
for Overlapping Hierarchies. Peter R. King and Ethan V. Munson Digital
documents: systems and principles. Lecture Notes in Computer Science 2023 Berlin:
Springer, 2004, pp. 139-160. Paper given at Digital Documents: Systems and Principles. 8th
International Conference on Digital Documents and Electronic Publishing, DDEP 2000, 5th
International Workshop on the Principles of Digital Document Processing, PODDP 2000, Munich,
Germany, September 13-15, 2000. 2004 http://www.springerlink.com/content/98j1vbu5nby73ul3/?p=4eefed0ac09e4ee381d09d3ac2afcb46&pi=8
http://cmsmcq.com/2000/poddp2000.html
http://www.w3.org/People/cmsmcq/2000/poddp2000.html
Stührenberg,
M. and Goecke, D. 2008. SGF — An integrated model for multiple annotations and
its application in a linguistic domain.
Proceedings of Balisage: The Markup Conference 2008. Montréal
(Canada) August 12-15, 2008. http://www.balisage.net/Proceedings/vol1/html/Stuehrenberg01/BalisageVol1-Stuehrenberg01.html.
doi:10.4242/BalisageVol1.Stuehrenberg01.
Stührenberg, M. and Jettka, D. A toolkit for multi-dimensional markup:
The development of SGF to XStandoff.
Proceedings of Balisage: The Markup Conference 2009. Montréal
(Canada), August 11-14, 2009. doi:10.4242/BalisageVol3.Stuhrenberg01.
TEI Consortium. 2007. TEI P5:
Guidelines for Electronic Text Encoding and Interchange. Ed. Lou Burnard and Syd
Bauman. Oxford, Providence, Charlottesville, Nancy: The TEI Consortium, 2007, rev. 2010. Tennison, Jeni and Piez,
Wendell. The Layered Markup and Annotation Language (LMNL).
Proceedings of Extreme Markup Languages® 2002. http://conferences.idealliance.org/extreme/html/2002/Tennison02/EML2002Tennison02.html (abstract only). Some information on LMNL can be found at http://www.piez.org/wendell/LMNL/lmnl-page.html. Witt, Andreas. 2004. Multiple
hierarchies: new aspects of an old solution. Paper given at Extreme Markup Languages
2004, Montréal, sponsored by IDEAlliance. Available on the Web at http://www.mulberrytech.com/Extreme/Proceedings/html/2004/Witt01/EML2004Witt01.html
Witt, A.,
Lüngen, H., Sasaki, F. and Goecke, D. 2005. Unification of XML Documents with
Concurrent Markup.
Literary and Linguistic Computing 20(1): 103-116. doi:10.1093/llc/fqh046.