How to cite this paper

Piez, Wendell. “Luminescent: parsing LMNL by XSLT upconversion.” Presented at Balisage: The Markup Conference 2012, Montréal, Canada, August 7 - 10, 2012. In Proceedings of Balisage: The Markup Conference 2012. Balisage Series on Markup Technologies, vol. 8 (2012). https://doi.org/10.4242/BalisageVol8.Piez01.

Balisage: The Markup Conference 2012
August 7 - 10, 2012

Balisage Paper: Luminescent: parsing LMNL by XSLT upconversion

Wendell Piez

Mulberry Technologies, Inc.

`<wapiez@mulberrytech.com>`

Wendell Piez has been attending Balisage and its antecedent conferences since the early days of XML; among his contributions has been, with Jeni Tennison, the original LMNL proposal (2002).

Abstract

Among attempts to deal with the overlap problem, LMNL (Layered Markup and Annotation Language) has attracted its share of attention but has also never grown much past its origins as a thought experiment. LMNL’s conceptual model differs from XML’s, and by design its notation also differs from XML’s. Nonetheless, a pipeline of XSLT transformations can parse LMNL input and construct an XML representation of LMNL, with the resulting benefit that further XML tools can be used to analyze and process documents originating from the alien notation. The key is to regard the task as an upconversion: structural induction performed over plain text.

LMNL: the Layered Markup and Annotation Language

Ranges
Arbitrary overlap
Annotations
Atoms

xLMNL: an XML-based representation of the LMNL data model

Compiling LMNL syntax into xLMNL via XSLT upconversion

Checking LMNL syntax for well-formedness

Working with the model: prototype LMNL applications

Reflections

Appendix A. xLMNL example

LMNL syntax:
Compiled into xLMNL

Appendix B. RNC schema for xLMNL

Appendix C. Demonstrations and source code

Luminescent is a prototype parser and compiler for LMNL syntax, converting LMNL documents into xLMNL, an XML-based representation of the LMNL model suitable for further processing. It consists of a series of XSLT 2.0 stylesheets, currently running in a web server (using Cocoon) or in batch mode (using an XProc pipeline). A second XProc pipeline can apply Schematron validation to the intermediate formats generated in Luminescent to detect and locate syntax errors in the input document.

LMNL: the Layered Markup and Annotation Language

LMNL (Layered Markup and Annotation Language) is an approach to markup first proposed by Jeni Tennison and myself in 2002 [Tennison and Piez 2002]. It emulates XML in some respects, but also differs from it in several fundamental ways, suggesting some very different approaches to modeling text-based information using markup, with some very different applications. For this reason, even if an alternative processing stack could never be built on LMNL (which presumably it could, given enough time, effort and resources), and even if LMNL is never regarded as a replacement for XML (which it was never intended to be), it turns out to be fertile laboratory for solutions to modeling problems - including XML-based solutions for XML platforms.

XML is defined [XML Recommendation] as a syntax, but implies a model, which was described by the (non-normative) XML Information Set [XML Infoset], expressed in any number of code libraries and APIs (both official and unofficial), and finally standardized (at least in one variant) in the XPath 2.0/XQuery Data Model (XDM) [XDM] . LMNL inverts this, being defined first as an abstract model, whose syntax is proposed incidentally, as a form of representation (and as such, one among many conceivable). Nevertheless, the idea is the same: a formal model stabilizes a set of capabilities for tools performing useful operations over text-based information sets, and provides a basis for interoperability, while a syntax provides a serialization format and an interface for developers and users. Like XML, LMNL is conceived in order to support markup, a means of assigning labels and attributing properties and relationships to data points or fields in text, by means of text; and like XML, LMNL expects to provide a basis for descriptive and declarative markup applications (although, again like XML, not only those), which support document and data processing within layered systems that can thus benefit from separation of concerns (between authoring, editorial, data management, and production tasks, for example), and that are not locked into single applications. Again like XML, LMNL does this by leaving it to applications to define their own sets of names, labels or keywords, to which they can assign whatever semantics they see fit. In this respect, LMNL syntax (like XML) is a meta-language while LMNL itself (like the XDM) is a meta-model: a model (with a design and hence a particular set of affordances in application) that we use to make models, of documents, families of documents, and assorted information sets of whatever description.

This much is similar; the differences from XML are (primarily) in the design of the model itself, and (secondarily) in the syntax proposed to represent it. The syntax is designed to look as little like XML as possible, for two reasons: first, so that LMNL syntax may be embedded directly into XML syntax, or the reverse; and secondly, to reduce cognitive overload when thinking about LMNL and XML together, or when thinking about LMNL with the burden of expectations formed by XML. (At the level of the model, we have similarly tried to avoid using XML terminology for LMNL concepts except where the connections are strong.) In the interests of brevity, rather than explicate the model fully and offer rationales for it here, I offer a simple summary description of the model, and of LMNL syntax, together.

Note

Readers may wish to review some of the historical LMNL specifications, which can now be found at lmnl-markup.org.

Ranges

Where XML has elements, LMNL has ranges. Unlike XML elements, ranges in LMNL have no necessary relation with one another: they are neither parents, nor children of each other, nor in any hierarchy at all. Ranges may be named (names in LMNL are qualified by namespaces in much the way they are in XML), or anonymous. The assumption is that they will ordinarily have generic names indicating their type, like XML elements. Ranges are properties of an owner limen (using the Latin word for doorstep to designate this important data object type), which belongs either to the document as a whole or an annotation, and which has a value comprising a single string (a sequence of contiguous characters). The value of the range will be a substring of the value of the limen, while its position will be the character offset within its limen where its starts.

In order to avoid confusion with XML, LMNL syntax uses a different set of delimiters to identify starts and ends of ranges. This example shows a chunk of LMNL syntax with two types of ranges, s and l, marked over the stream of text. s ranges do not overlap with other s ranges, and l never overlaps with l, but the two types overlap each other:

[s}[l}He manages to keep the upper hand{l]
[l}On his own farm.{s] [s}He's boss.{s] [s}But as to hens:{l]
[l}We fence our flowers in and the hens range.{l]{s]

In the way that XML has a concise empty-element syntax, empty ranges may also be marked with single tags, as in [br]. Empty ranges have no value (or their value is an empty string), although they do have a position within their owner layer.

It is sometimes convenient (although LMNL syntax does not require it) to designate a single range covering the entire document:

[excerpt}
[s}[l}He manages to keep the upper hand{l]
[l}On his own farm.{s] [s}He's boss.{s] [s}But as to hens:{l]
[l}We fence our flowers in and the hens range.{l]{s]
{excerpt]

Arbitrary overlap

LMNL supports arbitrary overlap, which is to say overlapping ranges of the same type. This is important for certain potential applications such as annotation frameworks and range indexing, where ranges of text need to be identified that may overlap, while still being of the same type.

In LMNL syntax, this example shows two ranges named r, overlapping each other:

[r=r1}A case [r=r2}of{r=r1] arbitrary overlap{r=r2]

While the range identifier (given after the =) is optional, when it is not given, a close tag is presumed to match the most recent open tag with the same combination of name and identifier; thus to express overlap of this kind (rather than one r range simply being enclosed in the other), the identifier is necessary on the tags marking at least one of the ranges involved. But the identifier is not formally part of the name.

Annotations

While XML elements may have attributes, LMNL ranges may have annotations. Unlike XML attributes, there is no restriction against assigning more than one annotation with the same name to a given range; likewise, the order of annotations on a range is supported in the model.

In the syntax, annotations are represented by using tagging inside tagging:

[excerpt [source}The Housekeeper{source] [author}Robert Frost{author]}
[s}[l [n}144{n]}He manages to keep the upper hand{l]
[l [n}145{n]}On his own farm.{s] [s}He's boss.{s] [s}But as to hens:{l]
[l [n}146{n]}We fence our flowers in and the hens range.{l]{s]
{excerpt]

In order to reduce tagging overhead, when annotations contain only simple string values, their close tags may be presented in abbreviated notation (resembling anonymous end tags):

...[l [n}145{]}On his own farm.{s [id}s1{]]...

In addition (as this example also shows), the syntax permits placing annotations on end tags, not only on start tags.

Finally, while attributes in XML assign properties to elements as name-value pairs, LMNL annotations may be structured. In the LMNL model, annotations are isomorphic to LMNL documents: like a document, an annotation has a limen with content and optionally ranges over that content. Likewise, like ranges (including ranges over annotation content), annotations may be annotated.

Given this flexibility it is sometimes convenient for annotations, like ranges, to be empty, having no content but only annotations, which it groups, orders and names.

So this is legal syntax and represents a coherent LMNL document object:

[excerpt
  [source [date}1915{][title}The Housekeeper{]]
  [author
    [name}Robert Frost{]
    [dates}1874-1963{]] }
[s}[l [n}144{n]}He manages to keep the upper hand{l]
[l [n}145{n]}On his own farm.{s] [s}He's boss.{s] [s}But as to hens:{l]
[l [n}146{n]}We fence our flowers in and the hens range.{l]{s]
{excerpt]

In this example, the excerpt range carries two empty annotations, source and author, each of which has annotations of its own.

This is an especially powerful feature of LMNL, not only because it provides a very useful capability in modeling (as it presents annotations in a directed graph structure – as if XML attributes could have their own attributes), but also because of its implications for the way documentary information is organized and linked. For example, a LMNL system might well support attaching a document dynamically as an annotation to a range in another document.

Atoms

At its base, a LMNL document is defined as a sequence of atoms: the most common type of atom will ordinarily be a character atom, represented by a single Unicode character in the syntax. Yet while every character in Unicode maps to a corresponding atom, atoms in LMNL are also capable of representing other information of whatever kind an application may find it useful to represent in this way.

An atom has string length of 1. Consequently, and unlike empty ranges, atoms not only have location, but they occupy space, are included in the value of ranges in which they participate, and can be marked up. Atoms are identified with their own notation, {{ }}, in the syntax.^[1] In this example, an atom named logo is marked up with a range named link:

[link [href}lmnl-markup.org{]}{{logo [src}lmnl-markup.org/hat.png{]}}{link]

xLMNL: an XML-based representation of the LMNL data model

One way LMNL builds on the conceptual foundation of XML is by differentiating between operations on the syntax, which imply parsing, and operations on optimized representations of documents held in memory: the model. This differentiation gives us leverage in development, since we have the opportunity to identify either syntax or model as the appropriate place for design and implementation, whether that be of the tag set itself (considered as a set of labels and constraints over their use), user interfaces, transformations or anything else.

Paradoxically, while the LMNL model is designed in deliberate contrast to XML, it is nevertheless useful to specify an XML-based representation of it, for several reasons. First, it exposes instances conveniently by giving us the opportunity to serialize LMNL documents in XML syntax. Second, it makes it possible to use XML-based tools (such as XSLT, schema technologies, XQuery, XML servers, CMS and database technology) to query and manipulate LMNL – an advantage for those of us who are well-practiced in these technologies for data processing, but not in Java or Python. And thirdly, it clarifies some of the resemblances and differences between LMNL and other approaches (especially XML-based approaches) to the problem set.

Since 2002, I have experimented with adapting XML to LMNL in several different ways. Not only can XML elements be construed as LMNL ranges and XML attributes as LMNL annotations (this is the essence of the CLIX and ECLIX approaches, cf Piez 2004); also, XML-based notations for representing overlap, such as milestone-based notations or segmented and aligned XML elements, can be mapped into LMNL. This provides a framework, at least, for thinking systematically about how to implement and maintain processes to manage these awkward and difficult forms of XML.

Yet the real power of the LMNL model as such cannot be exploited without a more direct representation. xLMNL is an XML-based representation of the model itself: that is to say, it leaves behind the concept of a document as an information set represented in embedded markup (literal tags applied directly to literal text), and simply uses XML as a kind of poor man's (hierarchical) database. This gives us many of the advantages of an XML platform described above, while making downstream applications more tractable, inasmuch as they can work directly with LMNL as conceptualized, rather than at a remove. At the price of being somewhat heavyweight and memory intensive, xLMNL is thus a useful interim format for testing ideas and demonstrating concepts.

Again, the most concise way of presenting this design is by way of an example: the xLMNL equivalent of the document given above is presented in Appendix A.

Note

Note however that the notation itself is not at all concise! In fact there are many redundancies built into xLMNL, as compared to a bare LMNL range model, in order to streamline downstream processes. For example, text layer content is broken up into spans which are indexed to the ranges in which they participate. While a LMNL processor might wish to calculate this on the fly, when working on a static document it makes sense to index them only once, so this is done in xLMNL. It should go without saying that this does not preclude a more lightweight standoff-based XML representation of LMNL.

xLMNL has undergone several iterations since I first starting modeling LMNL directly with XML in 2004 [Piez 2004, and see also Piez 2010]

Developers who work on the overlap problem in XML will recognize this as a standoff representation of ranges. As such, it might be generated and maintained in any number of ways – even (if rather onerously) by hand.

Nevertheless, no claim should be inferred that I suppose xLMNL to be at all an optimal approach to working with LMNL on an XML platform. The best argument for doing this is that fairly dramatic demonstrations of the interest of overlapping markup are not all that hard to come by if one only has a means by which to create them, and xLMNL is a step along the way.

A schema for xLMNL, using Relax NG (compact syntax) appears in Appendix B.

Compiling LMNL syntax into xLMNL via XSLT upconversion

In its current form, the complete Luminescent pipeline has thirteen steps, each of which is implemented in an XSLT 2.0 transformation. These can be chained together using any available means; I have used both XProc and Cocoon (which is convenient for hooking Luminescent together with further transformations processing xLMNL into various targets). Several of the steps could be combined for greater efficiency; the reason to have so many presently is to maximize transparency for development and debugging.

The steps proceed as follows:

Comments are extracted using a regular expression matching on open and close comment delimiters ([!-- and --]). This has to be done first so that markup inside comments will not be processed in subsequent steps. The result is a single element (representing the root of the tag tree) containing a sequence of strings and elements representing comments.
Tokenization: all open and close tag delimiters, [, {, ] and } in document content (i.e., not inside comments) are matched and wrapped as XML t elements (for token). The result is a sequence of strings interspersed with comments and these elements, representing tag delimiters.
The token (t) elements are marked with line and character offsets, to be carried forward for purposes of any error reporting that has to be performed later.
A sibling recursion is applied to infer tagging from the tokens. A tag element is initiated with each open delimiter ([ or {); each close delimiter (] or }) ends the tag element most recently started. The result is a rudimentary tag tree of the document. Delimiters and comments are retained.
Types are assigned to the tags, which are mapped to start, end, empty and atom elements. This works by inferring each type of tag from its open and close delimiters: [r} for start, {r] for end, [e] for empty, and {{a}} for atom. The extra level of delimiters required for atoms is respected; tags with outer shells but no inner shells (that is, that fail to respect the double-brace syntax of atoms, as in {{atom}}) are marked as errors.

Simultaneously, tag names (generic identifiers) are extracted from their values. Any tags that have range identifiers with the generic identifier keeps its range identifier as part of its GI. (So a tag [range=r1} is represented as <range gi="range=r1"/>.)
Start tags are marked with unique identifiers (distinct from any range identifiers already given).
By means of another sibling recursion, end tags are marked with the identifier of the most recent start tag with the same GI.

Since range identifiers are still, at this stage, considered part of the GI, the sibling recursion in this process matches end tags to start tags correctly.
Matching start and end-tag pairs appearing inside tags are promoted into annotations.

This is the trickiest step, for two reasons. First, abbreviated syntax permitted for simple annotations means that anonymous end tags ({]) may be matched with named start tags. Secondly, annotations may contain markup, and so not just any tag directly inside a tag is actually an annotation delimiter (it could mark up a range over content inside the annotation). This process must work, again, via sibling recursion (the third one performed in the pipeline). Where tagging is not correct, error elements may be generated.
Character offsets are marked on start, end, empty and atom tag elements, and text spans are wrapped (with span elements) and marked with character offsets within their owner layer (or limen in LMNL terminology: the annotation or document within which they appear). The offsets are determined from the lengths of string content (text nodes in the XML), with any atoms appearing being given length 1, while comments and range markers have length 0.
Proper generic identifiers (range names) are derived from combinations of ranges with their identifiers. (The identifiers are saved as label attributes in case they may be wanted.)
Unique identifiers are assigned to ranges; range start and end tags have the same identifier, while empty range tags have their own. Similarly, annotations are marked with unique identifiers, as is the document as a whole.
Layer identifiers are assigned to spans, corresponding to the limen (annotation or document) in which the span appears. Strictly speaking these identifiers are redundant, since the same information is given by the xLMNL document structure; but they are useful for optimizing subsequent (downstream) processes or (potentially) for processing or aggregating LMNL documents described in multiple xLMNL instances.

The result of this step is a comprehensive tag tree of the marked up LMNL syntax instance.

(A later project goal will be to codify this format for interchange; it maps to the earlier CLIX format. This may also prove to be more robust than xLMNL for maintenance of LMNL data sets in XML, since ranges are still represented by tags within the text stream rather than standoff markup.)
The tag tree is converted into xLMNL by reading range elements from start/end tag pairs, or from empty range markers as the case may be. Ranges are marked with the start and end offsets, read from their tags. Spans are marked with pointers to the ranges in which they participate. (A fourth sibling recursion accomplishes this. Again, the information here is redundant but useful.)

Checking LMNL syntax for well-formedness

Rather than stop processing, the pipeline currently emits error elements when it encounters problems, with codes identifying the issue. This appears to work well.

In addition, more precise diagnostics are performed by applying Schematron validation to particular steps in the pipeline. (This is implemented with a second XProc pipeline specification that imports the main one, applies Schematron schemas to the results of two of Luminescent's intermediate formats, aggregates their results together and formats them.) For example, using Schematron it is easy to check whether all start tags have matching end tags or vice-versa, or that range or annotation names follow their rules. Because the intermediate formats carry forward information on the location of tagging in the original LMNL syntax instance, Schematron can report the locations of tagging found to be problematic.

This is especially important since LMNL syntax becomes hard to read as the markup becomes more complex.^[2] For example, here is a malformed instance:

[excerpt [source}The Housekeeper{source] [author}Robert Frost{author]]}
[s}[l [n}144{n]}He manages to keep the upper hand{l]
[l [n}145{n]}On his own farm.{s] [s}He's boss.{s] [s}But as to hens:{l]
[l [n}146{n]}We fence our flowers in and the hens range.{l]{s]
{excerpt]

(The error occurs at the end of the first line, where an extra ] appears before the } ending the start tag.)

Schematron reports this:

Error UNEXPECTED-TAGGING reported for } at 1:71,
  C:\Projects\LMNL\Luminescent\lmnl\frost-quote.lmnl
No start tag matches end tag {excerpt] at 5:1,
  C:\Projects\LMNL\Luminescent\lmnl\frost-quote.lmnl

The processor has taken the mistaken ], as it must, as the end of the tag; and since it therefore makes an empty range marker, the end tag that is supposed to match it is found to have no start tag.

The two errors are detected differently. The first error is reported for any tag delimiter that can't be matched with a corresponding delimiter of the opposite kind (start or end). The second is reported for the failure to follow the constraint that all start tags must have end tags and vice versa.

The line numbers and offsets reported (1:71 and 5:1) correctly locate the problems; character 71 of line 1 is the location of the orphaned tag close delimiter } (which would have closed a start tag had the ] character not intervened), while line 5 character 1 is where the orphaned end tag is located.

Working with the model: prototype LMNL applications

Currently I have several processes running with xLMNL as source. Some of these are tuned to particular tag sets, while others are generic. A selection is offered in place of presentation slides for this paper (the zipped package contains a mix of HTML, XML and SVG and can be reviewed starting from index.html using any current web browser).

A generic diagnostic stylesheet can report which range types overlap with which other range types. (This is most useful to know for process customization.)
XML can be extracted from xLMNL dynamically, using a parameterized listing of range types to be reflected as a hierarchy of XML elements. Ranges of these types are promoted into XML elements; their annotations become, when they have simple values, XML attributes. Ranges not among these types, and annotations that are not cast to attributes, become XML elements representing range delimiters (tags) or annotation structures. Spans of text are kept with pointers to the ranges in which they participate, when these have not been cast to ancestor elements.

This process can be run independently, but its functionality is also available dynamically as a function call in XSLT, operating on any xLMNL document or annotation (or a subset of spans from within a document or annotation, perhaps those associated with a given range) and casting it into XML.

This is also a generic process, although the particular ranges to be converted into XML elements is passed in at run time.
SVG graphs and HTML renditions can be generated to display and depict LMNL documents. These transformations, to be sure, are not always trivial; but their difficulties are greatly mitigated by the XML extraction process just mentioned, used to cast LMNL into intermediate XML formats (hierarchical views of the LMNL).

These are not generic processes, since of course particular displays are optimized for particular tagging semantics, but some of them do rely on imported functionalities implemented generically (such as the logic that generates SVG bubble graphs), so it can be shared.

Links to demonstrations are provided in Appendix C.

Reflections

I can make no pretense as to the efficiency or scalability of this approach. So far, it has only worked well enough for my purposes: to demonstrate its feasibility in principle, and to test the specifications. While it has performed adequately well on documents up to several hundred Kb in size, and experience suggests that processing bottlenecks for Luminescent are actually more likely coming out of xLMNL rather than into it, I have no data to confirm my intuitions here. There does appear to be a rich and interesting set of problems at hand.

Nevertheless, if nothing else, this exercise has suggested some very interesting things about markup technologies beyond XML. One of the keys appears to be the separation of the parsing of the syntax from the construction of the model; so the parse tree is a tree only of the tags, from which the document model is derived by a different process. (The parse itself works like a parse of S-expressions, in which open and close delimiters are recursively parsed into tags.^[3]) In this view of things, machine-automated text processing can support a very different form of document description than that provided by the operational semantics of XML, which in order to build a document model from the markup in a single pass, must limit itself to a syntax in which not just tags but the element structure itself can be described by a context-free grammar.^[4] Thus its document models are limited to trees and to graphs projected over that tree [Bos 2005]. While not, formally, more expressive than XML markup (since graphs projected over a tree can express the same relations as LMNL markup, as indeed they do in xLMNL or other XML-based representations of LMNL), LMNL markup is practically so; it can get closer to the text than XML does, inasmuch as in order to fit within its own rules, XML's representation of a document (or at any rate, of a document in which overlapping structures or features, or structured annotations, are represented) is always getting in its own way.

Related to this is another aspect of this work: this parsing or compiling process does not assume a single depth-first traversal of structures implicit in the syntax, and so does not perform a single pass over the data. Instead, it considers that the entire text is available to the parser at once, and works by applying several distinct heuristic operations in sequence: first tags are inferred from delimiting tokens; then different types of tags (open, close, empty or atom) are recognized; then open/close pairs are matched, etc. Whether this technique is very novel or interesting, or how it relates to (or evades, or complicates) classic problems in text processing, I am not highly qualified to say. Yet it might be interesting for the sole reason that it serves as a proof of concept for generalized plain text processing in XSLT.

What I as a markup user find most remarkable, however, is what happens once a tool chain like this is in place. XML practitioners, I think, or at least those of us who work with structurally complex texts, are familiar with a conflict between the wish to describe our information accurately, capably and gracefully, and the need to force everything into a single hierarchy of elements – for reasons having nothing to do with the purposes of the markup, but only because the processing infrastructure insists on it, behind the scenes, before work has even begun. This conflict is apparent every time we work with (or must develop) a schema that has to make design compromises in order to address a requirement to represent things that overlap, introducing one or more of the well-worn but cumbersome workarounds for doing so. Sometimes we are faced with truly vexing problems in tagging, and even in the best case, having to use workarounds generates a certain amount of mental background noise. When working with LMNL markup, all this clamor is silenced. Even in small demonstrations, I am finding it liberating to be able to mark exactly what I wish to describe, with concern only for its clearest denotation in tags and its fidelity to what I want to represent in the text. If this is possible at all (and it evidently is), XML's early commitment to a single tree representation of something as complex as a text (meaning that word in the sense that literary scholars do, with everything it entails) appears to be a premature optimization – in other words, not always an optimization at all. When tags in plain text can be used to represent whatever structures in and features of text we care to discover, irrespective of whether they fit easily into a single tree-shaped model, then the potentials of markup are magnified immensely. We have only just started to explore the possibilities.

Appendix A. xLMNL example

LMNL syntax:

[excerpt}
[s}[l [n}144{n]}He manages to keep the upper hand{l]
[l [n}145{n]}On his own farm.{s] [s}He's boss.{s] [s}But as to hens:{l]
[l [n}146{n]}We fence our flowers in and the hens range.{l]{s]
{excerpt   
  [source [date}1915{][title}The Housekeeper{]]
  [author
    [name}Robert Frost{]
    [dates}1874-1963{]] ]

Compiled into xLMNL

White space is added for legibility, and LF characters in the data indicated with 
.

<?xml version="1.0" encoding="UTF-8"?>
<x:document xmlns:x="http://lmnl-markup.org/ns/xLMNL" ID="N.d1e1"
  base-uri="file:/c:/Projects/LMNL/Luminescent/lmnl/frost-example.lmnl">
  <x:content>
    <x:span start="0" end="1" layer="N.d1e1" ranges="R.d1e2">&#xA;</x:span>
    <x:span start="1" end="34" layer="N.d1e1" ranges="R.d1e2 R.d1e5 R.d1e6">He manages to keep the upper hand</x:span>
    <x:span start="34" end="35" layer="N.d1e1" ranges="R.d1e2 R.d1e5">&#xA;</x:span>
    <x:span start="35" end="51" layer="N.d1e1" ranges="R.d1e2 R.d1e5 R.d1e15">On his own farm.</x:span>
    <x:span start="51" end="52" layer="N.d1e1" ranges="R.d1e2 R.d1e15"> </x:span>
    <x:span start="52" end="62" layer="N.d1e1" ranges="R.d1e2 R.d1e15 R.d1e25">He's boss.</x:span>
    <x:span start="62" end="63" layer="N.d1e1" ranges="R.d1e2 R.d1e15"> </x:span>
    <x:span start="63" end="78" layer="N.d1e1" ranges="R.d1e2 R.d1e15 R.d1e31">But as to hens:</x:span>
    <x:span start="78" end="79" layer="N.d1e1" ranges="R.d1e2 R.d1e31">&#xA;</x:span>
    <x:span start="79" end="122" layer="N.d1e1" ranges="R.d1e2 R.d1e31 R.d1e37">We fence our flowers in and the hens range.</x:span>
    <x:span start="122" end="123" layer="N.d1e1" ranges="R.d1e2"> </x:span>
  </x:content>
  <x:range start="0" end="123" ID="R.d1e2" sl="1" so="1" name="excerpt" el="9" eo="25">
    <x:annotation ID="N.d1e49" sl="6" so="3" el="6" eo="47" name="source">
      <x:annotation ID="N.d1e50" sl="6" so="11" el="6" eo="22" name="date">
        <x:content>
          <x:span start="0" end="4" layer="N.d1e50">1915</x:span>
        </x:content>
      </x:annotation>
      <x:annotation ID="N.d1e53" sl="6" so="23" el="6" eo="46" name="title">
        <x:content>
          <x:span start="0" end="15" layer="N.d1e53">The Housekeeper</x:span>
        </x:content>
      </x:annotation>
      <x:content/>
    </x:annotation>
    <x:annotation ID="N.d1e56" sl="7" so="3" el="9" eo="23" name="author">
      <x:annotation ID="N.d1e57" sl="8" so="5" el="8" eo="24" name="name">
        <x:content>
          <x:span start="0" end="12" layer="N.d1e57">Robert Frost</x:span>
        </x:content>
      </x:annotation>
      <x:annotation ID="N.d1e60" sl="9" so="5" el="9" eo="22" name="dates">
        <x:content>
          <x:span start="0" end="9" layer="N.d1e60">1874-1963</x:span>
        </x:content>
      </x:annotation>
      <x:content/>
    </x:annotation>
  </x:range>
  <x:range start="1" end="51" ID="R.d1e5" sl="2" so="1" name="s" el="3" eo="32"/>
  <x:range start="1" end="34" ID="R.d1e6" sl="2" so="4" name="l" el="2" eo="52">
    <x:annotation ID="N.d1e7" sl="2" so="7" el="2" eo="15" name="n">
      <x:content>
        <x:span start="0" end="3" layer="N.d1e7">144</x:span>
      </x:content>
    </x:annotation>
  </x:range>
  <x:range start="35" end="78" ID="R.d1e15" sl="3" so="1" name="l" el="3" eo="71">
    <x:annotation ID="N.d1e16" sl="3" so="4" el="3" eo="12" name="n">
      <x:content>
        <x:span start="0" end="3" layer="N.d1e16">145</x:span>
      </x:content>
    </x:annotation>
  </x:range>
  <x:range start="52" end="62" ID="R.d1e25" sl="3" so="34" name="s" el="3" eo="49"/>
  <x:range start="63" end="122" ID="R.d1e31" sl="3" so="51" name="s" el="4" eo="62"/>
  <x:range start="79" end="122" ID="R.d1e37" sl="4" so="1" name="l" el="4" eo="59">
    <x:annotation ID="N.d1e38" sl="4" so="4" el="4" eo="12" name="n">
      <x:content>
        <x:span start="0" end="3" layer="N.d1e38">146</x:span>
      </x:content>
    </x:annotation>
  </x:range>
</x:document>

Appendix B. RNC schema for xLMNL

namespace x = "http://lmnl-markup.org/ns/xLMNL"

start =
  element x:document {
    document-model }

document-model =
    attribute base-uri { xsd:anyURI }?,
    attribute ID { xsd:ID },
    attribute name { xsd:QName }?,
    debug-support?,
    (annotation | comment)*,
    ( content,
      range*,
      (annotation | comment)*)?
    
annotation =
  element x:annotation {
    document-model }

content =
  element x:content {
    element x:span {
      attribute layer { xsd:IDREF },
      attribute ranges { xsd:IDREFS }?,
      attribute start { xsd:integer },
      attribute end { xsd:integer },
      (text
       | element x:atom {
           attribute name { xsd:NCName },
           debug-support?,
           annotation*
         }
       | comment )+
    }*
  }
range =
  element x:range {
    attribute ID { xsd:ID },
    attribute name { xsd:NCName }?,
    attribute start { xsd:integer },
    attribute end { xsd:integer },
    debug-support?,
    (annotation | comment)*
  }

comment =
  element x:comment { 
    debug-support?,
    text }
    
    
debug-support =
    attribute sl { xsd:integer },
    attribute so { xsd:integer },
    attribute el { xsd:integer },
    attribute eo { xsd:integer }

A full specification for xLMNL would include constraints not captured by this RNG, such as that offsets (start and end attributes) must be whole numbers (positive integers or 0); values of end must be greater than or equal to values of start on the same range; the difference between the start and end of a span (its length) must be equal to its string length plus the count of its atom children; referential integrity must be maintained between spans, ranges and layers (limina), and so forth.

Appendix C. Demonstrations and source code

A demonstration showsing results of the Luminescent pipeline accompany this paper, in the Slides and Materials linked in the Proceedings. Unzip the package and open index.html, which will describe the examples and present links for examining them.

Many browsers will now attempt and may do a reasonable job rendering the SVG examples. But best results will be obtained from a fully conformant SVG viewer implementation with panning and zooming to arbitrary levels of scale. (Most browsers will not zoom in as far as you may want to go.) Apache Squiggle (distributed with Batik) is recommended.

Source code for Luminescent is available on github, at https://github.com/wendellpiez/Luminescent.

References

[Bos 2005] Bos, Bert. The XML data model. 2005. See http://www.w3.org/XML/Datamodel.html

[Cayless and Soroka 2010] Cayless, Hugh A., and Adam Soroka. On Implementing string-range() for TEI. Presented at Balisage: The Markup Conference 2010 (Montréal, Canada, August 3 - 6, 2010). In Proceedings of Balisage: The Markup Conference 2010. Balisage Series on Markup Technologies, vol. 5 (2010). doi:https://doi.org/10.4242/BalisageVol5.Cayless01.

[DeRose 2004] DeRose, Steven. Markup Overlap: A Review and a Horse. Presented at Extreme Markup Languages 2004 (Montréal, Canada).

[Durusau and O'Donnell n.d.] Durusau, Patrick, and Matthew Brook O'Donnell. JITTs (Just-in-time Trees). http://www.durusau.net/publications/NY_xml_sig.pdf.

[lmnl-markup.org] LMNL-markup.org. See http://www.lmnl-markup.org.

[Piez 2004] Piez, Wendell. Half-steps toward LMNL. Presented at Extreme Markup Languages 2004 (Montréal, Canada). See http://www.piez.org/wendell/papers/LMNL-halfsteps.pdf.

[Piez 2010] Piez, Wendell. Towards Hermeneutic Markup: An architetural outline. Presented at Digital Humanities 2010 (London, England). See http://www.piez.org/wendell/papers/dh2010/index.html.

[Portier and Calabretto 2009] Portier, Pierre-Édouard, and Sylvie Calabretto. “Methodology for the construction of multi-structured documents.” Presented at Balisage: The Markup Conference 2009 (Montréal, Canada, August 11 - 14, 2009). In Proceedings of Balisage: The Markup Conference 2009. Balisage Series on Markup Technologies, vol. 3 (2009). doi:https://doi.org/10.4242/BalisageVol3.Portier01.

[Portier and Calabretto 2010] Portier, Pierre-Édouard, and Sylvie Calabretto. “Multi-structured documents and the emergence of annotations vocabularies.” Presented at Balisage: The Markup Conference 2010, Montréal, Canada, August 3 - 6, 2010. In Proceedings of Balisage: The Markup Conference 2010. Balisage Series on Markup Technologies, vol. 5 (2010). doi:https://doi.org/10.4242/BalisageVol5.Portier01.

[Pondorf and Witt 2010] Pondorf, Denis, and Andreas Witt. Freestyle Markup Language: Specification of an intuitive, powerful, polyhierarchical new extensible markup language. Presented at Balisage: The Markup Conference 2010 (Montréal, Canada, August 3 - 6, 2010). In Proceedings of Balisage: The Markup Conference 2010. Balisage Series on Markup Technologies, vol. 5 (2010). doi:https://doi.org/10.4242/BalisageVol5.Pondorf01.

[Schmidt 2010] Schmidt, Desmond. The inadequacy of embedded markup for cultural heritage texts. In Literary and Linguistic Computing (2010) 25 (3): 337-356. doi:https://doi.org/10.1093/llc/fqq007.

[Sperberg-McQueen and Huitfeldt 1999] Sperberg-McQueen, Michael, and Claus Huitfeldt: "Concurrent Document Hierarchies in MECS and SGML". In Literary and Linguistic Computing (1999) 14, pp 29-42. doi:https://doi.org/10.1093/llc/14.1.29.

[Stegmann and Witt 2009] Stegmann, Jens, and Andreas Witt. TEI Feature Structures as a Representation Format for Multiple Annotation and Generic XML Documents. Presented at Balisage: The Markup Conference 2009, Montréal, Canada, August 11 - 14, 2009. In Proceedings of Balisage: The Markup Conference 2009. Balisage Series on Markup Technologies, vol. 3 (2009). doi:https://doi.org/10.4242/BalisageVol3.Stegmann01.

[Stührenberg and Jettka 2009] Stührenberg, Maik, and Daniel Jettka. A toolkit for multi-dimensional markup: The development of SGF to XStandoff. Presented at Balisage: The Markup Conference 2009 (Montréal, Canada, August 11 - 14, 2009). In Proceedings of Balisage: The Markup Conference 2009. Balisage Series on Markup Technologies, vol. 3 (2009). doi:https://doi.org/10.4242/BalisageVol3.Stuhrenberg01.

[Tennison and Piez 2002] Tennison, Jeni, and Wendell Piez. The Layered Markup and Annotation Language (LMNL). Presented at Extreme Markup Languages 2002 (Montréal, Canada).

[XDM] Berglund, Anders, Mary Fernández, Ashok Malhotra, Jonathan Marsh, Marton Nagy, and Norman Walsh, eds. XQuery 1.0 and XPath 2.0 Data Model (XDM) (Second Edition) W3C Recommendation 14 December 2010. http://www.w3.org/TR/xpath-datamodel/.

[XML Infoset] Cowan, John, and Richard Tobin, eds. XML Information Set (Second Edition). W3C Recommendation 4 February 2004. http://www.w3.org/TR/xml-infoset/.

[XML Recommendation] Tim Bray, Tim, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, and François Yergeau, eds. Extensible Markup Language (XML) 1.0 (Fifth Edition) W3C Recommendation 26 November 2008. http://www.w3.org/TR/REC-xml/.

^[1] This raises the question whether characters can be represented with atom syntax, whether they can be annotated, and so forth.

The character A may indeed be represented as {{#x41}} (using a shorthand reference) or {{lmnl:char [codepoint}41{]}} using a reserved name for the atom with an annotation to identify it. But add another annotation to the latter form and it will not map back again. (It would be an annotated character, and as such could not be represented in a Unicode serialization by itself.)

^[2] This is a problem for which embedded markup, of course, has no built-in solution (as Desmond Schmidt has pointed out, Schmidt 2010) other than using only tag sets that do not permit complexity – a high price to pay (a baby for less bath water), and not the idea at all. Of course, the syntax is not ultimately the point of the LMNL model (which might be supported in all kinds of different interfaces) but only a means to an end.

^[3] In fact the initial insight that led to the development of this pipeline was that if one were to perform simple string substitutions as follows, the result would be S-expression-like:

[ and { (open tag delimiters) become ([ and ({
] and } become ]) and })

Performing this substition on this text:

[poem [by}Apollinaire{]}Et [red}l'unique [gold}cordeau{red]
  des [green}trompettes{gold] marines{green]{poem]

we get:

([poem ([by})Apollinaire({])})Et ([red})l'unique ([gold})cordeau({red])
  des ([green})trompettes({gold]) marines({green])({poem])

Here, each parenthetical expression represents a tag.

^[4] Thus the XML Recommendation has a well-formedness constraint (http://www.w3.org/TR/REC-xml/#GIMatch in XML Recommendation) on an XML document that is not, in itself, a definition of syntax, but only a restriction on the way it may be used: end tags must have the same name as the most recent unclosed start tag (the GI matching constraint). (The reason this is not a definition of syntax is because syntactically, an end tag is an end tag irrespective of whether it matches the most recent start tag; so this rule is not for the integrity of the syntax qua syntax, but rather in order that a second tree may be built out of the syntax parse tree.) In connection with the production for element (http://www.w3.org/TR/REC-xml/#NT-element), this is how XML is able to bridge from well-formedness to its set of validity constraints – something still undefined for LMNL. To be sure, formally speaking validation is optional in XML, and systems that validate XML not in the sense of the Recommendation (which entails a DTD) but using other models for validation have been implemented several times (and in several different ways) since the Recommendation was published in 1998.

While the GI matching constraint is suspended for LMNL, the question remains how a validation technology can be developed for a range model rather than a graph, such as this constraint enables. But XML and LMNL itself also demonstrate that processing can occur with only implicit validation in the application of a markup language.

Bos, Bert. The XML data model. 2005. See http://www.w3.org/XML/Datamodel.html

Cayless, Hugh A., and Adam Soroka. On Implementing string-range() for TEI. Presented at Balisage: The Markup Conference 2010 (Montréal, Canada, August 3 - 6, 2010). In Proceedings of Balisage: The Markup Conference 2010. Balisage Series on Markup Technologies, vol. 5 (2010). doi:https://doi.org/10.4242/BalisageVol5.Cayless01.

DeRose, Steven. Markup Overlap: A Review and a Horse. Presented at Extreme Markup Languages 2004 (Montréal, Canada).

Durusau, Patrick, and Matthew Brook O'Donnell. JITTs (Just-in-time Trees). http://www.durusau.net/publications/NY_xml_sig.pdf.

LMNL-markup.org. See http://www.lmnl-markup.org.

Piez, Wendell. Half-steps toward LMNL. Presented at Extreme Markup Languages 2004 (Montréal, Canada). See http://www.piez.org/wendell/papers/LMNL-halfsteps.pdf.

Piez, Wendell. Towards Hermeneutic Markup: An architetural outline. Presented at Digital Humanities 2010 (London, England). See http://www.piez.org/wendell/papers/dh2010/index.html.

Portier, Pierre-Édouard, and Sylvie Calabretto. “Methodology for the construction of multi-structured documents.” Presented at Balisage: The Markup Conference 2009 (Montréal, Canada, August 11 - 14, 2009). In Proceedings of Balisage: The Markup Conference 2009. Balisage Series on Markup Technologies, vol. 3 (2009). doi:https://doi.org/10.4242/BalisageVol3.Portier01.

Portier, Pierre-Édouard, and Sylvie Calabretto. “Multi-structured documents and the emergence of annotations vocabularies.” Presented at Balisage: The Markup Conference 2010, Montréal, Canada, August 3 - 6, 2010. In Proceedings of Balisage: The Markup Conference 2010. Balisage Series on Markup Technologies, vol. 5 (2010). doi:https://doi.org/10.4242/BalisageVol5.Portier01.

Pondorf, Denis, and Andreas Witt. Freestyle Markup Language: Specification of an intuitive, powerful, polyhierarchical new extensible markup language. Presented at Balisage: The Markup Conference 2010 (Montréal, Canada, August 3 - 6, 2010). In Proceedings of Balisage: The Markup Conference 2010. Balisage Series on Markup Technologies, vol. 5 (2010). doi:https://doi.org/10.4242/BalisageVol5.Pondorf01.

Schmidt, Desmond. The inadequacy of embedded markup for cultural heritage texts. In Literary and Linguistic Computing (2010) 25 (3): 337-356. doi:https://doi.org/10.1093/llc/fqq007.

Sperberg-McQueen, Michael, and Claus Huitfeldt: "Concurrent Document Hierarchies in MECS and SGML". In Literary and Linguistic Computing (1999) 14, pp 29-42. doi:https://doi.org/10.1093/llc/14.1.29.

Stegmann, Jens, and Andreas Witt. TEI Feature Structures as a Representation Format for Multiple Annotation and Generic XML Documents. Presented at Balisage: The Markup Conference 2009, Montréal, Canada, August 11 - 14, 2009. In Proceedings of Balisage: The Markup Conference 2009. Balisage Series on Markup Technologies, vol. 3 (2009). doi:https://doi.org/10.4242/BalisageVol3.Stegmann01.

Stührenberg, Maik, and Daniel Jettka. A toolkit for multi-dimensional markup: The development of SGF to XStandoff. Presented at Balisage: The Markup Conference 2009 (Montréal, Canada, August 11 - 14, 2009). In Proceedings of Balisage: The Markup Conference 2009. Balisage Series on Markup Technologies, vol. 3 (2009). doi:https://doi.org/10.4242/BalisageVol3.Stuhrenberg01.

Tennison, Jeni, and Wendell Piez. The Layered Markup and Annotation Language (LMNL). Presented at Extreme Markup Languages 2002 (Montréal, Canada).

Berglund, Anders, Mary Fernández, Ashok Malhotra, Jonathan Marsh, Marton Nagy, and Norman Walsh, eds. XQuery 1.0 and XPath 2.0 Data Model (XDM) (Second Edition) W3C Recommendation 14 December 2010. http://www.w3.org/TR/xpath-datamodel/.

Cowan, John, and Richard Tobin, eds. XML Information Set (Second Edition). W3C Recommendation 4 February 2004. http://www.w3.org/TR/xml-infoset/.

Tim Bray, Tim, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, and François Yergeau, eds. Extensible Markup Language (XML) 1.0 (Fifth Edition) W3C Recommendation 26 November 2008. http://www.w3.org/TR/REC-xml/.

BalisageThe Markup Conference2012

Balisage Paper: Luminescent: parsing LMNL by XSLT upconversion

`<wapiez@mulberrytech.com>`

Abstract

Table of Contents

LMNL: the Layered Markup and Annotation Language

Note

Ranges

Arbitrary overlap

Annotations

Atoms

xLMNL: an XML-based representation of the LMNL data model

Note

Compiling LMNL syntax into xLMNL via XSLT upconversion

Checking LMNL syntax for well-formedness

Working with the model: prototype LMNL applications

Reflections

Appendix A. xLMNL example

LMNL syntax:

Compiled into xLMNL

Appendix B. RNC schema for xLMNL

Appendix C. Demonstrations and source code

References

Balisage Series on Markup Technologies