Note

This paper has been inspired in part by Sam Wilmott's 1993 internal report, Beyond SGML[w93]. I also want to thank my colleague Jacques Légaré for his valuable comments and clarifications, and Stilo International for giving me time to do interesting work.

Introduction

SGML had this feature called SHORTREF. It allowed the DTD designer to specify that certain strings called shortrefs should in some contexts be interpreted as markup tags. For the authors using an SGML DTD with a well-designed set of shortrefs, the effect was similar to using a kind of Wiki markup.

As with other parts of SGML, the specification syntax for shortrefs was idiosyncratic.[s86] Furthermore, the method of their specification typically relied on some other rarely-used features of SGML DTDs, such as STARTTAG entities. This combination ensured that only an expert in SGML DTDs could hope to design shortrefs correctly, so they remained obscure and rarely used. When SGML was replaced by its simplified successor XML, nobody regretted their omission.

Or did they?

Many people stubbornly refuse to abandon their non-XML syntaxes. Programming language designers still use the old-fashioned EBNF grammars[b59] in their specifications instead of XML Schema. Even some languages that are at the very core of various XML technologies, such as XPath, are not XML. The RELAX NG schema language, though specified in XML syntax[c01], defines a non-XML compact syntax[c02c] as well.

The strongest evidence of yearning for shortrefs, however, is the myriad of Wiki languages in existence. Here we have a large family of actual markup languages, whose main purpose is to be converted to HTML, another markup language, and still they are not fully tagged XML. SGML DTDs with shortrefs and appropriate declarations could accomplish the task.[j04] Instead, Wiki engines typically store their pages as plain text, parse them using hand-coded parsers written in various general-purpose languages, and convert them directly to HTML for presentation.[b07]

There are many downsides to this architecture. Most Wiki pages are stored unvalidated and unstructured, which makes them suboptimal for searching and very difficult to automatically restructure. They are missing all XML tool chain support. All these problems are judged to be outweighed by the benefit of the special notation. A solution that preserves this notational convenience while keeping markup in XML documents would be a clear winner.

The present paper aims to deliver one solution that satisfies these criteria: given a relatively simple syntax specification that follows the established standards, it allows the author to create valid XML without entering XML tags. In other words, it resurrects SGML shortrefs in a more modern context of well-formed XML and RELAX NG schema specifications.

RELAX NG schema as a grammar

If our job is to specify how some text is to be parsed, one obvious place to start is from grammars, or more specifically context-free grammars; they have been successfully used for this purpose for more than half a century[b59]. Here is an example of such a grammar for a small fragment of a Wiki markup language, specified in a variant of the EBNF notation:

paragraph  ::= (plain-text | bold | italic)* "\n\n"?
bold       ::= "**" (plain-text | italic)* "**"
italic     ::= "//" (plain-text | bold)* "//"
plain-text ::= ([^\n*/]+ | "\n" [^\n] | "*" [^*] | "/" [^/])+

The plain-text production is rather tricky. This context-free grammar is working directly on plain-text input with no help from any lexical layer, so plain-text has to exclude the three markers (**, //, and the newline) in order to avoid ambiguity. The production would become even more complicated as more markup is added to the grammar.

Once the input text is parsed according to the grammar, we can represent the resulting abstract syntax tree as XML and use the following compact RELAX NG schema for its validation:

paragraph  = element para { (plain-text | bold | italic)* }
bold       = element bold { (plain-text | italic)* }
italic     = element italic { (plain-text | bold)* }
plain-text = text

The similarities between the two notations above are striking. The main difference is that the former specifies a concrete syntax, and the latter the abstract syntax[m62]. To become concrete, and thus useful for parsing text, the RELAX NG schema needs to specify the string markers, or terminal symbols. We could try the following modification, which brings the schema even closer to the EBNF grammar:

paragraph  = element para {
                (plain-text | bold | italic)*,
                "

"?
             }
bold       = element bold { "**", (plain-text | italic)*, "**" }
italic     = element italic { "//", (plain-text | bold)*, "//" }
plain-text = text

The RELAX NG specification[c01] unfortunately does not allow text-matching and element-matching patterns to be grouped together, and that makes the above schema invalid. To make our concrete-syntax schema syntactically correct, we need to enclose each string marker into an element of its own. These elements will belong to the special terminal namespace so we can distinguish them from the structural elements:

paragraph  = element para {
                (plain-text | bold | italic)*,
                paragraph_separator?
             }
bold       = element bold {
                bold_marker,
                (plain-text | italic)*,
                bold_marker
             }
italic     = element italic {
                italic_marker,
                (plain-text | bold)*,
                italic_marker
             }
plain-text = text

bold_marker         = element terminal:bold_marker { "**" }
italic_marker       = element terminal:italic_marker { "//" }
paragraph_separator = element terminal:paragraph_separator {
                         "

"
                      }

We could also replace the text pattern by string{pattern="([^\n*/]+|\n[^\n]|\*[^*]|/[^/])+"} to replicate the grammar even closer. As noted above, however, this pattern grows more complex as more markers are added to the grammar, which makes it difficult to maintain. Another downside is that the schema would lose the modularity properties that RELAX NG normally provides.

The plain-text pattern is meant to match any text up to any marker that is allowed in the context. Rather than require the user to construct this pattern every time a new marker is introduced, we can change the meaning of the text pattern to match what we need. In the standard RELAX NG semantics, text matches all text content up to the next element tag; in our modified semantics, it will match all text content until the next marker recognizable in the context, or until the next element tag.

Our parser must construct an abstract syntax tree with element nodes like bold that are not present in the input. To achieve this, we need to add another semantic extension and infer the missing element tags[b10]. This is especially necessary for features like Wiki lists, where a single indented asterisk can denote the beginning of both a list and a list item. This is similar to the OMITTAG feature of SGML, the main difference being that our input must be well-formed XML; the element's start-tag and its end-tag must both be present or both omitted.

The only elements with omissible tags will be those in the terminal namespace and those whose namespace URI begins with the prefix omissible+ (which is perfectly legal according to RFC 2396). In the schema fragment above, the default namespace should be made omissible; in other words, the schema should be preceded by

default namespace = "omissible+http://my.namespace.com/"
namespace terminal = "http://en.wikipedia.org/wiki/Terminal_and_nonterminal_symbols#Terminal_symbols"

The elements in the terminal namespace are perfectly ordinary XML elements; what gives them a special meaning is that the parser deletes them from the constructed syntax tree together with their content. The elements with the omissible+ namespace prefix will be kept in the normalized XML output, but their URI prefix will be removed. This stripping of terminal elements and omissible namespace prefixes is the default mode of operation. The parser can also be made to emit all the terminal nodes and keep the omissible namespace prefixes. For the above example schema and the input paragraph

Here's a **fat
and somewhat //slanted
// text**
example.

the default output of the parser is

<paragraph xmlns="http://my.namespace.com">Here's a <bold>fat
and somewhat <italic>slanted
</italic> text</bold>
example.</paragraph>

and the raw output, if requested, would be

<paragraph
   xmlns="omissible+http://my.namespace.com"
   xmlns:terminal="http://en.wikipedia.org/wiki/Terminal_and_nonterminal_symbols#Terminal_symbols"
>Here's a <bold><terminal:bold_marker>**</terminal:bold_marker>fat
and somewhat <italic><terminal:italic_marker>//</terminal:italic_marker>slanted
<terminal:italic_marker>//</terminal:italic_marker></italic> text<terminal:bold_marker>**</terminal:bold_marker></bold>
example.<terminal:paragraph_separator>

</terminal:paragraph_separator></paragraph>

Both these outputs are well-formed XML and contain no text markers. The former is valid against the original RELAX NG schema, and the latter is valid against the enriched schema. If we want to replicate the behaviour of an SGML DTD, where one can alternate between shortrefs and regular element tags, all we need do is combine the two schemata into one. The cleanest way to accomplish the same effect is to have the concrete-syntax schema include the original one, combining the original definitions with its own. If the original schema was defined in file strict.rng, the extended schema could be defined in a separate file as follows:

default namespace = "omissible+http://my.namespace.com/"
namespace terminal = "http://en.wikipedia.org/wiki/Terminal_and_nonterminal_symbols#Terminal_symbols"

include "strict.rng"

paragraph  |= element para {
                 (plain-text | bold | italic)*,
                 paragraph_separator?
              }
bold       |= element bold {
                 bold_marker,
                 (plain-text | italic)*,
                 bold_marker
              }
italic     |= element italic {
                 italic_marker,
                 (plain-text | bold)*,
                 italic_marker
              }
plain-text  = text

bold_marker         = element terminal:bold_marker { "**" }
italic_marker       = element terminal:italic_marker { "//" }
paragraph_separator = element terminal:paragraph_separator {
                         "&#x0a;&#x0a;"
                      }

Both the default and the raw output (i.e., the abstract and the concrete syntax tree) now conform to the same RELAX NG schema, and we can use any conforming RELAX NG validator to verify this.

Implementation

The parser for the schema specifications described in the previous section has been implemented in Haskell and can be found at http://hackage.haskell.org/package/concrete-relaxng-parser. It compiles to a standalone executable that requires two file names as arguments: the target RELAX NG schema (with or without any concrete-syntax extensions), and the input XML document.

The implementation of the concrete-syntax parser is based on the RELAX NG reference implementation[c02] with its novel algorithm based on Brzozowski derivatives[b64] [s05], together with some extensions described in our previous work[b10]. In particular, the inference of the missing element tags is the same as in [b10], the only change being its restriction to the set of elements whose namespace URI begins with the string omissible+. The rest of this section will concentrate on details that have not been described elsewhere.

The biggest change from [b10] is in the textDeriv function. Both in the reference validator and in the previous normalizer implementation, this function must match its pattern argument against its entire text node argument. Now a pattern is allowed to consume only a prefix of the current text node, so the Brzozowski derivatives cannot be calculated as easily. One possible solution would be to calculate the derivative character by character, but its performance would be unacceptable. We also considered introducing a lexical layer that separates all possible syntactic markers from the rest of the text, but in the end we settled for a mixed derivative/continuation-passing algorithm. The textDeriv function takes two continuations, one invoked in case the pattern consumes the entire text node and the other in case there is some leftover text. This way each pattern is free to consume as much text as it can match in a single try, and pass the rest to the continuation pattern.

This technique unfortunately does not implement the interleave patterns properly. If their semantics from the RELAX NG specification was carried over to the text nodes literally, it would imply that an interleave pattern should match any interleaving of the character sequences matched by its two branches. This semantics would be very difficult to implement efficiently, but more importantly, it would probably be useless in practice. Instead, textDeriv implements the interleave pattern as an alternation: one of its branches is matched followed by the other. This semantics is unfortunately not composable. At this time we must recommend against the use of interleave in concrete syntax definitions. The semantics of interleave across multiple XML elements and text nodes is not affected by this problem.

Another significant hurdle to overcome in the adaptation of RELAX NG to the task of parsing text is its text pattern. Having been designed for the validation of XML documents, RELAX NG allows the text pattern to match any arbitrary contiguous region of text. The boundaries of this region are determined by the surrounding markup tags. Since we cannot count on these hard boundaries, we must keep track of all syntactic markers that can appear instead of element tags. These markers are divided into two sets, the alternate set and the follow set. The former contains all markers that can begin an alternative to the current pattern, while the latter contains all markers that can appear after the end of the current pattern.

The same approach is applied to data and dataExcept patterns: they are bounded by the next following marker. They consume the longest possible prefix, recognized by the data type, of the text preceding the marker.

Whitespace is for the most part handled the same as all other text. The only two exceptions are that the whitespace consumption does not affect the alternate set and follow set of syntactic markers, and that any amount of whitespace can precede an explicit element tag. The latter feature follows the behaviour of the standard RELAX NG validator, which ignores whitespace between elements.

Results and future directions

The presented RELAX NG extension could be applied to many RELAX NG schemata and used to shorten their instances. Whether it should be applied to any particular schema depends mostly on outside factors like the target audience and document corpus. There are also, however, several technical factors that must be taken into consideration.

  • Syntactic markers can only be used to infer element tags without any specified attributes. This shortcoming is partly a consequence of the inability to specify fixed attribute values in RELAX NG, and could potentially be remedied by future extensions.

  • While a schema extended with syntactic markers and omissible element tags can replicate most common uses of SGML SHORTREF feature, it is a fundamentally different mechanism. A SHORTREF can expand to any general entity, which is free to include multiple elements with specified attributes and arbitrary content. A syntactic marker serves only to guide the parser in which omissible elements should be inferred, and these inferred elements are the only possible addition to the parsed output.

  • SGML derives some benefit from being a large and integrated specification. In particular, we can offer no equivalent to SGML usemap declaration which can activate an arbitrary set of shortrefs in any position in the document, or turn them all off. Since our input is well-formed XML, we could instead introduce special processing instructions that affect the parser's behaviour. The main obstacle currently is that the RELAX NG infrastructure normalizes the XML input, removing all processing instructions prior to validation and parsing. The CDATA marked sections are also normalized away, which presents an even more serious problem because the parser may infer elements within them.

  • The current performance of the parser is sufficient for authoring documents with syntactic markers and occasional one-off conversion to a fully tagged instance, but it would impose a significant overhead in a repeatedly invoked markup-processing pipeline. The worst-case performance of any parser implementation will depend on the details of the schema; since RELAX NG does not impose LL(1) or similar constraints, neither do we.

  • A judicious use of syntactic markers can ease the XML document authoring in a text editor. Their benefits would be diminished if used with an XML editor; they could even degrade the experience in this context.

  • There is currently no support for automatic inference of the desired element nesting level, like Wiki for example does with the indentation of the list item bullets. To allow an element to be nested within itself, the schema must specify a different syntactic marker for each element nesting level. Alternatively, one can always nest explicit element tags.

  • On the positive side, the concrete-syntax schema can be as modular as a regular, abstract-syntax RELAX NG schema. It is possible to experiment with multiple different concrete syntaxes for the same abstract syntax, for example, or vice versa.

  • The parser translates an XML document from concrete to abstract syntax. There is currently no tool support for performing a reverse translation. This would be a problem for any deployment scenario which allows a document to be edited in both the explicitly-tagged and its concrete syntax variant.

As a proof of concept, the present paper has been written in concrete syntax and translated to the abstract syntax conforming with the target schema. The concrete-syntax schema extension is given in Appendix A.

The sample schema extension modifies seven elements: code, emphasis, listitem, para, programlisting, quote, and title. Their tags are made omissible in all contexts where they can occur, with the exception of emphasis which must be explicitly tagged inside programlisting and inside an inferred emphasis. Each of the seven elements is also given a concrete syntax with different terminal symbols. Authored with the full use of these extensions, the present paper contains a total of 141 element tags — mostly of elements with required attributes. Once parsed into an explicitly tagged XML instance, it gains additional 284 element tags.

Another example in Appendix B presents a small extension of the modularized RELAX NG schema for XHTML 1.0[c08]. We hope to prepare more concrete syntax extensions like these for other XML schemata in the future.

Related work

The tool presented herein treats the RELAX NG schema as an abstract syntax description, and sprinkles it with some extensions for describing the concrete syntax of the language. There have been other tools[p09] [q11] using the same approach of starting with the abstract syntax and extending it with concrete syntax annotations. The abstract syntax notation in these related works is tool-specific, since they don't use XML as the abstract syntax tree.

On the other hand, there are numerous reports[b00] [c03] [m04] [r05] that focus on using XML as the target abstract syntax tree (AST) notation of a parser for some concrete syntax. To perform their parsing, however, they use parser-generators such as ANTLR[p95] and other traditional parsing tools, so they specify their concrete syntax in the formalism those tools require. Those that use an XML schema at all, use it only to validate the generated AST.

Appendix A. Concrete syntax schema extension for Balisage submissions

<?xml version="1.0" encoding="UTF-8"?>
<grammar ns="omissible+http://docbook.org/ns/docbook"
         xmlns:explicit="http://docbook.org/ns/docbook"
         xmlns:a="http://relaxng.org/ns/compatibility/annotations/1.0"
         xmlns:xlink="http://www.w3.org/1999/xlink"
         xmlns:terminal="http://en.wikipedia.org/wiki/Terminal_and_nonterminal_symbols#Terminal_symbols"
         xmlns:non-syntactic="http://en.wikipedia.org/wiki/Terminal_and_nonterminal_symbols#Nonterminal_symbols"
         xmlns="http://relaxng.org/ns/structure/1.0"
         datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">

  <!-- The balisage-1-3a.rng schema included below is semantically equivalent to the original Balisage 
       schema, but slightly refactored with the following definitions added for reuse:

    - code.content
    - emphasis.content
    - para.content
    - programlisting.content
    - quote.content
    - title.content
  -->
  <include href="balisage-1-3a.rng">
    <define name="programlisting.content">
      <ref name="programlisting.content.explicit"/>
    </define>
  </include>

  <define name="title" combine="choice">
    <element name="title">
      <ref name="title.attlist"/>
      <ref name="title.content"/>
      <ref name="paragraph_separator"/>
    </element>
  </define>

  <define name="para" combine="choice">
    <element name="para">
      <ref name="para.attlist"/>
      <ref name="para.content.non-recursive"/>
      <ref name="paragraph_separator"/>
    </element>
  </define>

  <define name="programlisting" combine="choice">
    <element name="programlisting">
      <ref name="programlisting.attlist"/>
      <ref name="programlisting_open_marker"/>
      <ref name="programlisting.content.explicit"/>
      <ref name="programlisting_close_marker"/>
    </element>
  </define>

  <define name="listitem" combine="choice">
    <element name="listitem">
      <ref name="listitem.attlist"/>
      <ref name="listitem_marker"/>
      <oneOrMore>
        <ref name="para.level"/>
      </oneOrMore>
    </element>
  </define>

  <define name="code" combine="choice">
    <element name="code">
      <ref name="code.attlist"/>
      <ref name="code_marker"/>
      <ref name="code.content"/>
      <ref name="code_marker"/>
    </element>
  </define>

  <define name="emphasis" combine="choice">
    <element name="emphasis">
      <ref name="emphasis.attlist"/>
      <ref name="emphasis_marker"/>
      <ref name="emphasis.content.non-recursive"/>
      <ref name="emphasis_marker"/>
    </element>
  </define>

  <define name="quote" combine="choice">
    <element name="quote">
      <ref name="quote.attlist"/>
      <ref name="quote_marker"/>
      <ref name="quote.content"/>
      <ref name="quote_marker"/>
    </element>
  </define>

  <!-- inlined emphasis.content, but with only explicit nested emphasis -->
  <define name="emphasis.content.non-recursive">
    <zeroOrMore>
      <choice>
        <text/>
        <ref name="link"/>
        <ref name="citation"/>
        <ref name="emphasis.explicit"/>
        <ref name="footnote"/>
        <ref name="trademark"/>
        <ref name="email"/>
        <ref name="code"/>
        <ref name="superscript"/>
        <ref name="subscript"/>
        <ref name="quote"/>
        <ref name="xref"/>
      </choice>
    </zeroOrMore>
  </define>

  <!-- emphasis element with explicit tags -->
  <define name="emphasis.explicit">
    <element name="explicit:emphasis">
      <ref name="emphasis.attlist"/>
      <ref name="emphasis.content"/>
    </element>
  </define>

  <!-- para.content minus the block-level elements which can recursively nest a paragraph -->
  <define name="para.content.non-recursive">
    <zeroOrMore>
      <choice>
        <text/>
        <ref name="citation"/>
        <ref name="code"/>
        <ref name="email"/>
        <ref name="emphasis"/>
        <ref name="equation"/>
        <ref name="inlinemediaobject"/>
        <ref name="link"/>
        <ref name="subscript"/>
        <ref name="superscript"/>
        <ref name="trademark"/>
        <ref name="quote"/>
        <ref name="xref"/>
      </choice>
    </zeroOrMore>
  </define>

  <!-- programlisting.content with only the explicit emphasis -->
  <define name="programlisting.content.explicit">
    <zeroOrMore>
      <choice>
        <text/>
        <ref name="emphasis.explicit"/>
        <ref name="superscript"/>
        <ref name="subscript"/>
      </choice>
    </zeroOrMore>
  </define>

  <define name="emphasis_marker">
    <element name="terminal:emphasis_marker">
      <value type="string">''</value>
    </element>
  </define>

  <define name="paragraph_separator">
    <element name="terminal:paragraph_separator">
      <value type="string">&#x0a;&#x0a;</value>
    </element>
  </define>

  <define name="programlisting_open_marker">
    <element name="terminal:programlisting_open_marker">
      <value type="string">{{{&#x0a;</value>
    </element>
  </define>

  <define name="programlisting_close_marker">
    <element name="terminal:programlisting_close_marker">
      <value type="string">&#x0a;}}}</value>
    </element>
  </define>

  <define name="listitem_marker">
    <element name="terminal:listitem_marker">
      <value type="token">*</value>
    </element>
  </define>

  <define name="code_marker">
    <element name="terminal:code_marker">
      <value type="string">`</value>
    </element>
  </define>

  <define name="quote_marker">
    <element name="terminal:quote_marker">
      <value type="string">"</value>
    </element>
  </define>
</grammar>

Appendix B. Concrete syntax extension of XHTML schema

<grammar ns="omissible+http://www.w3.org/1999/xhtml"
         xmlns:explicit="http://www.w3.org/1999/xhtml"
         xmlns:terminal="http://en.wikipedia.org/wiki/Terminal_and_nonterminal_symbols#Terminal_symbols"
         xmlns="http://relaxng.org/ns/structure/1.0">

<include href="xhtml/xhtml-strict.rng"/>

<define name="head" combine="choice">
  <element name="head">
    <ref name="head.content"/>
  </element>
</define>

<define name="title" combine="choice">
  <element name="title">
    <text/>
  </element>
</define>

<define name="body" combine="choice">
  <element name="body">
    <ref name="Block.model"/>
  </element>
</define>

<define name="p" combine="choice">
  <element name="p">
    <ref name="paragraph_separator"/>
    <ref name="Inline.model"/>
  </element>
</define>

<define name="ol" combine="choice">
  <element name="ol">
    <oneOrMore>
      <ref name="ol.li"/>
    </oneOrMore>
  </element>
</define>

<define name="ul" combine="choice">
  <element name="ul">
    <oneOrMore>
      <ref name="ul.li"/>
    </oneOrMore>
  </element>
</define>

<define name="hr" combine="choice">
  <element name="hr">
    <ref name="hr_marker"/>
  </element>
</define>

<define name="em" combine="choice">
  <element name="em">
    <ref name="emphasis_marker"/>
    <ref name="em.content.non-recursive"/>
    <ref name="emphasis_marker"/>
  </element>
</define>

<define name="ol.li">
  <element name="li">
    <ref name="ol_item_marker"/>
    <ref name="li.content.non-recursive"/>
  </element>
</define>

<define name="ul.li">
  <element name="li">
    <ref name="ul_item_marker"/>
    <ref name="li.content.non-recursive"/>
  </element>
</define>

<define name="em.content.non-recursive">
  <zeroOrMore>
    <choice>
      <text/>
      <ref name="abbr"/>
      <ref name="acronym"/>
      <ref name="br"/>
      <ref name="cite"/>
      <ref name="code"/>
      <ref name="dfn"/>
      <ref name="kbd"/>
      <ref name="q"/>
      <ref name="samp"/>
      <ref name="span"/>
      <ref name="strong"/>
      <ref name="var"/>
      <ref name="em.explicit"/>
    </choice>
  </zeroOrMore>
</define>

<define name="li.content.non-recursive">
  <zeroOrMore>
    <choice>
      <text/>
      <ref name="Inline.class"/>
      <ref name="address"/>
      <ref name="blockquote"/>
      <ref name="div"/>
      <ref name="pre"/>
      <ref name="Heading.class"/>
      <ref name="dl"/>
      <ref name="p.explicit"/>
      <ref name="ol.explicit"/>
      <ref name="ul.explicit"/>
    </choice>
  </zeroOrMore>
</define>

<define name="em.explicit">
  <element name="explicit:em">
    <ref name="em.attlist"/>
    <ref name="Inline.model"/>
  </element>
</define>

<define name="p.explicit">
  <element name="explicit:p">
    <ref name="p.attlist"/>
    <ref name="Inline.model"/>
  </element>
</define>

<define name="ol.explicit">
  <element name="explicit:ol">
    <ref name="ol.attlist"/>
    <oneOrMore>
      <ref name="li"/>
    </oneOrMore>
  </element>
</define>

<define name="ul.explicit">
  <element name="explicit:ul">
    <ref name="ul.attlist"/>
    <oneOrMore>
      <ref name="li"/>
    </oneOrMore>
  </element>
</define>

<define name="emphasis_marker">
  <element name="terminal:emphasis_marker">
    <value type="string">*</value>
  </element>
</define>

<define name="paragraph_separator">
  <element name="terminal:paragraph_separator">
    <value type="string">&#x0a;&#x0a;</value>
  </element>
</define>

<define name="line_separator">
  <element name="terminal:line_separator">
    <value type="string">&#x0a;</value>
  </element>
</define>

<define name="ol_item_marker">
  <element name="terminal:ol_item_marker">
    <value type="token">&#x0a;# </value>
  </element>
</define>

<define name="ul_item_marker">
  <element name="terminal:ul_item_marker">
    <value type="token">&#x0a;* </value>
  </element>
</define>

<define name="hr_marker">
  <element name="terminal:hr_marker">
    <value type="token">&#x0a;----</value>
  </element>
</define>
</grammar>

References

[b59] Backus, J.W., The Syntax and Semantics of the Proposed International Algebraic Language of Zürich ACM-GAMM Conference, Proceedings of the International Conference on Information Processing, UNESCO, 1959, pp.125-132.

[b64] Brzozowski, J. A. 1964. Derivatives of Regular Expressions. J. ACM 11, 4 (Oct. 1964), 481-494. doi:https://doi.org/10.1145/321239.321249.

[b00] Greg J. Badros. 2000. JavaML: a markup language for Java source code. Computer Networks 33, 1-6 (June 2000), 159-177. doi:https://doi.org/10.1016/S1389-1286(00)00037-2.

[b07] Mark Bergsma, 2007. Wikimedia architecture http://www.nedworks.org/~mark/presentations/kennisnet/Wikimedia%20architecture%20(kennisnet).pdf

[b10] Mario Blažević, 2010. Grammar-driven Markup Generation. In Proceedings of Balisage: The Markup Conference 2010. Balisage Series on Markup Technologies, vol. 5 (2010). http://www.balisage.net/Proceedings/vol5/html/Blazevic01/BalisageVol5-Blazevic01.html. doi:https://doi.org/10.4242/BalisageVol5.Blazevic01.

[c01] James Clark and Makoto Murata. RELAX NG Specification. http://relaxng.org/spec-20011203.html, 2001. ISO/IEC 19757-2:2003.

[c02] James Clark. An algorithm for RELAX NG validation http://www.thaiopensource.com/relaxng/derivative.html

[c02c] James Clark. RELAX NG compact syntax, Committee Specification 21 November 2002, OASIS http://relaxng.org/compact-20021121.html

[c08] James Clark. Modularization of XHTML in RELAX NG http://www.thaiopensource.com/relaxng/xhtml/

[c03] James R. Cordy, 2003. Generalized Selective XML Markup of Source Code Using Agile Parsing. In Proceedings of the 11th IEEE International Workshop on Program Comprehension (IWPC '03). IEEE Computer Society, Washington, DC, USA, 144-

[j04] Rick Jeliffe. From Wiki to XML, through SGML. http://www.xml.com/pub/a/2004/03/03/sgmlwiki.html

[m62] John McCarthy, Towards a Mathematical Science of Computation, Proceedings of IFIP Congress 1962, pages 21-28, North Holland Publishing Company, Amsterdam

[m04] J.I. Maletic, M. Collard, and H. Kagdi, Leveraging XML technologies in developing program analysis tools. IEEE Digest 2004, 80 (2004), doi:https://doi.org/10.1049/ic:20040255.

[p95] Parr, T. J. and Quong, R. W. ANTLR: A predicated-LL(k) parser generator. Software: Practice and Experience, volume 25, issue 7, 1995. John Wiley & Sons, Ltd. doi:https://doi.org/10.1002/spe.4380250705

[p09] Jaroslav Porubän, Michal Forgáč, and Miroslav Sabo, Annotation Based Parser Generator. Proceedings of the International Multiconference on Computer Science and Information Technology, 2009, pp. 707–714

[q11] Luis Quesada, Fernando Berzal, and Juan-Carlos Cubero, A Tool for Model-Based Language Specification. Department of Computer Science and Artificial Intelligence, CITIC, University of Granada, http://arxiv.org/abs/1111.3970v1

[r05] Raihan Al-Ekram and Kostas Kontogiannis. 2005. An XML-Based Framework for Language Neutral Program Representation Generic Analysis. In Proceedings of the Ninth European Conference on Software Maintenance and Reengineering (CSMR '05). IEEE Computer Society, Washington, DC, USA, 42-51. doi:https://doi.org/10.1109/CSMR.2005.10

[s05] Sperberg-McQueen, C. M. Applications of Brzozowski derivatives to XML schema processing. In Extreme Markup Languages 2005, page 26, Internet, 2005. IDEAlliance.

[s86] Standard Generalized Markup Language (SGML) International Organization for Standardization ISO 8879:1986

[w93] Sam Wilmott, Beyond SGML. Exoterica Technical Report ETR-9, 1993. http://developers.omnimark.com/etcetera/etr09/

Mario Blažević

Senior software architect

Stilo International plc.

The author has a Master's degree in Computer Science from University of Novi Sad, Yugoslavia. Since moving to Canada in 2000, he has been working for OmniMark Technologies, later acquired by Stilo International plc., mostly in the area of markup processing and on development of the OmniMark programming language.