How to cite this paper

Flynn, Peter. “Your Standard Average Document Grammar: just not your average standard.” Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4, 2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017).

Balisage: The Markup Conference 2017
August 1 - 4, 2017

Balisage Paper: Your Standard Average Document Grammar

just not your average standard

Peter Flynn

Peter Flynn manages the Academic and Collaborative Technologies Group in IT Services at University College Cork, Ireland. He trained at the London College of Printing and did his MA in computerized planning at Central London Poly (now the University of Westminster). He worked in the UK for the Printing and Publishing Industry Training Board as a DP Manager and for United Information Services of Kansas as IT consultant before joining UCC as Project Manager for academic and research computing. In 1990 he installed Ireland’s first Web server and now concentrates on academic and research publishing support. He has been Secretary of the TeX Users Group, Deputy Director for Ireland of EARN, and a member both of the IETF Working Group on HTML and of the W3C XML SIG; and he has published books on HTML, SGML/XML, and LaTeX. Peter also runs the markup and typesetting consultancy Silmaril, and is editor of the XML FAQ as well as an irregular contributor to conferences and journals in electronic publishing, markup, and Humanities computing, and a regular speaker and session chair at the XML SummerSchool in Oxford. He completed a late-life PhD in User Interfaces to Structured Documents with the Human Factors Research Group in Applied Psychology in UCC. He maintains a fairly random technical blog at

Article copyright © 2017 by Peter Flynn.


Most document XML applications adopt or adapt one of a small number of well-known public document grammars. These are basically all expressions of a shared and accepted fundamental logical view of document structure. There are variants and outliers and long tails, but despite differences in detail, they form a Standard Average Document Grammar, which lets us describe the overwhelming majority of text documents.

The grammar includes a hierarchy of nested, headed sections; arbitrarily recurring groups of common components; links between places in and out of the document; signifiers of importance, relevance, or sequence; and restrictions on what may and may not occur in different places. The modifications and customizations users make to these document grammars are informative both in their variety and their similarity, and in the fact that they all fit relatively comfortably within the Standard Average Document Grammar.

Table of Contents

From clay tablet to PDF
Core features
Standard Average?
Feature set
Adopt, Adapt, Build
Drawing the line

From clay tablet to PDF

There appears to be a set of structural features common to the majority of text documents that have become a part of the way the human race has recorded textual information over the millenia. As I have shown elsewhere, it is apparent that from clay tablets to PDFs, we have slowly evolved various models of a document that have many features in common [Flynn14, Ch.1]. Part of this may be due to the need — until recently — to agree upon a generalized physical representation for the document that others would recognize, but this could not have been done without there being a mental model of the document to start from. It is not known if anyone actually sat down at the dawn of writing, or even at the dawn of printing, to decide that certain features are what makes up a document,[1] but we can see evidence of such decisions in the design of commands and structures in older markup systems such as RUNOFF, Scribe, [S]GML, LaTeX, and others which inherit their paradigms.

Strictly speaking, a document grammar (in the case of XML, for example, a DTD, W3C Schema, or RNG Schema) is a set of definitions and declarations for modeling a class of text documents. It defines the components of the documents they describe, as well as the rules governing their presence in the documents of that class [Tekli11] — a similar application has been noted in linguistics [Power03]. However, we are more concerned here with the document components themselves, and with the rules governing their arrangement, than with the expressive power of the particular grammatical notation used to describe them.

Core features

In comparing the features of text document markup vocabularies for earlier research, the existence of a core set of features became evident because it recurred in one form or another in virtually every system examined. Not only were the functions replicated, but the associations between them, and the rules under which they operated, were extremely similar. These features have been observed and discussed many times, and are used as examples in our theories of document grammars, but they do not appear to have been codified across multiple instances of their occurrence. To test the feasibility of codification, an experimental Table of the Elements fragment was constructed from a small sample of document types of varying age and popularity [Table I], looking principally for obvious evidence of common requirements such as metadata (principally the document identity), hierarchical structure, non-hierarchical categorization, and object reference. Although incomplete and unrefined, the table showed the existence of some common features, as well as numerous gaps.

Table I

(Non-Periodic) Table of the Elements from selected XML grammars (LaTeX has been included for comparison)

Feature HTML DocBook DITA TEI 12083 JATS Briefing Bulletin LaTeX
title title title title title title article-title title title \title
author author author author author briefeditors author \author
summary abstract shortdesc abstract abstract abstract abstract abstract
preface preface front preface \frontmatter
part part section div|div0 part sec \part
chapter h1 chapter section div|div1 chapter sec report \chapter
section h2 sect1 section div|div2 section sec story section \section
subsection h3 sect2 section div|div3 subsect1 level3 sub.section \subsection
subsubsection h4 sect3 section div|div4 subsect2 level4 \subsubsection
appendix appendix appendix afterwrd \appendix
bibliography bibliography listBibl biblist Ref-list biblist thebibliography
index index index index index
glossary glossary glossary glossary glossary glosslist glossary
paragraph p para p p p p para ptxt \par
quotation blockquote blockquote lq quote bq block.quote quotation
numbered list ol orderedlist ol list list list numberlist list enumerate
bulleted list ul itemizedlist ul list list list bulletlist itemize
dictionary list dl variablelist dl list deflist list defnlist description
figure img figure fig figure fig fig illus figure figure
table table table table table table table table table table
mathematics equation formula formula mml:math formula $$
cross-reference a xref link ref secref xref eiro.ref \ref
bibliographic reference a biblioref cite ref citeref biblio \cite
external link a link xref ptr weblink external.ref \hyperref
emphasis em emphasis emph emph emph emph1 \emph
language lang foreignphrase foreign language.phrase \selectlanguage

From this data the features of a common grammar begin to emerge:

  • document models provide for self-labelling: in concrete terms, titles, authors, and other [meta]data within the document;

  • the models provide for an ordered hierarchical division of the information;

  • within those divisions, there is a non-hierarchical sequence of text-bearing components (and some for graphical content);

  • at the level of the discourse itself (text), there may be interspersed identifiers which describe relationships between objects or which signify some special quality to be observed, and which may themselves contain further text, identifiers, or signifiers.

I have so far avoided assigning the conventional labels of markup theory or the names used in any specific system to these features (element, attribute, environment, etc; or title, para, or sect1, etc). However, for practicality and convenience in discussion, the grouping of the features in Table I corresponds with terminology commonly used: metadata, hierarchy, pool, and flow.[2]

Standard Average?

The human race seems to like to categorize things. We do it on the basis of perception (loud|quiet, bright|dark, hot|cold), cognition (cheap|expensive, fast|slow, wet|dry), and even guesswork (bull|bear [market]) — ultimately it’s a survival trait (dangerous|harmless) [Lakoff90]. More experienced humans have more points on their scales: flooded|sodden|wet|damp|moist|dry|bone-dry|parched|desert, because it’s more useful that way. It’s also possible to measure on a sliding scale, for example 100%=flooded and 0%=desert, or any point in-between. But as most of us live neither under water nor in a desert, neither in perpetual daylight nor perpetual night, neither on top of a mountain nor at the bottom of a canyon, there is a tendency for most humans to have an affinity for somewhere between the extremes. This clustering, or central tendency, is a hallmark of natural behavior, and has been known since antiquity, although formalized in statistics only since the late 1600s.[3]

Average therefore seems to be an appropriate way to describe the clustering observed in the way in which document are constructed — at least in SGML/XML and LaTeX — even if it is not used in the strictly mathematical sense required by statistics. There is a cluster of recognizable types of information around the title and author; another around the hierarchy, another around the pool, and around the flow.

The standards we use daily, whether formalized by ISO or just accepted as patterns of behavior, have been formed from a similar principle to the average: a degree of genericness or commonality has been seen to be useful as a model because it is representative or descriptive of the whole. In effect, we are unconsciously applying the duck test of abductive reasoning: if it [repeatedly] looks useful, it probably is.

The suggested term Standard Average Document Grammar is derived from (but entirely unassociated with) the linguistic term Standard Average European coined in the late 1930s to describe a set of grammatical similarities which characterize Indo-European languages.[4] The term Standard Average on its own has to some extent become a portmanteau phrase in everyday language for acceptably common behaviour.[5]

Feature set

The set of features for the derived grammar is expanded below, but we should first deal with what it does not describe.

There are many classes of document structures that do not or cannot follow a generic model but have their own: those which are too short to exhibit much in the way of structure; those which are intended as ephemeral or singular; and those which by convention of their nature require a specialist structure. But even amongst these, some of the features may be present, even if (for example) in the metadata rather than the text body.

The point of standard and average as described above is that such a grammar should be able to cover enough of the spectrum to be a useful pattern or model in a majority of cases, and that this fit should be generally accepted by the user community. There will nevertheless be some specific factors which must be considered in testing this acceptance:

  • there must be broad agreement between users on semantics;

  • not all features have to be present: there can be rules about requirement and optionality;

  • if features are present, then they must be used in the manner generally accepted;

Naming is also important, and has been the topic of much discussion over the years on XML-related mailing lists. Not only are names a prerequisite of any concrete instantiation, but we need them informally as handles during discussion, so they may as well be meaningful in the language of that discussion. This raises other linguistic and cultural questions, but in essence we are simply requiring agreement that the feature we refer to as a title is in fact the title of a document (or section, or whatever) as commonly understood, and not a mosquito or a bottle of beer.

Because of the traditional separation of concerns between logical and physical in dealing with document markup, the visual appearance of a grammatical feature is not generally relevant. However, for the purposes of usability and — as here — illustration, when features are given an appearance, it is common to use one of the widely-accepted styles.

The salient features of a Standard Average Document Grammar are summarized in Figure 1 to Figure 4. There may be disagreement over the presence or absence of some specifics, but enough of these appear to occur in enough instances of otherwise disparate types of document to make it worth inclusion.

Figure 1: Identification

Naming and explaining

  • Title

  • Subtitle

  • Author

  • Summary

The features in Figure 1 are often regarded as metadata, as they typically stand outside the running text. It is nevertheless seems to be accepted as part of the function of the grammar that it should label the document (title), link to an authority outside the document (author), and provide an overview or synopsis (summary).

Figure 2: Formation

Hierarchical structure

  • Preamble

  • Major Division

    • Subdivision

      • Minor Divisions

  • Postamble

The core structure of a document appears most commonly as a hierarchical nesting of divisions, with each level able to reoccur as siblings (Figure 2). As encoders of documents are well aware, this does not hold true for many early documents, and even for some contemporary ones, but it is sufficiently true elsewhere for it to be useful as a model, and is sometimes imposed upon otherwise unstructured or semi-structured documents to make them usable in conventional modern contexts. In formally-published documents, especially books, there is usually material preceding and following the hierarchical structure (prefaces, forewords, indexes, appendices).

Figure 3: Text Content

Recurrent, reusable, ordered

  • Paragraph

  • List Item

  • Table

  • Figure

  • Quotation

  • Image

  • Notation

While the function of a hierarchical structure is to provide a referential framework within which the author can develop or express an argument (at the least, something like introduction, exposition, analysis, and conclusion), the text itself uses a set of building-blocks to present that argument (Figure 3), of which a small subset seems to be widely used.

  • The most basic seems to be the paragraph (a novel consists largely just of these and nothing else apart from chapter headings).

  • A list is a collection of thoughts or topics in some way related by order or concept.

  • Tables and figures are ways of expressing or relating more complex collections of information in such a way that they do not interrupt the flow of the argument but remain available for consultation.

  • Images and other notations (mathematics, music) are specialist ways of presenting collections of information that cannot reasonably be given in normal textual form because they need their own language.

  • Quotations are arguably a form of external link (see Figure 4), but reproduce the content of the target verbatim so that it becomes part of the author’s argument.

The critical point about these building-blocks is that they occur and reoccur many times. While the components of the hierarchical structure which contain them may reoccur as often as needed as siblings (that is, at their own level), they cannot occur out of depth (that is, you cannot have a subsubsection as a child of a chapter), whereas the building-blocks of content can occur and reoccur at any level within the hierarchical structure. Whatever about the constraints imposed by the hierarchical model, this distinction seems to be a key aspect of document grammars.

Figure 4: Reference


  • Internal Link

  • External Link

  • Signifiers

    • importance

    • relevance

    • sequence

Unlike the other features in Figure 1 to Figure 3, where at least one of them must be present, otherwise you have no document at all, the reference features are entirely optional, and are used at the author’s discretion according to sense (Figure 4).

In the detail of running text, there may be a need to link components within the document for reference or to link to other documents elsewhere. While these features perform a closely related function, an internal reference can be checked immediately, so it is dependent, whereas a link to another document is independent, as it cannot be known at the time of writing if the reader will have access to the document concerned.

Signifiers are ways to express some special nature of a feature, so that it takes on a quality which impresses itself on the reader. Emphasis or terminology are probably the most frequently-used in continuous text; specifiers of sequence occur in structures like numbered lists and the titles of sections.

Adopt, Adapt, Build[6]

In this author’s experience, the core set of features, or one very similar, is where most concrete instantiations of document grammars appear to have started, as far back as the days of SGML DTDs. Additional features, and deviations from the norm, are legion, and may be specialist within a field or topic, or introduced for practical, technical, or political reasons — it is these which distinguish one implementation from another. The ease (or otherwise) with which a particular type of document can be modified seems to depend largely on the original authors’ intentions:

  • some structures are designed to be modified, and therefore provide facilities for doing so, such as parameterization;

  • some certainly can be modified, and occasionally are, but it’s a big effort and it’s usually easier to put up with the occasional semantic mismatch;

  • some are not intended to be modified at all.

Not all parts of a document grammar may be equal to the task: in some cases it may be hard to modify the metadata but easy to modify the hierarchy; in others the reverse. There is also significant debate (not a part of this analysis) about the extent to which modifications should allow or deny a user the right to continue to claim that they are [still] using the type of document they started with.


The simplest use case is no changes. This implies that the requirements of the documents to be created or encoded are identical to those envisaged by the creators of the grammar, or at least so similar that the differences can be ignored. Using an existing document grammar in this way, without any modification at all, seems to this author to be relatively rare in the long run, with some specific exceptions noted below; but collecting hard data on numbers would be difficult to undertake. Certainly it makes an excellent starting-point for those with no history of structured-document usage, but the process needs to be managed in order to avoid rejection because of unexpected conflicts between the provisions of the grammar and the view that users have of their own document types.

One obvious exception is a need to adhere to a de facto standard, and HTML is the most prominent example. It is something of a special case because it was implemented by software (browsers and editors) that ignored or even encouraged syntactic errors. While XHTML and HTML5 are sometimes now well-formed, the uncounted millions of earlier HTML web pages remain in use and are likely to do so for the foreseeable future. HTML itself has been adapted on occasions for specialist use, but usually just in restricted forms like the subset of XHTML used in EPUBs rather than extending the grammar in other directions; and this author (and separately, the ISO HTML committee) did produce versions which used a hierarchical structure in the body of the document.

Another exception is the mandated use of specialist document types in a vertical market such as a single industry. The success of many industrial document types relies either on agreement that their use between companies in their industry is, effectively, grammatically identical, or it relies on an obvious advantage such as common software.

JATS, for example, while parameterized and open to modification, is seldom changed much except by very large organizations (and even then mostly only in the metadata) because significant change would break the shared model of an article in journal publishing, as well as the toolset. However, some extensive modification has been done to produce BITS (book interchange) and NISO STS (standards), but these are more in the nature of forks or full-scale derivatives.


Three commonly-adapted grammars are TEI, DocBook, and DITA. All provide extensive facilities for adaptation, implemented in different ways, and all can generate DTDs, W3C Schemas, or RNG Schemas.

  • TEI is generated by the ODD system (One Document Does all), and user modifications can be created via the Roma web tool by adding features to a minimal core or substracting them from an all-in version. More specialist modifications can also be done manually by creating customized ODD files and generating the schema afresh.

  • DocBook is maintained in RNG, and features (specified as RNG patterns) can be selectively disabled and enabled in a customization layer, and additional features introduced. The documentation is careful to distinguish between creating subsets, which remain valid DocBook instances, and extensions, which can no longer be called DocBook [Walsh16b].

  • DITA is maintained in RNG and allows for adding and removing new topic or elements types, as well as applying effectivities (conditionalizations). Specializations can be managed centrally by the sponsoring agency which maintains the standard (OASIS) or locally by users or industry groups.

Despite enquiry, I have failed to identify any modified version of any of these three which has involved changing any of the element type names shown in Table I, or their structure relative to one another. Additions and exclusions occur in more specialist areas, as noted above, but the basic grammar of a hierarchical structure containing sequences of text blocks containing mixed text and referential signifiers appears to satisfy that particular core of demand for what constitutes a document.

However, from discussions among developers of document types and classes (for example, on the TEI, DocBook, HTML, XML, LaTeX, and other related forums), it is clear that there have been questions of structural relationships and content modeling in the grammar at the design level which appear largely to have been resolved, at least within the encoding communities served by each system. A few examples:

  1. Should further discursive block-level content be permissible after the close of the last hierarchical child in a hierarchical container?

    • After the end of the last sect1 in a DocBook chapter? Yes, but limited to simplesect;

    • After the end of the last div1 in a TEI div0? No, perhaps oddly, given that the TEI is designed to be able to model historical documents which often do not conform to rigid modern hierarchical structures;

    • After the end of the last div in a HTML5 div? Sure, no problem.

  2. Should hierarchical containers be numbered (by level) or not?

    • DocBook provides names for Parts and Chapters but sections within them are numbered by level; but there is an unnumbered section which can be used instead;

    • TEI provides level-numbered divisions and keeps naming to attributes; but it too provides an undistinguished div;

    • ISO 12083 names the components down to the section level but numbers the levels beneath;

    • HTML and others simply use recurrent containers of the same name at all depths.

  3. To what extent should block-level (pool) components occur within themselves, alongside normal unmarked text?

    • Not at all — TEI (in SGML, one of the most notable victims of pernicious mixed content);

    • Within limits — DocBook (not those with complex internal structure);

    • Go for it — HTML (as implemented).

    (Some systems — Microsoft Word, for example — go to extreme lengths to avoid mixed content entirely.)

  4. Is it the responsibility of the grammar to describe or prescribe the possible types of content of a document?

    • TEI is largely descriptive, in that it was designed to cope with the planet’s literary, historical, and cultural Nachlaß;

    • DocBook is mildly prescriptive (no lists in an Abstract, for example);

    • Specialist grammars can be almost completely prescriptive in structure, although rarely in text content.

The degree to which the chosen grammar offers acceptable constraints, or fails to offer sufficient descriptive accuracy, will largely determine the level of adaptation needed. This is not a failing on either side, simply an acknowledgement that both sides are close enough to the standard average to get along together except for a few areas where they need to go their own way.


The decision to write your own document type or class — to design your own grammar, often from scratch — seems to me to be less common than before, when the public offerings were more limited, document-grammar analysis skills were rare, and a full understanding of ISO 8879 itself rarer still. Specialist requirements continue to mean that vertical-market document type grammars will still need to be written. Maler and el Andaloussi (1999) and others are clear about the commitment of time and effort required to undertake the task at an industrial level, but there must be many hundreds, possibly thousands, of personal or localized schemas originally written for ad hoc purposes which have become embedded into workflows and still continue to function.

In the original analysis for this paper, four small examples were used: EIRO Bulletin and Croner Briefing, which appear in Table I because they show some commonality with the rest; and BiBTeXML and Daybook, which have no correlation with the Standard Average Document Grammar.


This was written for the publishing workflow of a European Union labor research institution. The design is not easily extensible: it has an abbreviated hierarchy and pool, simply enough for the practicalities of publishing; and a curious selection of inline signifiers aimed at the requirements of the publishing process which needed to be able to identify many different aspects (locations, organizations, people, documents, and three different styles of emphasis) for indexing and retrieval as well as visual formatting.


Croner Publications had this developed for a frequently-issued series of business briefings. There is a simple hierarchical structure, but it is remarkable for the pool having 12 different element types for lists (surely some kind of record). There is a significant amount of metadata for document control in a publishing workflow, even for a relatively small unit of writing. Some of the inlines are clearly designed to be retro-fitted after formatting (position and page number).


This shows one possible way of tackling the naming problem when the field is (by design) very narrow. It would, of course, have been perfectly possible to encode the referenced document types (eg article, book, inproceedings, etc) in an attribute of the entry element, but this would mean either attempting a hugely complex content model to restrict the element types, or making the content model elements unconstrained and leaving it to the encoder to make the right choice.

The designers opted for the more pragmatic route of constraining the content model with an element type for each referenced document type, so that the element types available within them reflect exactly those a user would expect from any other interface to a BiBTeX file. This is in some ways an exercise in obviousness: part of the solution in usability is sometimes making the affordances so obvious that it minimizes training.


This was designed for the transcription of parliamentary proceedings. Legislative records not only have to be exact (perhaps in some jurisdictions even when the truth has been redacted) but for retrieval, an attempt has to be made to represent the class of material being debated, so there are element types for General Debate, Oral Answers, Written Answers, and Private Notice Questions. They can be nested, so the structure is discrete; class within class, rather than hierarchical in the normal chapter—section—subsection manner.

Ultimately, the write or adapt decision has to be made on many grounds: accuracy, practicality, security (independence), ease of use, speed, convenience, software availability, skill requirements, and others. Not all of these can necessarily be measured directly with money: there may be less-quantifiable aspects such as human relations and organizational politics involved.

Drawing the line

If there is anything we can learn from a Standard Average Document Grammar, it seems to be that it’s a convenient term for a phenomenon which needs more accurate measurement. One way of looking at it would be to pursue the pseudo-statistical theme and construct values for concrete use cases, with their distance from the theoretical SADG as a measure of divergence.

When an organization or individual considers using an existing document grammar, there will eventually be a pain point at which they in effect say, No, that really isn’t how we see things here, we need something closer to how we work. From that point on, it’s a case of adaptation: new names, perhaps, or a new structure, or an extended or contracted content model. If such a fork is public, it may attract additional users, particularly if it is designed for a vertical market. Takeup and the amount of divergence from the original can be measured.

Some will never get to that point, and will use an existing grammar unadapted, or perhaps with only the most trivial of changes to, say, attribute value lists. In these circumstances, we are effectively adding to the number of use cases at the mode (the most commonly-occurring value of an average).

Those who elect to build their own grammar are in effect initially located beyond some as yet undetermined measure of deviation, although if the resulting structures end up bearing enough similarity to the SADG, the grammar may be considered have added to the base of contributory systems.

In this author’s experience, the adaptations of existing grammars are undertaken for multiple reasons, but often related to not enough or too many or not what we call it:

  • insufficient or over-complex metadata requirements (some people need more, others need less);

  • too many or too few restrictions on the formation of the hierarchy: a modeling mismatch with the way the organization or individual works;

  • missing or excessive provision for pool components which lie at the heart of structured document writing and editing;

  • similar problems with the inline flow components.

Cutting back on the richness of some of the standard offerings is likely to ease editing complexity, but there can also be extra work if some components are named in a way that causes ambiguity or uncertainty in the circumstances of use. When this reaches frustration point among document users, there may be a rise in tag abuse or other inaccuracy, leading to calls for adaptation or writing a new grammar.

Given that the creation of a new document grammar and new document type or class is non-trivial, it would be useful to have some measure of how far off-piste you have to be to justify it.


[Mark Clifton’s 1952 story] Clifton, Mark (1952) Star, Bright. Galaxy Science Fiction, 4:4 (July), World Editions (Edizione Mondiale), New York, NY,

[Flynn14] Flynn, Peter (2014) Human Interfaces to Structured Documents, PhD Thesis, University College Cork, Cork, Ireland,

Kosek, Jirka (2017) Improving validation of structured text. In Proc. XML London 2017, University College London, June 11–12, pp.56–67. doi:

[Lakoff90] Lakoff, George (1990) Women, Fire, and Dangerous Things. University of Chicago Press, Chicago, IL, 9780226468044.

[Maler and el Andaloussi (1999)] Maler, Eve; and el Andaloussi, Jeanne (1999) Developing SGML DTDs: from Text to Model to Markup. Prentice-Hall, Upper Saddle River, NJ, 0-13-309881-8.

[Oppenheim67] Oppenheim, A Leo (1967) Letters from Mesopotamia: Official, Business, and Private Letters on Clay Tablets from Two Millenia. University of Chicago Press, Chicago, IL.

[Power03] Power, Richard Power; Scott, Donia; and Nadjet Bouayad-Agha (2003) Document Structure. In Computational Linguistics 29:2, p.223 et seq. doi:

[Southall (1989)] Southall, Richard (1989) Interfaces between the Designer and the Document. In André, Jacques; Furuta, Richard; and Quint, Vincent; Structured Documents, CUP, Cambridge, England pp.119-131, 0521365546.

[Tekli11] Tekli, Joe; Chbeir, Richard; Traina, Agma JM; and Traina Jr, Caetano (2011) XML document-grammar comparison: related problems and applications. In Central European Journal of Computer Science (Springer, Versita), 1:1, pp.117–136, doi:

[OMalley64] Vesalius, Andreas (1554) Letter to Johannes Oporinus. In O’Malley, Charles Donald (1964) Andreas Vesalius of Brussels, 1514–1564. University of California Press, Berkeley CA (text at, retrieved May 2017).

[Walsh16a] Walsh, Norman (2016) Underlying Technologies. In XML and Publishing, XML Summer School, St Edmund Hall, Oxford, p.19

[Walsh16b] Walsh, Norman (2016) Customizing DocBook. Ch.5 in Publishing DocBook Documents,

[1] Although in the first case, the authors of clay-tablet business documents do appear to have settled on shared modes of expression [Oppenheim67]; and in the second case, Vesalius came fairly close [OMalley64].

[2] The terms pool and flow are taken from the design conventions of Document Type Descriptions as used in SGML and XML: Maler and el Andaloussi (1999) derive them from an Open Software Foundation DTD design committee. They are in widespread use and occur in the specifications for both DocBook and HTML, although they appear much earlier under the terms hierarchy, containment, and sequence in Southall (1989). The terms blocks (for pool) and inlines (for flow) are also in common use.

[3] The word average derives from the Latin havaria, which was the sharing of the expense of lost cargoes between shipping merchants which ultimately gave us the concept of insurance.

[4] I am indebted to Michael Sperberg-McQueen for the suggestion.

[5] As in Mark Clifton’s 1952 story about the father of an exceptionally bright young daughter warning her against feigning stupidity in order to be accepted in school: Now, look, I cautioned, don’t overdo it. That’s as bad as being too quick. The idea is that everybody has to be just about standard average. That’s the only thing we will tolerate. […]

[6] (tl;dr: don’t) [Walsh16a].


Clifton, Mark (1952) Star, Bright. Galaxy Science Fiction, 4:4 (July), World Editions (Edizione Mondiale), New York, NY,


Flynn, Peter (2014) Human Interfaces to Structured Documents, PhD Thesis, University College Cork, Cork, Ireland,


Kosek, Jirka (2017) Improving validation of structured text. In Proc. XML London 2017, University College London, June 11–12, pp.56–67. doi:


Lakoff, George (1990) Women, Fire, and Dangerous Things. University of Chicago Press, Chicago, IL, 9780226468044.


Maler, Eve; and el Andaloussi, Jeanne (1999) Developing SGML DTDs: from Text to Model to Markup. Prentice-Hall, Upper Saddle River, NJ, 0-13-309881-8.


Oppenheim, A Leo (1967) Letters from Mesopotamia: Official, Business, and Private Letters on Clay Tablets from Two Millenia. University of Chicago Press, Chicago, IL.


Power, Richard Power; Scott, Donia; and Nadjet Bouayad-Agha (2003) Document Structure. In Computational Linguistics 29:2, p.223 et seq. doi:


Southall, Richard (1989) Interfaces between the Designer and the Document. In André, Jacques; Furuta, Richard; and Quint, Vincent; Structured Documents, CUP, Cambridge, England pp.119-131, 0521365546.


Tekli, Joe; Chbeir, Richard; Traina, Agma JM; and Traina Jr, Caetano (2011) XML document-grammar comparison: related problems and applications. In Central European Journal of Computer Science (Springer, Versita), 1:1, pp.117–136, doi:


Vesalius, Andreas (1554) Letter to Johannes Oporinus. In O’Malley, Charles Donald (1964) Andreas Vesalius of Brussels, 1514–1564. University of California Press, Berkeley CA (text at, retrieved May 2017).


Walsh, Norman (2016) Underlying Technologies. In XML and Publishing, XML Summer School, St Edmund Hall, Oxford, p.19


Walsh, Norman (2016) Customizing DocBook. Ch.5 in Publishing DocBook Documents,