Text. You keep using that word …

C. M. Sperberg-McQueen

Founder and principal

Black Mesa Technologies LLC

Copyright © 2017 by the author.

expand Abstract

expand C. M. Sperberg-McQueen

Balisage logo


expand How to cite this paper

Text. You keep using that word …

Balisage: The Markup Conference 2017
August 1 - 4, 2017

Some of you will recognize my title Text. You keep using that word … as an allusion to a line in the movie The Princess Bride. One character persistently misuses the word inconceivable, and eventually elicits from the character Inigo Montoya, played by Mandy Patinkin, the remark You keep using that word. I do not think it means what you think it means.[1]

I've been thinking about the word text a lot recently — more than usual, I mean — and about the model of text entailed by SGML and XML and related technologies. One reason perhaps is that the paper by Ronald Haentjens Dekker and David Birnbaum [Haentjens Dekker and Birnbaum 2017] refers back to a paper published a couple of decades ago (1990, to be exact) by Steve DeRose and Allen Renear and a couple of other people under the title What is Text, Really? [DeRose et al. 1990] That paper comes up a lot in discussions of the textual model of SGML and XML. And whenever I see discussions of the textual model of SGML and XML and of the way this paper What is Text, Really? defines it, I find myself thinking No, it's not really quite like that. It's a little more complicated. I don't think that any of those words mean what you think they mean.

The OHCO model and the OHCO thesis

The paper is well worth reading even at the distance of 27 years. It describes a model of text as an ordered hierarchy of content objects. The authors they abbreviate this concept O-H-C-O, so it's often referred to as the OHCO paper). The thesis that text is an ordered hierarchy of content objects is frequently referred to as the OHCO thesis. The paper defines — or at least sketches — the model; it briefly describes some alternative models; it argues for the superiority of the OHCO model. The 1990 paper does not go quite as far as the 1987 paper by some of the same authors [Coombs et al. 1997], in which the claim is made that SGML is not only the best available model of text, but the best imaginable model of text. (Perhaps between 1987 and 1990 some of the authors strengthened their imaginations? We all learn; we all improve.) But the 1990 paper does argue explicitly for the superiority of the OHCO model to the available alternatives.

In the time since, it has frequently been claimed that OHCO is the philosophy underlying SGML and XML and most applications of SGML and XML, so the OHCO paper comes up frequently in critiques of the Text Encoding Initiative Guidelines [hereafter TEI], sometimes in the form of a claim, more often in the form of a tacit assumption, that the OHCO model underlies TEI. When I read such critiques I find myself asking Is OHCO really the philosophy of the TEI? Is the OHCO model really the model entailed by SGML and XML? On mature consideration my answer to this question is: Yes. No. Maybe, in part. Yes, but …. No, but …. Perhaps I should explain.

First, let's recall that yes and no are not the only possible answers to a yes/no question. Sometimes the line between agreement and disagreement is complicated. Allen Renear once asked Can we disagree about absolutely everything? Or would that be like an airline so small that it had no non-stop flights? There are many possible stages between belief and disbelief, or assent and refutation.[2]

A second difficulty is that the authors of the OHCO paper don't actually define the terms hierarchy, ordered hierarchy, or content object. Not every reader is certain that they understand what is meant by the claim that text is an ordered hierarchy of content objects. And those readers — and they seem to be the majority — who do feel confident that they understand what is meant clearly don't always understand the same things.

I have held the lack of definitions against the authors for some time, but thinking about it more recently, I have concluded there is probably good reason that they don't nail the terms down any more tightly than they do. Hierarchy and ordered hierarchy, seem reasonably clear, even if we can quibble a bit about the edges. Content object is a pointedly vacuous term; to some readers it may suggest objects like chapters and paragraphs and quotations, and those are in fact some sample content objects that the authors mention. But we can also interpret a content object as an object that constitutes the content of something else in which case it is as pointedly vacuous and possibly as intentionally vacuous as the SGML and XML technical terms element or entity. Those terms are vacuous for a reason, vague for a purpose, vague because it is you, the user — or we, the users — who are to fill them with meaning by determining for ourselves what we wish to treat as elements or entities. There is in SGML and XML and perhaps in the OHCO model no a priori inventory of concepts which are held to appropriate for inclusion in a marked up document. The choice of concepts is left to the user. That is one of the crucial points of SGML.

The third difficulty I have in deciding how to answer the question is that we need to decide what it means to say that OHCO is the philosophy underlying SGML and XML. What do we mean by SGML or XML? There are at least three things we could plausibly mean. In the narrow sense, when people talk about SGML I frequently think that they mean the text of the specification itself, ISO 8879. And when they talk about XML, I assume they mean the text of the XML specification as published by the W3C. Sometimes when I'm feeling particularly expansive, I may think they mean and related specs. But (possibly because of my personal history) I do tend to assume that references to SGML and XML mean (or perhaps I mean that they ought to mean) the specs and the specs' texts themselves, independent of implementations and independent of related technologies.

But frequently when people are talking about the textual model in this or anything implicit in SGML or XML, what they have in mind is not the specifications themselves but what the creators of those specifications were trying to accomplish, or to be more precise what we now think that they were then trying to accomplish. Our ideas of what they were then trying to accomplish may or may not be based on asking them. And if we do ask them today what they were trying to accomplish then, we must bear in mind that they may not know. Or they may know and not want to tell us.

The third usage, probably the most frequent when people are talking about what is implicit in the use of SGML and XML, is what one might call standard average SGML, which includes but goes perhaps a little beyond the kind of commonalities that were identified by Peter Flynn the other day [Flynn 2017]. By standard average SGML I mean the set of beliefs, propositions, and practices that would be your impression of what the technology entails if you hung around a conference like this one or like the GCA SGML conferences of the late 1980s and early 1990s. It is quite clear that not everybody attending or speaking at those conferences believed the same things, but there was a certain commonality of ideas, and that commonality is not actually too far from the beliefs and propositions expressed in the OHCO paper.

OHCO and the SGML/XML specs

But, of course, as I say, because of my personal history I keep coming back to the text of ISO 8879 and the text of the XML spec, and in consequence I tend to resist the idea that SGML (the SGML spec), or even XML (that is, the XML spec), entails the idea that text consists of a single hierarchical structure. There are several reasons for this. I don't resist the idea that there is a model of text instrinsic to the specs. Specs do embody models, and we need careful hermeneutics of specs. But if we want to interpret the worldview of a spec, we need to pay close attention to the words used in the spec because words are what specs use to embody their views. It is no part of careful hermeneutics to offer an interpretation of a text which ignores the details of the text.

The privacy of your own CPU

ISO 8879 and XML define serialization formats. They define a data format; I would say they defined a file format, except that ISO 8879 notoriously does not use the term file. (That's one of the things that made it so desperately difficult to understand.) I sometimes suspect that ISO 8879 didn't use the word file because the editor of the spec, who had spent years working on IBM mainframes, really wanted to keep the door open to partitioned data sets. (Partitioned data sets may be simplistically described as files within files; Unix tar files and [except for the compression] ZIP files are roughly similar examples in other computing environments). The avoidance of the term file turned out to be very handy later, because when the Internet became more important everything in the SGML spec could be applied without any change in the spec at all to a stream of data coming in over a networking socket. If ISO 8879 had said this is what's in the file there would certainly have been language lawyers who said Well, a socket is not really a file so it doesn't apply; you can't use SGML or XML for that. The very abstract terminology of ISO 8879 is perhaps a mixed blessing. It made the spec much harder to understand in the first place, but it achieved a certain generality that went beyond what some readers (me, for example) might have expected.

Now, the usual model for processing for a format like that is that the spec defines what goes over the wire, what comes in over the socket, or what's in the file, and not processing. Software that processes the data is going to read the file, parse it, build whatever data structures it wants to build, do whatever computations they want to do, and serialize their data structures in an appropriate way. Even if they're going to write their output in the same format, the data structures used internally will not necessarily be closely tied to the input or output format. There is no necessary connection, and certainly no tight connection, between the grammar of the format and the form of the data structures. What you do in the privacy of your own CPU is your business.

In this, be it noted, SGML is very different from the kind of database models that were contemporary at the time it was being developed. Database models, as their name suggests, specified a model, to which access is provided via an API. Implicitly, or abstractly, database models define (abstract) data structures. SGML avoids doing so. It provides no API, and it requires no particular abstract data structure. Both specs are rather vague and general on the question of how the data are to be processed. The XML spec does say that we (the authors of the spec) conceive of a piece of software called an XML parser reading XML data and handing it in some convenient form to a downstream consuming application. That's as concrete as we got. ISO 8879 is not even that concrete; there is nothing in ISO 8879 that says there will be a general-purpose SGML parser and it will serve as client application software. ISO 8879 is compatible with that view, but it's also compatible with the view that guided by an SGML Declaration and a Document Type Definition, users of that DTD will write software that reads data that conforms to that DTD. If I remember correctly, this is how the Amsterdam SGML Parser worked [Warmer / van Egmond 1989]. It did not parse arbitrary SGML data coming in; you handed it an SGML Declaration and a Document Type Definition, and it handed you back a piece of software that would parse documents valid against that Document Type Definition. There was no general SGML application involved in the Amsterdam SGML Parser. It was a parser generator, as it were, a DTD compiler; general-purpose SGML parsers could in contrast be classified as DTD interpreters.

There is also no assertion in ISO 8879 that it should be used as a hub format. It's entirely possible to view it as a carrier format, to use the distinction that Wendell Piez introduced on Monday [Piez 2017]. This has both technical and political implications. For political reasons, it is frequently easier to persuade people to tolerate a format if they think Well, it's just a carrier format. It doesn't compete with my internal format. I have my own format which I use internally, and I can use this as a convenience to simplify interchange with other people. It doesn't require that I change anything at the core of my system, it only affects the periphery.

A graph, not a tree

The second reason to be skeptical of the claim that SGML embodies the OHCO thesis is an insight I owe to Jean-Pierre Gaspart, who wrote probably the first widely available SGML parser. He used to make emphatic reference to the idea that SGML does not define S-expressions. I infer that he may have been a LISP programmer by background. Those who aren't LISP programmers may want some clarification. An S-expression in LISP is either an atomic bit of data (a number, a string, an identifier, ...) or a parenthesized list of whitespace-separated S-expressions; since parentheses can nest, S-expressions can nest, and if you restrict yourself to S-expressions you build trees … and nothing but trees. Arbitrary directed graphs cannot be built with S-expressions because S-expressions have no pointers. In Gaspart's account, SGML does not define a data format which is isomorphic to S-expressions because SGML does have pointers; it has IDs and IDREFs. It follows that if there is a single data structure intrinsic to ISO 8879 and XML, it is the directed graph, not the tree. The portion of that graph captured by the element structure of the document is indeed a tree, which means that the data structure intrinsic to SGML and XML is a directed graph with an easily acessible spanning tree. And (if I may refer to yet another old-timer) Erik Naggum, the genius loci of comp.text.sgml for many years, used to insist (in ways that I didn't always find terribly helpful) that SGML does not define trees. He would deny with equal firmness that it defines graphs; he meant that SGML defines a sequence of characters, and the data structures you build from SGML input are completely independent which, again, is true enough.[3]

What about CONCUR?

Finally, of course, if the view of text in SGML were that text has a single hierarchical structure, there would be no explanation for the feature known as CONCUR. CONCUR, for those of you who haven't used it, allows a single SGML document to instantiate multiple document types, each with its own element structure. With CONCUR, an SGML document has, oversimplifying slightly, two or three or ninety-nine element-structure trees and directed graphs drawn over the same sequence of character data.[4]

For all of these reasons, I tend to bridle at the suggestion that ISO 8879 embodies the thesis that text consists of a single ordered hierarchy of content objects. Certainly, the text of ISO 8879 limits documents neither to a hierarchy (as opposed to a directed graph) nor to a single structure (as opposed to multiple concurrent structures). It is certainly true that for most of the goals that the SGML Working Group had and discussed in public, a single hierarchy would probably suffice. And historically it's well-attested that CONCUR was a very late addition to the spec, added at a time when the Office Document Architecture [hereafter ODA], the big competition within ISO, was making great play of the fact that they could describe both the logical structure of a document and its formatted structure. In order not to be demonstrably less capable than the Office Document Architecture, it was essential that the SGML spec have something analogous. At this point, it's worth while to notice an important property of what the Working Group did and did not do. They did not say Okay, you can have two structures, a logical structure and a physical structure. In a move which I can only explain as a touching instance of blind faith in the design principle of generality, the Working Group said you can have more than one structure. They didn't say that one structure is logical and one is physical; they didn't say anything about what one might want to use multiple structures for, and they didn't supply an upper limit to the number of structures. (Even if all you want is a logical structure and a physical structure, a document might have multiple physical structures if it is sometimes rendered on A4 paper and sometimes on 5x8 book pages.) In the same way that they avoided specifying the word file, they avoided telling you what these multiple hierarchies were for.

The result, of course, is that the introductory part of the Office Document Architecture spec was a joy to read. It was clear, it was concrete — I loved it. The authors nailed things down, they were specific, they said exactly what they intended, they didn't say some image format or other, they specified what image format ODA processors would support. And so on. The only problem was that by the time I had saved the money to buy a copy of the ODA spec and was reading it and enjoying its concreteness, the spec was technologically obsolete. No one in their right minds would by that time have chosen that format for graphics and that format for photographs; it was just crazy. The ODA group had driven stakes into the ground, and then the tide had moved the shoreline, and they were high and dry, and they were not where they wanted to be. There were a number of ODA implementation efforts, but I'm not sure that any of them was ever completed, because by the time the implementation was nearing completion, it was easy to lose interest, because no user was going to want to use it.

SGML was wiser in a way. The SGML WG held things like graphics and photographic formats at arm's length. Instead of prescribing photo formats for SGML processors to support (which would seem to be a good idea for interoperability), they provided syntax to allow the user to declare the notation being used, and syntax for referring to an external non-SGML entity. The details of how an SGML system used your declaration to find appropriate processing modules is completely out of scope for SGML, with the consequence that SGML was compatible both with the formats that were contemporary with its development and with all the many, many formats that came later.

OHCO and Standard Average SGML/XML

So what's the alternative to OHCO?

On the other hand, OHCO does seem a really, fairly good description of the view of text that I remember from SGML conferences. That is, it matches up pretty well with Standard Average SGML, even if not with the text of the spec. When I first started working with SGML, I spent a lot of time consciously training myself to think about documents as trees. Now, why did I do this? Because up until then I had used alternative textual models, and it's worth pointing out that none of them was really a graph model.

One of the main alternative models of text available to us before ISO 8879 was text as a series of punchcards or punchcard records. (I won't bother trying to explain why I think that was a problematic model of text.) It was an improvement when someone introduced the notion that text is simply a string of characters. But, you know, text isn't really a string of characters; if you just have a string of characters, you can't even get decent output.

If you want decent output, you have to control the formatting process. The next advance in the modeling of text was that text is a string of characters with interspersed formatting instructions. If you want a name for this model, I call it the text is one damn thing after another (ODTAO) model. Imagine that you have indexed the text. Using the index, you search for a word and find an instance of the word somewhere in the middle of the document. In order to display the passage to the user, you want to know where you are in the text, what page you are on, what act and scene you are in, what language the curent passage is in. And if we're talking about a text with embedded formatting instructions, you also need to know what the current margins are and what the current page number is. That means essentially that you want to know which of the instructions in the document are still applicable at this point. There is a simple way to find out; it is the only way to find out. You read the entire document from the beginning, making note of the commands that are in effect at the moment and the moment at which they are no longer in effect, and when you reach the point that your index pointed you at, you know what commands are still in effect at this point. You have now lost every advantage that using the index gave you in the first place because the whole point of an index is to avoid having to read the entire document from the start up to the point that you're interested in, in order to find things out.

I should mention one other alternative model that may explain why so many of us grasped at SGML as if we were reaching for a life ring. I recently encountered a description of a pre-SGML proprietary system for descriptive markup. Like SGML, it allowed you to define your own fields. Unlike SGML, it was a binary format; there was no serialization form. There was no validation mechanism, so it was essentially just a data structure. I won't try to describe all the details of the data structure (and couldn't if I wanted to), but anyone for whom the following sentence is meaningful will have an immediately clear idea of the essentials: it was a MARC record with names instead of numbers for the fields.

Now, those of you who don't know what MARC records are like need know that the MARC record was invented by a certifiable genius (Henriette Avram) who in the 1960s analysed the exceptionally complicated structure of library catalog data and found a way to represent it electronically in a machine-tractable form. Unfortunately, the machine-tractable form that she came up with has made many a grown programmer gnash their teeth and try to tear their own hair out. First of all, MARC defines units called records; everything is either a record or a collection of records. A record consists of a header, which has a fixed structure so it can be read automatically; you have a general purpose header reader that reads the header, and it then knows what's in the record. In the case of MARC, the header is a series of numbers which identify specific fields (types of information), followed by pointers that identify the starting position and length of the field within a data payload which follows the header and makes up the rest of the record.

Now this arrangement has some beautiful properties; you can read a record in from tape and process it easily. There are pointers, so it's easy to get at any portion of the record that you're interested in to process it, and you never have to move/copy things around in memory. If you're editing a MARC record, you can just add more data to the end of the payload and change the pointers and leave the old data as dead data (like junk DNA). And then if you want, you can clean it up later before you write it back up to tape or to disk. (In fact most of the first MARC processors used tapes, not disks, because disks weren't big enough.)

Notice that with such a record structure there is no problem with overlapping fields. There's nothing to prevent two different items in the header from pointing at overlapping ranges. I don't think anyone in their right minds ever tried to use overlapping fields, but there's nothing in the specification itself to prevent it.

OHCO as liberation

Compared to these models, a model of text based on user-specified content objects was almost guaranteed to feel like an improvement, especially if those content objects were suitably descriptive and generic. A model based on hierarchical structure is easier to work with, visualize, and reason about than the one-damn-thing-after-another model, and hierarchical structures allow a richer description of textual structure than a format based on flat fields within a record. If the hierarchy allows arbitrarily deep nesting, it is a much better match for text as we see it in the wild than the fixed-level document / paragraph / character hierarchy on which most word-processing software is built. And of course a model which keeps track of ordering is essential for text, in which order is by default always meaningful, in contrast to relational database management systems in which the data are by default always unordered.

A tree or graph also makes it easier to identify and understand the relevant context at any point in the document, if you jump into the document by following an index pointer. In a tree or graph, you can just consult the nodes above you in the graph to know your context. Each ancestor node may be associated with semantics of various kinds (whether those are abstractions like act and scene or formatting properties like margin settings) and will tell you something about the current environment. Assuming a reasonably coherent usage of SGML elements, that will probably give you the information you're looking for. N.B. This is not the only way to use SGML elements; if you're using milestone elements, you're back to reading from the beginning of the document. But in most SGML applications, thinking of text as a tree instead of as a set of flat fields or as a series of one damn thing after another was a huge step forward.

So, OHCO had a great appeal to me. But not really because I thought text was intrinsically a single ordered hierarchy of content objects. I thought that a single ordered hierarchy of content objects would be a good way to model text even though text by nature was slightly more complicated. That is to say, I more or less agreed with the character in Hemingway's The Sun Also Rises who, when someone suggests something to him, says Oh, wouldn't it be pretty to think so? [Hemingway 1926] Text may not be an ordered hierarchy of content objects, but wouldn't it be pretty to think so, or with the Italians se non è vero, è ben trovato.[5]

Now, as formulated, the OHCO paper makes a clear and refutable claim: we think that our point can be scarcely overstated: text is an ordered hierarchy of content objects. What really bothers me about that sentence is the singular article an. I'm pretty sure, however, that this is a case of even forward thinkers not being able to capture exactly what they mean. This can happen because, especially when you're trying to persuade other people, there's a limit to exactly how far you can go in your formulation. Some of you will remember the columnist Jerry Pournelle, who in the 1980s promoted a shift away from mainframe computing to personal computing with the slogan One user, one CPU. And within ten years found himself having to explain I didn't mean maximally one CPU; I meant at least one CPU.

I discovered, re-reading the OHCO paper recently, that the authors didn't actually argue that text is at most one ordered hierarchy of content objects; they explicitly claim, in a discussion of future developments, that many documents have more than one useful structure. They observe that Some structures cannot be fully described even with multiple hierarchies, but require arbitrary network structures. And they point out that version control and the informative display of document version comparisons will pose great challenges.

Those who have ears to hear, let them hear

I think the OHCO thesis can be regarded perhaps as an attempt to capture what was new — not everything in ISO 8879, but what was new and liberating. There are many things it doesn't address; I've mentioned some of them. I haven't mentioned something that I never thought of as terribly important but which I know the editor of ISO 8879 felt was important: namely, the more-or-less complete orthogonality of the logical structure from the physical organization of the document into entities. XML limited that orthogonality; XML requires every element to start and end in the same entity. That degree of harmony between logical and physical structure was not required in ISO 8879; we lost a certain amount of flexibility at that point. I personally have never missed it. I don't quite know why WG8 thought it was an advantage to have it (or why that particular member of the WG8 thought so), but anyone who prefer complete orthogonality of storage organization and logical structure will think of that rule of XML as a step backwards.

The things that I think the OHCO thesis captures well are

  • the focus on the notion of the essential parts of a document

  • the notion that those essential parts might vary with the application

  • the notion that it was the user's right and responsibility to decide what the essential parts of the application are

  • the separation of document content and structure from processing, which led allegedly to simpler authoring and certainly to simpler production work, simpler generic formatting, more consistant formatting, better information retrieval, better ways for retrieval, easier compound documents, and some notion of data integrity.

Those are the kinds of things that SGML and XML have that serve to make it possible to do the kinds of cool things we've heard about here in several talks: for example Murray Maloney's multiple editions of Bob Glushko's book on information [Maloney 2017], or Pietro Liuzzo's project on Ethiopic manuscripts [Liuzzo 2017]. Anne Brüggemann-Klein and her students have shown us that a sufficiently powerful model of text, as instantiated not just in XML but in the entire XML stack, can handle things that are not at all what we normally think of as text [Al-Awadai et al. 2017]. Each project in its way, I think, is a tour de force.

But to be fair, not everybody cares about all of these things. Some people are not impressed when things shift from being impossible to being possible; they want them to be easy, or they're not interested. And that may be why some of the things that we were excited about 30 years ago, we're not excited about now. Relatively few users were worried then about overlap; very few people cared. They voted with their feet for implementations of SGML that did not implement that optional feature. I sometimes think that the reason the OHCO paper focuses where it does is that it was trying to say things people could understand and trying to avoid saying things that people were not ready to hear. It doesn't help much to say things that people are not ready to understand, although occasionally someone will remember thirty years later, the way I now remember Jean-Pierre Gaspart telling us all that SGML is not just an S-expression with angle brackets instead of parentheses. It was thirty years before I understood what he was talking about, but now I think I understand.

What goes in the model, what goes in the application?

If we step back, if we ask show various models of text compare with each other — OHCO, what is actually in ISO 8879, hypertext the way Ted Nelson defines it, hypertext the way HyperCard defines it, the one-damn-thing-after-another model — I think OHCO looks pretty good compared to most of its contemporaries. It's probably true, however, that a more general graph structure might be better; it would certainly be more expressive.

The design of a model is sometimes a tradeoff among competing factors like expressive power, simplicity for the users, and simplicity for the implementors. There are reasons to want to get as much information as possible into the model. If you can capture a constraint in a model and if you have generic software to enforce it, then every application built on top of the data gets the preservation of that constraint for free. This is why it makes perfect sense for relational databases to allow you to declare referential integrity constraints and for the databases to enforce those constraints because if you declare referential integrity constraints, then the database enforces them (unless you were using MySQL a few years ago when it didn't enforce them), and you don't have to ensure that every application program that reads your database is careful about those constraints.

A richer model provides, in this way, a safer environment in which to program. Everything that's not in the model, everything that's not expressed formally in an enforceable way, is something that every application program you write has got to be careful about. So the more tightly and formally your model can describe what is correct, the cleaner your data can be. It may be true, as Evan Owens told us on Monday, that you will never know 100% of the ways that authors can get things wrong, but over time you can (if you pay attention) learn more and more of them, and if you can get them into the schema, you can protect your downstream software … if you have formalisms that allow you to check those things [Owens 2017].

Further, it has been demonstrated by several papers here, including the ones by David Lee and Matt Turner, that the fuller the model is, the more information it has about the data, the more interesting things it can do by itself without much input from us [Lee 2017, Turner 2017]. And processors that know more about the data can optimize better and more safely than processors that don't know anything about the data. So, it's better that things go into the model — up to a certain limit. The countervailing argument is that sometimes its better to have less elaborate models, because simpler models are easier to use and understand and easier to support in software, and many applications don't in fact need complex constraints. If constraints are hard to implement or slow things down, then implementors may omit them, or users may turn them off, the way some users turn off referential integrity constraint checking even in databases that support it, because they would rather have fast, wrong software than correct, slow software. That's a choice they get to make.

I notice there's a relation here between modeling and whether the format being defined is intended as a carrier or a hub. Formats that don't impose tight constraints may be better for carrier format functions. If a formalism or model is opinionated and says This is the way it's got to be, it's going to make it easier to do interesting things with the data that obeys those constraints, and it may be what you want for a hub format, whereas a more cynical format that doesn't really have any strong convictions but just allows anything to happen, like the carrier format that Wendell Piez was talking about (viz. HTML) [Piez 2017], may be better for carrier functions. Both SGML and many SGML applications like TEI were kind of vague about whether they expected to be used as a hub format or a carrier format. I think that was partly for political reasons, partly because the distinction may not have been clear to us all at the time.

There's another instructive example that we can spend a moment on, I think. In the 1950s, programming languages were defined in prose. And writing a parser for a programming language meant struggling with the prose of the spec and trying to figure out what on earth it meant in the corner case that you were currently facing in your code. And the nature of human natural-language prose being what it is, different readers occasionally reached different conclusions. This is one reason that when in 1960 the Algol 60 Report came out and introduced the format called the Backus-Naur Format [hereafter BNF] to provide a formal definition of the syntax, computer scientists were, as far as I can tell, immediately won over to the formalism. A concise formal definition of the syntax makes it possible to make inferences from the notation. One could know how the corner case was supposed to be handled, assuming that the grammar was correct — and the grammar was by definition correct, so you were home free. Computer science spent the next ten or twenty years developing one method after another to go systematically from a formal definition of a grammar in BNF, or later in extended BNF [hereafter EBNF], to a parser, systematically and eventually automatically, so you could just write a grammar, run a program on that grammar, and have a parser. So pretty much every programming language now provides a grammar of the language in BNF or EBNF.

But the formalism doesn't capture everything. Not every string of characters that's legal against the Algol 60 grammar is a legal Algol 60 program. And the same is true for any other programming language that's more than an intellectual curiosity, because programming languages are not, in fact, context-free languages; they are context sensitive. Algol 60 was typed, and if over here you had declared a certain variable as of type Boolean, you were not allowed to assign it the value 42 over there. But that amounts to saying that the set of assignment statements legal at any given point is dependent on the context, and context is precisely what a context-free grammar cannot capture.

In the preparation of Algol 68, the Dutch computer scientist Adriaan van Wijngaarden made a concerted effort to fix this state of affairs. In Algol 68, he was determined to push all those constraints into the formalism. To manage that, he needed, and he duly invented, a stronger grammatical formalism (known today as two-level grammars or van Wijngaarden grammars. He noticed if there were a finite number of legal identifiers, a context-free grammar could actually capture the kinds of constraint mentioned above involving declaration and typing of the variables. And if you have an infinite number of identifiers (as you do in any realistic programming language), you can manage to express the constraint if you allow yourself to imagine, not a finite context-free grammar, but an infinite context-free grammar. So, van Wijngaarden invented infinite context-free grammars. Now, he didn't want to try to write any infinite grammars down line by line, so he invented two-level grammars. At one level there is a context-free grammatical base that has notions called hyper-notions and meta-notions which in turn generate, at the second level, an infinite number of rules. For any given Algol 68 program, you can generate a finite subset of the infinite grammar of Algol 68 that suffices for parsing the particular program before you.[6]

It is a brilliant mechanism; its only flaw is that the grammar is now unreadable. It is almost certainly impossible for anyone in that Working Group (including, I suspect, van Wijngaarden himself) to look at the grammar and know for sure whether a given formulation is or is not a correct expression of the design agreed by the WG on some particular technical point — because the grammar is too complicated. It's like reading source code for a parser. There is a good reason that most programming languages are not defined by reference implementations: it is too hard to tell whether the reference implementation is correct or not. Now, of course, a reference implementation is correct by definition, but it's only correct by definition once the Working Group has said it's correct.

And so there are really not many implementations of two-level grammars. I know of exactly one, and I think it was a partial implementation. So, in a way, having too strong a formalism is like going back to the 1950s: you have to study the spec and try to figure out what it means. People wanted these constraints to be in the grammar because experience had shown that grammars were easy to understand, but by the time those constraints are pushed into the grammar — into the model — the model is no longer easy to understand. There is a tradeoff. Most programming languages now define context-free grammars, and then they define an additional list of context-sensitive constraints that you have to meet. You can formalize that, too. Attribute grammars are a way of formalizing that. Essentially, attribute grammars have a different kind of two-layer formalism: a context-free grammar, together with a set of rules for assigning attributes to each instance of a non-terminal and calculating the values for those attributes. So, perhaps the solution is to have layered models in which each layer individually is relatively simple, easy to understand, and easy to check, and in which the conjunction of all layers and their constraints allows you to do things that are more complex.

That's the way programming languages work; Will Thompson showed us a nice example of the kind of thing I have in mind [Thompson 2017]. The underlying model that he is working with doesn't know anything about redundancy; he wants to introduce controlled redundancy, so he invents a way of marking the redundancy and writes a little library that fits on top of the underlying engine and provides a more complicated model, in which you can have the redundant version of the document that's easy to retrieve or strip the redundancy for other purposes. It feels like a very SGML/XML-like thing to do. If the off-the-shelf models and tools don't do what we need, we can layer what we need on top of them.

Of course, sometimes layers of that kind just feel like work-arounds, like hacks. Sometimes what you need to is step back and think things through from the beginning. The outstanding example at this conference is the paper by Ronald Haentjens Dekker and David Birnbaum showing what things can look like if you step back and try to re-think the model of text from the beginning [Haentjens Dekker and Birnbaum 2017]. Their notion of using hyper-graphs as a way to keep the model simpler than other graph models — brilliant. I don't know how such a model can support the multiple orderings you propose as a topic for future work; you might need to layer something on top of it. But it's very exciting work.

Another open question there is how to match the capabilities offered by SGML and XML that work together so well. SGML and XML formally define a serialization format, but implicitly they suggest a data structure: an element tree with pointers. And the element tree in turn suggests a validation mechanism. You can write document grammars and treat the element structure as an abstract syntax tree for that document grammar, so you can constrain your data in ways that help you find a number of mechanical errors automatically. When you re-think things from the ground up, lots of interesting things become possible. Mary Holstege provided a very challenging but rather exhilarating example of the kind of considerations that need to go into the re-thinking of a model or a language — lessons from long ago that may nevertheless still be useful [Holstege 2017].

Sometimes the model that we want to formalize is whatever we know how to model formally. Sometimes it's where we think we found a sweet spot in the tradeoffs between expressive power and simplicity. Sometimes the model expresses what you can get the people in the room to agree on.[7]

OHCO captures, I think, pretty well, at least in the wouldn't-it-be-pretty-to-think-so sense what most standard average users of SGML could agree on. They didn't all agree on CONCUR, or rather they did mostly agree on CONCUR: they agreed they didn't want it. They didn't mostly think that ID and IDREF were a fundamental part of the model even though Jean-Pierre Gaspart did. They did think that trees were important, so all of the tutorials will talk about the tree structure of XML. This is one reason that people believe that the OHCO model is what motivated it in ISO 8879 because they read the tutotials rather than ISO 8879 — can we blame them?

But what we agree on, of course, varies with time and geography, and it changes when we hear other people who think differently and we argue with them. Sometimes we persuade each other. And to do that, to hear others and argue with them and persuade or be persuaded, we come to conferences like this one. I have learned a lot at this year's Balisage; I hope you have, too. I have enjoyed hearing from you in talks and during breaks and arguing with some of you about this and that, including the right way to model text. Thank you for coming to Balisage. Let's do it again sometime![8]


[Al-Awadai et al. 2017] Al-Awadai, Zahra, Anne Brüggemann-Klein, Michael Conrads, Andreas Eichner and Marouane Sayih. XML Applications on the Web: Implementation Strategies for the Model Component in a Model-View-Controller Architectural Style. Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4, 2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). doi:https://doi.org/10.4242/BalisageVol19.Bruggemann-Klein01.

[Coombs et al. 1997] Coombs, J. H., A. H. Renear, and S. J. DeRose. Markup systems and the future of scholarly text processing. Communications of the Association for Computing Machinery 30.11 (Nov. 1987): 933–947.

[DeRose et al. 1990] DeRose, Steven J., David G. Durand, Elli Mylonas and Allen H. Renear. What is text, really? Journal of Computing in Higher Education 1, no. 2 (1990): 3-26. doi:https://doi.org/10.1007/BF02941632.

[Flynn 2017] Flynn, Peter. Your Standard Average Document Grammar: just not your average standard. Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4, 2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). doi:https://doi.org/10.4242/BalisageVol19.Flynn01.

[Grune / Jacobs 2008] Grune, Dick, and Ceriel J. H. Jacobs. Parsing Techniques: A Practical Guide. New York: Ellis Horwood, 1990; Second edition [New York]: Springer, 2008.

[Haentjens Dekker and Birnbaum 2017] Haentjens Dekker, Ronald, and David J. Birnbaum. It's more than just overlap: Text As Graph. Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4, 2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). doi:https://doi.org/10.4242/BalisageVol19.Dekker01.

[Hemingway 1926] Hemingway, Ernest. The Sun Also Rises. New York: Charles Scribner's Sons, 1926. Reprint. New York: Scribner, 2006.

[Holstege 2017] Holstege, Mary. The Concrete Syntax of Documents: Purpose and Variety. Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4, 2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). doi:https://doi.org/10.4242/BalisageVol19.Holstege01.

[Lee 2017] Lee, David. The Secret Life of Schema in Web Protocols, API's and Software Type Systems. Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4, 2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). doi:https://doi.org/10.4242/BalisageVol19.Lee01.

[Liuzzo 2017] Liuzzo, Pietro Maria. Encoding the Ethiopic Manuscript Tradition: Encoding and representation challenges of the project Beta ma?a??ft: Manuscripts of Ethiopia and Eritrea. Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4, 2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). doi:https://doi.org/10.4242/BalisageVol19.Liuzzo01.

[Maloney 2017] Maloney, Murray. Using DocBook5: To Produce PDF and ePub3 Books. Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4, 2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). doi:https://doi.org/10.4242/BalisageVol19.Maloney01.

[Owens 2017] Owens, Evan. Symposium Introduction. Presented at Up-Translation and Up-Transformation: Tasks, Challenges, and Solutions, Washington, DC, July 31, 2017. In Proceedings of Up-Translation and Up-Transformation: Tasks, Challenges, and Solutions. Balisage Series on Markup Technologies, vol. 20 (2017). doi:https://doi.org/10.4242/BalisageVol20.Owens01.

[Piez 2017] Piez, Wendell. Uphill to XML with XSLT, XProc … and HTML. Presented at Up-Translation and Up-Transformation: Tasks, Challenges, and Solutions, Washington, DC, July 31, 2017. In Proceedings of Up-Translation and Up-Transformation: Tasks, Challenges, and Solutions. Balisage Series on Markup Technologies, vol. 20 (2017). doi:https://doi.org/10.4242/BalisageVol20.Piez02.

[YouTube Film Clip] You keep using that word. I do not think it means what you think it means. The Princess Bride, YouTube video, 00:07. Clip from film released in 1987. Posted by Bob Vincent, January 9, 2013. https://www.youtube.com/watch?v=wujVMIYzYXg.

[Thompson 2017] Thompson, Will. Automatically Denormalizing Document Relationships. Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4, 2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). doi:https://doi.org/10.4242/BalisageVol19.Thompson01.

[Turner 2017] Turner, Matt. Entity Services in Action with NISO STS. Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4, 2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). doi:https://doi.org/10.4242/BalisageVol19.Turner01.

[Warmer / van Egmond 1989] Warmer, Jos, and Sylvia van Egmond. The implementation of the Amsterdam SGML Parser. Electronic Publishing 2.2 (December 1989): 65-90. A copy is on the Web at http://cajun.cs.nott.ac.uk/compsci/epo/papers/volume2/issue2/epjxw022.pdf.

[van Winjgaarden et al. 1976] van Wijngaarden, A[driaan], et al. Revised Report on the Algorithmic Language Algol 68. Berlin, Heidelberg, New York: Springer, 1976.

[1] There are YouTube clips of the moment, among them the one at [YouTube Film Clip].

[2] He has not, as far as I know, put this line into a published paper; I heard it in a talk he gave at the Society for Textual Scholarship in (if memory serves) 1995.

[3] Jean-Pierre Gaspart wrote an SGML parser for a company then called SoBeMAp (Societé Belge de Mathematique Appliquée) and later called SEMA Group. I heard him make the observation attributed to him in more than one TEI working group meeting. No doubt he made the point elsewhere, as well; it had the air of a well-rehearsed claim. Erik Naggum's remarks can no doubt be found in online Netnews archives, though the volume of his contributions may make it a time-consuming search.

[4] This is an over-simplification I used for a long time because I couldn't bear the complication of the actual truth, but the simple fact of the matter is that the character data within a given document type can consist in part or even in whole of ENTITY references, and ENTITY references can vary between document types. So there is actually also no guarantee that a document with concurrent markup will have the same string of base characters in each document type. It gets hard to think about when you contemplate that, so most discussions are concerned only with the case in which each document type has the same character data.

[5] The actual remark in Hemingway is slightly different:

Oh, Jake, Brett said, we could have had such a damned good time together.

Ahead was a mounted policeman in khaki directing traffic. He raised his baton. The car slowed suddenly pressing Brett against me.

Yes, I said. Isn't it pretty to think so?

[6] The authoritative source for Algol 68 is of course the Algol 68 report [van Winjgaarden et al. 1976]; for a somewhat more accessible treatment of two-level grammars see Grune and Jacobs [Grune / Jacobs 2008].

[7] Justice Frankfurter is reputed to have said that the single most important skill in a Supreme Court justice is the ability to count to five (five being what gives you a majority on the nine-member Supreme Court). This may be folklore; at least, I have not found any source to cite. The principle is sometimes said to the most important skill for an appellate lawyer arguing before the Court.

[8] I am grateful to Tonya Gaylord for transcribing the recording of these remarks; in copy-editing the text I have mostly let the oral style of the original remain, though I have tried to make the text clearer and easier to read. I have not hesitated, however, to elaborate some points well out of proportion to their treatment in the initial presentation.