Do we really want to see markup?

James David Mason

Abstract

Markup fanatics have long cried, “We need to see the markup!” Yet since the earliest stages of developing the SGML standard, there has been an urge even among standards developers to avoid having to write tags everywhere. The recent urge to create “Invisible XML” is but the latest symptom of a smoldering disease, from which I too suffer.

Prologue

Why do we want to see markup?

That's not a question I would have asked forty years ago when I started using computers to process text. I first experienced document markup as an editor and writer in the publishing organization at Oak Ridge National Laboratory: I taught myself the coding for our typesetting system (developed in house by a physicist) so I could have more control over my documents. No WYSIWYG was available to me then! I worked with markup on hard copy and edited it using a line editor on a teletype terminal. Because I had done that typesetting and also some FORTRAN programming, I was picked to be the guinea pig for our new UNIX-based publishing system and eventually to train the rest of our staff. I found myself with a full-screen editor on a CRT (much quieter than the teletype), learning troff, tbl, and eqn. Basic troff typesetting wasn't all that different from what I knew (the systems shared Runoff as a common ancestor), but Joe Ossanna and Brian Kernighan had made troff programmable, and that meant that there were macro packages, the abstractions of patterns in markup.

My life changed forever: I had encountered Generic Markup! This appealed to me. All my life I had been interested in patterns. I had encountered Joseph Campbell and The Hero with a Thousand Faces early in my college career, studied Jungian archetypes, and written a dissertation on patterns in early Germanic literature. Now I had found something based on patterns I could use in my work—and get paid for it.

I chose the MM (Bell Laboratories Memorandum Macros) package as being most suited to our work at ORNL and set about adapting the package to our requirements. I also rewrote parts of eqn. Then I started training our composition staff and eventually the other editors. I attended one of the early Seybold Conference series, where someone from IBM talked about something called Generic Markup Language. I realized there was a kind of community; other people were working on other types of generic markup.

My success with the project at ORNL led to my being asked to present it at a Department of Energy conference. There I met Millard Collins, chairman of a new ANSI committee (X3V1) working on how to make the new word-processing systems just becoming popular communicate with each other. Since part of my job was to get text out of word processors and into our UNIX system, I joined the committee at its organizational meeting in the fall of 1981. At that meeting, I met Charles Card, who suggested I join his committee (X3J6), which was working on, among other things, a Standard Generalized Markup Language. I first attended X3J6 in the spring of 1982, and there I had my first encounter with Charles Goldfarb, the project editor and driving force behind SGML.

The first of these committees (and its ISO counterpart, ISO/IEC JTC1/SC18) started work on something called Office Document Architecture (later Open Document Architecture), ISO 8613 ODA, now largely forgotten. SGML, ISO 8879, SGML developed originally by X3J6 and its ISO counterpart, the JTC1 Experts Group on Computer Languages for Processing Text, is still with us. The two ISO committees eventually merged into SC18. In the fall of 1985, I became the convenor of the ISO working group responsible for SGML and related projects (SC18/WG8). ODA was managed by a parallel working group (SC18/WG3). After the demise of ODA, my SGML group became the primary committee in 1998, just as XML was getting started. (As ISO/IEC JTC1/SC34, it still exists. SC34) The competition between SGML and ODA went on for nearly eighteen years. While there were many technical issues (and much electro-politics) involved, in many ways the competition was about the difference between visible and invisible markup.

ODA, SGML, and the First Hints of Invisible SGML

Most of the people working on SGML came, like me, from the documentation and the scientific and technical publishing industries. We prided ourselves on our connection to technology, and we were used to typing codes into computers. We were used to long, highly structured documents—and lots of code.

Those who joined the world of descriptive markup only after the arrival of XML may not realize how endangered that world had been only a few years earlier. The SGML/ODA Wars are, thankfully, long over and forgotten, except by those of us who still have scars from them. In retrospect, I think SGML might have survived on its own, in a niche community; but if we had not survived the wars, we wouldn't have been able to build a support system for it. In particular, we wouldn't have had DSSSL (Document Style Semantics and Specification Language, ISO/IEC 10179), and without that we wouldn't have had the basis to build XSL and XQuery.

The ODA project was driven largely by makers of word-processing systems and also by national telecommunications agencies that were looking to offer yet another tariffed service. While they dreamed of WYSIWYG, the reality of their work was long limited by the limitations of their hardware, particularly the inability to produce more than typewriter-like output when the project began. What ODA seemed to desire most was a system that offered a working screen free of codes. Nonetheless, ODA had a foundation that was not so simple as their surface goals might suggest, and indeed they had considerable influence on SGML and its approach to coding. From its beginnings in Wolfgang Horak's dissertation, ODA had an implicit interest in generic structures, Horak-Kroenert-83 and in the earliest ISO drafts, ODA proposed that documents possessed two concurrent, interleaved, high-level document structures, layout and logical. What these structures involved was never made completely explicit, though layout obviously had to do with rendition on the screen and page. The logical structure apparently dealt with paragraph-like objects. ODA was the cloud computing of the 1980s: an office was expected to rent an ODA terminal from their telephone company, and the documents would reside on the company's mainframes. The ODA standards project was eventually published in 14 volumes, with several supporting technical reports.

From the beginning, ODA assumed that the serialization of documents would be in binary form, ODIF (Office Document Interchange Format). The notation selected was based on ASN.1 (Abstract Syntax Notation One, ASN.1 ASN1), though with modifications because of the concurrent structures. Below the page level, the layout structure was control codes for rendering devices, which amounted to invisible inline procedural markup. For the logical structure, however, the developers turned to type-length-value triplets, with byte count pointers as a kind of implied stand-off generic markup.

During the earliest years of the ODA project, I attended their meetings and brought back their discussions to the SGML committee. Most of the SGML team considered ODA a distraction, but it intrigued Goldfarb, who took it as a personal challenge to develop an SGML representation for anything and everything proposed for ODIF. One of the first results of this was the introduction of the CONCUR feature into SGML. Because ODA never developed an explicit schema mechanism, Goldfarb had to develop a mechanism for dealing with ad hoc and implicit structures. The result was Architectural Forms. Goldfarb's SGML rendering of something that began as binary and invisible into visible markup was eventually folded back into the ODIF standard as an alternative serialization.

In the two serializations of ODIF, we had (at least in theory) the materials for a reversible transformation between a document whose only visible manifestation was something that appeared on a presentation system and one that was encoded in conventional, and readable, character markup. It was sufficiently interesting to Goldfarb that he played with the idea of developing a binary version of the whole SGML design, on the assumption that it would be more compact and therefore easier to transmit over a bandwidth-limited network. That came to an end when NIST calculated the relative sizes of binary- and SGML-encoded ODA documents and found the latter to be more compact.

Although the reversible transformation between visible and invisible markup was defined, at least for definition of the serialization of ODA, it never worked in practice. While we all know SGML and its heirs, which have multiple implementations, ODA was never completely implemented and today is largely forgotten. It had, on paper, a bewildering number of options from which profiles could be extracted, only a few of which had even trial laboratory implementations. Those of us who had to cope with its presence generally think of it as an expensive failure. Yet it influenced DSSSL, and thus XSL, through its page model. And it started the debate of how to represent overlapping structures that still intrigues participants in Balisage.

One of the things that killed the ODA project was visible markup. ODA was not intended to be seen, even in the SGML encoding. ODA was not really even intended to be created directly (though Philips did at one point attempt, unsuccessfully, to build an ODA editor as a laboratory project). ODA was originally intended to be used in invisible environments, for communication between systems. It was too hard for all but a few specialists to comprehend its rather abstract model and its difficult binary representation. ODIF could be generated only by machines, doing things like pointer arithmetic. SGML markup, in contrast, was expected to be created by end users. It turned out as something we could—and did—create by hand, and we expected to see that which was both document markup and the interchange format. Yves Marcoux and Martin Sévigny considered eye-readability to be the primary reason that SGML succeeded where ODA did not. Marcoux

I trace the last gasps of ODA to the SC18 plenary in 1995. The convenors of the working groups were sitting together at the head table, and I was next to Steve Price, the convenor of WG3 and the chief public advocate for ODA. I happened to look at his laptop screen and saw he was taking notes in a text editor—in HTML. I leaned over and whispered to him I'm glad to see you've come over to our side. What do you mean? he asked. You're taking notes in SGML, I replied. No, he shot back, it's this new World Wide Web thing. Yes, I can see it's HTML, and that's an SGML application. He was crushed. His group, which had big money behind it, had spent years trying to compete with ours, which had worked because of a passion for its project. All this time the ODA developers had never really grasped what we were doing. Meanwhile, we sold our concept quietly, planting it in places like CERN, where it spawned HTML, and the ODA team didn't realize they had been subverted. They tried to keep their project going for another couple of years, but it was futile.

I don't think that it was merely the technical superiority of SGML that led to its victory over ODA. The ODA developers had started with confidence that they had the next great thing. They were, after all, professional standards developers, backed by powerful organizations, and they were working on something that would fit into Open Systems Interconnect. The SGML developers knew little about standards development; we were just end users with a common interest. (As Sharon Adler remarked, If we ever figure out how this standards process works, it will be time for us to retire.) In the long run, it was probably to the advantage of the SGML developers that they were working on something that they wanted and needed themselves, rather than something that corporate bodies expected to impose on end users. The design of SGML is improvised—sometimes amateurish, sometimes obscure. The resulting application languages are nonetheless something that can be seen and used directly by humans. The visibility of SGML markup was part of what enabled Bill Tunnicliffe to sell it to the U.S. Department of Defense in 1983, and that led to our going public with the GENCODE standard later that year. GENCODE ODA, with its thousands of permutations of options, was much harder to grasp—and to implement. All its advocates could do was publish descriptive papers. You can write SGML in a simple text editor. You can't do that with ODA. So in the end, the leader of ODA development picked up on the utility of HTML and actually used it. Visible markup had won.^[1]

Digression on Word Processors and Seeing Coding

WYSIWYG is a seductive concept. The earliest stand-alone word-processing systems—expensive, yet limited, behemoths—promoted it. But by the time SGML and its offspring really gained traction, the stand-alone devices had been supplanted by programs running on general-purpose personal computers. And in the end, the multitude of early applications had largely fallen by the wayside while two major competitors fought to control the marketplace, Microsoft's Word and Corel's Word Perfect. Word was based on work at Xerox PARC, and as a consequence it was fundamentally object oriented. It understood units of text such as strings and paragraphs and applied properties to them, and it understood generalized structure and inheritance of both structure and properties. That meant it could easily support stylesheets with inheritable properties and things that depended on structure, like outlining. Word Perfect, in contrast, just serialized control functions in whatever order the user happened to insert them; there was no overall concept of structure. (I thought of it as one damn thing after another.) Stylesheets and outlining came only late to Word Perfect and were relatively weak, compared to those in Word.

Conceptually, Word was in closer sympathy with SGML, while Word Perfect followed the layout structure of ODA. (It is perhaps significant that Corel was one of the very few companies to attempt an ODIF export filter for their product.) Word beat Word Perfect to full WYSIWYG with Word for Windows (no surprise there), but my observation of hundreds of users of these two products showed an interesting phenomenon: serious Word Perfect users almost always ran the program in split-screen mode, with reveal codes at the bottom of the editing screen. Using reveal codes was important because the program enforced no discipline about how codes were entered; users could do things in random order, and just seeing the cursor in the WYSIWYG screen gave few hints about what was actually going on in the procedural coding. Word users didn't need this because the program managed the coding in a structured way, always told them what object they were in, and could also tell them what its properties were. So in a fully structured environment, it was not necessary to look at coding; but in an undisciplined one, visibility of coding was essential.

Early Invisibility in SGML

As proud as the hard-core SGML developers were of our ability to bang markup into a terminal, we were nonetheless practical—or lazy. Almost from the beginning we had markup minimization. In the early days, before we had syntax-directed editors designed for SGML, we took it on faith that the SGML Parser (whatever that turned out to be) would be intelligent enough to keep track of the current context and so save us the trouble of typing full tags. Goldfarb, of course, had to generalize that idea into the full scope of minimization options in the final standard (see below, Appendix A).

I can remember the first SGML editor I used, from Datalogics: it was basically a text editor, with an attached batch parser. I could type tags, attributes and all, and end tags; then I could check to see how many mistakes I'd made. Software Exoterica (later known by the name of its primary product, OmniMark) came out with Checkmark, based on a simple text editor for the Macintosh, but with a live parser. The ability to get validation while a document was being created was so useful that I, like a number of other people, kept an ancient Mac alive for years just to run Checkmark after Exoterica stopped updating it for later systems.

XML, hoping to simplify life for the parser writer, decided to drop minimization. Ironically, most of the problems with minimization had been solved by then, and furthermore we had real SGML editors like SoftQuad's Author/Editor and Arbortext, so the problem had ceased to be an issue. With the arrival of real SGML editors, users suddenly had the option of deciding how much SGML they wanted to see. They could see full source code, they could see schematic block tags, or they could see no tags at all. As I write this in <oXygen/>, I'm looking at a page very similar to what I saw more than twenty years ago in Author/Editor, and I'm switching between visible and hidden tags according to what tasks I'm performing at the moment. Even if I were still in Author/Editor, there would be no minimization in my output document.

As I've looked at some recent papers on Invisible XML, I've kept thinking, We're back where I was about 1983.

What was the state of SGML back then, and how does it lead to Invisible SGML, if not to Invisible XML?

By 1982 our image of what an SGML document would look like would be largely recognizable to an XML user today. A document would have tags with angle brackets, and the elements indicated by the tags would be in a hierarchy. Attributes would be specified in start tags. What we lacked then was a formal way to define the tags and hierarchy. In short, we needed a way to specify a schema, and developing such a specification was harder than forming a basic expectation of what SGML would look like. In 1982 we were already thinking about specifications for content models that were somehow related to regular expressions, but we did not yet have a settled syntax for them. When we did start to develop a syntax for declarations in 1983, one of our first drafts was actually a whitespace-delimited table inside a declaration (then called STRUC, for structure), with columns for element names and models. Multiple elements could be declared in a single table. We'd leave until later the problem of how to parse such a table and use the results.

Given this state of development, it was sometime in late 1982 that I inadvertently launched an idea that would result in Invisible SGML. I had to do a presentation about SGML, and I picked for my example a conventional memo, with From, To, Subject, and other such fields. Not yet having a real syntax for a schema, I wrote out a series of definitions borrowing from regular expressions that included string literals as components of content models. I don't have the original any longer, but it was something like

memo: to, from, subject, date, body
  to: "To: ", #PCDATA
from: "From: ", #PCDATA
etc.

Afterwards, I showed it to Goldfarb, who fired back that it was all wrong, that wasn't what he intended to do at all, that he wasn't using full regular expressions, and so there could be no literals in the models. Content models included only element names (plus reserved characters for grouping, sequencing, and occurrence indication).

But Goldfarb being Goldfarb, my error gave him a challenge. Rather than drop the idea of literal strings in the input as replacements for tags, he decided to implement it, and the 1983 version of the STRUC declaration did include some limited cases of literals in models for character strings. It also included the first cut at what became the DATATAG option in an SGML configuration. GENCODE At the cost of adding another delimiter role to separate them from element names, string literals came back into content models as separators between elements. When a declared literal pattern is encountered in the source, it ends one element, forcing the start of the next in the model, while at the same time being passed on as part of the source. With the final DATATAG syntax of 1986, the declaration

<!ELEMENT row - o ([cell, ", ", " "], cell)>

describes a two-column table row to be made from a row in a comma-separated list, one line per implied row, where the comma is followed by a space (", ") and then followed by optional padding spaces " ", then by the second cell.

If strings (#PCDATA) can become markup, what about strings that change roles according to context? Goldfarb did not stop with simple alternatives to tagging: he went on to generalize the concept of recognizing strings in situations such as smart quotes. His solution, short references and short reference maps, cost two more markup declarations (SHORTREF and USEMAP) and considerable indirection. When a string that has been declared as a short reference is encountered, it is replaced by an entity, which is resolved to an element name, and whether it is to be used in a start tag or an end tag. Furthermore, invoking an element (either by encountering it in text or by generating it from a short reference) can change the mapping from a short reference to an entity. Thus encountering a quotation mark in text could start an element and a new map; encountering another quotation mark under the new map could end the element and revert to the original map. (Handling nested quotes or cases like single quotes in English, which can have more than one role, requires complex patterns and mappings.)

<!USEMAP textmap p>                      
          <!-- In normal text, the "textmap" is active. -->
<!USEMAP quotemap quote>                 
          <!-- In a quotation, the "quotemap" is active -->

<!ENTITY quotetag "<quote>" >            
          <!-- The "quotetag" entity is the start tag for a quotation. -->
<!ENTITY endquotetag "</quote>" >        
          <!-- The "endquotetag" entity is the end tag for a quotation. -->

<!SHORTREF textmap '"' quotetag>         
          <!-- Within the "textmap" a double quote resolves to the "quotetag" entity. --> 
<!SHORTREF quotemap '"' endquotetag>     
          <!-- Within the "quotemap" a double quote resolves to the "endquotetag" entity. -->

DATATAG and SHORTREF are complementary techniques. DATATAG is a technique for markup minimization; SHORTREF is an alternative method for entering markup and potentially modifying its meaning. When DATATAG is enabled, a string that matches a pattern serves as both data and end tag; the characters of the string are passed through to the output at the same time that they cause a parsing event. The start tag that began the element is generally assumed to be minimized. A string that matches a SHORTREF pattern is just markup in Invisible SGML; it causes an event but is consumed in the process.

For all his ingenuity in creating these techniques, Goldfarb still didn't give me precisely what I was asking for: I wanted matching a pattern to create an implied start tag. In its first draft DATATAG supported both start and end tags, but the final version provides implied end tags, or rather it provides element separators that involve an implied end tag for one element and a start tag for the next. Perhaps SHORTREF could be stretched (Goldfarb seemed not to like long short references), rather than DATATAG, to get what I was looking for:

<!SHORTREF memomap "&#RS;To: " to
                   "&#RS;From: " from>
<!ENTITY to "<to>">
<!ENTITY from "<from>">
<!ELEMENT to   o o (%text;)>
<!ELEMENT from o o (%text;)>

So long as whatever %text; resolved to didn't include the string To: or From: , that might work. ("&#RS;" is a long-forgotten SGML predefined entity reference to the start of a data record; there was a corresponding "#RE" for the end of a record.)

As the SGML standard makes explicit (Appendix C.1.3), one intent of these techniques was to capture simple WYSIWYG data, as it was seen in the 1980s. In effect, we were trying to capture typewriter-like markup, expressed largely through whitespace and punctuation. This was about as much as the stand-alone word processors of the early 1980s were able to export. Given that the only output devices available to them, such as daisy-wheel printers, were only glorified typewriters, that's about as much as could be expected. The day of the stand-alone device was ending because they were beginning to be supplanted by programs running on personal computers. As laser printers arrived, with new output capabilities, the programs also grew in flexibility and also in complexity of coding. With the new word-processing programs, it was often possible to extract more coding data, though I saw little evidence of SGML users stretching these techniques to deal with extended coding. In the period when Word Perfect was the dominant program, writing SHORTREF structures would have offered even more challenges than dealing with multilingual quotes because there were so many codes and no programmatic discipline at all over the order in which they could be entered.

By the time I was building real SGML publishing systems, we had separate conversion tools and then OmniMark to do the work for us. But the work was still nontrivial.

The longest discussions of the DATATAG and SHORTREF techniques that I know, in Appendix C the ISO standard (and Goldfarb's annotation of it in The SGML Handbook Goldfarb-1990) and Martin Bryan's book SGML: An Author's Guide, Bryan-88 concentrate on techniques such as turning vertical whitespace into new elements in a sequence, turning comma-separated (or TAB-separated) data into tables, and handling quotations and similar constructs. These discussions predate the rise of word-processing programs, so they did not deal with translation of formatting codes.

All the mechanisms necessary to enable these techniques were dropped from XML:

the SGML DECLARATION, necessary to enable minimization, DATATAG, and SHORTREF;
markup minimization as a concept;
the SHORTREF and USEMAP markup declarations;
markup roles declared in ENTITY declarations; and
predefined entities, especially the #RS and #RE, often used in short references for the concepts of record start and record end.

These techniques were not heavily used, and implementing them was probably too much for the desperate Perl hacker envisioned as the potential XML parser writer.

The absence of these features in XML has not prevented enthusiasts from trying to reinvent them. Simon St. Laurent had a habit of showing up at the Montréal conference that has since become Balisage and suggesting ways of resurrecting things lost in XML. In 2001 his target was using textual patterns as markup. StLaurent

To Be Seen or Not To Be Seen

So do we want to see markup?

At first glance, the current interest in Invisible XML suggests that we don't want to see markup anymore. Pemberton-2013 But I think that is not really the case. Invisibility is not the goal in this effort; markup is. As Steven Pemberton has said about his project, Invisible XML is a technique for treating any parsable format as if it were XML, and thus allowing any parsable object to be injected into an XML pipeline.. Pemberton-2016 In this sense, Invisible XML is like a continuation of Goldfarb's demonstration of how to generate SGML out of comma-delimited values, which can be traced back as far as the 1983 GENCODE standard.

I think that the greatest differences between Invisible XML technologies and SGML technologies are the underlying assumptions and the technologies available. In the 1980s we made few assumptions about the data, other than that we could find some patterns upon which to operate. The patterns might be complex, as in Goldfarb's incomplete attempt to mark up sentences and words (ISO 8879, Appendix C, p. 106) or Bryan's handling of multilingual quotation marks (Appendix A.3, pp. 274–286), but they were derived simply from direct examination of documents. Invisible XML, in contrast, treats documents from the beginning as though they were expressions of a parse tree, with the expectation that it must be possible to describe the data using a context-free grammar Sperberg-McQueen-2019 and to write out that grammar to drive a processor. In the 1980s we had few tools available with which to ingest documents into SGML, so Goldfarb built requirements for the tools into the standard itself, hoping that some programmer would implement them. Since XML has omitted the basis on which Goldfarb improvised his tools, we must now depend on something outside the XML parser. Fortunately, we have other tools, many of them XML-aware, and so Sperberg-McQueen can propose Aparecium as a library for XSLT or XQuery. The emphasis in Invisible XML is, after all, not on Invisible but on XML. And this is still the goal we had in the 1980s: How do we get our data marked up so we can make further use of it? Invisible XML, requiring an external processor, is more complex and more capable than the original set of techniques, but the interest it has aroused suggests that we still need something to do that work. So Invisible XML is a way of making the invisible appear.

The techniques I have described that were built into SGML were originally a way of making markup disappear. Everything grew out of minimization, and that started as a way of saving effort for users in the days when all the coding would have to be typed in manually, not inserted by a syntax-directed editor. While this was a labor-saving technology, I suspect there was also an unconscious awareness that this new SGML notation for markup was much more verbose that what our team had been used to in Script, troff, and other systems. SGML, before the final version, was actually much more verbose than we think of it now. There were more delimiters and more delimiter roles: one reviewer accused the code of looking like chicken tracks! That these techniques turned into a way of simplifying the process of getting markup into documents that were being imported was an unintended consequence, though a fortunate one.

We put up with SGML because it was what we needed, what we had created, and we didn't have much other choice. It was successful in spite of what some saw as flaws. We sold it to the Department of Defense, the European Union, CERN, the American Association of Publishers, and dozens of other organizations. Major applications that we are still discussing at Balisage this year, such as DocBook and TEI, started out in SGML. Nonetheless, most of us were glad to see the arrival of applications like Author/Editor that disguised the chicken tracks and allowed us to forget about minimization. Most of the time what we cared about wasn't so much what the markup looked like but that we knew it was there and we could get at it as needed. As I write this, most of the time I have tags hidden. I sometimes turn them on when I need to know where my selection cursor really is. And on occasion I go into full code view because there are some things I just can't do any other way.

There is a difference between working with documents where there is no visible markup, yet which you can treat as though they are marked up, and working with documents where you make the markup that is present disappear because that helps your creative process. Nevertheless, in any case, the goal is to have information identified. Whether I am importing data or creating it from scratch, what is important is that the markup is applied to the data. What was on my mind in 1998, whether I just said it at a conference or wrote it down, was that not only had visible markup helped the success of SGML over ODA, but that, having vanquished what we had thought was a mortal threat, we could relax and make SGML less overtly visible. ^[1] Visibility, per se, is not a goal. I think that the core issue is connected to the idea of ownership of data. Putting your mark on the data (or rather in it) is an effective way of establishing that. The SGML/XML model of inline markup has thus been vastly more successful in that respect than the ODA approach of binary pointers.

Looking back over more than three decades of working with descriptive markup, I think the issue is not just seeing markup but making markup comprehensible by humans. If making markup visible is what it takes to do that, I'm all for visible markup.

Appendix A. Markup Minimization

With modern XML editors, markup minimization has ceased to be an issue. XML dropped the whole concept as being irrelevant in a time of syntax-directed editors, as well as being too difficult to implement in a parser.

But when SGML was under development, minimization was much desired—and debated—in our meetings. The final form of the ELEMENT declaration in the 1986 standard had two fields for minimization between the element name and its model, one for start tags and the other for end tags. Either, or both, could be declared omissible. The STRUC declaration in the 1983 GENCODE draft of SGML had several other kinds of minimization, and more than one kind of minimization could be specified in each of the two fields (pp. 40–46, 64–65).

`-`	Tag is required.
`O`	Tag can be omitted.
`C`	A containing element can end elements within it.
`E`	The current element can be ended by its container.
`N`	Null tag: the current element type is the same as the previous. There are many variants on this, but in general they meant typing only delimiters, without including the whole generic identifiers within them.
`D`	Data tags: literal strings could serve for either open or close tags.

We eventually realized this was excessively complex. When we created so many conditions, we didn't actually have an SGML parser with which to test minimization. As we gained experience in parser design, we realized, for example, that ending a container element naturally ended any contained elements on the stack. In the end, each field became binary: -, required, or O, omissible, in the published standard.

Planning minimization for an application required some skill: you had to think like a parser and maintain a mental stack of contexts. Consider a document type that required the title of a section to be followed by a paragraph and did not allow paragraphs to be nested:

<!ELEMENT section - - (title, p+) >
<!ELEMENT title   - - (#PCDATA)  >
<!ELEMENT p       O O (%text;) -p >

(For those who are not familiar with SGML DTDs, the -p> is an SGML exclusion: even if %text; includes p in its content model, p cannot appear within another p.) The result might look like:

<section>
<title>A section title</title>
The first paragraph
<p>
A second paragraph
<p>
A third paragraph
</section>

Just such a model is what led Tim Berners-Lee to think that the <p> tag was just a separator, analogous to a newline in typewriter text and not a container for text! The mess that we recognize in HTML is a prime case of why markup should not be made invisible.

References

AT&T Bell Laboratories (and later modifiers). groff_mm man page. https://www.mankier.com/7/groff_mm.

[Bryan-88] Bryan, Martin. SGML: An Author's Guide. New York: Addison-Wesley (1988).

[GENCODE] Graphic Communications Association. GCA Standard 101-1983, GENCODE and the Standard Generalized Markup Language.

[Goldfarb-1990] Goldfarb, Charles, and Yuri Rubinski. The SGML Handbook. Oxford: Oxford University Press (1990).

[Horak-Kroenert-83] Horak, Wolfgang, and Guenther Kroenert (1983). "Techniques for Preparing and Interchanging Mixed Text-Image Documents at Multifunctional Workstations", Siemens Forschungs- und Entwicklungsberichte/Siemens Research and Development Reports. 12. 61-69. https://www.researchgate.net/publication/282210430_TECHNIQUES_FOR_PREPARING_AND_INTERCHANGING_MIXED_TEXT-IMAGE_DOCUMENTS_AT_MULTIFUNCTIONAL_WORKSTATIONS.

[ODA] International Organization for Standardization/International Electrotechnical Commission. ISO/IEC 8613-1:1994, Information technology—Open Document Architecture (ODA) and interchange format: Introduction and general principles, https://www.iso.org/standard/15928.html.

International Organization for Standardization/International Electrotechnical Commission. ISO/IEC 8613-2:1994, Information technology—Open Document Architecture (ODA) and interchange format: Open Document Interchange Format, https://www.iso.org/standard/23410.html.

[SGML] International Organization for Standardization/International Electrotechnical Commission. ISO/IEC 8879:1986, Information processing—Text and office systems—Standard Generalized Markup Language (SGML), https://www.iso.org/standard/16387.html.

[ASN1] International Telecommunication Union, Abstract Syntax Notation 1, ASN.1, X-680 series, https://en.wikipedia.org/wiki/Abstract_Syntax_Notation_One).

[SC34] International Organization for Standardization/International Electrotechnical Commission. ISO/IEC JTC1/SC34, Document description and processing languages, https://www.iso.org/committee/45374.html, https://en.wikipedia.org/wiki/ISO/IEC_JTC_1/SC_34.

[Marcoux] Marcoux, Yves, and Martin Sévigny. Why SGML? Why Now?. Journal of the American Society for Information Science 48, No. 7, July 1997, p. 584.

[Pemberton-2013] Pemberton, Steven. Invisible XML. Presented at Balisage: The Markup Conference 2013, Montréal, Canada, August 6–9, 2013. In Proceedings of Balisage: The Markup Conference 2013. Balisage Series on Markup Technologies, vol. 10 (2013). doi:https://doi.org/10.4242/BalisageVol10.Pemberton01.

[Pemberton-2016] Pemberton, Steven. Data Just Wants to Be Format-Neutral. Presented at XML Prague, 2016, Prague, Czech Republic. Proceedings of XML Prague 2016, pp. 109–120. http://archive.xmlprague.cz/2016/files/xmlprague-2016-proceedings.pdf, https://homepages.cwi.nl/%7Esteven/Talks/2016/02-12-prague/data.html.

[StLaurent] St. Laurent, Simon. Regular fragmentations: Treating complex textual content as markup. Paper given at Extreme Markup Languages 2001, Montréal, sponsored by IDEAlliance. Abstract on the Web at http://conferences.idealliance.org/extreme/html/2001/StLaurent01/EML2001StLaurent01.html.

[Sperberg-McQueen-2019] Sperberg-McQueen, C. M. Aparecium: An XQuery / XSLT library for invisible XML. Presented at Balisage: The Markup Conference 2019, Washington, DC, July 30 – August 2, 2019. In Proceedings of Balisage: The Markup Conference 2019. Balisage Series on Markup Technologies, vol. 23 (2019). doi:https://doi.org/10.4242/BalisageVol23.Sperberg-McQueen01.

^[1] I said something about visibility/invisibility in a session at SGML/XML Europe 1998 in Paris, where I responded to a query by François Chahuneau with a comment that it was perhaps time to streamline SGML and that we no longer needed to be attached to the specifics of what SGML looked like. I have been convinced for some years that I had published somewhere not long afterwards an opinion piece on how visibility/invisibility affected the SGML/ODA Wars and what that meant for the future of markup. Diligent searching by several people has failed to discover a published article, and my wife has declared it to be a Fig Newton of my imagination. So now I am committing to text what I should have said then.

James David Mason

James D. Mason, originally trained as a mediaevalist and linguist, is retired from being a writer, publishing systems developer, and manufacturing engineer at U.S. Department of Energy facilities in Oak Ridge, Tennessee. In 1981, he joined the ISO’s work on standards for document management and interchange. He chaired ISO/IEC JTC1/SC34, which was responsible for SGML, DSSSL, Topic Maps, and related standards, from 1985 until 2007. Dr. Mason has been a frequent writer and speaker on standards and their applications. For his work on SGML, Dr. Mason has received the Gutenberg Award from Printing Industries of America and the Tekkie Award from the Graphic Communications Association. He has also done research in horology and the history of pipe organs.

BalisageThe Markup Conference

Balisage Paper: Do we really want to see markup?