Why Literate Programming?

First of all, I'd like to thank Stilo International[1] for giving me access to some of their internal documents. This paper wouldn't exist without their contribution.

Literate Programming, the integration of program code with its documentation, has been a feature of both the programming and the documentation fields for over thirty years. It got off to a good start with Knuth's WEB System but there hasn't been a lot of new work in the field over most of the intervening decades. However, now there seems to be a renewed interest in Literate Programming. Why? you might ask.

Integrating programming code and its documentation isn't just about having them in the same document, as was done in Knuth's WEB: it's about eliminating duplication of information coded in both programming and documentation forms. In contrast, and most commonly, programmers are continuing with the traditional model of completely separating code and its documentation. The difficulty with this approach is that it duplicates a lot of information: information coded in a programming language and also written up in its documentation, as text or in tables. The result is costly in three ways:

  • Doing so increases the amount of work required by the programmers and the documenters (who may be the same or different folk) to initially create and to later update the code and the documentation.

  • Organizing multiple copies of information can itself increase the cost of developing, maintaining and managing programming code: there's more to do that way.

  • Most importantly, duplication increases the chances of error: rewriting a text description into code can result in an error, as can describing code using text. Updating one can easily cause it to be out of step with the other, even when they were previously in step.

There's been a number of approaches taken to deal with these difficulties:

  • All production programming languages support integrating comments with code. Comments are most commonly used to help the reader understand the details of why coding is done in a particular way. Comments are also used to document how a program is to be used, but they don't make for good reading for the users, and force the user to read the code.

  • Many ways have been looked at for adding markup to program languages' comments, both XML and non-XML (often wiki-like) markup. This certainly improves things. It means that the user documentation can be extracted from the programming code and repurposed for the user: a big help.

    Marked-up comments still have a problem, however: a lot of information needs to be duplicated, so that there's a "human" version of the information and a "computer" version of it too. For example, the information in function/method headers needs to be available to and understood by both the user and the computer.

  • Taking things one more step further, there have been a few approaches to adding markup to a programming language's code itself, so that it can be used within the documentation without duplication. This is where the future lies, what I'll be talking about here, and is something still in development.

There are a number of good papers at this conference already covering different aspects of this problem:

  • "Code Up: Marking up Programming Languages and the winding road to an XML Syntax"[2] describes and analyzes various approaches, from simple commenting to a program that's all XML.

  • "On XML Languages"[3] describes both XML and "compact" (non-XML) syntaxes for existing W3C scripting languages, discussing the advantages and disadvantages of each approach.

  • "Encoding Transparency: Literate Programming and Test Generation for Scientific Function Libraries"[4] describes an XML-based approach to duplicating what was achieved with Donald Knuth's Literate Programming tools (his WEB targeting TeX).

This paper adds to the discussion in two ways:

  • It presents an existing system, integrating programming code and its documentation in a practical way.

  • It discusses further issues that have to be dealt with when designing languages and building tools for such a system.

A Blast From The Past: A Case Study

Back in the days when SGML was still new (when XML hadn't shown up yet), and when the C programming language was still a practical language of choice for cross-platform tool development (when C was about the only language that ran uniformly on all major platforms, and when there were a much larger variety of machine architectures than there are in our now Intel-dominated world), I implemented one of the still-existing SGML parsers. Almost uniquely, I think, the SGML parser is itself an SGML document. (It helped a lot that it was the second SGML parser developed by the company, so that the first one could be used to initially process the second one. Once the second SGML parser was well developed, it took over and was used to help processed itself.)

The following examples are taken from the code of the SGML parser used in Stilo International's OmniMark programming language. This code has been in use for over twenty years, so it serves as a good example of "real world" markup-based literate programming. The markup language used to markup the SGML parser's code is quite complex, but you'll get most of it's ideas from the following examples.

In practice, the following program-oriented markup elements are included in otherwise common paragraph-level markup.

Here's the header of the module processing SGML declarations (other than ENITY declarations)[5]:

<!-- xkdecl.doc: 

   Copyright (C) Stilo International plc, 1991 - 2011
   All Rights Reserved


<revinfo>$Id: xkdecl.doc,v 1.83 2001/10/19 15:11:08 kernel Exp $
<module defined>decl <!-- Declarations;-->;
basic; mem; syn; lex; var; ent; mod; con; attr; edec; err; fsm1

It contains:

  • Importantly, copyright and distribution information.[5]

  • A chapter heading, both as a lead comment in the code and as a chapter start and title in the user's and programmer's documentation.

  • Revision information for the revision control system used at the time. (For a stable piece of software such as this, it doesn't get updated often, as you can see.)

  • Information about the system name, the used module name, what other modules are used, and what C include files are needed.

Data structures are documented rather than coded:

<struct external>document type definition
# The data structure which describes a document type definition (a "compiled"
  DTD) and which points to all the data structures for the objects declared
  for the document type definition. #
The following fields provide information about specific features of
a document type.
=  document element: 'element definition'*
   # Pointer to the definition for the document element. #
=  default general entity: 'entity definition'*
   # Pointer to the default general entity. #

Documentation of a structure as a whole as well as of each field is required. The markup used for the fields of a structure is exactly the same as for an ordered list of labeled textual items: no distinction is made between the markup for documentation and for code.

Structures, functions and other constructs have attributes specified with them that are meaningful either to the target code, the programmer's documentation, the user's documentation or two or more of these things.

Global names are marked up (surrounded by apostrophes in code and by "at" signs in text), and are chosen to be appropriate for documentation. The processing software replaces these names with the kinds of names required by the target language, together with appropriate prefixing. This approach makes the code more portable between systems.

Functions/methods have their interface information marked up as documentation:

<function external>initialize document type: 'boolean'
# Prepare for parsing a document instance. #
<inout setself>parsing_state: 'parsing state'*
<in>base_element: 'element definition'*
This procedure prepares ~parsing_state~ to parse a document instance using the
current document type definition in ~parsing_state~.  Three options are
available for selecting what is to be parsed, depending on the value of
~base_element~, as follows:
=  If ~base_element~ is the base document element of the document type (i.e.
the one named following the keyword DOCTYPE in the DTD), then the following
input text is parsed with that element as the document element (see
ISO 8879-1986, definition 4.99).
=  If ~base_element~ is any other element in the DTD, then that other element
is treated as if it were the document element for the purposes of parsing the
following input text.  This allows parts of documents to be parsed, such as a
single chapter.
=  If ~base_element~ is "null" (@element [email protected]), then the following
text may consist of any sequence of elements defined in the DTD.
   In the first two cases, @initialize document [email protected] sets the number of opened
elements to zero (0).
   <return/'initialize parsing state generally'
                 ('document type definition'*) 'document type definition.null',
                 ('document syntax'*) 'document syntax.null',
                 ('document syntax'*) 'document syntax.null',
                 ('parsing state setup result'*) 0)/

Function arguments are documented both by text and by the markup of the argument. Code in the body of a function is the one place where program code is used in preference to markup, for a number of reasons:

  • Dense code is generally easier to read in a more compact form.

  • The code is only used in two (potential) targets: the produced C code that was intended to be compiled, and in the annotated code documentation.

That said, there are exceptions to using a non-SGML form for code:

  • Constructs that impact a function's interface, such as "return" (but not things like "if") are marked up.[6]

  • The big issue in choosing verbose markup or compact markup is in the trade-off of readability and utility. This trade-off can be subjective -- different people with come to different conclusions, depending largely on what markup and other notations they are familiar with.

  • References to names in the software's interface, either of interest to the user or of global interest to the software's developers, are marked up, so that they can be easily found if needed. One use of marked up marked-up names is an index of all uses of every name can be listed.

Using markup also means that things that are better coded as tables than as code, but which need to be run as code, can be included. This was done in the SGML parser, by coding the syntactic parsing logic as a finite state machine (FSM). For example, here's the logic for parsing an SGML end tag (in a somewhat abbreviated form):

   From Clause 7.5, End-tag:

<fsm>end-tag (TAG):
=  name {end-tag}: +generic identifier specification
=  tagc {back over lexeme; check end-tag shorttag}: +checked shorttag
=  * {impossible}
#  checked shorttag
   {empty end-tag}: +generic identifier specification
#  generic identifier specification (TAG):
=  tagc {end of end-tag: other prolog; end of tag}: content
=  stago no rhs, etago no rhs
   {back over lexeme; report missing end tag tagc missing;
    end of end-tag: other prolog; end of tag}: content
=  * {backup needed; 'unrecognized item'}: +unrecognized
#  unrecognized
   {end of end-tag: other prolog; end of tag}: content

Each entry has four parts: the thing or things being recognized, the lexical context in effect (i.e. what tokens are recognized, identified by a keyword such as "TAG"), the action to be taken when recognizing that thing (in curly braces), and what state in the state machine to go to next. In particular:

  • "#" introduces a sub-state and "+" prefixes a local reference to the next state. Next states with no "+" prefix are major states, like "end-tag".

  • Substates need not recognize anything, but just do something, like "checked shorttag".

  • Groups of common actions are coded as entity references ("&more;" and "&s;").

Note that the above example is very heavily marked up: all of ( ) { } + = # ' and ; are compact markup (a.k.a. SHORTREFs).

Actions in the FSM are marked up specially:

<action value>end tag
# Process an end tag containing an element name and signal the
change of context to the application. #
The current lexical item is the name of the element.
@parsing state.selected [email protected] is to be made the definition of the element.
to the previous state after closing one element, or go to the alternate
state after having reported an error.
<local>element: 'element definition'*
<local>opened_element: 'opened element'*
   if (parsing_state->'parsing state.opened element count' > 0)
      opened_element = parsing_state->'parsing state.opened element stack';
      opened_element = 'opened element.null';
   if (!'look up element' (parsing_state,
                          parsing_state->'parsing state.opened entity stack'->
                                          'opened entity.item start',
                          (parsing_state->'parsing state.opened entity stack'->
                                          'opened entity.item end' -
                           parsing_state->'parsing state.opened entity stack'->
                                          'opened entity.item start'),
      parsing_state->'parsing state.selected element' =
                     'element definition.null';
      'report error' (parsing_state,
                      'exception code.undefined element in end tag');
   parsing_state->'parsing state.selected element' = element;
   'create opened element' (parsing_state);
   'initiate closing current element' (parsing_state, opened_element);

Actions can be compiled as functions, with calls to them included in the FSM code, or they can be marked as a "macro", and included in-line. The "value" attribute indicates that the action (potentially) returns a value to the invoking application. This illustrates the use of markup not just for documentation purposes, but to make the coding simpler.

The FSM markup language made it easy to create program code, and was easy to work with. It greatly shortened the time of creating a high-performance SGML parser.

Using SGML to help create an SGML parser had nothing, of course, to do with the fact that it was an SGML parser that was being developed using this technique. However it did help to speed up development of the product in an otherwise inappropriate programming language: C. One could also argue that it took someone with expertise in implementing and using SGML to perform both tasks.

It's unclear whether this use of literate programming was a success or not:

  • It's use helped greatly in the project in which it was used.

  • It wasn't reused in later projects.

So an argument could be made both for success and for failure.

An Aside On Short References

The work described in this paper makes extensive use of short references and illustrates how they can be useful.

Another paper being presented at this conference, [7] describes a simplified mechanism for introducing the advantages of Wiki Markup and SGML short references into XML. As that paper correctly points out, it's not easy to get SGML short references right. The difficulty is not so much compact markup its self -- it's in the mechanism for defining it, in the tool support for such markup, and in the quality of the documentation of such markup. (If anything, it's in the later that the use of SGML short references failed most notably.)

XML was designed and made different from SGML on the assumption that markup support tools, such as XML editors and XML exporting support in word processors, had or would develop to the point where users were no longer entering XML markup "by hand", but would use semi-automated tools for doing so. This is true for a large class of users. But there is also a large number of users entering XML tags using non-XML-specific editors: one major category of such being in programming language environments, where those languages have syntaxes in addition to that of XML. To be effective user-helpful tools need to support multiple syntaxes, not just that of the programming language or languages used, or just XML, but all of them.

One difficulty with using compact markup is that it's best used sparingly. That is, only a small number of compact markup forms should be used in any particular context. Successful Wiki Markup languages are a testament to this principle. Too many different compact forms results in confusion. The classical paper on the subject is Miller's The Magical Number Seven, which says that the limit on the number of usable forms (per context) is about 7 (plus or minus 2).

At this stage in the development of markup languages, it doesn't seem to be a particularly controversial statement to say that the best use of fully-tagged and compact markup is in some combination of the two -- with the balance chosen based on the needs of a particular application. One size does not fit all. For an example, consider the mixture of fully-tagged XML and compact XPath that appear in most XSLT programs.

There are a number of ways in which the advantages of compact markup can be realized in an XML context, including:

  • A general facility could be added to XML structure descriptors (DTD, schema, RELAX-NG, etc), maybe some up-dated form of short references as suggested in another paper here[7] for markup language developers to develop their own compact markup.

  • A similar facility could be created as a separate process, complementary to existing XML structure descriptors, that could be used with any of them, that for example, adds further element structure to a preexisting parsed XML tree based on discovered compact markup.

  • Some special-purpose compact markup could be supported as a separate process. This approach would be appropriate if there were a limited number of applications of compact markup -- only for literate programming applications, for example -- and no need for a general approach.

The Literate Programming work described in this paper wouldn't have really been possible without the use of some form of compact markup to complement the primary markup (SGML or XML). The level of detail would make full XML markup, for example, difficult to read, especially for programmers, whose primary interest is the programming code.

Another Aside, On The Kinds Of Documentation

The SGML/C project described above supported four kinds of documentation that could be targeted by marked-up code and documentation:

  • User documentation: information for the end user of a software system.

  • Design documentation: information for helping maintain a software system, outlining the structure of the software and how it works

  • Fully annotated code: for use by those actually working with the code, detailing what, how and why is actually done.

    These three categories of documentation are incremental: generally speaking, design documentation includes everything the user is told, and annotated code includes all the user and design information.

  • Comments: There is some documentation that falls outside of any of the above categories: comments detailing the how and why of specific code snippets (rather than the more general techniques that apply to whole methods or other segments of code). These comments are inseparable from the code they annotate, and seem to be best entered as language-specific comments rather than as marked-up documentation. Unlike the above categories of documentation, these kinds of comments need no special handling.

and of course, there's the code itself: what the programming language's compiler needs to be given. In practice there can be more than one kind of code:
  • The "production" code, that appears in the final product. There can be multiple products, or multiple versions of a product, originating in one set of code.

  • In addition, code can exist as part of the software development process, with lots of extra checks and reports.

Markup can effectively distinguish between different versions and kinds of code.

So there's at least four kinds of things created from marked-up code: user, design and annotated code documentation, and the compiler's code.

A Literate Programming Markup Language As A New Language

Adding comments to program code doesn't change the programming language used in any way. It remains the same programming language plus comments. But once major programming language constructs, such as data structure declarations and function headings, are replaced by documentation-friendly markup, we find ourselves looking at different programming language.

At what point changing the syntax of a programming language makes it a different language depends largely on one's point of view. From the point of view of the programming language designer, syntax is a minor issue: functionality is their focus. From the point of view of the language user syntax is just about everything: it's important how to code an "if" statement, even though it's semantics is more-or-less the same in every programming language. As a consequence, any useful definition of what constitutes a programming language, and the extent to which two are the same, has got to take syntax into account.

A major impediment to acceptance of a literate programming language is the fact that it is a different language. It's not the programming language that a programmer knows, and switching over is not a small job. And I'm afraid to say that I've found computer programmers in general very conservative in what languages they are willing to work with: they generally stick with what they know. A major selling job is needed to convince programmers to switch.

It being a different language than what programmers were used to seems to be a large part of the reason that the SGML/C-based programming language described in this paper failed. It may well be for other reasons: lack of promotion of the language, or a well-established base of other software that management and the programmers didn't want to change. These things have to be taken into consideration when developing a new language, to ensure its better acceptance.

Conclusions And Observations

Literate Programming is something that clearly needs more work:

  • More use of Literate Programming needs to be undertaken so that useful ideas can be developed. If nobody does it, it's not going to happen.

  • Markup conventions for Literate Programming need to be developed, either with respect to a particular programming language, or which apply to a variety of programming languages. There is not going to be general acceptance of Literate Programming if every language or, worse yet, every system has its own set of conventions.

    As noted earlier, the trade-offs between full and compact markup are somewhat subjective. As a consequence, these conventions will need to be arbitrary. And that has to be accepted.

  • Literate Programming tools need to be integrated into software development systems. At present, Literate Programming is usually implemented as a preprocessor. But this doesn't fit well with most visual software development systems, or with the expectations of most programmers.

  • The use of compact markup in XML documents needs to be researched further. Whether XML itself needs to be extended to support compact, whether that can best be done outside of XML, or whether it's unwise to try either needs to be reexamined.

Markup-based Literate Programming gives us the opportunity to bring the advantages of markup in general, and XML in particular, to a wider community. More than any new programming language feature -- which language designers are always on the lookout for -- better and more reliable documentation could make a difference to how computer programmers work. But it's not a small task: it's as big as developing a whole new programming langauge.


[C programming language] Home page of ISO/IEC JTC1/SC22/WG14 - C http://www.open-std.org/jtc1/sc22/wg14

[Knuth's WEB System] Donald E. Knuth, Literate Programming http://www.literateprogramming.com/knuthweb.pdf

[Literate Programming] Literate Programming Web Site http://www.literateprogramming.com

[The Magical Number Seven] George A. Miller, Harvard University, "The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information" http://www.psych.utoronto.ca/users/peterson/psy430s2001/Miller GA Magical Seven Psych Review 1955.pdf

[OmniMark] OmniMark Developer Resources http://developers.omnimark.com

[SGML] Standard Generalized Markup Language (SGML) International Organization for Standardization ISO 8879:1986

[TeX] TeX Users Group http://www.tug.org

[Wiki Markup] Wikipedia Wiki Markup Help page http://en.wikipedia.org/wiki/Help:Wiki_markup

[XML] Extensible Markup Language (XML) 1.1 (Second Edition) http://www.w3.org/TR/xml11

[XPath] XML Path Language (XPath) 3.0 http://www.w3.org/TR/xpath-30

[XSLT] XSL Transformations (XSLT) http://www.w3.org/TR/xslt

[1] Stilo International http://stilo.com

[2] David Lee, MarkLogic, "Code Up: Marking up Programming Languages and the winding road to an XML Syntax", to be presented at Balisage 2012, Wednesday 2:00pm.

[3] Norman Walsh, MarkLogic, "On XML Languages", to be presented at Balisage 2012, Wednesday 2:45pm.

[4] Mark Flood, Matthew McCormick and Nathan Palmer, Office of Financial Research, Department of the Treasury, "Encoding Transparency: Literate Programming and Test Generation for Scientific Function Libraries", to be presented at Balisage 2012, Wednesday 4:00pm.

[5] Please note that the copyright information isn't just an example. This code is copyright and extracted from the original code of the product (with some abbreviations to make it easier to present). Stilo International has kindly allowed me to use snippets of it as examples in a public forum.

[6] I forget exactly why returns were marked up. Oh well.

[7] Mario Blazevic, Stilo International, "Extending XML with SHORTREFs specified in RELAX NG", to be presented at Balisage 2012, Wednesday 4:45pm.

Author's keywords for this paper:
Literate Programming; Markup Language Implementation; SGML; Short References; Wiki Markup; XML

Sam Wilmott

Sam Wilmott started using markup languages in the late '60s. Since then he has led the development of typesetting/text-formatting systems for the Canadian Government Printing Office and for a major real-estate company, implemented one of the first SGML parsers (which was also the first pull-model markup parser), and is the originator of the OmniMark programming language, with its strong support of SGML, XML, and text transformation.

More recently Sam has been working the XSLT world: he has recently contributed to the implementation of an XSLT compiler and currently works as an XSLT programmer and analyst. As a side project, he is working on new programming language ideas for markup language processing.