How to cite this paper

Walsh, Norman. “Customizing DocBook.” Presented at Symposium on Markup Vocabulary Customization, Washington, DC, July 29, 2019. In Proceedings of the Symposium on Markup Vocabulary Customization. Balisage Series on Markup Technologies, vol. 24 (2019). https://doi.org/10.4242/BalisageVol24.Walsh01.

Symposium on Markup Vocabulary Customization
July 29, 2019

Balisage Paper: Customizing DocBook

Norman Walsh

Norman Walsh is a Principal Engineer at MarkLogic Corporation where he helps to develop APIs and tools for advanced content applications. At OASIS, he was chair of the DocBook Technical Committee for many years and is the author of DocBook: The Definitive Guide. Norm has spent more than twenty years developing commercial and open source software including significant DSSSL, XSLT 1.0, and XSLT 2.0 stylesheets for DocBook.

Copyright ©2019 Norman Walsh

Abstract

DocBook is a general purpose XML vocabulary particularly well suited to books and papers about computer hardware and software (though it is by no means limited to these applications). DocBook has been under active maintenance for more than 20 years; it began life as an SGML document type definition. Because it is a large and robust schema, and because its main structures correspond to the general notion of what constitutes a “book,” DocBook has been adopted by a large and growing community of authors writing books of all kinds. After a brief introduction to DocBook, we will discuss the mechanisms built in to DocBook for customization.

Table of Contents

History
Early development
Growth and popularity
Versioning and copyright
DTD structures
The modern day
Conversion to RELAX NG
DocBook is a RELAX NG grammar
What is RELAX NG: a brief tutorial
Pattern annotations
The DocBook RELAX NG grammar
DocBook is a RELAX NG grammar
Removing elements or attributes
Adding elements or attributes
Make required elements optional
Make optional elements required
Make optional elements forbidden
Change the semantics of a component
Document the customizations
Plus Schematron
Appendix A. Building DocBook

History

DocBook has a long history. It began in the early 90s as an SGML DTD. At the time, there were several commercial Unix vendors and O’Reilly Media (then O’Reilly & Associates) had built a successful publishing business supplying technical books about the Unix ecosystem.

One aspect of the Unix system, the man page, contributed significantly to the documentation of Unix and its commands, tools, APIs, and subsystems. Anecotally, the origin story for DocBook begins with some of the Unix vendors, in cooperation with O’Reilly and HaL Computer Systems, getting together to build an interchange vocabulary for man pages.

The man pages were not a differentiating factor for the vendors, they didn’t need or want them to be proprietary. All Unix systems shipped with largely the same man pages.

The idea was that each Unix vendor would have their own man pages (in troff documents with thier own custom macros) but when they wished to exchange them, the would transform them into a common language and the recipient would transform them from the common language back into the local flavor of troff.

SGML was selected as the appropriate interchange format and DocBook development began. Much of the early design of DocBook was the result of extensive document analysis over the corpus of O’Reilly Unix-related books (including several collections of “man pages”, formatted in the house style, entire series of X-Windows books, and others). Markup was invented for each significant feature of the corpus.

From the very beginning, DocBook was a descriptive format, not a prescriptive one. If a structure could occur in a corpus of reasonable documents, it was (to the maximum extent possible) allowed.

Early development

Once there were several players, a maintenance organization was formed: The Davnenport Group. Davenport met several times a year, stewarding the development of DocBook.

The Davenport Group established that the focus of DocBook would be computer hardware and software documentation: Unix man pages, software documentation, documentation about networking hardware, etc. That remains its focus today.

By the mid 90s DocBook had accumulated several years of managed but somewhat ad hoc growth. When Eve Maler joined the design team, she undertook to bring design principles to the structure of the DTD. Maler and Jeanne El Andaloussi developed the markup design philosophy, methodology, and techniques described in their book Developing SGML DTDs: From Text to Model to Markup during the development of DocBook 3.0.

The structure of DocBook still reflects that methodology, even though much has evolved.

Growth and popularity

By the late 90s and early 2000’s, DocBook was very popular. It arrived on the SGML scene (relatively) early, it had meaningful element names, good documentation, and a reasonably complete set of open source stylesheets for transforming it into HTML and print.

Ironically, the Unix vendors had all gone away by this time and interchange had ceased to be the dominate use case for DocBook. The most common use case became, and remains to this day, simply to use DocBook as an authoring format, often with very little customization.

Versioning and copyright

From the very earliest days, the Davenport Group made distribution and reuse of the DTD as easy as possible. It was freely available, it was free to use, and organizations were free to customize it.

The rules that apply to customizations of DocBook are as simple as possible: do anything you like, but if you change it, don’t call it DocBook.

Many, many (many!) organizations began with DocBook and built their own systems by extension and subsetting. Many more simply canabalized DocBook for the markup structures that they found useful.For example, the elements and attributes for indexing, or the table model (later formalized as the XML Exchange Table Model Document Type Definition), were often excised out of DocBook and used in other systems.

DTD structures

Anyone familiar with the Maler and El Andaloussi methodologies will recognize the structure of the DocBook DTD. It makes extensive use of parameter entitites to make customization (i.e., changing the schema without editing the original source files) as easy as possible.

Elements are divided into classes, with extensibility:

<!ENTITY % local.admon.class "">
<!ENTITY % admon.class
                "Caution|Important|Note|Tip|Warning %local.admon.class;">

Classes are accumulated into mixtures, also with extensibility:

<!ENTITY % local.component.mix "">
<!ENTITY % component.mix
                "%list.class;           |%admon.class;
                |%linespecific.class;   |%synop.class;
                |%para.class;           |%informal.class;
                |%formal.class;         |%compound.class;
                |%genobj.class;         |%descobj.class;
                |%ndxterm.class;
                %local.component.mix;">

And the content models of individual elements are constructed by combining elements and mixtures:

<!ENTITY % admon.elements "INCLUDE">
<![ %admon.elements; [
<!ELEMENT (%admon.class;) - - (Title?, (%admon.mix;)+) %admon.exclusion;>
<!--end of admon.elements-->]]>

DocBook provided the additional feature that elements could be selectively excluded with a parameter entity (admon.elements in this case).

The modern day

Many members of The Davenport Group became central figures in the working groups at the W3C that lead to the development of XML. In this period, development of DocBook languished. It was revitalized in 1998 when maintenance moved to OASIS as the first Technical Committee.

Conversion to RELAX NG

As the XML ecosystem evolved, namespaces became a significant feature of the landscape. The maintainers were eager that DocBook should continue to evolve and participate fully with the emerging standards (XLink, XInclude, etc.) that required namespaces.

DTDs are incapable of validating namespaced documents (in the general case), so moving to a new validation technology was necessary. Practically speaking, only two grammar-based choices presented themselves: W3C XML Schemas and RELAX NG. There was unanimity among the members of the Technical Commitee that RELAX NG was a better fit for modeling the structure of prose documents.

Over the course of a couple of years, through a series of experimental releases and design reviews, a new set of RELAX NG based models was developed. These debuted in DocBook V5.0 in 2008.

A primary goal of the conversion was that as many valid documents as practical should remain valid. That is, a DTD-valid DocBook V4.5 document could be converted to a RELAX NG-valid DocBook V5.0 document simply by removing any SGML features used in the document and adding a namespace declaration (and normalizing mixed case, if necessary).

In converting the normative format from DTDs to RELAX NG, the OASIS Technical Committee decided that the DocBook schema should take advantage of RELAX NG features that would improve the constraints. While on its face this seems very sensible approach, this decision casts a long shadow.

Unlike DTDs and W3C XML Schemas, RELAX NG allows ambiguity in content models. This is useful in a prose schema because it allows the vocabulary designer to more precisely model what users actually want to do.

In particular, DocBook’s descriptive rather than prescriptive nature leads to a lot of optionality. Consider a (highly!) simplified description of a DocBook book. This simplified book could be described as an optional table of contents, followed by optional chapters, followed by an optional table of contents. (In French publishing, tables of contents often come at the end.)

So the structure is:

toc?, chapter*, toc?

That’s perfectly reasonable and logical. No human being looking at that is troubled by it. And yet neither DTDs nor W3C XML Schema can express that content model because of its ambiguity: if you see a toc, you can’t determine (without looking ahead) if it’s the first one before chapters or the last one after absent chapters.

One of the secondary goals of the transition to RELAX NG was that it should be possible to generate useful (though not normative) DTD and W3C XML Schema versions of the schema.

That turned out to be impractical.

DocBook is a RELAX NG grammar

DocBook is normatively defined by a RELAX NG grammar. The actual construction of the published schema from its sources is a somewhat complicated affair (Appendix A), but to the end user, DocBook is a monolithic RELAX NG grammar. The mechanisms that you have available to customize DocBook are precisely those afforded by RELAX NG.

What is RELAX NG: a brief tutorial

Very broadly speaking, RELAX NG is a language for performing pattern matching on trees roughly analogous to the way a regular expression is a language for performing pattern matching on strings.

A RELAX NG schema (or grammar) defines a set of patterns. A document is valid against that grammar if there exists a valid arrangement of those patterns that matches the document.

There are two syntaxes for RELAX NG, an XML syntax and a compact syntax. The two are entirely equivalent and it’s possible to translate losslessly between them. In the interest of space, and because many people find it more readable, this paper gives its examples in the compact syntax.

Let’s consider something smaller than DocBook to explore the way a RELAX NG grammar works.

Here are three patterns:

a = element a { empty }
b = element b { empty }
c = a|b

The first matches an empty a, the second an empty b, and the third anything that matches a or anything that matches b. It’s important to remember that validity is about pattern matching. Although it’s convenient to name patterns after elements, technically what matches c isn’t an a element or a b element, it’s an a pattern> or a b pattern.

If RELAX NG was limited to matching empty elements without attributes, it wouldn’t be very useful! Let’s extend our example to add some attributes and content.

One way to do this is to extend an existing pattern with a new one. If you extend an existing pattern, you have to specify how your extension should fit into the current pattern: is it a new choice, or is it allowed to be interleaved anywhere in the existing pattern.

Here’s an example that extends the “a” pattern with a choice (signaled with “|=”):

a = element a { empty }

a |= element a {
                attribute priority { "high" | "highest" },
                empty
            }

b = element b { empty }
c = a|b

Now a matches either a element with a “high” or “highest” priority attribute or an a element with no attributes.

Writing an easily customized RELAX NG grammar is, in part, about making the patterns easily customizable. Making the a pattern an explicit choice between two element patterns isn’t the best approach. It would be easier to customize if we used different pattern names. This grammar is equivalent:

ordinary = element a {
               empty
           }

important = element a {
                attribute priority { "high" | "highest" },
                empty
            }

a = ordinary | important
b = element b { empty }

Now a customization layer has the freedom to adjust, in ways we’ll come to in a moment, the ordinary and important patterns independently.

As these patterns stand, we can match either a single a element or a single b element. Let’s add a wrapper to hold a collection of elements:

document = element doc { (a|b)* }

This pattern matches an element named doc that contains any number, including none, of things that match the a pattern or things that match the b pattern in any order.

The content model rules are straightforward, if you find regular expressions straightforward, and will be familiar if you’ve written DTDs.

  • a matches exactly one a pattern.

  • a? matches an optional (exactly 0 or 1) a pattern.

  • a* matches zero or more a patterns.

  • a+ matches one or more a patterns.

  • (a,b), a sequence, matches an a pattern followed by a b pattern.

  • (a|b), a choice, matches an a pattern or a b pattern.

  • (a&b), an interleave, matches an a pattern and a b pattern, in any order.

Finally, RELAX NG requires that we enumerate the top level patterns that our document must match. This is not possible in DTDs and requires a certain amount of gymnastics in W3C XML Schema.

start = doc|a

Combining these patterns into a grammar, we get:

start = doc|a

doc = element doc { (a|b)* }

important = element a {
                attribute priority { "high" | "highest" },
                empty
            }

ordinary = element a {
               empty
           }

a = ordinary | important
b = element b { empty }

This grammar matches documents that begin with a doc element or an a element, if and only if the a element has a priority attribute with the value “high” or “highest”.

With a schema this simple, it doesn’t seem impractical to stop here. Extending the important or ordinary patterns by redefining the entire element pattern wouldn’t be too burdonsome.

That’s much less practical in a schema with hundreds of elements and attributes containing complex content models. Let’s rewrite this grammar in a way that more closely matches the overall pattern structure in the DocBook schema.

start = doc|a

doc.contentmodel = (a|b)*
doc.attlist = empty
doc = element doc {
          doc.attlist,
          doc.contentmodel
      }

high_priority = attribute priority { "high" | "highest" }
priority = high_priority

important.attlist = priority
important.contentmodel = empty

important = element a {
                important.attlist,
                important.contentmodel
            }

ordinary.attlist = empty
ordinary.contentmodel = empty

ordinary = element a {
               ordinary.attlist,
               ordinary.contentmodel
           }

a = ordinary | important

b.attlist = empty
b.contentmodel = empty

b = element b {
        ordinary.attlist,
        ordinary.contentmodel
}

This grammar validates exactly the same documents, but it’s much, much easier to customize as we shall see below.

RELAX NG allows you to create a grammar with reference to another, existing grammar. Suppose the schema above is accessible at the URI “~base.rnc~”. I can write a new grammar by reference:

# My custom schema

include "base.rnc" {
}

We can add any patterns we like outside of the curly braces that follow the filename, “base.rnc”. Within those curly braces, the patterns that we specify will either augment or entirely replace patterns with the same names in the original, base schema.

As it stands, this is an uninteresting grammar that matches exactly the same things as the base grammar, all I’ve introduced is a comment. But from here we can begin to look at customizations.

First, observe that our base grammar allows an explicit high priority attribute, but doesn’t allow an explicit low or medium priority attribute. We can easily add such an attribute. Second, let’s add the requirement that a high priority element must have an ID.

# My custom schema

ordinary_priority = attribute priority { "low" | "medium" }
id = attribute xml:id { xsd:ID }

include "base.rnc" {
    ordinary.attlist = ordinary_priority?
    important.attlist = priority & id
}

This grammar defines a new pattern to match a low or medium priority attribute and extends the definition of ordinary.attlist to include it. By making the pattern optional (“?”), we still allow ordinary elements without the new attribute.

The important.attlist is defined to interleave a required ID attribute. RELAX NG incorporates all of the W3C XML Schema data types, so we can define it to be an xsd:ID. Since attributes are unordered in XML, it’s natural to interleave them. (But putting a comma between them would have the same effect, you cannot make order matter even if you technically make the attributes a sequence in your grammar.)

Next, let’s imagine that we want to add an “emergency” priority. There are, in fact, several ways that we could do this.

We could redefine the high_priority pattern:

include "base.rnc" {
    high_priority = attribute priority { "high" | "highest" | "emergency" }
}

Or we could extend it:

emergency_priority = attribute priority { "emergency" }
include "base.rnc" {
    priority = high_priority | emergency_priority
}

At this point, you might wish that the base schema had defined a pattern for the list of values:

high_priorities = "high" | "highest"
high_priority = attribute priority { high_priorities }

Then our customization could simply be:

include "base.rnc" {
    high_priorities = "high" | "highest" | "emergency"
}

Schema designers have to strike a balance between complexity (make everything a pattern) and maintainability. Invariably, it will be the case that for some customizations, you’ll wish there had been another pattern in the base schema.

RELAX NG offers a special pattern called notAllowed that allows us to remove things in a customization layer. Suppose, for example, that we want to remove the notion of priority from this schema entirely:

include "base.rnc" {
  priority=notAllowed
}

Pattern annotations

It is possible to add annotations to patterns in RELAX NG. This can be used to elaborate the grammar. For example, while RELAX NG has no provision for default attribute values, there is a defined vocabulary of annotations for this purpose. We can use these annotations to make “medium” the default value of the priority attribute:

namespace a = "http://relaxng.org/ns/compatibility/annotations/1.0"

[ a:attributeValue = "medium" ]
ordinary_priority = attribute priority { "low" | "medium" }

include "base.rnc" {
    ordinary.attlist = ordinary_priority?
}

Note that this default value will be added by a processor that understands and interprets the compatibility annotations in addition to performing RELAX NG validation. An “ordinary” validator will simply ignore them.

Another common use of annotations is for documentation. Also in aid of documentation, RELAX NG allows patterns to be grouped within div elements.

The DocBook RELAX NG grammar

The DocBook RELAX NG grammar continues to reflect the structure established in the DTD. Many patterns are defined simply to establish mixtures that will later be used in content models:

db.nopara.blocks =
   db.list.blocks
 | db.wrapper.blocks
 | db.formal.blocks
 | db.informal.blocks
 | db.publishing.blocks
 | db.graphic.blocks
 | db.technical.blocks
 | db.verbatim.blocks
 | db.bridgehead
 | db.remark
 | db.revhistory

db.para.blocks =
   db.anchor
 | db.para
 | db.formalpara
 | db.simpara

db.all.blocks =
   db.nopara.blocks
 | db.para.blocks
 | db.extension.blocks

A consistent arrangement of patterns is used to define each element. Here, for example, is the set of patterns that define the para element:

[
   db:refname [ "para" ]
   db:refpurpose [ "A paragraph" ]
]
div {
   db.para.role.attribute = attribute role { text }

   db.para.attlist =
      db.para.role.attribute?
    & db.common.attributes
    & db.common.linking.attributes

   db.para.info = db._info.title.forbidden

   db.para =
      element para {
         db.para.attlist,
         db.para.info,
         (db.all.inlines | db.nopara.blocks)*
      }
}

It begins with a couple of annotations that are used for documentation purposes. What follows are typically:

  • A pattern that defines the role attribute. In the base schema, these definitions are all the same. A distinct pattern is provided in each case because one common form of customization is to make the role attribute have a delimited set of values.

  • A pattern that defines the attributes on the element. In the base schema this is often just a mixture of the role and common attributes.

  • For block elements, a pattern for the info element. The info element is a generic wrapper for block-level metadata, things like title, author, and copyright. (Those of you with long memories may recall BookInfo and ChapterInfo from DocBook’s SGML DTD days. Those all got replaced with a single info element that has varying content models in the RELAX NG grammar.)

    Out-of-the box, this comes in several flavors. For para, the info element that cannot contain a title is used.

  • Finally, there is the content model itself, most often a combination of elements and mixtures.

DocBook is a RELAX NG grammar

Customizing DocBook is, effectively, nothing more than applying the RELAX NG grammar features to the particular set of patterns that define the DocBook schema.

Removing elements or attributes

There may be nothing easier to do in a RELAX NG customization layer than remove things. Suppose, for example, you wanted to remove the revisionflag attribute. If you aren’t tracking changes in your DocBook sources, you don’t need it. At the element level, if your publishing system doesn’t support generated callouts, you don’t need the area element.

namespace db = "http://docbook.org/ns/docbook"
default namespace = "http://docbook.org/ns/docbook"

# ======================================================================

# docbookxi.rnc is the flavor of DocBook that has
# XInclude mixed-in in appropriate places.
include "docbookxi.rnc" {
  db.revisionflag.attribute = notAllowed

  db.area.units-enum.attribute = notAllowed
  db.area.units-other.attributes = notAllowed
}

Sometimes, as with the area element above, you have to make a few patterns notAllowed to do the job completely. In this case, there are two patterns, one with an enumerated list of values for the units attribute and one with a value of “other” for the units attribute and a required otherunits attribute.

Adding elements or attributes

To add something new, you have to do two things: create a pattern to match the new item and add that pattern to the appropriate mixtures. For example, here’s a customization layer that adds a new inline element, port. The author uses this customization in the documentation for XProc.

include "docbookxi.rnc" {
  db.markup.inlines |= db.port
}

# ======================================================================

db.port.role.attribute = attribute role { text }
db.port.attlist =
   db.port.role.attribute?
 & db.common.attributes
 & db.common.linking.attributes

db.port =
   element port {
      db.code.attlist, (db.programming.inlines | db._text)*
   }

There is no requirement that you name patterns following the conventions used in the DocBook schema, but doing so is likely to help you keep them organized in a similar way and will make them easier for other DocBook customizers to understand.

Make required elements optional

Changing content models is, necessarily, contextual. Making a title optional, for example, is just a matter of changing the pattern used for the relevant info element. This customization makes chapter titles optional:

include "/projects/docbook/docbook/relaxng/schemas/docbookxi.rnc" {
  # This pattern is db._info.title.req in the base schema
  db.chapter.info = db._info
}

The descriptive nature of DocBook has lead to a schema without a lot of required elements. Books without chapters? Indexes without index terms? Check and check.

To pick an example where a less elegant customization is necessary, let’s consider ordered lists. In DocBook, they’re required to have at least one list item. Suppose we wanted to relax that requirement?

It happens that there isn’t a “~.contentmodel~” pattern for the content of orderedlist, so we’ll simply have to redefine the whole thing.

include "docbookxi.rnc" {
   db.orderedlist = element orderedlist {
      db.orderedlist.attlist,
      db.orderedlist.info,
      db.all.blocks*,
      db.listitem*
   }
}

A more practical customization here might be to remove the optional blocks from before the first list item, but that’s not an example of making required elements optional.

Make optional elements required

Making optional elements required is very much the same as making required elements optional. If you wish to require bibliography elements to have a title, change the db.bibliography.info so that it matches an info element with a required, db._info.title.req.

For a more interesting example, consider that some style guides frown on nested hierarchy elements without any intervening prose: a chapter that begins immediately with a top-level section or a top-level section that begins immediately with a second-level section.

If you examine the DocBook schema, you’ll find that the chapter content model is defined by the db.chapter.contentmodel pattern. That pattern is, in turn, defined as db.component.contentmodel, the common content model for all “components” (roughly, elements at the level of chapter).

The common content model for components is:

db.component.contentmodel =
  db.navigation.components*,
  db.toplevel.blocks.or.sections,
  db.navigation.components*

This allows navigational components (indexes, tables of contents, etc.) to appear at either the front or the back. Between them, “top level blocks or sections”:

db.toplevel.blocks.or.sections =
  (db.all.blocks+, db.toplevel.sections?) | db.toplevel.sections

DocBook has two independent section hierarchies, a numbered one, (sect1, sect2, …) and a recursive one (section). That’s captured by the db.toplevel.sections pattern in a way that makes it easy to choose either one or both.

Anyway, this deep in the maze, we can see that forbidding immediately nested hierarchy elements for all components would require a simple change to this pattern:

include "docbookxi.rnc" {
  db.toplevel.blocks.or.sections =
    db.all.blocks+, db.toplevel.sections?
}

If, for some reason, this change were necessary only for chapters, more dramatic surgery would be required. How we approach it depends on whether or not we expected our customization layer to be further customized.

The most direct method would be simply to redefine the content model for chapters:

include "docbookxi.rnc" {
  db.chapter.contentmodel =
    db.navigation.components*,
    db.all.blocks+,
    db.toplevel.sections?,
    db.navigation.components*
}

This is sufficient, but we’ve “unpicked” the pattern structure significantly. We haven’t, for example, made it any easier to apply this change to appendix elements later, if we need to.

Exercise for the reader: consider how you might make it easier for a future customizer of your customization layer.

Make optional elements forbidden

Structured editing tools that constrain authors to write valid documents are wonderful. But one of the disadvantages of a broad, standard schema is that editing tools will expose all of the flexibility of the standard allowed to your authors.

One of the easiest ways to make authoring easier is remove all of the things that you don’t want your authors to use. This is straightforward in RELAX NG.

The flexibility to produce French books with tables of contents at the back is wonderful. But if you don’t publish books in French, it’s just extra cognative load for your authors.

There are giant swaths of DocBook that you will probably never use unless you write for a particular domain of hardware or software.

  • Do you write about programming language APIs? No? Then you don’t need all the synopsis elements.

  • Do you write about networking? No? Then you don’t need all the inlines about that.

  • Do your documents have bibliographies? Both the raw and cooked forms? Ditch the one(s) you don’t use.

  • Do you produce back-of-the-book indexes in markup? No? Then you don’t need indexentry and its descendants.

  • Do your documents have mathematics? Flush the equation elements!

  • Do your documents have admonitions? Q&A sets? Screenshots? Video? Audio? Drop all the blocks you don’t need.

  • You don’t need msgset.

This’ll simplify your authoring environment:

include "docbookxi.rnc" {
  db.synopsis.blocks = notAllowed
  db.systemitem = notAllowed
  db.biblioentry = notAllowed
  db.indexdiv = notAllowed
  db.indexentry = notAllowed
  db.segmentedlist = notAllowed
  db.equation = notAllowed
  db.informalequation = notAllowed
  db.inlineequation = notAllowed
  db.math.inlines = notAllowed
  db.admonition.blocks = notAllowed
  db.videoobject = notAllowed
  db.audioobject = notAllowed
  db.screenshot = notAllowed
  db.qandadiv = notAllowed
  db.qandaentry = notAllowed
  db.qandaset = notAllowed
  db.msg = notAllowed
  db.msgexplan = notAllowed
  db.msgmain = notAllowed
  db.msgrel = notAllowed
  db.msgset = notAllowed
  db.msgsub = notAllowed
}

Change the semantics of a component

DocBook, perhaps because of its history as an interchange format, doesn’t attempt to bring a great deal of rigor to the semantics of its elements. The reference documentation provides a description of its intended semantics, as the DocBook designers understood it, but those descriptions are often intentionally vague. Saying that a productnumber is “a number assigned to a product” is not especially precise.

There are a number of elements right down on the leaves of the tree where a customization layer could impose stricter syntactic constraints that would limit the opportunities for misunderstanding. For example, pubdate could be restricted to an ISO 8601 date or date-time. Similarly, if your organization has product numbers that follow a predictable pattern, you could add a constraint to enforce that.

Document the customizations

The RELAX NG grammar allows documentation to be combined with the schema. Elements from other namespaces simply become ignored annotations to the validator. In this way, DocBook prose for example, could be combined directly with the schema in a “literate programming” style.

Unfortunately, this is fairly cumbersome in practice, in part because the DocBook schema is authored in the compact syntax. The compact syntax, as mentioned earlier, can be losslessly converted to and from the XML syntax. However, the particular representation of arbitrary XML in the compact syntax is, in a word, awful.

For example, consider this simple fragment of documentation in the XML syntax:

<db:para>This is a <emphasis role="important">feature</emphasis>,
not a bug.</db:para>

In the compact syntax, it becomes this annotation:

db:para [
  "This is a "
  rng:emphasis [ role = "important" "feature" ]
  ",\x{a}" ~
  "not a bug."
]

That’s not…practical.

As a result, the DocBook schema limits the embedded documentation to the single-sentence summary of each pattern (it’s man page refpurpose), and the description of enumerated attribute values.

For example, here are the patterns for the revisionflag and its enumerated values.

db.revisionflag.enumeration =
   ## The element has been changed.
   "changed"
 | ## The element is new (has been added to the document).
   "added"
 | ## The element has been deleted.
   "deleted"
 | ## Explicitly turns off revision markup for this element.
   "off"

db.revisionflag.attribute =
   [
      db:refpurpose [ "Identifies the revision status of the element" ]
   ]
   attribute revisionflag { db.revisionflag.enumeration }

The rest of the reference documentation is maintained separately, in DocBook, and combined with the schema annotations through a fairly complicated process of shaking and stirring.

Plus Schematron

One last observation. No set of grammatical constraints can conveniently capture all of the useful constraints of an authoring schema. DocBook uses Schematron rules, embedded in the RELAX NG grammar as annotations, to enforce a number of extra-grammatical constraints.

If you add structures that carry with them extra-grammatical constraints, you’d be wise to add Schematron rules for as many of them as practical.

Appendix A. Building DocBook

A reviewer commented that additional detail about the process by which a collection of source files becomes the DocBook RELAX NG grammar would be interesting. What follows is a summary of the process. If you want to see all the gory details, they’re publically available in the DocBook repository at GitHub. The process begins down in the /relaxng/schemas/ directory.

The source files themselves are stored in a collection of logical modules (markup related to admonitions, sections, tables, technical content, general publishing information, etc.). The idea is that if you want to completely excise some module, you can simply construct your own driver file that omits it.

The goal of the assembly process is to transform a set of schema files organized to be convenient for authoring and generating documentation into a small, efficient RELAX NG grammar that can be used for validation. The process follows this basic plan:

  1. The RNC files are handy for authoring, but not actually useful for processing. The first step is to use trang to convert them all to XML RNG files.

  2. The set of schema files is composed into a single XML document. There’s provision at this level for a few special cases through the use of some custom “control” markup in another namespace in the RELAX NG grammars.

  3. Any patterns that are entirely unreferenced are excluded from the composed schema. Several published customization layers are subsets; removing the unused patterns from the base schema makes the customization layers smaller and easier to understand.

  4. Schematron validation is used for a number of extra-grammatical constraints. In the context of DocBook, many of these can be derived from the patterns themselves. The control markup has features for expressing this. For example:

    ctrl:exclude [ from="db.footnote" exclude="db.formal.blocks" ]
    

    This single control structure is transformed into a set of Schematron rules that forbid any element in the “db.formal.blocks” pattern from appearing as a descendant of (any element in the) “db.footnote” pattern.

  5. Another cleanup pass is performed to remove redundant inherited attributes and sort out namespace issues (the namespace for the control vocabulary is no longer needed at this point, for example).

  6. All of the markup annotations related to documentation are removed from the generated schema.

  7. Copyright messages are moved into the right places and updated with the build information (date, version number, etc.).

  8. The build process also runs a subset of the DocBook test suite. (Builds are published from an online continuous integration server and the full suite runs longer than the allowed time for jobs.) Locally, developers can and should run the whole test suite, of course.

That process produces docbook.rng ready for validation. Running trang again produces docbook.rnc.

A slightly longer and more elaborate process can be used to turn the sources into a fully elaborated, 24MB XML document that can be used to drive the process of building the documentation sources.

The current state of the art is that this process is neither well documented nor especially portable. It would be possible to leverage these tools for local customizations, but it would not be easy.