Customizing DocBook
Symposium on Markup Vocabulary Customization
July 29, 2019
DocBook is a general purpose XML vocabulary particularly well suited to books and papers about computer hardware and software (though it is by no means limited to these applications). DocBook has been under active maintenance for more than 20 years; it began life as an SGML document type definition. Because it is a large and robust schema, and because its main structures correspond to the general notion of what constitutes a “book,” DocBook has been adopted by a large and growing community of authors writing books of all kinds. After a brief introduction to DocBook, we will discuss the mechanisms built in to DocBook for customization.
Norman
Walsh
Norman Walsh is a Principal Engineer at MarkLogic
Corporation where he helps to develop APIs and tools for advanced
content applications. At OASIS, he was chair of the DocBook
Technical Committee for many years and is the author of
DocBook: The Definitive Guide.
Norm has spent more than twenty years developing
commercial and open source software including significant
DSSSL, XSLT 1.0, and XSLT 2.0 stylesheets for DocBook.
Copyright ©2019 Norman Walsh
History
DocBook has a long history. It began in the early 90s as an SGML
DTD. At the time, there were several commercial Unix vendors and
O’Reilly Media (then O’Reilly & Associates) had built a successful
publishing business supplying technical books about the Unix
ecosystem.
One aspect of the Unix system, the
man page,
contributed significantly to the documentation of Unix and its
commands, tools, APIs, and subsystems. Anecotally, the origin story
for DocBook begins with some of the Unix vendors, in cooperation with
O’Reilly and HaL Computer Systems, getting together to build an
interchange vocabulary for man pages.
The man pages were not a differentiating factor for the vendors, they
didn’t need or want them to be proprietary. All Unix systems shipped
with largely the same man pages.
The idea was that each Unix vendor would have their own man pages (in
troff
documents with thier own custom macros) but when they wished to
exchange them, the would transform them into a common language and the
recipient would transform them from the common language back into the
local flavor of troff.
SGML was selected as the appropriate interchange format and
DocBook development began. Much of the early design of DocBook was the
result of extensive document analysis over the corpus of O’Reilly
Unix-related books (including several collections of “man pages”, formatted in
the house style, entire series of
X-Windows
books, and others).
Markup was invented for each significant feature of the corpus.
From the very beginning, DocBook was a descriptive format, not a
prescriptive one. If a structure could occur in a corpus of reasonable
documents, it was (to the maximum extent possible) allowed.
Early development
Once there were several players, a maintenance organization was
formed: The Davnenport Group. Davenport met several times a year,
stewarding the development of DocBook.
The Davenport Group established that the focus of DocBook would be
computer hardware and software documentation: Unix man pages, software
documentation, documentation about networking hardware, etc. That
remains its focus today.
By the mid 90s DocBook had accumulated several years of managed but
somewhat ad hoc growth. When Eve Maler joined the design team, she
undertook to bring design principles to the structure of the DTD.
Maler and Jeanne El Andaloussi developed the markup design philosophy,
methodology, and techniques described in their book
Developing SGML
DTDs: From Text to Model to Markup during the development of DocBook
3.0.
The structure of DocBook still reflects that methodology, even though
much has evolved.
Growth and popularity
By the late 90s and early 2000’s, DocBook was very popular. It
arrived on the SGML scene (relatively) early, it had meaningful
element names, good documentation, and a reasonably complete set of
open source stylesheets for transforming it into HTML and print.
Ironically, the Unix vendors had all gone away by this time and
interchange had ceased to be the dominate use case for DocBook. The
most common use case became, and remains to this day, simply to use
DocBook as an authoring format, often with very little customization.
Versioning and copyright
From the very earliest days, the Davenport Group made distribution and
reuse of the DTD as easy as possible. It was freely available, it was
free to use, and organizations were free to customize it.
The rules that apply to customizations of DocBook are as simple as
possible: do anything you like, but if you change it, don’t call it
DocBook.
Many, many (many!) organizations began with DocBook and built
their own systems by extension and subsetting. Many more simply
canabalized DocBook for the markup structures that they found
useful.For example, the elements and attributes for indexing, or the
table model (later formalized as the XML Exchange
Table Model Document Type Definition), were often excised out
of DocBook and used in other systems.
DTD structures
Anyone familiar with the Maler and El Andaloussi methodologies will
recognize the structure of the DocBook DTD. It makes extensive use of
parameter entitites to make customization (i.e., changing the schema
without editing the original source files) as easy as possible.
Elements are divided into classes, with extensibility:
<!ENTITY % local.admon.class "">
<!ENTITY % admon.class
"Caution|Important|Note|Tip|Warning %local.admon.class;">
Classes are accumulated into mixtures, also with extensibility:
<!ENTITY % local.component.mix "">
<!ENTITY % component.mix
"%list.class; |%admon.class;
|%linespecific.class; |%synop.class;
|%para.class; |%informal.class;
|%formal.class; |%compound.class;
|%genobj.class; |%descobj.class;
|%ndxterm.class;
%local.component.mix;">
And the content models of individual elements are constructed
by combining elements and mixtures:
<!ENTITY % admon.elements "INCLUDE">
<![ %admon.elements; [
<!ELEMENT (%admon.class;) - - (Title?, (%admon.mix;)+) %admon.exclusion;>
<!--end of admon.elements-->]]>
DocBook provided the additional feature that elements could be
selectively excluded with a parameter entity (admon.elements
in this
case).
The modern day
Many members of The Davenport Group became central figures in the
working groups at the W3C that lead to the development of XML. In this
period, development of DocBook languished. It was revitalized in 1998
when maintenance moved to OASIS as the first Technical Committee.
Conversion to RELAX NG
As the XML ecosystem evolved, namespaces became a significant feature
of the landscape. The maintainers were eager that DocBook should
continue to evolve and participate fully with the emerging standards
(XLink, XInclude, etc.) that required namespaces.
DTDs are incapable of validating namespaced documents (in the general
case), so moving to a new validation technology was necessary.
Practically speaking, only two grammar-based choices presented
themselves: W3C XML Schemas and RELAX NG. There was unanimity among
the members of the Technical Commitee that RELAX NG was a better fit
for modeling the structure of prose documents.
Over the course of a couple of years, through a series of experimental
releases and design reviews, a new set of RELAX NG based models was
developed. These debuted in DocBook V5.0 in 2008.
A primary goal of the conversion was that as many valid documents as
practical should remain valid. That is, a DTD-valid DocBook V4.5
document could be converted to a RELAX NG-valid DocBook V5.0 document
simply by removing any SGML features used in the document and adding a
namespace declaration (and normalizing mixed case, if necessary).
In converting the normative format from DTDs to RELAX NG, the
OASIS Technical Committee decided that the DocBook schema should take
advantage of RELAX NG features that would improve the constraints.
While on its face this seems very sensible approach, this decision
casts a long shadow.
Unlike DTDs and W3C XML Schemas, RELAX NG allows ambiguity in content
models. This is useful in a prose schema because it allows the
vocabulary designer to more precisely model what users actually want
to do.
In particular, DocBook’s descriptive rather than prescriptive nature
leads to a lot of optionality. Consider a (highly!) simplified
description of a DocBook book. This simplified book could be described
as an optional table of contents, followed by optional chapters,
followed by an optional table of contents. (In French publishing,
tables of contents often come at the end.)
So the structure is:
toc?, chapter*, toc?
That’s perfectly reasonable and logical. No human being looking at
that is troubled by it. And yet neither DTDs nor W3C XML Schema
can express that content model because of its ambiguity: if you see
a toc
, you can’t determine (without looking ahead) if it’s the first
one before chapters or the last one after absent chapters.
One of the secondary goals of the transition to RELAX NG was that it
should be possible to generate useful (though not normative) DTD and
W3C XML Schema versions of the schema.
That turned out to be impractical.
DocBook is a RELAX NG grammar
DocBook is normatively defined by a RELAX NG grammar. The actual
construction of the published schema from its sources is a somewhat
complicated affair (), but to the
end user, DocBook is a monolithic RELAX NG grammar. The mechanisms
that you have available to customize DocBook are precisely those
afforded by RELAX NG.
What is RELAX NG: a brief tutorial
Very broadly speaking, RELAX NG is a language for performing
pattern matching on trees roughly analogous to the way a regular
expression is a language for performing pattern matching on
strings.
A RELAX NG schema (or grammar) defines a set of patterns. A document
is valid against that grammar if there exists a valid arrangement of
those patterns that matches the document.
There are two syntaxes for RELAX NG, an XML syntax and a compact
syntax. The two are entirely equivalent and it’s possible to translate
losslessly between them. In the interest of space, and because many
people find it more readable, this paper gives its examples in the
compact syntax.
Let’s consider something smaller than DocBook to explore the way a
RELAX NG grammar works.
Here are three patterns:
a = element a { empty }
b = element b { empty }
c = a|b
The first matches an empty a
, the second an empty b
, and the third
anything that matches a
or anything that matches b
. It’s important
to remember that validity is about pattern matching. Although it’s
convenient to name patterns after elements, technically what matches
c
isn’t an a
element or a b
element, it’s an a
pattern> or a b
pattern.
If RELAX NG was limited to matching empty elements without
attributes, it wouldn’t be very useful! Let’s extend our example to
add some attributes and content.
One way to do this is to extend an existing pattern with a new
one. If you extend an existing pattern, you have to specify how your
extension should fit into the current pattern: is it a new choice, or
is it allowed to be interleaved anywhere in the existing pattern.
Here’s an example that extends the “a
” pattern with
a choice (signaled with “|=
”):
a = element a { empty }
a |= element a {
attribute priority { "high" | "highest" },
empty
}
b = element b { empty }
c = a|b
Now a
matches either a element with a “high” or
“highest” priority
attribute or an a
element
with no attributes.
Writing an easily customized RELAX NG grammar is, in part, about
making the patterns easily customizable. Making the a
pattern an
explicit choice between two element patterns isn’t the best approach.
It would be easier to customize if we used different pattern names.
This grammar is equivalent:
ordinary = element a {
empty
}
important = element a {
attribute priority { "high" | "highest" },
empty
}
a = ordinary | important
b = element b { empty }
Now a customization layer has the freedom to adjust, in ways we’ll
come to in a moment, the ordinary
and important
patterns
independently.
As these patterns stand, we can match either a single a
element
or a single b
element. Let’s add a wrapper to hold a collection
of elements:
document = element doc { (a|b)* }
This pattern matches an element named doc
that contains any number,
including none, of things that match the a
pattern or things that
match the b
pattern in any order.
The content model rules are straightforward, if you find regular
expressions straightforward, and will be familiar if you’ve written
DTDs.
a
matches exactly one a
pattern.
a?
matches an optional (exactly 0 or 1) a
pattern.
a*
matches zero or more a
patterns.
a+
matches one or more a
patterns.
(a,b)
, a sequence, matches an a
pattern followed by a b
pattern.
(a|b)
, a choice, matches an a
pattern or a b
pattern.
(a&b)
, an interleave, matches an a
pattern and a b
pattern,
in any order.
Finally, RELAX NG requires that we enumerate the top level patterns
that our document must match. This is not possible in DTDs and
requires a certain amount of gymnastics in W3C XML Schema.
start = doc|a
Combining these patterns into a grammar, we get:
start = doc|a
doc = element doc { (a|b)* }
important = element a {
attribute priority { "high" | "highest" },
empty
}
ordinary = element a {
empty
}
a = ordinary | important
b = element b { empty }
This grammar matches documents that begin with a
doc
element or an a
element, if and only if the a
element has a priority
attribute with the value “high” or “highest”.
With a schema this simple, it doesn’t seem impractical to stop here.
Extending the important
or ordinary
patterns by redefining the
entire element pattern wouldn’t be too burdonsome.
That’s much less practical in a schema with hundreds of elements and
attributes containing complex content models. Let’s rewrite this grammar
in a way that more closely matches the overall pattern structure in the
DocBook schema.
start = doc|a
doc.contentmodel = (a|b)*
doc.attlist = empty
doc = element doc {
doc.attlist,
doc.contentmodel
}
high_priority = attribute priority { "high" | "highest" }
priority = high_priority
important.attlist = priority
important.contentmodel = empty
important = element a {
important.attlist,
important.contentmodel
}
ordinary.attlist = empty
ordinary.contentmodel = empty
ordinary = element a {
ordinary.attlist,
ordinary.contentmodel
}
a = ordinary | important
b.attlist = empty
b.contentmodel = empty
b = element b {
ordinary.attlist,
ordinary.contentmodel
}
This grammar validates exactly the same documents, but it’s much, much
easier to customize as we shall see below.
RELAX NG allows you to create a grammar with reference to another,
existing grammar. Suppose the schema above is accessible at the
URI “~base.rnc~”. I can write a new grammar by reference:
# My custom schema
include "base.rnc" {
}
We can add any patterns we like outside of the curly braces that
follow the filename, “base.rnc
”. Within
those curly braces, the patterns that we specify will either augment
or entirely replace patterns with the same names in the original, base
schema.
As it stands, this is an uninteresting grammar that matches
exactly the same things as the base grammar, all
I’ve introduced is a comment. But from here we can begin
to look at customizations.
First, observe that our base grammar allows an explicit high priority
attribute, but doesn’t allow an explicit low or medium priority
attribute. We can easily add such an attribute. Second, let’s add
the requirement that a high priority element must have an ID.
# My custom schema
ordinary_priority = attribute priority { "low" | "medium" }
id = attribute xml:id { xsd:ID }
include "base.rnc" {
ordinary.attlist = ordinary_priority?
important.attlist = priority & id
}
This grammar defines a new pattern to match a low or medium priority
attribute and extends the definition of ordinary.attlist
to include it.
By making the pattern optional (“?”), we still allow ordinary
elements
without the new attribute.
The important.attlist
is defined to interleave a required ID
attribute. RELAX NG incorporates all of the W3C XML Schema data types,
so we can define it to be an xsd:ID
. Since attributes are unordered
in XML, it’s natural to interleave them. (But putting a comma between them
would have the same effect, you cannot make order
matter even if you technically make the attributes a sequence in your grammar.)
Next, let’s imagine that we want to add an “emergency” priority.
There are, in fact, several ways that we could do this.
We could redefine the high_priority
pattern:
include "base.rnc" {
high_priority = attribute priority { "high" | "highest" | "emergency" }
}
Or we could extend it:
emergency_priority = attribute priority { "emergency" }
include "base.rnc" {
priority = high_priority | emergency_priority
}
At this point, you might wish that the base schema had defined a
pattern for the list of values:
high_priorities = "high" | "highest"
high_priority = attribute priority { high_priorities }
Then our customization could simply be:
include "base.rnc" {
high_priorities = "high" | "highest" | "emergency"
}
Schema designers have to strike a balance between complexity (make
everything a pattern) and maintainability. Invariably, it will be
the case that for some customizations, you’ll wish there had been
another pattern in the base schema.
RELAX NG offers a special pattern called notAllowed
that allows us to
remove things in a customization layer. Suppose, for example, that we
want to remove the notion of priority from this schema entirely:
include "base.rnc" {
priority=notAllowed
}
Pattern annotations
It is possible to add annotations to patterns in RELAX NG. This can be
used to elaborate the grammar. For example, while RELAX NG has no
provision for default attribute values, there is a defined vocabulary
of annotations for this purpose. We can use these annotations to make
“medium” the default value of the priority
attribute:
namespace a = "http://relaxng.org/ns/compatibility/annotations/1.0"
[ a:attributeValue = "medium" ]
ordinary_priority = attribute priority { "low" | "medium" }
include "base.rnc" {
ordinary.attlist = ordinary_priority?
}
Note that this default value will be added by a processor that
understands and interprets the compatibility annotations in addition
to performing RELAX NG validation. An “ordinary” validator will
simply ignore them.
Another common use of annotations is for documentation. Also in aid
of documentation, RELAX NG allows patterns to be grouped within
div
elements.
The DocBook RELAX NG grammar
The DocBook RELAX NG grammar continues to reflect the structure
established in the DTD. Many patterns are defined simply to establish
mixtures that will later be used in content models:
db.nopara.blocks =
db.list.blocks
| db.wrapper.blocks
| db.formal.blocks
| db.informal.blocks
| db.publishing.blocks
| db.graphic.blocks
| db.technical.blocks
| db.verbatim.blocks
| db.bridgehead
| db.remark
| db.revhistory
db.para.blocks =
db.anchor
| db.para
| db.formalpara
| db.simpara
db.all.blocks =
db.nopara.blocks
| db.para.blocks
| db.extension.blocks
A consistent arrangement of patterns is used to define each element.
Here, for example, is the set of patterns that define the para
element:
[
db:refname [ "para" ]
db:refpurpose [ "A paragraph" ]
]
div {
db.para.role.attribute = attribute role { text }
db.para.attlist =
db.para.role.attribute?
& db.common.attributes
& db.common.linking.attributes
db.para.info = db._info.title.forbidden
db.para =
element para {
db.para.attlist,
db.para.info,
(db.all.inlines | db.nopara.blocks)*
}
}
It begins with a couple of annotations that are used for documentation
purposes. What follows are typically:
A pattern that defines the role attribute. In the base schema, these
definitions are all the same. A distinct pattern is provided in each
case because one common form of customization is to make the role
attribute have a delimited set of values.
A pattern that defines the attributes on the element. In the base
schema this is often just a mixture of the role and common attributes.
For block elements, a pattern for the
info
element. The info
element is a generic
wrapper for block-level metadata, things like title
,
author
, and copyright
. (Those of you with long
memories may recall BookInfo
and ChapterInfo
from DocBook’s SGML DTD days. Those all got replaced with a single
info
element that has varying content models in the RELAX NG
grammar.)
Out-of-the box, this comes in several
flavors. For para
, the info
element that
cannot contain a title
is used.
Finally, there is the content model itself, most often a combination
of elements and mixtures.
DocBook is a RELAX NG grammar
Customizing DocBook is, effectively, nothing more than applying the
RELAX NG grammar features to the particular set of patterns that
define the DocBook schema.
Removing elements or attributes
There may be nothing easier to do in a RELAX NG customization
layer than remove things. Suppose, for example, you wanted to remove
the revisionflag
attribute. If you aren’t tracking changes
in your DocBook sources, you don’t need it. At the element level,
if your publishing system doesn’t support generated callouts, you don’t
need the area
element.
namespace db = "http://docbook.org/ns/docbook"
default namespace = "http://docbook.org/ns/docbook"
# ======================================================================
# docbookxi.rnc is the flavor of DocBook that has
# XInclude mixed-in in appropriate places.
include "docbookxi.rnc" {
db.revisionflag.attribute = notAllowed
db.area.units-enum.attribute = notAllowed
db.area.units-other.attributes = notAllowed
}
Sometimes, as with the area
element above, you
have to make a few patterns notAllowed
to do the job
completely. In this case, there are two patterns, one with an enumerated
list of values for the units
attribute and one with a value of “other”
for the units attribute and a required otherunits
attribute.
Adding elements or attributes
To add something new, you have to do two things: create a
pattern to match the new item and add that pattern to the appropriate
mixtures. For example, here’s a customization layer that adds a new
inline element, port
. The author uses this customization
in the documentation for XProc.
include "docbookxi.rnc" {
db.markup.inlines |= db.port
}
# ======================================================================
db.port.role.attribute = attribute role { text }
db.port.attlist =
db.port.role.attribute?
& db.common.attributes
& db.common.linking.attributes
db.port =
element port {
db.code.attlist, (db.programming.inlines | db._text)*
}
There is no requirement that you name patterns following the
conventions used in the DocBook schema, but doing so is likely to help
you keep them organized in a similar way and will make them easier for
other DocBook customizers to understand.
Make required elements optional
Changing content models is, necessarily, contextual. Making a title
optional, for example, is just a matter of changing the pattern used
for the relevant info
element. This customization makes chapter titles
optional:
include "/projects/docbook/docbook/relaxng/schemas/docbookxi.rnc" {
# This pattern is db._info.title.req in the base schema
db.chapter.info = db._info
}
The descriptive nature of DocBook has lead to a schema without a lot
of required elements. Books without chapters? Indexes without index
terms? Check and check.
To pick an example where a less elegant customization is necessary,
let’s consider ordered lists. In DocBook, they’re required to have at
least one list item. Suppose we wanted to relax that requirement?
It happens that there isn’t a “~.contentmodel~” pattern for the
content of orderedlist
, so we’ll simply have to redefine the whole
thing.
include "docbookxi.rnc" {
db.orderedlist = element orderedlist {
db.orderedlist.attlist,
db.orderedlist.info,
db.all.blocks*,
db.listitem*
}
}
A more practical customization here might be to remove the
optional blocks from before the first list item, but that’s not an
example of making required elements optional.
Make optional elements required
Making optional elements required is very much the same as making
required elements optional. If you wish to require bibliography
elements to have a title, change the db.bibliography.info
so that it matches an info
element with a required,
db._info.title.req
.
For a more interesting example, consider that some style guides frown
on nested hierarchy elements without any intervening prose: a chapter
that begins immediately with a top-level section or a top-level
section that begins immediately with a second-level section.
If you examine the DocBook schema, you’ll find that the chapter
content model is defined by the db.chapter.contentmodel
pattern.
That pattern is, in turn, defined as db.component.contentmodel
, the
common content model for all “components” (roughly, elements at the
level of chapter).
The common content model for components is:
db.component.contentmodel =
db.navigation.components*,
db.toplevel.blocks.or.sections,
db.navigation.components*
This allows navigational components (indexes, tables of contents,
etc.) to appear at either the front or the back. Between them, “top
level blocks or sections”:
db.toplevel.blocks.or.sections =
(db.all.blocks+, db.toplevel.sections?) | db.toplevel.sections
DocBook has two independent section hierarchies, a numbered one,
(sect1
, sect2
, …) and a recursive one (section
). That’s captured
by the db.toplevel.sections
pattern in a way that makes it easy to
choose either one or both.
Anyway, this deep in the maze, we can see that forbidding
immediately nested hierarchy elements for all components would require
a simple change to this pattern:
include "docbookxi.rnc" {
db.toplevel.blocks.or.sections =
db.all.blocks+, db.toplevel.sections?
}
If, for some reason, this change were necessary
only for chapters, more dramatic surgery would be
required. How we approach it depends on whether or not we
expected our customization layer to be further
customized.
The most direct method would be simply to redefine the content model
for chapters:
include "docbookxi.rnc" {
db.chapter.contentmodel =
db.navigation.components*,
db.all.blocks+,
db.toplevel.sections?,
db.navigation.components*
}
This is sufficient, but we’ve “unpicked” the pattern structure
significantly. We haven’t, for example, made it any easier to apply
this change to appendix
elements later, if we need to.
Exercise for the reader: consider how you might make it easier for a
future customizer of your customization layer.
Make optional elements forbidden
Structured editing tools that constrain authors to write valid
documents are wonderful. But one of the disadvantages of a broad,
standard schema is that editing tools will expose all of the
flexibility of the standard allowed to your authors.
One of the easiest ways to make authoring easier is remove all of the
things that you don’t want your authors to use. This is
straightforward in RELAX NG.
The flexibility to produce French books with tables of contents at the
back is wonderful. But if you don’t publish books in French, it’s just
extra cognative load for your authors.
There are giant swaths of DocBook that you will probably never use
unless you write for a particular domain of hardware or software.
Do you write about programming language APIs? No? Then you don’t
need all the synopsis elements.
Do you write about networking? No? Then you don’t need all the inlines
about that.
Do your documents have bibliographies? Both the raw and cooked forms?
Ditch the one(s) you don’t use.
Do you produce back-of-the-book indexes in markup? No? Then you
don’t need indexentry
and its descendants.
Do your documents have mathematics? Flush the equation elements!
Do your documents have admonitions? Q&A sets?
Screenshots? Video? Audio? Drop all the blocks you don’t
need.
You don’t need msgset
.
This’ll simplify your authoring environment:
include "docbookxi.rnc" {
db.synopsis.blocks = notAllowed
db.systemitem = notAllowed
db.biblioentry = notAllowed
db.indexdiv = notAllowed
db.indexentry = notAllowed
db.segmentedlist = notAllowed
db.equation = notAllowed
db.informalequation = notAllowed
db.inlineequation = notAllowed
db.math.inlines = notAllowed
db.admonition.blocks = notAllowed
db.videoobject = notAllowed
db.audioobject = notAllowed
db.screenshot = notAllowed
db.qandadiv = notAllowed
db.qandaentry = notAllowed
db.qandaset = notAllowed
db.msg = notAllowed
db.msgexplan = notAllowed
db.msgmain = notAllowed
db.msgrel = notAllowed
db.msgset = notAllowed
db.msgsub = notAllowed
}
Change the semantics of a component
DocBook, perhaps because of its history as an interchange
format, doesn’t attempt to bring a great deal of rigor to the
semantics of its elements. The reference documentation provides a
description of its intended semantics, as the DocBook designers
understood it, but those descriptions are often intentionally vague.
Saying that a productnumber
is “a number assigned to a
product” is not especially precise.
There are a number of elements right down on the leaves of the
tree where a customization layer could impose stricter syntactic
constraints that would limit the opportunities for misunderstanding.
For example, pubdate
could be restricted to an ISO 8601 date
or date-time. Similarly, if your organization has product numbers that follow
a predictable pattern, you could add a constraint to enforce that.
Document the customizations
The RELAX NG grammar allows documentation to be combined with the
schema. Elements from other namespaces simply become ignored
annotations to the validator. In this way, DocBook prose for example,
could be combined directly with the schema in a
“literate
programming”
style.
Unfortunately, this is fairly cumbersome in practice, in part because
the DocBook schema is authored in the compact syntax. The compact
syntax, as mentioned earlier, can be losslessly converted to and from
the XML syntax. However, the particular representation of arbitrary
XML in the compact syntax is, in a word, awful.
For example, consider this simple fragment of documentation in the XML
syntax:
<db:para>This is a <emphasis role="important">feature</emphasis>,
not a bug.</db:para>
In the compact syntax, it becomes this annotation:
db:para [
"This is a "
rng:emphasis [ role = "important" "feature" ]
",\x{a}" ~
"not a bug."
]
That’s not…practical.
As a result, the DocBook schema limits the embedded documentation
to the single-sentence summary of each pattern (it’s man page
refpurpose
), and the description of enumerated attribute values.
For example, here are the patterns for the revisionflag
and its
enumerated values.
db.revisionflag.enumeration =
## The element has been changed.
"changed"
| ## The element is new (has been added to the document).
"added"
| ## The element has been deleted.
"deleted"
| ## Explicitly turns off revision markup for this element.
"off"
db.revisionflag.attribute =
[
db:refpurpose [ "Identifies the revision status of the element" ]
]
attribute revisionflag { db.revisionflag.enumeration }
The rest of the reference documentation is maintained separately, in
DocBook, and combined with the schema annotations through a fairly
complicated process of shaking and stirring.
Plus Schematron
One last observation. No set of grammatical constraints can
conveniently capture all of the useful constraints of an authoring
schema. DocBook uses Schematron rules, embedded in the RELAX NG
grammar as annotations, to enforce a number of extra-grammatical
constraints.
If you add structures that carry with them extra-grammatical
constraints, you’d be wise to add Schematron rules for as many of
them as practical.
Building DocBook
A reviewer commented that additional detail about the process by
which a collection of source files becomes the DocBook RELAX NG
grammar would be interesting. What follows is a summary of the process.
If you want to see all the gory details, they’re publically available
in the
DocBook
repository at GitHub. The process begins down in the
/relaxng/schemas/
directory.
The source files themselves are stored in a collection of logical
modules (markup related to admonitions, sections, tables, technical
content, general publishing information, etc.). The idea is that if
you want to completely excise some module, you can simply construct
your own driver file that omits it.
The goal of the assembly process is to transform a set of schema
files organized to be convenient for authoring and generating
documentation into a small, efficient RELAX NG grammar that can be
used for validation. The process follows this basic plan:
The RNC files are handy for authoring, but not actually useful for
processing. The first step is to use trang
to convert them
all to XML RNG files.
The set of schema files is composed into a single XML document.
There’s provision at this level for a few special cases through the use
of some custom “control” markup in another namespace in the RELAX NG grammars.
Any patterns that are entirely unreferenced are excluded from
the composed schema. Several published customization layers are
subsets; removing the unused patterns from the base schema makes the
customization layers smaller and easier to understand.
Schematron validation is used for a number of extra-grammatical
constraints. In the context of DocBook, many of these can be derived
from the patterns themselves. The control markup has features for
expressing this. For example:
ctrl:exclude [ from="db.footnote" exclude="db.formal.blocks" ]
This single control structure is transformed into a set of Schematron
rules that forbid any element in the “db.formal.blocks
” pattern
from appearing as a descendant of (any element in the) “db.footnote
”
pattern.
Another cleanup pass is performed to remove redundant inherited
attributes and sort out namespace issues (the namespace for the control
vocabulary is no longer needed at this point, for example).
All of the markup annotations related to documentation are removed
from the generated schema.
Copyright messages are moved into the right places and updated with
the build information (date, version number, etc.).
The build process also runs a subset of the DocBook test suite.
(Builds are published from an online continuous integration server and the
full suite runs longer than the allowed time for jobs.) Locally, developers
can and should run the whole test suite, of course.
That process produces docbook.rng
ready for validation.
Running trang
again produces docbook.rnc
.
A slightly longer and more elaborate process can be used to turn the
sources into a fully elaborated, 24MB XML document that can be used to drive
the process of building the documentation sources.
The current state of the art is that this process is neither well documented
nor especially portable. It would be possible to leverage these tools for local
customizations, but it would not be easy.