Guy Gosselin - Director, Building Regulations, NRC-CNRC Construction
Guyane Mougeot-Lemay - Building Regulations, Manager, Production & Marketing
Tarek Raafat - Building Regulations, Head, Information Systems
Helen Tikhonova - Building Regulations, Information Systems Specialist
David Taylor - Building Regulations, SGML/XML Specialist and CMS Administrator
Henning Heinemann - Building Regulations, Project Manager, Information Systems
The Building Codes content shown in any of the images or examples accompanying this paper are copyright the National Research Council of Canada and used with permission.
A Look at the Building Codes
Canada's building codes date back to 1941 when the first version of the National Building Code was published. At that time, a typical page from the document looked like Figure 1.
Figure 1: Sample content from the 1941 Building Code
Markup and computer based publishing (and computers for that matter) were still off in the future. To keep things moving along and relevant to this conference, I'll gloss over the ensuing forty years of Building Codes development and publishing activity.
By the late eighties, Codes production had shifted to desktop publishing (Pagemaker) and a typical page looked like Figure 2.
Figure 2: Sample page from the 1990 Building Code
Even as the Building Codes were being converted to desktop publishing tools, markup and, in particular, the notion of separating document content from document presentation, especially for large technical publications, was becoming more common. So too were the tools and expertise necessary to handle markup.
Despite the claimed advantages of a markup based publishing approach, it was still a leap of faith to go to the expense of converting data from proprietary formats to SGML and retooling the publishing chain. Nonetheless, for the 1995 version of the Codes, the content was converted to SGML with an accompanying DTD. Arbortext was selected as the editing and page composition tool which also required that a FOSI be developed to format the output. The SGML/Arbortext printed copy looked like Figure 3.
Figure 3: Sample page from the 1995 Building Code
One year later, the Codes were issued in their first electronic version using Dynatext from Electronic Book Technologies. Dynatext was a publishing system that allowed SGML content to be combined with other media like vector and raster graphics and audio and video clips into a book or book collection that could be shipped on a CD. The Dynatext version of the Building Codes looked like Figure 4.
Figure 4: Sample page from the Dynatext electronic Building Code
The Dynatext release of the Building Codes was important for demonstrating:
The advantage of a non-proprietary data format that could be processed by different tool chains to create very different output products - a particularly important message given the cost of conversion to SGML and the retooling to support the conversion.
The added value of an electronic document over paper - search, hyper links, light and compact (CD vs. paper), etc.
At the time however, a well-thumbed copy of a paper version of the Building Codes, thrown into the cab of a pickup truck on a building site, or stored on a building professional's desk, was a more realistic delivery scenario than a format that required ready access to a laptop or desktop computer.
It was 10 years before the next release of the Building Codes. As before, the paper version of the Codes was edited and composed in Arbortext although by 2005 the DTD and content had been converted from SGML to XML. The conversion was not difficult as the original conversion from Pagemaker to SGML did not take significant advantage of the SGML features that were dropped when XML was designed (although we have had many opportunities to lament the loss of inclusions in the XML DTD - having to allow for change-begin and change-end elements nearly everywhere in the XML DTD is much messier than being able to specify their inclusion once). This sample page from the 2005 version of the Codes (Figure 5) shows a strong resemblance to the 1995 version (Figure 3), save for the change to a single column format.
Figure 5: Sample page from the 2005 Building Code
This is where I come in to the story. By 2005, Dynatext was no longer available and the Canadian Codes Centre had selected the NXT CD publishing tool to create the next electronic version of the Building Codes. My initial brief was to create HTML output from the XML source suitable for import into NXT. As far as possible, the content was to be formatted like the paper copy. The FOSI used for the printed Codes, being a stylesheet itself, provided me with a useful leg up in creating the CSS.
When we started work converting the XML to HTML (using XSLT) for the NXT CD tool
we did not actually have the NXT software. Our initial conversion delivered a
2-frame HTML view of the output with a Table of Contents in the left frame and the
Codes content in the right. Figure 6 shows a sample.
Figure 6: Sample page from the 2006 electronic Building Code
Once the NXT software arrived, and as we learned its specific requirements, we
modified the conversion scripts to support both the original framed output and the
NXT output. In the next few sections I'll outline some of the more interesting
challenges we had to deal with.
I have already mentioned that the plain HTML output and the NXT output were
different. The changes were mostly related to the Table of Contents (TOC) and
the format of hyper links. Setting up the conversion scripts to handle these
differences was fairly straightforward. Formatting was another thing entirely.
At the time, we were trying to support Firefox 3, Internet Explorer 6, and CSS 2
- it turned out that if we got that right, the NXT output would look OK too. We
wanted to have a single CSS style sheet to reduce long term maintenance
headaches. Effectively, we were trying to support 3 different rendering engines
with one set of HTML files and one style sheet. Our initial conversion scripts
tried to take advantage of HTML elements like <P> but the browsers attach
some amount of built-in formatting to the HTML elements. Of course the
formatting was different for each browser as were the interactions of the
predefined formatting with the linked CSS. We simplified the problem by mapping
the XML into DIV and SPAN elements for block and running text respectively. As I
learned at Balisage in 2011 over a beer one evening, DIV and SPAN were
introduced specifically to be unformatted so we were able to limit my hair
pulling to resolving differences between how the browsers interpreted CSS2. This
was entertaining enough - we had to tweak both the output HTML and the CSS to
achieve my goals. For example, one instance we had to wrap output in both DIV
and SPAN elements to get similar presentations in Firefox and Internet Explorer.
The CSS has a disturbing number of comments like:
IE and FF have different opinions about how to layout para-nmbrd caused by FF
not honoring sentnum width. Numbers in FF on para-nmbrd text will therefore be
shifted left by 1em plus the difference between the width of the number (including
its trailing ')') and the width of an 'm' character.
My claim in the previous section about using only DIV and SPAN elements was true to a point. Tables were that point. Tabular output required engaging the table rendering engines in the browsers and so my HTML output does include HTML table elements. Anyone who has had to convert tables marked up using the Oasis Exchange Table Model into HTML tables knows just how much work this can be. For example, an Oasis table can have multiple TGROUP elements where each TGROUP can support a different number of columns. There is no analog in HTML tables - each table can only have a single number of columns. You therefore have 2 options:
Convert each TGROUP to support the number of columns in the least common multiple of the columns in all TGROUPs.
Output each TGROUP as a separate table and rely on rendering the tables with no intervening space to look like a single table.
The first option is unspeakably horrible as it involves setting up column spans or converting existing column spans (named or positional) and all the references to the span information in the table data amongst other nasties. The latter option is much, much easier but had a dark side that was not apparent at the time. That dark side showed up years later when we converted our output to be accessible. No longer could we present a single logical table as multiple printed tables. We had to present the logical table as a single HTML table with a CAPTION element. Fortunately, and after an extensive review of our content, we found that, while we did have to support multiple TRGOUPS, we did not have different numbers of columns or different types of presentational attributes in each TGROUP.
The tables in the Codes documents are both numerous and often complex. Even things like figuring out which borders to render on a table cell required looking at all the possible places cell borders could be specified starting at the Oasis TABLE element and working down through TGROUP, COLSPEC, ROW, and ENTRY elements. Ultimately, the results worked out well enough. The following table samples (Figure 7, Figure 8) give a feeling for the complexity of our tables.
One feature of the Codes documents is that each new version highlights significant changes from the previous version using change bars in the page margins. This is a very common technique in print, but HTML and CSS were not designed to support this level of page fidelity. We settled on using shaded text to highlight the differences (see Figure 9).
Figure 9: Version change highlighting
The interesting problem was that Arbortext encoded change bars in the XML as switches (empty elements) that told the Arbortext page composition engine to start (or stop) rendering a change bar. You can see the "change-begin" and "change-end" empty elements in the sample below.
<sentence id="es007023"><intentref xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="ei000287.xml" xlink:title="Intent"/><text>Where hangers are used to support <term refid="nmnll-hr">nominally
horizontal</term> piping, they shall be</text>
<text>metal rods of not less than</text>
<text><change-begin/><meas>6 mm</meas> diam to support piping <meas>2 inches</meas> or
less in <term refid="z">size</term>,<change-end/></text>
<text><change-begin/><meas>8 mm</meas> diam to support piping <meas>4 inches</meas> or
less in <term refid="z">size</term>, and<change-end/></text>
<text><change-begin/><meas>13 mm</meas> diam to support piping over <meas>4
inches</meas> in <term refid="z">size</term>, or<change-end/></text>
What we wanted to do was emit an element start tag when we encountered the
element that started the change bar and emit the corresponding end tag when we
hit the stop change bar element so that we could wrap content in a DIV or SPAN
element (with a CSS class). Of course XSLT does not normally allow a partial
element to be emitted in a template. We had to hide what we were doing from the
XSLT engine by outputting the start and end tags in different XSLT templates
using character entities like so:
<!-- CHANGE-BEGIN Any changes in this code should be mirrored in CHANGE-END. -->
<!-- CHANGE-END Any changes in this code should be mirrored in CHANGE-BEGIN. -->
The XSLT output serializer then converted the character entities back into regular < > characters where they would be interpreted as markup (and therefore as a DIV or SPAN wrapping content) by the browsers. Codes text that included < and > characters and that we did not want interpreted as markup had to be hidden by doubly encoding them as &lt; and &gt;.
Of course converting singleton elements functioning as switches to an element wrapping content did not initially produce reliably well-formed output in every case. Subsequent stages of our rendering pipeline choked on the output. Our solution was to manually relocate the offending change singletons in the source XML. In most cases this was as simple as moving a change-begin singleton from preceding a start tag to immediately following the start tag (for example) which left well-formed output that had the same effect as the original change markup.
Ultimately, the NXT output (and interface) looked like Figure 10.
The NXT output preserves the look of the printed Codes text (in the right pane) and also offers a number of advantages over print:
Active hyper linking within and between Codes documents
Full text search
More complete (the electronic output included the intent statements which were not released on paper). These are accessed through the links at the left of each sentence.
Much more portable
Between the time we published the Codes on CD using NXT and the time we had to start preparing for the next release of the Codes in 2010, changes on the NXT side suggested strongly that we have a plan B for releasing an electronic copy of the Codes in 2010. Plan B turned out to rely on Arbortext for both the print and electronic copies of the Codes using Arbortext PDF output. The electronic PDF output, like the SGML and HTML electronic version before it offers active hyper links, a TOC, and search capabilities. As you can imagine, with PDF as the output for both the print and electronic versions of the Codes, the presentation was nearly identical and the entire production process was greatly streamlined. The output looks very much like the 2005 Codes so I haven't included an example here.
For the foreseeable future, Arbortext will be the composition engine for both the print and electronic copies and so this part of my tale ends. The next part of this paper describes a different aspect of the work I've been involved with at the Canadian Codes Centre.
Maintaining and Developing the Building Codes
As we saw at the start of this paper, the Codes have an extensive set of stakeholders all of whom both contribute to and must be kept apprised of development work on the Codes. In addition, once the Codes are adopted by a jurisdiction (province, territory), they acquire legal standing. Until recently, the tracking of each proposed change to the Codes was managed with a MS-Word template like Figure 11. The template shows the original Code text, the proposed change, the rationale for the change, and a variety of administrative and tracking details.
Figure 11: MS-Word Proposed Change Template (heavily edited)
Keeping the Word templates up to date required a lot of manual work and discipline on the part of the technical committee chairs. In order to provide better process traceability and accountability and to help manage the increased number of documents in production the Canadian Codes Centre implemented a Content Management System (CMS).
The CMS we are using is Interwoven Teamsite. In the CMS, the Word template was replaced with a proper electronic form (the Proposed Change Form or PCF) with work flow, versioning, fielded searching, reporting, and centralized administration - all typical characteristics of a CMS. The new form looks like Figure 12.
Figure 12: Teamsite Proposed Change Form
This paper is not about the CMS though or even the PCF (which is itself an XML document behind the scenes). I instead want to focus on one aspect of the PCF - the part of the form that contains the text under consideration for change. This corresponds to the second tab in the PCF form: Figure 13.
Figure 13: PCF Code Reference Tab
In Figure 13, the reference is to an entire article.
As has been hinted at in the sample output shown so far, the Codes documents are highly structured documents following a deep hierarchical model: division/part/section/subsection/article/sentence/clause/subclause in the normative portions of the Codes and a different model for non-normative appendices.
A proposed change might include one or more sentences or higher level constructs (article, subsection, etc.). In the past, the Code text under consideration was cut from a PDF version of the Code document and pasted into the MS-Word PCF template. As anyone who has done this knows, the results can be ugly, especially if the cut text spans a page boundary in the PDF. Quite apart from that problem, the source for the Codes content is maintained as fragments of XML. Converting the source to PDF (for publication), then to Word (the PCF), and then back to XML (for our fragment library including regenerating all the meta data in the XML - IDs, IDREFs, etc.) once changes had been made was largely manual, time-consuming and error-prone. We wanted to try linking the source XML to the Codes revision process somehow so that we could improve the overall throughput, reliability, and integrity of the revision process, at least as far as the content was concerned.
The CMS allows files to be attached to forms so rather than inserting Code text into the form (much like the old Word templates), we decided to attach portions of the XML source to the form. Before I describe our solution, I'll take a short diversion into the XML library that contains the source for the Building Codes document.
XML Fragment Library
The XML source for the Codes documents are maintained as a single tree of XML fragments. The leaves contain the bulk of the Codes text (sentences, tables, appendix notes, intent analysis). Higher levels in the tree contain structural information fragments. Tables, appendix notes, and intent analysis fragments are referenced from sentence fragments.
A sample structural fragment looks like:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NRC-IRC//DTD Code_2010//EN"
<!ENTITY ES000432 SYSTEM "../../../sentence/es/000/es000432.xml">
<!ENTITY ES000433 SYSTEM "../../../sentence/es/000/es000433.xml">
<title>Group A, Division 2, up to 6 Storeys, Any Area, Sprinklered</title>
You can see that the article fragment is little more than a title element followed by entity references to the sentence fragments that make up the article (you can see the SGML heritage here).
One of the sentence fragments in the above article looks like:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE sentence PUBLIC "-//NRC-IRC//DTD Code_2010//EN"
<ref.intent xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="ei002082.xml" xlink:title="Intent"/>
<text>A <term refid="bldng">building </term> classified as Group A, Division 2, that is
not limited by <term refid="bldng-r">building area</term>, is permitted
to conform to <ref.int refid="es000433" type="short"/> provided</text>
<text>except as permitted by <ref.int pretext="Sentences"
refid="es000398"/> <ref.int pretext="and" refid="es000422"/>, the <term
refid="bldng">building</term> is <term refid="prnklrd">sprinklered</term
Each sentence has a corresponding intent analysis which describes why a sentence is important based on a number of objectives and functional requirements. The XLink information points to intent analysis fragments. Appendix notes can be included at any level in the document hierarchy and contain explanatory text, figures, examples, and equations. All content includes many cross-references to other parts of the document (see the ref.int elements above).
The XML fragment library is a directory. The leaf filenames are the same as the ID attribute on that fragment and the IDs capture the semantics of the directory structure. This will be important later.
Linking the XML Library to the CMS
Clearly, the technical committee chairs could not be expected to know the XML fragment ID of a block of content that they needed to attach to a form. We needed some sort of selection process to allow for easier content selection. Apart from allowing a more useful selection process, we also wanted to ensure that the content we built to attach to the form included everything that a technical committee might need to know about that content in order to amend it. This meant that we not only needed the Code text, but also the intent analysis for each sentence and any appendix notes that applied to the attached text. The old Word forms did not impose any such discipline and so portions of the Codes document were sometimes overlooked during the revision cycle. We called the content blobs composite fragments (CFs). A special version of the main Codes DTD allows for the structure of the composite fragments.
The CMS form editor supports changes to the PCF form made by the committee chairs directly or a side effect of a work flow process. The composite fragments though have to be edited separately as the Teamsite CMS does not understand XML at the level we need. We added a feature to the PCF form that put an "edit" button beside each attached composite fragment. Clicking "edit" will cause the CMS to push the composite fragment down to a local workstation from the CMS server and start up Arbortext on that composite fragment. When an editing session is complete, the edited composite fragment is copied back up to the server.
The mechanism that supports the creation of the composite fragments is
server-side CGI scripts can be used to extend the basic CMS functionality. However,
we took a different route to integrate our XML fragment library with the CMS. We
built a separate web server (the DSF Server) to sit between the CMS and the XML
fragment library. The following diagram shows the main moving parts in our system
Figure 14: System Architecture
send URLs to the DSF Server which responds with documents or pointers to documents.
So, for example, if a technical committee chair wanted to attach a sentence from the
National Plumbing Code to an open PCF, the CMS sends a URL to our DSF server, the
server creates the composite fragment and returns it to the CMS. The PCF form then
gets updated with a link to the composite fragment file.
The DSF Server relies on a custom NoSQL database to resolve which XML fragments should be used to populate the composite fragment. This initial content is then parsed for references to other material that must be included in the composite fragment until we have a complete package of content, appendix notes, and intent statements.
A useful side-effect of our DSF Server architecture is that it isolates the XML fragment library and all our fragment processing from the CMS itself. If the CMS is upgraded or even replaced, we retain all the composite fragment functionality unchanged.
The PCF form (and the composite fragment editing) is useful but we also need to render the forms so that the technical committees can see all the information presented in context from both the PCF form and the attached composite fragments. We render to HTML, PDF and MHT depending on the downstream use. Most of the rendering code comes from the code that was developed to render our Codes to HTML for the NXT CD deliverable in 2006. A small amount of rendering code was added to support the material in the PCF form itself. A rendered PCF form (edited for presentation) looks like Figure 15.
Figure 15: Rendered Proposed Change Form
The existing provision section shows that the referenced appendix note has been added to the composite fragment to ensure that a technical committee will consider it in their deliberations. The appendix note does not render in the proposed change section as no changes have been made (yet) and so we suppress its display (more on this below).
Of no small interest, Arbortext supports change tracking while editing, like all good editors. The technical committees wanted to preserve the change tracking in the composite fragments attached to the PCF and display the changes in the rendered PCF. Arbortext change tracking causes new elements to be added to the XML file being edited. The elements are embedded in the edited XML file until such time as a document editor accepts the changes. The change tracking elements are not part of the document model (DTD, Schema) for the XML document - Arbortext deals with them appropriately. However, since the added elements change the element hierarchy in the XML document, any processing that is based on an assumption about the element hierarchy as modeled in the DTD (or Schema) will no longer work. In fact, Arbortext does not recommend working directly with XML files containing change tracking elements.
Our solution to handle change tracking display, developed after several false
starts, was to introduce a rendering preprocessing step that converted the Arbortext
change tracking elements into change tracking attributes on every element wrapped by
the change tracking element. We then strip out the change tracking elements,
restoring the document to its model conformant state, so we can render it correctly.
A composite fragment with change tracking like this (clause 'b' in Figure 16):
<clause id="es001725b" cnum="b*">
<atict:del user="U1">comply with Sentence 126.96.36.199.(2)-2015 and</atict:del>
<atict:add user="U1">they are separated a minimum distance from sources
of contaminants in accordance with </atict:add>Table 188.8.131.52<atict:adduser="U1">.</atict:add>
<atict:del user="U1"> for minimum distances.</atict:del>
Will ultimately display like Figure 16 (a more complete example of change tracking output).
Figure 16: Change Tracking
There are two sets of sequence numbers in the generated PCF output. The rightmost set are the sequence numbers that the content had when it was published and the left most set are generated on the fly as the PCF is rendered. The former set helps tie discussions back to the original published documents and the latter provide context for discussions about changes. Note that the two clauses shown in Figure 16 are new and therefore have no original numbers (shown as "--)"). The new numbers are critical in situations like this.
We preserve the "user" attributes from the Arbortext change tracking elements as classes in the HTML output. This has allowed us to experiment with presenting the change tracking output differently for each user or class of users so that we can distinguish changes made by an editor from those made by a technical committee for example. If you look carefully at Figure 16 you can see this in the shaded content. This represents changed material altered by one of the Codes editors. The unshaded changed material was altered by a technical committee chair.
Aside from rendering the change tracking visually, the rendering code exploits the
change tracking attributes to suppress content in the output. In general we try to
suppress content that is unchanged so that technical committees, editors, and
translators can focus on material that has changed. For example, if an article has
no changes, we will render just the article title to provide some context while
allowing the technical committees to focus on more relevant material. We are still
in the early stages of content suppression based on change tracking and we are
trying to avoid having to deal with requests like "Show me only what I want to see
at the moment."
Once a Code change has been approved, the XML in the composite fragment attached to the PCF must be returned to the XML library. Since the composite fragment is a single document, we need to burst the composite fragment back into its component pieces (sentences, tables, structural fragments, appendix notes, etc.). The ID attribute semantics tells us what filenames and file paths we need to create for the burst output. For example an ID on a sentence like:
indicates that this is an English ('e') sentence ('s') with a filename of 'es000001.xml' stored in the XML fragment tree at
Bursting also recreates the structural fragments as necessary.
Our bursting process exploits the semantics of the IDs in the composite fragment not only to burst the composite fragment, but also to do a number of internal consistency checks on the composite fragment to help ensure that the burst fragments will be properly linked together. We do not burst the composite fragments directly back into the XML fragment library to allow for final validity and consistency checking on the burst pieces.