How to cite this paper

Kalvesmaki, Joel. “Serializing the Locator Format of the United States Government Publishing Office as XML.” Presented at Balisage: The Markup Conference 2023, Washington, DC, July 31 - August 4, 2023. In Proceedings of Balisage: The Markup Conference 2023. Balisage Series on Markup Technologies, vol. 28 (2023). https://doi.org/10.4242/BalisageVol28.Kalvesmaki01.

Balisage: The Markup Conference 2023
July 31 - August 4, 2023

Balisage Paper: Serializing the Locator Format of the United States Government Publishing Office as XML

Joel Kalvesmaki

Joel Kalvesmaki is a software developer for the United States Government Publishing Office, founder and director of the Text Alignment Network, and a scholar specializing in early Christianity.

Copyright Joel Kalvesmaki, released under a Creative Commons Attribution 4.0 International License

Abstract

The United States Government Publishing Office makes significant use of so-called locators, a specialized file format for commercial typography. In this article, I explain the origins of the locator format, discuss its architecture, and explore some of the challenges inherent in understanding and converting the format. I present Slax, Serializing Locators as XML, an expression of the locator format in XML and an application that makes the conversion. In telling the story of the locator format, which is poorly documented, I show that the specifications developed slowly over time as user expectations changed, and in tandem with improvements to its primary processor, MicroComp. Slax also provides an excellent example of where XSLT is an ideal technology for document conversion, and where it is not, and provides a model for the symbiotic use of declarative and imperative programming models.

Table of Contents

Introduction
The Background of the Locator Format
The MicroComp Locator Format
Specifications and Formats
Grids
Fonts and Hyphenation
Challenges in Converting Locators
Serializing Locators as XML (Slax)
The Slax Format
Conclusion and Further Work

Introduction

The Government Publishing Office (GPO), which serves the legislative, executive, and judicial branches of the United States Federal Government, makes significant use of so-called locators, a specialized format for commercial typography. At present, the format is actively used on a daily basis to typeset the Congressional Record, the Federal Register, and other significant publications.

At Balisage 2020, Andries and Wood presented an approach to converting locator files to structured XML. (andries_and_wood) Their article focuses upon the U.S. Code and its up-conversion to the semantically rich United States Legislative Markup (USLM) format, and it provides an overview of a complete workflow.

In this paper, I present a generalized process that serializes the locator format as XML. Unlike Andries and Wood's application, which focused on the U.S. Code, this process deals with all known locator publications. And whereas Andries and Wood's approach utilized Java to deal with the conversion process, mine adopts a mix of two technologies: XSLT to manage the sprawling data resources and a C# application to replicate the byte handling of MicroComp, GPO's flagship desktop composition application.

My paper offers two significant contributions. First, in telling the story of locators, I argue that the format, which is poorly documented, developed slowly and silently over time as user expectations changed, and as MicroComp evolved in the 1990s. MicroComp, in fact, became a de facto black-box specification for a format that was never fixed. Second, the serialization of locators into XML provides an excellent example of places where XSLT is an ideal technology for document conversion, and places where it is not, and offers a model for the symbiotic use of declarative and imperative programming models.[1]

The Background of the Locator Format

The Government Printing Office (now Government Publishing Office, GPO) was created by an Act of Congress in 1860 to centralize and stabilize a relatively chaotic printing system. It is one of the oldest agencies of the Federal Government, and has had storied moments. GPO seconded its staff to support the defense of Washington, DC during the Civil War, and was responsible for printing the Emancipation Proclamation.

GPO has been known for innovation and early adoption of new technology in printing. Necessity was a driving force, given the volume of government output and the costs of labor and materials. Starting in 1963, GPO used electronic printing methods, beginning with a video unit and two Linofilm keyboards created by the Mergenthaler Linotype Corporation. GPO's first electronically produced publication, from March that year, set in 8/9 Bodoni Book, was “believed to be the first anywhere to use a pre-existing tabulating card file to produce typographical quality output.” (uspo_1963)

Over the next several years, GPO took a small staff of programmers, added to their ranks compositors taught to program, and collaborated with IBM and Mergenthaler to create a digital typographic system for the IBM 1401. (boyle) Teletype system (TTS) code was converted to binary-coded decimal (BCD), and divided into blocks, each block being assigned an identification code. That prefatory code was critical to locating a particular block of text, so it was called a locator.

In 1967, the Linotron 1010, a new photocomposition machine developed specially by Mergenthaler and CBS Laboratories, was introduced to GPO production. (gpo_1967) The mechanical process involved directing the rays of a mercury arc lamp through a grid of 256 characters on the cathodes of a vacuum tube. An external logic signal controlling an electrostatic wire grid selection matrix ensured that only one of the 256 fragments of the beam—the chosen character—passed to an electron multiplier. (mlc_1966) The Linotron magazine could hold up to four grids for a total of 1,024 characters. The photocomposition system was operated by a Master Typography Program (MTP), a federated suite of small, independent programs written for the IBM 360/50 and other IBM peripherals. (cavanaugh) Customers would give GPO magnetic tape that had text or data, structured by a limited set of characters and reference codes. These codes identified text block categories, but not typographic styles. For example, a first-level header might have been marked with code 11, but the typographical specifications for block style 11 were defined separately in files curated by staff of GPO's Electronic Photocomposition Unit. This approach meant that the same structured text could, without alteration, be slotted into different publications, and not require redesign or new markup. That approach is still used today: the locator code for many legislative bills is copied without change into the Congressional Record.

Typographical specifications rested in two different GPO libraries. The parameter library (later called a format library) dictated page layouts, paragraph dimension, and the behavior of lines. A second grid library coordinated input with the Linotron grid magazine by supplying offset lookup tables, character width values, and so forth. These and other tape libraries were updated by GPO staff via cards. (jcp_1970, 90, 93–106) Thus, the Linotron system of the 1960s—a landmark in printing history—indelibly influenced the terminology (“locators,” “formats,” “grids,” “parameters”) and the architecture of GPO's electronic publication program.

The Linotron system was adequate for simple text-centric publications where digitized data already existed, but not for more complex ones with graphics, tables, multiple columns, and spanning heads, and not for publications where text had not yet been digitized. These publications were still handled by the costly hot metal process. GPO was eager to bring its entire line of publications into an electronic workflow, but as it considered options beyond the Linotron, it soon realized that no outside contractor would be able to write the software needed to publish complex documents. So in the 1970s GPO decided that it would develop its own typesetting software in-house. To build the staff of its Electronic Photocomposition Division, GPO recruited craftsmen from Linotype and Monotype.

GPO also sought to purchase different kinds of typesetting hardware. In 1974 GPO acquired a system that combined optical character recognition (OCR) with text editing. Designed by Compuscan and Atex, the system combined a model 170 scanner with a PDP 11/35 computer and two video display terminals (VDT). Its output was compatible with the Linotron. (cavanaugh) Two years later GPO procured VideoComp 500 machines for the Atex systems. The MTP software, which had been designed specially for the Linotron, was now revised for the PDP 11/35 mainframe hooked to the VideoComp 500. Printing jobs were gradually transferred from the Linotron to the VideoComp systems.

New Atex systems were procured in 1978, the same year that GPO established its Graphic Systems Development Division (GSDD), which was tasked with creating and supporting typography software that was now distinctively its own. At the heart of the application network was the Automated Composition System (AComp), the core composition algorithm. (gpo_1978) GSDD developed new features for AComp. In 1979 it introduced functionality for tables, which it called subformats. (gpo_1979) The following year it introduced features to allow tables and graphics to span columns, or to let tables flow across column or page breaks. (gpo_1980)

In the 1980s, GPO enhanced the existing VideoComp systems, but also looked ahead. Experience had shown that a locator-based approach to typography was conceptually sound and remained independent of any particular system. The format had matured to the point where it could handle all its publications, so GSDD prepared to make AComp and its other programs more widely available. In 1984, GSDD management announced that an unnamed typsetting program had entered production on the local network. (rollert) The shift to electronic publishing was complete, and in February 1985 GPO officially closed down its hot metal operations. But software development continued. The aging Atex system was scheduled to be replaced by a VAX 6210, requiring patches and new builds, and there was a pressing need to repurpose the codebase to enable typesetting on the microcomputer. In 1989, GSDD released for IBM-compatible systems an application it called MicroComp, a name that paid homage to the older VideoComp systems (for both, “Comp” was short for “composition”). (gpo_1989) By 1991 more than thirty offices in Congress were using MicroComp. (gpo_1991)

Very quickly, MicroComp became the mainstay for typesetting at GPO, Congress, and other parts of the Federal Government. Informally, the terms locators and MicroComp became interchangeable. Many aspects of MicroComp were advanced for the time. Despite the 256-character limit imposed by the byte, MicroComp gave users access to thousands of characters in dozens of styles. It allowed users control over a wide variety of formatting and typographic features, some highly sophisticated. Paragraph types could be programmed with conditional leaders. Columns could be feathered (leading applied so that lines would vertically fill a column). It was extremely fast, typesetting many pages per second. AComp managed to do this in a mere eighty thousand lines of code, written largely in C++, albeit with C-style, non-object architecture.

Despite these forward-looking accomplishments, MicroComp remained entrenched in 1960s hardware and concepts. Like its predecessors, MicroComp was a federated suite of programs that worked across a variety of computers and peripherals. The photomechanical grids of the Linotron 1010 had been replaced by digital, conceptual ones that remained agnostic of any meaning behind specific characters. MicroComp continued to rely upon the convention of locators (which now no longer located anything) and other terse codes already familiar to its Federal customers. It retained jargon, lingo, and engineering solutions rooted in 1960s paradigms.

MicroComp underwent upgrades and changes in the 1990s, although enhancements were documented poorly, if at all. Most significantly, it began to support SGML input, in tandem with conversion of the Federal Register and the Code of Federal Regulations. Gradual changes to byte handling were introduced. Some new format features were silently introduced. Format and grid libraries continued to evolve. Nevertheless, MicroComp continued to look backward as it progressed. GSDD's solution to integrating SGML with MicroComp was to have AComp first convert the input to an internal locator byte stream, and then typeset that stream, as if the user had simply used locators, not SGML. Browsing the incremental changes in MicroComp up to the turn of the millennium, one gets an eerie feeling that Unicode is distant and irrelevant, and that PostScript type 1 fonts represent the apex of font technology.

In the early 2000s, a moratorium was placed on any but the most urgent changes to MicroComp, while GPO leadership deliberated whether to pursue incremental development of the existing software, or to completely rearchitect it. The latter path was chosen. In 2009 a bid went out for vendors to supply a composition system replacement. As of September 2023, Congress is preparing to retire MicroComp, as it gradually switches its legislative workflow to XPub, GPO's next-generation, all-XML composition system. Much could be said about that transition. But events of the last twenty years are stories for another day.

The MicroComp Locator Format

Specifications and Formats

The most comprehensive, detailed account of the locator system was published in the early 1980s in a book with the arcane title Publishing from a Full Text Data Base (2nd edition, 1983, herein PFTDB). (gsdd_1983) PFTDB is the closest approximation to a formal specification for the locator format, because it documents the intentions of the designers of the locator format. But it is imperfect. Some of the jargon is impenetrable. Some of PTFDB's features were dropped or changed by MicroComp and, of course, PTFDB excludes features that MicroComp introduced. My description below draws from PFTDB and related documents used by GPO Prepress staff for their daily work.[2] I have also relied upon experiments with current compiled MicroComp executables, and upon selective study of the source code.

A typical locator file might look like this (control characters have been converted to Unicode control pictures):

␇S0634
␇I66F
␇I81SENATE RESOLUTION 77_DESIGNATING FEBRUARY 16, 2023, AS ``NATIONAL ELIZABETH PERATROVICH DAY'' 
␇I11Mr. ␇T4SULLIVAN␇T1 (for himself and Ms. ␇T4Murkowski␇T1) submitted the following resolution; 
which was considered and agreed to:
␇I74S. Res. 77
␇I27Whereas Elizabeth Wanamaker Peratrovich, Tlingit, was a member of the Lukaa␇g840␇T5C␇K.aÿAE1di 
clan in the Raven moiety with the Tlingit name of ␇g840␇T5B␇Kaa␇g840␇T5C␇Kgal.aat (referred to in 
this preamble as ``Elizabeth'') who fought for social equality, civil liberties, and respect for 
Alaska Native and Native American communities;

The code should not seem utterly strange. Quite a lot should be recognizable as text. The rest is markup of one sort or another. A number of markup characters begin with ␇, which is technically called the precedence code. More colloquially it is called the bell code, because it is commonly (but not always) handled by codepoint 7 (now U+0007, ALERT). The character that follows the precedence code specifies a particular action that should be taken, and it may be followed by other characters that supply parameters for that action. The most common code is ␇I followed by two digits. These are locators proper, tagging blocks of text by type.

Although the text looks like it might be ASCII or an ASCII-based codepage, appearances are deceptive. The locator format is actually codepage agnostic, or, to be more precise, the encoding is format dependent. Each MicroComp format allows a designer to reassign any of the 256 input bytes as desired to any permitted text or control character. More on that later.

The above ten lines were typset by MicroComp as follows:

Figure 1: Corrected typeset version of Congressional Record, February 16, 2023, p. S464 cols. 2–3, https://www.govinfo.gov/app/details/CREC-2023-02-16

More than two thousand MicroComp formats have been developed over the years, for various publications. Each publication is assigned one or more format numbers, and the typographic specifications are tailored to the publication in question. The format files that MicroComp uses are binary, and they are created by compiling a source ASCII file, created and edited by GPO staff. Line 1 in the example above invokes format number 0634. Here are some samples from the source ASCII file for 0634 (the full file is 439 lines):

/FORMAT NO./ /0634/ .PCF
/GRID TABLE/ /738/ /072/ /739/ /780/
/PAGE/
/00/ /003/ number of text columns
/01/ /177/ column width (including gutter)
/02/ /716/ column length--odd page column 1
/03/ /716/ column length--odd page column 2
/04/ /716/ column length--odd page column 3
/11/ /019/ column sink--odd page column 1
/12/ /019/ column sink--odd page column 2
/13/ /019/ column sink--odd page column 3
/20/ /716/ column length--even page column 1
/21/ /716/ column length--even page column 2
/22/ /716/ column length--even page column 3
/29/ /019/ column sink--even page column 1
/30/ /019/ column sink--even page column 2
/31/ /019/ column sink--even page column 3
/38/ /716/ column length--first page column 1
/39/ /716/ column length--first page column 2
/40/ /716/ column length--first page column 3
/47/ /019/ column sink--first page column 1
/48/ /019/ column sink--first page column 2
/49/ /019/ column sink--first page column 3
/56/ /.../ x-origin of the head in storage area 1
/57/ /.../ y-origin of the head in storage area 1
/76/ /001/ bottom leading for running (continued) heads
/77/ /010/ Footnote sep lding in pts--10-4-77
/78/ /002/ Footnote sep max carding allowed in pts--10-1-77
/79/ /001/ kill last page square-off
/80/ /1/   Rotation of the job and frame number and folio
/81/ /1/   Rotation of the page

. . . . . . .

/LOC 1/                                             FRSLN NEXT
LOC  PTSZ LDING  LNLG  TPLD  BTLD  Y1LN PRIND SCIND PRIOR PRIOR
/10/ /008/ /009/ /168/ /009/ /000/ /008/ /000/ /000/ /3/ /2/
/11/ /008/ /009/ /168/ /009/ /000/ /008/ /008/ /000/ /3/ /2/
/12/ /008/ /009/ /168/ /009/ /000/ /008/ /000/ /016/ /1/ /1/
/13/ /008/ /009/ /168/ /009/ /000/ /008/ /000/ /024/ /1/ /1/

. . . . . . . 

LOC   MINSP MAXSP RHOD  RHDE STORHD SETSZ GPLD PARLD
/10/ /006/ /009/ /.../ /.../ /.../ /.../ /010/ /009/
/11/ /006/ /009/ /.../ /.../ /.../ /.../ /012/ /009/
/12/ /006/ /009/ /.../ /.../ /.../ /.../ /012/ /009/
/13/ /006/ /009/ /.../ /.../ /.../ /.../ /012/ /008/

. . . . . . .

    F  L G T L T R  L  L L M H H S S S X F  S  S L C LOC
    O  D R F N S U  D  D D C D I E P N - T  E  C I N POR
    T  T N N T F L  C  R T L T E H C P S N  P  T N C TION
    P  P O O P D I  H  B F N P A D L G T O  C  F N S
                 D     L       R         T  H    B
LOC              X     K       Y         E       R
/10 .. . 1 1 J . . ... . . . . . . W . . . ... . . . .../
/11 .. . 1 1 J . . ... . . . . . . W . . . ... . . . .../
/12 .. . 1 1 J . . ... . . . . . . W . . . ... . . . .../
/13 .. . 1 1 J . . ... . . . . . . W . . . ... . . . .../

. . . . . . .

/CHRTBL UNSHIFT/
   Displacement Table                      Displacement Table
                        Input Grid                        Input Grid
   Char                 Code  Pos.            Char        Code  Pos.
bypass qc               /004/ /000/
bypass qm               /005/ /000/        open bracket   /091/ /091/
small bullet ␁          /001/ /001/                       /.../ /092/
small cap ampersand     /224/ /002/        close bracket  /093/ /093/
plus-minus              /027/ /003/        minus          /094/ /094/
lc mu   sss0            /160/ /005/        open quote     /096/ /096/
twist                   /006/ /006/        a              /097/ /097/
lc phi  sss1            /161/ /007/        b              /098/ /098/
lc delta sss7           /167/ /008/        c              /099/ /099/
                        /.../ /009/        d              /100/ /100/

. . . . . . .

ampersand &             /038/ /038/        SPECIAL CHARACTERS
apostrophe '            /039/ /039/
open paren (            /040/ /040/        The output code for all functions
close paren )           /041/ /041/        in the character table can be
asterisk *              /042/ /042/        anything between 192 and 240.
plus +                  /043/ /043/        There should be no outputs in the
comma ,                 /044/ /044/        character table between 128 to 191.
                        /.../ /045/
.                       /046/ /046/        hyphen         /045/ /192/
                        /.../ /047/        shill          /047/ /193/
0                       /048/ /048/        em dash        /095/ /194/
1                       /049/ /049/        word space     /032/ /195/

. . . . . .

/HEADS/
/ALL/
/C111012000522000000␇G3␇T2CONGRESSIONAL RECORD␇P/
/@/
/CONSTANT/
/CC/ /_Continued␇P/
/XXXXXXXXXXXXXXXXXXX/

Relevant data is wrapped by pairs of slashes. Nearly everything else is ignored, so can be treated as comments. Fields are identified by position within the source document and by localized groups.

Every format file has four sections. The first deals with global type and page settings. The second handles typographic rules for up to ninety-nine locators. This section is lengthy because it allows numerous block/paragraph properties to be defined, so it is broken up into three subsections. The portion above shows parts of the three subsections relevant to locators 10 through 13. The third major section of the format source file is a character displacement table, and the last handles miscellanea such as running heads.

For the purposes of this paper, the third section is critical, because it facilitates byte remapping, and therefore defines the codepage. In the example above, the hyphen, decimal 45, is targeted to be converted to 192. Incoming codepoint 160, treated as a μ, is to be shifted to byte 5. Incoming 95 (the lowbar, _) is to be shifted to byte 196, the em dash (see the locator sample plus figure 1 for an example). Any input codepoint can be redirected to another value, as long as the resulting codepoint is either from 0 to 127 inclusive (the character block), or 177 to 255 inclusive (the function code block). (The comment in the example above, indicating that the permitted range is 192 through 240, reflects an earlier, more restrictive scope. Function codes were added over time, some for internal use, others for users.) Yes, the null byte is allowed, and can be either an input byte or a target one.

As mentioned above, the precedence code is normally described as codepoint 7, because that is what creators and users of locator files typically type and see. But that is not completely accurate. The real precedence code, after byte displacement, is actually 200. That shift, as well as the choice of codepoint 7, is arbitrary, and not universal. In about five percent of GPO's two thousand formats some other character is defined as the precedence code, and codepoint 7 is used for other purposes. The ability to shift bytes howsoever one wishes makes the locator format highly enabling, and highly unpredictable. In a study of the more than two thousand formats, I have found some strange, surprising mappings. It is even possible to write a format specification that completely rescrambles all the letters and digits (fortunately, none of the existing standard formats do so).

When MicroComp's typographic core, AComp, receives an input byte stream, its first order of business is to perform the appropriate character displacement described just above. But before it can do that, it must perform another perfunctory initial character displacement to eradicate noise from word processors. PFTDB presumes the input to be a pure locator byte stream, and not the result of any customer, off-the-shelf software. But with MicroComp (introduced after the publication of PFTDB ), users began to use XYWrite and WordPerfect to create locator files, and GSDD staff had to deal with input that included reserved codepoints. For example, XYWrite used codepoints 174 and 175 to wrap excluded material. Word processors might include tabs, codepoint 9. That proved to be something of a nuisance, because in most MicroComp formats codepoint 9 becomes the en dash, a very common character. For such cases, MicroComp was written so as to strip away known XYWrite and WordPerfect reserved codepoints. If a user needed to directly access those codepoints, they were given a workaround: codepoint 255 followed by a two-digit hexadecimal number. During this initial process, AComp would convert those three-byte codes to the single, specified byte. Thus, even today, one uses ÿ09 to get an en dash in MicroComp, because to encode it as a tab would guarantee that it would be stripped out before processing begins.

Grids

After AComp has performed those two passes of byte displacement, it has a new byte stream consisting of either characters (codepoints 0 through 127) or functions (codepoints 177 and above). For now we ignore the function codepoints and look more closely at the character codepoints.

AComp's next point of business is to connect a character to a font glyph, and gather any associated properties. That connection is made through a grid specification. Every format defines four default grids (grids 1 through 4). In the format example above, the four grids are 738, 072, 739, and 780. All formats have tacit access to a standard set of four more grids that feature commonly needed specialized characters (relative grids 5 through 8, targeting absolute grid numbers 805 through 808).

A grid file, which is just as important as the format file already discussed, is also a binary artefact generated by compiling an ASCII source file. Keeping with our example, here are excerpts from the ASCII source file for format 0634's grid 1 (= absolute grid number 738):

;JUNE 24, 1996
;                                            Column                          
;Ionic/Century Schoolbook                     1 Typeface                     
;                                             2 Displacement                 
;MICROGRD.738                                 3 Output Code                  
;                                             4 Font Number                  
;T1 Ionic Roman C&lc (052)                    5 Synthetic font               
;T2 Century Schoolbook Bold C&lc (038)        6 Postscript widths            
;T3 Ionic Ital C&lc (054)                     7 Hyphenation Code             
;T4 Ionic Roman C&sc (052)                    8 Entity Tag & Description     
;T5 Ionic Roman C&sc (052) (full figs)
;
1 000 000 000 00 00000 000 00000000 ; [T1 Ionic Roman C&lc]
1 001 183 086 00 00556 000 sbull    ; ␁ Small bullet
1 002 038 052 01 00833 000 scamp    ; ␂ Small cap &
1 003 177 009 00 00549 000 plusmn   ; ␃ Plus-minus
1 004 101 009 00 00439 000 epsiv | egr ; ␄ lc greek epsilon
1 005 109 009 00 00576 000 mu | micro | mgr ; ␅ lc greek mu

. . . . . .

1 097 097 052 00 00615 097          ; a Lowercase a
1 098 098 052 00 00615 098          ; b Lowercase b
1 099 099 052 00 00563 099          ; c Lowercase c

. . . . . .

5 121 089 052 01 00833 121          ; y Small cap Y
5 122 090 052 01 00729 122          ; z Small cap Z
5 123 189 086 00 00551 003 lparb    ; { Black open paren
5 124 161 006 00 00333 000 iexcl    ; | Upside down bang
5 125 190 086 00 00551 002 rparb    ; } Black close paren
5 126 191 006 00 00611 000 iquest   ; ÿ7E Upside down question
5 127 162 009 00 00247 000 prime | min | foot | quot ; ␡ Single prime
GGGGGGGGGGGGGGGGGGGGGG
␚

The grid file is equivalent to the grid tape library that controlled the Linotron 1010 grid magazine. But where the Linotron grid magazine allowed four different sets of 256 characters each, the MicroComp grid supports five different sets of 128 characters each. A MicroComp grid is navigated by a pair of integers: a typeface option (1 through 5) and a codepoint (0 through 127).[3] The typeface-codepoint pair calls up a particular row, which has six pieces of information (identified in the example as nos. 3–8): (3) a glyph number, (4) a font number, (5) a special code that indicates what kind of special typesetting modifications should be applied, (6) a number specifying the relative width of that font's glyph, (7) a hyphenation code, and (8) entity names that are aliases for the codepoint (for conversion from SGML). The modifications (number 5) can be quite significant: small caps, superior (superscript), double superior, inferior (subscript), double inferior, cancel (strikethrough), underscore, condensed, extra condensed, extended, or some combination thereof.

GPO has defined more than 180 grid files. Commonly, a single grid file targets one or two faces, and one or more fonts from each face. In lines 7–11 in our example above, the comment indicates that the second typeface option (labeled T2) targets Century Schoolbook Bold. The other four options support the Ionic face. Two work with caps and lowercase: one roman and one italic. Another two typeface options target roman Ionic as caps and small caps, one with text figures (old-style numerals) and the other with full or lining figures.

But just because a given grid typeface option is declared to target a particular font and font modification, it always mixes glyphs from other fonts, to provide access to characters not in the target font, and to fill up the entire set of 128 slots. In the grid extract above, typeface 1 codepoint 1, the small bullet, points not to Ionic (font number 52) but BGsddV01 (font 86), an in-house custom font. Typeface 1 codepoint 4, Greek epsilon, targets font 9, Symbol.

When processing a given byte of text, how does AComp determine which grid and typeface option to use? As noted above, every format has eight predefined grids, and the format defines a default grid and typeface number for each locator. As AComp walks through the byte stream, it changes the default grid and typeface as it moves from one locator to the next. Some special precedence codes can change the current grid and typeface numbers in the middle of a locator. ␇G1 through ␇G8 changes the relative grid number, and ␇gNNN changes to absolute grid number NNN. ␇T1 through ␇T5 changes to a particular typeface in the grid. ␇K cancels the special grid-typeface call and returns AComp to the current locator's default grid and typeface number. The sample locator text above provides examples of all these codes.

So when AComp processes a text byte, it uses the immediate context to determine which grid and typeface should be consulted. It notes the target font number and glyph number. It stores the relative width of the character and any special modifications, and uses that information to determine how to space words, and where to place line breaks. (For better or worse, AComp performs no kerning.) And in one final act of displacement, it shifts the current byte to the target glyph byte, which is what will be written to the output PostScript file.

Fonts and Hyphenation

Unlike the older Linotron and VideoComp systems, which created output directly, the MicroComp system's primary goal is to produce a sequence of one or more PostScript files, one per printed page. Once that is finished, its job is done. It is up to subsequent applications to handle the PostScript files as desired, whether by passing them to a printer that has the fonts preinstalled, or by distilling the files into a PDF, or something else.

The font library, however, poses its own challenges. The current library consists of approximately 150 Type 1 PostScript fonts. Many of these fonts are in the standard PostScript encoding (“standard” is a bit of a misnomer: some assignments in that encoding table are rather eclectic). But many of them are non-standard, e.g., fonts with the Cyrillic alphabet, the Greek alphabet, woodcut ornaments, box drawing, wingdings, and symbols. Some fonts are in-house creations. GPO commonly fulfills requests from customers to design new glyphs, so that unusual characters are available at a keystroke (see Figure 1). GPO's piecemeal typography has included Chinese ideographs, IPA symbols, mathematical equations, and even the signature of sitting Presidents.

From the perspective of AComp's operations, the semantics of the target glyphs are irrelevant. The only exception pertains to hyphenation, which is significant in Federal publications. In legal documents such as legislative bills, line numeration, and therefore line breaks, must be stable and deterministic. A word break dictionary is shared across all MicroComp systems, and no user word break dictionary is permitted. Any update to the word break dictionary requires MicroComp to be released in a new major version. A user can force hyphenation in a word, but only by changing the input locator file, a safeguard against tampering.

Challenges in Converting Locators

There is a significant need to provide a conversion service that universally converts locators to some modern format, be it XML, JSON, HTML, or something else. The model offered by Andries and Wood is both original and significant, but represents only the start, and is focused upon the U.S. Code. Around eighty government publications have been published with MicroComp, and they need similar attention. If locator files have been accumulating over the last fifty years, across many branches of government, it is in the interest of archivists and historians to be able to convert them. Equally important to the past is the future. A conversion service should be able to handle cases when (not if) changes are made to GPO's format and grid specifications for publications that still use locators.

My summary of the history of MicroComp, and my description of how it processes locator files, show that there are several challenges inherent in the effort to convert locator files universally. In this discussion I presume an interest in XML (not JSON or another format) as the target format.

The locator format is not a context-free language. Quite the contrary, it is heavily contextual. A given locator file cannot be interpreted without access to all the formats that are invoked, and to all the grids, even those that are not referenced by the formats, because the creator of a locator file can on a whim draw from any of the 180 grids. When thinking about converting a locator file, one must deliberate over how much of that context needs to be placed within the target file.

The text itself poses problems. XML requires Unicode. Translating locator characters to Unicode can pose a significant challenge. Many font glyphs require interpretation. Sometimes, inspection of a particular glyph leaves mysteries about the intent of the font designer. For many glyphs, no Unicode value exists, or no Unicode value ever will exist. Some glyphs are composites, and require resolution to multiple Unicode characters. And even after determining a Unicode value, it is reasonable to suspect that a character may have been introduced capriciously, used for multiple functions, or simply misused. I have seen evidence of this in actual congressional legislation, where the German eszett, ß, was used as a Greek beta, or an obscure line drawing character was used instead of a hyphen or en dash.

Internally, locator files do not declare a version number, and they assume that users know and have access to the format and grid files. But format files and grid files have changed little by little over the years. Some of the fonts have as well. The great majority of these changes have been made silently, without any versioning system, so that it is impossible to retrieve a previous version. For any given locator file, one would be hard-pressed to reproduce the original typeset artefact, because one would have to configure MicroComp so that it matched exactly the version in use at the time.

Some of the locator conventions are not well documented. They do not feature in PFTDB, and they are not explained in user manuals. This phenomenon is seen in some of the ASCII source files for the formats, where new key-pairs suddenly appear in some of the formats, and are not explained. Only by experimentation with MicroComp can one discover exactly what certain features were designed to do. That can be true for some documented phenomena, as well. A feature may be explained in PFTDB quite sparingly, or in impenetrable jargon, and only tests with MicroComp can bring to light what is meant. Sometimes those experiments exhibit unexpected behavior that may or may not have been part of the intended design. Such Easter eggs also highlight MicroComp's error handling. A number of types of syntax errors are forgiven, and some permitted syntactic constructions can throw an error. The error messages returned by MicroComp are commonly inaccurate, misplaced, or uninformative.

The peculiar dependency of the format upon its processor highlights another challenge for conversion, that of replicating MicroComp's byte displacement rules. Locators were designed under an imperative programming paradigm. All AComp operations are applied byte by byte. It is common for AComp, at a given byte, to look behind or look ahead before applying a rule. Determining what those rules are requires experimentation with MicroComp and review of the source code, which has very little documentation. Variable names are cryptic. Conditional jumps out of iterative routines to completely different modules are frequent and unexplained. To describe AComp as spaghetti code would be generous. The code has been difficult to maintain across the years, evidenced by developers' comments, e.g., “these instructions make no sense to me,” “recursive nonsense,” “don't know why did this . . . that goddamned unified agenda is the reason.”

Some parts of the specifications appear to be ignored by MicroComp, or commented out. Other parts of the specifications have unclear functionality. In many such situations, I have been unable to find archived locator files with the specific coding, to test and verify the intended function against actual past publications.

Serializing Locators as XML (Slax)

A conversion process for locators could target any one of a number of formats. USLM XML, the target of Andries and Wood's effort, is only one of many possible destinations. Users of GPO's official document repository GovInfo.gov exhibit a strong preference to access and read documents in HTML. Archivists or other practitioners may wish to develop pipelines into Akoma Ntoso, a generalized XML format for legal documents. Parliamentary proceedings in other countries have been of interest to historians and linguists, and it is likely that many U.S. documents would be ideal for a pipeline into TEI XML. And we should not forget the possibility that some government publications would be ideal candidates for JATS XML.

Perhaps most important, GPO's next-generation composition service, XPub, uses XML Publishing Professional (XPP) for its core composition engine. The input model GPO has designed for XPP (which can handle a variety of input formats) consists of relatively flat XML files punctuated by processing instructions. In those XML files, typesetting concerns are at the forefront. Ostensibly, in the future, XPub may wish to incorporate, or migrate to, another typographic engine (say InDesign) that supports typographic-centered XML for input or output.

Every target format is lossy to one degree or another. Certainly, it is possible to take, say, USLM, and pack it with new attributes or elements to try to stanch the loss. But that would result in a kludge, and wind up making USLM something it wasn't designed to be.

Rather than try to force locators into an existing format, I asked myself what GSDD designers in the 1970s would have done if had they been given the opportunity to serialize the format as XML. The term “serialize” here is important, because it signals a change in perspective. My goal was not so much to convert locators to XML, but to express them as XML. Securing an XML expression of locators puts us in an excellent place. Locators serialized as XML could be easily reverted to the original format, if one liked. And it could become a pivot format, facilitating straight-forward conversion to any other format.

The concept developed into a name, Slax, meaning Serializing Locators as XML. Slax is the name for both the XML format and the application that does the conversion.

But what to do about the many challenges (see previous section)?

Some of those challenges, such as misuse or creative use of glyphs, are intractable, and not peculiar to locators themselves. They would be put to the side.

The sprawling format-grid-font context, it seemed to me, could be handled optimally by an XSLT-based workflow. The ASCII source files for formats and grids are structured data. One might be tempted, given our current technology, to apply Invisible XML to them. Personally, I wouldn't want to try. XSLT proved to be an excellent choice for parsing the 2,000+ formats and 180+ grids. Certainly other languages could have been used. But I found XSLT's <xsl:analyze-string> to be a powerful tool, and its praises need to be sung more loudly. The construct allows the user to losslessly segment a string into principal parts, and to help developers clearly visualize how the input string wil be segmented, and to quickly identify the relationship between both matches and non-matches and their role within the result tree. I have found that the task of losslessly consuming unparsed text to build a complex tree is much more manageable with <xsl:analyze-string> than it is in other programming language counterparts. The XSLT code is generally easier to read and maintain.

The grids were relatively easy to convert through this process. An XSLT application of about 200 lines was able to convert all the grids to something like this:

<?xml version="1.0" encoding="UTF-8"?>
<!--This data was generated by the stylesheet at 
file:/D:/XPUB_UTILITIES/Slax/SlaxBuilders/grid%20source%20to%20xml.xsl 
on 2022-09-20T21:59:09.01525-04:00.-->
<grid n="738">
   <type>microcomp</type>
   <!--JUNE 24, 1996-->
   <!--                                            Column                          -->
   <!--Ionic/Century Schoolbook                     1 Typeface                     -->
   <!--                                             2 Displacement                 -->
   <!--MICROGRD.738                                 3 Output Code                  -->
   <!--                                             4 Font Number                  -->
   <!--T1 Ionic Roman C&lc (052)                    5 Synthetic font               -->
   <!--T2 Century Schoolbook Bold C&lc (038)        6 Postscript widths            -->
   <!--T3 Ionic Ital C&lc (054)                     7 Hyphenation Code             -->
   <!--T4 Ionic Roman C&sc (052)                    8 Entity Tag & Description     -->
   <!--T5 Ionic Roman C&sc (052) (full figs)-->
   <!---->
   <typeface n="1">
      <glyph cp="0">
         <output-code>0</output-code>
         <font-number>0</font-number>
         <synthetic-font>0</synthetic-font>
         <postscript-width>00000</postscript-width>
         <hyphenation-code>0</hyphenation-code>
         <entity>
            <code>00000000</code>
         </entity>
         <!-- [T1 Ionic Roman C&lc]-->
      </glyph>
      <glyph cp="1">
         <output-code>183</output-code>
         <font-number>86</font-number>
         <synthetic-font>0</synthetic-font>
         <postscript-width>00556</postscript-width>
         <hyphenation-code>0</hyphenation-code>
         <entity>
            <code>sbull</code>
         </entity>
         <!-- ␁ Small bullet-->
      </glyph>
. . . . . .

The formats were more challenging, but a multiple-pass approach within a single XSLT file (about 1,200 lines of code) was sufficient. Here a sample from format 0634:

<?xml version="1.0" encoding="UTF-8"?>
<!--This data was generated by Format source to XML version 2.00, at file:/D:/XPUB_UTILITIES/Slax/SlaxBuilders/format%20source%20to%20xml.xsl on 2022-11-16T23:20:49.32532-05:00.-->
<format n="0634" converted="2022-11-16T23:20:49.32532-05:00">
   <type>microcomp</type>
   <grids>
      <grid>738</grid>
      <grid>072</grid>
      <grid>739</grid>
      <grid>780</grid>
   </grids>
   <group key="PAGE">
      <column-count>3</column-count>
      <column-width>177</column-width>
      <column-length folio="odd" n="1">716</column-length>
      <column-length folio="odd" n="2">716</column-length>
      <column-length folio="odd" n="3">716</column-length>
. . . . . . 
   <group key="LOC">
      <locator n="10">
         <point-size>8</point-size>
         <leading>9</leading>
         <line-length>168</line-length>
         <leading type="top">9</leading>
         <leading type="bottom">0</leading>
         <column-sink>8</column-sink>
         <indentation type="first">0</indentation>
         <indentation type="next">0</indentation>
         <carding-priority type="first">3</carding-priority>
         <carding-priority type="next">2</carding-priority>
         <word-space-minimum>6</word-space-minimum>
         <word-space-maximum>9</word-space-maximum>
         <leading type="group">10</leading>
         <leading type="paragraph">9</leading>
         <grid>1</grid>
         <typeface>1</typeface>
         <line-type>J</line-type>
         <split-column>W</split-column>
      </locator>
. . . . . .
   <group key="CHRTBL UNSHIFT">
      <character-map>
         <char>
            <label>bypass qc</label>
            <input cp-dec="4" cp-hex="4"/>
            <output cp-dec="0" cp-hex="0"/>
         </char>
         <char>
            <label>bypass qm</label>
            <input cp-dec="5" cp-hex="5"/>
            <output cp-dec="0" cp-hex="0"/>
         </char>
         <char>
            <label>precedence</label>
            <input cp-dec="7" cp-hex="7"/>
            <output cp-dec="200" cp-hex="C8"/>
         </char>
. . . . . . .
   <group key="HEADS"/>
   <group key="ALL">
      <head type="C111012000522000000␇G3␇T2CONGRESSIONAL">
         <value n="1">RECORD␇P</value>
      </head>
   </group>
   <group key="@"/>
   <group key="CONSTANT">
      <constant>_Continued␇P</constant>
   </group>
   <group key="XXXXXXXXXXXXXXXXXXX"/>
</format>

In places, the XML files are much more legible than their ASCII source counterparts. Element and attribute names provide much better documentation than the abbreviated, cryptic headers. And in the course of building a library of format and grid XML files, it was easy to ferret out numerous typographical errors, some of which affected MicroComp output.

For both the formats and grids, relatively succinct XSLT applications were able to generate, very quickly, lossless XML serializations. XSLT's <xsl:analyze-string> allowed me to capture and retain non-data fields as comments, so that it was possible to reverse the process and rebuild the source ASCII files nearly byte for byte (except where an occasional stray null byte was in the source file). With these XSLT applications in hand, whenever formats and grids were updated, the Slax resources could be as well.

The fonts posed a distinct challenge. There, I had to get my hands dirty. I studied the entire library, and focused on the idiosyncratic fonts. I did my best to provide a sensible mapping to the Unicode standard. Where such mapping was impossible, undesirable, or arbitrary, I left notes. Because most fonts had the standard PostScript encoding, I focused only on mapping the exceptions, which I did in a simple XML file, of course.

It seemed to me that at the very minimum, a user of a Slax XML file should not have to perform any more byte displacements. But to get there—to arrive at the proper Unicode values—one must take a complicated journey, through a given format, through a given grid, through a given typeface, to a particular font glyph, and finally to its Unicode meaning. In building the pipeline to retrieve that information, I felt that about one third of the grid material had been mined and placed directly in the Slax XML file. If the grid files could be altogether dispensed with, users of the Slax format would benefit. Hence the output indicates, for every locator, or for any special characters, the remaining relevant information from the grid file: the target font name and family, and any special modifications that should be applied.[4]

The format files, however, were a different story. They have such an abundance of information (points, leading, line length, justification, indentation, margins, relative padding, complementary formats, etc.) that to put any of that within a typical Slax file would result in a bloated, illegible, and repetitive format. It would be better, I thought, to give Slax file users access to the library of format files, serialized as XML, so that they can query typesetting specifications of interest as needed.

I now had a set of XSLT tools that could place the vast mechanism of formats, grids, and fonts at service for the XML stack, and could be responsive to any changes to that context. That left me with one final challenge: how to orchestrate the conversion process, i.e., how to start with an actual locator file, trawl those XML resources, and return an XML file with the intended Unicode text.

I quickly ruled out XML technologies for this task. The conversion process would need to handle control bytes (prohibited in XML 1.0) and null bytes (prohibited in both XML 1.0 and 1.1). A proper handling of the character displacement process would require bytewise operations with look-ahead and look-behind features. My previous experiments with byte- and bit-wise functions, such as implementing the MD5 hash algorithm in XSLT, suggested that an XSLT or XQuery locator processor would not be performative.[5]

Andries and Wood opted for a Java-based declarative approach. I felt that I would have most success along a different route. The locator format was designed with a particular outlook, and was deeply enmeshed with AComp. I thought that the possibility of error and omissions would be minimized if I developed a system that mimicked AComp's native process.

My solution was to write the Slax application in C#. It was designed as a class LocatorAnalyzer, whose primary method Load() processed an arbitrary locator byte stream along a streamlined version of what I detected AComp was doing, or was meant to do. Load() walks the byte stream along the same stages AComp does. At many of those stages, AComp consults formats and grids as compiled binaries. Slax needed to do the equivalent, quickly looking up character displacement maps, and related operations.

Because the formats, grids, and fonts were now available as XML, I developed another XSLT application that built the tuples, records, and other low-level data structures required for the C# code. In an early experiment with this model, the XSLT application took the C# source code as unparsed text, located specially marked sections designed for fields, and populated them with the raw data serialized in C# syntax. This was an effective but impractical solution. The very long C# code that resulted rendered Visual Studio unresponsive. The technique also meant that any updates to the formats, grids, or fonts would require recompiling the application, and submitting it repeatedly for security audits.

The more tractable approach was to write the data to stand-off text files, optimized for rapid ingestion as tuples, hashsets, dictionaries, and other data types. This approach was clean, and meant that any future updates to the grids and formats would not require recompiling the core executable. The automated XSLT-based process of grouping and distilling the thousands of format, grid, and font files into small datasets proved to be somewhat time-consuming, taking about ten minutes. But such updates are expected to be infrequent, so the excess time is at present negligible, and represents work that the C# application does not have to perform.

Overall, both XSLT and C# performed their share of challenging work. XSLT parses and manages the resources, and prepares the C# source code with the data it needs. The C# code approaches byte handling in an object-oriented environment with very fast low-level operations. The result is that Slax is highly performative. A typical locator file is converted in a tenth of a second. A one megabyte file takes less than a second. Slax is one of GPO's fastest applications.

The Slax Format

The best way to illustrate the design of Slax XML is by example, using the locator file presented at the beginning of this article:

<slax ver="0.09" xmlns="http://gpo.gov/slax">
   <format n="634">
      <locator n="66" font-no="86" font-mod-no="0" font-name="BGsddV01" font-mod-name="none">
         <c font-no="0" font-name="Gpospec5">
            <svg-path for-cp="102"
               d="M7870 228h2l-4 -3l-6 -6h2l-7 -3c-4 -4 -8 -6 -8 -6c-1 1 2 2 8 6c2 1 3 2 5 3l-1 1l-36 
               -27l-2700 -98l-1159 -49l-3937 154l-1 30c-28 7 2 30 21 31l3946 195l1208 -68l2665 -134c10 
               -9 8 -19 2 -26z"
            />
         </c>
      </locator>
      <locator n="81" font-no="52" font-mod-no="0" font-name="MIonic" font-mod-name="none">SENATE
         RESOLUTION 77—DESIGNATING FEBRUARY 16, 2023, AS “NATIONAL ELIZABETH PERATROVICH DAY” </locator>
      <locator n="11" font-no="52" font-mod-no="0" font-name="MIonic" font-mod-name="none">Mr. <span
            typeface="4" font-mod-no="1" font-mod-name="small-caps"><c font-mod-no="0"
               font-mod-name="none">SULLIVAN</c></span><span typeface="1"> (for himself and Ms.
            </span><span typeface="4" font-mod-no="1" font-mod-name="small-caps"><c font-mod-no="0"
               font-mod-name="none">M</c>URKOWSKI</span><span typeface="1">) submitted the following
            resolution; which was considered and agreed to: </span></locator>
      <locator n="74" font-no="52" font-mod-no="1" font-name="MIonic" font-mod-name="small-caps"><c
            font-mod-no="0" font-mod-name="none">S</c><c font-mod-no="0" font-mod-name="none">.</c>
         <c font-mod-no="0" font-mod-name="none">R</c>ES<c font-mod-no="0" font-mod-name="none"
            >.</c>
         <c font-mod-no="0" font-mod-name="none">7</c><c font-mod-no="0" font-mod-name="none">7</c>
      </locator>
      <locator n="27" font-no="52" font-mod-no="0" font-name="MIonic" font-mod-name="none">Whereas
         Elizabeth Wanamaker Peratrovich, Tlingit, was a member of the Lukaa<span font-no="0"
            font-name="Gpospec5">x̱</span>.ádi clan in the Raven moiety with the Tlingit name of
            <span font-no="0" font-name="Gpospec5">Ḵ</span>aa<span font-no="0" font-name="Gpospec5"
            >x̱</span>gal.aat (referred to in this preamble as “Elizabeth”) who fought for social
         equality, civil liberties, and respect for Alaska Native and Native American communities;
      </locator>

The root element is <slax>, and it takes a sequence of <format>s. Each format takes a sequence of block elements such as <locator> and <table> (on which see below). Each <locator> takes mixed content of text and inline <span>s. Individual characters might be marked by <c>s, which are always wrapped by <span>s.

Line 1 of the input code, ␇S0634, creates the first <format>. And line 2, ␇I66F, creates the first <locator> for locator 66. Information about the default font and font modification is captured in attributes twice, once with the relevant code and a second time with a human-readable version. A future version of Slax may handle such repetition differently.

Note, the F of locator file line 2 is not the letter F, but rather a character from one of GPO's specialized fonts. This represents the elongated decorative lozenge seen in figure 1. After all byte displacement has taken place, F does not target a Unicode character but a graphic, so it is replaced with an <svg-path>, whose attribute d holds a SVG representation of the font Gpospec5's glyph 102.

The standard way in locators to express double quotation marks is through two separate glyphs, as either two backticks, ``, or two straight apostrophes, ''. Slax converts these to typographic quotation marks, because that is what the user intends, even though MicroComp would have typeset these as pairs of single quotation marks.

In line 4, we have the following code: ␇I11Mr. ␇T4SULLIVAN␇T1 (for himself and Ms. ␇T4Murkowski␇T1). That line invokes locator 11, whose default is typeface 1, which has no special font modifications. But when typeface 4 is specially invoked (twice) it calls upon small caps as a modifier for the lowercase letters. Each word governed by ␇T4 includes uppercase letters, which are not subject to the small caps modification. Hence, the Slax representation results in some structures where font modification is toggled on and off. Note that Slax output has URKOWSKI in all caps. That reflects the actual byte displacement process defined by MicroComp's grid, which calls for the codepoints to be changed from lowercase to uppercase, and also instructs MicroComp to reduce their size. The glyph width information in the grid specifications is used to specify the width of the letters rendered as small caps, so that they are smaller than the letters not so rendered.

The locator sample I have used throughout this paper has a relatively simple structure, and the bell codes are used only to navigate formats, grids, and typefaces. The locator format has other bell codes, for other components, and to describe them would make this article overly long. However, at least something should be said about tables, because of their complexity.[6] Here is a sample table in the locator format and its typeset form:

␇c5,L2,i1,s50,12,12,12,12  
␇I95Comparison of Base Charge and Rates  
␇h1  
␇h1FY 2020 
␇h1FY 2021 
␇h1Amount change 
␇h1Percent change
␇j
␇I01Base Charge ($)
␇D$66,419,402
␇D$65,443,462
␇D-$975,940
␇D-1.5
␇I01Composite Rate (mills/kWh)
␇D18.08
␇D18.10
␇D0.02
␇D0.1
␇I01Energy Rate (mills/kWh)
␇D9.04
␇D9.05
␇D0.01
␇D0.1
␇I01Capacity Rate ($/kW-Mo)
␇D1.75
␇D1.69
␇D-0.06
␇D-3.4
␇e

Figure 2: Typeset version of Federal Register, 85 no. 169, August 31, 2020, p. 53810, https://www.govinfo.gov/app/details/FR-2020-08-31

A table in locators is commonly called a subformat, because the traditional format specifications continue to be used, but different settings are invoked. A table's start and end are signaled by ␇c and ␇e, respectively, with the division between head and body signaled by ␇j. The first line has a complex line of code declaring the parameters for the table and the specifications for each column. There are other features, not illustrated above, that make tables one of the most daunting parts of the locator format. Rather than point out every feature of the table, I provide here the Slax representation:

<table cols="5" rules="cross down" standard-stub-indent="1" font-no="5" font-mod-no="0"
   font-name="Helvetica" font-mod-name="none">
   <tgroup>
      <colspec colnum="1" colname="c1" min-width="50" width-unit="point" type="stub"/>
      <colspec colnum="2" colname="c2" min-width="12" width-unit="figure" type="figure"/>
      <colspec colnum="3" colname="c3" min-width="12" width-unit="figure" type="figure"/>
      <colspec colnum="4" colname="c4" min-width="12" width-unit="figure" type="figure"/>
      <colspec colnum="5" colname="c-last" min-width="12" width-unit="figure" type="figure"/>
      <title loc-no="95" continues="true">Comparison of Base Charge and Rates </title>
      <thead>
         <head level="1" column="1"> </head>
         <head level="1" column="2">FY 2020 </head>
         <head level="1" column="3">FY 2021 </head>
         <head level="1" column="4">Amount change </head>
         <head level="1" column="5">Percent change</head>
      </thead>
      <tbody>
         <row>
            <entry loc-no="1" leadered="true" base-indentation-in-ems="0">Base Charge
               ($)</entry>
            <entry>$66,419,402</entry>
            <entry>$65,443,462</entry>
            <entry>-$975,940</entry>
            <entry>-1.5</entry>
         </row>
         <row>
            <entry loc-no="1" leadered="true" base-indentation-in-ems="0">Composite Rate
               (mills/kWh)</entry>
            <entry>18.08</entry>
            <entry>18.10</entry>
            <entry>0.02</entry>
            <entry>0.1</entry>
         </row>
         <row>
            <entry loc-no="1" leadered="true" base-indentation-in-ems="0">Energy Rate
               (mills/kWh)</entry>
            <entry>9.04</entry>
            <entry>9.05</entry>
            <entry>0.01</entry>
            <entry>0.1</entry>
         </row>
         <row>
            <entry loc-no="1" leadered="true" base-indentation-in-ems="0">Capacity Rate
               ($/kW-Mo)</entry>
            <entry>1.75</entry>
            <entry>1.69</entry>
            <entry>-0.06</entry>
            <entry>-3.4</entry>
         </row>
      </tbody>
   </tgroup>
</table>

Although the Slax version is more verbose than the locator version, it is also more informative, and a significant step in the direction of making the tables HTML- or CALS-compliant. The locator version indirectly or implicitly points to settings that are expressed explictly in the Slax version. For example, the first line in the locator file is used to populate various attributes for <table> and <colspec>. Within the table body, ␇I01 in the locator file invokes subformat specifications about whether a cell is leadered or not, or what level of indentation it should receive, expressed in the <entry> attributes.

For both tables and running text, when Slax serializes locators, it may encounter syntax errors. If they are not fatal, the location of the error, and its specific message, are recorded. When the Slax XML file is written, each erroneous location is anchored by a comment whose content is an integer one greater than the previous anchor's value. At the end of the file all errors are listed. A Slax Schematron schema associates each error with the particular place in the file, so that users in Oxygen or other XML editors can quickly find errors. Schematron, of course, is not the main Slax schema, which is written in RELAX-NG, so if the locator serialization process results in syntactic absurdities (some of which MicroComp might overlook), users have two registers of validation.

The Slax XML samples above are representative. Slax has been applied to tens of thousands of locator files, and more often than not what emerges is a relatively flat tree. The occasional <format> adds some depth to what is otherwise normally an undifferentiated long sequence of <locator>s, whose attributes give no sign of semantic function. Within the <locator> blocks are inline elements that deal with character- or phrase-based exceptions to the block rules.

In looking at Slax XML I am reminded of Microsoft Word docx files, OpenOffice odt files, and early word processor formats, which share a similar typology. I think of them not as trees but as hedgerows. The hierarchy is rarely deep, even though the printed result might give the illusion of depth. Behind the scenes, the files consist of very lengthy sequences of text blocks, kept all on the same level, like hedgerows along a road.

The exception to that general trend pertains to tables and lists. In the locator format, tables never nest. So when a table is encountered, its depth always has a maximum limit. Here the locator format is shallower than its word processor counterparts, which normally allow tables and other structures to nest recursively. In the case of lists, the locator format makes no attempt to preserve the hierarchical structure, but simply assigns each list item, regardless of its semantic depth, to one of a series of <locator>s. Whether the user creates deep, beautiful lists, or illogical or inconsistent ones, it is all the same flat structure. The locator format stays out of the business of monitoring the proper use of hierarchical structures.

Conclusion and Further Work

Slax is still under development. Securing an adequate test suite is difficult. Many of the dark corners of AComp remain to be explored. Although much work has been done testing the serialization, only some effort has been put into the next stage, namely, using Slax XML as a pivot format for conversions to USLM, HTML, and XML for typesetting with XPP. More work needs to be done to allow users to supply their own custom format and grid specifications. That is critical for any attempts to reconstruct previous versions of formats and grids, in order to accurately convert legacy locator files.

Desiderata aside, Slax shows enormous promise. It demonstrates the power of XSLT for lossless parsing of complex documents. It models an optimal balance between declarative and imperative programming. And it results in a tool and an XML format I wish I could gift to my predecessors.

References

[andries_and_wood] Andries, Patrick, and Lauren Wood. “Converting Typesetting Codes to Structured XML.” Presented at Balisage: The Markup Conference 2020, Washington, DC, July 27–31, 2020. In Proceedings of Balisage: The Markup Conference 2020. Balisage Series on Markup Technologies, vol. 25 (2020). doi:https://doi.org/10.4242/BalisageVol25.Wood01.

[boyle] Boyle, John J. “Electronic Composing System Applications.” In Electronic Composition in Printing: Proceedings of a Symposium, edited by Richard W. Lee and Roy W. Worral, pp. 90–93. National Bureau of Standards Special Publication 295. Washington, DC: National Bureau of Standards, 1968. https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nbsspecialpublication295.pdf.

[cavanaugh] Cavanaugh, John F. “Text Handling at the United States Government Printing Office.” Technical Communication 25, no. 3 (1978): 12–15.

[gpo_1967] Government Printing Office. “Annual Report of the Public Printer.” GPO, 1967. https://www.govinfo.gov/content/pkg/GOVPUB-GP-f1e8461d646b2f25f3c9487e7d1a8619/pdf/GOVPUB-GP-f1e8461d646b2f25f3c9487e7d1a8619.pdf.

[gpo_1978] Government Printing Office. “Annual Report of the Public Printer.” GPO, 1978. https://www.govinfo.gov/content/pkg/GOVPUB-GP-3677bb1bb9a1aadf7e794f482d785082/pdf/GOVPUB-GP-3677bb1bb9a1aadf7e794f482d785082.pdf.

[gpo_1979] Government Printing Office. “Annual Report of the Public Printer.” GPO, 1979. https://www.govinfo.gov/content/pkg/GOVPUB-GP-32dc4c28ea4b47eb69076a03bc3330d4/pdf/GOVPUB-GP-32dc4c28ea4b47eb69076a03bc3330d4.pdf.

[gpo_1980] Government Printing Office. “Annual Report of the Public Printer.” GPO, 1980. https://www.govinfo.gov/content/pkg/GOVPUB-GP-ddfb1d6276c5911c3e64db0a406ebdcc/pdf/GOVPUB-GP-ddfb1d6276c5911c3e64db0a406ebdcc.pdf.

[gpo_1989] Government Printing Office. “Annual Report of the Public Printer.” GPO, 1989. https://www.govinfo.gov/content/pkg/GOVPUB-GP-9b5c6bc8466be4c144860e27da375e58/pdf/GOVPUB-GP-9b5c6bc8466be4c144860e27da375e58.pdf.

[gpo_1991] Government Printing Office. “Annual Report of the Public Printer.” GPO, 1991. https://www.govinfo.gov/content/pkg/GOVPUB-GP-2c3e69ce51feee11c9b9b677cb2ab61c/pdf/GOVPUB-GP-2c3e69ce51feee11c9b9b677cb2ab61c.pdf.

[gsdd_1983] Graphic Systems Development Division. Publishing from a Full Text Data Base. 2nd ed. Government Printing Office, 1983.

[hincherick] Hincherick, William. “Automated Composition: The Development and Utilization of a Unique System for the U.S. Government Printing Office.” Government Publications Review 12, no. 3 (May 1, 1985): 215–25. doi:https://doi.org/10.1016/0277-9390(85)90024-X.

[jcp_1970] Joint Committee on Printing. A Review of the Costs of Electronic Composition. Washington, DC: Government Printing Office, 1970. https://books.googleusercontent.com/books/content?req=AKW5QaeWNgnmTS8iokrWgPMpUCmH7iL_b6O-2hWqseh2KFS7vx4VG9LKehbSXls1GDhp53niauWamLgtWxHpI3nAXm_aOQAQtF9NaeuaCAKa0LiK0kCSiOhgWEEMHAR7-jQajQmiHaYs1FlZHklblXrVAMdou8z7vXlzgt9d-kX05Ns9ZQteUpgX6FOQL2hMgLG0Rd2AXFTom1wWKdSHno4GY_mlZln097bCA3Tv67-CjaF3atlrrxVbx6-tkB4zBPLua2KVKPOn3qt75Sl3WHSmnLOJRpvx_w.

[mlc_1966] Mergenthaler Linotype Corporation. Linotron 1010, 2013. https://vimeo.com/75532295.

[rollert] Rollert, Donald. “Memo,” March 24, 1984. Production Engineering Department, Government Printing Office.

[uspo_1963] United States Patent Office. Roster of Attorneys Registered to Practice Before the U.S. Patent Office. Washington, DC: Government Printing Office, 1963. https://books.googleusercontent.com/books/content?req=AKW5Qad1E04U2swX7_CxXQayhBIFoxd-VodKezL9dQjpDkTbqlBUEfvliUEK5P5gdu15OoemBK57eNCGrGeZJ4G8b974MgGUmiEA6daCblZZYRDeEHr7dW4yjr21dZecUcit00FzOs_TczdKV565Dv7M9AqgfGSIMwq4mPTTXr2LWXeU4lJtdqaDNi3wP3INIe1_zM9KJTCkr7BQQRC_En0XXXz4i_4P0c8lqytWu2n8rGrtd7TVOS2os_Z8ZTgv9uqsw4buoEMsw5AMy1S_uaxFyj4vLgvYWw.

[stevens_and_little] Stevens, Mary Elizabeth, and John L. Little. Automatic Typographic-Quality Typesetting Techniques: A State-of-the-Art Review. National Bureau of Standards Monograph 99. Washington, DC: National Bureau of Standards, 1967. https://www.govinfo.gov/content/pkg/GOVPUB-C13-49f6b3c3ddbcc96d84cc5b060de5ae76/pdf/GOVPUB-C13-49f6b3c3ddbcc96d84cc5b060de5ae76.pdf.

Worsnop, R. L. (1968). “Computers in Publishing.” Editorial Research Reports 2 (1968). http://library.cqpress.com/cqresearcher/cqresrre1968071000.



[1] I sincerely thank Peter W. Binns and Mark Harcourt for reading a draft of this communication, and providing excellent suggestions.

[2] Various documents, including PFTDB, can be accessed at the Request for Proposal site archived from 2009, https://web.archive.org/web/20091008230407/gpo.gov/vendors/composition.htm.

[3] In standard typography, typeface refers to a unified ensemble of type design, e.g., Ionic, which is represented by one or more fonts, e.g., Ionic roman, Ionic italic. Locator documentation, however, uses typeface in an idiosyncratic way, to refer to one of the five parts of a grid file, i.e., a set of 128 characters. In this article when I use typeface, it is in the locators sense; when I mean the standard typographic term, I use simply face. Thanks to Peter Binns for prompting me to be clear on terminology.

[4] A grid also lists the relative width of each glyph, and the hyphenation code, but most users of Slax would have little if any use for that information. If someone really wanted to know about the PostScript font glyphs, they could access the grid specifications in XML, or the actual font tables, as separate resources.

[5] My MD5 hash algorithm is part of the Text Alignment Network XSLT function library, and is documented here: https://textalign.net/release/TAN-2021/guidelines/xhtml/ch13s02.xhtml#function-md5.

[6] I thank Mark Harcourt for prompting me to include this discussion. I wrote the Slax conversion for locator tables independently of similar work performed by Priscilla Walmsley and Martin Smith. I have benefitted from the output of their conversion routines for testing Slax output. That we reached similar conclusions on how the tables should be interpreted is gratifying.

×

Andries, Patrick, and Lauren Wood. “Converting Typesetting Codes to Structured XML.” Presented at Balisage: The Markup Conference 2020, Washington, DC, July 27–31, 2020. In Proceedings of Balisage: The Markup Conference 2020. Balisage Series on Markup Technologies, vol. 25 (2020). doi:https://doi.org/10.4242/BalisageVol25.Wood01.

×

Boyle, John J. “Electronic Composing System Applications.” In Electronic Composition in Printing: Proceedings of a Symposium, edited by Richard W. Lee and Roy W. Worral, pp. 90–93. National Bureau of Standards Special Publication 295. Washington, DC: National Bureau of Standards, 1968. https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nbsspecialpublication295.pdf.

×

Cavanaugh, John F. “Text Handling at the United States Government Printing Office.” Technical Communication 25, no. 3 (1978): 12–15.

×

Graphic Systems Development Division. Publishing from a Full Text Data Base. 2nd ed. Government Printing Office, 1983.

×

Hincherick, William. “Automated Composition: The Development and Utilization of a Unique System for the U.S. Government Printing Office.” Government Publications Review 12, no. 3 (May 1, 1985): 215–25. doi:https://doi.org/10.1016/0277-9390(85)90024-X.

×

Mergenthaler Linotype Corporation. Linotron 1010, 2013. https://vimeo.com/75532295.

×

Rollert, Donald. “Memo,” March 24, 1984. Production Engineering Department, Government Printing Office.

×

Stevens, Mary Elizabeth, and John L. Little. Automatic Typographic-Quality Typesetting Techniques: A State-of-the-Art Review. National Bureau of Standards Monograph 99. Washington, DC: National Bureau of Standards, 1967. https://www.govinfo.gov/content/pkg/GOVPUB-C13-49f6b3c3ddbcc96d84cc5b060de5ae76/pdf/GOVPUB-C13-49f6b3c3ddbcc96d84cc5b060de5ae76.pdf.

×

Worsnop, R. L. (1968). “Computers in Publishing.” Editorial Research Reports 2 (1968). http://library.cqpress.com/cqresearcher/cqresrre1968071000.