SGML in the Age of XML

Betty Harvey

Abstract

Today (2016!), there are organizations, especially in the military, who have SGML documents and/or requirements to meet SGML-based specifications. Given the unfashionability of SGML and the shrinking availability of SGML tools and SGML expertise, these organizations face significant challenges. How can they best approach the task of working with existing SGML document collections? What about a requirement to create SGML that will integrate cleanly into existing SGML document collections to be processed with existing SGML tools? What questions should someone facing an SGML requirement ask? What resources are they going to need? How much can they do with XML infrastructure to meet SGML requirements and where must they “cut over” to SGML? How should they make SGML if they really need to? How can they leverage XML tools while maintaining SGML source requirements?

Introduction

This year we celebrate the 30th anniversary of Standard Generalized Markup Language (SGML) becoming an international standard. Many reading this paper may never have heard of SGML or the role it played in the acceptance and success of the World Wide Web (WWW). In some cases it has also revolutionized the publishing of information by providing an easy way to output complex information to different media channels.

SGML via HTML allowed for the first time for Internet browsers the capability to display and publish information on the WWW. In December 1990, the Global Hypertext Project at CERN, European Laboratory for Particle Physics under the direction of Tim Berners Lee provided the capability of displaying and linking information across the internet. At this point in time the internet was mainly available to educational and government organizations. The Global Hypertext Project project developed a very simple SGML vocabulary for presenting and transporting information across the internet. Many working in the SGML space knew the flexibility and usefulness of SGML but HTML actually proved to the naysayers that there was both intellectual and monetary value in using markup to describe information. HTML became and still is the largest use of SGML (some nameless organizations dispute this fact but they haven't been able to prove it yet).

There were many pitfalls along the way. The U.S. Department of Defense was one of the first adopters of SGML for technical publications. This resulted in positive movement to adopting SGML by other organizations in manufacturing, data warehouses, publishing, etc. The negative side of the DOD jumping in so early was that they poured massive amounts of money into companies (in many cases traditional DoD contractors) to develop software applications to support DoD. This resulted in many of the early SGML software applications (authoring, databases, publishing, etc) were too costly for small and medium-sized organizations to adequately leverage the power of SGML.

With the success of HTML, visionaries saw that SGML really could be affordable to the masses. SGML could be used by small and medium organizations to manage and disseminate their information.

SGML wasn't without it's problems. XML was designed to alleviate some of the inherent problems and pain points that SGML had. XML was originally designed to be a subset of SGML. Some of these problems of SGML, either real or perceived were:

SGML Declarations: The SGML declaration was a complex file that relayed information to the SGML application. A few of the parameters were:
- Allowed length of element names and attributes. Some of the early SGML vocabularies restricted element names to 2 characters. Most were restricted to 8 to 32 characters.
- Tag minimization. You could say whether the opening tab, closing tag or both tags could be eliminated.
- Allowed you to use other characters in place of the less-than and greater-than (pointy brackets) in documents.
SGML DTD: SGML requires that validation against a DTD always be performed before any application will process the information. Although validation is important in developing and managing information for presentation and dissemination of information it is not important to the end user. The SGML DTD allows many concepts such as inclusions, exclusions, inline comments, tag minimization, etc. that caused inconsistencies in tools and parsers.
Character entities: SGML used the ISO character sets for characters (á). XML uses native UNICODE

XML became a W3C standard in December 1998. Organizations quickly jumped on board and adopted XML for their data. Organizations that had originally adopted SGML were slower to switch to XML but as software applications improved and became more affordable than SGML software, as well as the declining SGML tool market, they also moved to XML. An educated guess would say that 98-99% of all organizations are using markup are using XML if they are using Markup for their data. Today, even popular software such as Microsoft Word uses XML for it's underlying data. Open a Microsoft .docx or .xlsx file in an zip application and take a peak inside - all the underlying data is XML.

However, there are a few organizations, mainly DoD who have not adopted XML and have stuck with SGML almost 20 years later. This paper is designed to help organizations navigate the necessity of delivering SGML in an XML world.

Authoring SGML Content

There are a few editors that still support SGML authoring. These editors originally started in the SGML world and still support SGML authoring. However, if you try researching their literature there is very little information about their SGML authoring capability. Each of these authoring tools support full SGML editing ,as well as support authoring using native SGML DTD's. These editors are:

Justsystem's Xmetal

Adobe's Framemaker + XML

PTC's Arbortext Editor

If you do have the need to deliver SGML the easiest and most efficient way is to use one of the editors that support both SGML and XML authoring.

The same SGML editors can be used to create XML files. The above editors also allow you to save the file as XML or SGML. The big differences between an XML and SGML instance are:

XML declaration vs. SGML declaration
DOCTYPE statement
Empty tags: <linebreak/> vs. <linebreak>. Normalizing an XML element from <linebreak> to <linebreak></linebreak> will result in validity in both SGML and XML.
Case sensitivity. SGML is not case sensitive. This means that the elements <TITLE> and <title> are exactly the same in an SGML document. This is not true in XML, these two elements are treated as 2 different elements in XML. SGML editors handle cases differently. For example, some editors use all capital letters for element names whereas other editors use all lower case. If you are going from XML to SGML this isn't important but moving from SGML to XML case becomes significant. It is something to be aware of when deciding your authoring process.

Document Declaration Subset

The document declaration subset is a construct that provides a mechanism in the beginning of an SGML or XML document for creating both file entities and text entities. The document declaration subset was a commonly used construction in SGML. Almost every file contained one. Even though document declaration subsets are still used in XML it isn't a commonly used as it once was. The reason they aren't used as much in XML is because XSLT cannot process the information in the document declaration subset.

The example below shows a document using a document declaration subset:

<!DOCTYPE poem SYSTEM "poem.dtd"[
<!ENTITY author SYSTEM "poepic.jpg"  NDATA jpg>
]>
<poem id="poem1">
	<title>The Raven</title>
	<poet>Edgar Allan Poe</poet>
       <author-picture src="author"/>
	<stanza id="stanza1">
		<line>Once upon a midnight deary, while I pondered, weak and weary,</line>
		<line>Over many a quaint and curious volume of forgotten lore-</line>
		<line>While I nodded, nearly napping, suddenly there came a tapping</line>
		<line>As of some one gently rapping, rapping at my chamber door.</line>
		<line>"‘Tis some visitor," I muttered, "tapping at my chamber door-</line>
		<line>Only this and nothing more."</line>
	</stanza>
...	

</poem>

If you are faced with this situation, establishing authoring rules can allow files to be processed by both XML and SGML applications. For example, if you establish a rule that the entity name is always the name of the file name then XSLT can determine the graphic without the necessity of looking at the document declaration subset to determine the name of the file.

Other organizations have used a metadata field in the XML document to place the document declaration subset information in the file. The metadata field gets stripped during the conversion to SGML. The document declaration subset is created at the time of the conversion to SGML and included in the file.

Authoring in Native XML Editor

If you already have an XML editor in-house and prefer using your favorite editor this can be accomplished. You just need to be aware of the slight differences in the SGML/XML editor.

SGML DTD to XML DTD

If you decide to author content in an XML editor you will need a valid XML DTD. You will need to either obtain the XML version of the DTD or you will need to convert the SGML DTD to a valid XML version. This can be a daunting task, especially with large complex DTDs. There are some good articles on the modifications required to convert an SGML DTD to an XML DTD. One such article was written by Norm Walsh in 1998 and is available at W3C [DTD].

If you need to convert a complicated and all-inclusive DTD such as MIL-STD-38784C it may be best to do a data analysis of your documents and determine what components from the DTD are required for your set of documents and develop a subset of the DTD. This approach will has several advantages:

best defines your documents
makes authoring documents easier by reducing the number of unnecessary elements.

In some cases organizations have requirements to deliver their data in multiple SGML/XML formats for multiple clients. This happens all the time in the manufacturing world. In this case organizations usually find it cost effective to develop their own XML DTD and/or schema and convert the document to multiple formats based on business requirements. Their DTD/schema may be based upon an industry standard.

Many open source standards that started in the SGML world have both XML and SGML DTD's, as well as XML Schema, versions available. Docbook (http://docbook.org/) and Text Encoding Initiative (TEI) (http://www.tei-c.org/index.xml) are two initiatives that provide both SGML and XML DTD's.

Converting and Parsing XML Native Data to SGML Native Data

Converting an XML document to an SGML document is trivial. By normalizing the XML file you will have an SGML document that you can parse against the SGML DTD. You will want to parse the document against the DTD before any delivery of data. One of the best tools for parsing an SGML document is James Clark's SP.SP requires a little knowledge of the SGML application but is one of the best SGML parsers.

Native SGML Publishing and Specifications

Early SGML publishing was accomplished using proprietary publishing systems. These systems were very expensive. Several of these publishing systems are still available and are still quite costly. In the late 1980's DoD started developing a specification for publishing SGML. The specification was called Formatting Output Specification Instance (FOSI).

The specification MIL-PRF-28001 (MARKUP REQUIREMENTS AND GENERIC STYLE SPECIFICATION FOR EXCHANGE OF TEXT AND ITS PRESENTATION) was originally published in 1992. The last printing was MIL-PRF-280001C which was published in 1997. MIL-PRF-28001 specified the use of SGML for all new technical manuals within DoD. Each branch of DoD (Army, Navy, Air Force) developed DTD’s for use within their individual organizations based on their specific requirements. These DTDs adhered to the requirements in MIL-PRF-28001.

In addition to specifying the SGML constructs for developing DTD’s, MIL-PRF-28001 provided a DTD and specification for applying styling to the SGML. The styling specification was called Formatting Output Specification Instance (FOSI). Appendix B of the specification contained the DTD that supported the use of FOSI for presentation. Several SGML vendors were part of the working group and worked toward developing a FOSI-based publishing system. Two vendors DataLogic and Arbortext successfully developed FOSI-based formatting within their product. However, the implementation was slightly different in both systems based on different interpretations of the specification and ambiguity of the DTD. The result was that a FOSI developed for one system could not be used in the other system.

FOSI's are still used today by DataLogic and Arbortext. Arbortext has slowly tried to replace the FOSI with their own style specification called Styler. Styler uses both FOSI constructs and its own style constructs.

SGML and Loose-Leaf Publishing

When SGML was a new standard, large publishing environments required loose-leaf publishing. In the 'olden days' large organizations published administrative and technical manuals in paper. When modifications to the manuals were made only the pages that were modified were printed and sent to the users. A manifest was sent with the 'change pages' which told the manual administrators or librarians which pages to remove and add to the paper document. The manifests were often printed on blue paper and the manifest and change pages were called 'blue pages'. Many organizations still used these manifests in their XML, as well as SGML publishing and still call them 'blue pages'. An example of a manifest document is available at the Patent and Trademark Office

The users would remove old pages and insert new pages into binders. If pages ran longer than the original page then the page numbers reflected the new page with a different numbering sequence. For example, if page 1200-2 is modified and the revised page resulted in running over to the next page the users would receive page 1200-2 and page 1200-2a.

SGML publishing systems that supported loose-leaf publishing compose the SGML first and places processing instructions at the point of a page break in the SGML. When revision elements are placed in the SGML, the publishing system makes a second run through the document to determine where page breaks occur and then calculates the correct page breaks and page numbers.

Modern technology has negated the need for loose-leaf publishing because manuals can be disseminated in total without the necessity to print and disseminate single paper pages. Most of today's workforce are used to on-line, PDF or e-book technology and prefer electronic dissemination of information to paper.

However, there are still pockets of organizations who are stuck in the 1970's and still require loose-leaf publishing capability. Therefore suppliers to these organizations have a requirement to provide information in antiquated paper pages which includes change pages.

If you are required to support loose-leaf publishing, it can still be done with XML but will require out-of the box thinking and additional processes in order to emulate loose-leaf publishing. XSL-FO does not support loose-leaf publishing.

Creating Published Documents from Native SGML

There are multiple publishing formats for SGML/XML documents. Most organizations are looking to create print (PDF), HTML and/or e-books). It is possible to get all of these outputs from your SGML documents. In some cases you will need to convert the SGML to XML.

SGML had two main specifications for publishing SGML. These two specifications were FOSI (Formatting Object Specification Instance) and DSSSL (Document Style Semantics and Specification Language). The FOSI came first then DSSSL followed.

FOSI (Formatting Object Specification Instance)

Shortly after SGML became a standard the U.S. Department of Defense decided to adopt SGML as the standard architecture for developing technical manuals and IETMs (Interactive Electronic Technical Manuals). They needed a way to create printed output from the SGML documents. Initially the DoD initiated a project called CALS (initially Computer Aided Logistics Support, then Continuous Acquisition and Lifecycle Support and lastly Commerce at Lightspeed). DoD needed a mechanism to produce printed documents from the SGML they were creating.

Industry was slow in developing a standard for publishing SGML. There were pockets of proprietary software. DoD started an industry initiative to develop a standard for publishing SGML. The partnership between industry and DoD resulted in the FOSI specification. The FOSI specification was incorporated in the DOD. MIL-PRF-28001 [MIL-PRF-28001] specification. Ultimately there were two vendors that supported the FOSI specification, Arbortext (now PTC) and Datalogics. Both Arbotext Editor and Datalogics DL Composer still support FOSI's for publishing SGML data.

FOSI is actually an SGML document controlled by an SGML DTD. It was a bold concept. Newer SGML and XML specifications continued this practice of using markup to write output specifications.

The last update to MIL-PRF-28001 was almost 20 years ago in May 1997.

DSSSL (Document Style Semantics and Specification Language)

DSSSL came chronologically after FOSI. DSSSL became an ISO (International Standards Organization) [ISO] in 1996. Like FOSI, there were few products that adopted DSSSL. However DSSSL can be considered the mother of the W3C XSLT specification.

DSSSL, like XSLT, had 2 parts. The first part was transformation. The transformation specification provided the standard for how to convert the document. The second part provided formatting information on how the elements should be transformed in order to obtain the presentation of the data.

Creating Published Documents from XML

There are several possibilities for creating PDF output from the native SGML. As previously discussed you can use one of the tools that support the SGML style specifications. There are also proprietary publishing systems that support SGML. For most organizations proprietary publishing software is cost inhibitive. Organizations who can't afford the software and/or the expertise to develop the stylesheets will use a 3rd party composition company, which can also be expensive.

However XSLT and XSL-FO are obvious choices for creating PDF and printable files. As stated previously converting the SGML files to XML or vise-versa is relatively trivial. XSLT is a relatively easy skillset to obtain internally or externally. XSL-FO expertise is a little harder to obtain but should be easy to either train individuals internal or obtain outside help.

There are also standard XML vocabularies that have standard stylesheets available. Docbook and DITA are two that are commonly used. If the presentation of the SGML is relatively straight-forward it might be worthwhile to convert the document to Docbook or DITA and modify the stylesheets that come as part of these specifications.

DITA would be the best choice for DoD technical manuals, however extensive modification of the stylesheets would be required and wouldn't be the easiest approach. Some DoD contractors have been able to negotiate with their DoD customers to either deliver a complete PDF book as a 'new book' without the loose-leaf publishing requirement. In this case XSL-FO is used to create the PDF of the technical manual.

Others have negotiated to supply the SGML to the DoD customer as well as an HTML rendition of the document [IETM]. In this case the DoD facility has the capability to publish the SGML in-house using the JCALS (Joint Computer Aided Logistics Support) system that is still have available. JCALS was a joint program with the Army, Navy and Air Force that developed a publishing system that included custom and proprietary software in the mid-1990's. Arbortext Editor is the editor and Datalogics DL Composer is the composition software.

Conclusion

In conclusion, 30 years after SGML became an international standard it is still being created and used. If you and/or your organization find yourself in a position where you need to deliver SGML documents, it is still possible. I will take careful thought to develop the document constructs and the workflow. Hopefully, in the not too distant future, organizations will eventually move from SGML to XML.

References

XML Specification, https://www.w3.org/TR/1998/REC-xml-19980210

W3C Note on SGML and XML differences, https://www.w3.org/TR/NOTE-sgml-xml.html

[DTD] Converting an SGML DTD to XML, Norm Walsh, July 08, 1998, http://www.xml.com/pub/a/98/07/dtd/

An SGML System Conforming to International Standard ISO 8879 -- Standard Generalized Markup Language, James Clark, http://www.jclark.com/sp/

[MIL-PRF-28001] Markup Requirements and Generic Style Specification for Exchange of Text and Its Presentation, http://www.navsea.navy.mil/Portals/103/Documents/NSWC_Carderock/28001c.pdf

[ISO] International Standards Organization, http://www.iso.org/iso/home.html

[IETM] Betty Harvey, Balisage 2012, Developing Low-Cost Functional Class 3 IETM, http://www.balisage.net/Proceedings/vol8/html/Harvey01/BalisageVol8-Harvey01.html, doi:https://doi.org/10.4242/BalisageVol8.Harvey01

Betty Harvey

Electronic Commerce Connection, Inc.

As President of Electronic Commerce Connection, Inc. since 1995, Ms. Harvey has led many federal government and commercial enterprises in planning and executing their migration to the use of structured information for their critical functions. She has helped develop strategic XML solutions for her clients. Ms. Harvey has been instrumental in developing industry XML standards. She is the co-author of "Professional ebXML Foundations" published by Wrox. Ms. Harvey founded the Washington, DC Area SGML/XML Users Group. Ms. Harvey is a member of "The XML Guild" and was a coauthor of the book "Advanced XML Applications From the Experts at The XML Guild" published by Thomson.

BalisageThe Markup Conference

Balisage Paper: SGML in the Age of XML