Parser Possibilities: Why Write A Markup Parser

Norman Earl Smith

SAIC Technical Fellow and Assistant VP of Technology

Science Applications International Corp.

Copyright, Science Applications International Corporation. All rights reserved. Unpublished rights reserved under copyright laws of the United States.

expand Abstract

expand Norman Earl Smith

Balisage logo

Proceedings

expand How to cite this paper

Parser Possibilities: Why Write A Markup Parser

Balisage: The Markup Conference 2008
August 12 - 15, 2008

Introduction

XML and SGML parsers are available that are mature, robust, and widely used. It's not like the early days of XML when it seemed that every new week brought a new parser. Soon, the first SAX-based XML parsers appeared, followed by DOM (Document Object Model) parsers with standard APIs. Once high-quality validating parsers, such as Expat and AElfred, became widely available, the community seemed to lose interest in writing new parsers. Many of the early XML parsers were non-validating because they were considerably easier to write than validating parsers.

SAX was originally patterned after the ESIS (Element Structure Information Set) output from the SGMLS and NSGMLS SGML parsers. (See ClarkJ) DOM implementations usually use an underlying SAX parser to feed the content to the DOM objects. SAX-based applications are generally faster and require less memory than a DOM-based application. My experience is that an event-based parsing model, like SAX, can easily handle 80% to 90% of XML applications. Either are appropriate for the next 5% to 10% of applications, and the last 5% of applications really need DOM or something similar.

Even with numerous XML parser options, I have never been completely satisfied with the available XML parsers because I still deal with SGML and other markup languages. The remainder of this paper looks at reasons for writing a parser. Questions that you should ask yourself before starting to write a custom parser and the road to writing my own mlParser are also examined.

What's A Markup Parser

A markup language, such as XML, is a language for writing application-level markup languages. HTML and DocBook are examples of application-level markup languages. The dual use of the term "markup language" is especially confusing to non-technical people. Wikipedia's definition of SGML (See WikipediaSGML) describes it as "a metalanguage in which one can define markup languages" and I believe the term metalanguage is appropriate. Wikipedia further defines markup language as "a set of annotations to text that describe how it is to be structured, laid out, or formatted."

Both SGML and XML use "<" and ">" to delimit markup. Common usage has evolved to the point that just about any markup that uses "<" and ">" is called XML. Notice that I have been using the term "markup language parser" instead of "XML parser" for the most part so far. This is done on purpose.

There are four types of markup language in use that I know of:

  1. SGML

  2. XML

  3. WordML

  4. *SP

HTML and DocBook are not included in the list because they are XML/SGML application-level markup languages, not markup meta languages. WordML and *SP are not meta-markup lanugages. However, their syntax sufficiently differs from XML and SGML such that normal parsers don't handle them either. They are included to show markup that might require a custom parser to process, and because mlParser can handle both.

SGML, or Standard Generalized Markup Language, is an ISO standard. Its heritage is in the publishing industry and it has an IBM General Markup Language (GML) lineage. Computers in the early days had limited processing power for the individual, which translated to no SGML editor that performed real-time validation and formatting. Therefore, SGML contained many features to minimize the keystrokes required to enter content. For example, end tags could be declared optional. The structure of an SGML document is defined by a Document Type Definition (DTD). SGML is a markup metalanguage. My general belief is that SGML is still best for defining publishing markup languages.

XML, or eXtensible Markup Language, has its roots in the database world. It is a direct result of Tim Bray's work at OpenText for handing structured data that may or may not be SGML compliant. XML is also a markup metalanguage. Element structure may be defined thru a variety of "languages" including DTDs, Schemas, and RelaxNG. It is an SGML subset with just enough syntax differences to prevent processing markup files with each other's tools. I know because I tried for years to treat markup as SGML for some tools and the same markup as XML with other tools with limited success.

XML also introduced the concept of valid versus well-formed XML documents. A valid document is one that has been validated successfully against its DTD or Schema with a validating parser. A well-formed document is one that has matching start and end tags and follows the other rules of XML markup syntax. A well-formed document may or may not be valid. For that matter, it may or may not have a DTD or Schema.

There is no concept of well-formed documents in SGML, only valid or not valid. Most SGML tools validate the document every time it is used. A common real-world view is that a document only needs to be validated in specific cases:

  • The document is edited by a human.

  • During program development, until you are sure that valid markup is generated.

  • When data is supplied from an external source, markup may or may not need to be validated on every data exchange depending on circumstances.

WordML is my name for the output from saving a Microsoft Word document as Filtered HTML. The markup appears to follow some XML syntax rules, some SGML syntax rules, and some of its own, unique syntax rules. Much Word-specific data is stored in comments. I have even run across a few WordML documents that were not well formed! Few XML parsers can handle WordML.

*SP represents the various Serve Page markup languages such as ASP and JSP. It is not often that a Java developer attempts to run Java Server Pages thru an external parser. The JSP compiler normally takes care of the parsing. There are times when it can be very revealing. The output of the JSP compiler is a Java program that generates an HTML page. Normally, the embedded Java code generates dynamic values on the page. XML parsers do not normally handle *SP files.

Finally, the answer to the question "What's a markup parser?" A markup parser basically reads data that contains application-level markup, extracts tags and attributes from the markup, and generates some output. Validating parsers also read the structure definition in the form of a DTD, Schema, or other format. The output may range from structural error messages to ESIS output or anything in between. Some parsers, like OmniMark have built-in programming languages. Others provide SAX or DOM programming language interfaces. The possibilities are limited only by the requirements of the application and imagination of the developer. A markup parser may handle multiple types of markup, not just XML or SGML, from the point of view presented here.

ESIS is the primary output for the SGMLS and NSGMLS parsers. It is a record-oriented format where the first character on each line represents the markup event and the rest of the line is data. For example, a start tag event is represented by an '(' event type and the rest of the line is data. The tag name is the data in this case. '(mytag' is the ESIS output generated for <MYTAG> in the input document.

ESIS is very significant from a historical prospective. Back in prime SGML days of the early 1990s, SGMLS was just about the only widely available free SGML parser. Commercial SGML tools were all expensive because of the cost of either writing or licensing an SGML parser to include in the tools. Taking advantage of the SGMLS ESIS output was the only way to test-drive SGML without spending a lot of money. ESIS format is easy to process and a good programmer could do amazing things with ESIS output and a Perl script or two. ESIS also contains the idea behind SAX. The call-back events in programs logically work on an ESIS stream.

Why Write A Parser?

With the parser background out of the way, let's take a look at reasons both to write your own parser and reasons not to write your own parser. First, there are good reasons for not writing a custom parser:

  • Writing a basic parser is a lot of work

  • Writing a validating parser is a lot more work

  • Implementing XSL/XSLT/X-Path/XQuery, etc., support is not practical for most individual developers

  • Writing a parser, even a non-validating parser that implements SAX and/or DOM, is a huge effort

  • There are multiple, structure definition languages, such as DTD, Schema, RelaxNG, etc., needed for validation that are complex

Writing any markup parser is hard work. The additional complexity and effort required to write a validating parser instead of a non-validating/well-formed parser is significant. For many cases, it exceeds the point of diminishing returns . Writing a validating parser requires writing the validating part, plus the code to parse a DTD or schema, and implement the code to actually validate the element structure at each change-in-tag state. Are the benefits worth the effort?

The "X" add-ons are also complex. They require a very sharp staff to implement the standards. The resources to properly implement XSL/XSLT/X-Path/XQuery is substantial. These are usually beyond the average individual developer for short-term projects. Complexity and the associated learning curve are the main impediments.

Both the SAX and DOM API's are large and fairly complex, which means the average developer won't implement them completely, if at all. I know from experience; I didn't bother implementing either. A validating parser also requires the code to load allowable element structure in order to be able to validate a document, with another increase in code complexity and size.

All of these items represent increased implementation effort that just might not be worth the trouble if an existing parser meets most of your requirements. You have to carefully gauge the value proposition for each increase in effort that the next step brings. It may or may not be worth the effort.

On the other hand, there are still many reasons for writing your own markup parser. The rest of this section takes a look as some of those reasons. They include:

  • The learning experience.

  • No existing parser meets your specific requirements.

  • You have complete control.

  • You can mix and match markup languages, i.e., SGML, XML, etc.

  • Not tied to existing APIs.

  • You need to create documents with "live" content.

The following paragraphs expand these points.

The learning experience. Writing a markup parser, even a "simple" non-validating parser, is always a learning experience. There are a number of learning experience possibilities, such as:

  • Learning about markup languages. Writing a parser will teach you about markup language rules. Having to account for every possible condition in the markup will quickly enlighten you!

  • Learning about software state machines. I have used the state machine technique for writing a parser more than once. It is a straight-forward way to work thru the program. An in depth knowledge of the rules of the markup language is required. There is a basic state machine for parsing SGML published in the Practical Guide To SGML And XML Filters. (See Smith1998) If you don't know what a State Transition Diagram is, you will learn a lot of new things writing a markup parser.

  • Providing a vehicle to learn a new language. I wrote the first implementation of mlParser because I needed to write a non-trivial program to learn Java and I already understood the ins and outs of SGML. This allowed me to concentrate on learning the programming language and not the application.

As you can see, there are many things that can be learned from the experience of writing a parser.

No existing parser meets your specific requirements. This is not an unusual occurrence, especially if your application is a little non-standard. Reasons include:

  • Need a light-weight parser

  • Need to parse multiple markup meta languages

  • The programming language for an application is incompatible with existing parsers

Robert Bajzat (See Bajzat) needed a light-weight XML parser for his Thinlet package, which is a small (39k) Java windowing framework aimed at cell phones. The on-screen widgets for a Thinlet-based application are in simple XML. The Thinlet XML parser handles its slightly restricted XML syntax. Comments not spanning a line is an example restrictions. The Thinlet parser also is closely tied to the application and knows how to handle just the widget markup. The Thinlet parser is an integral part of the framework, and the short cuts in syntax make it tiny!

Incompatibility with existing parsers usually means the markup data buries information within comments, does not follow the syntax rules, or mixes and matches syntax rules from different markup meta languages. WordML is a good example of this. WordML is what I call the result of saving a Word document as "Filtered HTML." It appears to follow some SGML rules, some XML rules, invents a few rules, and puts a great deal of information into comments. Few parsers can handle this markup.

You have complete control. This is the real reason most people write their own version of an application. Tailoring a program to meet your system requirements has a strong draw for many people. I would rather write a hand-tuned set of Java classes to represent markup data, for example, than use one of the canned frameworks that seems to create hundreds of classes/methods when three or four well-designed classes are easier to understand and more efficient.

Control extends to which markup features are supported. The parser may not need to handle attributes at all if the data does not contain attributes and is record-oriented. The parser may do simple transforms as the input stream goes by that will significantly simplify downstream code. The parser doesn't always have to stop on errors.

You can mix and match markup languages. I started out using SGML around 1990. I have been thru the start of XML and the rise of Server Page languages such as JSP and ASP. There have been plenty of times when I wanted to mix and match SGML and XML files, in particular. Writing your own parser makes this practical.

Not tied to existing APIs. The existing SAX and DOM APIs are large and complex. If you go back and examine ESIS closely, basic markup processing can be done with a much simpler subset API. But once you make the leap that you don't have to implement existing APIs, you are free to build what meets your application requirements. That said, I firmly believe that this is the least acceptable reason in my list for writing a parser.

You need to create documents with "live" content. A favorite application technique of mine is creating a live document. By live, I mean that some portion of the content is dynamically generated. Executing an external script, nesting the output of another parse, running an SQL query, or retrieving a URN are some of the many things that can be incorporated transparently with a custom parser. The data just shows up in the data to the downstream processing programs. A great deal of extra functionality can be transparently dropped into a markup-based application by including the ability to execute arbitrary programs/code on a parsing event.

The Road To mlParser

I initially wrote my markup language parser, mlParser, in 2002. The journey from being thrown into the SGML ocean in 1990 to the birth of mlParser and its subsequent evolution into a multi, meta-markup language parser was a long road. In the early SGML days, SGML tools were expensive. My customer wanted to use SGML as the exchange format for bibliographic data and could not justify the expense of purchasing SGML tools without being sure of success. We developed small-scale applications around SGMLS using Perl scripts to process ESIS output. It worked well, and I learned a lot about processing an ESIS stream.

The following markup file is used as input for sample code for the examples that follow:

<RECORDS>
<RECORD>
  <NAME>John Doe</NAME>
  <PHONE>555-123-4567</PHONE>
  <EMAIL>JDoe@anymail.com</EMAIL>
  <STATE>Confusion</STATE>
</RECORD>
<RECORD>
  <NAME>Jane Smith</NAME>
  <PHONE>555-345-9876</PHONE>
  <EMAIL>smithj@anymail.com</EMAIL>
  <STATE>Nirvana</STATE>
</RECORD>
</RECORDS>

The ESIS output from mlParser for the above markup is:

# mlParser (c) Science Applications International Corporation, 2002, 2006, 2007.
All rights reserved.
(records
-\n
(record
-\n
(name
-John Doe
)name
-\n
(phone
-555-123-4567
)phone
-\n
(email
-JDoe@anymail.com
)email
-\n
(state
-Confusion
)state
-\n
)record
-\n
(record
-\n
(name
-Jane Smith
)name
-\n
(phone
-555-345-9876
)phone
-\n
(email
-smithj@anymail.com
)email
-\n
(state
-Nirvana
)state
-\n
)record
)records
C
Done

The code described in the following paragraphs produces this output:

 Name: John Doe   Email: JDoe@anymail.com
 Name: Jane Smith   Email: smithj@anymail.com

Over the years, I wrote two other parsers. The first was part of a Forth (See Smith1997) interpreter. The only restriction on Forth function names is that the name cannot contain white space. Therefore, a function name can be a tag name. So, <RECORD> is both a tag and a function. A function definition begins with ':' and ends with ';'. Forth is a stack-oriented language that has a Reverse Polish syntax, which means that parameters come before the function name and no parens are necessary. A function is called by simply referencing its name.

The application implementation approach was:

  • Define a function for each tag

  • Each function must consume characters up to the next tag/end tag

  • A document processed itself when fed to the Forth interpreter

The document effectively executed itself. It was an interesting idea that never got past the toy stage. The following is a snippet of Forth code used to process the markup file described above:

 String content
 : <NAME>
      " Name:" print
      content collect ;  \ Read the content for <name>
 : </NAME>
      content print ;
 : <EMAIL>
      " E-Mail: " print
      content collect ;
 : </EMAIL>
      content print
      " \n" print ;
 " test.xml" cload

The cload function loads and interprets the filename test.xml. the collect function consumes the charcters up to the next '<' and store it in content. The document executes itself generating the output. The parser was implemented as a state machine.

My next parser was a Perl library, which was based completely on regular expressions. The primary functions were:

 $content = &get_XML_field($string,"tag");
 @results = &get_next_XML_field($string,"tag");

get_XML_field() returns the contents of <TAG> from the string of markup. It is useful when processing record oriented markup one record at a time. get_next_XML_field() extracts a repeating tag from the markup string, returning a status, the tag contents, and the remainder of the markup string. Typically, the whole file is read into a string and get_next_XML_field() extracts the next record to operate on. Then calls to get_XML_field() pull out the fields individually for processing.

The approach for using this parser was:

  • Read data into a string

  • Extract a "wrapper" element into another string

  • Extract individual fields from the wrapper string

  • Do whatever processing is needed

The following Perl code snippet generates the example output:

 require "xml.pl";

 my $file;
 my $record;
 my $name;
 my $email;
 my $template ="Name: \$name   Email: \$email\n";

 $/ = "</RECORD>";

 while(<STDIN>){   # Read a record
      $record = $_;
      $name   = &get_XML_field($record,"NAME");
      $email  = &get_XML_field($record,"EMAIL");
      eval("print \"$template\"");
 }
 exit;

This Perl parsing library has been used successfully in several production applications. The constraints are that the markup does not contain attributes and that huge records have to fit in memory.

Both of these solutions worked for a small subset of applications with simplified markup. They handle start tags, content, and end tags and that's about it. Neither ever grew into a robust, general purpose parser. The Perl library has found its way into several production applications though.

The following code snippet is part of the mlParser program to process the sample input file. It has the call-backs for the parser output events. The simple nature of the input markup, makes for an extremely simple Java example.

    ...
    HashMap element = new HashMap();
    String  content = "";
    ...
    public void writeStartTag(String sTag)
    {
        String tag     = sTag.toLowerCase();

        if(tag.equals("record"){
            element.clear();    // Wipe the hash for each record.
        }
    }
   ...
    public void writeEndTag(String eTag)
   	{
   	    String tag     = eTag.toLowerCase();
        if(tag.equals("record")){
            system.out.print("Name: "    + element.get("name")  +
                             "  Email: " + element.get("email") + "\n";
        }else{
            element.put(tag, content);  // Collect all element content in a Hash
        }
   	}

Assume that the content callback happens and leaves the content in the content global variable. The only processing needed in the writeStartTag() method clears the element hash. All of the other processing occurs in writeEndTag(). When the end tag is </RECORD>, we know that all of the fields in the record have been collected. Simply putting the content in a hash with the tag as the key is a convenient way to collect the data without testing for each tag as it passes by.

Mixing SGML And XML

I am an old-time SGMLer who was as skeptical as the next person about XML when it first came out of the closet at the 1996 SGML Conference. The false promise that got a lot of the SGML community on the XML train was that "XML is just SGML without DTDs" and XML is a subset of SGML. I had often wished to process SGML files without a DTD. The implication I misread into the original XML discussions was being able to process SGML markup with XML tools.

Not being able to do much without a DTD was always an SGML issue for two reasons. First, I never saw the need to validate an SGML file every time it was touched. An SGML file only needs to be validated when modified. Second, I received SGML files from external sources without a DTD with enough regularity that having to reverse engineer a DTD in order to be able to use SGML tools became a real annoyance. Most of the time, only a handful of tags were processed. Writing what was essentially a throw-away DTD always rubbed me the wrong way.

An XML well-formed document versus a valid document was the leap that was supposed to enable the "SGML without DTDs" concept. Well, that didn't quite happen. By the time XML hit the streets, minor syntax changes and the fact that non-validating parsers threw errors when processing virtually every SGML file made feeding an SGML file to a non-validating XML parser a waste of CPU cycles. Error examples include:

  • The first line of the file had to be the XML declaration (<?xml version="1.0" encoding="UTF-8"?>), which is a processing instruction in SGML.

  • The SGML empty tag representation causes the document not to be well-formed and therefore causes parsing problems.

  • Any entity reference not in the default XML set (<, >, and &) threw errors.

I eventually found xmln, which seemed to be the answer for a while. The C source was available and it generated an ESIS-like output stream, which meant it could be utilized by programs such as the Perl scripts I had already written to use the ESIS output from SGMLS. I thought I could modify xmln to handle SGML files as well but could never track down one of the C header files, which forced an end to my customization attempts.

I attempted to use markup interchangeably as SGML and XML for a couple of years. I finally threw in the towel when a couple of developers started using Java XML tools on a project. Since then, I consider SGML and XML cousins at best. XML is not a subset of SGML as originally advertised!

I work with some systems that started life as SGML applications. Eventually, the data and code will be migrated to XML data and tools. In the meantime, wouldn't it be convenient to mix SGML data and XML data transparently rather than have to do a conversion? I believe this is still a valid requirement for mixing XML and SGML interchangeably and a good reason to write a custom parser.

This section discusses my frustration mixing SGML and XML and the road to writing my own parser. mlParser did not happen directly as a result, rather the knowledge gained along the way was applied to its initial implementation and eventual evolution.

mlParser

In 2002, my next project required that I be fluent in Java. I had done light Java maintenance up to that point. The project seemed pretty important, so I needed to be productive on day one! I knew several programming languages at that point - getting up to speed did not seem impossible. The obvious approach was writing a non-trivial Java program.

I decided to implement the parser described by the software finite state machine from Practical Guide To SGML/XML Filters. (See Smith1998) Having experimented with both state machines and simple parsers in the past, I knew the technique and subject area. This allowed me to concentrate on the learning Java aspect in writing the parser.

The initial implementation started as a simple, pretty-much exact implementation from the book. The parser generated an ESIS output stream. I verified the output by comparing the mlParser ESIS output with the NSGMLS output. The first implementation was much easier than I had expected.

The second version implemented a few of the XML syntax differences, such as the <TAG/> empty tag. By the time I got the parser digesting basic SGML and XML, I began to see the potential for my own parser!

The third iteration was a restructuring of the code with interfaces for input processing, the parsing engine, and the output call-back class. My thinking was:

  • Input Interface. The assumption was a developer could supply a markup stream from any source, not just a file. The default input interface class handles the file as a stream, which was a good choice because of the large number of input types that can be mapped into a stream. There has been no need to implement another input class, but I still see the potential for custom input classes.

  • Parsing Engine. The interface makes it possible to replace the default state machine with some other parsing engine. The state machine is the heart of mlParser and its ability to parse multiple meta-markup languages is an important capability for me, so I don't see replacing current the state machine. However, it is possible.

  • Output Call-Backs. Each markup event triggers a call-back to an interface-defined method. The interface is organized around ESIS events plus a couple, such as document start and end. The interface is reminiscent of SAX, only a great deal simpler. This interface is the hook to embed mlParser into applications.

Two additional implementation cycles added a couple of significant capabilities, parsing WordML and *SP files. Not long after SGML and XML parsing were stable, I saved a Microsoft Word document as Filtered HTML and fed it to mlParser. mlParser did not get very far. I was disappointed, but not surprised. There was no immediate requirement to parse Word output. I wanted mlParser to handle all common markup types transparently by this time.

Examination of a WordML file reveals that it follows some SGML rules, some XML rules, and invents a few of its own. Additionally, a great deal of formatting data is stored in comments. I wrote a program to convert WordML to a very generic HTML as an excuse to keep tweaking the parser. Successfully parsing WordML became an obsession and mlParser was a time-consuming hobby at this point.

I volunteered to convert Word documents to HTML many times over the next couple of years. Each document seemed to present some unaccounted-for nuance that required a code tweak to the state machine class. The mlParser and WordML converter program now handles most Word documents, although the generated HTML is often ugly. The HTML is usually good enough to pass a validating parser with no significant validation errors.

I put some thought into what additional features would be necessary to make mlParser useful in a production application. The obvious items included:

  • Identifying empty tags without a DTD or Schema

  • Simple entity resolution

  • Setting most options via the command line

  • The option to stop on errors or continue processing a file

  • Implementing additional input data sources such as strings and URNs

  • Allowing nested parsing

SGML and XML have different markup syntax for empty elements. XML uses the form <Tag/> and SGML just uses a start tag and no end tag. The XML form enables recognizing an empty element without a DTD or Schema. The SGML requires that empty tags be explicitly identified. The mlParser approach is simply a file that identifies empty elements that get loaded into a hash at startup. It's simple and effective. When a missing end tag is detected, a check of the hash determines whether or not it is an empty element, and the processing handles the condition correctly. This approach allows SGML files that contain empty elements to work properly.

Simple entity substitution is a similar problem to empty elements; how do you represent entity values without a DTD or Schema? The mlParser solution is a simple Java properties file with the entity string as the name and the substitution value as the value. The default entity translation file looks like this:

 &amp;=&
 &lt;=<
 &gt;=>

If no entity property file is specified, a default set is used. Entities not found in the property file are passed thru unchanged instead of generating an error.

Some applications, such as a browser or building a document, may require that a markup parser continue even when errors occur. Other applications may need to stop on each error so they can be corrected. mlParser can do either and allows the user to specify the behavior as a command-line option. Applications that embed mlParser can set the option via a setter method.

A definite requirement for using mlParser in several of my applications is the ability to launch nested parses. Instantiating another parser object is straightforward and adds almost no complication to application code.

The Java JSP compiler is pretty forgiving, and as long as it can parse the Java code out of "<%" and "%>" delimiters, it is happy. What happens when the JSP compiler is successful and there is a missing angle bracket or two in the HTML markup? The result is usually unexplained "stuff" on the generated HTML page. What do you do? You use mlParser, which can handle the server page syntax, and parse the problem JSP file. Even if the parser only checks that the file is well formed, you may find problems. This actually happened on one project. mlParser showed the JSP file to not be well formed. When the markup was fixed in the JSP file, the unexplained "stuff" went away.

At this point, mlParser is mature with the capabilities identified above and is used in multiple production applications on multiple projects.

The remainder of this section gives a brief overview of the Software Finite State Machine that is the mlParser parsing engine. I first ran across state machines about 20 years ago in a presentation about navigation in an Adventure game where the current state represented room or location on the adventure map. The direction selection by the user caused the state to transition to a new location. State machines have fascinated me ever since and I have managed to use them a few times over the years. Markup parsers are close to an ideal state machine application.

The following State Transition Diagram shows the states for a simple SGML parser:

SGML State Transistion Diagram

jpg image ../../../vol1/graphics/Smith01/Smith01-001.jpg

Simple SGML State Transition Diagram

The state machine has to handle every character in the input stream. The initial state is TEXT. State changes often correspond to parsing events, although the State Transition Diagram does not show the call-back points. All parsing states eventually return to the TEXT state.

Two characters are significant in the TEXT state - the Start Tag Open (STAGO), which is '<' and Entity Reference Open (ERO), which is '&'. STAGO fires a call-back to the Content method and changes the state to TAG.

ERO changes to the ENTITY state to handle collection and substitution of the entity. A ';' or white-space character triggers transition back into the TEXT state. All other characters do not cause the state machine to exit the ENTITY state and are accumulated to form the entity name.

The TAG state is entered from the TEXT state via the '<' character. A white-space character triggers transition into ATTRIBUTE Name state. The Tag Close (TAGC) character '>' exits TAG state back to TEXT state and fires a call-back to the start tag method. Call-backs to the attribute method fire at the end of each attribute. This is one difference between mlParser and SAX. mlParser fires a call-back for each attribute and SAX returns the attributes in an Attributes object. Attributes are available when the start tag method call-back executes in both.

There are two sub-states for collecting attributes: Attribute Name and Attribute Value. The top level state is ATTRIBUTE. The need for sub-states may not be obvious, but works well. The attribute name ends with either an '=' character or a '>' character. In either case, a state change is triggered. The Attribute Value sub-state is a little complicated because the value may be enclosed in quotes or simply terminated by white space. The end of Attribute Value sub-state also signals the end of ATTRIBUTE. The current state changes back to TAG because the TAGC character will eventually be found. In practice, there are sub-states for several states. A global variable keeps track of the current state.

The ability to look ahead one character simplifies the state machine code significantly. Reading the next character needs to be a method and not just a read loop around the state cases. The sub-states certainly contribute to needing to be able to read characters at any point, and I believe a read character method is the right approach for this type of state machine application.

The state transition diagram certainly helps to think thru a state machine implementation at its start. I usually draw these diagrams during design. I then prepare a State Transition Table by the time I start coding. It speeds up the coding process significantly. The following State Transition Table corresponds to the State Transition diagram above:

Table I

State Transition Table

Current StateCharacter
STAGO (<)WhitespaceTAGC (>) = ERO(&)ERC(;)Other
TextTagTextTextTextEntityText
TagAttributeTextTag
AttributeTagSwitch sub-statesAttribute
EntityTextTextEntity

The table contains more details than the diagram. The table also points out potential error conditions. Any cell without an entry usually represents an error condition that should be examined closely before deciding if it is truly an error condition or if it can be ignored. For example, the cell at TEXT state and TAGC was blank in the initial version. The cell should have had TEXT because a lone '>' is just another text character. The empty cell at TAG state and STAGO is definitely an error and must be handled.

A software finite state machine is the ideal implementation technique for a markup parser. There are a relatively small number of states, a limited number of state change conditions, and the implementation is not overly complex. The current State Machine class in mlParser is about 1800 lines of heavily commented Java. The whole parser is about 3000 lines and the compiled mlParser JAR file is only 22K.

mlParser Today

This section covers mlParser in its present form. Design choices I made and implemented features are discussed. mlParser has grown and evolved a great deal since that first Java implementation. It is a robust markup parsing tool that is at the heart of several internal applications and is being used on multiple projects. The most significant mlParser use is as the core for an integrated software documentation tool where it has displaced OmniMark applications.

Early sections of this paper identified reasons for writing a parser and design choices to make before you get too far into the effort. My initial reason for writing mlParser was learning Java. My reasons for continuing development evolved almost as much as the code itself. My current list is:

  • Mixing different markup meta-languages

  • An application development framework

  • An easily customizable parsing package when XML-type markup just won't work

  • Fullfilling the "SGML without DTDs" vision

  • Replacing all my OmniMark code

The integrated software documentation tool has been in use for several years; there are thousands of SGML files across many projects. The software will eventually migrate completely to XML; however, the legacy SGML files will never be converted to XML. There will be a period where documents will be built from some SGML files and some XML files. The system must handle the legacy SGML on demand for the foreseeable future. mlParser enables this mixed markup environment.

An application development framework has grown up with mlparser. The output call-back interface forms the basis for mlParser application development. A new application can often be implemented with just two classes. One class is the main program that collects command line arguments, registers the output call-back class, and launches the parsing engine. Starting the parsing process is done by simply invoking the state machine class. Simple applications usually do not require additional classes. Applications that handle multiple document types will need to invoke multiple parser objects with associated, output call-back classes. Each document type should have its own parsing object.

My intimate knowledge of the internals, especially the state machine, allows me to customize the parser for specific applications. For example, there is an application that uses an HTML subset plus a handful of extra tags. I want to use generic, out-of-the-box OpenOffice as the editor so the user never sees a tag. We tried processing instructions in place of the extra tags without success. For OpenOffice to work as the editor for this application, editing must be done in normal WYSIWYG mode. Our solution was to include the extra three or four tags in the document with '{' and '}' as tag delimiters instead the normal '<' and '>'. It turned out to be a simple solution from both the developer's and user's points of view. The developer was happy because no extra code was required to be able to use OpenOffice, and the user was happy because he didn't have to deal with tags for the most part. OpenOffice happily passes around the special tags unmodified, and the application code sees them as normal tags. Tweaking mlParser was straightforward and the result was a major editing improvement from the user's perspective without affecting application code.

The "XML is just SGML without DTDs" sales pitch from the XML faction at the 1996 SGML Conference convinced me and the SGML community at large that XML was worth pursuing. I envisioned XML as a true SGML subset where I could transparently use the expensive SGML tools that I fought so hard to purchase over the years. New XML tools would certainly be less expensive and compatible. By the time that the XML 1.0 standard hit the streets, that fantasy was crushed. I never gave up the dream of transparent markup though. I wanted to be able to treat markup as SGML to use SGML tools and XML to use XML tools transparently. mlParser essentially allows me to treat both XML and SGML as just markup. The mlParser parsing engine recognizes the syntax differences between the two, and therefore fulfills the "XML is just SGML without DTDs" as "XML and SGML are just markup," which is even better!

mlParser is built around the following design/features:

  • Runs either from the command line or embedded in an application

  • Generates an ESIS output stream by default

  • Handles simple entity character substitution

  • Non-validating parser

  • Handles both SGML and XML empty elements

  • Allows nested parsing

  • Sets most parsing options from the command line

  • Handles multiple types of markup transparently (XML, SGML, WordML)

  • Can include or exclude comments from the output stream

  • Handles CDATA

  • Does not implement marked sections

  • Includes the option to stop on parsing errors or continue when possible

  • Built around ESIS events

  • Based on a relatively simple software state machine

  • Includes a sample output call-back class that implements a jump table for "document" applications

  • Supports multiple input sources

The same Java JAR file is used for both the stand-alone program and when embedded in an application. Simple entity substitution is driven via a Java properties file. If no entity property file is specified, a default set is used. Entities not found in the property file are passed thru unchanged. A full-blown entity implementation seemed to be almost as much effort as the rest of the parser; and for me, simple entity substitution is sufficient.

Several of my applications require nested parsing with different DTDs. An application simply invokes another parser object, handles the output from the nested parser object, and picks up where it left off in the original input stream.

The need to stop on a parsing error or continue transparently is application specific. Stopping on an error is the default behavior for stand-alone parser operation. An application processing a document may want to simply unwind the element stack, fire end tag call-backs and keep going. This works reasonably well for missing end tags, but can create a mess with a missing start tag. mlParser provides the option to stop or continue when errors are detected.

Multiple input sources is a useful feature. Markup traditionally comes from files and must be complete. mlParser accepts files, strings, or URNs. Parsing a string implies well-formed markup, but no <!DOCTYPE and no XML declaration. Transparently handling URNs allows an application to parse markup directly from the web or some external source such as a Subversion repository. Building a multi-file document directly from a Subversion repository ensures that the output document contains the latest checked-in files without a process to manually extract and process them.

The default output call-back interface for handling documents is constructed around two jump tables and Java Reflection. There is one jump table for start tags and one for end tags. The jump tables are constructed as HashMaps with the tag as the key, and the value is the Java method via Reflection. This approach creates a level of indirection between the tags and the code that processes them:

  • One method can handle multiple tags. For example, the same method can be used for dropping a tag and its contents or for passing the tag unchanged.

  • Only tags of interest have to be accounted for; all others can be automatically ignored. An unexpected tag will not cause an exception to be thrown or the program to bomb.

  • The source code does not contain page-after-page of nested if statements to determine which method to call to process a specific tag.

The following start tag call-back method uses the jump table approach:

   	public void writeStartTag(String sTag)
   	{
            String key     = sTag.toLowerCase();
            try{
                myMethod = (Method)sTagAction.get(key);
                myMethod.invoke(this,new Object[] {});
            }
            catch(Exception e){
                performDefaultStart();
            }
   	}

Handling a new tag requires no code change here. Instead, a new entry is simply added to the Start Tag Action hash. I find the jump table approach clean, flexible, and elegant. The jump table approach can be used with any SAX-based parser, not just mlParser.

mlParser has evolved over the years from a learning exercise with potential to a major part of my markup applications. The thought of implementing the current mlParser from scratch is a daunting one. Starting small and adding capabilities as needed is a doable job.

Summary

This paper discusses reasons to write your own markup parser and documents my journey to writing my own parser. mlParser started with very modest beginnings and has evolved into a primary tool in my markup application toolkit. I use it to generate ESIS streams that are processed by Perl scripts, for the core of an integrated software documentation tool, and for markup conversion applications.

I don't believe that I would have tackled writing a parser with all of the features and capabilities that are currently in mlParser. Starting with a small, simple, and flexible base allowed me to evolve a useful and capable parsing tool!

I found out along the way that while XML is king for record-oriented markup, I still prefer SGML for documents. SGML's inclusions, exclusions, and default attribute values make a developer's life much less stressful for document-centric applications.

The experience has shown that there are still many valid reasons to write a markup parser. This paper has identified and explored many of these reasons. Not finding a parser that meets your needs is what most of the reasons boil down to. Don't be afraid to follow the path to your own parser!

References

[Bajzat] Bajzat, Robert, Thinlet Home Page, http://www.thinlet.com/index.html.

[ClarkJ] Clark, James, Nsgmls Output Format, http://www.jclark.com/sp/.

[Smith1998] Smith, Norman E., Practical Guide to SGML/XML Filters, Wordware Publishing, Inc., ISBN 1-55622-587-3, © July 1998.

[Smith1997] Smith, Norman E., Write Your Own Programming Language Using C++, 2nd Edition, Wordware Publishing, Inc., ISBN 1-55622-492-3, © 1997.

[WikipediaMU] Wikipedia, Markup Language Definition, http://en.wikipedia.org/wiki/Markup_language.

[WikipediaSGML] Wikipedia, SGML Definition, http://en.wikipedia.org/wiki/SGML.

Author's keywords for this paper: parser; XML; SGML