<?xml version="1.0" encoding="UTF-8"?><article xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0-subset Balisage-1.2" xml:id="HR-23632987-8973"><title>Parser Possibilities: Why Write A Markup Parser</title><info><confgroup><conftitle>Balisage: The Markup Conference 2008</conftitle><confdates>August 12 - 15, 2008</confdates></confgroup><abstract><para>
 In the early days of XML, there seemed to be a new XML parser just about
 every week. This was in stark contrast to SGML where there might be half a
 dozen working parsers ever written. As XML matured and SAX became the first defacto XML
 parser API, the new parser stream pretty much slowed to a trickle.
 Once robust XML parsers, such as Expat,
 became widely available, there seemed little reason left to write you own parser.
 Expat is robust, fast, and still provides the XML under pinnings for many programming
 languages.
</para><para>I believe there remain many valid reasons for writing your own markup
	 language parser. This paper identifies reasons you might want to write a custom
	 parser and examines the choices I made writing mlParser.
</para></abstract><author><personname><firstname>Norman</firstname><othername>Earl</othername><surname>Smith</surname></personname><personblurb><para>Mr. Smith has been a software developer for 30+ years and involved with
markup applications starting with SGML in 1990. He has worked for SAIC for 26
years on a variety of projects ranging from automated document creation to
robotics to web applications. He has authored 12 books including two on
SGML/XML. Mr. Smith was selected as an SAIC Technical Fellow in 2004.</para></personblurb><affiliation><jobtitle>SAIC Technical Fellow and Assistant VP of Technology</jobtitle><orgname>Science Applications International Corp.</orgname></affiliation><email>smithno@saic.com</email></author><legalnotice><para>Copyright, Science Applications International Corporation. All rights reserved.  Unpublished rights reserved under copyright laws of the United States.</para></legalnotice><keywordset role="author"><keyword>parser</keyword><keyword>XML</keyword><keyword>SGML</keyword></keywordset></info><section><title>Introduction</title><para>XML and SGML parsers are available that are mature, robust,
		and widely used. It's not like the early days of XML when it seemed that every new
		week brought a new parser. Soon, the first SAX-based XML parsers appeared,
		followed by DOM (Document Object Model) parsers with standard APIs. Once
		high-quality validating parsers, such as Expat and AElfred, became widely available,
		the community seemed to lose interest in writing new parsers. Many of
		the early XML parsers were non-validating because they were considerably easier to write
		than validating parsers.</para><para>SAX was originally patterned after the ESIS (Element Structure Information Set)
        output from the SGMLS and NSGMLS
		SGML parsers. (<emphasis role="ital">See</emphasis> <xref linkend="ClarkJ"/>)
		DOM implementations usually use an underlying SAX parser to feed the
		content to the DOM objects. SAX-based applications are generally faster and
		require less memory than a DOM-based application. My experience is that an
		event-based parsing model, like SAX, can easily handle 80% to 90% of XML applications.
		Either are appropriate for the next 5% to 10% of applications, and the last 5%
		of applications really need DOM or something similar.</para><para>Even with numerous XML parser options, I have never been completely
		satisfied with the available XML parsers because I still deal with SGML and other
		markup languages. The remainder of this paper looks at reasons for
		writing a parser. Questions that you should ask yourself before starting to write
		a custom parser and the road to writing my own mlParser are also examined.</para></section><section><title>What's A Markup Parser</title><para>A markup language, such as XML, is a language for writing application-level
           markup languages. HTML and DocBook are examples of application-level
		markup languages. The dual use of the term "markup language" is especially
		confusing to non-technical people. Wikipedia's definition of SGML
             (<emphasis role="ital">See</emphasis> <xref linkend="WikipediaSGML"/>) describes it
		as "a metalanguage in which one can define markup languages" and I believe the
		term metalanguage is appropriate. Wikipedia further defines markup language as "a
		set of annotations to text that describe how it is to be structured, laid out,
		or formatted."</para><para>Both SGML and XML use "&lt;" and "&gt;" to delimit markup.  Common usage
		has evolved to the point that just about any markup that uses "&lt;" and "&gt;"
		is called XML. Notice that I have been using the term "markup language parser" instead
		of "XML parser" for the most part so far. This is done on purpose. </para><para>There are four types of markup language
		in use that I know of:</para><orderedlist startingnumber="1"><listitem><para>SGML</para></listitem><listitem><para>XML</para></listitem><listitem><para>WordML</para></listitem><listitem><para>*SP</para></listitem></orderedlist><para>HTML and DocBook are not included in the list because they are
	 XML/SGML application-level markup languages, not markup meta languages.
	 WordML and *SP are not meta-markup lanugages. However, their syntax
	 sufficiently differs from XML and SGML such that normal parsers don't handle
	 them either. They are included to show markup that might require a
	 custom parser to process, and because mlParser can handle both.
	 </para><para>SGML, or Standard Generalized Markup Language, is an ISO standard. Its
		heritage is in the publishing industry and it has an IBM General Markup Language
		(GML) lineage. Computers in the early days had limited processing power for the
		individual, which translated to no SGML editor that performed real-time validation
		and formatting. Therefore, SGML contained many features to minimize the
		keystrokes required to enter content. For example, end tags could be declared optional.
		The structure of an SGML document is defined by a Document Type Definition (DTD).
		SGML is a markup metalanguage. My general belief is that SGML is still best for
		defining publishing markup languages.</para><para>XML, or eXtensible Markup Language, has its roots in the database
		world. It is a direct result of Tim Bray's work at OpenText for handing
		structured data that may or may not be SGML compliant. XML is also a markup
		metalanguage. Element structure may be defined thru a variety of "languages"
		including DTDs, Schemas, and RelaxNG. It is an SGML subset with just enough
		syntax differences to prevent processing markup files with each other's tools.
		I know because I tried for years to treat markup as SGML for some tools and the
		same markup as XML with other tools with limited success. </para><para>XML also introduced the concept of valid versus well-formed XML documents. A
		valid document is one that has been validated successfully against its DTD or Schema with a
		validating parser. A well-formed document is one that has matching start and
		end tags and follows the other rules of XML markup syntax. A well-formed
		document may or may not be valid. For that matter, it may or may not have a DTD
		or Schema.</para><para>There is no concept of well-formed documents in SGML, only
		valid or not valid. Most SGML tools validate the document every time it is
		used. A common real-world view is that a document only needs to be validated in
		specific cases:</para><itemizedlist><listitem><para>The document is edited by a human.</para></listitem><listitem><para>During program development, until you are sure that valid markup is
			 generated.</para></listitem><listitem><para>When data is supplied from an external source, markup may or may not
			 need to be validated on every data exchange depending on
			 circumstances.</para></listitem></itemizedlist><para>WordML is my name for the output from saving a Microsoft Word document
		as Filtered HTML. The markup appears to follow some XML syntax rules, some SGML
		syntax rules, and some of its own, unique syntax rules.
		<!-- [*** Find an example ***]  -->
		Much
		Word-specific data is stored in comments. I have even run across a few WordML
		documents that were not well formed! Few XML parsers can handle WordML.</para><para>*SP represents the various Serve Page markup languages such as ASP and
		JSP. It is not often that a Java developer attempts to run Java Server Pages
		thru an external parser. The JSP compiler normally takes care of the parsing.
		There are times when it can be very revealing. The output of the JSP compiler
		is a Java program that generates an HTML page. Normally, the embedded Java code
		generates dynamic values on the page. XML parsers do not normally handle
		*SP files.</para><para>Finally, the answer to the question "What's a markup parser?" A markup parser
		basically reads data that contains application-level markup, extracts tags and
		attributes from the markup, and generates some output. Validating parsers also
		read the structure definition in the form of a DTD, Schema, or other format.
		The output may range from structural error messages to ESIS output or anything
		in between. Some parsers, like OmniMark have built-in programming languages.
		Others provide SAX or DOM programming language interfaces. The possibilities
		are limited only by the requirements of the application and imagination of the
		developer. A markup parser may handle multiple types of
		markup, not just XML or SGML, from the point of view presented here.</para><para>ESIS is the primary output for
		the SGMLS and NSGMLS parsers. It is a record-oriented format where the first
		character on each line represents the markup event and the rest of the line is
		data. For example, a start tag event is represented by an '(' event type and
		the rest of the line is data. The tag name is the data in this case.
		'(mytag' is the ESIS output generated for &lt;MYTAG&gt; in the input document.</para><para>ESIS is very significant from a historical prospective. Back in prime
		SGML days of the early 1990s, SGMLS was just about the only widely available
		free SGML parser. Commercial SGML tools were all expensive because of the cost
		of either writing or licensing an SGML parser to include in the tools. Taking
		advantage of the SGMLS ESIS output was the only way to test-drive SGML without
		spending a lot of money. ESIS format is easy to process and a good programmer
		could do amazing things with ESIS output and a Perl script or two.
		ESIS also contains the idea behind SAX. The call-back events in programs
		logically work on an ESIS stream. </para></section><section><title>Why Write A Parser?</title><para>With the parser background out of the way, let's take a look at reasons both to
		write your own parser and reasons not to write your own parser. First, there are
		good reasons for not writing a custom parser:</para><itemizedlist><listitem><para>Writing a basic parser is a lot of work</para></listitem><listitem><para>Writing a validating parser is a lot more work</para></listitem><listitem><para>Implementing XSL/XSLT/X-Path/XQuery, etc., support is not practical for most
			individual developers</para></listitem><listitem><para>Writing a parser, even a non-validating parser that implements SAX
		        and/or DOM, is a huge effort
			 </para></listitem><listitem><para>There are multiple, structure definition languages, such as DTD,
		  	   Schema, RelaxNG, etc., needed for validation that are complex</para></listitem></itemizedlist><para>Writing any markup parser is hard work. The additional complexity and
	        effort required to write a validating parser instead of a
	        non-validating/well-formed parser is significant. For many cases,
	        it exceeds the point of diminishing returns . Writing a validating parser
	        requires writing the
		validating part, plus the code to parse a DTD or schema, and implement the code
		to actually validate the element structure at each change-in-tag state.
Are the benefits worth the effort?</para><para>The "X" add-ons are also complex. They require a very sharp staff to
		implement the standards. The resources to properly implement
		XSL/XSLT/X-Path/XQuery is substantial. These are usually beyond the
		average individual developer for short-term projects.
		Complexity and the associated learning curve are the main
		impediments.</para><para>
		Both the SAX and DOM API's are large and fairly complex, which means the
		average developer won't implement them
completely, if at all. I know from experience; I didn't
		bother implementing either.
A validating parser also requires the code to load allowable element
		structure in order to be able to validate a document, with another increase in
		code complexity and size.</para><para>
	    All of these items represent increased implementation effort that just
		might not be worth the trouble if an existing parser meets most of your
		requirements. You have to carefully gauge the value proposition for
		each increase in effort that the next step brings. It may or may not be worth
		the effort.
	</para><para>On the other hand, there are still many reasons for writing your own
		markup parser. The rest of this section takes a look as some of those reasons.
		They include:</para><itemizedlist><listitem><para>The learning experience.</para></listitem><listitem><para>No existing parser meets your specific requirements.</para></listitem><listitem><para>You have complete control.</para></listitem><listitem><para>You can mix and match markup languages, i.e., SGML, XML,
			 etc.</para></listitem><listitem><para>Not tied to existing APIs.</para></listitem><listitem><para>You need to create documents with "live" content. </para></listitem></itemizedlist><para>The following paragraphs expand these points. </para><para><emphasis role="ital">The learning experience.</emphasis> Writing a markup parser, even a
		"simple" non-validating parser, is always a learning experience. There are a
		number of learning experience possibilities, such as:</para><itemizedlist><listitem><para>Learning about markup languages. Writing a parser will teach you
			 about markup language rules. Having to account for every possible condition
			 in the markup will quickly enlighten you!</para></listitem><listitem><para>Learning about software state machines. I have used the state
			 machine technique for writing a parser more than once. It is a straight-forward
			 way to work thru the program. An in depth knowledge of the rules of the markup
			 language is required. There is a basic state machine for parsing SGML published
			 in the <emphasis role="ital">Practical Guide To SGML And XML Filters.</emphasis>
			 (<emphasis role="ital">See</emphasis> <xref linkend="SmithN"/>)
			 If you don't know
             what a State Transition Diagram is, you will learn a lot of new things
			 writing a markup parser.</para></listitem><listitem><para>Providing a vehicle to learn a new language. I wrote the first implementation
			 of mlParser because I needed to write a non-trivial program to learn
			 Java and I already understood the ins and outs of SGML.
			 This allowed me to concentrate
			 on learning the programming language and not the application.</para></listitem></itemizedlist><para>As you can see, there are many things that can be learned from the
		experience of writing a parser.</para><para><emphasis role="ital">No existing parser meets your specific requirements.</emphasis> This is
		not an unusual occurrence, especially if your application is a little
		non-standard. Reasons include:</para><itemizedlist><listitem><para>Need a light-weight parser</para></listitem><listitem><para>Need to parse multiple markup meta languages</para></listitem><listitem><para>The programming language for an application is incompatible with
		  existing parsers</para></listitem></itemizedlist><para>Robert Bajzat (<emphasis role="ital">See</emphasis> <xref linkend="Bajzat"/>)
        needed a light-weight XML parser for his Thinlet package,
		which is a small (39k) Java windowing framework aimed at cell phones. The
		on-screen widgets for a Thinlet-based application are in simple XML. The Thinlet
		XML parser handles its slightly restricted XML syntax. Comments
		not spanning a line is an example restrictions.
		The Thinlet parser also is closely tied to the application
		and knows how to handle just the widget markup. The Thinlet parser is an
		integral part of the framework, and the short cuts in syntax make it tiny! </para><para>Incompatibility with existing parsers usually means the markup data
		buries information within comments, does not follow the syntax rules, or mixes
		and matches syntax rules from different markup meta languages. WordML is a good
		example of this. WordML is what I call the result of saving a Word document as
		"Filtered HTML." It appears to follow some SGML rules, some XML rules, invents
		a few rules, and puts a great deal of information into comments. Few parsers
		can handle this markup.</para><para><emphasis role="ital">You have complete control.</emphasis>
	        This is the real reason most people write their
		own version of an application. Tailoring a program to meet your system
		requirements has a strong draw for many people. I would rather write a
		hand-tuned set of Java classes to represent markup data, for example,
		than use one of the
		canned frameworks that seems to create hundreds of classes/methods when three
		or four well-designed classes are easier to understand and more efficient.
        </para><para>
		Control extends to which markup features are supported. The parser may not
		need to handle attributes at all if the data does not
		contain attributes and is record-oriented. The
		parser may do simple transforms as the input stream goes by that will
		significantly simplify downstream code. The parser
		doesn't always have to stop on
		errors. </para><para><emphasis role="ital">You can mix and match markup languages.</emphasis> I started out using
		SGML around 1990. I have been thru the start of XML and the rise of Server Page
		languages such as JSP and ASP. There have been plenty of times when I wanted
		to mix and match SGML and XML files, in particular. Writing your own parser
		makes this practical. </para><para><emphasis role="ital">Not tied to existing APIs.</emphasis> The existing SAX and DOM APIs are
		large and complex. If you go back and examine ESIS closely, basic markup
		processing can be done with a much simpler subset API. But once you make the
		leap that you don't have to implement existing APIs, you are free to build what
		meets your application requirements. That said, I firmly believe that this is
		the least acceptable reason in my list for writing a parser.</para><para><emphasis role="ital">You need to create documents with "live" content.</emphasis> A favorite
		application technique of mine is creating a live document. By live, I mean that
		some portion of the content is dynamically generated. Executing an external
		script, nesting the output of another parse, running an SQL query, or retrieving
		a URN are some of the many things that can be incorporated transparently with a
		custom parser. The data just shows up in the data to the downstream processing
		programs. A great deal of extra functionality can be transparently dropped into
		a markup-based application by including the ability to execute arbitrary
		programs/code on a parsing event.</para></section><section><title>The Road To mlParser</title><para>I initially wrote my markup language parser, mlParser, in 2002. The
		journey from being thrown into the SGML ocean in 1990 to the birth of mlParser
		and its subsequent evolution into a multi, meta-markup language parser was a
		long road. In the early SGML days, SGML tools were expensive. My customer wanted to
		use SGML as the exchange format for bibliographic data and could not justify
		the expense of purchasing SGML tools without being sure of success. We
		developed small-scale applications around SGMLS using
		Perl scripts to process ESIS output. It worked well, and I
		learned a lot about processing an ESIS stream. </para><para>The following markup file is used as input for sample code for the examples
	       that follow:</para><programlisting xml:space="preserve">
&lt;RECORDS&gt;
&lt;RECORD&gt;
  &lt;NAME&gt;John Doe&lt;/NAME&gt;
  &lt;PHONE&gt;555-123-4567&lt;/PHONE&gt;
  &lt;EMAIL&gt;JDoe@anymail.com&lt;/EMAIL&gt;
  &lt;STATE&gt;Confusion&lt;/STATE&gt;
&lt;/RECORD&gt;
&lt;RECORD&gt;
  &lt;NAME&gt;Jane Smith&lt;/NAME&gt;
  &lt;PHONE&gt;555-345-9876&lt;/PHONE&gt;
  &lt;EMAIL&gt;smithj@anymail.com&lt;/EMAIL&gt;
  &lt;STATE&gt;Nirvana&lt;/STATE&gt;
&lt;/RECORD&gt;
&lt;/RECORDS&gt;
</programlisting><para>The ESIS output from mlParser for the above markup is:</para><programlisting xml:space="preserve">
# mlParser (c) Science Applications International Corporation, 2002, 2006, 2007.
All rights reserved.
(records
-\n
(record
-\n
(name
-John Doe
)name
-\n
(phone
-555-123-4567
)phone
-\n
(email
-JDoe@anymail.com
)email
-\n
(state
-Confusion
)state
-\n
)record
-\n
(record
-\n
(name
-Jane Smith
)name
-\n
(phone
-555-345-9876
)phone
-\n
(email
-smithj@anymail.com
)email
-\n
(state
-Nirvana
)state
-\n
)record
)records
C
Done
</programlisting><para>The code described in the following paragraphs produces this output:
		</para><programlisting xml:space="preserve">
 Name: John Doe   Email: JDoe@anymail.com
 Name: Jane Smith   Email: smithj@anymail.com
</programlisting><para>Over the years, I wrote two other parsers. The first was part of a Forth
 (<emphasis role="ital">See</emphasis> <xref linkend="Forth"/>)
	interpreter. The only restriction on Forth function names is that the name
	cannot contain white space. Therefore, a function name can be a tag name. So,
	&lt;RECORD&gt; is both a tag and a function. A function
	definition begins with ':' and ends
	with ';'. Forth is a stack-oriented language that has a Reverse Polish syntax,
	which means that parameters come before the function name and no parens are
	necessary. A function is called by simply referencing its name.</para><para>The application implementation approach was:</para><itemizedlist><listitem><para>Define a function for each tag</para></listitem><listitem><para>Each function must consume characters up to the
		next tag/end tag</para></listitem><listitem><para>A document processed itself when fed to the
		  Forth interpreter</para></listitem></itemizedlist><para>The document effectively executed itself. It was an interesting idea
		that never got past the toy stage. The following is a snippet of
		Forth code used to
		process the markup file described above:</para><programlisting xml:space="preserve">
 String content
 : &lt;NAME&gt;
      " Name:" print
      content collect ;  \ Read the content for &lt;name&gt;
 : &lt;/NAME&gt;
      content print ;
 : &lt;EMAIL&gt;
      " E-Mail: " print
      content collect ;
 : &lt;/EMAIL&gt;
      content print
      " \n" print ;
 " test.xml" cload
</programlisting><para>The <emphasis role="bold">cload</emphasis> function loads and interprets the filename
		<emphasis role="bold">test.xml</emphasis>.
	the <emphasis role="bold">collect</emphasis> function consumes
	the charcters up to the next '&lt;' and store it in
	<emphasis role="bold">content.</emphasis>
	The document executes itself generating the
		output. The parser was implemented as a state machine.
	 </para><para>My next parser was a Perl library, which was based completely on regular
		expressions. The primary functions were:</para><programlisting xml:space="preserve">
 $content = &amp;get_XML_field($string,"tag");
 @results = &amp;get_next_XML_field($string,"tag");
</programlisting><para>
            <emphasis role="bold">get_XML_field()</emphasis> returns the contents
            of &lt;TAG&gt; from the string of markup. It is useful when processing
            record oriented markup one record at a time.
            <emphasis role="bold">get_next_XML_field()</emphasis> extracts a
            repeating tag from the markup string, returning a status, the tag
            contents, and the remainder of the markup string. Typically, the whole
            file is read into a string and <emphasis role="bold">get_next_XML_field()</emphasis>
            extracts the next record to operate on. Then calls to
            <emphasis role="bold">get_XML_field()</emphasis> pull out the fields
            individually for processing.
         </para><para>The approach for using this parser was:</para><itemizedlist><listitem><para>Read data into a string</para></listitem><listitem><para>Extract a "wrapper" element into another string</para></listitem><listitem><para>Extract individual fields from the wrapper string</para></listitem><listitem><para>Do whatever processing is needed</para></listitem></itemizedlist><para>The following Perl code snippet generates the example output:</para><programlisting xml:space="preserve">
 require "xml.pl";

 my $file;
 my $record;
 my $name;
 my $email;
 my $template ="Name: \$name   Email: \$email\n";

 $/ = "&lt;/RECORD&gt;";

 while(&lt;STDIN&gt;){   # Read a record
      $record = $_;
      $name   = &amp;get_XML_field($record,"NAME");
      $email  = &amp;get_XML_field($record,"EMAIL");
      eval("print \"$template\"");
 }
 exit;
</programlisting><para>This Perl parsing library has been used successfully in several production
		applications. The constraints are that the markup does not contain attributes
		and that huge records have to fit in memory. </para><para>Both of these solutions worked for a small subset of applications with
		simplified markup. They handle start tags, content, and end tags and that's
		about it. Neither ever grew into a robust, general purpose parser.
		The Perl library has found its way into several production
		applications though.
</para><para>The following code snippet is part of the mlParser program to
        process the sample input file. It has the call-backs for the parser
        output events. The simple nature of the input markup, makes for an
        extremely simple Java example. </para><programlisting xml:space="preserve">
    ...
    HashMap element = new HashMap();
    String  content = "";
    ...
    public void writeStartTag(String sTag)
    {
        String tag     = sTag.toLowerCase();

        if(tag.equals("record"){
            element.clear();    // Wipe the hash for each record.
        }
    }
   ...
    public void writeEndTag(String eTag)
   	{
   	    String tag     = eTag.toLowerCase();
        if(tag.equals("record")){
            system.out.print("Name: "    + element.get("name")  +
                             "  Email: " + element.get("email") + "\n";
        }else{
            element.put(tag, content);  // Collect all element content in a Hash
        }
   	}
</programlisting><para>Assume that the content callback happens and leaves the
         content in the <emphasis role="ital">content</emphasis> global variable.
         The only processing needed in the <emphasis role="bold">writeStartTag()</emphasis>
         method clears the <emphasis role="ital">element</emphasis> hash. All of
         the other processing occurs in <emphasis role="bold">writeEndTag().</emphasis> When
         the end tag is &lt;/RECORD&gt;, we know that all of the fields in the
         record have been collected. Simply putting the content in a hash with
         the tag as the key is a convenient way to collect the data without
         testing for each tag as it passes by. </para></section><section><title>Mixing SGML And XML</title><para>I am an old-time SGMLer who was as skeptical as the next person about XML
		when it first came out of the closet at the 1996 SGML Conference. The false
		promise that got a lot of the SGML community on the XML train was that "XML is
		just SGML without DTDs" and XML is a subset of SGML. I had often wished to
		process SGML files without a DTD. The implication I misread into the original
		XML discussions was being able to process SGML markup with XML tools. </para><para>Not being able to do much without a DTD was always an SGML issue for two
		reasons. First, I never saw the need to validate an SGML file every time it
		was touched. An SGML file only needs to be validated when modified. Second, I
		received SGML files from external sources without a DTD with enough regularity
		that having to reverse engineer a DTD in order to be able to use SGML tools
		became a real annoyance. Most of the time, only a handful of tags were
		processed. Writing what was essentially a throw-away DTD always rubbed me the
		wrong way. </para><para>An XML well-formed document versus a valid document was the leap that was
		supposed to enable the "SGML without DTDs" concept. Well, that didn't quite
		happen. By the time XML hit the streets, minor syntax changes and the fact that
		non-validating parsers threw errors when processing virtually every SGML file made
		feeding an SGML file to a non-validating XML parser a waste of CPU cycles.
		Error examples include:</para><itemizedlist><listitem><para>The first line of the file had to be the XML declaration
		(&lt;?xml version="1.0" encoding="UTF-8"?&gt;),
		which is a processing instruction in SGML.</para></listitem><listitem><para>The SGML empty tag representation causes
                          the document not to be well-formed and
                          therefore causes parsing problems.</para></listitem><listitem><para>Any entity reference not in the default XML set (&lt;, &gt;, and
		  &amp;) threw errors. </para></listitem></itemizedlist><para>I eventually found xmln, which seemed to be the answer
		for a while. The C source was available and it generated an ESIS-like output
		stream, which meant it could be utilized by programs
		such as the Perl scripts I had already written
		to use the
		ESIS output from SGMLS. I thought I could modify xmln to handle SGML files as
		well but could never track down one of the C header files, which forced an end
		to my customization attempts.</para><para>I attempted to use markup interchangeably as SGML and XML for a couple
		of years. I finally threw in the towel when a couple of developers started using
		Java XML tools on a project. Since then, I consider SGML and XML
		cousins at best. XML is not
		a subset of SGML as originally advertised!</para><para>
		I work with some systems that started life as SGML applications. Eventually,
		the data and code will be migrated to XML data and tools. In the meantime,
		wouldn't it be convenient to mix SGML data and XML data transparently rather
		than have to do a conversion? I believe this is still a valid requirement for
		mixing XML and SGML interchangeably and a good reason to write a custom
		parser.
	 </para><para>This section discusses my frustration mixing SGML and XML and
	        the road to writing my own parser. mlParser
		did not happen directly as a result, rather the knowledge gained along the way
		was applied to its initial implementation and eventual evolution.</para></section><section><title>mlParser</title><para>In 2002, my next project required that I be fluent in Java. I had done
		light Java maintenance up to that point. The project seemed pretty important, so
		I needed to be productive on day one! I knew several programming
		languages at that point - getting up to speed did not seem impossible. The
		obvious approach was writing a non-trivial Java program. </para><para>I decided to implement the parser described by the software finite state
		machine from
		<emphasis role="ital">Practical Guide To SGML/XML Filters</emphasis>.
		 (<emphasis role="ital">See</emphasis> <xref linkend="SmithN"/>) Having experimented
		with both state machines and simple parsers in the past, I knew the technique
		and subject area. This allowed me to concentrate on the learning Java aspect in
		writing the parser. </para><para>The initial implementation started as a simple, pretty-much exact
		implementation from the book. The parser generated an ESIS output stream. I
		verified the output by comparing the mlParser ESIS output
		with the NSGMLS output. The
		first implementation was much easier than I had expected. </para><para>The second version implemented a few of the XML syntax differences, such
		as the <emphasis role="ital">&lt;TAG/&gt;</emphasis> empty tag. By the time I
                got the parser digesting basic SGML
		and XML, I began to see the potential for my own parser! </para><para>The third iteration was a restructuring of the code with interfaces for
		input processing, the parsing engine, and the output call-back class. My
		thinking was: </para><itemizedlist><listitem><para><emphasis role="ital">Input Interface.</emphasis> The assumption was
                         a developer could supply
			 a markup stream from any source, not just a file. The default input interface
			 class handles the file as a stream, which was a good choice because of the
			 large number of input types that can be mapped into a stream. There has been no
			 need to implement another input class, but I still see the
			 potential for custom
			 input classes.</para></listitem><listitem><para><emphasis role="ital">Parsing Engine.</emphasis> The interface makes it possible to replace
			 the default state machine with some other parsing engine. The state machine is the
			 heart of mlParser and its ability to parse multiple meta-markup languages is
			 an important capability for me, so I don't see replacing current the state machine.
			 However, it is possible.</para></listitem><listitem><para><emphasis role="ital">Output Call-Backs.</emphasis> Each markup event triggers a call-back
			 to an interface-defined method. The interface is organized around ESIS events
			 plus a couple, such as document start and end. The interface is reminiscent of
			 SAX, only a great deal simpler. This interface is the hook to embed mlParser into
			 applications.</para></listitem></itemizedlist><para>Two additional implementation cycles added a couple of significant
		capabilities, parsing WordML and *SP files. Not long after SGML and XML parsing
		were stable, I saved a Microsoft Word document
                as <emphasis role="ital">Filtered HTML</emphasis> and
		fed it to mlParser. mlParser did not get very far. I was disappointed, but not
		surprised. There was no immediate requirement to parse Word output. I wanted
		mlParser to handle all common markup types transparently by this time. </para><para>Examination of a WordML file reveals that it follows some SGML rules,
		some XML rules, and invents a few of its own. Additionally, a great deal of
		formatting data is stored in comments. I wrote a program to convert WordML to a
		very generic HTML as an excuse to keep tweaking the parser. Successfully
		parsing WordML became an obsession and mlParser was a time-consuming hobby
		at this point. </para><para>I volunteered to convert Word documents to HTML many times over the next
		couple of years. Each document seemed to present some unaccounted-for nuance
		that required a code tweak to the state machine class. The mlParser and WordML
		converter program now handles most Word documents, although the generated HTML
		is often ugly. The HTML is usually good enough to pass a validating parser with
		no significant validation errors.</para><para>I put some thought into what additional features would be necessary to
		make mlParser useful in a production application. The obvious items
		included:</para><itemizedlist><listitem><para>Identifying empty tags without a DTD or Schema</para></listitem><listitem><para>Simple entity resolution</para></listitem><listitem><para>Setting most options via the command line</para></listitem><listitem><para>The option to stop on errors or continue processing a file</para></listitem><listitem><para>Implementing additional input data sources such as strings and
		  URNs</para></listitem><listitem><para>Allowing nested parsing</para></listitem></itemizedlist><para>SGML and XML have different markup syntax for empty elements.
                XML uses the form &lt;Tag/&gt; and SGML just uses a start tag
                and no end tag. The XML form enables recognizing an empty element
                without a DTD or Schema. The SGML requires that empty tags be
                explicitly identified. The mlParser approach is simply a file
                that identifies empty elements that get loaded into a hash at
                startup. It's simple and effective. When a missing end tag is
                detected, a check of the hash determines whether or not it is an
                empty element, and the processing handles the
                condition correctly. This approach allows
                SGML files that contain empty elements to work properly. </para><para>
	       Simple entity substitution is a similar problem to empty
               elements; how do you represent entity values without a DTD
               or Schema?  The mlParser solution is a simple Java properties
               file with the entity string as the name and the substitution
               value as the value. The default entity translation file looks
               like this:</para><programlisting xml:space="preserve"> &amp;amp;=&amp;
 &amp;lt;=&lt;
 &amp;gt;=&gt;
</programlisting><para>If no entity property file is specified, a default set is used. Entities not
		found in the property file are passed thru unchanged
		instead of generating an error. </para><para>Some applications, such as a browser or building a document,
                may require that a markup parser continue even when errors occur.
                Other applications may need to stop on each error so they can be
                corrected. mlParser can do either and allows the user to specify
                the behavior as a command-line option. Applications that embed
                mlParser can set the option via a setter method.</para><para>A definite requirement for using mlParser in several of my
               applications is the ability to launch nested parses. Instantiating
               another parser object is straightforward and adds almost no complication
               to application code.</para><para>The Java JSP compiler is pretty forgiving, and as long as it can parse the
		Java code out of "&lt;%" and "%&gt;" delimiters, it is happy. What happens when
		the JSP compiler is successful and there is a missing angle bracket or two
		in the HTML markup? The result is usually
		unexplained "stuff" on the generated HTML page. What do you do? You
		use mlParser, which can handle the server page
		syntax, and parse the problem JSP file. Even if the parser only checks that the
		file is well formed, you may find problems. This actually happened on one project.
		mlParser showed the JSP file to not be well formed. When the markup was fixed in
		the JSP file, the unexplained "stuff" went away.</para><para>At this point, mlParser is mature with the capabilities identified above
		and is used in multiple production applications on multiple projects.</para><para>The remainder of this section gives a brief overview of the Software
		Finite State Machine that is the mlParser parsing engine. I first ran across
		state machines about 20 years ago in a presentation about navigation in an
		Adventure game where the current state represented room or location on the adventure
		map. The direction selection by the user caused the state to transition to a
		new location. State machines have fascinated me ever since and I have managed to use
		them a few times over the years. Markup parsers are close to an ideal
		state machine application. </para><para>The following State Transition Diagram shows the states for a simple
		SGML parser:</para><figure xml:id="StateDiagram" floatstyle="1" xreflabel="SGML State Transistion Diagram"><mediaobject><imageobject><imagedata format="jpg" fileref="../../../vol1/graphics/Smith01/Smith01-001.jpg" width="100%"/></imageobject><caption><para>Simple SGML State Transition Diagram</para></caption></mediaobject></figure><para>The state machine has to handle every character in the input stream. The
		initial state is TEXT. State changes often correspond to parsing events, although
		the State Transition Diagram does not show the call-back points. All parsing
		states eventually return to the TEXT state.</para><para>Two characters are significant in the TEXT state - the Start Tag Open
		(STAGO), which is '&lt;' and Entity Reference Open (ERO), which is '&amp;'.
		STAGO fires a call-back to the Content method and changes the state to TAG. </para><para>ERO changes to the ENTITY state to handle collection and substitution of
		the entity. A ';' or white-space character triggers transition back into the TEXT
		state. All other characters do not cause the state machine to exit the ENTITY
		state and are accumulated to form the entity name.</para><para>The TAG state is entered from the TEXT state via the '&lt;' character. A
	        white-space
		character triggers transition into ATTRIBUTE Name state. The Tag Close (TAGC)
		character '&gt;' exits TAG state back to TEXT state and fires a call-back to
		the start tag method. Call-backs to the attribute method fire at the end of
		each attribute. This is one difference between mlParser and SAX. mlParser fires
		a call-back for each attribute and SAX returns the attributes in an
		<emphasis role="ital">Attributes</emphasis> object. Attributes are available when the start tag method
		call-back executes in both.</para><para>There are two sub-states for collecting attributes: Attribute Name and
		Attribute Value. The top level state is ATTRIBUTE. The need for sub-states may
		not be obvious, but works well. The attribute name ends with either an '='
		character or a '&gt;' character. In either case, a state change is triggered.
		The Attribute Value sub-state is a little complicated because the value may be
		enclosed in quotes or simply terminated by white space. The end of Attribute
		Value sub-state also signals the end of ATTRIBUTE. The current state changes
		back to TAG because the TAGC character will eventually be found.
		In practice, there are sub-states for several states. A global
		variable keeps track of the current state. </para><para>The ability to look ahead one
		character simplifies the state machine code significantly. Reading the next
		character needs to be a method and not just a read loop around the state cases.
		The sub-states certainly contribute to needing to be able to read characters at
		any point, and I believe a read character method is the right approach for this
		type of state machine application.</para><para>The state transition diagram certainly helps to think thru a state
		machine implementation at its start. I usually draw
		these diagrams during design. I then prepare
		a State Transition Table by the time I start coding. It speeds up the coding
		process significantly. The following State Transition Table corresponds to the
		State Transition diagram above:</para><table border="1" cellpadding="2" cellspacing="0"><caption><para>State Transition Table</para></caption><thead><tr><th>Current State</th><th colspan="7">Character</th></tr></thead><tbody><tr><td/><td>STAGO (&lt;)</td><td>Whitespace</td><td>TAGC (&gt;)</td><td> = </td><td>ERO(&amp;)</td><td>ERC(;)</td><td>Other</td></tr><tr><td>Text</td><td>Tag</td><td>Text</td><td>Text</td><td>Text</td><td>Entity</td><td/><td>Text</td></tr><tr><td>Tag</td><td/><td>Attribute</td><td>Text</td><td/><td/><td/><td>Tag</td></tr><tr><td>Attribute</td><td/><td/><td>Tag</td><td>Switch sub-states</td><td/><td/><td>Attribute</td></tr><tr><td>Entity</td><td/><td>Text</td><td/><td/><td/><td>Text</td><td>Entity</td></tr></tbody></table><para>The table contains more details than the diagram. The table also points
		out potential error conditions. Any cell without an entry
                usually represents an error
		condition that should be examined closely before
                deciding if it is truly an error condition
		or if it can be ignored. For example, the cell at TEXT state and TAGC was blank
		in the initial version. The cell should have had TEXT because a lone '&gt;' is
		just another text character. The
                empty cell at TAG state and STAGO
		is definitely an error and must be handled.</para><para>A software finite state machine is the ideal implementation technique
		for a markup parser. There are a relatively small number of states, a limited
		number of state change conditions, and the implementation is not overly
		complex. The current State Machine class in mlParser is about 1800 lines of
		heavily commented Java. The whole parser is about 3000
                lines and the compiled mlParser JAR file is only 22K.</para></section><section><title>mlParser Today</title><para>This section covers mlParser in its present form. Design choices I made
		and implemented features are discussed. mlParser has grown and
		evolved a great deal since that first Java implementation. It is a
		robust markup parsing tool that is at the heart of several internal
		applications and is being used on multiple projects. The most significant
		mlParser use is as the core for an integrated software documentation tool where
		it has displaced OmniMark applications.</para><para>Early sections of this paper identified reasons for writing a parser and
		design choices to make before you get too far into the effort. My initial
		reason for writing mlParser was learning Java. My reasons for continuing
		development evolved almost as much as the code itself. My current list
		is:</para><itemizedlist><listitem><para>Mixing different markup meta-languages</para></listitem><listitem><para>An application development framework</para></listitem><listitem><para>An easily customizable parsing package when XML-type markup just
			 won't work</para></listitem><listitem><para>Fullfilling the "SGML without DTDs" vision</para></listitem><listitem><para>Replacing all my OmniMark code</para></listitem></itemizedlist><para>The integrated software documentation tool has been in use for several
		years; there are thousands of SGML files across many projects. The software
		will eventually migrate completely to XML;
		however, the legacy SGML files will never be converted
		to XML. There will be a period where
                documents will be built from some SGML files and
		some XML files. The system must handle the legacy SGML on demand for the
		foreseeable future. mlParser enables this mixed markup environment. </para><para>An application development framework has grown up with mlparser. The
		output call-back interface forms the basis for mlParser application
		development. A new application can often be implemented with just two classes.
		One class is the main program that collects command line arguments, registers
		the output call-back class, and launches the parsing engine. Starting the
		parsing process is done by simply invoking the state machine class. Simple
		applications usually do not require additional classes. Applications that
		handle multiple document types will need to invoke multiple parser objects with
		associated, output call-back classes. Each document type should have its own
		parsing object.</para><para>My intimate knowledge of the internals, especially the state machine,
		allows me to customize the parser for specific applications. For example, there
		is an application that uses an HTML subset plus a handful of extra tags. I want
		to use generic, out-of-the-box OpenOffice as the editor so the user never
		sees a tag. We tried processing instructions in
		place of the extra tags without success. For OpenOffice to work as the editor
		for this application, editing must be done in normal WYSIWYG mode. Our solution
		was to include the extra three or four tags in the document with '{' and '}' as tag
		delimiters instead the normal '&lt;' and '&gt;'. It turned out to be a simple
		solution from both the developer's and user's points of view. The developer was
		happy because no extra code was required to be able to use OpenOffice, and the
		user was happy because he didn't have to deal with tags
                for the most part. OpenOffice happily
		passes around the <emphasis role="ital">special</emphasis>
                tags unmodified, and the application code sees
		them as normal tags. Tweaking mlParser was straightforward and the result was
		a major editing improvement from the user's perspective without
                affecting application code.</para><para>The "XML is just SGML without DTDs" sales pitch from the XML faction at
		the 1996 SGML Conference convinced me and the SGML community at large that
		XML was worth pursuing. I envisioned XML as a true SGML subset where I could
		transparently use the expensive SGML tools that I fought so hard to purchase
		over the years. New XML
		tools would certainly be less expensive and compatible. By the time that the
		XML 1.0 standard hit the streets, that fantasy was crushed. I never gave up the
		dream of transparent markup though. I wanted to be able to treat markup as SGML
		to use SGML tools and XML to use XML tools transparently. mlParser essentially
		allows me to treat both XML and SGML as just <emphasis role="ital">markup.</emphasis> The mlParser
		parsing engine recognizes the syntax differences between the two, and therefore
		fulfills the "XML is just SGML without DTDs" as "XML and SGML are just markup,"
		which is even better!</para><para>mlParser is built around the following design/features:</para><itemizedlist><listitem><para>Runs either from the command line or embedded in an
			 application</para></listitem><listitem><para>Generates an ESIS output stream by default</para></listitem><listitem><para>Handles simple entity character substitution</para></listitem><listitem><para>Non-validating parser</para></listitem><listitem><para>Handles both SGML and XML empty elements</para></listitem><listitem><para>Allows nested parsing</para></listitem><listitem><para>Sets most parsing options from the command line</para></listitem><listitem><para>Handles multiple types of markup transparently (XML, SGML,
			 WordML)</para></listitem><listitem><para>Can include or exclude comments from the output stream</para></listitem><listitem><para>Handles CDATA</para></listitem><listitem><para>Does not implement marked sections</para></listitem><listitem><para>Includes the option to stop on parsing errors or continue when
			 possible</para></listitem><listitem><para>Built around ESIS events</para></listitem><listitem><para>Based on a relatively simple software state machine</para></listitem><listitem><para>Includes a sample output call-back class that implements a jump
			 table for "document" applications</para></listitem><listitem><para>Supports multiple input sources</para></listitem></itemizedlist><para>The same Java JAR file is used for both the stand-alone program and when
		embedded in an application. Simple entity substitution is driven via a
		Java properties file. If no entity property file
		is specified, a default set is used. Entities not
		found in the property file are passed thru unchanged. A full-blown entity
		implementation seemed to be almost as much effort as the rest of the parser; and
		for me, simple entity substitution is sufficient.</para><para>Several of my applications require nested parsing with different DTDs.
		An application simply invokes another parser object, handles the output from
		the nested parser object, and picks up where it left off in the original input
		stream.</para><para>The need to stop on a parsing error or continue transparently is
		application specific. Stopping on an error is the default behavior for
		stand-alone parser operation. An application
		processing a document may want to simply
		unwind the element stack, fire end tag call-backs and keep going. This works
		reasonably well for missing end tags, but can create a mess with a missing
		start tag. mlParser provides the option to stop or continue when errors
		are detected.
		</para><para>Multiple input sources is a useful feature. Markup traditionally comes
		from files and must be complete. mlParser accepts files, strings, or URNs.
		Parsing a string implies well-formed markup, but no &lt;!DOCTYPE and no XML
		declaration. Transparently handling URNs allows an application to
		parse markup directly from the web or some external source such as a
		Subversion repository. Building a multi-file document directly from a
		Subversion repository ensures that the output document contains the latest
		checked-in files without a process to manually extract and process them.</para><para>The default output call-back interface for handling documents is
		constructed around two jump tables and Java Reflection. There is one jump table
		for start tags and one for end tags. The jump tables are constructed as
		HashMaps with the tag as the key, and the value is the Java method via
		Reflection. This approach creates a level of indirection between the tags and
		the code that processes them:</para><itemizedlist><listitem><para>One method can handle multiple tags. For example, the same method
			 can be used for dropping a tag and its contents or for passing the tag
			 unchanged.</para></listitem><listitem><para>Only tags of interest have to be accounted for; all others can be
			 automatically ignored. An unexpected tag will not cause an exception to be
			 thrown or the program to bomb.</para></listitem><listitem><para>The source code does not contain page-after-page of nested
			 <emphasis role="ital">if</emphasis> statements to determine which method to call to process a specific
			 tag. </para></listitem></itemizedlist><para>The following start tag call-back method uses the jump table approach:</para><programlisting xml:space="preserve">
   	public void writeStartTag(String sTag)
   	{
            String key     = sTag.toLowerCase();
            try{
                myMethod = (Method)sTagAction.get(key);
                myMethod.invoke(this,new Object[] {});
            }
            catch(Exception e){
                performDefaultStart();
            }
   	}
</programlisting><para>
		Handling a new tag requires no code change here. Instead, a new entry
		is simply added to the Start Tag Action hash.  I find the jump table approach
		clean, flexible, and elegant. The jump
		table approach can be used with any SAX-based parser, not just mlParser.
		</para><para>mlParser has evolved over the years from a learning exercise with
		potential to a major part of my markup applications. The thought of
		implementing the current
		mlParser from scratch is a daunting one. Starting small and adding capabilities
		as needed is a doable job. </para></section><section><title>Summary</title><para>
		This paper discusses reasons to write your own markup parser and
		documents my journey to writing my own parser. mlParser
		started with very modest beginnings and has evolved into a
		primary tool in my markup application toolkit. I use it to generate ESIS
		streams that are processed by Perl scripts, for the core of
		an integrated software documentation tool, and for
		markup conversion applications.
	 </para><para>
		I don't believe that I would have tackled writing a parser with
		all of the features and capabilities that are currently in
		mlParser. Starting with a small, simple, and flexible base
		allowed me to evolve a useful and capable parsing tool!
	 </para><para>I found out along the way that while XML is king for record-oriented
		markup, I still prefer SGML for documents. SGML's inclusions,
		exclusions, and default attribute values make a developer's life
		much less stressful for document-centric applications. </para><para>
		The experience has shown that there are still many valid reasons
		to write a markup parser. This paper has identified and explored
		many of these reasons. Not finding a parser that meets your needs
		is what most of the reasons boil down to. Don't be afraid to
		follow the path to your own parser!
	</para></section><bibliography><title>Bibliography</title><bibliomixed xml:id="Bajzat" xreflabel="Bajzat">Bajzat, Robert,
		   <emphasis role="ital">Thinlet Home Page,</emphasis>
           <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.thinlet.com/index.html</link>.
		</bibliomixed><bibliomixed xml:id="ClarkJ" xreflabel="ClarkJ">Clark, James,
		   <emphasis role="ital">Nsgmls Output Format,</emphasis>
           <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.jclark.com/sp/</link>.
		</bibliomixed><bibliomixed xml:id="SmithN" xreflabel="Smith1998">Smith, Norman E.,
		   <emphasis role="ital">Practical Guide to SGML/XML Filters,</emphasis>
           Wordware Publishing, Inc., ISBN 1-55622-587-3, © July 1998.
           <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.wordware.com</link>.
		</bibliomixed><bibliomixed xml:id="Forth" xreflabel="Smith1997">Smith, Norman E.,
		   <emphasis role="ital">Write Your Own Programming Language Using C++, 2nd Edition,</emphasis>
           Wordware Publishing, Inc., ISBN 1-55622-492-3, © 1997.
           <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.wordware.com</link>.
		</bibliomixed><bibliomixed xml:id="WikipediaMU" xreflabel="WikipediaMU">Wikipedia,
		   <emphasis role="ital">Markup Language Definition,</emphasis>
           <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://en.wikipedia.org/wiki/Markup_language</link>.
		</bibliomixed><bibliomixed xml:id="WikipediaSGML" xreflabel="WikipediaSGML">Wikipedia,
		   <emphasis role="ital">SGML Definition,</emphasis>
           <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://en.wikipedia.org/wiki/SGML</link>.
		</bibliomixed></bibliography></article>
