Grammar-driven Markup Generation

Grammar-driven Markup Generation Balisage: The Markup Conference 2010 August 3 - 6, 2010 This paper defines the concept of grammar-driven normalization of incomplete instances, sketches its implementation for RELAX NG schema and XML documents, and presents an example of its practical use for automated document conversion. Mario Blažević The author has a Master's degree in Computer Science from University of Novi Sad, Yugoslavia. Since moving to Canada in 2000, he has been working for OmniMark Technologies, later acquired by Stilo International plc., mostly in the area of markup processing and on development of the OmniMark programming language. mblazevic@stilo.com Senior software developer Stilo International plc. Copyright © 2010 by the author. Used with permission. I want to thank my colleagues Jacques Légaré, Joe Gollner, Patrick Baker, and Roy Amodeo for their valuable comments, and Stilo International for providing the initial motivation for the work and giving me time to do it properly.

Introduction A grammar is a set of rules that describes the composition of a language. Grammars have existed for at least two and half millenia, but their use was mainly didactic. The practical applications of grammars started with their mathematical formalization in the 1950s and the emergence of computer programming languages . There are three ways to apply a grammar: Validate a given sentence to determine if it is grammatically correct. Generate a correct sentence according to the grammar. Parse a given sentence to determine its structure and constituent parts. The historical motivation for development of the theory of formal languages was to explain the method of generation of valid sentences as well as their recognition and parsing. Generative grammars are still an active area of reasearch in linguistics. In computer science, however, it is parsing that is the primary role of grammars; their use for generation of valid sentences appears to be forgotten. One reason for this lack of interest in grammar-driven generation is that the problem is poorly defined. The output of a generator is a grammatically valid sentence. But what is its input, other than the grammar? The task of a parser is clearer: while its output in practice can be anything, it is usually defined as a parse tree of some form. While many kinds of generator inputs are possible and interesting to contemplate, we shall concentrate on the specific task of generating a grammatically valid sentence from a sequence of tokens (i.e., terminals) allowed by the grammar. Furthermore, the generator must exactly reproduce any input that is already a grammatically valid sentence. We shall use the term normalization to refer to this form of sentence generation, and the term normalizer for the generator that performs it. Given a normalizer that maps a set of token sequences T into a set of cannonical, grammatically valid sequences S, one can define a new language S' as the union of the core language S and the non-core language T. For an actual example of this kind of language definition, we need look no further than SGML. An SGML parser performs parsing, normalization, and validation at the same time. For example, given the following SGML DTD that declares the end tag of element a and both tags of element b to be omissible: <!element a - o (b)> <!element b o o (#pcdata)> an SGML parser with OMITTAG feature enabled will parse the instance <a>Hello, World! into a syntax tree corresponding to <a><b>Hello, World!</b></a>. Any conformant SGML parser with support for omissible tags will consider these two instances as equivalent, as long as the same DTD is used to parse them. If the DTD should change, however, an instance that omits element tags may become not only invalid but unparsable. For this reason and others, an XML parser does not perform any normalization. A well-formed XML instance is always parsed in the same way according to a fixed grammar, regardless of what DTD or schema the instance happens to also conform to. Various types of XML schemas are used only for validation, not for parsing or normalization. This paper will demonstrate how an XML schema (a RELAX NG schema specifically) can be used to generate a valid instance of the schema from an invalid but well-formed XML instance. In other words, we shall normalize a well-formed XML instance so it conforms to a given schema. The only element tags allowed in the input are those that occur in the target schema. If normalization succeeds with no errors, the output of the normalizer will: be well-formed XML, conform to the target schema, and contain the entire input instance as its subsequence, interleaved with additional element tags. There is one exception. The normalizer may reorder those parts of the input content that the target schema models as interleaved. In essence, its output will conform to the target schema strictified by replacing its every interleave pattern node by a group node. The normalizer is much more forgiving than the SGML parser in what kinds of input it can normalize. The target grammar is allowed to be ambiguous, though this is not recommended for performance reasons. There is no unique particle contribution constraint , nor is the generation restricted to contextually required elements . If the input instance is ambiguous, i.e., there is more than one way to normalize it, the normalization with the least amount of added markup will be chosen. This choice is not made at the point when the particle is encountered, because the subsequent content could make that choice invalid, but deferred until the enclosing element ends. The biggest constraint is that only plain element tags with no attributes can be inserted. Even this constraint could be overcome by automatically generating some arbitrary grammatically correct values for all required attributes, but we felt that this might cause more harm than benefit in practice. The primary purpose of SGML's minimization features was making SGML easier to author. We do not recommend using the XML normalizer in this way; we feel that it is important that every XML document in permanent storage should conform to a clearly expressed schema. The legitimate place for the normalization process is inside a processing pipeline, where it can usefully be applied to or produce a transient XML document instance. Normalization can thus be applied before another process in order to transform the input document into a form easier to process. The target schema for the normalizer may be just a stricter form of the input document's schema, or it may specify additional disambiguating markup. Alternatively, a normalizer can appear at the opposite end of the processing pipeline, where it ensures that the output of the preceding process conforms to the target schema. We shall demonstrate this latter possibility of practical use in a later section. The rest of the paper is organized as follows: the next section will present an example of normalizer's use, and afterwards its design and implementation will be explained. The section following after that explains the purpose and results of the normalizer in production, and the paper concludes with a discussion of related work.

Example This section will demonstrate the capabilities of our normalizer through an example. Let us take the following RELAX NG schema as our target: start = document block = p | ol | ul document = element document { title, block+, section* } section = element section { title, block+, section* } title = element title { text } p = element p { text } ol = element ol { li+ } ul = element ul { li+ } li = element li { block+ } The goal is to convert our input to a valid instance of this grammar. As for the input, it may start as a plain text file like the following one: Normalizer test input This is a short test input for the normalizer. Purpose The purpose of this document is to demonstrate the capabilities of the normalizer. The normalizer is schema-driven, and it tries to fit its input into the target schema. Constraints The goal of the normalizer is to produce output that conforms to the schema, while preserving complete input. It does not always succeed: - A piece of input that cannot be fit into the schema will be dropped. - The normalizer does not attempt to conjure any required attribute values nor text data. These must be present in the input. We first need to ensure that our input is a well-fomed XML document, which can be achieved simply by putting the entire text inside the <document>...</document> tags. If we then run this XML instance through the normalizer, it will be converted to <document> <title></title> <p> Normalizer test input This is a short test input for the normalizer. Purpose The purpose of this document is to demonstrate the capabilities of the normalizer. The normalizer is schema-driven, and it tries to fit its input into the target schema. Constraints The goal of the normalizer is to produce output that conforms to the schema, while preserving complete input. It does not always succeed: - A piece of input that cannot be fit into the schema will be dropped. - The normalizer does not attempt to conjure any required attribute values nor text data. These must be present in the input. </p> </document> As we can verify, the output indeed does conform to the target schema. The content model of document requires a title element, for example, and this element has been inserted. Since it has no content, though, it may not be all that useful. In order to produce a more appropriate instance, we need to place certain constraints on the solution. One way to do this would be make the schema stricter. We could require the title content to be non-empty, and the paragraph content to contain no line-breaks. Alternatively, we can specify the constraints by adding the desired markup to the input document. For example, we can mark up all input lines that appear to be titles: <document> <title>Normalizer test input</title> This is a short test input for the normalizer. <title>Purpose</title> The purpose of this document is to demonstrate the capabilities of the normalizer. The normalizer is schema-driven, and it tries to fit its input into the target schema. <title>Constraints</title> The goal of the normalizer is to produce output that conforms to the schema, while preserving complete input. It does not always succeed: - A piece of input that cannot be fit into the schema will be dropped. - The normalizer does not attempt to conjure any required attribute values nor text data. These must be present in the input. </document> Given this enriched input together with the original schema, the normalizer produces the following output: <document> <title>Normalizer test input</title> <p> This is a short test input for the normalizer. </p> <section> <title>Purpose</title> <p> The purpose of this document is to demonstrate the capabilities of the normalizer. The normalizer is schema-driven, and it tries to fit its input into the target schema. </p> <section> <title>Constraints</title> <p> The goal of the normalizer is to produce output that conforms to the schema, while preserving complete input. It does not always succeed: - A piece of input that cannot be fit into the schema will be dropped. - The normalizer does not attempt to conjure any required attribute values nor text data. These must be present in the input. </p> </section> </section> </document> The good news is that the normalizer has not only respected and kept our title markup, it has inferred that every title but the first one has to start a new section. The bad news is that the second section has been nested within the first one. While this does conform to the target schema, the two sections belong at the same level. Furthermore, we want every line to be a separate paragraph; and there appears to be a list at the end of our document and it should be marked up as such. In this simple example we could just visually determine where each section, paragraph, or list item starts and ends and mark them up with XML element tags. In practice, however, we'd rather have this work done automatically. A simple pattern-matching rule would be capable of inserting a start tag wherever there is a hyphen at the beginning of the line, but finding where it ends is much more difficult. To make this marking-up process easier, the normalizer recognizes a number of processing instructions as output constraints. Once we apply them, our input document might look like this: <document> <title>Normalizer test input</title> <?derivative.start-anew <p>?>This is a short test input for the normalizer. <?derivative.start-anew <section>?><title>Purpose</title> <?derivative.start-anew <p>?>The purpose of this document is to demonstrate the capabilities of the normalizer. <?derivative.start-anew <p>?>The normalizer is schema-driven, and it tries to fit its input into the target schema. <?derivative.start-anew <section>?><title>Constraints</title> <?derivative.start-anew <p>?>The goal of the normalizer is to produce output that conforms to the schema, while preserving complete input. It does not always succeed: <?derivative.proceed-with <ul>?><?derivative.start-anew <li>?> A piece of input that cannot be fit into the schema will be dropped. <?derivative.proceed-with <ul>?><?derivative.start-anew <li>?> The normalizer does not attempt to conjure any required attribute values nor text data. <?derivative.start-anew <p>?>These must be present in the input. </document> The processing instructions basically act as placeholders for element start tags, letting the normalizer know which element is supposed to enclose the following content. The exact location of the element's end tag will be inferred by the normalizer. The details of the processing instructions and the inference algorithm will be explained in the next chapter. The normalizer's output, based on our last enriched input above, looks as follows: <document> <title>Normalizer test input</title> <p>This is a short test input for the normalizer.</p> <section> <title>Purpose</title> <p>The purpose of this document is to demonstrate the capabilities of the normalizer.</p> <p>The normalizer is schema-driven, and it tries to fit its input into the target schema.</p> </section> <section> <title>Constraints</title> <p>The goal of the normalizer is to produce output that conforms to the schema, while preserving complete input. It does not always succeed:</p> <ul> <li> <p>A piece of input that cannot be fit into the schema will be dropped.</p> </li> <li> <p>The normalizer does not attempt to conjure any required attribute values nor text data.</p> <p>These must be present in the input.</p> </li> </ul> </section> </document>

Design and Implementation The basic idea that drove the normalizer design came after implementing the validator automaton for RELAX NG . This automaton uses a flexible validation algorithm based on Brzozowski derivatives . We shall now sketch the basic idea of this validation algorithm; a more thorough treatment can be found in the literature . Informally, the derivative of a grammar G with respect to a sequence of tokens S is the grammar G', such that a sequence S' is valid according to G' if and only if the concatenation of S and S' is valid according to G. We can think of the grammar G as the initial state of a validating automaton, and the grammar derivative calculation as its transition function. A very simple and generic validator algorithm can be built on top of this concept , and one such validator is the standard RELAX NG validation algorithm. While working on the implementation of an RELAX NG validator library for OmniMark , it became clear that the automaton could be relatively easily extended in two orthogonal ways: with additional pattern constructs, and with additional actions to perform. In particular, the validating automaton could be modified to produce arbitrary output, in the same way a deterministic finite automaton can be extended to a Moore machine . The original RELAX NG validation algorithm represents the grammar as a tree data structure. Here is its definition in Haskell : data Pattern = Empty | NotAllowed | Text | Choice Pattern Pattern | Interleave Pattern Pattern | Group Pattern Pattern | OneOrMore Pattern | List Pattern | Data Datatype ParamList | DataExcept Datatype ParamList Pattern | Value Datatype String Context | Attribute NameClass Pattern | Element NameClass Pattern | After Pattern Pattern The constructors of this data type, with the exception of the last one, correspond to the constructs of the RELAX NG schema language. The last constructor, After, appears when a start tag of an element is consumed, and matches the remainder of the XML input together with the corresponding end tag. We need two extensions to this data structure in order to enable the automaton to perform normalization. First, we need to be able to assume the presence of an omitted element tag. Secondly, we must be able to output the normalized version of our input, together with the omitted and inferred tags. So we add two more data constructors: data Pattern = Empty | ... | Region Bool QName Pattern | Result [ResultDelta] Pattern data ResultDelta = StartTag ChildNode | Add ChildNode | EndTag | Inferred ResultDelta The Result constructor keeps track of the output that should be emitted if the pattern it wraps succeeds. Whenever the current pattern (i.e., the state of the automaton) has a Result node at its root, the accumulated chunk of output is flushed and the root node is replaced by its child pattern. The Region constructor denotes an open element region, which may have come from the input or may have been inserted in order to normalize it. In the former role, it acts as a replacement for the After node. To accomodate this change in the representation of open element regions, we have to make another adjustment to the Interleave constructor. It needs an extra integer field to keep track of which of its patterns has consumed unmatched start tags, and how many of them. As long as the input conforms to the target schema, the normalizer acts the same way as the RELAX NG validator, except it builds the Result nodes and flushes them at the output. When a particle that doesn't match the current pattern is encountered, the normalizer attempts to recover by descending into the content models of elements that are allowed at the current point. The derivative of the current pattern is then the sum of derivatives of all content models that accept the particle, with each content derivative wrapped into a Region node and a Result node; these keep track of the inferred element start tag. This modification of the RELAX NG validation algorithm results in the normalizer whose complete source code is listed in the paper's appendix . This simple normalizer is completely correct, but far from perfect. It easily runs into a combinatorial explosion of the pattern tree, and it cannot be customized to resolve the ambiguity in a desired way. While the available space is too short to describe all of the additional features and optimizations, the role of the processing instruction guides is too important to leave out. Many interesting target schemas contain a large number of elements which have nearly identical content models and which are allowed to appear in the same places. Such is the case, for example, with DocBook 5 elements glossary, bibliography, index, toc, dedication, acknowledgements, preface, chapter, appendix, article, and colophon. This represents a problem for the normalizer because it must keep track of all the alternative elements in case of recovery. If the user can specify which of the elements should be used for recovery, this improves the output and the performance at the same time. At the moment, the normalizer accepts the following processing instructions: <?derivative.start-anew depth? start-tag?> <?derivative.start-nested depth? start-tag?> <?derivative.proceed-with depth? start-tag?> <?derivative.ensure-inside element-name?> <?derivative.ensure-outside element-name?> where depth has the form region-identifier:integer. Processing instructions are used instead of element tags because the input document has to be well-formed XML, so it cannot contain unmatched element start tags. Using empty element tags in a special namespace could cause difficulties because the target schema may or may not allow them in that position. Finally, since these instructions target one specific tool, the normalizer, this is an appropriate use of processing instructions if there ever was one. The normalizer evaluates these processing instructions as follows: start-anew Any existing element region that was open with the same region-identifier and larger or equal integer depth will be closed. If depth is not specified, any existing element region with the same element name will be closed. A new element region will be started as if the specified start-tag appeared instead of the instruction. start-nested Any existing element region that was open with the same region-identifier and larger integer depth will be closed. A new element region will be started as if the specified start-tag appeared instead of the instruction. proceed-with Any existing element region that was open with the same region-identifier and larger integer depth will be closed. If there is an existing element region with the same region-identifier and equal integer depth, this region is kept and nothing else happens. Otherwise a new element region will be started as if the specified start-tag appeared instead of the instruction. ensure-inside The current pattern is trimmed to leave only those possibilities which assume the specified element-name is currently open. ensure-outside The existing schema of the remainder of the document is trimmed to leave only those possibilities which assume the specified element-name is not currently open.

Results The schema-driven normalizer described above has been implemented in the OmniMark programming language, to serve as a component in various document conversion solutions. We shall now briefly describe this environment in order to illustrate the practical benefits of the approach. The wider purpose is to convert documents from a legacy presentation-oriented format into a semantically richer XML format. Some examples of the source document format are FrameMaker, HTML, InDesign, or Quark Xpress. The target format of the conversion can be DITA, DocBook, or a user-specified XML format. One obvious simplification when faced with converting M source formats to N target formats is to introduce an intermediate format, and thus replace M∗N direct converters by M input converters to the intermediate format plus N output converters from the intermediate to the target format. Another benefit of having the intermediate document is that it can be enriched with user annotations to guide the output converter. Experience we gained in building specific output converters showed that hand-written implementations were difficult to verify and maintain, so we needed a more robust and more flexible solution. Some customers required outputs that conform to a specialization of DITA, others wanted DocBook with various extensions. The core of the output converters is the topic of this paper: the schema-driven normalizer. Its draft implementation, given in the appendix, was coded in Haskell and tested using the QuickCheck library . After this proof of concept, the production implementation was developed in OmniMark. Once the normalizer was ready, the old hand-coded output converters for DITA and DocBook were replaced with new converters in a matter of days. Compared to the hand-written output converters they replaced, the converters based on the schema-driven normalizer provide several benefits: Assuming the normalizer itself is correct, the output of the converter is guaranteed to be conformant to the target schema. The correctness of hand-written converters' output is more difficult to verify. A new target format can be supported very quickly as long as we have a schema for it, while before it would require rewriting the entire converter. If the new target language is a strict subset of an existing one, no coding is required whatsoever. For a small extension of an already supported schema, only a few lines of code must be added. Many users have various stylistic requirements that can be supported simply by restricting the target schema, with no code modifications required. The new converters are capable of automatically reordering parts of the input to make it conform to the target schema, with no user intervention required beyond the usual annotations. One example that we have often encountered in practice is that of DITA task topics: the DITA 1.1 schema requires the prereq, context, steps, result, example, and postreq elements to be listed in this specific order. Most existing technical documentation is not written this way. Note that this reordering transformation applies only where the target content model specifies that its parts can appear in any order. In case of DITA task topics, this means that we have to modify the target schema so it accepts the elements listed above in any order. The normalizer will output them in the canonical order. One downside of the new converters is their performance, especially their worst-case memory consumption. The worst case in question is generally an input document which is annotated in a way that makes it non-conformant with the target schema. The hand-coded converters would in this situation either report errors (which is the desired behaviour) or generate invalid output (which is not). The schema-driven converters try to recover by inserting the missing element tags and by rearranging the input. The recovery eventually fails and the normalizer produces an error report, but its attempt costs a lot of time and memory. We are still looking for a way to fix this issue.

Related Work Automatic grammar-driven sentence normalization has been done in linguistics, for the automatic correction of grammatically incorrect English sentences . For markup languages, particularly XML, there has been some related work in the area of automatic markup . The main difference with our work is that the input of an automatic markup system is typically plain text, usually produced by OCR. As a consequence, even those automatic markup systems whose output schema is not fixed require additional configuration in form of pattern-matching rules. The normaliser approach presented in this paper can be coupled with a pattern-matching engine as well, but it does not depend on one. The automatic markup system that seems nearest to ours is the Vasari project , which is also schema-driven to some extent. The target schema must be enriched using a special pattern-matching language which determines where the required element tags can be inserted. It is not clear if the location of every tag required by the target schema must be specified this way. There is a large body of work on the automatic translation between structured documents that conform to similar schemas . The approaches based on the concept of syntax-directed translation schema and its extensions, such as extended syntax-directed translation schema and tree transformation grammars , seem to be most related to our work. The syntax-directed translation is simpler and more direct than our approach to translation, but it requires both the source and target format to be well structured. Our problem, quite common in practice, is that the source format is only weakly structured. The transformation process presented in this paper proceeds in two stages: the weakly structured source document is first translated into another weakly structured document which uses the tags allowed by the target schema. This intermediate document is then normalized to a valid schema instance.

Sample implementation This appendix contains a sample normalizer implementation in Haskell. The implementation depends on the sample code for the RELAX NG validator , which it expects to find in module file Validator.hs. module Normalizer where import Control.Monad (liftM2) import Control.Exception (assert) import Data.List (minimumBy) -- The RELAX NG validator automaton code from http://www.thaiopensource.com/relaxng/derivative.html import Validator (contains, datatypeAllows, datatypeEqual, strip, whitespace, Uri, LocalName, ParamList, Prefix, Context, Datatype, NameClass(..), QName(..), ChildNode(..), AttributeNode(..)) data Pattern = Empty | NotAllowed | Text | Choice Pattern Pattern | Interleave Int Pattern Pattern | Group Pattern Pattern | OneOrMore Pattern | List Pattern | Data Datatype ParamList | DataExcept Datatype ParamList Pattern | Value Datatype String Context | Attribute NameClass Pattern | Element NameClass Pattern | Region Bool QName Pattern | Result [ResultDelta] Pattern deriving (Eq, Show) data ResultDelta = StartTag ChildNode | Add ChildNode | EndTag | Inferred ResultDelta deriving (Eq, Show) nullable:: Pattern -> Bool nullable (Group p1 p2) = nullable p1 && nullable p2 nullable (Interleave balance p1 p2) = balance == 0 && nullable p1 && nullable p2 nullable (Choice p1 p2) = nullable p1 || nullable p2 nullable (OneOrMore p) = nullable p nullable (Element _ _) = False nullable (Attribute _ _) = False nullable (List _) = False nullable (Value _ _ _) = False nullable (Data _ _) = False nullable (DataExcept _ _ _) = False nullable NotAllowed = False nullable Empty = True nullable Text = True nullable (Region False _ _) = False nullable (Region True _ p) = nullable p nullable (Result _ p) = nullable p inferred :: ResultDelta -> Bool inferred (Inferred _) = True inferred _ = False nullableResults:: Pattern -> [[ResultDelta]] nullableResults (Group p1 p2) = case nullableResults p2 of [] -> [] l -> assert (all null l) (nullableResults p1) nullableResults (Interleave 0 p1 p2) = liftM2 (++) (nullableResults p1) (nullableResults p2) nullableResults Interleave{} = [] nullableResults (Choice p1 p2) = nullableResults p1 ++ nullableResults p2 nullableResults (OneOrMore p) = nullableResults p nullableResults (Element _ _) = [] nullableResults (Attribute _ _) = [] nullableResults (List _) = [] nullableResults (Value _ _ _) = [] nullableResults (Data _ _) = [] nullableResults (DataExcept _ _ _) = [] nullableResults NotAllowed = [] nullableResults (Region False _ _) = [] nullableResults (Region True q p) = map (++ [Inferred EndTag]) (nullableResults p) nullableResults Empty = [[]] nullableResults Text = [[]] nullableResults (Result r p) = map (r ++) (nullableResults p) bestNullableResult :: Pattern -> [ResultDelta] bestNullableResult p = if nullableResults p == [] then error ("bestNullableResult: " ++ show p) else minimumBy innermostLength (nullableResults p) where innermostLength r1 r2 = compare (length r1) (length r2) childDeriv :: Context -> Pattern -> ChildNode -> Pattern childDeriv cx p (TextNode s) = textDeriv cx False p s childDeriv _ p (ElementNode qn cx atts children) = let p1 = startTagOpenDeriv p (ElementNode qn cx atts []) p2 = attsDeriv cx p1 atts p3 = startTagCloseDeriv p2 p4 = childrenDeriv cx p3 children in endTagDeriv p4 textDeriv :: Context -> Bool -> Pattern -> String -> Pattern textDeriv cx inf (Choice p1 p2) s = choice (textDeriv cx inf p1 s) (textDeriv cx inf p2 s) textDeriv cx False (Interleave 0 p1 p2) s = choice (interleave 0 (textDeriv cx False p1 s) p2) (interleave 0 p1 (textDeriv cx False p2 s)) textDeriv cx False (Interleave balance p1 p2) s = if balance < 0 then interleave balance (textDeriv cx False p1 s) p2 else interleave balance p1 (textDeriv cx False p2 s) textDeriv cx True (Interleave 0 p1 p2) s = choice (interleave 0 (textDeriv cx True p1 s) p2) (interleave 0 p1 (textDeriv cx True p2 s)) textDeriv cx True (Interleave balance p1 p2) s = if balance < 0 then interleave balance (textDeriv cx True p1 s) p2 else interleave balance p1 (textDeriv cx True p2 s) textDeriv cx inf (Group p1 p2) s = let p = group (textDeriv cx inf p1 s) p2 in if nullable p1 then choice p (result (bestNullableResult p1) (textDeriv cx inf p2 s)) else p textDeriv cx inf (OneOrMore p) s = group (textDeriv cx inf p s) (choice (OneOrMore p) Empty) textDeriv cx inf Text s = result [(if inf then Inferred else id) (Add (TextNode s))] Text textDeriv cx1 inf (Value dt value cx2) s = if datatypeEqual dt value cx2 s cx1 then result [(if inf then Inferred else id) (Add (TextNode s))] Empty else NotAllowed textDeriv cx inf (Data dt params) s = if datatypeAllows dt params s cx then result [(if inf then Inferred else id) (Add (TextNode s))] Empty else NotAllowed textDeriv cx inf (DataExcept dt params p) s = if datatypeAllows dt params s cx && not (nullable (textDeriv cx inf p s)) then result [(if inf then Inferred else id) (Add (TextNode s))] Empty else NotAllowed textDeriv cx inf (List p) s = if nullable (listDeriv cx p (words s)) then result [(if inf then Inferred else id) (Add (TextNode s))] Empty else NotAllowed textDeriv cx inf (Element (Name uri local) p) s = let qn = QName uri local in region True qn (textDeriv cx inf (startRecovery qn cx p) s) textDeriv cx inf (Element (NameClassChoice n1 n2) p) s = choice (textDeriv cx inf (Element n1 p) s) (textDeriv cx inf (Element n2 p) s) textDeriv cx inf (Element _ _) s = NotAllowed textDeriv cx inf (Region inferred qn p) s = region inferred qn (textDeriv cx inf p s) textDeriv cx inf (Result r p) s = result r (textDeriv cx inf p s) textDeriv _ _ _ _ = NotAllowed listDeriv :: Context -> Pattern -> [String] -> Pattern listDeriv _ p [] = p listDeriv cx p (h:t) = listDeriv cx (textDeriv cx False p h) t startRecovery :: QName -> Context -> Pattern -> Pattern startRecovery qn cx p = result [Inferred (StartTag (ElementNode qn cx [] []))] (startTagCloseDeriv p) region :: Bool -> QName -> Pattern -> Pattern region inferred qname NotAllowed = NotAllowed region inferred qname p = Region inferred qname p result :: [ResultDelta] -> Pattern -> Pattern result _ NotAllowed = NotAllowed result r (Choice p1 p2) = Choice (result r p1) (result r p2) result r1 (Result r2 p) = Result (r1 ++ r2) p result [] p = p result r p = Result r p addResult :: Pattern -> [ResultDelta] -> Pattern addResult NotAllowed _ = NotAllowed addResult (Choice p1 p2) r = Choice (addResult p1 r) (addResult p2 r) addResult (Group p1 p2) r = group (addResult p1 r) p2 addResult p@(Interleave balance p1 p2) r = interleave balance (addResult p1 r) p2 addResult (Result r1 p) r2 = result r1 (addResult p r2) addResult (Region inferred qn p) r = Region inferred qn (addResult p r) addResult p r = Result r p choice :: Pattern -> Pattern -> Pattern choice p NotAllowed = p choice NotAllowed p = p choice Empty Text = Text choice Text Empty = Text choice p1 p2 | p1 == p2 = p1 choice p1 p2 = Choice p1 p2 group :: Pattern -> Pattern -> Pattern group p NotAllowed = NotAllowed group NotAllowed p = NotAllowed group p Empty = p group Empty p = p group Text Text = Text group _ (Result _ _) = error "Can't have Result on the RHS of Group." group (Result r p1) p2 = result r (group p1 p2) group p1 p2 = Group p1 p2 interleave :: Int -> Pattern -> Pattern -> Pattern interleave _ p NotAllowed = NotAllowed interleave _ NotAllowed p = NotAllowed interleave _ p Empty = p interleave _ Empty p = p interleave _ Text Text = Text interleave balance p1 p2 = Interleave balance p1 p2 startTagOpenDeriv :: Pattern -> ChildNode -> Pattern startTagOpenDeriv (Choice p1 p2) node = choice (startTagOpenDeriv p1 node) (startTagOpenDeriv p2 node) startTagOpenDeriv (Element (NameClassChoice n1 n2) p) node = choice (startTagOpenDeriv (Element n1 p) node) (startTagOpenDeriv (Element n2 p) node) startTagOpenDeriv (Element nc@(Name uri local) p) node@(ElementNode qn cx _ _) = if contains nc qn then Region False qn (result [StartTag node] p) else let qn = QName uri local in region True qn (startTagOpenDeriv (startRecovery qn cx p) node) startTagOpenDeriv (Element nc p) node@(ElementNode qn cx _ _) = if contains nc qn then Region False qn (result [StartTag node] p) else NotAllowed startTagOpenDeriv (Interleave 0 p1 p2) node = choice (interleave (-1) (startTagOpenDeriv p1 node) p2) (interleave 1 p1 (startTagOpenDeriv p2 node)) startTagOpenDeriv (Interleave balance p1 p2) node = if balance < 0 then interleave (balance-1) (startTagOpenDeriv p1 node) p2 else interleave (balance+1) p1 (startTagOpenDeriv p2 node) startTagOpenDeriv (OneOrMore p) node = group (startTagOpenDeriv p node) (choice (OneOrMore p) Empty) startTagOpenDeriv (Group p1 p2) node = let x = group (startTagOpenDeriv p1 node) p2 in if nullable p1 then choice x (result (bestNullableResult p1) (startTagOpenDeriv p2 node)) else x startTagOpenDeriv (Region inferred qn p) node = region inferred qn (startTagOpenDeriv p node) startTagOpenDeriv (Result r p) node = result r (startTagOpenDeriv p node) startTagOpenDeriv _ _ = NotAllowed attsDeriv :: Context -> Pattern -> [AttributeNode] -> Pattern attsDeriv cx p [] = p attsDeriv cx p ((AttributeNode qn s):t) = attsDeriv cx (attDeriv cx p (AttributeNode qn s)) t attDeriv :: Context -> Pattern -> AttributeNode -> Pattern attDeriv cx (Choice p1 p2) att = choice (attDeriv cx p1 att) (attDeriv cx p2 att) attDeriv cx (Group p1 p2) att = choice (group (attDeriv cx p1 att) p2) (group p1 (attDeriv cx p2 att)) attDeriv cx (Interleave balance p1 p2) att = choice (interleave balance (attDeriv cx p1 att) p2) (interleave balance p1 (attDeriv cx p2 att)) attDeriv cx (OneOrMore p) att = group (attDeriv cx p att) (choice (OneOrMore p) Empty) attDeriv cx (Attribute nc p) (AttributeNode qn s) = if contains nc qn && valueMatch cx p s then Empty else NotAllowed attDeriv cx (Region inferred qn p) att = Region inferred qn (attDeriv cx p att) attDeriv cx (Result r p) att = result r (attDeriv cx p att) attDeriv _ _ _ = NotAllowed valueMatch :: Context -> Pattern -> String -> Bool valueMatch cx p s = (nullable p && whitespace s) || nullable (textDeriv cx False p s) startTagCloseDeriv :: Pattern -> Pattern startTagCloseDeriv (Choice p1 p2) = choice (startTagCloseDeriv p1) (startTagCloseDeriv p2) startTagCloseDeriv (Group p1 p2) = group (startTagCloseDeriv p1) (startTagCloseDeriv p2) startTagCloseDeriv (Interleave balance p1 p2) = interleave balance (startTagCloseDeriv p1) (startTagCloseDeriv p2) startTagCloseDeriv (OneOrMore p) = oneOrMore (startTagCloseDeriv p) startTagCloseDeriv (Attribute _ _) = NotAllowed startTagCloseDeriv (Region inferred qn p) = Region inferred qn (startTagCloseDeriv p) startTagCloseDeriv (Result r p) = result r (startTagCloseDeriv p) startTagCloseDeriv p = p oneOrMore :: Pattern -> Pattern oneOrMore NotAllowed = NotAllowed oneOrMore Empty = Empty oneOrMore Text = Text oneOrMore p@OneOrMore{} = p oneOrMore (Choice p Empty) = Choice (oneOrMore p) Empty oneOrMore (Choice Empty p) = Choice Empty (oneOrMore p) oneOrMore (Result r p) = result r (oneOrMore p) oneOrMore p = OneOrMore p childrenDeriv :: Context -> Pattern -> [ChildNode] -> Pattern childrenDeriv _ NotAllowed _ = NotAllowed childrenDeriv cx p [] = let p1 = textDeriv cx True p "" in choice (addResult p [Inferred (Add (TextNode ""))]) p1 childrenDeriv cx p [(TextNode s)] = let p1 = childDeriv cx p (TextNode s) in if whitespace s then choice (addResult p [Add (TextNode s)]) p1 else p1 childrenDeriv cx p children = stripChildrenDeriv cx p children stripChildrenDeriv :: Context -> Pattern -> [ChildNode] -> Pattern stripChildrenDeriv _ p [] = p stripChildrenDeriv cx p (h:t) = stripChildrenDeriv cx (if strip h then (addResult p [Add h]) else childDeriv cx p h) t endTagDeriv :: Pattern -> Pattern endTagDeriv (Choice p1 p2) = choice (endTagDeriv p1) (endTagDeriv p2) endTagDeriv (Region False qn p) = if nullable p then result (bestNullableResult p ++ [EndTag]) Empty else region False qn (endTagDeriv p) endTagDeriv (Region True qn p) = region True qn (endTagDeriv p) endTagDeriv (Result r p) = result r (endTagDeriv p) endTagDeriv (Group p1 p2) = group (endTagDeriv p1) p2 endTagDeriv (Interleave balance p1 p2) | balance < 0 = interleave (balance+1) (endTagDeriv p1) p2 | balance == 0 = NotAllowed | balance > 0 = interleave (balance-1) p1 (endTagDeriv p2) endTagDeriv _ = NotAllowed Bibliography AV Aho, JD Ullman, 1972. The Theory of Parsing, Translation, and Compiling. Prentice Hall Mohammad Abolhassani, Norbert Fuhr and Norbert Gövert, Information extraction and automatic markup for XML documents, In Blanken et al, 2003, 159--174, Springer. doi:10.1007/978-3-540-45194-5_11 Backus, J.W., The Syntax and Semantics of the Proposed International Algebraic Language of Zürich ACM-GAMM Conference, Proceedings of the International Conference on Information Processing, UNESCO, 1959, pp.125-132. Brzozowski, J. A. 1964. Derivatives of Regular Expressions. J. ACM 11, 4 (Oct. 1964), 481-494. doi:10.1145/321239.321249 Chomsky, Noam (1956). "Three models for the description of language". IRE Transactions on Information Theory 2: 113–124. doi:10.1109/TIT.1956.1056813 James Clark and Makoto Murata. RELAX NG Specification. http://relaxng.org/spec-20011203.html, 2001. ISO/IEC 19757-2:2003. James Clark. An algorithm for RELAX NG validation http://www.thaiopensource.com/relaxng/derivative.html QuickCheck: Automatic testing of Haskell programs http://hackage.haskell.org/package/QuickCheck-2.1.0.3 Peter Fankhauser and Yi Xu, MarkItUp! - An incremental approach to document structure recognition, Electronic Publishing, 1993, pages 447-456 Eila Kuikka and Martti Penttonen, Transformation of Structured Documents, Electronic Publishing Origination, Dissemination and Design, 8(4), 1995. Bertin Klein and Peter Fankhauser, Error tolerant Document Structure Analysis, International Journal on Digital Libraries, 1997, volume 1, pages 344-357. doi:10.1007/s007990050028 Lewis, P. M. and Stearns, R. E. 1968. Syntax-Directed Transduction. J. ACM 15, 3 (Jul. 1968), 465-488. doi:10.1145/321466.321477 Greger Lindén, Structured Document Transformations, 1997 Moore, E. F., [1956]. Gedanken experiments on sequential machines, Automata Studies, Princeton Univ. Press, Princeton, New Jersey, pp. 129-153. Makoto Murata, Transformation of Documents and Schemas by Patterns and Contextual Conditions, Proceedings of the Third International Workshop on Principles of Document Processing (PODP 96), 1997, pages 153-169, Springer-Verlag. doi:10.1007/3-540-63620-X_61 Sperberg-McQueen, C. M. Applications of Brzozowski derivatives to XML schema processing. In Extreme Markup Languages 2005, page 26, Internet, 2005. IDEAlliance. Kazem Taghva, Allen Condit, and Julie Borsack, Autotag: A tool for creating structured document collections from printed materials, Electronic Publishing, Artistic Imaging, and Digital Typography, Proc. of the EP ’98 and RIDT ’98 Conferences, 1998, pages 420-431, Springer-Verlag Tang, X. 2003 A High-Level Specification Language for Structured Document Transformation. Doctoral Thesis. UMI Order Number: AAINQ84932., University of Waterloo. Dénes Vargha, Schema method: a framework for correcting grammatically ill-formed input Proceedings of the 12th conference on Computational linguistics - Volume 1 Computer and Automation Institute, Hungarian Academy of Sciences Pages 341 - 347 Association for Computational Linguistics Morristown, NJ, USA ©1988 ISBN: 9638431563. doi:10.3115/991635.991705 Haskell 98 Language and Libraries, the Revised Report. December 2002. http://haskell.org/onlinereport/ Standard Generalized Markup Language (SGML) International Organization for Standardization ISO 8879:1986 OmniMark language documentation http://developers.omnimark.com/docs-extract/html/index.htm OmniMark RELAX NG (OMRELAXNG) library documentation http://developers.omnimark.com/docs-extract/html/library/125.htm XML Schema Part 1: Structures Second Edition, Analysis of the Unique Particle Attribution Constraint W3C Recommendation 28 October 2004 http://www.w3.org/TR/xmlschema-1/#non-ambig