The traditional approach to transforming XML documents is a three-step pipeline: validate, transform, validate. (Sometimes, of course, one or both of the validation steps is omitted.) Architectural forms, a feature first of the SGML-based hypermedia standard HyTime and then of SGML itself, made use of a combination of enhancements to DTDs and annotations in source documents to allow a two-step pipeline for certain simple transformations. In this pipeline, a valid SGML document could be automatically transformed using a specialized SGML parser, called an architectural engine (AE), into another SGML document valid against a more general DTD known as the meta-DTD. This permitted document creators to conform to a general document architecture without having to constrain their own documents to every detail of a specific schema.
However, DTDs have not seen wide uptake in the XML world, and the few XML architectural engines that have been built have conformed more to the letter than to the spirit of architectural forms. The emphasis has been on the creation of comprehensive and complex schemas which attempt simultaneously to serve local needs and the needs of interchange. Such schemas are usually arrived at by difficult, lengthy, and highly political negotiations between interested parties, with victory often going to the participants with the greatest weight of Sitzfleisch rather than the best ideas.
This paper describes an attempt to return to those thrilling days of yesteryear by providing a modern equivalent of SGML architectural engines. In principle any grammar-based schema language such as XML Schema or RELAX NG would be suitable for the methods outlined here. However, the software development (still very much a work in progress as of this writing) is using the much simpler Examplotron schema language. Examplotron is not well-known or much used in the XML environment, but I believe it to be extremely suitable to the stripped-down MicroXML environment in which I am now primarily interested. Since most people don't know Examplotron, I have written the paper to be accessible to anyone who can read simple DTD declarations.
In this paper, I will speak of the source document, which is the input to a schema-based transformation engine (TE), and of the target document, which is a TE's output. Additional inputs are the source schema and the target schema. In this paper the schemas are expressed as DTD fragments, but in actual use they will be Examplotron 0.8 schemas. In addition, we may supply the TE with the transformation name of the particular transformation to be performed on the source, possibly one of many such transformations. For clarity's sake, I will speak as if the various transformations are made one by one, but except for attribute defaulting they are all made simultaneously. For example, if all elements named
foo are to be renamed
bar, and all elements named
bar are to be renamed
baz, that does not mean that both
bar elements wind up being named
2. Element Renaming and the Renaming Attribute
The first and simplest kind of transformation to be performed is element renaming. A TE does this by looking at each element of the source document for an attribute whose name is the same as the transformation name supplied to the TE. This attribute is called the renaming attribute.
For example, suppose we have the following source document:
<limerick> <title>Relativity</title> <a>There was a young lady named Bright</a> <a>Who could travel much faster than light.</a> <b>She set out one day</b> <b>In a relative way</b> <a>And returned the previous night.</a> </limerick>If we wish to transform it from its limerick-specific schema to a more general stanza schema, we might add a renaming attribute named
stanzato every element, like this:
<limerick stanza="stanza"> <title stanza="title">Relativity</title> <a stanza="line">There was a young lady named Bright</a> <a stanza="line">Who could travel much faster than light.</a> <b stanza="line">She set out one day</b> <b stanza="line">In a relative way</b> <a stanza="line">And returned the previous night.</a> </limerick>Running a TE on the above document, specifying
stanzaas the transformation name, would produce the following target document:
<stanza> <title>Relativity</title> <line>There was a young lady named Bright</line> <line>Who could travel much faster than light.</line> <line>She set out one day</line> <line>In a relative way</line> <line>And returned the previous night.</line> </stanza>Note that all occurrences of the renaming attribute have been removed from the target document.
What happens if an element doesn't have a renaming attribute? The answer is that the element is dropped in its entirety. For example, suppose we did not have a
stanza attribute on the source document's
title element. In that case, the target document would contain only a
stanza element with five
line child elements.
If you don't provide a TE with a transformation name, there is no renaming attribute, and rather than dropping all the elements, none of them are renamed. However, the target document may still differ from the source document in other ways.
The concept of renaming attributes comes from AEs; however, AEs do not require the name of the renaming attribute to be the same as the transformation name, and have different and more flexible rules about processing elements without renaming attributes.
3. Attribute Defaulting
This business of adding renaming attributes directly to the source document is irritating, and may be impossible if we aren't able to change the source document. Instead, we can take advantage of attribute defaulting by specifying a source schema. Consider the following DTD fragment:
<!ATTLIST limerick stanza "stanza"> <!ATTLIST title stanza "title"> <!ATTLIST a stanza "line"> <!ATTLIST b stanza "line">This says that in the
limerickelement, if no
stanzaattribute is supplied, its value is assumed to be
stanza. Likewise, for the
titleelement, the default value of the
title, and for the
belements, it is
line. Now we no longer have to alter our original limerick document when we want to transform it. If we specify the transformation name as
stanza, we will get the same target document that we saw in the previous section.
What is more, we can provide more than one renaming attribute in the same source schema. Suppose we add the following declarations to the above source schema:
<!ATTLIST limerick estrofa "estrofa"> <!ATTLIST title estrofa "título" <!ATTLIST a estrofa "línea"> <!ATTLIST b estrofa "línea">If we specify the transformation name as
stanza, we will generate a target document whose element names are in Spanish rather than English. However, the TE cannot automatically remove the defaulted
stanzaattribute when doing an
estrofatransformation, nor vice versa, because it does not know which attributes might be used as renaming attributes in a different transformation run. In order to suppress them, we must provide the TE with a list of renaming attributes that are not being used for the current transformation, so that they can be suppressed from the target document. In the rest of this paper we will assume that this list has been provided.
Attribute defaulting is not restricted to renaming attributes. If any attribute is given a default value by the source schema but does not appear in the source document, it will be created, and by default will appear in the target document. Attribute defaulting is done in advance of all other transformations; a default attribute may have its name or value changed by a later transformation.
Attribute defaulting is inherent to DTD processing. The version of Examplotron used by TEs, Examplotron 0.8, allows the specification of default values for attributes, and in fact for elements too.
4. Element Reordering
So far, we haven't had to deal with child elements appearing in a different order in the source and target documents. However, this can often happen when the source document is data-oriented rather than content-oriented. In order to know how to reorder child elements, we must provide the TE with a target schema. Here's a simple target schema specifying a document containing people's names:
<!ELEMENT people (person*)> <!ELEMENT person (last, first)>In this schema, we see that a
peopleelement contains zero or more
personelements and nothing else, and that each
firstelements in that order.
Now here's a source document:
<people> <person> <first>John</first> <last>Cowan</last> </person> <person> <first>Dorian</first> <last>Cowan</last> </person> </people>Suppose we pass this source document and the target schema to a TE without specifying a transformation name. In that case, there is no renaming attribute, and so no element renaming is done. However, since the order of child elements for the
personelement in the source document is not valid according to the target schema, they will be reordered so as to be valid in the target document, producing this:
<people> <person> <last>Cowan</last> <first>John</first> </person> <person> <last>Cowan</last> <first>Dorian</first> </person> </people>
AEs do not perform element reordering.
Both source and target schemas can specify how many occurrences a child element can have within its parent element. In DTDs, we can repeat the element name to specify a fixed number of occurrences, as in this source schema for our limerick document:
<!ELEMENT limerick (title, a, a, b, b, a)> <!ATTLIST limerick index "poem"> <!ATTLIST a index "firstline">
Now suppose we run a TE, passing it the transformation name
index, our original limerick document, the above source schema, and the following target schema:
<!ELEMENT poem (firstline)>The renaming attribute
indexwill rename the
poemand the three
firstline, dropping the
belements altogether. But since the target schema permits only a single
firstlineelement in each
poemelement, the second and third
firstlineelements will also be dropped, producing the following target document:
<poem> <firstline>There was a young lady named Bright</firstline> </poem>This is suitable for inclusion in an index of first lines.
On the other hand, if the target schema requires more occurrences of an element than the source schema provides, sufficient elements are created following the mapped elements. For an example of that process, consider this source document with explicit renaming attributes:
<couplet limerick="limerick"> <line limerick="a">Go and tell the Spartans, passerby,</line> <line limerick="b">That here, obedient to their laws, we lie.</line> </couplet>What happens if we transform this into a limerick using the limerick schema as the target schema? (There is nothing inherent in a schema that says whether it is a source or a target, only in how it is provided to a TE.) Limericks have to have a title and five lines, but we have only two lines here, one mapped (for some unknown reason) to an
aelement and one to a
belement. Consequently, we get this target document:
<limerick> <title/> <a>Go and tell the Spartans, passerby,</a> <a/> <b>That here, obedient to their laws, we lie.</b> <b/> <a/> </limerick>Not very useful or pretty, perhaps, but certainly valid.
In this paper, newly created elements are shown as empty. However, if the Examplotron schema provides a default value for them, it will be used.
When specifying the content model of an element in a source or target schema, we can follow the name of a child element with
* to mean "zero or more occurrences", as shown in the declaration of the
people element. In the same way,
? means "zero or one occurrences" and
+ means "one or more occurrences". All these indicators are respected by a TE. So if two
foo child elements appear in the source document, but the target schema specifies
foo?, then the second one will be dropped. A TE cannot construct transformations based on more complex content models like
((a,b)+), in which the occurrence indicator follows a sequence of child element names, except as noted under the discussion of mixed content.
However, technically ambiguous content models like
(line, line?, line?), meaning from one to three
line elements, which are illegal in DTDs, are supported in Examplotron schemas as well as by a TE.
AEs neither drop unwanted elements nor create new ones, but report validation errors instead.
6. Character Content
So far, the source and target schemas we have seen have been incomplete, because not all the elements used in the documents have been mentioned in the schemas. In particular, declarations for the elements whose only permitted content is characters, such as
title have been left out. Here's a complete version of the limerick source schema with all three renaming attributes provided:
<!ELEMENT limerick (title, a, a, b, b, a)> <!ATTLIST limerick stanza "stanza"> <!ATTLIST limerick estrofa "estrofa"> <!ATTLIST limerick index "poem"> <!ELEMENT title #PCDATA> <!ATTLIST title stanza "title"> <!ATTLIST title estrofa "título" <!ELEMENT a #PCDATA> <!ATTLIST a stanza "line"> <!ATTLIST a estrofa "línea"> <!ATTLIST a index "firstline"> <!ELEMENT b #PCDATA> <!ATTLIST b stanza "line"> <!ATTLIST b estrofa "línea">And here is an erroneous target schema for stanza documents:
<!ELEMENT stanza (title, line*)> <!ELEMENT title #PCDATA> <!ELEMENT line EMPTY>
Let's see what happens if we do a
stanza transformation using that target schema. We get this target document:
<stanza> <title>Relativity</title> <line/> <line/> <line/> <line/> <line/> </stanza>Because the target schema specified the
lineelement as empty (no child elements or character content), the TE threw away the character content. Again, probably not very useful, but again certainly valid.
Reordering and occurrence control are really two aspects of the same thing, and they can both happen to the same children of an element at the same time. Here is a not-very-realistic example. Given the source document
<root> <a id="a1"/> <b id="b1"/> <a id="a2"/> <b id="b2"/> <a id="a3"/> </root>and the target schema
<!ELEMENT root (a, a, b, b, b>)>the target document will be
<root> <a id="a1"/> <a id="a2"/> <b id="b1"/> <b id="b2"/> <b/> </root>That is, the
aelements have been reordered before the
belements, the third
aelement has been dropped as unwanted, and a third
belement has been created.
AEs allow greater control of what happens to character content when an element containing it is dropped from the target document: it may be discarded or included as part of the parent element. TEs always discard it unless the parent element's content model is specified as mixed content.
7. Mixed Content
An element has mixed content when its content includes both child elements and characters. Consider this limerick:
<limerick> <title>Memory</title> <a>There was an old man of Khartoum</a> <a>Who kept two black sheep in his room.</a> <b><quote>"They remind me,"</quote> he said,</b> <b><quote>"Of two friends who are dead,</quote></b> <a><quote>But I <em>cannot</em> remember of whom."</quote></a> </limerick>Because of the
emelements, this document isn't valid against our latest limerick schema. Let's add the following declarations to our limerick schema, replacing the existing declarations for the
<!ELEMENT emphasis (#PCDATA|quote|em)*> <!ELEMENT quote (#PCDATA|quote|em)*> <!ELEMENT a (#PCDATA|quote|em)*> <!ELEMENT b (#PCDATA|quote|em)*>The meaning of these element declarations is that the specified child elements (
emin this case) may appear in any order, any number of times, interleaved with the character content if any. This is the only kind of mixed content that DTDs support. Examplotron permits more restrictive sorts of mixed content, but a TE cannot handle them. If we do a
stanzatransformation, then because the
belements are declared to have mixed content, instead of simply dropping the
emelements along with their content as you might expect, their content is preserved. The result, then, is the same as if no quotation or emphasis markup had appeared in the source document.
What would happen if the target schema for stanzas allowed
em elements but not
quote elements? Then the final line's content would become:
<line>But I <em>cannot</em> remember of whom.</line>
By definition, reordering is never done on mixed content. It is the presence of mixed content in the source schema, not in the target schema, that triggers this style of processing, although you usually want to specify mixed content in both schemas.
In summary, the content models that a TE supports are mixed content, character-only content, empty content, and element content consisting of a simple sequence of child element names, possibly decorated with occurrence indicators. All other content models are unsupported for transformation, though they are permitted for validation.
8. Attribute Mapping
So far, the value of a renaming attribute has been a single token, an element name. But if the renaming attribute contains multiple tokens separated by whitespace, the first token is the element name for element mapping, and the rest of the tokens are pairs of equivalent source and target attribute names. For example, here's a
link element that contains a renaming attribute to map it to an HTML
<link target="http://examplotron.com" html="a target href"> Examplotron </link>Running a TE on this source document and providing
htmlas the transformation name produces this target document:
<a href="http://examplotron.com"> Examplotron </a>
TEs support three special cases of attribute mapping. If the target attribute name is replaced by
#NONE, then the source attribute will be omitted from the target document. If the source attribute is
#CONTENT, then the target attribute's value does not come from any source attribute, but from the character content of the element; likewise, if the target attribute is
#CONTENT, then the source attribute is removed and its value is used as character content of the target element. Here's an example of all three special cases. The source element
<url purpose="linkage" label="Examplotron" html="a purpose #NONE label #CONTENT #CONTENT href"> http://examplotron.org </url>is transformed by dropping the
purposeattribute, putting the character content
hrefattribute, and putting the value of the
labelattribute into the character content of the target element (an
aelement), thus producing the same result (modulo whitespace) as the transformation of the
As a further extension to attribute mapping, if a source/target attribute name pair is followed by the token #
MAPTOKEN, it is then followed by a source token and a target token. The source attribute value is then divided into tokens by whitespace, and if the source token appears in it, it is replaced by the target token. There may be any number of such triples of
#MAPTOKEN, source token, target token following a source/target attribute pair.
This mechanism is usable but crude, and should eventually be replaced by something less hacky. In AEs the source/target attribute pairs and mapping-token triples are in a separate attribute from the renaming attribute.
International Organization for Standards. SGML Extended Facilities, normative annex A to ISO/IEC 10744. "A.3 Architectural Form Definition Requirements (AFDR)." [online]. © 1992, 1997 [cited 12 July 2013]. http://www.pms.ifi.lmu.de/mitarbeiter/ohlbach/multimedia/HYTIME/ISO/clause-A.3.html.
van der Vlist, Eric. "Examplotron" [online]. © 2003 [cited 12 July 2013]. http://www.examplotron.org.