A MicroXPath for MicroXML (AKA A New, Simpler Way of Looking at XML Data Content)
Copyright © 2016 Uche Ogbuji
Table of Contents
There have been many attempts to modify XML, usually to simplify it. This should be considered a normal consequence of XML's success. There have been even more episodes of insistence that XML is dead because some other format was supplanting it. Whether YAML, HTML5 or JSON. This is also a natural consequence of XML's success. For a long time the idea of a modest simplification of XML has buzzed around groups of XML experts, and MicroXML MicroXML Spec is a W3C community group MicroXML Community Group and a spec that emerged from that group offering a backward-compatibile format looking to keep the best of XML while omitting anything considered too complicating.
The MicroXML specification is only eight pages or so, compared to the 37-odd pages of the XML 1.0 specification. Even so MicroXML provides something XML 1.0 did not, a data model. In the XML world the lack of a data model in the foundational spec led to a succession of separate specifications for XML data models, including the XML Infoset and the XPath Data Model (XDM) for XPath 2.0 and beyond. It's worth noting that the simplest of these, the XML Infoset is in itself twice the length of the MicroXML spec. The most widely used data model was one of the first options, the Document Object Model (DOM), which was enormously complicated, partly because it also served as a scaffolding for dynamic, in-browser operations on HTML as well as XML. By including a data model which can be specified in 4 pages or so, MicroXML helps enforce simplicity and improves the likelihood of interoperability of implementations.
The next logical step after considering the MicroXML data model is thinking how to jot down basic expressions in context of a document. XPath provides way to do so in the XML space, and developing a subset and variation on XPath suitable for MicroXML would be a valuable furtherance of that technology stack. There are several possible approaches to developing a MicroXPath, from creating an entirely new language to adapting an existing one such as CSS 3 or XPath. This paper describes an approach largely based on XPath 1.0, but with one major concept taken from XPath 2.0.
XPath, like XML, has been used in many different ways. XPath 1.0 developed as a unified selection language for XPointer and expression language for XSLT. Use of XPointer faded away, and XSLT predominated, but many limitations of XSLT 1.0 emerged. Many users, however found XPath useful as a utility language within non-XSLT host environments, from XML databases to general-purpose programming languages. One simple XPath could eliminate dozens of lines of DOM traversal code. Come time to develop XPath 2.0 there was a clamor for features from XSLT users as well as XML database users. This led to a language with much more features than XPath 1, but also far more complex. This added complexity for the most part isn't needed in cases of powerfully expressive host environemnts, such as Java or Python programs.
The target for MicroXPath is to assume that very sophisticated processing can be done by a Turing-complete and fully expressive modern programming language. MixroXPath focuses on delivering nodes from the document to the host environment. It does offer a system of expressions which goes beyond MicroXML nodes, but this is largely to power predicate operations which are used to narrow down the selection of nodes.
Before getting to MicroXPath design goals, it's worth remembering the key goals of MicroXML. As established by the Community Group these are as follows.
The syntax of MicroXML is a subset of XML 1.0.
MicroXML specifies a data model and a mapping from the syntax to the data model, which is substantially consistent with XML 1.0.
MicroXML is dramatically simpler than XML regarding its specification, syntax, and data model.
MicroXML is designed to complement rather than replace XML, JSON, and HTML.
MicroXML supports the needs of documents, in particular mixed content.
MicroXML supports Unicode.
MicroXML supports the use of text editors for authoring.
MicroXML is able to straightforwardly represent HTML.
The specification of MicroXML is as self-contained as is practical.
MicroXPath is inspired by the above, and has its own minimal set of design goals.
A large proportion of XPath 1.0 produce similar results in MicroXPath, notably excluding expressions which involve namespaces.
MicroXPath incorporates additional features based on experience with the limitations of XPath 1.0.
MicroXPath is read-only. It does not modify the context provided by the host environment.
MicroXPath is designed to provide information from MicroXML nodes directly to a modern, Turing-complete host language. MicroXPath itself is not intended to be computationally complete in any way except in its reach of MicroXML nodes based on a provided context structure.
The final goal also enshrines the decision for MicroXPath to be substantially based on XPath 1.0 and not XPath 2.0 or 3.0. Other design principles important to these later XPath versions, such as composability, and thus mathematical closure, are pursued in MicroXPath only as far as practical. MicroXPath also preserves "syntactic sugar" from XPath, such as axes, which are a convenient way of writing node traversal along common relationships, and node tests, which are a useful abbreviation of some predicates.
MicroXPath has a data model that's a superset of the MixroXML data model. A key construct in MicroXML is the sequence, which is based on the XPath 2.0 construct. XPath 1.0 was built around the concept of node sets, for a variety of reasons relating to its origins supporting XPointer and XSLT. This led to a great deal of confusion among users, especially when the result tree fragment construct from XSLT was brought into the picture. XPath 2.0 drew from many of those lessons to rely on a more versatile sequence construct, including a node list, which is a sequence of nodes. MicroXPath adopts this approach. The results of all MicroXPath expressions are sequences.
A MicroXPath sequence provides zero or more objects. An object can be one of four types.
element (MicroXML element item. There are no other node types.)
boolean (true or false)
number (floating-point number)
string (sequence of UCS characters. This is an abstract sequence, and not a MicroXML sequence object in itself.)
A MicroXPath sequence cannot contain another MicroXPath sequence. Nesting is not allowed. any operation that would seem to result in nested sequences implicitly has those sequences flattened.
A MicroXPath expression is evaluated respect to a context, comprising the information that can affect the result of the expression, namely the following.
context node, the current item being processed
context object, an object of any type. If a node, must be identical to the contxt node
context position, a non-zero positive integer giving the position of the context node within the sequence of items being processed
context size, a non-zero positive integer giving the number of nodes in the sequence of items being processed
variable bindings, a mapping from names to values set by the hosting environment
function library, a mapping from names to behaviors set by the hosting environment
key bindings, a mapping of mappings to make available through the
The biggest difference from XPath 1 context is the addition of the context item this
is meant to support predicate expressions, which can operate on any sequence. In an
expression such as
(1, 2, 3, 4, 5)[. > 3], the predicate sees each number in order, and the
. computes to each number, so that in this case the result would be the sequence
MicroXPath does introduce two node types which are not in the MicroXML data model. These are required in order to ensure that most XPath expressions retain similar semantics in MicroXPath. For example without a root node, the semantics of absolute location paths would be radically different. MicroXML introduces a root node object purely for purposes of expression evaluation. It also introduces attribute nodes. It is perfectly acceptable for an implementation to not construct root nodes or attribute nodes until required by expression semantics. The MicroXPath implementation I wrote does use such a just-in-time strategy in constructing root and attribute nodes.
A location path selects a set of nodes relative to the context node. MicroXPath location paths are very similar identical to XPath ones, sometimes nicknamed "Tumblers" among XPointer users. The main difference is that name tests are never in the form of QNames. The result of evaluating an expression that is a location path is a sequence of nodes.
All the examples from the top of section 2 of the XPath 1.0 spec XPath 1.0 happen to be valied MicroXPath expressions. Ditto all location paths using abbreviated syntax in section 2.5. All XPath 1.0 axes are valid in MicroXML exepting the nameapace axis. In other words the following are MicroXML axes, each containing the same nodes as in XPath 1.0.
Attributes are the principal node type for the attribute axis, and elements for
all other axes. There are only two node tests in MicroXPath,
Syntactically MicroXPath is much the same as XPath 1.0, but there is one significant addition. MicroXPath provides a syntax for creating sequences, borrowing from XPath 2.0/3.0. The following are the examples of expressions that construct sequences, taken from 3.4.1 of the XPath 3 spec. They have the same semantics as in MicroXPath.
(10, 1, 2, 3, 4) results in a sequence of five integers.
(10, (1, 2), (), (3, 4), (5)) results in a sequence with six items, 10, 1, 2, 3, 4, 5. The five component sequences
of length one, two, zero, two, and one, respectively, are combined.
(salary, bonus) results in a sequence containing all salary children of the context node followed
by all bonus children. The salary and bonus children flow into the result separately
in document order, but the resulting sequence may not be in document order.
($price, $price) results in a sequence with the value of the variable
$price twice over. If $price is bound to the value 10.50, the result of this expression
is the sequence 10.50, 10.50. If $price is bound to the a sequence 1, 2, 3, the result
of this expression is the sequence 1, 2, 3, 1, 2, 3.
As you can see from the second example the empty sequence is expressed as
() and a sequence of a single item
($item) is the same as the item expressed directly
$item. This derives from the fact that all MicroXPath expressions are sequences. XPath
2.0/3.0 range expressions are not supported, but there are core functions to provide
The MicroXPath union operator
| behaves much as in XPath 1, and is also a way to arrange a sequence into document
order. The first items in the result sequence of
$a|$b would be all nodes in either
$b, in document order. Next would come all strings, sorted by code points, then all
numbers, sorted numerically, then finally booleans, false if it occurs in either of
the arguments, followed by true, if it occurs. This is a simple case of the more generalized
function of the built-in
union function discussed below.
Unlike XPath there is no syntactic distinction between core functions and extension functions. MicroXPath defines a core library of functions which must always be made available by a conforming host environment. The host environment can provide any additional functions as long as their names do not conflict with any of the core function names, nor the reserved names "node" and "text" (reserved because they are names of node tests which are semantically different from functions but use similar-seeming syntax).
XPath 1 functions related to namespace processing are not in the MicroXPath core
name function returns the simple node name (generic identifier) of an element or attribute
node. Another change is that
Other changes have to do with the use of sequences. The
count function now takes any sequence and returns the number of items in that sequence.
string function always operates on the first item in its argument sequence, regardless of
MicroXPath defines a number of additional functions, many of which derive from
EXSLT EXSLT. It borrows the
key function from XSLT 1.0 XSLT 1.0, but MicroXPath does not specify a way to create lookup tables (e.g.
<xsl:key …/>). Rather it's up to the host environment to provide lookup tables in the execution
context. MicroXML does not support any aspect of DocType declarations you can (for
ID DTD types), nor does it support namespaces (for
xml:id attributes) so there is no standard way to specify element IDs. As such there is
id function in MicroXPath and users must rely on
key. There is no
lang function either, but there is a
same-lang function which can be used to do similar ISO-639-modulo comparisons.
union() takes two or more argument and returns the sequence union of these arguments, with
union($a, $b, $c) returns the same results as
$a|$b|$c, but you might expect the former to be more readily optimized by implementations.
intersection() takes two or more argument and returns the sequence intersection of these arguments,
i.e. only items occuring in all the arguments. The resulting order is analogous to
the results order of
MicroXPath offers a few core functions using the Regular Expressions syntax as defined
by Perl Perl Regex. This is very similar to the Regular expressions defined in section 7.6 of XPath
2.0 Functions and Operators XPath 2.0 Functions. The functions are
object-type() is similar to
exsl:object-type(), operating on the first item in the argument sequence.
evaluate() is similar to
dyn:evaluate(), and provides some of the power of higher-order functions. MicroXPath does not support
functions as a core data type, as in XPath 3 XPath 3.0 (see section 3.2.2, "Dynamic Function Call"). For one thing, this would require side
effects in the host environment. This is not necessarily a cast-in-iron design principle,
evaluate() seems to provide most of the useful additonal power in a simpler package.
There are also MicroXPath core functions to provide similar features to XPath 2.0/3.0 range expressions and to provide map/apply capabilities.
There is an implementation in Amara 3 Amara 3 MicroXPath, written for Python 3.4 or higher, and which also implements MicroXML. Amara 3 is able to parse XML 1.0 into the MicroXML data model, so as long as you don't need to worry about deep namespace processing, you can readily try out MicroXPath on XML 1.0 documents as well as MicroXML.
The implementation uses the Ply libray to generate the lexical scanner and parse
table to create an AST. The AST's control flow uses Python's generators and the
yield statement to effect the fact that MicroXPath expressions result in a sequence. This
also makes for straightforward and efficient computation. Expressions that aggregate
from component expressions use the
yield from statement new in Python 3.4. This feature also makes it easy to treat sequences as
the dynamic outcomes of operation, efficient and automatically flattened. The computation
of location paths naturally results in document ordered sequences. Sorting by document
order is only required with the union operator and the set functions
XPath 1.0 has always been an admirably well-designed language, in the context of its design constraints. It had to serve the uneven masters of XSLT and XPointer, the latter of which then faded into obscurity. It had to deal with the quirks of XML Namespaces. It had to account for limitations claimed by key implementors which now seem quaint in the light of the latest software engineering. Despite that it has been very successful. Even as many of the small ranks of hard-core XML types moved on to later XPath versions and XQuery, XPath 1.0 remains the widest used and most recognizable processing technology. It has, especially in its location paths system, influenced other languages and processing tools.
The significant simplification of MicroXML over XML 1.x has opened up a fine opportunity to explore what XPath could be like with fewer shackles on its design. MicroXPath takes the best of XPath 1.0, and sprinkles in a few key bits from XPath 2.0/3.0 and EXSLT. The resulting language is about on par in complexity as XPath 1.0. In truth the "Micro" moniker is not as apt as it is with "MicroXML." MicroXPath omits dealing with namespaces, comments and processing instructions, but it adds operations related to sequences, and some useful core functions. Implementing MicroXPath was one test of it as a language, and efficient implementation in Python proved a quite natural use of that language's generator/iterator features.
[MicroXML Spec] MicroXML Specification http://dvcs.w3.org/hg/microxml/raw-file/tip/spec/microxml.html.
[XPath 2.0] XML Path Language (XPath) 2.0 (Second Edition), W3C Recommendation 14 December 2010 https://www.w3.org/TR/xpath20/.
[Amara 3 XML Toolkit] https://github.com/uogbuji/amara3-xml/.