Times are reported in seconds. The three ratio lines in the table set the time for one
of the tests at a value of 1 and then calculate the amount of time the
other implementations required in proportion to it.The results show that querying along the long axes took more than 16 times as much
time as querying along the sibling axes. Using the @offset attribute value
instead of either the long axes or the sibling axes saved an additional 11% in time, and
using the @last attribute value as well saved an additional 41% in time
over that. All told, the implementation that relies on the long axes took more than 25
times as much time as the one with the greatest optimization.Is it XML?The XML version of the poem has an inherent hierarchy (the poem contains books, which
contain lines, which contain words) and inherent order (the words occur in a particular
order, as do the lines and books). Those inherent features are encoded naturally in the
structure of the XML document because XML documents are obligatorily hierarchical (even
though in some projects the hierarchy may be flat) and ordered (even though in some
projects the user may ignore the order). The addition of @offset and
@last attributes and the adoption of a strategy that treats the
document as flat and never looks at the hierarchy essentially transforms the approach
from one that is based on natural properties of XML documents to one that is based on a
flat-file database way of thinking. That is, we could map each
<word> element in the XML version to a record in a
database table, the fields of which would be the textual representation of the word (a
character string), the offset value (a unique positive integer), an indication of
whether the word falls at the end of a line (a boolean value), and the book and line
number (a string value, which is used in reporting). Records in a database do not have
an inherent order, but once we rely on the value of the @offset attribute
in the XML document, the <word> elements might as well be
sprinkled through the document in any order, and the <line>
and <book> elements play no role at all in the system. That
is, except for the book and line number, the most highly optimized (and most efficient)
implementation above adopts precisely a flat-file database approach, which raises the
question of whether this project should have been undertaken in XML in the first
place.The answer to that rhetorical question is that of course it should have been
undertaken in XML because the order and hierarchy are meaningful. They are inherent in
the XML structure but must be written explicitly into a corresponding database
implementation, which indicates that this is data that wants, as it were, to be regarded
as an ordered and hierarchical XML document. The problem is not that the data is
inherently tabular, and therefore inherently suited to a flat-file database solution,
but that the XML tool available to manipulate the data was not sufficiently optimized for
the type of retrieval required.ConclusionThe best solution would be, of course, an optimization within eXist that would let users write concise and legible XQuery code (using
the long axes), which would then be executed efficiently through optimization behind the
scenes. This type of solution would remove the need for both more complex code (along
the lines of the sibling-axes approach described above) and modifying the XML to write
information into the document in character form when that information is already
inherent in the document structure. Until such a solution became available, though, the
strategies described above provided a substantial improvement over explicit use of the
long axes, salvaging a project that would otherwise have been unusable for reasons of
efficiency.AddendumeXist is an open-source project, which means that
impatient users who require an optimization not already present in the code have the
opportunity to implement that optimization themselves and contribute it to the project.
Unfortunately, in the present case this particular impatient user lacked the Java
programming skills to undertake the task. Fortunately, however, the eXist development team is very responsive to feature requests
from users, and shortly after I wrote to the developers about the problem they released
an upgrade that implemented precisely the modification described above (consult the
predicate first and retrieve only the nodes that will be needed from the designated
axis). Rerunning the original code that relied on the long axes on the same machine as
the earlier tests but using eXist version
1.3.0dev-rev9622-20090802, which includes this new optimization, yielded times of 1.754,
1.778, 1.765, 1.944, 1.944, 1.777, 18.949, 1.838, 1.763, and 1.798 seconds. The mean
time for these tests was 3.531 seconds, and if we exclude the aberrant long time on the
seventh trial (an artifact of a system process that woke up at an inconvenient moment?),
the mean drops to 1.818 seconds. The 3.531-second figure is 14.455% of the best mean
time (24.427 seconds) achieved with my XSLT-based optimizations and 0.559% of the mean
time of the long-axes search (631.150 seconds) before the introduction of the eXist-internal optimization. The 1.818-second figure is
7.443% of the best mean time (24.427 seconds) achieved with my XPath-based optimizations
and 0.288% of the mean time of the long-axes search (631.150 seconds) before the
introduction of the eXist-internal optimization.The eXist optimization works by checking the static
return type of the predicate expression to determine whether it is a positional
predicate. (This paragraph reproduces more or less verbatim an explanation provided by
the eXist developers.) If the answer is yes and there
is no context dependency, the predicate will be evaluated in advance and the result will
be used to limit the range of the context selection (e.g.,
following::word). For example, $i/following::word[1] would
benefit from the optimization because the static return type of the predicate is a
positional predicate and it entails no context dependency. On the other hand,
$i/following::word[position() = 1] would not be optimized because it
introduces a context dependency insofar as position() returns the position
of the current context item and cannot be evaluated without looking at the context.
Furthermore, determining the static type is not always easy. In particular, the type
information is passed along local variables declared in a let or
for, but it gets lost through function calls. My original query,
for $j in (1 to 3) return $i/following::word[$j], works, but if
$j were a function parameter, it would not. Additionally, support for
this optimization with particular XPath functions is being introduced only
incrementally, to avoid breaking existing code. For example, the developers’ initial
attempt at an optimization failed with the reverse() function that I used
to retrieve the three preceding words in the correct order, although support for this
function was added later to the optimization.The unsurprising technical conclusion, then, is that, at least in the present case,
optimization of the XPath code by the user to reduce the scope of a query can achieve
substantial improvement, but much more impressive results are obtained by optimizing the
Java code underlying the XPath interpreter. What this experiment also reveals, though,
is that, at least in the present case, the user was not reduced to waiting helplessly
for a resolution by the developers, and was able to achieve meaningful improvement in
those areas that he did control, viz., the XML, XPath, and XQuery.In his concluding statement at the Balisage 2009 pre-conference Symposium on
Processing XML Efficiently, Michael Kay invoked David Wheeler’s advice that application
developers optimize the code that users actually write, that is, that they find out what
people are doing and make that go quickly. From an end-user perspective, though, the
lesson can be reversed: Find out what goes quickly and use it.Works cited“Configuring Database Indexes.”
(Part of the eXist documentation.) http://www.exist-db.org/indexing.html. Accessed 2009-05-31.“Lucene-based Full Text Index” (Part of the
eXist documentation.) http://www.exist-db.org/lucene.html. Accessed 2009-05-31.“Tuning the Database.” (Part of the
eXist documentation.) http://exist.sourceforge.net/tuning.html. Accessed 2009-05-31.