The corpus

Alain de Lille’s (Alanus ab insulis) allegorical-philosophical epic, the Anticlaudianus, is a poem divided into nine books with a brief verse prologue and an equally brief prose pre-prologue. The text was published by Robert Bossuat in 1955 (Anticlaudianus / Alain de Lille: texte critique avec une introduction et des tables publié par R. Bossuat [Paris : J. Vrin, 1955]), cleaned up in 2009 by Danuta Shanzer (University of Illinois at Urbana-Champaign, shanzer@illinois.edu) as a Dumbarton Oaks Medieval Library Latin Series Work In Progress, and converted to XML and published as a queriable concordance by David J. Birnbaum (University of Pittsburgh, djbpitt@pitt.edu). When this report was last edited in August 2009, the concordance was freely accessible at http://clover.slavic.pitt.edu:8081/exist/acl/search.html; it will eventually move to a different stable freely accessible address, which has not yet been determined, but which should be discoverable through search engines.

The corpus consists of a single XML file subdivided into <book> elements (prose introduction, verse introduction, nine principal chapters). Each <book> is divided into <line> elements (4344 <line> elements in the nine principal <book> elements, for an average of 482.7 <line> elements per <book> element), and each <line> is divided into <word> elements (27222 <word> elements in the nine principal books, or approximately 6.27 <word> elements per <line> element; the <word> counts per <line> range from a low of 4 to a high of 10). For example, Book 2 begins:

<book n="2">
  <line>
    <word>Regia</word>
    <word>tota</word>
    <word>silet;</word>
    <word>expirat</word>
    <word>murmur</word>
    <word>in</word>
    <word>altum,</word>
  </line>
  <line>
    <word>cum</word>
    <word>visu</word>
    <word>placidos</word>
    <word>delegat</word>
    <word>curia</word>
    <word>vultus,</word>
  </line>
  <!-- more lines -->
</book>

Because scholars are likely to be interested more in the contents of the nine principal books than in the contents of the prose and verse introductions, the counts and calculations below omit the latter. They are, nonetheless, relevant in certain situations, such as when a query consults all <word> elements in the document, including those in the introductions. The prose introduction contains 477 <word> elements and the verse introduction 50.

The task

The goal of the electronic concordance project is to enable users to search for words and generate a keyword-in-context (KWIC) report on the fly. The system was originally implemented using the eXist XML database (http://www.exist-db.org), version 1.3.0dev-rev:8710-20090308. Version 1.3 of eXist, still in beta at the time this report was last revised, is the first to introduce a Lucene-based index, which is intended eventually to replace the original proprietary full-text index, and the present concordance is indexed and accessed using the Lucene index. When the user enters a query string and launches a search, the system retrieves all hits and returns them with several preceding and following words (three each by default, but the user can adjust this number). Line breaks in the original are rendered as slashes in the KWIC output (this part of the code has been omitted from most of this report in the interest of legibility, except when it is the object of optimization). For example, a search for pugna (Latin for ‘fight’), which occurs four times in the corpus, returns:

  • Acl.8.254: Martis amore / succensi, pugna cupiunt incidere vitam. /

  • Acl.9.283: Luxum / Sobrietas, sed pugna favet Virtutibus, harum /

  • Acl.9.331: fraudesque recurrit. / degeneri pugna, servili Marte, dolosa /

  • Acl.9.384: indignata sub umbras. / Pugna cadit, cedit iuveni

The XPath preceding and following axes are ideally suited to this type of project, since they ignore the <line> element boundaries and treat adjacent <word> elements identically irrespective of whether they fall in the same <line> element as the target word or in preceding or following <line> elements. For example, the system can retrieve the three <word> elements that precede the target <word> element with the following XQuery (assume that $i represents the target <word> element):

for $j in reverse(1 to 3) return $i/preceding::word[$j]

This query returns the third, second, and first <word> elements before the target <word> element, in the specified order. An analogous statement can retrieve the three <word> elements that follow the hit, thus providing the rest of the context. Because queries along the long (preceding and following) axes make no distinction between preceding and following <word> elements within the same <line> element and those that require crossing a <line> element boundary, the resulting XQuery code is lucid and clean, making it extremely easy to read, write, and maintain.

The problem

The preceding strategy retrieves the correct results, and does so with elegant code, but initially it proved unusable in practice for reasons of efficiency, even with appropriately configured eXist Lucene and range indexes (“Configuring,” “Lucene,” “Tuning”). Until a recent revision in the eXist code base (see the Addendum, below), XPath expressions that addressed the long axes were inefficient because eXist retrieved the entire set of nodes on the specified axis before looking at the predicate. For example, in the worst case a hit would fall at the beginning of the first <line> in the first <book> element, which meant that in order to find the three immediately following <word> elements by looking on the following axis eXist would first retrieve as many as 27221 following <word> elements and only then apply a numerical predicate to filter the returned result. Since the context includes both preceding and following <word> elements (that is, it requires accessing the preceding axis in the former case and the following axis in the latter), hits that have fewer preceding <word> elements have more following ones, and vice versa, which means that all hits require sets of retrievals that span the entire document. The overhead was not a significant problem with queries that retrieved a mere handful of hits, but those that retrieved as few as two hundred hits could take several minutes to return, and sometimes they failed to return entirely because they generated Java heap overflow errors (which could have been evaded by increasing the heap size, but that would not have provided a solution to the efficiency problem).

This problem is not a unique or inherent property of the long axes. Rather, it is a property of the number of nodes on which the predicate operates. For example, in a flattened tree (imagine transforming the document to one where all <word> elements are directly under the root and line and book boundaries are encoded as empty milestone tags), a hit at the beginning of the document that queried the following-sibling axis, rather than the following axis, would nonetheless retrieve tens of thousands of unwanted <word> elements before applying the predicates to select the mere three that were actually needed. For this reason, although the problem may initially appear to be an overlap issue in that in its original form it crosses the boundaries of <line> elements, the flattening thought experiment reveals that it is actually an optimization problem that is independent of both the specific axes involved and the depth of the nesting. If, for example, eXist were to look first at the predicate and then access only the necessary elements on the specified axis, instead of first retrieving all elements on that axis and only then consulting the predicate and discarding the unwanted ones (in the present case, all but one), processing would not be suffocated by unnecessary retrieval irrespective of whether the query needed to cross a <line> element boundary.

As is discussed in the Addendum, below, eXist has since implemented this type of optimization, but the problem nonetheless continues to merit consideration. Among other things, the issue was never about eXist, since were that the case, an obvious solution would have been to use an alternative platform that provided the necessary optimization. Instead, the problem provides an opportunity to reflect more generally on the nature of XPath expressions and the relationship between XPath and XML. For example, although the inefficiency described above is independent of the specific axis involved insofar as it could also have arisen with the sibling axes in a flattened tree, it nonetheless does depend on the XML structure, or, more precisely, on its indifference to the XML structure. What the long axes in the actual problem and the sibling axes in the flattened tree alternative have in common is that they operate independently of the tree structure. For example, the sibling axes in the original tree constrains the number of nodes involved because the tree is balanced in a way that ensures that no <word> element will have more than nine sibling <word> elements. What the long axes in that same tree and the sibling axes in the hypothetical flattened tree alternative have in common, on the other hand, is that all <word> elements are treated as though they are on the same level of the tree (in the former case because the long axes ignore the tree and in the latter case because the tree is flattened, and therefore irrelevant). This suggests that unless one can be certain that the software that will evaluate one’s XPath expressions will optimize one’s query, the designer should take into consideration the number of nodes that will be addressed by those expressions.

An XPath solution

Since eXist was unable to optimize the queries in question, that duty fell on the user, who, in this case, first adopted a strategy that avoided the long axes, favoring instead the sibling axes at all levels (<word> and <line>). This constrains searches for <word> elements within a <line> element to an average of 2.64 (5.27 / 2) elements and searches for <line> elements within a <book> element to 240.5 (481.7 / 2) elements, and in the worst case to 5.27 and 481.7, respectively. These numbers should be multiplied by six for a typical query, which retrieves three preceding and three following <word> elements, but they still compare very favorably to queries along the long axes, which consult 13610.5 (27221 / 2) <word> elements on average and 27221 in the worst case. A prose explanation of the strategy for retrieving the three <word> elements following the target while using the following-sibling axis exclusively instead of the following axis is:

  1. If there is a <word> element in the appropriate position on the following-sibling axis (that is, in the same <line> element), retrieve it.

  2. If not, navigate up to the parent axis (a <line> element), get its following-sibling <line> element, and retrieve the appropriate <word> child element from within that <line>.

The following XQuery snippet retrieves the three <word> elements that follow the target. Assume that $i refers to the target (here and in subsequent examples):

if (count($i/following-sibling::word) ge 3)
then for $j in (1 to 3) return $i/following-sibling::word[$j]
else
  if (count($i/following-sibling::word) eq 2)
  then (for $j in (1 to 2) return $i/following-sibling::word[$j], 
    $i/parent::line/following-sibling::line/word[1])
  else
    if (count($i/following-sibling::word) eq 1)
    then ($i/following-sibling::word[1], for $j in (1 to 2) 
      return ($i/parent::line/following-sibling::line/word[$j]))
    else
      for $j in (1 to 3) return $i/parent::line/following-sibling::line/word[$j]

An analogous strategy can retrieve the three <word> elements that immediately precede the target.

It is possible to generalize this solution to allow the user to specify at run time the number of preceding or following <word> elements to provide as context along the following lines (assume that $scope specifies the number of words of context to provide after the target <word>):

if (count($i/following-sibling::word) ge $scope) 
then for $j in (1 to $scope) return $i/following-sibling::word[$j]
else for $k in (0 to ($scope - 1)) return
  if (count($i/following-sibling::word) eq $k) 
  then (
    for $j in (1 to $k) return $i/following-sibling::word[$j],
    for $j in (1 to ($scope - $k)) return $i/parent::line/following-sibling::line/word[$j]
  )
  else ""

An analogous strategy can retrieve a user-specified number of <word> elements that immediately precede the target. This generalization, however, is fragile because it looks only at the <line> element that contains the target <word> element plus the immediately preceding or following sibling <line> element. This means that, for example, if the user wants to include a context of five following <word> elements, the code will fail when the target <word> falls at the end of a <line> and the following (sibling) <line> contains only four <word>elements. It might be possible to circumvent this limitation through a recursive approach, but by that point the code would become so convoluted as to be difficult to understand and impractical to maintain.

A range index solution

In addition to the new Lucene-based full-text index (and the original full-text index, which is still available), eXist also supports range indexes, which provide very fast access to element nodes on the descendant-or-self (//) or child (/) axis and attribute notes on the attribute (/@) axis according to their typed data value. This suggests an alternative approach:

  1. Before storing the XML source document, create an @offset attribute for each <word> element and assign it a unique sequential integer value. The first <word> element in the document has an @offset value of 1, the next has a value of 2, etc.

  2. Retrieve the target <word> element by using the new Lucene full-text index.

  3. Retrieve the adjacent context <word> elements according to their @offset attribute values by counting backward or forward from the @offset value of the target <word> element.

This approach is available only where the designer has control over the XML source and is able to incorporate a specific @offset attribute that is to be used only for navigation during retrieval. It has at least two weaknesses, one aesthetic and one technical:

  • The aesthetic problem is that the ordinal position of each <word> element within the document is an inherent property of the document structure, and the user should not have to specify through the insertion of character-based markup into the document a value that is already encoded implicitly but consistently and unambiguously in the markup structure.

  • The technical problem is that the insertion or deletion of a word in the document will break the numbering, requiring that the @offset values be calculated anew and rewritten. In the present case the document is relatively stable (that is, more stable than, for example, an on-line commerce site that writes new data for every transaction), but the editor may still choose to modify her reading at some point as she reconsiders the available evidence and perhaps discovers and needs to integrate the data from newly discovered manuscript witnesses.

In practice, neither of these weakness imposes a serious inconvenience in the context of the present project, and the opportunity to use the eXist range index feature provides a substantial improvement in response time over alternative strategies (see the time test data below).

With the modification to the XML source described above, the first two lines of Book 2 now look like:

<book n="2">
  <line>
    <word offset="3708">Regia</word>
    <word offset="3709">tota</word>
    <word offset="3710">silet;</word>
    <word offset="3711">expirat</word>
    <word offset="3712">murmur</word>
    <word offset="3713">in</word>
    <word offset="3714">altum,</word>
  </line>            
  <line>
    <word offset="3715">cum</word>
    <word offset="3716">visu</word>
    <word offset="3717">placidos</word>
    <word offset="3718">delegat</word>
    <word offset="3719">curia</word>
    <word offset="3720">vultus,</word>
    </line>
  <!-- more lines -->
</book>

The following code will now use a properly-configured eXist range index to retrieve the three <word> elements following the target <word> element (assume $i is the target <word> element):

let $offset := $i/@offset
return
for $j in (1 to 3) return doc("/db/acl/acl.xml")//word[@offset eq ($offset + $j)]

An analogous strategy can retrieve the three <word> elements that immediately precede the target. Furthermore, this solution is easily generalized to allow the user to specify the number of preceding or following context <word> elements at run time (assume that $scope specifies the number of words of context to provide after the target <word>):

let $offset := $i/@offset
return
for $j in (1 to $scope) return doc("/db/acl/acl.xml")//word[@offset eq ($offset + $j)]

The strategy of writing structural information about the XML (such as offset position) into attribute values of the XML source itself in the interest of improved execution time can also be applied to writing slashes to mark the ends of lines. The most natural XPath way to write a slash after the target <word> element when it ends a <line> element in the XML source is:

if (not($i/following-sibling::word)) then " / " else ""

A similar approach can be used to write slashes after the leading and trailing context <word> elements when they end a <line> element. This method is not maddeningly slow because the number of nodes on the sibling axes is typically small, but further improvement in processing speed is available by modifying the XML source to include the string " / " (without the quotation marks) as an attribute value associated with <word> elements that fall at the end of <line> elements and then retrieving it when generating the report, instead of consulting the sibling axis. The first two lines of Book 2 now look like:

<book n="2">
  <line>
    <word offset="3708">Regia</word>
    <word offset="3709">tota</word>
    <word offset="3710">silet;</word>
    <word offset="3711">expirat</word>
    <word offset="3712">murmur</word>
    <word offset="3713">in</word>
    <word last=" / " offset="3714">altum,</word>
  </line>            
  <line>
    <word offset="3715">cum</word>
    <word offset="3716">visu</word>
    <word offset="3717">placidos</word>
    <word offset="3718">delegat</word>
    <word offset="3719">curia</word>
    <word last=" / " offset="3720">vultus,</word>
  </line>
  <!-- more lines -->
</book>

The XQuery code to write trailing context <word> elements is then:

let $offset := $i/@offset
for $j in (1 to 3) return (
  " ", 
  data(doc("/db/acl/acl.xml")//word[@offset eq ($offset + $j)]),
  data(doc("/db/acl/acl.xml")//word[@offset eq ($offset + $j)]/@last)
)

This approach writes a space after the target <word> element, followed by the appropriate trailing context <word> element, followed by the value of the @last attribute of that <word> element. If there is a @last attribute, it has the value " / " (without the quotation marks), which is what we want to write. If the attribute is missing, that statement operates vacuously, producing no output.

Time test results

To test the relative efficiency of the various coding strategies described above, the same query was executed ten times with each of four strategies. The query was a search for the word sed (Latin for ‘but’), which occurs 221 times in the corpus, and the XQuery scripts were all written to return it along with its location (book and line number) and with three context words on either side. The search strategies were:

  • Long axes: Search for context words using the preceding and following axes. Place slashes at the ends of lines by checking whether each output word has a following sibling <word> element in the same <line>.

  • Sibling axes: Search for context words using the sibling axes. If there are not enough context <word> elements in the same <line>, find the nearest sibling of the <line> and navigate to the desired word element along the child axis. Place slashes as described above.

  • @offset: Modify the XML to add an @offset attribute to every <word> element and find the context words by counting down or up from the @offset attribute value for the target <word>. Place slashes as described above.

  • @last: Same as @offset, except add a @last attribute in the XML to every <word> element that is the last in its parent <line>, and place slashes at the ends of lines by returning the value of that attribute.

The tests were conducted on a Gateway 3GHz Pentium D with 1MG of memory, running Microsoft Windows Vista with Service Pack 1.0. The Java version is 1.6.0_13 and the eXist version is 1.3.0dev-rev:0000-20090528.

Table I

Test no.

Long axes

Sibling axes

@offset @last
1 447.750 38.477 34.345 24.413
2 440.265 38.390 34.446 24.416
3 559.141 38.710 34.438 24.445
4 905.472 38.484 35.409 24.384
5 702.915 38.590 34.562 24.447
6 530.739 38.341 34.798 24.424
7 851.608 38.714 34.145 24.415
8 473.601 38.496 34.410 24.395
9 670.772 39.521 34.423 24.463
10 473.601 38.317 34.429 24.469
Mean 631.150 38.604 34.541 24.427
Ratio 1 25.838 1.580 1.414 1.000
Ratio 2 18.273 1.118 1.000  
Ratio 3 16.349 1.000    

Times are reported in seconds. The three ratio lines in the table set the time for one of the tests at a value of 1 and then calculate the amount of time the other implementations required in proportion to it.

The results show that querying along the long axes took more than 16 times as much time as querying along the sibling axes. Using the @offset attribute value instead of either the long axes or the sibling axes saved an additional 11% in time, and using the @last attribute value as well saved an additional 41% in time over that. All told, the implementation that relies on the long axes took more than 25 times as much time as the one with the greatest optimization.

Is it XML?

The XML version of the poem has an inherent hierarchy (the poem contains books, which contain lines, which contain words) and inherent order (the words occur in a particular order, as do the lines and books). Those inherent features are encoded naturally in the structure of the XML document because XML documents are obligatorily hierarchical (even though in some projects the hierarchy may be flat) and ordered (even though in some projects the user may ignore the order). The addition of @offset and @last attributes and the adoption of a strategy that treats the document as flat and never looks at the hierarchy essentially transforms the approach from one that is based on natural properties of XML documents to one that is based on a flat-file database way of thinking. That is, we could map each <word> element in the XML version to a record in a database table, the fields of which would be the textual representation of the word (a character string), the offset value (a unique positive integer), an indication of whether the word falls at the end of a line (a boolean value), and the book and line number (a string value, which is used in reporting). Records in a database do not have an inherent order, but once we rely on the value of the @offset attribute in the XML document, the <word> elements might as well be sprinkled through the document in any order, and the <line> and <book> elements play no role at all in the system. That is, except for the book and line number, the most highly optimized (and most efficient) implementation above adopts precisely a flat-file database approach, which raises the question of whether this project should have been undertaken in XML in the first place.

The answer to that rhetorical question is that of course it should have been undertaken in XML because the order and hierarchy are meaningful. They are inherent in the XML structure but must be written explicitly into a corresponding database implementation, which indicates that this is data that wants, as it were, to be regarded as an ordered and hierarchical XML document. The problem is not that the data is inherently tabular, and therefore inherently suited to a flat-file database solution, but that the XML tool available to manipulate the data was not sufficiently optimized for the type of retrieval required.

Conclusion

The best solution would be, of course, an optimization within eXist that would let users write concise and legible XQuery code (using the long axes), which would then be executed efficiently through optimization behind the scenes. This type of solution would remove the need for both more complex code (along the lines of the sibling-axes approach described above) and modifying the XML to write information into the document in character form when that information is already inherent in the document structure. Until such a solution became available, though, the strategies described above provided a substantial improvement over explicit use of the long axes, salvaging a project that would otherwise have been unusable for reasons of efficiency.

Addendum

eXist is an open-source project, which means that impatient users who require an optimization not already present in the code have the opportunity to implement that optimization themselves and contribute it to the project. Unfortunately, in the present case this particular impatient user lacked the Java programming skills to undertake the task. Fortunately, however, the eXist development team is very responsive to feature requests from users, and shortly after I wrote to the developers about the problem they released an upgrade that implemented precisely the modification described above (consult the predicate first and retrieve only the nodes that will be needed from the designated axis). Rerunning the original code that relied on the long axes on the same machine as the earlier tests but using eXist version 1.3.0dev-rev9622-20090802, which includes this new optimization, yielded times of 1.754, 1.778, 1.765, 1.944, 1.944, 1.777, 18.949, 1.838, 1.763, and 1.798 seconds. The mean time for these tests was 3.531 seconds, and if we exclude the aberrant long time on the seventh trial (an artifact of a system process that woke up at an inconvenient moment?), the mean drops to 1.818 seconds. The 3.531-second figure is 14.455% of the best mean time (24.427 seconds) achieved with my XSLT-based optimizations and 0.559% of the mean time of the long-axes search (631.150 seconds) before the introduction of the eXist-internal optimization. The 1.818-second figure is 7.443% of the best mean time (24.427 seconds) achieved with my XPath-based optimizations and 0.288% of the mean time of the long-axes search (631.150 seconds) before the introduction of the eXist-internal optimization.

The eXist optimization works by checking the static return type of the predicate expression to determine whether it is a positional predicate. (This paragraph reproduces more or less verbatim an explanation provided by the eXist developers.) If the answer is yes and there is no context dependency, the predicate will be evaluated in advance and the result will be used to limit the range of the context selection (e.g., following::word). For example, $i/following::word[1] would benefit from the optimization because the static return type of the predicate is a positional predicate and it entails no context dependency. On the other hand, $i/following::word[position() = 1] would not be optimized because it introduces a context dependency insofar as position() returns the position of the current context item and cannot be evaluated without looking at the context. Furthermore, determining the static type is not always easy. In particular, the type information is passed along local variables declared in a let or for, but it gets lost through function calls. My original query, for $j in (1 to 3) return $i/following::word[$j], works, but if $j were a function parameter, it would not. Additionally, support for this optimization with particular XPath functions is being introduced only incrementally, to avoid breaking existing code. For example, the developers’ initial attempt at an optimization failed with the reverse() function that I used to retrieve the three preceding words in the correct order, although support for this function was added later to the optimization.

The unsurprising technical conclusion, then, is that, at least in the present case, optimization of the XPath code by the user to reduce the scope of a query can achieve substantial improvement, but much more impressive results are obtained by optimizing the Java code underlying the XPath interpreter. What this experiment also reveals, though, is that, at least in the present case, the user was not reduced to waiting helplessly for a resolution by the developers, and was able to achieve meaningful improvement in those areas that he did control, viz., the XML, XPath, and XQuery.

In his concluding statement at the Balisage 2009 pre-conference Symposium on Processing XML Efficiently, Michael Kay invoked David Wheeler’s advice that application developers optimize the code that users actually write, that is, that they find out what people are doing and make that go quickly. From an end-user perspective, though, the lesson can be reversed: Find out what goes quickly and use it.

References

[Configuring] “Configuring Database Indexes.” (Part of the eXist documentation.) http://www.exist-db.org/indexing.html. Accessed 2009-05-31.

[Lucene] “Lucene-based Full Text Index” (Part of the eXist documentation.) http://www.exist-db.org/lucene.html. Accessed 2009-05-31.

[Tuning] “Tuning the Database.” (Part of the eXist documentation.) http://exist.sourceforge.net/tuning.html. Accessed 2009-05-31.

Author's keywords for this paper:
XPath; XQuery; eXist; optimization; efficiency; indexing

David J. Birnbaum

Professor and Chair

Department of Slavic Languages and Literatures University of Pittsburgh

David J. Birnbaum is Professor and Chair of the Department of Slavic Languages and Literatures at the University of Pittsburgh. He has been involved in the study of electronic text technology since the mid-1980s, has delivered presentations at a variety of electronic text technology conferences, and has served on the board of the Association for Computers and the Humanities, the editorial board of Markup Languages: Theory and Practice, and the Text Encoding Initiative Council. Much of his electronic text work intersects with his research in medieval Slavic manuscript studies, but he also often writes about issues in the philosophy of markup.