Investigating the streamability of XProc pipelines

Norman Walsh

Abstract

XProc: An XML Pipeline Language is a specification being developed by the W3C for describing a sequence of XML operations performed over a set of input documents. Not all of the steps in XProc are known to streamable and consequently, the XProc specification does not require implementations to support streaming.

It's an open question whether or not a streaming implementation would be likely to achieve significant performance improvements over a similar non-streaming implementation. Using data collected from real-world pipelines, this paper examines that question.

Introduction

XProc is a language for expressing a pipeline of XML operations. One simple example of the sort of use cases that XProc is designed to meet is this pipeline:

<p:pipeline xmlns:p="http://www.w3.org/ns/xproc">
  <p:xinclude/>
  <p:validate-with-xml-schema>
    <p:input port="schema">
      <p:document href="/uri/of/schema.xsd"/>
    </p:input>
  </p:validate-with-xml-schema>
  <p:xslt>
    <p:input port="stylesheet">
      <p:document href="/uri/of/stylesheet.xsl"/>
    </p:input>
  </p:xslt>
</p:pipeline>

This pipeline performs XInclude processing, validates the expanded document, applies a stylesheet to it, and returns the result of the transformation.

XProc steps

The XProc specification defines 31 required and 10 optional steps, summarized in Table I.

Table I

The XProc Standard Step Library

p:add-attribute	p:http-request	p:set-attributes	p:wrap-sequence
p:add-xml-base	p:identity	p:sink	p:wrap
p:compare	p:insert	p:split-sequence	p:www-form-urldecode^*
p:count	p:label-elements	p:store	p:www-form-urlencode^*
p:delete	p:load	p:string-replace	p:xinclude
p:directory-list	p:make-absolute-uris	p:unescape-markup	p:xquery^*
p:error	p:namespace-rename	p:unwrap	p:xsl-formatter^*
p:escape-markup	p:pack	p:uuid^*	p:xslt
p:exec^*	p:parameters	p:validate-with-relax-ng^*
p:filter	p:rename	p:validate-with-schematron^*
p:hash^*	p:replace	p:validate-with-xml-schema^*
^* An optional step.

For more details about the nature of XProc pipelines, see XProc: An XML Pipeline Language.

Streamability

When XProc development first began, there were a lot of opinions about streamability. Some people suggested the constraint that all XProc steps must be streamable. Others suggested only that they should be streamable.

In the end, the working group determined that it would impose no streamability constraints on implementations. Instead it would attempt to define steps such that it was (at least usually) possible to stream them and leave the actual streamability as a quality of implementation issue.

There seems to be an informal consensus that a streaming implementation would be expected to outperform a similar non-streaming implementation. In the particular case of XProc, the author believes that this depends significantly on the extent to which real world pipelines are composed of streamable steps.

The standard XProc steps can be divided roughly into three categories: those for which a streaming can always be achieved (e.g., p:identity), those for which streaming can often be achieved (e.g., p:delete), and those for which no general, streaming implementation is known (e.g., p:xslt).

The distinction between the first and second catagories amounts to the question of XPath streamability. The p:delete step, for example, is streamable precisely to the extent that the XPath expression used to select the nodes to be deleted is streamable.

Taking an optimistic view, this paper assumes that if a step could be streamable then it is. This won't always be true, of course, but taking that view will establish an upper bound on the number of streamable pipelines.

With that in mind, we assume that the following steps can not be streamed: p:exec, p:http-request, p:validate-with-relaxng, p:validate-with-schematron, p:validate-with-xml-schema, p:xquery, and p:xslt. We assume that all other steps can be streamed.

XML Calabash

XML Calabash is the author's implementation of XProc. XML Calabash is written in Java and is built on top of the Saxon 9 APIs. The author's principle motivation for writing XML Calabash is to demonstrate a complete and correct implementation. Demonstrating at least two complete and correct implementations of every feature is a necessary prerequisite for getting XProc approved as a W3C Recommendation.

At the present time, XML Calabash (version 0.9.13) evaluates all pipelines in a linear, non-streaming, single-threaded fashion. Future versions of XML Calabash may provide streaming and multi-threading, though high performance is not a stated goal at this time.

Collecting data about real world pipelines

XML Calabash has a feature for collecting information about actual pipelines that are being used. By default, XML Calabash sends a simple report about each pipeline that it runs back to a central server. This report includes some basic information about the pipeline and a list of each step that was run and the order in which they were run.

In the interests of privacy, XML Calabash never collects any information about the documents that are processed by the pipeline. Users who feel uncomfortable about this data collection service can disable it completely.

Data collected

This paper reports on the data collected by XML Calabash between 21 Dec 2008 and 11 Jul 2009. During that period, data was collected on more than 294,000 pipeline runs.

Ignoring the outliners, this data shows reasonably steady growth in XML Calabash usage.

Data aggregation

Figure 3 presents a summary of the raw data, showing for each step how many times it was used in the data collected.

Not shown in that table are the errors: 396 pipelines failed with static errors, 7778 failed with dynamic errors.

The whole collection of pipeline reports includes many runs of the same pipeline. So while there are almost 300,000 runs over six months, there are no where near that many distinct pipelines. The reports have a very consistent structure. This consistency allows us to easily identify duplicate runs.

What is a duplicate run? Two runs are considered to be the same if they come from the same IP address and perform the same steps in the same order at runtime. Note that this does not mean that the source pipelines were necessarily the same, though it's certainly likely that most duplicate runs are from the same pipeline. Note also that the same pipeline might produce runs of different lengths when presented with different data.

Our analysis reveals that there are 3,994 distinct runs in the data collected. Many pipelines run only a single step, but one pipeline ran 4,051 steps. (For an average of 73.8 steps/pipeline.)

Figure Figure 4 plots each unique pipeline against the length of its run, sorted by length of run. Plots of different ranges of this graph show a very similar structure.

It's an open question whether analysis of the unique runs is more or less valuable than analysis of all the runs. Using only the unique runs biases against pipelines that were run many times, for which performance could be an important factor. Using all the data baises against distinct runs that occurred only once.

Since the focus of this paper is on the characteristics of XProc in general, and not the efficiencies of individual pipelines, the balance of this paper consideres only the set of unique pipeline runs.

Data analysis

Now let us consider the streamability of a pipeline. At one extreme, if a pipeline consists entirely of steps which cannot be streamed, then streaming offers no advantage for that pipeline. At the other extreme, if a pipeline consists entirely of streamable steps, then streaming would offer the most possible benefit for that pipeline. In the middle are those pipelines that contain a mixture of streamable and non-streamable steps.

For a collection of pipeline runs that are “n” steps long, we can compute three values: the length of the longest streamable subpipeline, the length of the shortest streamable subpipeline, and the corresponding average length. From this data, we obtain Figure 5.

(A small number of runs consisting of thousands of steps significantly reduced the readability of the graph; they have been elided.)

Figure 6 shows a magnified version of the first portion of this graph.

With respect to our earlier simlifying assumption that all steps that can be streamed are streamed, a casual inspection of the actual match patterns used revealed that to be true at least 80% of the time.

Conclusions

The preliminary analysis performed when this paper was proposed suggested that less than half “real world” pipelines would benefit from a streaming implementation.

The data above seems to indicate that the benefits may be considerably larger than that. Although it is clear that there are piplines for which streaming wouldn't offer significant advantages, it's equally clear that for essentially any set of pipelines of a given length, there are pipelines which would be almost entirely streamble.

Perhaps the most interesting aspect of this analysis is the fact that as pipeline runs grow longer, they appear to become more and more amenable to streaming. That is to say, it appears that a pipeline that runs to 300 steps is, on average, more likely to benefit from streaming than one that's only 100 steps long.

We have not yet had a chance to investigate why this is the case.

Caveats

In the interests of openness, it should be noted that there are several weaknesses in the data.

The data is collected on an “opt-out” basis. That makes it very difficult to publish the raw data. Without explicit consent from users, I'm reluctant. On the other hand, it's likely that much less data would be available if an explicit, “opt-in” approach had been taken.
The XProc specification is currently a Candidate Recommendation. Today's users represent the very leading edge of adoption. It's not obvious that the usage patterns apparent in this data will accurately reflect the usage that occurs when XProc is widely adopted.

Norman Walsh

Principal Technologist

Mark Logic Corporation

Norman Walsh is a Principal Technologist in the Information & Media group at Mark Logic Corporation where he assists in the design and deployment of advanced content applications. Norm is also an active participant in a number of standards efforts worldwide: he is chair of the XML Processing Model Working Group at the W3C where he is also co-chair of the XML Core Working Group. At OASIS, he is chair of the DocBook Technical Committee.

Before joining Mark Logic, he participated in XML-related projects and standards efforts at Sun Microsystems. With more than a decade of industry experience, Mr. Walsh is well known for his work on DocBook and a wide range of open source projets. He is the principle author of DocBook: The Definitive Guide

BalisageThe Markup Conference

Balisage Paper: Investigating the streamability of XProc pipelines