Robinson, James A. “Performance of XML-based applications: a case-study.” Presented at International Symposium on Processing XML Efficiently: Overcoming Limits on Space,
Time, or Bandwidth, Montréal, Canada, August 10, 2009. In Proceedings of the International Symposium on Processing XML Efficiently: Overcoming
Limits on Space, Time, or Bandwidth. Balisage Series on Markup Technologies, vol. 4 (2009). https://doi.org/10.4242/BalisageVol4.Robinson01.
International Symposium on Processing XML Efficiently: Overcoming Limits on Space,
Time, or Bandwidth August 10, 2009
Balisage Paper: Performance of XML-based applications: a case-study
HighWire Press is the online publishing operation of the Stanford University Libraries,
and currently hosts online journals for over 140 separate publishers. HighWire has
and deployed a new XML-based publishing platform, codenamed H2O, and is in the process
migrating all of its publishers to this new platform.
This paper describes four XML-based systems developed for our new H2O platform, and
describes some of the performance characteristics of each. We describe some limitations
encountered with these systems, and conclude with thoughts about our experience migrating
an XML-based platform.
In late 2006 HighWire had started internal discussions over whether or not we needed
implement a radical overhaul of our publishing, parsing and content delivery system.
the system to be much more flexible when it came to incorporating new data, sharing
between systems, and delivering new features.
At that time our system, which had been built up over the past decade, followed a
traditional model consisting of a display layer, a business logic layer, and a
metadata/storage layer. Specifically, our original system could be described as a
Perl and Nsgmls based tools to process data supplied by file providers.
NFS servers to hold derivatives (e.g., SHTML or similar files).
Relational database servers, accessed via SQL, to hold metadata.
Java Mediators talking to the NFS and database servers.
Java Servlets with a custom templating language similar to JSP, named DTL, to build
pages for the browser.
While this system has served us well, and continues to do so, there were some basic
problems we were finding difficult to overcome:
The translation from the relational databases model into Java Objects, and then into
DTL objects for use in the final display layer, often forced the writing of new
application features to become a senior developer task. In order to handle new metadata,
the developer had to determine whether or not the existing relational tables were
enough for the new data, or whether new tables were needed. Next would be the job
extending, or creating, appropriate stored procedures to access the new metadata.
work would be needed in the Java layer to add object mapping support, including routines
to map the metadata into DTL.
Beyond the problem of mapping new metadata from the relational layer of the system
the DTL layer, the introduction of completely new models was daunting. The original
database had been built to support journals whose primary components were modeled
issues with articles, and articles with figures and tables. The original design of
system had intended that we support new models by creating new relational tables,
entire databases, and then building new Java mediators to handle the translation from
database to the display layer. Unfortunately, the reality was that it was more difficult
than we would have liked to support customers who wanted different, non-traditional
The original system hadn't been built with either XML or Unicode in mind. Much of
core system had been developed in the late 1990s, around the same time XML 1.0 was
published, and before it was widely adopted. By the same token, the DTL system was
developed before Unicode was supported in mainstream browsers. These shortcomings
was very difficult for us to properly ingest XML and produce valid XHTML with Unicode
support on the display side. Attempting to add Unicode support alone was a daunting
as it required careful vetting of all code which worked at the character level, beginning
with the Perl system, moving through to the database systems and filesystems, and
in the Java layers.
What we decided to build:
An end-to-end XML-based system. We would accept incoming XML, transform it into
different XML representations, store it as XML, query it as XML, and generate XML
We would encode certain types of oft-used relationship data up front, trying as much
as possible to compute it only once.
We decided to build everything following a RESTful [Fielding2000] model,
with the idea that using a simple set of operators (GET/HEAD, POST, PUT, and DELETE),
using unique URIs (vs. SOAP documents submitted to a single URI shared by all resources),
and embedding hyperlink metadata into our documents would make it easier to spin off
After about six months of discussions and prototyping, we had an outline of what we
be building and which new software technologies we would be using. By January of 2007
built a demonstration system which made use of XSLT and XQuery to transform incoming
metadata and XHTML, and to deliver dynamically built XHTML pages.
After about fifteen months of work following this prototyping, HighWire had a beta
operating, and was ready to announce its new platform, dubbed H2O. In the first week
of 2008 we launched our first migrated site, Proceedings of the National Academy of
of the United States of America (PNAS). Since that time we've launched 57 additional
consisting of a mixture of new launches and migrations.
There are three primary tiers of XML-based technology in the H2O system:
Firenze, a HighWire-developed XSLT 2.0 pipeline execution system used to build both
front-end sites and back-end data services.
Schema, Addressing, and Storage System (SASS), a data store implementing an internally
developed protocol, the HighWire Publishing Protocol (HPP) built on top of the Atom
Publishing Protocol (APP) [AtomPub2007] and Atom Syndication Format (ASF)
[Atom2005]. SASS is used to manage and serve content, implemented in two
different technologies: XSLT 2.0 using Saxon-SA (read-only) and XQuery using MarkLogic
Server 3.x (read/write).
Babel XSLT, a HighWire-developed vector processing engine which we use to drive XSLT
In this paper we'll discuss how these systems work, and will examine their different
The first layer of the H2O system we'll describe is the Firenze application framework.
Firenze framework, written in 87,000 lines of code across 526 Java classes, is the
of technology we run that services all dynamic page generation requests in our public-facing
H2O web servers. All of the dynamically generated content served by the public-facing
flows through this framework.
The bulk of Firenze is a vendor-agnostic set of classes which rely on various public
standard APIs, e.g., the Java Servlet API, JAXP, and HTTP handlers. An additional
classes are then needed to provide a vendor-specific service-provider to execute XSLT
Transformations, and to provide custom URI Resolver implementations. We've written
additional Java classes which use Saxon-SA 9.x to implement this service-provider
functionality. The original implementation of Firenze used Saxon-SA 8.x APIs directly,
a subsequent rewrite we decided that we would benefit from abstracting the smaller
vendor-specific parts away from the larger, more general, framework.
A Firenze application pushes an incoming request through four basic stages to produce
It transforms an incoming HTTP request from the Java Servlet API into an XML
representation, req:request, based on an internal schema.
Firenze pushes the req:request through zero or more Filters which may
amend the req:request document, adding or removing data from the
Next, Firenze pushes the amended req:request through a chain of one or
more XSLT Templates to produce an XML response representation,
rsp:response, which is also based on an internal schema.
Finally, Firenze transforms the rsp:response into appropriate calls
against the Java Servlet API to output an HTTP response to the client
The process of pushing these documents through the pipeline is handled via SAX2 events,
implemented by various event-handlers. As each event-handler is invoked it has a chance
operate on parts of the req:request or rsp:response as the documents
flow through the pipeline. Once each handler completes its area of responsibility
removed from the execution stack, thereby reducing the number of SAX2 events fired
length of the pipeline.
The demarcation of responsibilities between Filters and Templates is fuzzy: both may
req:request documents in any fashion, and the decision whether to make the
process a specific Filter or part of the Template chain is up to the author. The Template
pipeline is then responsible for transforming the req:request into a final
rsp:response document. It may be interesting to note that almost all of the
Filters we've implemented are XSLT stylesheets. Only a few of the Filters have been
implemented directly in Java.
As an example of a pipeline operation, a request flowing through the pipeline might
its life as a req:request document as built via the Java Servlet API:
The initial req:request document describe the HTTP request in its entirety.
The first Filter handler which processes this req:request may then, for example,
add additional information about the resource being requested, shown here as an additional
msg:attribute element added to the end of the req:request
The msg:attribute element in this case, given a name
content-peek, provides detailed information about how to retrieve the requested
resource from the central data store, in this case via the URL
http://sass.highwire.org/pnas/106/27/10877.atom?with-ancestors=yes. A second
Filter may then consume this msg:attribute and use it to fill in the contents of that
this case a full text article with metadata about its ancestry (the issue it is in,
volume, etc.) expanded in-line, by adding another msg:attribute to the end of the
In this fashion, Filters and Templates aggregate data from multiple sources until
information has been accumulated to produce the final response. It is probably obvious
the amount of data produced within the pipeline over the lifetime of a request may
exceed the final size of the response document sent to the client. For the example
final req:request document produced by the pipeline exceeds 375 kilobytes of
serialized XML, while the final XHTML document sent to the user is a mere 60 kilobytes.
The public-facing H2O sites run almost entirely off of Firenze, executing a codebase
consisting of a little under 50,000 lines of XSLT 2.0 code (including comments and
whitespace), spread across a set of 160 stylesheets. This codebase is maintained under
shared repository, with each site then applying overriding stylesheets for any site-specific
functionality. Currently each site has between 10 and 23 site-specific stylesheets,
some of these only consist of a single include statement referencing a shared stylesheet.
a context is deployed, a copy of the shared and local stylesheets are pushed out to
server, and the context compiles and caches a private set of Templates objects for
This means that the shared development model for stylesheets doesn't get carried all
through to the runtime, and each site must allocate memory to compile and store a
of all the stylesheets.
Firenze implements two levels of caching. The first is context-wide caching. Each
context-wide cache records all retrieved documents, e.g., JAXP Source and Templates
as well as any documents generated and cached on-the-fly by the pipeline. These cached
are available for any subsequent requests that execute in the context. Each context-wide
may have a specific cache-retention policy and algorithm associated with it, as well
own implementation of a backing store. Currently a memory-based store is used, but
we are in
the processing of implementing a disk-based store as well (our intent is to create
memory and disk cache).
Each context-wide cache then serves as a provider for request-specific caches. These
request-specific caches store items in memory for the duration of a given request.
request-specific caches were originally developed to fulfill a contract implied by
specification, which requires that a document, once read via functions like doc
or doc-available, remain immutable for the duration of the transformation. We've
extended this policy to require that any document read during the execution of a pipeline
remain unchanged for the duration of that request.
Currently the sites implement a very straightforward context-wide caching policy.
Templates object survives for the lifetime of the context, while most Source objects
survive for a lifetime of 30 minutes, or until the cache size has reached a 1,000
limit. Once the cache has reached its size limit, the addition of any new entries
oldest Source objects to be evicted. In some cases we fine-tune the policy for Source
ensuring they will survive for the lifetime of the context, or by increasing or decreasing
size of the cache. This level of caching has proved to be sufficient, if not ideal,
current level of traffic. The default cache policy is most effective for the scenario
user who is looking at an abstract and then the full text of the article, or the scenario
user scanning the abstracts of an Issue in sequential order, clicking from one abstract
next within a fairly short period of time.
To determine the response times for requests handled by Firenze, we examined 78 days'
worth of Apache access logs, consisting of 727 million requests. These requests cover
types of requests, meaning it includes static file requests as well as dynamic page
We found that 214 million requests were dynamic, meaning that Firenze would have handled
Examining the response times for those Firenze requests, we found the following response
Response Times for Firenze requests
1 - 2
2 - 3
3 - 4
This means that 92% of Firenze requests took less than 2 seconds to complete, 5% took
2 to 4 seconds to complete, and 3% took more than 4 seconds to complete. Our performance
is to be able to serve all dynamic page requests, and any associated static content,
1.5 seconds, so we have not yet met our performance goals.
We've been hosting sets of 15 to 20 sites per load-balanced cluster, with each cluster
consisting of either two or five servers. Each server has 32 gigabytes of memory and
four and eight CPU cores operating between 2.3 and 2.5 GHz. CPU on the clusters is
taxed; we routinely record only 15% to 25% CPU usage per server, even during peak
By far the least efficient part of the Firenze system we've seen is its memory footprint.
What we've seen is that each 32-gigabyte server may use 9 gigabytes of memory during
lowest traffic periods, up to 15 gigabytes of memory in normal traffic periods, and
gigabytes during very high traffic periods. During normal traffic periods, memory
cycle between 9 and 15 gigabytes within the space of a minute, with all the activity
in the Eden and Survivor spaces of the JVM. The need to allocate between one and two
of memory per site is a serious impediment to packing enough sites onto a machine
utilize the CPUs.
In order to improve the response time of the Firenze layer, we are currently building
Cache-Channel [Nottingham2007] support into Firenze, and are building a disk-based cache
implementation. The disk-based cache will allow us to store dynamically generated
of pages for a longer period of time, and will be paired with the in-memory caches
advantage of optimized Source representations.
We also plan to implement an Adaptive Replacement Cache (ARC) [Megiddo2003]
algorithm to replace the Least Recently Used (LRU) algorithm currently used by the
cache. The HighWire sites receive a steady stream of traffic from indexing web-crawlers,
we've found that these crawlers tend to push highly-used resources out of cache when
start crawling large amounts of archival content.
Firenze can be thought of as a framework designed to aggregate data from multiple
and then to manipulate that aggregate data into a final form. We currently aggregate
many different services, gathering information for access control, running search
gathering reference data, etc. One of the primary services that the sites use is the
Addressing, and Storage System (SASS). A SASS service is a data store that provides
view of metadata and content for publications we host.
One of the first decisions we needed to make when we were building the replacement
was how we would store metadata and related resources. Early on we mocked up a file-system
based set of XML files along with an XQuery prototype that combined these files dynamically,
feeding the combined resources to an XSLT-based display layer. These experiments proved
that the fairly simple concept of using a hierarchy of XML files could actually provide
flexibility and functionality for our needs. We decided we could replace some of the
relational databases of our original system with a hierarchy of XML files. Our implementation
of this, SASS, is the result of that decision.
A SASS service uses an HTTP-based protocol we've defined, named HighWire Publishing
Protocol (HPP), to handle requests. HPP is built on top of the Atom Publishing Protocol
which specifies much of the basic functionality we needed for a database system:
It specifies how resources are created, retrieved, updated, and deleted.
It specifies the format of introspection documents, which provide configuration
It specifies the output format for an aggregation of resources, using feeds of
It offers a flexible data model, allowing foreign namespaced elements and
After examining APP in detail, we decided we needed a little more functionality, and
concluded that we would adopt APP and then extend it in three ways. Our extensions
give us the
Add Atom entries with hierarchical parent/child relationships.
Create multiple alternative representations of an Atom entry, called variants.
Create relationship links between Atom entries.
As with APP, we have defined a form of XML-based introspection document, similar to
Service Document. These introspection documents define which collections exist and
of resources the collections will accept. Because our HPP system allows every single
entry to potentially act as a service point to which new resources may be attached,
entry references a unique introspection document. Each introspection document describes
content types and roles may be used for variant representations of the entry, and
or more collections for sub-entries, with constraints regarding the types and roles
resources that may be added to each collection.
Both client and server examine the introspection document, the client being responsible
for evaluating which collection should best be used for a new resource it wishes to
and the server to evaluate whether or not to allow a particular operation. Once the
decided that a client's request to create a new resource is allowed, it further examines
introspection document to:
Determine how to name the new resource.
Determine what, if any, modifications need to be made to the new resource, e.g.,
creating server-managed hyperlinks to its parent.
Determine what other resources might need to be created or modified, e.g., creating
reciprocal hyperlinks between resources or creating a new Atom sub-entry for a media
An example slice of resource paths in our data store shows a top level entry, a journal
entry (pnas), a volume (number 106) entry, an issue (number 27) entry and an article
10879) entry, and a child (Figure 1) of the article.
Examining the extensions of the paths above, you will see a number of media types
represented. The .atom files are metadata resources, while the .xml,
.html, .gif, .jpg, and .pdf files allow
us to serve alternative representations (variants) of those resources. SASS therefore
unified system for serving metadata paired with an array of alternative
The system is intended to be flexible enough that we can model new relationships
relatively quickly, and once those relationships are defined we can immediately start
and serving the new resources and relationships. As an example, the hierarchy of relationships
we originally designed was, in part:
This model allows one or more Journals, each Journal may have one or more Volumes,
Volume may have one or more Issues, etc. When we encountered a journal whose hierarchy
match this model, we simply edited the templates for the introspection documents.
case, we edited it to allow for an Issue to be attached directly to a Journal:
When deciding how to implement SASS, we concluded early-on that we needed to provide
a read-only and a read/write service. The read-only SASS services would be used by
public-facing sites, and the read/write service would be used by our back-end publishing
system. Splitting the services this way would allow us to optimize each service for
primary use-case: transactions and complicated searching would be needed in the read/write
SASS service, but would not be needed in the read-only SASS service.
Based on our early prototyping work, we were confident that we could develop a
filesystem-based read-only implementation of the HPP protocol that would serve the
the public-facing sites. Currently we've implemented a read-only version of SASS using
The description of SASS to this point has only described the basic hierarchical layout
of resources. To take advantage of this hierarchy, HPP defines ways to request an
view of the metadata and content representations for a resource. Specifically, a set
parameters may be provided when requesting an Atom entry:
These parameters may be combined in variations and some may be repeated. Taken together,
the parameters serve as a way to drive the expansion of a resource along its parent,
and variant axes, returning a compound document consisting of an appropriate slice
hierarchy. As an example, requesting
http://sass.highwire.org/pnas/106/27/10877.atom will retrieve the Atom entry
associated with PNAS Volume 106, Issue 27, Page 10877:
while adding the parameter to expand its ancestry axis in full,
additionally expands the Atom entries for the article's Issue, its Volume, etc:
<atom:title>HighWire Atom Store</atom:title>
<atom:title>Proceedings of the National Academy of Sciences</atom:title>
<atom:name>National Academy of Sciences</atom:name>
<atom:name>National Academy of Sciences</atom:name>
<atom:name>National Academy of Sciences</atom:name>
<atom:title>Should Social Security numbers be replaced by modern, more secure identifiers?</atom:title>
<atom:name>William E. Winkler</atom:name>
<nlm:name name-style="western" hwp:sortable="Winkler William E.">
The difference between the two documents is that the parent link in the entry:
Because the with-ancestors value was yes, each entry has had its parent
link expanded into a c:parent element, pulling in metadata all the way up to
the root of the hierarchy.
Likewise, a client may also request with-descendants, and a common request
sent by the sites is for a Journal Issue with its ancestors expanded completely, and
descendants expanded to a depth of one. This in effect gives them the metadata for
and its Article children, from which they may do things like build a Table of Contents
In effect, these parameters allow us to perform operations somewhat like a join
operation in a relational database. If you think of the Atom entries as relational
and atom:link elements as foreign keys, we have a limited ability to join documents
together on those keys.
The read-only SASS service hosts an Apache front-end, load-balancing requests to a
of four Tomcat servers. Each Tomcat server uses two AMD 1210 cores, 8 gigabytes of
and two local SATA disks. Each Tomcat server runs Firenze to execute the read-only
service stylesheets. The stylesheets in turn pull data from a service named SASSFS,
on an Apache-only server using four AMD 2218 CPU cores, 32 gigabytes of memory, and
terabytes of FC attached SATA storage. The SASSFS service holds a synchronized clone
read/write SASS service. The SASSFS system is, in effect, a network-based storage
SASS, accessed over HTTP instead of a more traditional NFS protocol.
The XSLT implementation of read-only SASS consists just under 5,000 lines of XSLT
code (including whitespace and comments), spread across a set of 13 stylesheets. About
of those lines of code are an interim prototype for disk-based caching.
Our initial version of the read-only SASS service used the default in-memory caching
available in Firenze. This default would store the most recent 1,000 resources requested
from SASSFS in memory as Source document (the underlying representation being a Saxon
TinyTree). This caching proved to be effective, and the service performed very well
high load. While we were satisfied with the performance, we knew that we wanted to
a more effective caching algorithm for Firenze as a whole, and we decided to use the
read-only SASS service as a test-bed for prototyping part of this work.
Because HighWire hosts a great deal of material that does not change very often, we
wanted to implement a caching system that could take advantage of the fact that most
material is for all intents and purposes written once and then read many times. Our
turned up the Cache-Channel specification, describing a method where clients could
efficient service to detect when cached items were stale. If we implemented this system,
could cache responses built by the SASS service and, for the most part, never have
them again. Thus, we could trade disk space for time, allowing us to short circuit
all processing within the Firenze system when we had a cached response available
To prototype this work, we implemented a set of extensions for Saxon that allowed
write serialized XML to a local disk partition. When an incoming request could be
by the cache, we could simply stream the data from disk, bypassing the bulk of the
processing we would otherwise have to perform.
In the XSLT prototype, the req:request representation of the HTTP request
is processed via the following steps:
Examine the HTTP PATH of the req:request and check that the resource
is available on SASSFS; if it is not, return a not-found error code.
If the media type of the requested resource is not XML, stream the resource from
SASSFS to the client.
If the resource is not in cache, build the response. SASS reads resources from
SASSFS, storing the resources in local cache. Using the resources fetched from SASSFS,
the SASS service builds an XML rsp:response, and stores that response in
cache. Each resource written to the cache is accompanied by a corresponding XML
If the resource was in cache, check the metadata and perform HTTP HEAD requests
against SASSFS to see whether or not the item needs to be rebuilt. The rebuild would
be needed if any one of the constituent resources on SASSFS have changed. If nothing
has changed, stream the response from disk to the client. Otherwise a
rsp:response is built as in step #3.
For the XSLT-based prototype work, we decided not to implement the actual Cache-Channel
client or to hook into the in-memory cache of frequently used Source objects. We would
tackle these items later, when we implemented the caching logic in Java.
We expected this prototype to be slower than the original implementation, both because
Firenze would now need to be parsing XML from disk for every request, instead of simply
reusing cached Source objects, and because we would be polling SASSFS to see if a
Our initial analysis of the prototype's performance simply examined the average response
time across all requests. We were very unpleasantly surprised to find that for single
entries the average response time jumped from 0.031 seconds to 0.21 seconds. The average
response times for compound entries jumped from 0.05 seconds to 0.26 seconds. Looking
those averages, we decided we needed to know whether or not the slowdown was across
board, or whether the averages reflected large outliers.
We examined response times for a day's worth of requests using each of the two caching
implementations, and sorted the requests into two categories. One category was for
that would return a single resource, effectively a transfer of a file from SASSFS
client via SASS, with some cleanup processing applied. The second category was for
that returned compound resources. These were resources built by SASS, using component
resources fetched from SASSFS. We examined the response time for these requests, and
them into percentiles:
SASS Response Times in Seconds per Cache implementation
Native Firenze Cache
XSLT Prototype Disk Cache
This analysis shows that the disk-caching prototype was 1-2.5 times slower than the
memory-based cache for about 75% of the requests, but that performance was significantly
worse for the remaining 25% of the requests.
What we discovered were two bottlenecks occurring with the prototype. The first, and
most significant, bottleneck was the IO subsystem. The hardware on our machines couldn't
keep up with the level of read/write activity being asked of them. When measuring
activity, we found it was operating at around 700 block writes per second and around
block reads per second. This level of activity was overwhelming the 7,200 rpm SATA
used by the servers, causing high IO wait times.
The second bottleneck turned out to be the portion of XSLT code responsible for
executing HTTP HEAD requests to determine whether or not a resource had changed. When
profiled the application on a stand-alone machine (eliminating disk contention), we
that the following snippet of code was responsible for 30% of the execution time:
some $m in $cache:metadata
satisfies cache:is-stale($m)" />
The cache:is-stale function takes as an argument a small XML metadata
element storing a URL, a timestamp, and an HTTP ETag value. The function executes
HEAD request against the URL to determine whether or not the resource has been modified.
As Saxon does not take heavy advantage of multi-threading, this XPath expression
ends up running serially. Because underlying resources don't change very often, the
algorithm usually ends up running through every metadata element only to find nothing
These discoveries were actually good news to us, as we knew that we could both reduce
disk contention and parallelize the check for stale resources when we implemented
in Java as a native Firenze service. We're in the process of completing this work,
the meantime we have rolled the XSLT prototype code into active service.
Performance of the prototype has proven to be adequate. Examining 12 days of access
from read-only SASS, the service is handling an average of 5.9 million requests per
ranging from a low of 3.3 million requests to a high of 7.8 million requests. On average
service is processing 70 requests per second, writing 3.5 megabytes per second to
Overall the read-only SASS service is serving an average of 266 gigabytes per day.
Because SASS serves both XML markup and binary data, and because binary data may be
directly from the SASSFS system without any intermediate processing by SASS, only
of those 266 gigabytes is XML processed via Firenze. A breakdown of the two types
shows we serve an average of 166 gigabytes of XML data per day, and an average of
gigabytes of binary data:
Gigabytes served per day by read-only SASS
It has proven difficult to compare these numbers against our older system because
SASS service combines services that are spread out across multiple database and NFS
in the older system.
In order to implement the read/write SASS service, we knew we needed to build a
transactional system. We had to be able to know that we could roll back any operation
met with an error condition. In addition, we wanted a system that would allow us to
the XML documents without needing to write custom code or build new indexes for every
query we might come up with.
After exploring the available systems, we decided to license the XML server provided
Mark Logic Corporation. In addition, since both the MarkLogic Server and its underlying
XQuery technology were new to us, we contracted with Mark Logic for consultants to
us to build an implementation of our HPP specification. HighWire staff provided details
regarding the specification and Mark Logic consultants wrote an implementation in
The implementation was written in just under 7,600 lines of XQuery code, spread across
We're currently running MarkLogic Server version 3.2, which is a few years old, and
which uses a draft version of XQuery. Newer releases of MarkLogic implement the XQuery
specification, and we plan to eventually modify the XQuery implementation to take
of the newer releases.
We are currently running the MarkLogic implementation on one dedicated production
using four AMD 2218 CPU cores, 32 gigabytes of memory, and 3.7 terabytes of FC attached
storage. This server is currently handling between 7 to 8 million requests per month,
used as the system of record for our production processing system. The break-down
requests for the months of April, May, and Jun in 2009 were:
Number of requests to SASS read/write service
GET (non-search) reflects the retrieval of a single Atom entry or variant
GET (search) reflects the execution of a search
GET (report) reflects the execution of custom reporting modules we've
What these numbers translate to is the loading of between 40,000 to 50,000 articles
month, though in our first month of operations, when we were migrating PNAS, we loaded
93,829 articles that month alone.
As of Jun 18th 2009, the read/write SASS service held the following counts of resource
types (there are others, but these are the ones whose counts may be of general interest):
read/write SASS service resource counts
In the table above, Journal/Volume, Journal/Issue, and Journal/Article resources
correspond to the obvious parts of a journal. Journal/Fragment resources indicate
extracted from an article to create a sub-resource, in this case they are representations
figures and tables. Adjuncts are media resources that provide supplemental data to
(e.g., raw data sets submitted along with an article). All variants consist of alternative
representations, including XHTML files, PDFs, images, etc.
In general we've found the performance of MarkLogic to be very good, and have not
reached the level of use that would require us to add additional servers. When we
that point, an important advantage we see in MarkLogic was that we ought be able to
capacity by simply creating a MarkLogic cluster of multiple servers.
There are two areas where MarkLogic has had some trouble with our particular application:
Complex DELETE operations are slow
Some ad-hoc XQuery reporting may be resource intensive depending on the
In MarkLogic, individual delete transactions are very efficient, but to properly
implement a DELETE operation in SASS the application executes an expensive traversal
algorithm, building a list of resources, including:
Resources that are children of the targeted resource.
Resources that refer to the resource targeted for deletion or to any of its child
The application then needs to delete all the descendant resources and remove
all references to those deleted resources. Deleting a single article could require
application perform a dozen searches, delete fifty resources, and then update all
entries that refer to those deleted resources. This algorithm is costly to execute,
makes DELETE far slower than the other operations.
For each type of HTTP operation a selection for 5,000 log entries were examined for
their execution times:
Seconds to complete a request
GET (non-search) reflects the retrieval of a single Atom entry or a variant
GET (search) reflects the execution of a search
Performance is excellent for the GET (non-search), POST, and PUT operations, and fairly
good for GET (search), but DELETE operations are far slower than any other operation.
intrinsic problem with handling a DELETE is the complexity of the algorithm and the
of documents that need to be searched and modified. In theory we ought to be able
optimize how the searches are performed, implementing a more efficient algorithm,
speeding up the execution. Because DELETE operations make up such a small number of
requests we execute, we have not yet seriously investigated implementing such an
The other problem area we've had with MarkLogic is constructing efficient ad-hoc
queries. MarkLogic automatically creates indexes for any XML that it stores, and while
indexes cover many types of possible queries, it is possible to construct queries
not take advantage of these indexes. At various times we want to run ad-hoc reports
the database, and we've found that some of these queries can time out if they are
without applying some knowledge of how the server's query optimizer work. Given the
structure of our XML, for some of our ad-hoc queries, a challenge has been that our
of MarkLogic Server will not use an index if the expression is within nested predicates.
an example, if we have an index built on the two attributes @scheme and
@term for atom:category in an Atom entry, which together
function as a key/value pair:
as well as on the element:
then if we wanted to find those entries with the values represented by the
variables $scheme, $term, and $issue-id, the XPath
expression must be written along the lines of
for $cat in /atom:entry[nlm:issue-id = $issue-id]/atom:category[@scheme eq $scheme and @term eq $term]
Writing it in an alternative way, using nested predicates,
/atom:entry[atom:category[@scheme eq $scheme and @term eq $term] and nlm:issue-id = $issue-id]
results in the server's not using the @scheme and @term indexes,
resulting in longer execution times. As more predicates are added to a query, it can
very difficult to figure out how best to structure the query to take full advantage
As an example, the following XQuery expression searches for Atom entries under a
specified $journal-root location, and identifies those Atom entries that match particular
atom:link and atom:category criteria. The nested predicates listed are required to
false positives are returned:
[atom:link[@rel eq $hpp:rel.parent and @c:role = $hpp:model.journal]]
[atom:category[@scheme eq $hpp:role.scheme and @term eq $hpp:model.adjunct]]
[not(atom:link[@rel eq 'related' and @c:role = $hpp:model.adjunct.related])]
This query takes some 472 seconds to run against a $journal-root which
contains a little over 1.8 million resources. Rewriting the query to first look for
of the criteria for each nested predicate listed above, thereby allowing the server
more indexes, reduces the execution time to around 4.6 seconds:
for $entry in
[atom:link/@c:role = $hpp:model.journal]
[atom:category/@term = $hpp:model.adjunct]
[not(atom:link/@c:role = $hpp:model.adjunct.related)]
$entry/atom:category[@scheme eq $hpp:role.scheme and @term eq $hpp:model.adjunct]
and $entry/atom:link[@rel eq $hpp:rel.parent and @c:role = $hpp:model.journal]
and not($entry/atom:link[@rel eq 'related' and @c:role = $hpp:model.adjunct.related])
queries produce the correct results; it's just a matter of how quickly those results
computed. Another way we could improve the performance of this query is to change
structure of our XML to be better aligned with MarkLogic's indexes. For this application,
that was not an option.
MarkLogic Server is able to provide detailed information about which parts of a query
are using an index, and is able to provide very detailed statistics regarding cache
rates for a query. Many queries in MarkLogic can be fully evaluated out of the indexes,
these queries are very efficient, usually returning in sub-second time. However, as
become more complex, the developer needs to understand the impact of the query's conditions
and the way they interact with the indexes. MarkLogic provides accurate responses
queries, and as the query is made to make more use of the indexes, response times
As an example, the following query makes full uses of the indexes to identify those
resources that contain a given DOI value $doi, and MarkLogic can return results for
type of query in less than 0.1
for $doc in
atom:entry/nlm:article-id[@pub-id-type eq "doi"][. eq $doi]
The final component of the XML-based systems used in H2O is the Babel XSLT processing
engine. Babel XSLT is a batch processing engine that we use to transform incoming
into resources for loading into the read/write SASS service. We've implemented an
client in XSLT 2.0 (using Java extensions to allow XSLT programs to act as an HTTP
and we perform the bulk of our content loading using the Babel XSLT engine to POST
into the read/write SASS service.
Babel XSLT is an HTTP service that accepts XML documents describing a batch operation
perform. A batch consists of an XSLT stylesheet to run, an optional set of XSLT parameters
(these parameters may be complex content, meaning they may contain document fragments
sequences), and one or more input sources to process, along with corresponding output
When a batch is submitted, it is queued for processing until the server has free
Once the server begins processing a batch, it draws from a pool of threads to apply
specified stylesheet to each specified input source in parallel. Upon completion,
a batch log
report is produced that indicates the start and stop time of each transformation,
as well as
any xsl:message log events captured during the execution of the individual
transformations. As with the input parameters, the xsl:message log events may be
The Babel XSLT service keeps a permanent cache of compiled Templates for the stylesheets
it is asked to execute. Because a batch requires the uniform application of any XSLT
parameters to every input source in a batch, the server is then able to set up its
workflow once and then apply that workflow en masse to all the inputs listed in the
We currently use Babel XSLT to produce and, via its HTTP and HPP client extensions,
load and update almost all H2O content. The production process includes tasks such
Schematron assertions to produce reports on the content, applying normalization routines
the article source XML, enriching the article source XML to include extra metadata,
converting those article source files into Atom entries and variant representations
XHTML). HighWire has written about 48,000 lines of XSLT 2.0 code (including comments
whitespace), spread across 318 stylesheets, to perform this work.
We are currently running Babel XSLT on two servers. Each server uses two AMD 1210
gigabytes of memory, and various NFS mounted storage arrays. Across both servers we
executing an average of 3,839 transformations per hour. At peak times we've run anywhere
23,000 to 57,000 transformations in an hour. Transformation execution times range
from a low
0.20 seconds to a high of 20.0 seconds, with 95% of transformations taking less than
seconds to complete.
The biggest efficiency headache we've encountered with the Babel XSLT service has
related to its memory requirements. A large enough batch job can run into memory limits
converts the incoming batch into a JDOM object, runs its XSLT transformations, and
to produce the batch log report. The Babel XSLT servers have a minimum memory footprint
ranging from 200 to 300 megabytes, but can easily use up to 5 gigabytes of memory
their workloads. In the space of one minute, a server might jump from needing 500
needing 2.5 gigabytes of memory.
Currently HighWire uses a Perl-based framework to submit Babel XSLT jobs. The Perl
responsible for identifying which stylesheet and which input and output files need
submitted for a given batch, based on the phase of processing in a workflow. The Perl
responsible for producing a batch, submitting it to the Babel XSLT system, and then
the batch log report to determine whether or not the job was completed successfully,
report any messages emitted by the stylesheet.
By far our most challenging experience has been that of educating everyone within
organization. Our developers are faced with new systems that make use of a bewildering
of specifications and standards, and it has not been easy for everyone involved to
come up to
speed on everything; our developers have demanded better documentation and clearer
explanations of how the new systems work.
In terms of performance, we've found the XML-based technologies to be adequate, if
stellar. When we've needed to improve performance we've applied traditional techniques:
Don't perform work if you don't need to (e.g., Firenze's ability to remove handlers
from the stack when the handler has completed its task).
Take advantage of optimized representations of your data, if available (e.g., using
compiled Templates, making use of optimized Source implementations).
Develop caching techniques at multiple layers, trading space for time.
Examine your algorithms to determine if they are the best fit for the
Applying these techniques, the XML-based technologies we've discussed here can be
fast enough for most of our needs.
The advantages we see to using a unified, RESTful, XML data store paired with high-level
declarative programming languages like XSLT and XQuery are:
It is easier to introduce changes to our data models.
There's no need to spend time writing code that converts data from one data model
into another (e.g., from relational form to an object-oriented form and back).
I would like to thank Craig Jurney <[email protected]>, the architect and
developer of the Firenze system, and Jules Milner-Brage <[email protected]>, the
primary architect of the SASS specification and the architect and developer of the
system, for their comments and advice during the preparation of this paper.