Introduction

It’s a truism that things were different when I was a lad, but with open-source implementations of markup technologies there seems to be truth to it.

When I began to develop with XML technologies in the very early 2000s there was a slew of toolkits and implementations for XML parsers, multiple DOM-like implementations outside web browsers, XSLT implementations, etc (almost ad nauseam).

Now, GNOME’s LibXML seems to be the de facto standard implementation, used with many (non-Java) language bindings, there’s only one open-source XSLT2 implementation, and technologies like XQuery seem to be restricted to specialist use and implementations on top of XML databases.

What happened? Open-source technologies in related areas including document-like data stores, document-manipulation (albeit in Javascript running in a browser), have come on in leaps and bounds. The XML technology space seems to have contracted and stagnated, at least to a casual observer.

Why should you care about this? Will it affect you? And if you do care, what can you do?

The greener grass

For comparison, let’s take a look at open-source web development, and what’s happened to that in the last 10 years. Web development is a good comparison point because it is, in essence, in the business of markup production.

If we go back to 2003 we’re in a world where Perl and variants of the CGI model are dominant, and PHP (a variant of the CGI model itself) is fast rising. Python is heavily used by self-respecting developers (Google being the poster child for Python at this time). MySQL has already won and taken its place as the default database backing the web. The common thread is that there are very few frameworks, as we would understand them now. The frameworks that are there seem to largely be restricted to proprietary toolkits like WebObjects and J2EE. The open-source web development world is making do with CGI, templating libraries, and SQL. A uniform API to connect to different SQL databases and issue queries is pretty much the height of sophistication.

In the XML world, we have LibXML 2 fast emerging, but still not installed by default on pretty much every computer system (as it is now). Saxon and Xerces/Xalan are the heavyweights here.

July 2004: Ruby on Rails

Rails was the first proper full-stack web framework to get significant traction and adoption. Django, the Python framework which most closely matches it in terms of scope, was first released in July 2005.

The release of Rails and Django are not unique, but they are significant enough to stand in for the changes in backend web development as a whole.

What we see as they pick up speed is an explosion of libraries and plugins and the emergence of ecosystems surrounding them. Ruby makes for a nice subject here. Rails was largely responsible for it becoming a language with widespread adoption, especially outside Japan. If we look at statistics from Github (launched 2008) about the growth in number of public repositories for projects written in Ruby, then we see growth that looks exponential.

Rubygems is Ruby’s package management system, with all public gems hosted by rubygems.org. There are currently 55,075 public gems. Looking at Rails itself, the current version (released 18 March 2013) has (as of 17 April 2013) been installed 380,730 times.

What does this tell us? Apart from the obvious – there’s a lot happening – it’s fairly clear that the buzz about web development technologies 10 years ago translated into sustained and impressive development of an open-source ecosystem. Contrast that with the buzz around XML technologies, say, 15 years ago, and there’s no real comparison. The real question is why.

A parallel story

Back in the early 2000’s there were several competing open-source relational databases, in addition to Oracle and DB2, where ‘real’ work was done. Of those available open-source DB’s, MySQL was the one which became the default choice, quickly almost entirely displacing its open-source competitors (and Oracle) from general web development. You’d be hard pressed to find anyone who actually understands relational database implementation who’ll say that MySQL was the best technology, and many who’ll say it was pretty bad in the early days. Yet, in the battle of MySQL vs. everything else, MySQL destroyed the competition. Why?

There are two critical factors in getting a developer to adopt a technology like a relational DB.

  1. Is it straightforward to install?

  2. Is it easy to use?

MySQL’s great weapon was that it was trivial to install on almost any system with a C compiler, and it soon had client bindings for almost all languages.

If you wanted to use a relational DB, you could either spend an age figuring out how to satisfy the dependencies and configuration requirements for a competing RDBMS or you could spend 5 minutes installing MySQL.

It was a SQL database, so writing queries for it was easy. It had native bindings for your language, so it was easy to integrate. It was quick and worked well enough for the 80% case that you didn’t immediately notice its flaws.

In short, it was a database for casual users.

A true story of pain and bewilderment: Validating XML

Recently, I have been experimenting with a web service which has a ReST API with an XML serialisation format, and which provides XML Schema grammars for the various endpoints. Writing a client in Ruby, and wanting to validate XML I generated as part of my automated tests, how could I go about that?

My first thought was to shell out to a command line utility and pass or fail tests based on the return value, in classic Unix style.

So. LibXML’s support is incomplete, a polite way of saying dangerous and broken.

Saxon HE doesn’t include it (which is fine, but meant I couldn’t use it)

Xerces’ command-line utilities (C++ or Java) are really hard to figure out (and the JVM startup tax is really hefty when shelling out dozens of times in an automated test suite). (Norm Walsh released a wrapper that does schema validation, but I didn’t find it until researching for this paper, and I know Norm.)

Having tried to do this a few times in the past, it’s at this point that I usually give up, because nothing has changed since the last time I went looking.

This time, I realised that a Ruby XML/HTML library (Nokogiri, more on this later) wrapped Xerces-J under JRuby, and Xerces-J’s XML Schema implementation works. So, now I have XML Schema validation integrated into my test suite, but only when it runs under JRuby. Under MRI (the standard Ruby implementation) LibXML 2’s broken Schema implementation explodes.

XML Schema is a technology that’s been a TR since 2001. It seems absurd that, in 2013, it’s simplest task – validation – requires so many hoops to be jumped in order to integrate it into a sensible, modern, web development workflow.

hpricot, Nokogiri, and getting things done

One of the side-effects of the HTML/XHTML kerfuffle was that, by and large, tools for dealing with HTML (without resorting to regexes) were tools for dealing with XML, at least in Python and Ruby. Partly as a legacy of its SGML roots, but mostly because humans are incredibly good at being incredibly bad at things, vast swathes of HTML content wasn’t (still isn’t) even well-formed HTML, let alone XML. Even more content isn’t valid HTML.

XML’s default error handling (terror, immediate exit) makes it extremely problematic to use with HTML, and HTML constitutes the largest body of markup on the internet.

Python’s Beautiful Soup & lxml, and Ruby’s Hpricot provided tools for coping with HTML. They ignored the DOM for search interfaces based on idiomatic constructs and even CSS.

Hpricot was effectively superseded by Nokogiri, which wraps LibXML 2 in an API based on Hpricot’s. Before it was installed by default on Mac OS X, Nokogiri was the sole reason that a lot of people installed LibXML 2.

These libraries are widely used, and successful precisely because of their pragmatic and idiomatic approach.

Language bindings matter, and why LibXML won

If MySQL won because any fool with a C compiler and Make could get it running, LibXML 2 won because it was actually fast enough to use, and had language bindings for almost everything. That was enough for it to creep onto almost every developer’s system as part of the base OS install (a side-effect of the XML-as-data-storage fad which meant that core OS systems needed to read and write XML). (Mac OS X did this in 2007, the last holdout.) Even before then, the pain of installing LibXML 2 + language bindings (not inconsiderable if you were installing from source) was far outweighed by the orders-of-magnitude better performance than most other libraries outside the Java world.

LibXML 2 / LibXSLT 1 have what amounts to a frozen feature set though: no XSLT 2, no XPath 2, no XQuery, incomplete XML Schema support. Its ubiquity and competence far outweigh its restrictions for most people, but that means that very few will ever explore beyond XSLT 1 in the way that its introduction allowed many people to explore beyond simple document parsing.

The standard HTML & XML processing libraries for almost all dynamic languages are wrappers around LibXML 2.

Next steps

What can we do about this state of affairs? Is there a way continue to advance the state-of-the-art whilst also making it easy for new developers to jump in?

The short answer is that there needs to be. The longer answer, I think, draws on what we’ve learnt from MySQL and LibXML.

There’s no excuse for being inaccessible

Java seems to be where current open-source markup technology development is taking place. If you’re not a Java developer it’s often a pain to get started. If you are a Java developer it’s often tedious to do common, trivial, tasks.

There’s no excuse for being inaccessible to non-Java developers, and there’s no excuse for tedium. Let’s take Jenkins as an example. Jenkins is a continuous integration server written in Java. There are many CI servers written in many languages. Jenkins is beating them all because it’s trivial to get started with it. It bundles a simple Java app server in its .war, meaning that, if you want, you can get started with nothing more complex than downloading the .war, and running it with java -jar jenkins.war. There’s nothing else to it, and that built-in server is enough for almost everyone.

Saxon can be used from the command line, but it doesn’t have a dedicated utility, which means there’s no man page, and no simple tab-completion for a half-remembered command name. xmllint and xsltproc, the utilities shipped with Libxml2 are so useful because they are standard command line utilities: they are invoked with a single-word command, they have man pages, they aren’t dependent on CLASSPATH or on remembering where you put saxon9he.jar.

(If Saxon is PostgreSQL, technically superior in almost every way, then LibXML 2 is MySQL. LibXML2 is utterly ubiquitous: it’s on your phone.)

Language bindings are important

The popularity of the JVM as a host for implementations of popular dynamic languages (JRuby, Jython) and new languages (Scala, Clojure) mean that even Java-native libraries like Saxon can be made obvious and easy to use for non-Java developers. Nokogiri uses Xerces as its parser and XSLT engine under JRuby, which means that I use JRuby + Nokogiri to validate documents against XML Schema (although, obviously, I’m limited to XML Schema 1.0). Why not have idiomatic Saxon bindings for other JVM host languages?

There’s more to life than DOM.

hpricot and Nokogiri made it fun and easy to work with complex HTML and XML. Their shelving of the DOM API in favour of idiomatic Ruby made many common tasks vastly easier than with DOM. That led in turn to Nokogiri’s near-total dominance of XML handling in Ruby.

Imagine if XQuery were opened up in that way. Imagine if the DSDL validation pipeline were made trivial to use. There are lots of XML-clad web APIs out there. Imagine how much better documented they’d be with, say, RelaxNG + Schematron that anyone could trivially easily use.

Matt Patterson

Matt Patterson is a web developer who has worked with HTML and XML for over 10 years. He lives in Berlin, Germany