How to cite this paper
Where did all the document kids go?
Open-source, markup, and the casual developer
Balisage: The Markup Conference 2013
August 6 - 9, 2013
It’s a truism that things were different when I was a lad, but with open-source implementations
of markup technologies there seems to be truth to it.
When I began to develop with XML technologies in the very early 2000s there was a
slew of toolkits and implementations for XML parsers, multiple DOM-like implementations
outside web browsers, XSLT implementations, etc (almost ad nauseam).
Now, GNOME’s LibXML seems to be the de facto standard implementation, used with many
(non-Java) language bindings, there’s only one open-source XSLT2 implementation, and
technologies like XQuery seem to be restricted to specialist use and implementations
on top of XML databases.
What happened? Open-source technologies in related areas including document-like data
on in leaps and bounds. The XML technology space seems to have contracted and stagnated,
at least to a casual observer.
Why should you care about this? Will it affect you? And if you do care, what can you
The greener grass
For comparison, let’s take a look at open-source web development, and what’s happened
to that in the last 10 years. Web development is a good comparison point because it
is, in essence, in the business of markup production.
If we go back to 2003 we’re in a world where Perl and variants of the CGI model are
dominant, and PHP (a variant of the CGI model itself) is fast rising. Python is heavily
used by self-respecting developers (Google being the poster child for Python at this
time). MySQL has already won and taken its place as the default database backing the
web. The common thread is that there are very few frameworks, as we would understand
them now. The frameworks that are there seem to largely be restricted to proprietary
toolkits like WebObjects and J2EE. The open-source web development world is making
do with CGI, templating libraries, and SQL. A uniform API to connect to different
SQL databases and issue queries is pretty much the height of sophistication.
In the XML world, we have LibXML 2 fast emerging, but still not installed by default
on pretty much every computer system (as it is now). Saxon and Xerces/Xalan are the
July 2004: Ruby on Rails
Rails was the first proper full-stack web framework to get significant traction and
adoption. Django, the Python framework which most closely matches it in terms of scope,
was first released in July 2005.
The release of Rails and Django are not unique, but they are significant enough to
stand in for the changes in backend web development as a whole.
What we see as they pick up speed is an explosion of libraries and plugins and the
emergence of ecosystems surrounding them. Ruby makes for a nice subject here. Rails
was largely responsible for it becoming a language with widespread adoption, especially
outside Japan. If we look at statistics from Github (launched 2008) about the growth
in number of public repositories for projects written in Ruby, then we see growth
that looks exponential.
Rubygems is Ruby’s package management system, with all public gems hosted by rubygems.org.
There are currently 55,075 public gems. Looking at Rails itself, the current version
(released 18 March 2013) has (as of 17 April 2013) been installed 380,730 times.
What does this tell us? Apart from the obvious – there’s a lot happening – it’s fairly
clear that the buzz about web development technologies 10 years ago translated into
sustained and impressive development of an open-source ecosystem. Contrast that with
the buzz around XML technologies, say, 15 years ago, and there’s no real comparison.
The real question is why.
A parallel story
Back in the early 2000’s there were several competing open-source relational databases,
in addition to Oracle and DB2, where ‘real’ work was done. Of those available open-source
DB’s, MySQL was the one which became the default choice, quickly almost entirely displacing
its open-source competitors (and Oracle) from general web development. You’d be hard
pressed to find anyone who actually understands relational database implementation
who’ll say that MySQL was the best technology, and many who’ll say it was pretty bad
in the early days. Yet, in the battle of MySQL vs. everything else, MySQL destroyed
the competition. Why?
There are two critical factors in getting a developer to adopt a technology like a
Is it straightforward to install?
Is it easy to use?
MySQL’s great weapon was that it was trivial to install on almost any system with
a C compiler, and it soon had client bindings for almost all languages.
If you wanted to use a relational DB, you could either spend an age figuring out how
to satisfy the dependencies and configuration requirements for a competing RDBMS or
you could spend 5 minutes installing MySQL.
It was a SQL database, so writing queries for it was easy. It had native bindings
for your language, so it was easy to integrate. It was quick and worked well enough
for the 80% case that you didn’t immediately notice its flaws.
In short, it was a database for casual users.
A true story of pain and bewilderment: Validating XML
Recently, I have been experimenting with a web service which has a ReST API with an
XML serialisation format, and which provides XML Schema grammars for the various endpoints.
Writing a client in Ruby, and wanting to validate XML I generated as part of my automated
tests, how could I go about that?
My first thought was to shell out to a command line utility and pass or fail tests
based on the return value, in classic Unix style.
So. LibXML’s support is incomplete, a polite way of saying dangerous and broken.
Saxon HE doesn’t include it (which is fine, but meant I couldn’t use it)
Xerces’ command-line utilities (C++ or Java) are really hard to figure out (and the
JVM startup tax is really hefty when shelling out dozens of times in an automated
test suite). (Norm Walsh released a wrapper that does schema validation, but I didn’t
find it until researching for this paper, and I know Norm.)
Having tried to do this a few times in the past, it’s at this point that I usually
give up, because nothing has changed since the last time I went looking.
This time, I realised that a Ruby XML/HTML library (Nokogiri, more on this later)
wrapped Xerces-J under JRuby, and Xerces-J’s XML Schema implementation works. So,
now I have XML Schema validation integrated into my test suite, but only when it runs
under JRuby. Under MRI (the standard Ruby implementation) LibXML 2’s broken Schema
XML Schema is a technology that’s been a TR since 2001. It seems absurd that, in 2013,
it’s simplest task – validation – requires so many hoops to be jumped in order to
integrate it into a sensible, modern, web development workflow.
hpricot, Nokogiri, and getting things done
One of the side-effects of the HTML/XHTML kerfuffle was that, by and large, tools
for dealing with HTML (without resorting to regexes) were tools for dealing with XML,
at least in Python and Ruby. Partly as a legacy of its SGML roots, but mostly because
humans are incredibly good at being incredibly bad at things, vast swathes of HTML
content wasn’t (still isn’t) even well-formed HTML, let alone XML. Even more content
isn’t valid HTML.
XML’s default error handling (terror, immediate exit) makes it extremely problematic
to use with HTML, and HTML constitutes the largest body of markup on the internet.
Python’s Beautiful Soup & lxml, and Ruby’s Hpricot provided tools for coping with
HTML. They ignored the DOM for search interfaces based on idiomatic constructs and
Hpricot was effectively superseded by Nokogiri, which wraps LibXML 2 in an API based
on Hpricot’s. Before it was installed by default on Mac OS X, Nokogiri was the sole
reason that a lot of people installed LibXML 2.
These libraries are widely used, and successful precisely because of their pragmatic
and idiomatic approach.
Language bindings matter, and why LibXML won
If MySQL won because any fool with a C compiler and Make could get it running, LibXML
2 won because it was actually fast enough to use, and had language bindings for almost
everything. That was enough for it to creep onto almost every developer’s system as
part of the base OS install (a side-effect of the XML-as-data-storage fad which meant
that core OS systems needed to read and write XML). (Mac OS X did this in 2007, the
last holdout.) Even before then, the pain of installing LibXML 2 + language bindings
(not inconsiderable if you were installing from source) was far outweighed by the
orders-of-magnitude better performance than most other libraries outside the Java
LibXML 2 / LibXSLT 1 have what amounts to a frozen feature set though: no XSLT 2,
no XPath 2, no XQuery, incomplete XML Schema support. Its ubiquity and competence
far outweigh its restrictions for most people, but that means that very few will ever
explore beyond XSLT 1 in the way that its introduction allowed many people to explore
beyond simple document parsing.
The standard HTML & XML processing libraries for almost all dynamic languages are
wrappers around LibXML 2.
What can we do about this state of affairs? Is there a way continue to advance the
state-of-the-art whilst also making it easy for new developers to jump in?
The short answer is that there needs to be. The longer answer, I think, draws on what
we’ve learnt from MySQL and LibXML.
There’s no excuse for being inaccessible
Java seems to be where current open-source markup technology development is taking
place. If you’re not a Java developer it’s often a pain to get started. If you are
a Java developer it’s often tedious to do common, trivial, tasks.
There’s no excuse for being inaccessible to non-Java developers, and there’s no excuse
for tedium. Let’s take Jenkins as an example. Jenkins is a continuous integration
server written in Java. There are many CI servers written in many languages. Jenkins
is beating them all because it’s trivial to get started with it. It bundles a simple
Java app server in its .war, meaning that, if you want, you can get started with nothing
more complex than downloading the .war, and running it with
java -jar jenkins.war. There’s nothing else to it, and that built-in server is enough for almost everyone.
Saxon can be used from the command line, but it doesn’t have a dedicated utility,
which means there’s no man page, and no simple tab-completion for a half-remembered
command name. xmllint and xsltproc, the utilities shipped with Libxml2 are so useful
because they are standard command line utilities: they are invoked with a single-word
command, they have man pages, they aren’t dependent on CLASSPATH or on remembering
where you put saxon9he.jar.
(If Saxon is PostgreSQL, technically superior in almost every way, then LibXML 2 is
MySQL. LibXML2 is utterly ubiquitous: it’s on your phone.)
Language bindings are important
The popularity of the JVM as a host for implementations of popular dynamic languages
(JRuby, Jython) and new languages (Scala, Clojure) mean that even Java-native libraries
like Saxon can be made obvious and easy to use for non-Java developers. Nokogiri uses
Xerces as its parser and XSLT engine under JRuby, which means that I use JRuby + Nokogiri
to validate documents against XML Schema (although, obviously, I’m limited to XML
Schema 1.0). Why not have idiomatic Saxon bindings for other JVM host languages?
There’s more to life than DOM.
hpricot and Nokogiri made it fun and easy to work with complex HTML and XML. Their
shelving of the DOM API in favour of idiomatic Ruby made many common tasks vastly
easier than with DOM. That led in turn to Nokogiri’s near-total dominance of XML handling
Imagine if XQuery were opened up in that way. Imagine if the DSDL validation pipeline
were made trivial to use. There are lots of XML-clad web APIs out there. Imagine how
much better documented they’d be with, say, RelaxNG + Schematron that anyone could
trivially easily use.