How to cite this paper
Berjon, Robin. “Mending Fences and Saving Babies.” Presented at Symposium on HTML5 and XML, Washington, DC, August 4, 2014. In Proceedings of the Symposium on HTML5 and XML. Balisage Series on Markup Technologies, vol. 14 (2014). https://doi.org/10.4242/BalisageVol14.Berjon01.
Symposium on HTML5 and XML
August 4, 2014
Balisage Paper: Mending Fences and Saving Babies
Robin Berjon is a freelance consultant carrying out research, prototyping, and
standardisation in Web, mobile, and XML technologies. He has worked on both Web and
standards for over a decade, and is currently trying to herd HTML5 to Recommendation
part of the W3C team. He lives in Paris, France, with his wife, two daughters, and
rather idiotic cat.
Copyright © 2014 Robin Berjon
The harshest squabbles are fraternal, and as fraternal squabbles go, the one between
proponents of XML and HTML has been at times quite brutal. This kerfuffle has opened
rifts in what is largely a like-minded community. The differences between XML and
genuine, especially when considering not just the markup but the full family of technologies
that have grown around them. But do they really justify animosity? Both XML and HTML
created strong solutions to varied problems often ignored by other angle bracketists.
many commonalities mean that the XML and HTML communities need not throw away one
babies in a big slosh of bathwater. It is time for a candid conversation about flaws
limitations, and from there to mend fences.
In this paper we look at some mythical preconceptions that each community has about
other, go through a number of topics that show where there is value in “looking over
fence”, and reach what is hopefully a pragmatic conclusion as to what each community
Table of Contents
- Myths of the Markup Community
- HTML in an XML pipeline
- Extending HTML: Web Components
One of the original hopes for XML and its family of technologies was that it would
markup infrastructure for the Web. This goal was notably materialised in the suite
specifications, in SVG and MathML, as well as in the use cases considered for a number
technologies such as XSLT, XSL-FO, XLink, and many others.
This dream, however, has failed. XML is without a doubt a very successful set of technologies
and benefits from a healthy community and powerful tooling. It is nevertheless close
from Web content.
The failure of this dream, and the way it was brought about, has created a lot of
the broader markup community. XML aficionados feel they have been cheated out of their
by HTML, browser vendors, and whoever is even remotely associated with today’s Web.
other side, HTML people feel they had to fight XML bitterly in W3C and in their daily
order for Web technology to have the properties they felt were needed for its success.
In the aftermath of this dispute, the two communities are largely estranged from one
disparage XML in much the same way that one makes fun of Java.
It is this paper’s position that this attitude is hurting both. While it seems unlikely
not even desirable — that some form of grand merge of XML and HTML would take place,
nevertheless value in opening up a bidirectional discussion between the two communities
that they may learn from one another’s tools and ideas, mending fences as it were
so that each
side stops throwing away babies with bathwater.
Myths of the Markup Community
Without getting into excessive details about myths, rumours, and hearsay it is useful
a quick look at the myths that the XML and HTML communities entertain about one another.
nothing else, it tells us where each is coming from and can help avoid clichés as
well as map
out places of genuine contention.
To the HTML crowd, XML is essentially overly strict, full of overly complicated
solutions that are perceived to be either enterprise-like (invented to give Java a
reason to exist) or completely academic and impractical.
There is no doubt that technologies such as XML Schema, not to mention the whole SOAP
have a lot to do with this perception. But as we will see below, the famed Desperate
not have a monopoly on useful technologies, and some tools that may be pitched in
reminiscent of enterprise-speak — likely because that is where they can be sold today
have value in many other contexts.
Conversely, to XML people HTML is messy, hackish, cannot be parsed or processed
reliably, isn’t extensible, and is plagued with tools designed by and for amateurs,
Again, there are genuine issues that brought about this perception. For the longest
was indeed impossible to parse properly, and it is only now acquiring extensibility.
that its tooling is accessible to beginners — a strength — entails that there are
beginners dabbling in it. But the sheer scale of the Web and its obvious ability to
complex, major, highly successful projects put together with the massive creativity
communities of developers should give indication that it may not be entirely stupid
Covering all of the ways in which one community could inform the other would require
space than there is here. We can, however, look at a few examples in the hope that
whet the readers’ curiosity.
By its very dynamical and interactive nature, it is seen to single-handedly destroy
document’s meaning and any hope for the processing of data outside the visual, interactive
mediation of the browser.
In many a case, and for a large class of documents, that has indeed be the case. Since
HTML page may be a document just as well as an application, and since the difference
the two is blurry at best, one cannot in general process HTML in a meaningful manner
of the browser.
This can be contrasted with what was a large part of the vision of XML on the Web.
at least that of many, was that one could produce purely semantic content in the form
XML document, and then attach to it some XSLT that would transform it for clients
required such transformations. This was seen as providing a separation of concerns
that afforded by HTML+CSS since one can easily introduce elements more meaningful
div. (How the actual semantics of a given arbitrary markup
language were supposed to be conveyed to users, tools, or the accessibility layer
however, largely swept under the rug as a problem to be solved at a later date.)
return of that very usage.
Single Page Applications (SPA) are Web applications in which all of the resources
inside of which content can change. A well-known advantage of SPAs is that they massively
increase performance and thereby provide users with a better, snappier experience.
avoid having to deal with application logic that is split between client and server,
therefore very much desirable for developers — even in cases where the application
content-oriented (e.g. a blog). Note that, contrary to still-popular belief, SPAs
can be made
URL-friendly through use of the
To date, SPAs have mostly been used in the production of very application-oriented
The reason for that is because the robotic crawlers used by search engines have so
unable to process them, and no one wants their content kept away from search engines.
do so. This opens up the door to far broader deployment of SPAs, and since they are
more convenient for developers the odds are strong that they will come to dominate
This essentially brings about the content/transformation distinction that XSLT and
aiming for on the Web. One can maintain (and make available) a set of “pure” documents
the SPA then renders on the client. The content may be simplified, semantic, possibly
HTML that adequately captures the intended meaning (and is happily devoid of all the
navigation and useless paraphernalia that most pages would normally contain). It can
for more data-oriented content, JSON. And naturally, if desired, it can be XML. In
XSLT cannot be expected to be natively available for transformation, but there exist
solutions that can be picked based on what best matches one’s needs. (The author routinely
uses jQuery as a transformation language precisely for this sort of task.) If one
there are even XSLT and XQuery libraries available — it becomes up to you to use the
just as a VM, and deploy whichever technology you prefer.
The attentive reader will note that even if SPAs effectively make the client-side
publishing workflow a reality, they still do not solve the problem of properly conveying
arbitrary semantics that was mentioned above. Hopefully, though, in being successful
make the problem more salient and thereby bring about a solution.
HTML in an XML pipeline
HTML was initially supposed to be defined as an application of SGML. But few implementations
effectively followed that path, and it quickly grew to be defined solely as a set
and bugs mimicked from others’ bugs, leading to the well-known “tag soup” situation
essentially made it scrapable at best only through regular expressions.
While that situation prevailed for a long time, it no longer reflects reality. The
parsing algorithm has now been fully defined, and is highly interoperable. It certainly
its complex, dirty corners, but those only need to be implemented once. And in many
are no worse than some of the warts found in the likes of XML or SGML.
Today, when applying an HTML parser, you obtain real, usable DOMs that are guaranteed
interoperably the same across implementations. The HTML DOM even benefits of a mapping
XML known as the “Infoset coercion” rules. As a result, largely any tool that you
can apply to
XML can be applied equally well to HTML provided you front your pipeline with an HTML
No need to even stick to the so-called “polyglot” syntax (which has issues of its
As things stand today, processing a large HTML corpus remains painful. There are full-text
indexers, but they rarely afford much flexibility in taking the structure into account.
can naturally parse the HTML and process the DOM, but doing that for every search
on a large
corpus is of course prohibitively expensive. It is possible to produce ad hoc indices
with such processing, but that removes the benefits from arbitrary querying. In other
there is no such thing as an “HTML database” to match the existing XML databases.
In developing Web standards, we regularly need to look at large HTML corpora to determine
whether a given usage is common or how people actually use the technology (for instance,
dump of the front pages of the top million sites). The tool we use for this? Typically:
Yet a lot of data is captured as HTML. Huge corpora contain a humongous amount of
for instance in tables, that is being locked there.
That’s a situation for which something like XQuery could prove itself extremely useful.
is, in fact, very little that prevents one from loading HTML directly into an XML
and processing it. Yet few do it, likely because of the “X” in “XQuery”, serving as
scarecrow. As Liam Quin recently put it, it may have been better for XQuery to be
something like “Fast Forest”.
Slightly to the side of HTML, but by and large in the same technological bucket, a
situation applies to JSON. There do exist JSON databases — many of them actually —
query abilities are often poor to laughable. Solutions built atop XQuery, such as
would without a doubt solve many real problems that people are facing when managing
JSON data. Yet the mutual ignorance is such between our communities that such tools
Another great example of technology built for XML being applied to HTML comes from
Producing meaningful diffs of HTML content is, today, largely a painful situation,
if the HTML is irregular, large, and heavily marked up. That is a problem largely
XML and HTML could benefit greatly from it becoming more available.
Extending HTML: Web Components
A strong point of contention between XML and HTML is the notion of extensibility,
precisely of extensibility carried out by arbitrary third-parties with no requirement
work their way through a centralised standard. In other words, “distributed extensibility”.
XML’s solution to distributed extensibility is XML Namespaces. For all that they may
reviled, namespaces do work in bringing distributed extensibility to XML — but only
Extensibility can happen at many levels: the syntax, the vocabulary, the meaning,
the behaviour… Neither XML nor HTML have extensible syntax and there seems to be only
demand for that. Namespaces deliver vocabulary extensibility: you can create your
vocabulary easily, and if you’re not entirely daft you can do so in a manner that
conflict with anyone else. However, namespaces stop there. Even without considering
problems inherent in interactive behaviour, just discovering how two vocabularies
together need to be processed is an unsolved problem and requires resorting to ad
development. This situation is worsened by the fact that some schema languages, most
XML Schema, don’t even consider XML to be extensible by default.
HTML does not have a real solution at the vocabulary level (unless you count prefixing
elements in a global namespace a solution). It does, however, have an approach from
end of the spectrum: Web Components.
There is not enough space in this paper to provide a full-fledged introduction to
Components (for more on the topic, I recommend the
webcomponents.org website) but the part that
is essential for this discussion should be easy to grasp without a full understanding
Essentially, the point at which HTML behaviour is integrated into a browser engine
HTMLElement interface. That is where the common APIs hang off of, where CSS
applies, where integration with the class and ID system happens, and much more. What
Components enable is essentially for developers to create their own arbitrary elements
HTMLElement and providing their own implementation, injected into the
Once that has been done, the new element becomes treated just like any other built-in
What’s more, thanks to the concept of shadow trees (essentially subtrees of the DOM
be hiding recursively behind regular DOM nodes) it is possible to intermix Web Component
content and regular HTML at will.
It is therefore interesting to note that neither XML nor HTML have solved the distributed
extensibility problem across the board. Each has solved it from the end that made
to its more common usage. Because of a difference in use cases, depending on where
either one may be seen as extensible and the other not, or vice-versa.
But this difference of viewpoint leads not necessarily to an opposition but rather
complementarity. It is important to know and to understand both so that one may be
rely on either when the applicable need arises rather than shun half of the solution
In my XML Prague 2014 paper “Distributed
Extensibility: Finally Done Right?” I go so far as to point out how one could transform
XML-namespaced content into a syntax friendly to Web Components in order to implement
behaviour of an XML language; and provide indications as to how the two could be integrated
more closely together. Deciding whether that is wise or not is left as an exercise
reader, but it does point to complementarity rather than opposition.
We hope to have shown through this overview that there is value for anyone sitting
on one side
of the fence to go look at what is going on on the other side, assuming that the “others”
in fact be smart people with somewhat different needs rather than dumb people who
A good example here is SVG. While originally defined in XML, and in fact deeply steeped
technology throughout, it struggled for years to reach any decent level of usage.
point came the realisation that “SVG isn’t about
XML, or even syntax, it’s about sassy, sexy, wicked cool graphics that make you go
Ever since adopting the changes that make it usable equally well in XML and HTML contexts,
SVG has undergone a period of blooming and has grown to be a solid part of the Web
There are several more such stories waiting to be written.