Introduction

One of the original hopes for XML and its family of technologies was that it would be the markup infrastructure for the Web. This goal was notably materialised in the suite of XHTML specifications, in SVG and MathML, as well as in the use cases considered for a number of XML technologies such as XSLT, XSL-FO, XLink, and many others.

This dream, however, has failed. XML is without a doubt a very successful set of technologies and benefits from a healthy community and powerful tooling. It is nevertheless close to absent from Web content.

The failure of this dream, and the way it was brought about, has created a lot of animosity in the broader markup community. XML aficionados feel they have been cheated out of their future by HTML, browser vendors, and whoever is even remotely associated with today’s Web. On the other side, HTML people feel they had to fight XML bitterly in W3C and in their daily jobs in order for Web technology to have the properties they felt were needed for its success.

In the aftermath of this dispute, the two communities are largely estranged from one another. XML heads hold their noses at JavaScript and at HTML parsing; HTML people fashionably disparage XML in much the same way that one makes fun of Java.

It is this paper’s position that this attitude is hurting both. While it seems unlikely — and not even desirable — that some form of grand merge of XML and HTML would take place, there is nevertheless value in opening up a bidirectional discussion between the two communities so that they may learn from one another’s tools and ideas, mending fences as it were so that each side stops throwing away babies with bathwater.

Myths of the Markup Community

Without getting into excessive details about myths, rumours, and hearsay it is useful to take a quick look at the myths that the XML and HTML communities entertain about one another. If nothing else, it tells us where each is coming from and can help avoid clichés as well as map out places of genuine contention.

To the HTML crowd, XML is essentially overly strict, full of overly complicated solutions that are perceived to be either enterprise-like (invented to give Java a reason to exist) or completely academic and impractical.

There is no doubt that technologies such as XML Schema, not to mention the whole SOAP stack, have a lot to do with this perception. But as we will see below, the famed Desperate JavaScript Hacker (who in today’s world has come to replace the Desperate Perl Hacker) does not have a monopoly on useful technologies, and some tools that may be pitched in a manner reminiscent of enterprise-speak — likely because that is where they can be sold today — can have value in many other contexts.

Conversely, to XML people HTML is messy, hackish, cannot be parsed or processed reliably, isn’t extensible, and is plagued with tools designed by and for amateurs, chief amongst which stands JavaScript, often considered to be a toy language.

Again, there are genuine issues that brought about this perception. For the longest time, HTML was indeed impossible to parse properly, and it is only now acquiring extensibility. The fact that its tooling is accessible to beginners — a strength — entails that there are many beginners dabbling in it. But the sheer scale of the Web and its obvious ability to deliver complex, major, highly successful projects put together with the massive creativity of its communities of developers should give indication that it may not be entirely stupid and unreliable.

Covering all of the ways in which one community could inform the other would require more space than there is here. We can, however, look at a few examples in the hope that it will whet the readers’ curiosity.

How JavaScript Is Saving the Document

There are few communities in which JavaScript is more reviled than amongst document lovers. By its very dynamical and interactive nature, it is seen to single-handedly destroy a document’s meaning and any hope for the processing of data outside the visual, interactive mediation of the browser.

In many a case, and for a large class of documents, that has indeed be the case. Since any HTML page may be a document just as well as an application, and since the difference between the two is blurry at best, one cannot in general process HTML in a meaningful manner outside of the browser.

This can be contrasted with what was a large part of the vision of XML on the Web. The idea, at least that of many, was that one could produce purely semantic content in the form of an XML document, and then attach to it some XSLT that would transform it for clients that required such transformations. This was seen as providing a separation of concerns superior to that afforded by HTML+CSS since one can easily introduce elements more meaningful than, say, div. (How the actual semantics of a given arbitrary markup language were supposed to be conveyed to users, tools, or the accessibility layer was, however, largely swept under the rug as a problem to be solved at a later date.)

It may therefore come as a shock to some that, today, JavaScript is paving the way for a return of that very usage.

Single Page Applications (SPA) are Web applications in which all of the resources that define a page (the HTML chrome, JavaScript, CSS, etc.) are loaded once and then used as a shell inside of which content can change. A well-known advantage of SPAs is that they massively increase performance and thereby provide users with a better, snappier experience. They also avoid having to deal with application logic that is split between client and server, and are therefore very much desirable for developers — even in cases where the application is largely content-oriented (e.g. a blog). Note that, contrary to still-popular belief, SPAs can be made URL-friendly through use of the History API.

To date, SPAs have mostly been used in the production of very application-oriented content. The reason for that is because the robotic crawlers used by search engines have so far been unable to process them, and no one wants their content kept away from search engines. This situation is changing. Increasingly, crawlers are able to process JavaScript-heavy pages, and do so. This opens up the door to far broader deployment of SPAs, and since they are often more convenient for developers the odds are strong that they will come to dominate even content-based sites.

This essentially brings about the content/transformation distinction that XSLT and XML were aiming for on the Web. One can maintain (and make available) a set of “pure” documents that the SPA then renders on the client. The content may be simplified, semantic, possibly enhanced HTML that adequately captures the intended meaning (and is happily devoid of all the navigation and useless paraphernalia that most pages would normally contain). It can also be, for more data-oriented content, JSON. And naturally, if desired, it can be XML. In all cases, XSLT cannot be expected to be natively available for transformation, but there exist many solutions that can be picked based on what best matches one’s needs. (The author routinely uses jQuery as a transformation language precisely for this sort of task.) If one prefers XML, there are even XSLT and XQuery libraries available — it becomes up to you to use the browser just as a VM, and deploy whichever technology you prefer.

The attentive reader will note that even if SPAs effectively make the client-side XSLT publishing workflow a reality, they still do not solve the problem of properly conveying arbitrary semantics that was mentioned above. Hopefully, though, in being successful they will make the problem more salient and thereby bring about a solution.

HTML in an XML pipeline

HTML was initially supposed to be defined as an application of SGML. But few implementations effectively followed that path, and it quickly grew to be defined solely as a set of hacks and bugs mimicked from others’ bugs, leading to the well-known “tag soup” situation that essentially made it scrapable at best only through regular expressions.

While that situation prevailed for a long time, it no longer reflects reality. The HTML parsing algorithm has now been fully defined, and is highly interoperable. It certainly has its complex, dirty corners, but those only need to be implemented once. And in many ways, they are no worse than some of the warts found in the likes of XML or SGML.

Today, when applying an HTML parser, you obtain real, usable DOMs that are guaranteed to be interoperably the same across implementations. The HTML DOM even benefits of a mapping to XML known as the “Infoset coercion” rules. As a result, largely any tool that you can apply to XML can be applied equally well to HTML provided you front your pipeline with an HTML parser. No need to even stick to the so-called “polyglot” syntax (which has issues of its own).

As things stand today, processing a large HTML corpus remains painful. There are full-text indexers, but they rarely afford much flexibility in taking the structure into account. One can naturally parse the HTML and process the DOM, but doing that for every search on a large corpus is of course prohibitively expensive. It is possible to produce ad hoc indices built with such processing, but that removes the benefits from arbitrary querying. In other words, there is no such thing as an “HTML database” to match the existing XML databases.

In developing Web standards, we regularly need to look at large HTML corpora to determine whether a given usage is common or how people actually use the technology (for instance, a dump of the front pages of the top million sites). The tool we use for this? Typically: grep.

Yet a lot of data is captured as HTML. Huge corpora contain a humongous amount of information, for instance in tables, that is being locked there.

That’s a situation for which something like XQuery could prove itself extremely useful. There is, in fact, very little that prevents one from loading HTML directly into an XML database and processing it. Yet few do it, likely because of the “X” in “XQuery”, serving as a scarecrow. As Liam Quin recently put it, it may have been better for XQuery to be called something like “Fast Forest”.

Slightly to the side of HTML, but by and large in the same technological bucket, a similar situation applies to JSON. There do exist JSON databases — many of them actually — but their query abilities are often poor to laughable. Solutions built atop XQuery, such as JSONiq, would without a doubt solve many real problems that people are facing when managing their JSON data. Yet the mutual ignorance is such between our communities that such tools remain largely confidential.

Another great example of technology built for XML being applied to HTML comes from DeltaXML. Producing meaningful diffs of HTML content is, today, largely a painful situation, especially if the HTML is irregular, large, and heavily marked up. That is a problem largely solved for XML and HTML could benefit greatly from it becoming more available.

Extending HTML: Web Components

A strong point of contention between XML and HTML is the notion of extensibility, or more precisely of extensibility carried out by arbitrary third-parties with no requirement to work their way through a centralised standard. In other words, “distributed extensibility”.

XML’s solution to distributed extensibility is XML Namespaces. For all that they may be reviled, namespaces do work in bringing distributed extensibility to XML — but only in a limited sense.

Extensibility can happen at many levels: the syntax, the vocabulary, the meaning, the styling, the behaviour… Neither XML nor HTML have extensible syntax and there seems to be only limited demand for that. Namespaces deliver vocabulary extensibility: you can create your own vocabulary easily, and if you’re not entirely daft you can do so in a manner that won’t conflict with anyone else. However, namespaces stop there. Even without considering the problems inherent in interactive behaviour, just discovering how two vocabularies mixed together need to be processed is an unsolved problem and requires resorting to ad hoc development. This situation is worsened by the fact that some schema languages, most notably XML Schema, don’t even consider XML to be extensible by default.

HTML does not have a real solution at the vocabulary level (unless you count prefixing your elements in a global namespace a solution). It does, however, have an approach from the other end of the spectrum: Web Components.

There is not enough space in this paper to provide a full-fledged introduction to Web Components (for more on the topic, I recommend the webcomponents.org website) but the part that is essential for this discussion should be easy to grasp without a full understanding of the technology.

Essentially, the point at which HTML behaviour is integrated into a browser engine is through the HTMLElement interface. That is where the common APIs hang off of, where CSS applies, where integration with the class and ID system happens, and much more. What Web Components enable is essentially for developers to create their own arbitrary elements by subclassing HTMLElement and providing their own implementation, injected into the runtime.

Once that has been done, the new element becomes treated just like any other built-in element. What’s more, thanks to the concept of shadow trees (essentially subtrees of the DOM that can be hiding recursively behind regular DOM nodes) it is possible to intermix Web Component content and regular HTML at will.

It is therefore interesting to note that neither XML nor HTML have solved the distributed extensibility problem across the board. Each has solved it from the end that made most sense to its more common usage. Because of a difference in use cases, depending on where you stand either one may be seen as extensible and the other not, or vice-versa.

But this difference of viewpoint leads not necessarily to an opposition but rather to complementarity. It is important to know and to understand both so that one may be able to rely on either when the applicable need arises rather than shun half of the solution space.

In my XML Prague 2014 paper “Distributed Extensibility: Finally Done Right?” I go so far as to point out how one could transform XML-namespaced content into a syntax friendly to Web Components in order to implement the behaviour of an XML language; and provide indications as to how the two could be integrated more closely together. Deciding whether that is wise or not is left as an exercise for the reader, but it does point to complementarity rather than opposition.

Conclusion

We hope to have shown through this overview that there is value for anyone sitting on one side of the fence to go look at what is going on on the other side, assuming that the “others” may in fact be smart people with somewhat different needs rather than dumb people who just don’t “get it”.

A good example here is SVG. While originally defined in XML, and in fact deeply steeped in XML technology throughout, it struggled for years to reach any decent level of usage. At some point came the realisation that “SVG isn’t about XML, or even syntax, it’s about sassy, sexy, wicked cool graphics that make you go wow.

Ever since adopting the changes that make it usable equally well in XML and HTML contexts, SVG has undergone a period of blooming and has grown to be a solid part of the Web platform. There are several more such stories waiting to be written.

Robin Berjon

freelance

W3C

Robin Berjon is a freelance consultant carrying out research, prototyping, and standardisation in Web, mobile, and XML technologies. He has worked on both Web and XML standards for over a decade, and is currently trying to herd HTML5 to Recommendation as part of the W3C team. He lives in Paris, France, with his wife, two daughters, and a rather idiotic cat.