Markup Formats In Context

Liam R. E. Quin

Documents on Paper

When a human wishes to communicate extended ideas with another human not physically present, paper and a pencil can be used. Historically, this mechanism was extended using people trained to copy documents onto new, additional sheets of paper, but this was slow and expensive, and, after only a few thousand years, replaced by the automated printing press.

Paper documents are difficult to revise and cannot easily be searched.

Paper documents are independent of software and with care can be archived indefinitely.

Electronic Paper

Today if someone wants to write something for the benefit of another reader, they can use a word processor and either send computer-printed paper or send electronic files that the recipient can print and read. Those files are sent sometimes as plain text (electronic mail), or for longer documents or documents with more complex formatting, as PDF page images or as word processing files.

Word processing files represent a document complete with formatting but in an editable form, so that text can re-flow as needed. Such files generally make use of system resources such as fonts, so that a document may be differently paginated or, in the case of specialty symbol or language-specific fonts, may be partly or entirely unreadable. For some languages (Northern Cree and some of the scripts or writing systems used in India and in Africa come to mind) it’s customary to write documents using a font with a custom encoding, as Unicode coverage is (or is perceived to be) incomplete or insufficient; this means that if the recipient does not have the right font installed, the document may appear correct but will have some characters in the document silently substituted for others. To be fair this problem exists for all of the document formats discussed in this paper, but some of the formats alleviate the difficulties, or at least let documents be explicit about what was done, more than others.

Word processor documents today use complex and proprietary formats (although increasingly these are represented in XML). This means that they can be difficult to search, although they are usually easy to revise. Later versions of a word processor may interpret older files differently, with or without warning, so that the documents become tied to specific versions of software running in specific operating environments. Because word processor formats are (implicitly or explicitly) tied to specific versions of specific software, as well as to system resources such as fonts, they are not suitable for archival use.

Portable Document Format (PDF), a document format produced and maintained by Adobe Systems Inc. of California, USA, has a corresponding ISO archival standard, although in practice PDF documents can and do make use of extensions that are not archival. However, PDF files do contain all needed resources such as fonts and images, and are in most cases considered to be of archival quality. Software that creates PDF may have options to create PDF/A, the archival variant.

PDF documents do not generally reflow text if printed or viewed on a differently sized device than that for which they were created. The page dimensions are part of a PDF document and cannot easily be altered; hyphenation has been performed, footnotes have been numbered and so on. PDF documents can be extremely difficult to read on smaller devices, as the user may need to scroll horizontally back and forth to read each line of text.

The formatting of both word processor documents and PDF files is explicit (except as noted already) and it is therefore possible for a search engine to process and index the text in them and then to display formatted results and previews. However, because the formats are not intended for this purpose, there are some difficulties. For example, PDF does not require that the creating software explicitly mark word boundaries, since each “glyph” can be positioned independently from any other. If word and phrase boundaries are not clearly marked, indexing has to use heuristics: sometimes one will run across search engine results in which characters have been joined between paragraph breaks, or where words have been incorrectly split, or where hyphenated words result in two smaller sub-words being indexed separately.

Documents as essentially pictures of documents, whether because of proprietary or poorly documented file formats, because of insufficient information, or because the document actually contains bitmap (raster) images of text rather than the actual text, can pose difficult or insurmountable problems not only to search engines but also to people who cannot easily read from the pictures, for example because the lines of text do not reflow (creating a need for difficult sideways scrolling) or because the user is relying on a text reader to speak the text out loud and the document does not actually contain any text. Any system for mediating communication between humans must be useable by all humans.

When Robots Watch

When people share documents and also expect their documents to be processed by automatic robotic services such as search engine indexers they must use formats that can be read by an unknown audience. HTML can be a suitable format because it has well-defined behaviour: the robots know where paragraphs start and end, which markup breaks up words or phrases and which does not, and how relationships to other resources such as images or linked documents are represented.

Although HTML 5 has added new structural elements such as article, it is common today for Web sites to use div elements with CSS-based styling for such things; this can increase the difficulty of determining the intended formatting: the search engines can determine word and phrase breaks only by applying the CSS styles. With the increased use of JavaScript-based styling this becomes harder, but fortunately there are strong financial incentives for commercial producers of HTML to use clear markup as otherwise their Web sites do not appear in user’s search results.

HTML is a moving, changing format and is not necessarily safe for archival purposes. PDF can be used with mediation, but PDF documents are not necessarily sufficiently accessible; it is possible to create PDF documents that consist of scanned bitmap page images rather than text.

Documents that Last

When people share documents and need them to be archived for several years or longer, a combination of formats may be best.

XML is a suitable basis for archival formatting because the syntax of XML is not evolving significantly (unlike HTML). Since there are no behavioural semantics within XML there is nothing to change: it is a framework for carrying meaning. However, precisely because XML does not have universal behavioural semantics a robot, or a future human, cannot necessarily determine word, phrase and paragraph boundaries, nor relationships to other resources, by inspection. HyTime Architectural Forms (for use with the older SGML standard document format) might have provided a way for robots to do this, but they have not been adopted for XML.

Since XML documents cannot reliably be presented to humans or to robots it is necessary to augment them, either with transformations or with alternate additional document formats.

Whenever information is provided in multiple formats there is a possibility of errors and contradictions between the various versions of the documents. Providing one or more automated transformations, using standardized and non-proprietary transformation languages such as XSLT or XQuery, and clearly marking the XML version of the document as authoritative, may be sufficient to minimize the impact of the lack of default formatting for arbitrary XML vocabularies.

Suitably augmented XML, then, is suitable for archiving, can be transmitted across networks, and can be formatted to reflow on different devices or pages. The cost of attaining this goal can be high: it is the cost of anticipating the needs of others (including later, older versions of ourselves) as opposed to the cost of reacting only to our own present-moment needs. To motivate the expenditure we must realize short-term benefits. The ability to produce documents in multiple formats is part of this; other benefits will be discussed later in this document.^[1]

On Delivering XHTML or HTML

Many people have written to say that XHTML has no advantages over HTML, or even has disadvantages. However, those writers all seem to be writing from a perspective in which HTML is itself considered a good thing, and in which the primary purpose of creating a document is to display it in a Web browser.

When a single document is to be consumed by many processes within a single organization the ability to use XML tools on it can make XHTML very useful. In addition, ebook readers are currently using XHTML and XML rather than unrestricted HTML 5.

Even if XHTML documents are served on the Web as text/html and not as XML, the design of “polyglot” XHTML is such that the result is predictable, and yet the document can still be processed with XML tools. The value, then, is to the producer. Any value to the consumer is coincidental, but there is also no significant detriment.

When Documents are not Documents

When information is divorced from any context and becomes a set of facts it can be tempting to switch to RDF, the underlying knowledge representation used in the Linked Data Initiative. That context, however, may still be needed over time, so in larger projects RDF is most commonly used when it is automatically generated and known to be context-free, or kept in named RDF graphs and regenerated as needed, for example to support repudiation of facts or restrictions on sharing.

RDF cannot in general be represented in document format except through visualizations of graphs, and thus is even harder to format in search results or accessibility tools than XML (although since RDF can be interchanged in XML there is clearly and necessarily some overlap).^[2]

Programmer to Programmer, Machine to Machine: program-specific data formats

When a computer program needs to communicate complex information to another program different considerations apply from human-readable documents.

Whatever format is used must map directly to data structures used within the programs at both ends, as otherwise the primary goal of communication between programs will not be achieved.
Programmers often consider efficiency to be an important goal, as measured by number of lines of code for parsing, amount of memory consumed, amount of processing used, and amount of data transmitted for a given result. For this reason terse formats are often preferred.
Flexibility of representation is not a benefit when one is marshalling data, saving/restoring/transmitting objects, or exchanging application-specific data. Instead, a very specific format may be easier to parse.
Standardization is not usually considered important by developers except insofar as widely-deployed code libraries might reduce work. One therefore often sees one-off formats in use.

One widely-used program-to-program data format is JavaScript Object Notation (JSON). Although, as the name suggests, this was originally a serialized form of data structures such as are found in Web browsers, the popularity of the World Wide Web and the desire to create and devour Web browser data structures on Web servers has meant that most programming languages today have libraries or native support available for handling JSON.

Since object serializations are by nature tied to specific versions of specific programs, and since JSON is not in general self-labelling with regard to version or conformance, JSON cannot be said to be suitable for archiving. None the less the syntax is compact and familiar to programmers working with most of the widely-used languages today, languages whose design was influenced by the C programming language.

Another widely used format is the “comma-separated values” (CSV) file. There are dozens of different syntax variations and software that reads CSV files often has to ask users to identify particular aspects of the variant in use, showing that the format is not very suitable for interchange or archiving. Recent work at W3C in supplying metadata for CSV files may help in this area in the future.

Programs and Humans: program-specific text formats

A variant on machine-to-machine communication is the set of markup formats designed by programmers for use in specific programs but intended to be authored and edited by humans using text editors.

This list includes languages such as Markdown (used for formatting wiki entries and for describing programs on github), Microsoft-style “ini” files, but perhaps also TeX and troff macros.

Over the years there have been many such formats, and long experience suggests several difficulties with the use of such formats and several strengths.

Ease of parsing can be so great there may not even be an identifiable piece of code that’s a parser. This can be both a strength (rapid prototyping and development) and a drawback (higher cost of maintenance).
Ad-hoc formats tend not to have any explicit document format version indication and yet be specific to specific versions of the software for which they were written.
If there is only one interpreter for a language it’s common to find that undocumented features become used, hindering future attempts at a second implementation and frustrating attempts to interpret data in the absence of the software for which it was created.
Errors in a file created in an ad-hoc format might go undetected, and, without other implementations to compare, or without a concept of validation, can become difficult, expensive or even impossible to correct after the fact.

Ameliorating some of the concerns is the fact that many human/programmer text formats are widely implemented.

One such widely-implemented format, Markdown, is used in multiple programs. Markdown is a text-based format designed for use in Web forms such as Wiki pages, with a syntax such as using equals-signs to underline a heading. It has the advantage that the text looks similar to the result of formatting, although the markup for that same reason tends to be presentational and not aimed at representing information which can be re-purposed. Unfortunately, there are many incompatible variations of Markdown and the format is not self-labeling, so that one can't be certain which variation one is seeing.

A strength claimed for Markdown is that people unaccustomed to HTML or other markup languages can work with it. Direct content-editing in Web browsers removes much of that appeal, since a word-processor style of input editing is presumably even more appealing to the same people who don't like HTML. In fairness one should also mention programmers who want a text-based document format but feel that XML and HTML are too verbose for their needs.

Factors for Evaluation

This section describes some of the factors that determine which format to use in a given situation. There is no complete list because situational and contextual factors are always the most significant in practice. Note that evaluation here is not in the sense of deciding one format to be in some way superior to another, but to suggest applications for which each is the most suited.

Information Life Cycle

Information that will be archived for future research purposes must be clear when taken out of context. This might be achieved through careful documentation and avoiding relying on application-specific or opaque formats.

Information that will be used once and discarded, such as an API message in a Web service or notification that a user moved a pointing device could reasonably be in an application-specific format, but if multiple programs might make use of the same message then there is greater value in a more generic format.

Information that will be stored and processed and perhaps queried will need to be in a format that supports that processing. This is the most common case for documents today and the least common for data (since the data is more easily queried in a data store than interchanged en masse).

Self-describing or clearly documented information will generally make querying easier and will facilitate recovery from an archive in the future, but that follows for all possible data formats. However, not all data formats are such that documents can easily, and routinely do, identify the format used and version of that format. For example, neither CSV files nor Markdown documents can in any standard manner identify the specification or language to which they might conform, and HTML 5 documents do not identity the dated version of the "living standard" to which they conform.

Audience Language and Culture

Information that contains mixed languages, scripts or dialects will need a mechanism to indicate this, such as xml:lang in XML or lang in HTML.

Where human-readable content is included and could be in any language (now or in the future), rich text (mixed content) will almost certainly be needed, at a minimum for supporting Japanese or Chinese ruby annotations.

Where text may be translated, in part or whole, a text replacement mechanism may be needed to make a translated version of a document. It may also be necessary to mark which parts are to be left untranslated (push the button labeled sokken: the label on the physical vending machine on the platform doesn't change just because you have an English guide book).

Universal Access

Any information presented to people will need to be accessible to them. This means that accessibility must be built in at all levels. Some of the formats described in this paper are accessibility-agnostic, but others can include or encourage user interface elements that can be harmful or exclusionary; in such cases extra vigilance may be needed on the part of document authors and system developers.

Relationships between Documents

Sometimes a document or piece of data might stand alone, but that surely is rare. A document might form part of a sequence, might contain links, might be contained in, or be a database, so that joins between sets of values might be performed.

Link discovery requires a standard vocabulary such as HTML or XLink or a standard discovery mechanism such as HyTime's architectural forms for SGML years earlier.

Implicit links, such as might be found by joins, are thus format-dependent; a dictionary site might make a link out of every word or phrase in a paragraph to a corresponding definition, but might do so programmatically (often with poor results in the face of homonyms). This ability is independent of format, but explicit linking requires syntax as does marking terms not intended to participate in such links.

Although simple querying can be performed on any of the formats, since they are text based, structure-aware querying is currently defined only for some formats, including RDF (SPARQL), XML (XQuery) and (although not a standard) JSON (JSONIQ).

Structure-based querying often has difficulty when one syntax is embedded within another: which HTML documents contain a definition for a particular JavaScript function with a given type signature, or which JSON documents contain a string with embedded HTML having a div element with a particular class attribute. Such hybrid queries can involve complex textual escaping conventions; XQuery systems supporting SPARQL queries of RDF embedded in XML provide a promising counterexample.

Default Formatting

Documents on the open Web need to be findable, and that generally means that search engines will need to parse them and then in response to user searches generate result snippets, short extracts that users can use to decide whether to read the longer document. Phrase and word breaks and basic formatting is necessary for the snippets.

Default formatting is also needed for operations such as copy and paste.

Validation

Although validation is a dirty word in some HTML circles, in other circles it's an essential part of doing business: context determines function.

Validation can be at the syntax checking level, or at the business logic level (every invoice must have a date, a customer number and an amount), or can be at the application level (the file is OK if the program reads it). Of these, the application level validation is the most powerful (arbitrary code) and the least portable. A standard way to express business or grammar rules means that documents can be tested against multiple programs and can also serve as documentation over time.

Data Typing

A document may contain components with identifiable data types such as "numbers" or "sequence of characters, string" or "truth value". This is essential for data binding and object dumping (as in JSON), but for some other systems it's also important to support user-defined types such as sock-colour or MailingAddress.

Program Compatibility

The constructs that a data format can represent should match the objects that a program needs in order to manipulate that data. If a format is too difficult to process it will not be popular with developers.

This must be balanced by the fact that programmers may not be the only, or even the most critical, stakeholders in a project.

In some cases (and some contexts) a compromise can be reached using scripting languages such as JavaScript, but then security implications must be considered.

The need to process data is intrinsic to computing with data; having standard data processing and transformation languages can help with staffing needs as well as system portability and longevity at the expense of using languages that are not necessarily optimized for the particular task at hand.

Information Modelling

One of the decisive factors for many projects in the past has been whether the goal of using markup is to model information (which may exist outside of the marked-up document, for example in a physical book or manuscript being transcribed or quoted, or an existing business process) or whether it is to guide presentation.

Markup as part of information modelling can be contrasted with markup as a syntax for conveying data, such as node-and-arc graphs or objects, which themselves may represent (or be) models.

HTML and Web Browsers

The markup in HTML is primarily driven today by the goals of Web browser vendors.

Although lip-service is paid to so-called “semantic tagging” what is meant is markup divorced from presentation specifics and yet tailored to a specific type of software application, the Web browser. An HTML document represents part of a Web Application, together with other resources such as Cascading Style Sheets (CSS), images, JavaScript programs, and perhaps input data in JSON or other formats.

So-called semantic tags (actually elements) added to HTML 5 have mostly included markup for blogging. Transcribing a play, writing a poem, even sharing song lyrics, these are not on the HTML agenda.^[3]

Recent work on user-defined elements in HTML concentrates on their “behaviour” rather than on what (if anything) is being represented.

Since Cascading Style Sheets have built-in support for HTML features rather than being a general-purpose styling language for marked-up documents it is more convenient to use HTML rather than some other XML markup language when using CSS, whether for Web browser use or otherwise.

Since the HTML language is intended for use with CSS and JavaScript, primarily within a Web browser, and not for document modeling, it makes sense to use XML for authoring, transcriptions, and archival purposes, and to transform to HTML when needed.

Multiple Consumers: Transformations

The need for document creators to produce EPUB documents for electronic readers alongside other formats has led to an increase in the usage of XML, as opposed to (or as well as) proprietary page design or word processing formats. There is nothing about XML that makes it inherently more amenable to transformation than JSON, or than any format that can be parsed reliably and in an interoperable manner. In practice, however, the existence of XSLT, of XQuery and XPath, and the widespread availability of tools implementing those languages, means that XML is a particularly convenient choice. The use of XML schema languages to check that documents meet specified constraints can also help to control the scope of transformation programs.

It should be noted that a strength of XSLT is that it can be written, read and maintained by people who do not see themselves as programmers, but as document people. The declarative nature of XSLT, and the limited control flow possibilities, help to make the XSLT transformations easy to understand. As a result, organizations with people working on predominantly textual documents are very likely to have staff who can comfortably use XSLT, making XML in turn an excellent choice as a basis for transformations.

HTML and JSON, by contrast, do not have such transformation languages; JavaScript is much closer to “regular programming” than XSLT and may be seen as inappropriate for technical writers to use.

Comparison of Formats

So far this paper has introduced some use cases and (indirectly) markup formats. This section summarizes the strengths and weaknesses of each format using the factors for evaluation described above, after a brief introduction to make clear what is meant in this paper by each format.

it should be stressed that this is not a complete list of markup formats; the goal of this paper is to help the reader choose among several of the most likely formats to be used today, and to provide a starting-point for discussion.

Plain Text (Unstructured)

Mentioned here only for completeness, plain text files with no claim to using any particular markup strategy can be read by humans and if there is some regular ad-hoc syntax then a program can read the file, but there is no Network Effect: if the syntax were widely enough used to have multiple implementations and a user community it would no longer be considered a plain text file, but would have identifiable structure.

Since plain unstructured text does not by itself constitute a markup language, it will not be compared further.

Markdown

Although there are a number of mostly-compatible variants of Markdown, in this paper we will imagine a world in which a single variant dominates. The stated intent of Markdown is as a text to HTML conversion tool for Web writers.

Life Cycle: because Markdown is not a standard, variations between versions may mean Markdown is not ideal for archiving. This is exacerbated because Markdown files are not self-describing: they do not label themselves as Markdown and do not identify the version of Markdown to which they conform.

Audience, Language and Culture: Markdown is not internationalized. Lack of support of mixed language paragraphs, indications of language in use, explicit right-to-left markup, Ruby annotations and script selection may make it unsuitable for mixed language content. Lack of named identifiers for sections and paragraphs may make it difficult to keep translations in sync.

Universal Access: Markdown has limited support for HTML accessibility from a reader perspective; on the other hand Markdown has found a use for people writing blogs, because it can fairly easily be created in a text editor and uploaded, avoiding the user interface for the blogging system.

Situations: Markdown is suitable for simple computer-mediated human-to-human communication, since Markdown files can easily be read in their text form as well as when converted to HTML. Markdown cannot represent complex documents such as mathematical research papers.

Relationships: Markdown supports explicit URL-based links.

Default formatting: Markdown files can be seen as text files or as HTML, and it is reasonable to say that, although not as powerful r widely supported as HTML in this regard, Markdown documents are transparent with respect to the author's formatting intentions.

Data Typing and Validation: not provided except for basic syntax checking.

Program Compatibility: Markdown is not significantly easier to process in programs than HTML, and a common way to process it is in fact to convert it to HTML first.

Use case: Markdown is primarily used where a text-based "rich text" is needed for people uncomfortable dealing with HTML or XML directly, and where no tools are available.

Information Modelling: not attempted.

JSON

JSON (JavaScript Object Notation) is a mechanism for transmitting data that can easily be instantiated as programming-language-level objects by the receiver. The format was originally defined for JavaScript but JSON is now supported by most of the major programming languages. JSON is included in this paper because, even though it is not perhaps a markup language, and does not attempt to be particularly suited for textual documents, it is widely seen as a replacement for XML in Web services and interactive Web usage (AJAX), where JSON strings contain escaped fragments of HTML.

Situations: JSON is intended for program-to-program communication.

Life Cycle: JSON is primarily aimed at information that will be used once and discarded, such as search results communicated from a Web server to a Web browser. However, today there are databases for storing and querying "JSON documents".

Audience, Language and Culture: JSON documents do not have standard ways (at the time of writing) to mark the natural language used for text strings; even if it did, JavaScript objects are the wrong level of abstraction for this. It is, however, possible to embed escaped HTML string in JSON, and this can contain language tags. JSON is not intended as an authoring format for textual documents.

Universal Access: since JSON is intended for program-to-program communication this is not an issue. It is up to the creator of any HTML embedded inside JSON to ensure accessibility, however.

Relations between Documents: JSON documents represent objects with simple names; if it's known through some external source that the same name in multiple documents represents the same information then database query languages can associate the information. Additionally, JSON strings might include escaped HTML markup with links, but there is no meaningful way to point into a JSON file with a link, nor is there a standard meaning. JSON Schema defines a mechanism to point to JSON objects using a reserved name, "id".

The JSONIQ query language gives an extended XPath-like syntax, and there are other ways to refer to the inside of a JSON document, but pointing into an object in a computer program isn't the same as linking to part of a document.

There are no widely used ways to transform JSON objects outside of a programming language, although there is (or will be) JSON support in XQuery 3.1, XSLT 3 and JSONIQ.

Default Formatting: There is no default presentation for JSON objects beyond the "source code view" of the actual document.

Validation and Data Typing: The IETF JSON Schema language is still a draft, and does not have large traction yet, but is gaining maturity. It was influenced by XML Schema but does not support user-defined data types. it is intended for use at a programmer and API level, not at a business level.

Program Compatibility: This is the greatest strength of JSON: JSON documents are also JavaScript fragments. They can be embedded in the source code of programs, they can be read with "eval" (although security implications suggest this should be preceded with validation) and they can be generated directly from any object in a JavaScript program. Although usage in other programming languages typically requires a library, JSON's data structures usually map exactly onto data structures in popular programming languages, unlike (for example) HTML or XML, where attributes and mixed content must be modeled in terms of such data structures.

Information Modelling: JSON is all about program modelling and not information modelling. It's just syntax: one can map from SGML or HTML or XML into JSON, but the primary strength of JSON is its convenience for developers, not its easy (or otherwise) at modeling information. Another indication of the JSON culture is that JSON Schema does not provide for user-defined types, just number, string, boolean, array, object and null. Schema authors can restrict the value space to say that a field called socks_owned must be a whole number not less than zero, but cannot say that socks_owned is of type socks_count; this reflects the type system of JavaScript but is not for example a good match for the way people think about documents or objects outside the computer.

HTML

The HyperText Markup Language, standardized first at the IETF and the ISO and later at W3C, is a fixed markup language aimed at delivering documents to the World Wide Web. It is a vocabulary largely controlled by Web browser makers.

A recent variant, HTML 5, adds support for "Web Components", essentially user-defined HTML elements with content templates and JavaScript and CSS styles to supply any required browser-side behaviour. Unfortunately, HTML 5 is a "living standard" and features come and go from time to time. This is balanced by excellent support from Web browsers and clear documentation (in almost all cases) on exactly how a Web browser should recover from errors.

Situations: HTML is primarily intended for computer-mediated human to human communication of documents, but it is also increasingly used today for computer-to-human interactions with "Web Applications."

HTML is also used for computer-to-computer messages, but in this case the error recovery rules employed by Web browsers and by conforming HTML 5 implementations may not always be appropriate. Silent correction or acceptance of errors has in other languages and systems famously led to deaths in space missions and other engineering problems.

Information Life Cycle: HTML is implemented in perhaps a dozen or more Web browsers, with a very large deployment. As a result it is difficult for HTML to change in incompatible ways. None the less attempts to change HTML in that way are often attempted, and, as a result, archived HTML documents need to be explicit about the version of HTML they used.

The culture of HTML tends to be very much aimed at Web browser use. As such, behavioural and presentation semantics are emphasized, with "semantic" elements such as section and article being hailed as an advance over equally generic names such as div. Again, the challenge here for archiving is that the actual meanings of markup constructs will and do change over time, and also that JavaScript code may or may not continue working over a period of decades and may or may not sufficiently describe behaviour and intent.

A large number of content management systems and databases for storing HTML exist; some of them prefer XHTML, which can be parsed more reliably; see the next section for more details.

Audience, Language and Culture: HTML has strong internationalization and localization support, especially when used in conjunction with the Internationalization Tag Set (ITS). Individual elements down to the word or sub-word level can be marked for language, region and script, and can be marked as not to be translated. Ongoing work, for example in supporting all forms of Chinese and Japanese ruby annotations, is improving the situation still further, but, overall, HTML offers one of the best formats for international and multilingual documents today.

Early versions of HTML, unfortunately, put human-readable content such as alternate replacement text for when an image is not available, in attributes, precluding markup for mathematics, for Ruby annotations, for emphasis; this defect is slowly being corrected, for example with the picture element.

Universal Access: Extensive and very helpful information is available for document and application authors working with HTML. There are plenty of challenges since not all HTML documents are automatically accessible, but that is also true of other rich formats, especially when they are scriptable. A complex system of fallbacks makes it possible to write Web applications that will work on a wide range of devices and with assistive technologies such as text readers, alternate pointing devices and even Braille terminals.

Relationships between Documents: HTML has a rich vocabulary for representing relationships from one document to another, including explicit hypertext links and link relations as well as implicit links (for example with URI Templates) and links between information and remote descriptions with microdata and RDFa annotations.

There is no automated mechanism today for link discovery when links are implicit.

There is no widely-deployed standard HTML querying language, and there is no standard way in HTML to represent relationships between documents outside of any document.

Default Formatting: HTML today is used for the representation and formatting of best-selling printed books; it is not as sophisticated as other publishing platforms but it growing rapidly in that area. HTML documents have default associated formatting, although an increase in the use of cascading style sheets to redefine the formatting and purpose of elements can weaken that, and should be avoided.

Validation: There are widely-used syntax checkers for HTML, such as that at validator.nu and the W3C HTML validator. Validation at the business level, for example to say a heading must be followed by a paragraph, must be handled with other mechanisms, such as by using XHTML and XML Schema.

Data Typing: HTML did not define any specific data model until HTML 5; before that, although the HTML DOM was widely used, it was not mandated by HTML. Like JavaScript objects, however, the HTML DOM is not strongly typed.

Program Compatibility: Unlike JSON, HTML documents cannot easily be processed by programs in most traditional languages, even JavaScript. Attempts to alleviate this, such as the popular jQuery library, have been largely successful where they are available. HTML is not a strong choice for object serialization and deserialization, which is why JSON exists.

Information Modelling: HTML documents are closely (and increasingly) tied to Web browser design. HTML is adequate in many cases for modelling a blog, although it does not have standard support for song lyrics, poems, footnotes, or a host of other basic rhetorical forms and devices.

XHTML

There are two main versions of XHTML in use today, and two meanings of the term; XHTML 1 was designed to be an XML-based version of HTML 4 which can be served to Web browsers as either XML or HTML. XHTML 5 is an XML serialization of HTML 5 with the same goal: that when a Web browser reads an XHTML 5 document it creates the same internal representation (DOM) regardless of whether the HTML or the XML syntax was used. XHTML 5 is not, however, a successor to earlier versions of XHTML.

All of the considerations for HTML apply to the XML syntax for HTML, except that parsing of XHTML as XML means firstly that errors may be treated as fatal and second that XML tools can be used with XHTML documents.

RDF and Linked Data

The Resource Description Framework, RDF, is a standard for representing metadata as sets of decontextualized triples of atomic values that form a (possibly disconnected) graph. RDF is most often exchanged in three formats: RDF/XML; Turtle (a text-based syntax); and SPARQL Results in XML, a format intended to be transformed (often with XSLT) into a user-visible format such as HTML or SVG.

Linked Data (LD) is a name for the practice of publishing and combining RDF-based graphs; it is mentioned here in the context of making abstract RDF graphs available from documents.

Situation: RDF is primarily used in computer-to-computer communication, although many RDF data sets are hand-authored.

Information Life Cycle: RDF documents are frequently stored in databases, whether hybrid or RDF-only (RDF-only databases are often called triple stores). Although RDF can be used for one-off communication it is more often stored and queried. RDF is also commonly embedded in other formats, especially HTML. The most common standard querying language for RDF is SPARQL.

Since RDF uses URIs, and URIs are defined to be opaque and meaningless to an outside observer, RDF is strictly speaking not self describing. In practice, though, URIs are normally made from natural language words and represent what those words name. Most RDF serializations do identify the file as conforming to a specific version of RDF.

Audience, Language and Culture: RDF nodes have opaque identifiers that are not in any natural language. it is possible to create "labelFor" nodes in the RDF graph and give them language tags, although it should be noted that RDF does not handle XML or HTML style mixed content well.

The Linked Data culture wants all information about everything and everyone to be public. Privacy and security remain challenges for the various RDF communities. A talk at XML Prague suggested storing RDF graphs in XML databases and using XQuery to construct a set of triples for SPARQL queries based on security, but this should probably be seen as an outlier; in the long term one can expect SPARQL itself to learn about security. A technical challenge is that there is nowhere in a triple to store sharing or security information.

Universal Access: RDF, like JSON, does not have any inherent user interaction. Graphical visualizations, however, can be a challenge for people who are not able to see them, and alternatives therefore need to be considered.

Relationships between Documents: RDF is all about relationships, but, oddly, cannot easily refer from one graph to another. RDF named graphs (new in RDF 1.1) may provide a mechanism there, but it is too soon to measure deployment.

Default Formatting: RDF documents do not have textual representations other than (like JSON) as source. However, they are conventionally represented as node and arc graphs. This visual representation conveys the overall structure of an RDF graph but not necessarily the actual content.

Validation: There has been recent work on RDF Shape Expressions for constraining the shape of RDF graphs; this is not yet deployed.

Data Typing: RDF does support associating data types with values, and these can be user defined.

Program Compatibility: The RDF model is graph based, not object based, and does not correspond to the native data structures and type systems of modern programming languages. However, those same current languages are easily able to represent RDF graphs, and there is no mixed content to complicate things.

Information Modelling: RDF is about modelling knowledge, not information. it is a knowledge representation system used primarily for first-order logic and inferencing.

XML

The Extensible Markup Language, defined at W3C as a subset or profile of SGML (and originally known as Web SGML), is not really a single markup language like HTML, but instead a framework for defining one's own markup languages, all of which have a common syntax.

This paper distinguishes where appropriate between arbitrary XML documents and documents in some specific XML-based markup language such as XHTML 5 or DocBook.

Situation: XML is used in all areas of communication: person to person, person to computer, and computer to computer, and can to some extent also be used without computer mediation (that is, text-oriented XML documents can be moderately readable, although not as much as Markdown documents).

Information Life Cycle: XML documents have a life cycle that depends on how they are used more than on the fact they are XML. For example, a message from an automobile engine to a garage mechanic's diagnostic system, a message from one operating system component to another when a user double-clicks on a desktop icon, a transcription of an Anglo-Saxon poem, a health-care provider's record of treatment for a patient, all are likely to be in XML, and each have different longevity and processing characteristics.

Trees based on parsing XML documents can be stored in relational, XML-native or hybrid data stores, and the XQuery language can be used to access them efficiently.

Audience Language and Culture: XML documents can support all of the internationalization features of HTML and XHTML, but it depends on the specific XML vocabulary. If you are designing an XML representation for text you should consider adopting the HTML model where possible because of widespread understanding and adoption.

The W3C Internationalization Tag Set (ITS) can be used directly in XML to help with translation and localization.

Universal Access: Again, this depends on the ways in which the XML documents are used. Awareness of the W3C Web Content Accessibility Guidelines can help document designers to create accessible systems using XML.

Relationships between Documents: The XLink specification has not gained much traction, and today people are more likely to use an ad-hoc attribute called href, or possibly to use the HTML "a" element by means of an XML namespace. it is also possible to embed RDF in XML documents.

Default Formatting: This is one of the two biggest weaknesses of XML: since there are no default presentational semantics, search engines cannot generate reliable snippets for results. Using XML on the World Wide Web can therefore be a problem.

Validation: XML has a wide variety of validation mechanisms, from simple and widely-supported DTDs, through to the baroque complexities of W3C XML Schema. A part-way compromise is RELAXNG, but this does not perform the data binding role of XML Schema, as described in the next paragraph. User-defined data types and compound types are available.

Data Typing: XML Schema validation can assign type annotations to elements in the parsed XML tree; type labels can be user-defined type names as well as built-in types. Note that RELAXNG does not support assignment of type annotations in a deterministic way, so that XML Schema is generally used where data binding (object loading and dumping) is required.

Program Compatibility: This is the second of the two main weaknesses of XML: the concept of an annotated tree of nodes is not a native data structure in most programming languages. As with HTML, mixed content such as paragraphs with embedded elements considerably complicates processing.

The situation is mitigated by the popularity of XSLT and XQuery, XML-specific languages for querying and manipulating trees.

Information Modelling: This is the greatest strength of XML: that it can be used, and culturally is used, to model documents or other information outside of any particular application or process. This strength comes at a cost: because XML documents are usually independent of any one program they are also not optimized for processing by any one program, and this can make XML unpopular with application developers.

Some Use Cases

This section gives examples chosen to illustrate a typical use case for each of the main formats discussed, together with indication of how to represent the example in the other formats.

An Object Dump

Consider a JavaScript program running in node.js on a Web server, communicating with a database to provide persistent storage of objects. Objects will have JavaScript types and values; the obvious choice is JSON, which was designed for this purpose.

One could use RDF instead; direct mappings from UML to RDF exist. But then a library would be needed, and the various transfer syntaxes of RDF are not as convenient for JavaScript programmers. In languages where JSON also needs a library, or where JSON does not map well to objects, RDF may be a stronger contender.

XML is also commonly used for object dumps. A library is needed, both for serialization and for loading, but such libraries exist for most languages. Since object dumps tend to be specific to a particular state of a particular program at a particular time, they are not easily reused by other programs; JSON may be more suited in that case. The strongest use cases for XML are when documents will be used in multiple ways.

The lack of standard transformation tools for JSON (compared to XML for example) is likely to be short-lived; there are several contenders as well as native-JSON NoSQL databases in widespread use.

A Technical Dictionary

In this example an organization edits a complex dictionary and produces editions in print, in PDF, in HTML on a subscriber-only Web site and in EPUB for ebook readers. Subsidiary products are also produced and might include a dictionary defining only terms needed for specific high-school (K12) or undergraduate courses, or subsets containing, say, only entries that mention a specific compound.

Dictionaries are examples of documents that often feature mixed content very heavily: superscript and subscripts, mathematics, terms that are to link to definitions, multiple languages, symbols and small diagrams may all occur in running text. Even a simple English dictionary may contain relatively mixed content, as in the example in Figure 6

Since EPUB 3 used for electronic books is essentially a "Web site on a stick" there is considerable pressure to use HTML. However, custom markup can support business-level validation (for example, every major definition must have at least three examples, and can help with research and querying.

A compromise is to use (X)HTML augmented using ARIA attributes to provide so-called so-called structural semantics, with microdata, or even with custom XML elements; since HTML 5 Web Components provide a standard way to add elements this approach is likely to become popular. However, enforcing appropriate markup on authors may be necessary to preserve the value of the work, and that may suggest a custom XML-based markup with transformations to HTML as needed. Multilingual mixed content is today the home turf of the XML team.

RDF metadata can be embedded in dictionary entries, or, more likely, generated on the fly, perhaps with XQuery or XSLT, from the higher-level XML notations that are more convenient for authors to work with. Representing mixed content in RDF would typically involve explicit and tedious representation of sequences of anonymous nodes.

Markdown quickly runs out of power to express complex texts, whether multilingual like the English dictionary or containing chemical formulae and mathematics as in the technical dictionary. Variants that are sufficiently powerful start to stretch what is feasible with ad-hoc text-based syntax and the extra difficulty of using HTML or XML for the simpler parts probably pays off with consistent markup for the harder parts.

Extended Journal Bibliography

In this example entries for different authors are to be connected; any text formatting is minimal and formulaic. RDF is a strong candidate here. JSON could also be used.

A common need with bibliographical data is powerful full text searching, including similarity, starts-with, lexical containment, proximity within a field or element, and more. The XPath and XQuery Full Text extension was created with this in mind, suggesting that in some environments an XML-compatible representation may be worth investigating. Note that XQuery and XSLT 2 and later are defined to operate on trees which, although commonly created from XML, could come from any source that meets the necessary constraints.

Although Markdown again is not a likely choice, it should be noted that the text-based format pioneered by Mike Lesk for the refer program in the 1970s, and later taken up by BibTeX, is widely used and widely supported in technical and academic communities.

Web-based Authoring Interface

This example considers a Wiki-like situation, with a large and diverse group of authors for most of whom interaction with the Web site is not a major part of their lives, so that they will have little interest in learning about “syntax.”

This is a typical use case for Markdown today. The Markdown markup is embedded in an HTML form, and the user interacts with the Web browser's built-in text editor.

More recently, the content-editable property of HTML elements can be used to support word-processing style editing of parts of documents in place, which may reduce the desire to use Markdown.

Hybrid Approaches

Just as it would be wrong to suggest that the various formats all compete in the same space, so it would be wrong to insist that they stand alone. Some obvious combinations are given in this section, but it is necessarily not an exhaustive list.

RDF and JSON

People are already exchanging linked data using JSON instead of XML or N3 to transmit RDF graphs. This is to be expected since RDF is primarily a format for machine-to-machine communication and programmers like the strong match between JSON and internal data structures.

There are a number of competing formats, including JSON-LD, RDF/JSON, JSN3, JROn and more, although JSON-LD may at the time of writing be winning out.

RDF and XML

There are three main approaches to adding RDF to XML: storing RDF triples explicitly within XML documents alongside other XML information; storing RDF separately from XML, perhaps in a triple store; generating RDF from XML documents. Each has its place as circumstances dictate,and combinations of these methods are also in use.

Converting from RDF to XML (other than serializing RDF as RDF/XML or some other XML representation of RDF graphs) is not useful in general, but the results of querying an RDF graph with SPARQL are often processed with XML tools such as XSLT or XQuery for presentation in human-readable form.

Visualizations of RDF graphs as SVG and also using the XML-based GraphML should also be mentioned here.

HTML and XML

Mixing two document formats, rather than a data format and a document format, rarely seems to be productive. The combination of HTML and XML is HTML represented in XML (XHTML). Another combination is found commonly in RSS feeds and Atom, and is escaped HTML inside XML. This is done because HTML (not XHTML) has different syntax rules that conflict with XML, so that one cannot simply embed HTML inside XML.

Conclusions

It is not possible to give universal recommendations for when to use a particular format because many unforeseeable considerations may apply. For example, local knowledge of particular programming languages or ways of working may dictate consideration of a subset of the formats, or may even mandate the use of a particular format regardless of suitability to task.

The formats discussed here do not compete with one another. They complement one another, and are often used in conjunction with each other.

Balisage Paper: Markup Formats In Context

A comparison of the strengths of some widely-used markup systems

Liam R. E. Quin

Table of Contents