Figure 2: person-to-person communication using electronic
transportation of digital media objects.
Today if someone wants to write something for the benefit of another
reader, they can use a word processor and either send computer-printed
paper or send electronic files that the recipient can print and read.
Those files are sent sometimes as plain text (electronic mail), or for
longer documents or documents with more complex formatting, as PDF page
images or as word processing files.
Word processing files represent a document complete with formatting
but in an editable form, so that text can re-flow as needed. Such files
generally make use of system resources such as fonts, so that a document
may be differently paginated or, in the case of specialty symbol or
language-specific fonts, may be partly or entirely unreadable. For some
languages (Northern Cree and some of the scripts or writing systems used
in India and in Africa come to mind) it’s customary to write documents
using a font with a custom encoding, as Unicode coverage is (or is
perceived to be) incomplete or insufficient; this means that if the
recipient does not have the right font installed, the document may appear
correct but will have some characters in the document silently substituted
for others. To be fair this problem exists for all of the document formats discussed
paper, but some of the formats alleviate the difficulties,
or at least let documents be explicit about what was done, more than others.
Word processor documents today use complex and proprietary formats
(although increasingly these are represented in XML). This means that they
can be difficult to search, although they are usually easy to revise.
Later versions of a word processor may interpret older files differently,
with or without warning, so that the documents become tied to specific
versions of software running in specific operating environments. Because
word processor formats are (implicitly or explicitly) tied to specific
versions of specific software, as well as to system resources such as
fonts, they are not suitable for archival use.
Portable Document Format (PDF), a document format produced and
maintained by Adobe Systems Inc. of California, USA, has a corresponding
ISO archival standard, although in practice PDF documents can and do make
use of extensions that are not archival. However, PDF files do contain all
needed resources such as fonts and images, and are in most cases
considered to be of archival quality. Software that creates PDF may have
options to create PDF/A, the archival variant.
PDF documents do not generally reflow text if printed or viewed on a
differently sized device than that for which they were created. The page
dimensions are part of a PDF document and cannot easily be altered;
hyphenation has been performed, footnotes have been numbered and so on.
PDF documents can be extremely difficult to read on smaller devices, as
the user may need to scroll horizontally back and forth to read each line
The formatting of both word processor documents and PDF files is
explicit (except as noted already) and it is therefore possible for a
search engine to process and index the text in them and then to display
formatted results and previews. However, because the formats are not
intended for this purpose, there are some difficulties. For example, PDF
does not require that the creating software explicitly mark word
boundaries, since each “glyph” can be positioned independently from any
other. If word and phrase boundaries are not clearly marked, indexing has
to use heuristics: sometimes one will run across search engine results in
which characters have been joined between paragraph breaks, or where words
have been incorrectly split, or where hyphenated words result in two
smaller sub-words being indexed separately.
Documents as essentially pictures of documents, whether because of
proprietary or poorly documented file formats, because of insufficient
information, or because the document actually contains bitmap (raster)
images of text rather than the actual text, can pose difficult or
insurmountable problems not only to search engines but also to people who
cannot easily read from the pictures, for example because the lines of
text do not reflow (creating a need for difficult sideways scrolling) or
because the user is relying on a text reader to speak the text out loud
and the document does not actually contain any text. Any system for
mediating communication between humans must be useable by all humans.
Factors for Evaluation
This section describes some of the factors that determine
which format to use in a given situation. There is no complete list because
situational and contextual factors are always the most significant in
practice. Note that evaluation here is not in the sense of deciding one
format to be in some way superior to another, but to suggest
applications for which each is the most suited.
Information Life Cycle
Information that will be archived for future research purposes
must be clear when taken out of context. This might be achieved
through careful documentation and avoiding relying on application-specific
or opaque formats.
Information that will be used once and discarded, such as an API
message in a Web service or notification that a user moved a pointing device
could reasonably be in an application-specific format, but if multiple
programs might make use of the same message then there is greater
value in a more generic format.
Information that will be stored and processed and perhaps queried
will need to be in a format that supports that processing. This is the most
common case for documents today and the least common for data (since the
data is more easily queried in a data store than interchanged en masse).
Self-describing or clearly documented information will generally
make querying easier and
will facilitate recovery from an archive in the future, but that follows for all
possible data formats. However, not all data formats are such that documents
can easily, and routinely do, identify the format used and version of that format.
For example, neither CSV files nor Markdown documents can in any standard manner
identify the specification or language to which they might conform, and
HTML 5 documents do not identity the dated version of the "living standard" to which
Audience Language and Culture
Information that contains mixed languages, scripts or dialects will
need a mechanism to indicate this, such as xml:lang in XML or lang in HTML.
Where human-readable content is included and could be in any language
(now or in the future), rich text (mixed content) will almost certainly be needed,
at a minimum for supporting Japanese or Chinese ruby annotations.
Where text may be translated, in part or whole, a text replacement
mechanism may be needed to make a translated version of a document.
It may also be necessary to mark which parts are to be left untranslated (push the
button labeled sokken: the label on the
physical vending machine on the platform doesn't change just because you have
an English guide book).
Any information presented to people will need to be accessible to
them. This means that accessibility must be built in at all levels.
Some of the formats described in this paper are accessibility-agnostic,
but others can include or encourage user interface elements that can be
harmful or exclusionary; in such cases extra vigilance may be needed on
the part of document authors and system developers.
Relationships between Documents
Sometimes a document or piece of data might stand alone,
but that surely is rare. A document might form part of a sequence, might
contain links, might be contained in, or
be a database, so that joins between
sets of values might be performed.
Link discovery requires a standard vocabulary such as HTML or XLink
or a standard discovery mechanism such as HyTime's architectural forms
for SGML years earlier.
Implicit links, such as might be found by joins, are thus format-dependent;
a dictionary site might make a link out of every word or phrase in a paragraph to
a corresponding definition, but might do so programmatically (often
with poor results in the face of homonyms). This ability is independent of format,
but explicit linking requires syntax as does marking terms not intended to participate
in such links.
Although simple querying can be performed on any of the formats, since
they are text based, structure-aware querying is currently defined only
for some formats, including RDF (SPARQL), XML (XQuery) and (although
not a standard) JSON (JSONIQ).
Structure-based querying often has difficulty when one syntax
is embedded within another: which HTML documents contain a definition
JSON documents contain a string with embedded HTML having a div element with a
particular class attribute. Such hybrid queries can involve complex textual escaping
XQuery systems supporting SPARQL
queries of RDF embedded in XML provide a promising counterexample.
Documents on the open Web need to be findable, and that generally
means that search engines will need to parse them and then in response
to user searches generate result snippets, short extracts that users can
use to decide whether to read the longer document. Phrase and word
breaks and basic formatting is necessary for the snippets.
Default formatting is also needed for operations such as copy and paste.
Although validation is a dirty word in some HTML circles, in other
circles it's an essential part of doing business: context determines function.
Validation can be at the syntax checking level, or at the business logic level
(every invoice must have a date, a customer number and an amount), or can be
at the application level (the file is OK if the program reads it). Of these, the application
level validation is the most powerful (arbitrary code) and the least portable. A standard
way to express business or grammar rules means that documents can be tested
against multiple programs and can also serve as documentation over time.
A document may contain components with identifiable data types such as
"numbers" or "sequence of characters, string" or "truth value". This is essential
for data binding and object dumping (as in JSON), but for some other systems it's
important to support user-defined types such as sock-colour or MailingAddress.
The constructs that a data format can represent should match the
objects that a program needs in order to manipulate that data. If a
format is too difficult to process it will not be popular with
This must be balanced by the fact that programmers may not be the
only, or even the most critical, stakeholders in a project.
In some cases (and some contexts)
a compromise can be reached using scripting languages
The need to process data is intrinsic to computing with data;
having standard data processing and transformation languages can
help with staffing needs as well as system portability and longevity at
the expense of using languages that are not necessarily optimized for the particular
task at hand.
One of the decisive factors for many projects in the past has been
whether the goal of using markup is to model information (which may
exist outside of the marked-up document, for example in a physical
book or manuscript being transcribed or quoted, or an existing
business process) or whether it is to guide presentation.
Markup as part of information modelling can be contrasted
with markup as a syntax for conveying data, such as node-and-arc graphs or
objects, which themselves may represent (or be) models.
Comparison of Formats
So far this paper has introduced some use cases and (indirectly)
markup formats. This section summarizes the strengths and weaknesses of
each format using the factors for evaluation described above, after a brief introduction
make clear what is meant in this paper by each format.
it should be stressed that this is not a complete list of markup
formats; the goal of this paper is to help the reader choose among several
of the most likely formats to be used today, and to provide a starting-point
Plain Text (Unstructured)
Mentioned here only for completeness, plain text files with no claim
to using any particular markup strategy can be read by humans and if
there is some regular ad-hoc syntax then a program can read the file,
but there is no Network Effect: if the syntax were widely enough used to
have multiple implementations and a user community it would no longer be
considered a plain text file, but would have identifiable
Since plain unstructured text does not by itself constitute a
markup language, it will not be compared further.
Although there are a number of mostly-compatible variants of Markdown,
in this paper we will imagine a world in which a single variant dominates.
The stated intent of Markdown is as a text to HTML conversion tool for
Life Cycle: because Markdown is not a standard, variations between
versions may mean Markdown is not ideal for archiving. This is
exacerbated because Markdown files are not self-describing: they do not
label themselves as Markdown and do not identify the version of Markdown
to which they conform.
Audience, Language and Culture: Markdown is not internationalized. Lack
of support of mixed language paragraphs, indications of language in use,
explicit right-to-left markup, Ruby annotations and
script selection may make it unsuitable for mixed language content. Lack of
named identifiers for sections and paragraphs may make it difficult to keep
translations in sync.
Universal Access: Markdown has limited support for HTML
accessibility from a reader perspective; on the other hand Markdown has
found a use for people writing blogs, because it can fairly easily be
created in a text editor and uploaded, avoiding the user interface for
the blogging system.
Situations: Markdown is suitable for simple computer-mediated
human-to-human communication, since Markdown files can easily be read in
their text form as well as when converted to HTML. Markdown cannot
represent complex documents such as mathematical research papers.
Relationships: Markdown supports explicit URL-based links.
Default formatting: Markdown files can be seen as text files or as HTML,
and it is reasonable to say that, although not as powerful r widely
supported as HTML in this regard,
Markdown documents are transparent with respect to the author's formatting
Data Typing and Validation: not provided except for basic syntax checking.
Program Compatibility: Markdown is not significantly easier to process
in programs than HTML, and a common way to process it is in fact to convert
it to HTML first.
Use case: Markdown is primarily used where a text-based "rich text" is
needed for people uncomfortable dealing with HTML or XML directly, and where
no tools are available.
Information Modelling: not attempted.
data that can easily be instantiated as programming-language-level objects by
by most of the major programming languages. JSON is included in this paper
because, even though it is not perhaps a markup language, and does not attempt
to be particularly suited for textual documents, it is widely seen as a replacement
for XML in Web services and interactive Web usage (AJAX), where JSON strings contain
escaped fragments of HTML.
Situations: JSON is intended for program-to-program communication.
Life Cycle: JSON is primarily aimed at information that will be used once and discarded,
such as search results communicated from a Web server to a Web browser. However,
today there are databases for storing and querying "JSON documents".
Audience, Language and Culture: JSON documents do not have standard
ways (at the time of writing) to mark the natural language used for text strings;
however, possible to embed escaped HTML string in JSON, and this can contain
language tags. JSON is not intended as an authoring format for textual documents.
Universal Access: since JSON is intended for program-to-program communication
this is not an issue. It is up to the creator of any HTML embedded inside JSON to
ensure accessibility, however.
Relations between Documents: JSON documents represent objects
with simple names; if it's known through some external source that the same
name in multiple documents represents the same information then database
query languages can associate the information. Additionally, JSON strings might
include escaped HTML markup with links, but there is no meaningful way to point
into a JSON file with a link, nor is there a standard meaning. JSON Schema defines
a mechanism to point to JSON objects using a reserved name, "id".
The JSONIQ query language gives an extended XPath-like syntax, and there are
other ways to refer to the inside of a JSON document, but pointing into an
object in a computer program isn't the same as linking to part of a document.
There are no widely used ways to transform JSON objects outside of a
programming language, although there is (or will be) JSON support in
XQuery 3.1, XSLT 3 and JSONIQ.
Default Formatting: There is no default presentation for JSON objects
beyond the "source code view" of the actual document.
Validation and Data Typing: The IETF JSON Schema language is still a
draft, and does not have large traction yet, but is gaining maturity. It
was influenced by XML Schema but does not support user-defined data
types. it is intended for use at a programmer and API level, not at a
Program Compatibility: This is the greatest strength of JSON: JSON
source code of programs, they can be read with "eval" (although security
implications suggest this should be preceded with validation) and they
Although usage in other programming languages typically requires a
library, JSON's data structures usually map exactly onto data structures
in popular programming languages, unlike (for example) HTML or XML,
where attributes and mixed content must be modeled in terms of such data
Information Modelling: JSON is all about program modelling and not
information modelling. It's just syntax: one can map from SGML or HTML
or XML into JSON, but the primary strength of JSON is its convenience
for developers, not its easy (or otherwise) at modeling information.
Another indication of the JSON culture is that JSON Schema does not
provide for user-defined types, just number, string, boolean, array,
object and null. Schema authors can restrict the value space to say that
a field called socks_owned must be a whole number not less than zero,
but cannot say that socks_owned is of type socks_count; this reflects
the way people think about documents or objects outside the
The HyperText Markup Language, standardized first at the IETF and
the ISO and later at W3C, is a fixed markup language aimed at delivering
documents to the World Wide Web. It is a vocabulary largely controlled by
Web browser makers.
A recent variant, HTML 5, adds support for "Web Components",
essentially user-defined HTML elements with content templates and
Unfortunately, HTML 5 is a "living standard" and features come and go
from time to time. This is balanced by excellent support from Web
browsers and clear documentation (in almost all cases) on exactly how a
Web browser should recover from errors.
Situations: HTML is primarily intended for computer-mediated human
to human communication of documents, but it is also increasingly used today for
computer-to-human interactions with "Web Applications."
HTML is also used for computer-to-computer messages, but in this case
the error recovery rules employed by Web browsers and by conforming HTML 5
implementations may not always be appropriate. Silent correction or acceptance
of errors has in other languages and systems famously led to deaths in space
missions and other engineering problems.
Information Life Cycle: HTML is implemented in perhaps a dozen
or more Web browsers, with a very large deployment. As a result it is difficult
for HTML to change in incompatible ways. None the less attempts to change
HTML in that way are often attempted, and, as a result, archived HTML documents
need to be explicit about the version of HTML they used.
The culture of HTML tends to be very much aimed at Web browser use.
As such, behavioural and presentation semantics are emphasized, with
"semantic" elements such as section and article being hailed as an
advance over equally generic names such as div. Again, the challenge
here for archiving is that the actual meanings of markup constructs will
continue working over a period of decades and may or may not
sufficiently describe behaviour and intent.
A large number of content management systems and databases for
storing HTML exist; some of them prefer XHTML, which can be parsed more
reliably; see the next section for more details.
Audience, Language and Culture: HTML has strong internationalization
and localization support, especially when used in conjunction with the Internationalization
Tag Set (ITS). Individual elements down to the word or sub-word level can
be marked for language, region and script, and can be marked as not to
be translated. Ongoing work, for example in supporting all forms of Chinese
and Japanese ruby annotations, is improving the situation still further, but,
overall, HTML offers one of the best formats for international and multilingual
Early versions of HTML, unfortunately, put human-readable content
such as alternate replacement text for when an image is not available, in
attributes, precluding markup for mathematics, for Ruby annotations, for
emphasis; this defect is slowly being corrected, for example with the picture element.
Universal Access: Extensive and very helpful information is available for
document and application authors working with HTML. There are plenty of
challenges since not all HTML documents are automatically accessible, but
that is also true of other rich formats, especially when they are scriptable.
A complex system of fallbacks makes it possible to write Web applications that
will work on a wide range of devices and with assistive technologies such as
text readers, alternate pointing devices and even Braille terminals.
Relationships between Documents: HTML has a rich vocabulary
for representing relationships from one document to another, including
explicit hypertext links and link relations as well as implicit links (for example
with URI Templates) and links between information and remote descriptions
with microdata and RDFa annotations.
There is no automated mechanism today for link discovery when links
There is no widely-deployed standard HTML querying language, and there is no standard
way in HTML to represent relationships between documents outside of any document.
Default Formatting: HTML today is used for the representation and
formatting of best-selling printed books; it is not as sophisticated as
other publishing platforms but it growing rapidly in that area. HTML
documents have default associated formatting, although an increase in
the use of cascading style sheets to redefine the formatting and purpose
of elements can weaken that, and should be avoided.
Validation: There are widely-used syntax checkers for HTML, such as
that at validator.nu and the W3C HTML validator. Validation at the business
level, for example to say a heading must be followed by a paragraph, must
be handled with other mechanisms, such as by using XHTML and XML Schema.
Data Typing: HTML did not define any specific data model until HTML 5;
before that, although the HTML DOM was widely used, it was not mandated by HTML.
Program Compatibility: Unlike JSON, HTML documents cannot easily be
Attempts to alleviate this, such as the popular jQuery library, have
been largely successful where they are available. HTML is not a strong
choice for object serialization and deserialization, which is why JSON
Information Modelling: HTML documents are closely (and increasingly) tied
to Web browser design. HTML is adequate in many cases for modelling a blog,
although it does not have standard support for song lyrics, poems, footnotes,
or a host of other basic rhetorical forms and devices.
There are two main versions of XHTML in use today, and two meanings of
the term; XHTML 1 was designed to be an XML-based version of HTML 4 which can
be served to Web browsers as either XML or HTML. XHTML 5 is an XML serialization of
HTML 5 with the same goal: that when a Web browser reads an XHTML 5 document it
creates the same internal representation (DOM) regardless of whether the HTML or
the XML syntax was used. XHTML 5 is not, however, a successor to earlier
versions of XHTML.
All of the considerations for HTML apply to the XML syntax for HTML,
except that parsing of XHTML as XML means firstly that errors may be treated as
fatal and second that XML tools can be used with XHTML documents.
RDF and Linked Data
The Resource Description Framework, RDF, is a standard for
representing metadata as sets of decontextualized triples of atomic
values that form a (possibly disconnected) graph. RDF is most often
exchanged in three formats: RDF/XML; Turtle (a text-based syntax); and
SPARQL Results in XML, a format intended to be transformed (often with
XSLT) into a user-visible format such as HTML or SVG.
Linked Data (LD) is a name for the practice of publishing and
combining RDF-based graphs; it is mentioned here in the context of
making abstract RDF graphs available from documents.
Situation: RDF is primarily used in computer-to-computer communication,
although many RDF data sets are hand-authored.
Information Life Cycle: RDF documents are frequently stored in databases,
whether hybrid or RDF-only (RDF-only databases are often called triple stores).
Although RDF can be used for one-off communication it is more often
stored and queried. RDF is also commonly embedded in other formats,
especially HTML. The most common standard querying language for RDF is SPARQL.
Since RDF uses URIs, and URIs are defined to be opaque and
meaningless to an outside observer, RDF is strictly speaking not self
describing. In practice, though, URIs are normally made from natural
language words and represent what those words name. Most RDF
serializations do identify the file as conforming to a specific version
Audience, Language and Culture: RDF nodes have opaque identifiers
that are not in any natural language. it is possible to create
"labelFor" nodes in the RDF graph and give them language tags, although
it should be noted that RDF does not handle XML or HTML style mixed
The Linked Data culture wants all information about everything and
everyone to be public. Privacy and security remain challenges for the
various RDF communities. A talk at XML Prague suggested storing
RDF graphs in XML databases and using XQuery to construct a set of
triples for SPARQL queries based on security, but this should probably be seen as
an outlier; in the long term one can expect SPARQL itself to learn about security.
A technical challenge is that there is nowhere in a triple to store sharing or security
Universal Access: RDF, like JSON, does not have any inherent user
interaction. Graphical visualizations, however, can be a challenge for
people who are not able to see them, and alternatives therefore need to
Relationships between Documents: RDF is all about relationships,
but, oddly, cannot easily refer from one graph to another. RDF named graphs
(new in RDF 1.1) may provide a mechanism there, but it is too soon to measure
Default Formatting: RDF documents do not have textual representations
other than (like JSON) as source. However, they are conventionally represented as
node and arc graphs. This visual representation conveys the overall structure of an
graph but not necessarily the actual content.
Validation: There has been recent work on RDF Shape Expressions for
constraining the shape of RDF graphs; this is not yet deployed.
Data Typing: RDF does support associating data types with values, and
these can be user defined.
Program Compatibility: The RDF model is graph based, not object based,
and does not correspond to the native data structures and type systems of
modern programming languages. However, those same current languages are
easily able to represent RDF graphs, and there is no mixed content to
Information Modelling: RDF is about modelling knowledge, not information.
it is a knowledge representation system used primarily for first-order logic and
The Extensible Markup Language, defined at W3C as a subset or profile
of SGML (and originally known as Web SGML), is not really a single markup language
like HTML, but instead a framework for defining one's own markup languages,
all of which have a common syntax.
This paper distinguishes where appropriate between arbitrary XML documents and
documents in some specific XML-based markup language such as XHTML 5 or
Situation: XML is used in all areas of communication: person to
person, person to computer, and computer to computer, and can to some
extent also be used without computer mediation (that is, text-oriented
XML documents can be moderately readable, although not as much as
Information Life Cycle: XML documents have a life cycle that depends on
how they are used more than on the fact they are XML. For example, a message
from an automobile engine to a garage mechanic's diagnostic system, a
message from one operating system component to another when a user double-clicks
on a desktop icon, a transcription of an Anglo-Saxon poem,
a health-care provider's record of treatment for a patient, all are likely to be in
and each have different longevity and processing characteristics.
Trees based on parsing XML documents can be stored in relational, XML-native or hybrid
and the XQuery language can be used to access them efficiently.
Audience Language and Culture: XML documents can support all of the
internationalization features of HTML and XHTML, but it depends on the specific
XML vocabulary. If you are designing an XML representation for text you should
consider adopting the HTML model where possible because of widespread
understanding and adoption.
The W3C Internationalization Tag Set (ITS) can be used directly in XML to
help with translation and localization.
Universal Access: Again, this depends on the ways in which the XML documents
are used. Awareness of the W3C Web Content Accessibility Guidelines can help
document designers to create accessible systems using XML.
Relationships between Documents: The XLink specification has not
gained much traction, and today people are more likely to use an ad-hoc attribute
called href, or possibly to use the HTML "a" element
by means of an XML namespace. it is also possible to embed RDF in XML documents.
Default Formatting: This is one of the two biggest weaknesses of XML:
since there are no default presentational semantics,
search engines cannot generate reliable snippets for results. Using XML on
the World Wide Web can therefore be a problem.
Validation: XML has a wide variety of validation mechanisms, from simple
and widely-supported DTDs, through to the baroque complexities of W3C XML Schema.
A part-way compromise is RELAXNG, but this does not perform the data binding
role of XML Schema, as described in the next paragraph. User-defined data types and
compound types are available.
Data Typing: XML Schema validation can assign type annotations to
elements in the parsed XML tree; type labels can be user-defined type names as well
as built-in types. Note that RELAXNG does not support assignment of type
annotations in a deterministic way, so that XML Schema is generally used where
data binding (object loading and dumping) is required.
Program Compatibility: This is the second of the two main weaknesses of XML:
the concept of an annotated tree of nodes is not a native data structure in most
programming languages. As with HTML, mixed content such as paragraphs with embedded
elements considerably complicates processing.
The situation is mitigated by the popularity of XSLT and XQuery, XML-specific
languages for querying and manipulating trees.
Information Modelling: This is the greatest strength of XML: that it
can be used, and culturally is used, to
model documents or other information outside of any particular
application or process. This strength comes at a cost: because XML
documents are usually independent of any one program they are also not
optimized for processing by any one program, and this can make XML
unpopular with application developers.
Some Use Cases
This section gives examples chosen to illustrate a typical use case
for each of the main formats discussed, together with indication of how to
represent the example in the other formats.
An Object Dump
types and values; the obvious choice is JSON, which was designed for
One could use RDF instead; direct mappings from UML to RDF exist.
But then a library would be needed, and the various transfer syntaxes of
JSON also needs a library, or where JSON does not map well to objects,
RDF may be a stronger contender.
XML is also commonly used for object dumps. A library is needed,
both for serialization and for loading, but such libraries exist for
most languages. Since object dumps tend to be specific to a particular
state of a particular program at a particular time, they are not easily
reused by other programs; JSON may be more suited in that case. The
strongest use cases for XML are when documents will be used in multiple
The lack of standard transformation tools for JSON (compared to XML
for example) is likely to be short-lived; there are several contenders
as well as native-JSON NoSQL databases in widespread use.
A Technical Dictionary
In this example an organization edits a complex dictionary and
produces editions in print, in PDF, in HTML on a subscriber-only Web
site and in EPUB for ebook readers. Subsidiary products are also
produced and might include a dictionary defining only terms needed for
specific high-school (K12) or undergraduate courses, or subsets
containing, say, only entries that mention a specific compound.
Dictionaries are examples of documents that often feature mixed
content very heavily: superscript and subscripts, mathematics, terms
that are to link to definitions, multiple languages, symbols and small
diagrams may all occur in running text. Even a simple English dictionary
may contain relatively mixed content, as in the example in Figure
Figure 6: A definition from a 1730s dictionary showing mixed
English and Greek used in an etymology.
Since EPUB 3 used for electronic books is essentially a "Web site on
a stick" there is considerable pressure to use HTML. However, custom
markup can support business-level validation (for example, every major
definition must have at least three examples, and can help with research
A compromise is to use (X)HTML augmented using ARIA attributes to
provide so-called so-called structural semantics, with microdata, or
even with custom XML elements; since HTML 5 Web Components provide a
standard way to add elements this approach is likely to become popular.
However, enforcing appropriate markup on authors may be necessary to
preserve the value of the work, and that may suggest a custom XML-based
markup with transformations to HTML as needed. Multilingual mixed
content is today the home turf of the XML team.
RDF metadata can be embedded in dictionary entries, or, more likely,
generated on the fly, perhaps with XQuery or XSLT, from the higher-level
XML notations that are more convenient for authors to work with.
Representing mixed content in RDF would typically involve explicit and
tedious representation of sequences of anonymous nodes.
Markdown quickly runs out of power to express complex texts, whether
multilingual like the English dictionary or containing chemical formulae
and mathematics as in the technical dictionary. Variants that are
sufficiently powerful start to stretch what is feasible with ad-hoc
text-based syntax and the extra difficulty of using HTML or XML for the
simpler parts probably pays off with consistent markup for the harder
Extended Journal Bibliography
In this example entries for different authors are to be connected;
any text formatting is minimal and formulaic. RDF is a strong candidate
here. JSON could also be used.
A common need with bibliographical data is powerful full text
searching, including similarity, starts-with, lexical containment,
proximity within a field or element, and more. The XPath and XQuery Full
Text extension was created with this in mind, suggesting that in some
environments an XML-compatible representation may be worth
investigating. Note that XQuery and XSLT 2 and later are defined to
operate on trees which, although commonly created from XML, could come
from any source that meets the necessary constraints.
Although Markdown again is not a likely choice, it should be noted
that the text-based format pioneered by Mike Lesk for the refer program in the 1970s, and later taken up
by BibTeX, is widely used and widely supported in technical and academic
Web-based Authoring Interface
This example considers a Wiki-like situation, with a large and
diverse group of authors for most of whom interaction with the Web site
is not a major part of their lives, so that they will have little
interest in learning about “syntax.”
This is a typical use case for
Markdown today. The Markdown markup is embedded in an HTML
form, and the user interacts with the Web browser's built-in text
More recently, the content-editable property of HTML elements
can be used to support word-processing style editing of parts
of documents in place, which may reduce the desire to use Markdown.