How to cite this paper
Tai, Andreas. “WebVTT versus TTML: XML considered harmful for web captions?” Presented at Balisage: The Markup Conference 2013, Montréal, Canada, August 6 - 9, 2013. In Proceedings of Balisage: The Markup Conference 2013. Balisage Series on Markup Technologies, vol. 10 (2013). https://doi.org/10.4242/BalisageVol10.Tai01.
Balisage: The Markup Conference 2013
August 6 - 9, 2013
Balisage Paper: WebVTT versus TTML: XML considered harmful for web captions?
Institut fuer Rundfunktechnik, Munich
Copyright © 2013 Institut fuer Rundfunktechnik GmbH
This paper investigates why the XML standard TTML was rejected as timed text format
by the WHATWG and it shows the relation to the discussion about the use of XML on
Table of Contents
- Established industries versus emerging user communities
- Quality control versus non-draconian error handling
- Don´t reinvent the wheel versus keep it simple
- Readability: curse or blessing?
- Appendix A. TTML and WebVTT document samples
In 2010 the WHATWG mailing list discussed which existing timed text format should
be used as a base to distribute timed text with the newly defined track element in
HTML5 [P10]. The most important use case for timed text was the display of captions and subtitles
with a related video.
In the end not the existing XML standard for timed text, TTML [TC10] (formerly specified as DFXP) was chosen, but SRT, a text-based format for subtitles
that originated from the web user community and had been implemented already in a
large range of video players [H10]. After the adoption of SRT by the WHATWG the timed text format was first called
WebSRT and later renamed to WebVTT.
The arguments exchanged in mailing threads were not limited to timed text and the
domain of subtitling. The decision against TTML also was a decision against XML and
other standards from the XML-family.
The importance of having a new timed text format is driven by the large increase of
video content distribution over IP-based networks. The demand for subtitles to show
with that video is rising as well. In some regions provisioning of subtitles with
online content even is an obligation from the regulator (see for example [FCC12]).
Left aside proprietary formats, for the distribution of subtitles on the web the TTML
and the WebVTT standards currently are receiving the biggest attention from the market.
The competition of these two standards follows a pattern which is similar to the one
seen with the adoption of XML for the web. More than a competition between different
technologies, it appears to be a clash of the ‘web culture’ with the ‘XML culture’.
The unresolved problems in this relationship could have a blocking effect on the progress
of IP-based delivery of subtitles and on web-accessibility in general.
In 2003 the W3C set up a working group to specify a timed text markup language [TT03]. According to the requirements, which were laid out already in 2002 [TT02], the goal was to define a non-proprietary, standardized format that could be used
for displaying text synchronized with other elements such as audio and video.
In the requirements use cases such as karaoke, credit rolls and text overlays were
listed next to the main use case: the display of subtitles. One of the top priorities
for the targeted architecture was to have an XML representation of the format.
Members of the working group represented, amongst others, a broadcaster, a professional
engineering association from the moving picture industry, a research institution on
accessible media and vendors of existing video players for the web. TTML was published
as a detailed standard in November 2010 as Timed Text Markup Language (TTML) 1.0 [TF10].
A simple TTML example illustrates the expression of a two line subtitle:
<p begin="00:00:00.000" end="00:00:02.000">
This is a subtitle<br/>
on two lines
In the TTML presentation semantics a
p element represents a block level element and the
br element inserts a line break. The
end attributes specify the timecodes between which the subtitle should be shown.
As for other local names of TTML elements that represent subtitle content
br are derived from (X)HTML. And although they are defined in a different namespace, their semantics are similar.
TTML uses a subset of the XML Infoset to formalize the data model.
A reduced notation of the
p information element would be:
begin = <timeExpression>
end = <timeExpression>
Content: (#PCDATA | br )*
In 2004 a group of W3C members decided to split off from the W3C HTML specification
and to push HTML outside of, but in close collaboration with the W3C. The Web Hypertext
Application Technology Working Group (WHATWG) was set up with the goal to “create
technical specifications that are intended for implementation in mass-market web browsers,
in particular Safari, Mozilla, and Opera” [W04].
Although WebVTT is currently specified by the W3C Web Media Text Tracks Community
Group [W13], the decision to use SRT as its base had been taken by the WHATWG and the initial
specification has been written with Ian Hickson as responsible editor. Priorities
for the choice of a timed text format included: simplicity, compatibility with existing
players and integration into the HMTL5 specification effort [H10].
A simple two line subtitle would be expressed in WebVTT as a text cue:
00:00:00.000 --> 00:00:02.000
This is a subtitle
on two lines
The delimiter for begin and end timecodes is the string “-->”. A character sequence
representing a newline (LF, CR or CRLF) separates the timing information from subtitle
text and breaks a subtitle text line.
WebVTT does not use a formal grammar to describe the syntax but a sequence of rules
written in normative prose. A reduced definition of the text cue shown in the example
A WebVTT timestamp representing the start time offset of the cue.
The string "-->" (U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN SIGN).
A WebVTT timestamp representing the end time offset of the cue.
A WebVTT line terminator
Zero or more WebVTT cue text span, representing the text of the cue each optionally separated from the next by a WebVTT line terminator.
Established industries versus emerging user communities
While XML has been well received and is used in established industries, it has at
least a disputable role on the web. The most prominent areas of debate are the draconian
error handling implemented by XHMTL supporting web browsers ( see [K12, W12]) and the growing suppression of XML through JSON as an interchange format for data
on the web (see [D06, V13]).
Similar to these debates the origin of the recent failure of TTML to be adopted by
HTML5 can be found in the separation of user communities.
At the same time when the WHATWG decided against TTML, other standardization committees
from the broadcast and movie domain adopted and promoted TTML as format for subtitles.
The Society of Motion Picture (SMPTE) extended TTML to SMPTE-TT [S10], YouView defined a profile of TTML for delivery to interactive TV sets and set-top
boxes in the UK [Y11], the Digital Entertainment Content Ecosystem consortium (DECE) defined a TTML profile
for the common file format (CFF-TT) [D12] and the EBU published the TTML subset EBU-TT [E12] for the interchange, archiving and production of subtitles.
The adoption of TTML by the European Broadcasting Union (EBU) is a good example of
why an XML standard was chosen by the different standardization initiatives. The EBU
was looking for a successor of the binary subtitle format EBU STL [E13] and the TTML standard met the requirements for a reference standard. It was expressive
enough to cover the desired semantics and at the same time it had the option to be
constrained. Furthermore the XML standard has a well-documented Unicode support (something
missing in the binary EBU STL format) and is ideally suited to implement the “Create
once, publish everywhere” strategy. While the translation process of spoken text into
subtitles still requires a large amount of manual work, the deployment of subtitles
in different subtitle formats for linear and non-linear TV is only practically feasible
when it is automated.
Similar to other professional sectors with a highly automated production process the
broadcast industry depends on reliable and stable standards to guarantee the quality
of their services. The benefit of a formal standard that uses XML as an established
technology in this context outbalances the extra effort to implement its potential
The web environment however, is home to a deeply-rooted “everybody can do it” philosophy.
Anyone can and should be able to be a sender. The belief that it should be easy to
publish and to distribute content without high investments is shared with the open
source movement. From this point of view the hurdle to apply an XML standard is high
(especially if it references other XML standards). Implementation may require an academic
background or special training. Furthermore free and user friendly authoring tools
for XML are rare. Therefore the WHATWG preferred a subtitle format with an easier
to understand notation that could be used with a simple text editor.
Another difference to the TTML use case is, that WebVTT is designed to only serve
as a web distribution format for subtitles. There is no ambition for it to be used
as an intermediary format. And although later in the specification process documents
and extensions were published to support the translation from existing US broadcast
standards into WebVTT[P11], support for legacy formats had not been a requirement from the beginning.
Quality control versus non-draconian error handling
The XML specification, together with grammar-based or rule-based constraining languages
such as W3C XML Schema 1.0, Relax NG or Schematron, forms a framework to test information
which is exchanged between applications and/or organizations for standard conformance.
In highly professionalized sectors such as the broadcast industry this support for
QC processes is well-appreciated and used.
Paradoxically the strictness of the XML specification in guaranteeing ‘well-formedness’
could be seen as a reason for its bad reputation in the web developer community. The
behaviour of some web browsers which do not recover gracefully from ill-formed errors
in XHTML documents but instead interrupt the rendering process has resulted in distracting
user experiences (see [TA10, B11]). The return of the HTML5 spec to a more "forgiving" parser behavior can be seen
as a direct result of these problems.
The strictness of XML was one reason for the decision against TTML on the WHATWG mailing
list. Well-formedness errors and namespace problems were marked as potential problems
for authors. As in general discussions related to "XML in the browser", the expectation
was that amateur developers will want to provide content and therefore the format
has to be kept simple.
For professional content providers, including broadcasting stations, automated QC
is becoming increasingly important. A strict format such as TTML therefore is preferred
both at the production side, and as well for distribution. It gives broadcast stations
the means to guarantee the quality of their online-services and to live up to the
expectation of their audience. Note that the reputation of most broadcasters has been
established not through the web, but rather by the use of high-quality broadcast standards.
TTML and WebVTT provide different options to handle document conformance. TTML provides
an informative W3C XML Schema and Relax NG Schema to support automatic document validation
while WebVTT integrates conformance testing and error handling in a normative text
A good illustration for the different concepts is the specification of text alignment.
In TTML text alignment of subtitle text can be expressed through the attribute
<p begin="00:00:00.000" end="00:00:02.000" tts:textAlign="left" >
One line aligned to the left.
A simplified type definition of the
tts:textAlign attribute in the TTML XML Schema is shown below:
The validation of a TTML document where the
textAlign attribute contains another value than "start", "end", "left", "center" or "right"
would fail. Depending of the implementation a TTML decoder could stop further processing
of the document or continue parsing but reject the document at the end because the
conformance test was negative.
In WebVTT text alignment can be controlled by the
alignment cue setting:
00:00:00.000 --> 00:00:02.000 align:left
One line aligned to the left.
There are two sources for document conformance tests in the WebVTT spec: the syntax
definition and the parsing algorithm. The syntax definition is intended as authoring
instruction, while the parsing algorithm is a guideline for the implementation of
The syntax for the
alignment cue setting is as follows:
WebVTT vertical text cue setting consists of the following components, in the order given:
The string "align".
A U+003A COLON character (:).
One of the following strings: "start", "middle", "end", "left", "right".
The setting should be parsed as shown below:
Let name be the leading substring of setting up to and excluding the first U+003A COLON character (:) in that string.
Let value be the trailing substring of setting starting from the character immediately after the first U+003A COLON character (:) in that string.
Run the appropriate substeps that apply for the value of name, as follows:
If name is a case-sensitive match for "align"
If value is a case-sensitive match for the string "start", then let cue's text track cue alignment be start alignment.
If value is a case-sensitive match for the string "middle", then let cue's text track cue alignment be middle alignment.
If value is a case-sensitive match for the string "end", then let cue's text track cue alignment be end alignment.
If value is a case-sensitive match for the string "left", then let cue's text track cue alignment be left alignment.
If value is a case-sensitive match for the string "right", then let cue's text track cue alignment be right alignment.
Following the WebVTT text parsing algorithm the setting will simply be ignored if
a value is not defined. The parsing of the document continues and the writing direction
of the text cue will not be changed.
There is no recommendation if and how this error should be signaled.
Don´t reinvent the wheel versus keep it simple
One principle in standardization is not to reinvent the wheel. If a required functionality
has been defined elsewhere, it would be appropriate to reference it. Often a compromise
is necessary because not all requirements are covered by existing standards and also
because it is sometimes neither easy nor desirable to extend them. This compromise
typically is acceptable for the goals of technology dissemination and future interoperability
The XML ecosystem makes use of this principle and there are a lot of cross references
between the different core standards of the ‘XML family’, such as XPATH, W3C XML Schema
From the start the Timed Text Working Group (TTWG) has had the requirement to use
existing W3C technologies. This has been implemented by the integration of SMIL as
the semantic reference for the timing model and by the integration of XSL:FO for the
semantics of the formatting model.
In the ‘WHATWG versus TTML’-debate, especially the reference to XSL:FO and the claimed
incompatibility between XSL:FO and CSS have been important arguments. Furthermore
the re-use of TTML was blocked by critical comments about the use of XML. In this
sense the re-use of standards actually let to the rejection to be used by another
Although WebVTT also makes uses of other standards and has the design goal to integrate
well with HTML5 and CSS, it also duplicates the functionality of TTML without making
a reference to this existing standard.
Readability: curse or blessing?
When comparing the advantages of WebVTT and SRT versus TTML readability is often highlighted
as a difference. The construction principle of simple WebVTT/SRT files seems easy
to understand, while a TTML document which expresses the same semantics appears to
be more verbose and opaque, especially to a reader who is not familiar with XML.
While comprehensibility of a format by reading it may be important when the manual
creation of conformant documents is an important option, the ease of parsing and implementation
is crucial for automated processing.
The readability argument leads back to the differences in requirements between established
industries and emerging user communities, but in any case it seems more relevant for
the creation than for the interpretation of a format. While a document with subtitles
can be directly created by a human being, the decoding and rendering of the document
will always be done by an automated process. It is therefore questionable how the
complexity of a format can be judged by taking human readability as the criterion.
As a side note it should be mentioned that in Europe the binary (so not human-readable
at all!) EBU-STL format is highly trusted as an exchange format for subtitles. Actually,
some concerns exist regarding its replacement with a human readable XML format! It
would be an interesting field of investigation if under special circumstances an opaque
format is more likely to gain trust and therefore adoption than a transparent format
Besides the technical and semantic differences between TTML and WebVTT, the sheer
existence of two standards with duplicated functionality lowers the speed of adoption,
blocks specification processes and in the long run affects interoperability between
Although it not likely that the two formats will be merged, the efforts to combine
both activities in one W3C working group can be seen as a promising step. This nevertheless
leaves one problem still unresolved: the divergence between the XML and the web world.
These two worlds should not only meet at conferences, but they should be represented
in the same standard committees. That could improve the interoperability perspective
for new standards right from the beginning.
Appendix A. TTML and WebVTT document samples
Two short but complete examples for TTML and WebVTT are documented below. Both documents
represent the same formatting semantics for a timed, two-line subtitle:
The virtual box for the subtitle is 80% of the video frame width.
The virtual box of the subtitle has a 10% offset to the left edge of the video frame.
The subtitle text is centered.
The subtitle has an identifier with the value "sub1".
A color of green is applied to one word by the use of an inline-style.
An italic typeface is applied to one word by the use of a referenced style definition.
The vertical position is in the lower third of the video frame.
The representation in TTML would be:
<?xml version="1.0" encoding="UTF-8"?>
<tt xml:lang="en" xmlns="http://www.w3.org/ns/ttml" xmlns:tts="http://www.w3.org/ns/ttml#styling" xmlns:tt="http://www.w3.org/ns/ttml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<tt:style xml:id="s1" tts:color="green"/>
<tt:region tts:extent="80% 80%" tts:origin="10% 10%" tts:displayAlign="after"/>
<p xml:id="sub1" begin="00:00:00.000" end="00:00:02.000">
This is a green <span style="s1">word.</span>.<br/>
This is an italic <span tts:fontStyle="italic">word</span>.
The representation in WebVTT would be:
00:00:00.000 --> 00:00:02.000 size:80% position:10% line:70% align:middle
This is a green <c.s1>word</c>.
This is an italic <i>word</i>.
WebVTT has the option to use CSS style sheet definitions. Currently these definitions
are not embedded in the WebVTT document. In the example they are included in the WebVTT
file by the use of the @import rule.
The content of the imported file in the example would be as follows:
[B11] Bovens, Andreas, No more "XML parsing failed" errors, 28 September 2011.
[TA10] Çelik, Tantek, XHTML Is Dead, Long Live XML-Valid HTML5, 29 October 2010.
[D06] Crockford, Douglas, JSON: The Fat-Free Alternative to XML, XML 2006 Boston, 6 December
[D12] Digital Entertainment Content Ecosystem (DECE) LLC, Common File Format & Media Formats
Specification, Version 1.0.6, 23 February 2013.
[E13] European Broadcasting Union (EBU), EBU Tech 3264, Specification of the EBU Subtitling
data exchange format, February 1991.
[E12] European Broadcasting Union (EBU), EBU Tech 3350, EBU-TT Part 1 Subtitling format
definition, Version 1.0, July 2012.
[TC10] Glenn Adams (ed.), Timed Text Markup Language (TTML) 1.0, W3C Candidate Recommendation
23 February 2010.
[TF10] Glenn Adams (ed.), Timed Text Markup Language (TTML) 1.0, W3C Recommendation 18 November
[FCC12] Federal Communications Commission, Small Entity Compliance Guide, Closed Captioning
of Internet Protocol Delivered Video Programming: Implementation of the Twenty First
Century Communications and Video Accessibility Act of 2010, MB Docket No. 11 - 154,
FCC 12 - 9.
[H10] Hickson, Ian, Timed tracks for <video>, Email on the Public mailing list for the
WHAT working group, 2010-04-09.
[W13] Hickson, Ian, Pfeiffer, Silvia, WebVTT: The Web Video Text Tracks Format, W3C Community
Group Draft Report.
[P11] Pfeiffer, Silvia, Conversion of 608/708 captions to WebVTT. Draft Community Group
Specification 11 July 2013.https://dvcs.w3.org/hg/text-tracks/raw-file/default/608toVTT/608toVTT.html
[P10] Pfeiffer, Silvia, Introduction of media accessibility features, Email thread started
by Silvia Pfeiffer on the Public mailing list for the WHAT working group, 2010-04-09.
[S10] Society of Motion Picture and Television Engineers (SMPTE), SMPTE Standard for Television
- Timed Text Format (SMPTE-TT), SMPTE ST 2052-1:2010.
[V13] Van der Vlist, Eric, Embracing JSON? Of course, but how?, XML Prague 2013, Conference
[K12] Van Kesteren, Anne, XML5's Story, XML Prague 2012, Conference Proceedings, p. 23-25.
[W04] Web Hypertext Application Technology Working Group (WHATWG), WHAT open mailing list
announcement, June 4th 2004.
[W12] World Wide Web Consortium (W3C), DraconianErrorHandling, W3C Wiki, 9 February 2010.
[TT03] World Wide Web Consortium (W3C), W3C Timed Text Working Group Charter (TTWG), Initial
[TT02] World Wide Web Consortium (W3C), Standardized Timed-text Format, W3C Working Draft
21 March 2002.
[Y11] YouView TV Ltd, YouView Core Technical Specification, For Launch, Version 1.0, 14
Society of Motion Picture and Television Engineers (SMPTE), SMPTE Standard for Television
- Timed Text Format (SMPTE-TT), SMPTE ST 2052-1:2010.