Introduction

In 2010 the WHATWG mailing list discussed which existing timed text format should be used as a base to distribute timed text with the newly defined track element in HTML5 [P10]. The most important use case for timed text was the display of captions and subtitles with a related video.[1]

In the end not the existing XML standard for timed text, TTML [TC10] (formerly specified as DFXP) was chosen, but SRT, a text-based format for subtitles that originated from the web user community and had been implemented already in a large range of video players [H10]. After the adoption of SRT by the WHATWG the timed text format was first called WebSRT and later renamed to WebVTT.

The arguments exchanged in mailing threads were not limited to timed text and the domain of subtitling. The decision against TTML also was a decision against XML and other standards from the XML-family.

The importance of having a new timed text format is driven by the large increase of video content distribution over IP-based networks. The demand for subtitles to show with that video is rising as well. In some regions provisioning of subtitles with online content even is an obligation from the regulator (see for example [FCC12]).

Left aside proprietary formats, for the distribution of subtitles on the web the TTML and the WebVTT standards currently are receiving the biggest attention from the market. The competition of these two standards follows a pattern which is similar to the one seen with the adoption of XML for the web. More than a competition between different technologies, it appears to be a clash of the ‘web culture’ with the ‘XML culture’. The unresolved problems in this relationship could have a blocking effect on the progress of IP-based delivery of subtitles and on web-accessibility in general.

TTML

In 2003 the W3C set up a working group to specify a timed text markup language [TT03]. According to the requirements, which were laid out already in 2002 [TT02], the goal was to define a non-proprietary, standardized format that could be used for displaying text synchronized with other elements such as audio and video.

In the requirements use cases such as karaoke, credit rolls and text overlays were listed next to the main use case: the display of subtitles. One of the top priorities for the targeted architecture was to have an XML representation of the format.

Members of the working group represented, amongst others, a broadcaster, a professional engineering association from the moving picture industry, a research institution on accessible media and vendors of existing video players for the web. TTML was published as a detailed standard in November 2010 as Timed Text Markup Language (TTML) 1.0 [TF10].

A simple TTML example illustrates the expression of a two line subtitle:[2]

<p begin="00:00:00.000" end="00:00:02.000">
    This is a subtitle<br/>
    on two lines
</p>
		

In the TTML presentation semantics a p element represents a block level element and the br element inserts a line break. The begin and end attributes specify the timecodes between which the subtitle should be shown.

As for other local names of TTML elements that represent subtitle content p and br are derived from (X)HTML.[3] And although they are defined in a different namespace, their semantics are similar.

TTML uses a subset of the XML Infoset to formalize the data model.[4]

A reduced notation of the p information element would be:[5]

<p>
    begin = <timeExpression>
    end = <timeExpression>
    Content: (#PCDATA | br )*
</p>		
		

WebVTT

In 2004 a group of W3C members decided to split off from the W3C HTML specification and to push HTML outside of, but in close collaboration with the W3C. The Web Hypertext Application Technology Working Group (WHATWG) was set up with the goal to “create technical specifications that are intended for implementation in mass-market web browsers, in particular Safari, Mozilla, and Opera” [W04].

Although WebVTT is currently specified by the W3C Web Media Text Tracks Community Group [W13], the decision to use SRT as its base had been taken by the WHATWG and the initial specification has been written with Ian Hickson as responsible editor. Priorities for the choice of a timed text format included: simplicity, compatibility with existing players and integration into the HMTL5 specification effort [H10].

A simple two line subtitle would be expressed in WebVTT as a text cue:

00:00:00.000 --> 00:00:02.000
This is a subtitle
on two lines
		

The delimiter for begin and end timecodes is the string “-->”. A character sequence representing a newline (LF, CR or CRLF) separates the timing information from subtitle text and breaks a subtitle text line.[6]

WebVTT does not use a formal grammar to describe the syntax but a sequence of rules written in normative prose. A reduced definition of the text cue shown in the example would be:

A WebVTT timestamp representing the start time offset of the cue.
The string "-->" (U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN SIGN).
A WebVTT timestamp representing the end time offset of the cue.
A WebVTT line terminator
Zero or more WebVTT cue text span, representing the text of the cue each optionally separated from the next by a WebVTT line terminator.
	

Established industries versus emerging user communities

While XML has been well received and is used in established industries, it has at least a disputable role on the web. The most prominent areas of debate are the draconian error handling implemented by XHMTL supporting web browsers ( see [K12, W12]) and the growing suppression of XML through JSON as an interchange format for data on the web (see [D06, V13]).

Similar to these debates the origin of the recent failure of TTML to be adopted by HTML5 can be found in the separation of user communities.

At the same time when the WHATWG decided against TTML, other standardization committees from the broadcast and movie domain adopted and promoted TTML as format for subtitles. The Society of Motion Picture (SMPTE) extended TTML to SMPTE-TT [S10], YouView defined a profile of TTML for delivery to interactive TV sets and set-top boxes in the UK [Y11], the Digital Entertainment Content Ecosystem consortium (DECE) defined a TTML profile for the common file format (CFF-TT) [D12] and the EBU published the TTML subset EBU-TT [E12] for the interchange, archiving and production of subtitles.

The adoption of TTML by the European Broadcasting Union (EBU) is a good example of why an XML standard was chosen by the different standardization initiatives. The EBU was looking for a successor of the binary subtitle format EBU STL [E13] and the TTML standard met the requirements for a reference standard. It was expressive enough to cover the desired semantics and at the same time it had the option to be constrained. Furthermore the XML standard has a well-documented Unicode support (something missing in the binary EBU STL format) and is ideally suited to implement the “Create once, publish everywhere” strategy. While the translation process of spoken text into subtitles still requires a large amount of manual work, the deployment of subtitles in different subtitle formats for linear and non-linear TV is only practically feasible when it is automated.

Similar to other professional sectors with a highly automated production process the broadcast industry depends on reliable and stable standards to guarantee the quality of their services. The benefit of a formal standard that uses XML as an established technology in this context outbalances the extra effort to implement its potential complexities.

The web environment however, is home to a deeply-rooted “everybody can do it” philosophy. Anyone can and should be able to be a sender. The belief that it should be easy to publish and to distribute content without high investments is shared with the open source movement. From this point of view the hurdle to apply an XML standard is high (especially if it references other XML standards). Implementation may require an academic background or special training. Furthermore free and user friendly authoring tools for XML are rare. Therefore the WHATWG preferred a subtitle format with an easier to understand notation that could be used with a simple text editor.

Another difference to the TTML use case is, that WebVTT is designed to only serve as a web distribution format for subtitles. There is no ambition for it to be used as an intermediary format. And although later in the specification process documents and extensions were published to support the translation from existing US broadcast standards into WebVTT[P11], support for legacy formats had not been a requirement from the beginning.

Quality control versus non-draconian error handling

The XML specification, together with grammar-based or rule-based constraining languages such as W3C XML Schema 1.0, Relax NG or Schematron, forms a framework to test information which is exchanged between applications and/or organizations for standard conformance. In highly professionalized sectors such as the broadcast industry this support for QC processes is well-appreciated and used.

Paradoxically the strictness of the XML specification in guaranteeing ‘well-formedness’ could be seen as a reason for its bad reputation in the web developer community. The behaviour of some web browsers which do not recover gracefully from ill-formed errors in XHTML documents but instead interrupt the rendering process has resulted in distracting user experiences (see [TA10, B11]). The return of the HTML5 spec to a more "forgiving" parser behavior can be seen as a direct result of these problems.

The strictness of XML was one reason for the decision against TTML on the WHATWG mailing list. Well-formedness errors and namespace problems were marked as potential problems for authors. As in general discussions related to "XML in the browser", the expectation was that amateur developers will want to provide content and therefore the format has to be kept simple.

For professional content providers, including broadcasting stations, automated QC is becoming increasingly important. A strict format such as TTML therefore is preferred both at the production side, and as well for distribution. It gives broadcast stations the means to guarantee the quality of their online-services and to live up to the expectation of their audience. Note that the reputation of most broadcasters has been established not through the web, but rather by the use of high-quality broadcast standards.

TTML and WebVTT provide different options to handle document conformance. TTML provides an informative W3C XML Schema and Relax NG Schema to support automatic document validation while WebVTT integrates conformance testing and error handling in a normative text parsing algorithm.

A good illustration for the different concepts is the specification of text alignment.

In TTML text alignment of subtitle text can be expressed through the attribute textAlign:

<p begin="00:00:00.000" end="00:00:02.000" tts:textAlign="left" >
    One line aligned to the left.
</p>		
		

A simplified type definition of the tts:textAlign attribute in the TTML XML Schema is shown below:

<xs:attribute name="textAlign">
    <xs:simpleType>
	    <xs:restriction base="xs:token">
		    <xs:enumeration value="start"/>
		    <xs:enumeration value="end"/>
		    <xs:enumeration value="left"/>  
		    <xs:enumeration value="center"/>
		    <xs:enumeration value="right"/>	    
        </xs:restriction>
    </xs:simpleType>
  </xs:attribute>
		

The validation of a TTML document where the textAlign attribute contains another value than "start", "end", "left", "center" or "right" would fail. Depending of the implementation a TTML decoder could stop further processing of the document or continue parsing but reject the document at the end because the conformance test was negative.

In WebVTT text alignment can be controlled by the alignment cue setting:

00:00:00.000 --> 00:00:02.000 align:left
One line aligned to the left.		
		

There are two sources for document conformance tests in the WebVTT spec: the syntax definition and the parsing algorithm. The syntax definition is intended as authoring instruction, while the parsing algorithm is a guideline for the implementation of WebVTT decoders.

The syntax for the alignment cue setting is as follows:

WebVTT vertical text cue setting consists of the following components, in the order given:
    The string "align".
    A U+003A COLON character (:).
    One of the following strings: "start", "middle", "end", "left", "right".

The setting should be parsed as shown below:

Let name be the leading substring of setting up to and excluding the first U+003A COLON character (:) in that string.

Let value be the trailing substring of setting starting from the character immediately after the first U+003A COLON character (:) in that string.

Run the appropriate substeps that apply for the value of name, as follows:

If name is a case-sensitive match for "align"

        If value is a case-sensitive match for the string "start", then let cue's text track cue alignment be start alignment.
        If value is a case-sensitive match for the string "middle", then let cue's text track cue alignment be middle alignment.
        If value is a case-sensitive match for the string "end", then let cue's text track cue alignment be end alignment.
        If value is a case-sensitive match for the string "left", then let cue's text track cue alignment be left alignment.
        If value is a case-sensitive match for the string "right", then let cue's text track cue alignment be right alignment.

Following the WebVTT text parsing algorithm the setting will simply be ignored if a value is not defined. The parsing of the document continues and the writing direction of the text cue will not be changed.

There is no recommendation if and how this error should be signaled.

Don´t reinvent the wheel versus keep it simple

One principle in standardization is not to reinvent the wheel. If a required functionality has been defined elsewhere, it would be appropriate to reference it. Often a compromise is necessary because not all requirements are covered by existing standards and also because it is sometimes neither easy nor desirable to extend them. This compromise typically is acceptable for the goals of technology dissemination and future interoperability between systems.

The XML ecosystem makes use of this principle and there are a lot of cross references between the different core standards of the ‘XML family’, such as XPATH, W3C XML Schema and XSLT.

From the start the Timed Text Working Group (TTWG) has had the requirement to use existing W3C technologies. This has been implemented by the integration of SMIL as the semantic reference for the timing model and by the integration of XSL:FO for the semantics of the formatting model.

In the ‘WHATWG versus TTML’-debate, especially the reference to XSL:FO and the claimed incompatibility between XSL:FO and CSS have been important arguments. Furthermore the re-use of TTML was blocked by critical comments about the use of XML. In this sense the re-use of standards actually let to the rejection to be used by another standard (!).

Although WebVTT also makes uses of other standards and has the design goal to integrate well with HTML5 and CSS, it also duplicates the functionality of TTML without making a reference to this existing standard.

Readability: curse or blessing?

When comparing the advantages of WebVTT and SRT versus TTML readability is often highlighted as a difference. The construction principle of simple WebVTT/SRT files seems easy to understand, while a TTML document which expresses the same semantics appears to be more verbose and opaque, especially to a reader who is not familiar with XML.

While comprehensibility of a format by reading it may be important when the manual creation of conformant documents is an important option, the ease of parsing and implementation is crucial for automated processing.

The readability argument leads back to the differences in requirements between established industries and emerging user communities, but in any case it seems more relevant for the creation than for the interpretation of a format. While a document with subtitles can be directly created by a human being, the decoding and rendering of the document will always be done by an automated process. It is therefore questionable how the complexity of a format can be judged by taking human readability as the criterion.

As a side note it should be mentioned that in Europe the binary (so not human-readable at all!) EBU-STL format is highly trusted as an exchange format for subtitles. Actually, some concerns exist regarding its replacement with a human readable XML format! It would be an interesting field of investigation if under special circumstances an opaque format is more likely to gain trust and therefore adoption than a transparent format like XML.

Conclusion

Besides the technical and semantic differences between TTML and WebVTT, the sheer existence of two standards with duplicated functionality lowers the speed of adoption, blocks specification processes and in the long run affects interoperability between systems.

Although it not likely that the two formats will be merged, the efforts to combine both activities in one W3C working group can be seen as a promising step. This nevertheless leaves one problem still unresolved: the divergence between the XML and the web world. These two worlds should not only meet at conferences, but they should be represented in the same standard committees. That could improve the interoperability perspective for new standards right from the beginning.

Appendix A. TTML and WebVTT document samples

Two short but complete examples for TTML and WebVTT are documented below. Both documents represent the same formatting semantics for a timed, two-line subtitle:

  • The virtual box for the subtitle is 80% of the video frame width.

  • The virtual box of the subtitle has a 10% offset to the left edge of the video frame.

  • The subtitle text is centered.

  • The subtitle has an identifier with the value "sub1".

  • A color of green is applied to one word by the use of an inline-style.

  • An italic typeface is applied to one word by the use of a referenced style definition.

  • The vertical position is in the lower third of the video frame.[7]

The representation in TTML would be:

<?xml version="1.0" encoding="UTF-8"?>
<tt xml:lang="en" xmlns="http://www.w3.org/ns/ttml" xmlns:tts="http://www.w3.org/ns/ttml#styling" xmlns:tt="http://www.w3.org/ns/ttml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <head>
        <styling>
            <tt:style xml:id="s1" tts:color="green"/>
        </styling>
        <layout>
            <tt:region tts:extent="80% 80%" tts:origin="10% 10%" tts:displayAlign="after"/>
        </layout>
    </head>
    <body>
        <div>
            <p xml:id="sub1" begin="00:00:00.000" end="00:00:02.000">
                This is a green <span style="s1">word.</span>.<br/>
                This is an italic <span tts:fontStyle="italic">word</span>.
            </p>
        </div>
    </body>
</tt>
 

The representation in WebVTT would be:

WEBVTT
Style:
@import(foo.css)
##
Language: en

sub1
00:00:00.000 --> 00:00:02.000 size:80% position:10% line:70% align:middle 
This is a green <c.s1>word</c>.
This is an italic <i>word</i>.

WebVTT has the option to use CSS style sheet definitions. Currently these definitions are not embedded in the WebVTT document. In the example they are included in the WebVTT file by the use of the @import rule.

The content of the imported file in the example would be as follows:

::cue(c.s1) {
  color: green;
}

References

[B11] Bovens, Andreas, No more "XML parsing failed" errors, 28 September 2011. http://my.opera.com/ODIN/blog/2011/09/28/no-more-xml-parsing-failed-errors

[TA10] Çelik, Tantek, XHTML Is Dead, Long Live XML-Valid HTML5, 29 October 2010. http://tantek.com/2010/302/b1/xhtml-dead-long-live-xml-valid-html5

[D06] Crockford, Douglas, JSON: The Fat-Free Alternative to XML, XML 2006 Boston, 6 December 2010. http://www.json.org/fatfree.html

[D12] Digital Entertainment Content Ecosystem (DECE) LLC, Common File Format & Media Formats Specification, Version 1.0.6, 23 February 2013. http://www.uvvuwiki.com/images/f/f6/CFFMediaFormat-C1.0.6.pdf

[E13] European Broadcasting Union (EBU), EBU Tech 3264, Specification of the EBU Subtitling data exchange format, February 1991. http://tech.ebu.ch/docs/tech/tech3264.pdf

[E12] European Broadcasting Union (EBU), EBU Tech 3350, EBU-TT Part 1 Subtitling format definition, Version 1.0, July 2012. http://tech.ebu.ch/docs/tech/tech3350.pdf?vers=1.0

[TC10] Glenn Adams (ed.), Timed Text Markup Language (TTML) 1.0, W3C Candidate Recommendation 23 February 2010. http://www.w3.org/TR/2010/CR-ttaf1-dfxp-20100223/

[TF10] Glenn Adams (ed.), Timed Text Markup Language (TTML) 1.0, W3C Recommendation 18 November 2010. http://www.w3.org/TR/2010/REC-ttaf1-dfxp-20101118/

[FCC12] Federal Communications Commission, Small Entity Compliance Guide, Closed Captioning of Internet Protocol Delivered Video Programming: Implementation of the Twenty First Century Communications and Video Accessibility Act of 2010, MB Docket No. 11 - 154, FCC 12 - 9. http://www.fcc.gov/document/closed-captioning-internet-protocol-delivered-video-programming-1

[H10] Hickson, Ian, Timed tracks for <video>, Email on the Public mailing list for the WHAT working group, 2010-04-09. http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2010-July/027386.html

[W13] Hickson, Ian, Pfeiffer, Silvia, WebVTT: The Web Video Text Tracks Format, W3C Community Group Draft Report. http://dev.w3.org/html5/webvtt/

[P11] Pfeiffer, Silvia, Conversion of 608/708 captions to WebVTT. Draft Community Group Specification 11 July 2013.https://dvcs.w3.org/hg/text-tracks/raw-file/default/608toVTT/608toVTT.html

[P10] Pfeiffer, Silvia, Introduction of media accessibility features, Email thread started by Silvia Pfeiffer on the Public mailing list for the WHAT working group, 2010-04-09. http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2010-April/025898.html

[S10] Society of Motion Picture and Television Engineers (SMPTE), SMPTE Standard for Television - Timed Text Format (SMPTE-TT), SMPTE ST 2052-1:2010. http://www.smpte.org/sites/default/files/st2052-1-2010.pdf

[V13] Van der Vlist, Eric, Embracing JSON? Of course, but how?, XML Prague 2013, Conference Proceedings, p.163-188. http://archive.xmlprague.cz/2013/files/xmlprague-2013-proceedings.pdf

[K12] Van Kesteren, Anne, XML5's Story, XML Prague 2012, Conference Proceedings, p. 23-25. http://archive.xmlprague.cz/2012/files/xmlprague-2012-proceedings.pdf

[W04] Web Hypertext Application Technology Working Group (WHATWG), WHAT open mailing list announcement, June 4th 2004. http://www.whatwg.org/news/start

[W12] World Wide Web Consortium (W3C), DraconianErrorHandling, W3C Wiki, 9 February 2010. http://www.w3.org/html/wg/wiki/index.php?title=DraconianErrorHandling&oldid=1526

[TT03] World Wide Web Consortium (W3C), W3C Timed Text Working Group Charter (TTWG), Initial Vesion. http://www.w3.org/AudioVideo/TT/ttcharter20020901.html.

[TT02] World Wide Web Consortium (W3C), Standardized Timed-text Format, W3C Working Draft 21 March 2002. http://www.w3.org/AudioVideo/timetext.html

[Y11] YouView TV Ltd, YouView Core Technical Specification, For Launch, Version 1.0, 14 April 2011. http://industry.youview.com/resources/YouView_Core_Technical_Specification_1.0.pdf



[1] The term “captions” describes on screen text for use by deaf and hard of hearing audiences. The term “subtitles” describes on screen text for translation purposes. For easier reading only the term “subtitles” is used in this article and the term "captions” may be used interchangeably for the term “subtitles".

[2] Complete TTML and WebVTT documents can be found in Appendix A. Note that examples in this article shall only illustrate the design principles of the different standards. They do not always reflect the complete syntax. For implementation the relevant standards shall be consulted.

[3] See Appendix J1 of [TC10]

[4] See Appendix A of [TC10]

[5] The type <timeExpression> is defined in section 10.3.1 of [TC10]

[6] The syntax in the sample is equivalent to what was already defined in SRT, the base format of WebVTT.

[7] Because of different positioning concepts the represented vertical position is only approximately the same.