Facilitating multi-media publishing workflows
Publishing textual content to several media, such as printed books, ebooks and web faces a number of challenges. Most workflows seem to incorporate parts or all of one of the common publishing solutions:
Small-scale, non-academic text publishing generally relies on text production in word processing applications (namely Microsoft Word), which is exported as HTML. For a web version it is then cleaned and converted to use the tags, attributes and classes used within the site that the text is embedded into. If an ebook version is created at all, it is often created as an EPUB by converting the HTML file further, using a tool such as Calibre. To obtain a print version, the text is imported into and set up within a desktop publishing application, such as Adobe InDesign. If resources for this kind of conversion are lacking, a PDF may be created directly from Microsoft Word, which leads to suboptimal output quality.
Small-scale, academic text publishing will alternatively at times be done using tools such as LaTeX which convert human-readable source text into good-looking PDFs which are well-suited for print and which are much better at additional features such as consistent bibliography management or mathematical formulas than word processors. Runtimes such as pdfTeX convert the LaTeX source files into printable PDF-files. For ebook and web output, a stage of transformation to HTML has to occur first, and although conversion tools such as HEVEA, latex2html and TeX4ht exist, conversions seldom go smoothly, and cleanup by hand is mostly required. Similarly problematic is the conversion of the input: unless the author directly offers the text in LaTeX format, it needs to be converted from a word processor, which seldom can be done automatically
Text publishing by larger companies and organizations is oftentimes done via a step of XML in which the original text is first converted from a word processor to an XML format, it is then cleaned up manually. It is then converted to PDF, HTML and EPUB using one of a number of different chain of conversion tools. For example, PDFs can be obtained by applying an XSLT stylesheet to an XML file using an XSLT processor the output of which is then parsed through an XSL-FO formatter. An HTML file can be obtained from the XML file by applying another XSLT stylesheet using the XSLT processor. An EPUB can be obtained by converting the HTML file. In theory these processes could be entirely automated, but in practice, oftentimes, a lot of manual and by hand editing is required at some stage, because the contents contain elements very specific to the type of publication in question, that had not been anticipated by the creators of the conversion software.
A slightly different workflow is also XML-centered, but instead of converting the XML directly, the XML is imported into InDesign where it is then styled and adjusted for print. The problem is that if the XML file has been changed and the output file needs to be updated, changes made in InDesign will have to be reapplied.
All these conversion systems have in common that they are rather labor intensive and that separate and different workflow steps are needed for the different output formats. While the most professional solution, involving XML, at least in theory can work with just one source file which can be updated along the way, XML is not easily editable, and it seems as if XHTML is being replaced by the non-XML-conforming HTML5 in the context of much web publishing.
Additionally, the most common way of styling XML-files, using XSL-FO, is running into trouble: While the number of print products created with XSL-FO is still increasing and it continues to have some features that are more advanced than CSS that are used in print products, further standardization of XSL-FO seems to have halted indefinitely due to lack of interest, with the W3C believing that CSS will replace it Graham2014Kelly2014.
The need for a web-based content solution
Going beyond the currently existing publishing solutions, it was clear to us that none of them function perfectly, nor automatically. We also noted that the central place that XML currently has in many publishing workflows is likely mainly a historical artifact from the period when HTML was to be replaced by XML in the form of XHTML around the turn of the century Simpson2000. Because XML seldom is the final output format, and just about absent from the web Berjon2014, and there are much fewer editors to edit XML in a rich text WYSIWYG fashion than is the case for HTML, it creates largely unneeded conversion steps.
If, on the other hand, one used HTML as the main content file format, some steps of conversion could be made much smaller or eliminated entirely:
In the case of EPUB files, the most common ebook file type, the textual content comes in the form of files containing a restricted version of HTML. And the styling of these pages is defined through restricted CSS, the same language used to define the styling of web pages. Conversion from a HTML source file to an EPUB could therefore largely be done automatically. If one is able to restrict the tags, attributes and CSS rules used in the source files, the conversion should in most cases be entirely automatic.
While solutions for web- and EPUB-publishing would not have to be changed a lot, the situation in print is quite different. As we have seen above, none of the standard print typesetting workflows are centered around HTML and print does not require for the text to ever be converted into a web-centric format. Source files will be a mix of Microsoft Word, Adobe InDesign and in the case of large publishers, XML files.
We believe that a lot of publishers could have benefits from switching their workflow to HTML, while some publishers will still have benefits from using XML. Independently thereof, they will find benefits from switching from styling defined through XSL-FO to CSS for print, because it allows them to use the same or similar style definitions for all types of outputs.
Existing HTML-centric print formatters
Both of these are stand-alone executables that allow for CSS and HTML input and will output printable PDFs, and at least two major publishers have switched to HTML and CSS for book publishing: O'Reilly Media McKesson2012McKesson2013Kleinfeld2013 and the Hachette Book Group Cramer2012. Even though the formatters accept fairly common HTML elements, the implementation of each HTML formatter differs slightly. Those creating web-based content and editors to create web-based content not only try to comply to existing web specifications, but also to the most common web browsers actual implementations of those standards, which means they test their content's rendering in Google Chrome, Apple Safari, Mozilla Firefox and Internet Explorer, but not on formatters solely meant for print. Web content that renders without problems in all major browsers will need extra attention before it can be converted by the above-mentioned tools, both due to the slight differences in how features are implemented in the formatters in comparison to the browsers, and because the formatters are relatively slow to support new CSS features since they implement the core engines on their own with their much smaller development teams than what the browsers have. This is one of the difficulties in current CSS typesetting that the print-publishing industry is facing. Other difficulties are that standard CSS does not include rules for everything needed for book styling, and that those extensions that are concerned with adding styling features that are important for book printing are at a very early development stage Bos2013.
Things needed: common styling specifications for print
Because we believe all styles should be configurable through CSS, part of our focus lies in ensuring that the extra elements that are only important for print and other page based media are sufficiently defined in web specifications to ensure interoperability with other projects.
One of the more important specifications is the CSSPagedMedia module. There are already several typesetting engines supporting CSS Paged Media, the Antenna House Formatter and PrinceXML being among them.
Browsers have implemented ways for users to create PDFs of web pages. Unfortunately, support for the CSS Paged Media specification has not been implemented in the main browsers. The same is true for most ebook display solutions.
Additionally, the typesetting engines supporting CSS Paged Media contain proprietary and incompatible vendor extensions which means that source files cannot easily be moved between engines.
With the Vivliostyle project we prioritize advancing the development of web standards so that Vivliostyle.js will be interoperable with other and future web-based print-solutions.
We have started to work with the World Wide Web Consortium (W3C) to enhance and promote specifications such as CSS Paged Media and other related specifications such as the CSSPageFloats or the CSSGeneratedContentForPagedMedia specifications.
Vivliostyle.js has been coded for half a year and continues to be developed. It parses page-related CSS properties that are ignored by the regular browser. So far it is able to do basic page styling including footnotes, page numbering, floats and page headers.
Figure 1: Vivliostyle.js
[Berjon2014] Berjon, Robin.
Mending Fences and Saving Babies. Presented at Symposium on HTML5 and XML, Washington, DC, August 4, 2014. In Proceedings of the Symposium on HTML5 and XML. Balisage Series on Markup Technologies, vol. 14 (2014). doi:https://doi.org/10.4242/BalisageVol14.Berjon01.
[Bos2013] Bos, Bert.
Can you typeset a book with CSS? Presented at 2nd W3C Workshop on Electronic Books and the Open Web Platform, Tokyo,
Japan, June 4, 2013. http://www.w3.org/Talks/2013/0604-CSS-Tokyo/.
[Cramer2012] Cramer, Dave.
Production as if Digital Mattered: Making books with HTML and CSS. New York, 2012. http://infogridpacific.typepad.com/files/idpf-2012-cramer-smaller.pdf.
[Graham2014] Graham, Tony.
Formatting from XML. XML Prague, Prague, Czech Republic, February 14–16, 2014. In XML Prague 2014 Conference Proceedings. http://archive.xmlprague.cz/2014/files/xmlprague-2014-proceedings.pdf.
[Kelly2014] Kelly, Mike.
XSL-FO Is Dead, CSS Paged Media Is Prime Suspect. [online]. Rockweb. June 4, 2014. [cited 13 Apr
[Kleinfeld2013] Kleinfeld, Sanders.
The Case for Authoring and Producing Books in (X)HTML5. Presented at Balisage: The Markup Conference 2013, Montréal, Canada, August 6–9,
2013. In Proceedings of Balisage: The Markup Conference 2013. Balisage Series on Markup Technologies, vol. 10 (2013). doi:https://doi.org/10.4242/BalisageVol10.Kleinfeld01.
[McKesson2012] McKesson, Nellie.
Building Books with CSS3. [online]. Rockweb. June 12, 2012. [cited 13 Apr
[McKesson2013] McKesson, Nellie.
Publisher Case Study: O'Reilly Media. IDPF Seminar: EPUB and the Open Web Platform for Publishers, Noida, India, November
30, 2013. http://idpf.org/sites/default/files/file_attach/oreilly-case-study.pdf.