How to cite this paper

Graham, Tony. “Copy-fitting for Fun and Profit.” Presented at Balisage: The Markup Conference 2018, Washington, DC, July 31 - August 3, 2018. In Proceedings of Balisage: The Markup Conference 2018. Balisage Series on Markup Technologies, vol. 21 (2018). https://doi.org/10.4242/BalisageVol21.Graham01.

Balisage: The Markup Conference 2018
July 31 - August 3, 2018

Balisage Paper: Copy-fitting for Fun and Profit

Tony Graham

Antenna House, Inc.

`<tony@antennahouse.com>`

`<tgraham@antenna.co.jp>`

Tony Graham is a Senior Architect with Antenna House, where he works on their XSL-FO and CSS formatter, cloud-based authoring solution, and related products. He also provides XSL-FO and XSLT consulting and training services on behalf of Antenna House.

Tony has been working with markup since 1991, with XML since 1996, and with XSLT/XSL-FO since 1998. He is Chair of the Print and Page Layout Community Group at the W3C and previously an invited expert on the W3C XML Print and Page Layout Working Group (XPPL) defining the XSL-FO specification, as well as an acknowledged expert in XSLT. Tony is the developer of the ‘stf’ Schematron testing framework and also Antenna House’s ‘focheck’ XSL-FO validation tool, a committer to both the XSpec and Juxy XSLT testing frameworks, the author of “Unicode: A Primer”, and a qualified trainer.

Tony’s career in XML and SGML spans Japan, USA, UK, and Ireland. Before joining Antenna House, he had previously been an independent consultant, a Staff Engineer with Sun Microsystems, a Senior Consultant with Mulberry Technologies, and a Document Analyst with Uniscope. He has worked with data in English, Chinese, Japanese, and Korean, and with academic, automotive, publishing, software, and telecommunications applications. He has also spoken about XML, XSLT, XSL-FO, EPUB, and related technologies to clients and conferences in North America, Europe, Japan, and Australia.

Abstract

Copy-fitting is the fitting of words into the space available for them or, sometimes, adjusting the space available to fit the words. Copy-fitting is included in “Extensible Stylesheet Language (XSL) Requirements Version 2.0” and is a common feature of making real-world documents. This talk describes an ongoing internal project for automatically finding and fixing problems that can be fixed by copy fitting in XSL-FO.

Introduction

Copy-fitting as estimating
Copy-fitting as adjustment

Copy-fitting for fun

Copy-fitting for profit

Books
Manuals and other documentation

Standards for copy-fitting of XML or HTML

Extensible Stylesheet Language (XSL) 1.1
Extensible Stylesheet Language (XSL) Requirements Version 2.0
List of CSS features required for paged media

Existing Extensions

Print & Page Layout Community Group
AH Formatter
FOP

Copy-fitting Implementation

Error condition XSLT
Formatting error XML
PDF error report
Copy-fitting instructions

Future Work

Conclusion

Introduction

Copy-fitting has two meanings: it is both the “process of estimating the amount of space typewritten copy will occupy when converted into type” White and the process of adjusting the formatted text to fit the available space. Since automated formatting, such as with an XSL-FO formatter, is now so common, a lot of the manual processes for estimating the amount of space are superfluous.

Copy-fitting as estimating

There are multiple, and sometimes conflicting, aspects to the relationship between copy-fitting and profit. For the commercial publisher, there is tension between more pages costing more money and more pages providing a better reading experience or even a better “shelf appeal”, for want of a better term. We tell ourselves do not judge a book by its cover, but we do still sometimes judge a book by the width of its spine when we look at it on the shelf in the bookstore. Copy-fitting, in this first sense, is part of the process of making the text fill the number of pages (the ‘extent’) that has been decided in advance by the publisher. This may mean increasing the font-size, leading, other spaces, and the margins to fill more pages, or it may mean reducing them so that more text fits on a page.

The six steps to copyfitting from “How to Spec Type” White are:

Count manuscript characters

Select the typeface and type size you want.

Cast of to determine the number of set characters per line

Divide set characters per line into total manuscript character count. Result is number of lines of set type.

Add leading to taste.

If too short: add leading or increase type size or decrease line length.

If too long: remove leading or decrease type size or increase line length.

However:

Most important of all, decisions should be made with the ultimate aim of benefiting the reader.

— “Book Typography: A Designer’s Manual”, Mitchell & Wightman

Copy-fitting as adjustment

‘Copy-fitting’ as adjustment is now the more common use for the term.

Copy-fitting for fun

It’s an interesting problem to solve.

Copy-fitting for profit

Books

The total number of pages in a book is generally a multiple of 16 or 32, and any variation can involve additional cost:

It must be remembered that books are normally printed and bound in sheets of 16, 32, 64 (or multiples upwards) pages, and this exact figure for any job must be known by the designer at the point of designing: it will be the designer’s job to see that the book makes the required number of pages exactly. If a book is being printed and bound in sheets of 32 pages (16 on each side of the sheet), it is generally possible to add one leaf of 2 pages by ‘tipping in’, or 4, 8, or any other multiple of 4 pages extra by additional binding operation, but the former will mean hand-work, and even the latter will involve disproportionately higher cost.

— “The Thames and Hudson Manual of Typography”, Ruari McLean, 1980

The role here for copy-fitting (in the second sense) is to ensure that the overall page count is at or close to a multiple of the number of pages in a signature.

Even Print-On-Demand (POD) printing has similar constraints. For example, the Blurb POD service requires that books in “trade” formats – e.g. 6×9inches – are a multiple of six pages Blurb. In a simplistic example, suppose that the document with the default styles applied formats as 15 pages:

The (self-)publisher has the choice of three alternatives:

Leave the layout unchanged. However, the publisher is paying for the blank pages, and the number of blank pages may be more than the house style allows.
Copy-fit to reduce the page count to the next lower multiple of the signature size. This, obviously, is cheaper than paying for blank pages, but “the public still has a tendency to judge the value of a book by its thickness” Williamson, pg. 287.

Figure 3: After copy-fitting to reduce page count
Copy-fit to increase the page count to better fill a multiple of the signature size. This, just as obviously, costs more than reducing the page count, but it has the potential for helping sales.

Figure 4: After copy-fitting to increase page count

Manuals and other documentation

A manufacturer who provides printed documentation along with their product faces a different set of trade-offs. The format for the documentation may be constrained by the size of the product and its packaging, regulatory requirements, or the house style for documentation of a particular type. A manufacturer may need to print thousands or even millions^[1] of documents. Users expect clear documentation yet may be unwilling to pay extra when the documentation is improved, yet unclear documentation can lead to increased support costs or, in some cases, to fatalities.

Suppose, for example, that the text for a document has been approved by the subject matter experts – and, in some cases, also by the company’s lawyers and possibly the government regulator – and has to be printed on a single standard-size page, or possibly a single side of a standard-size page, yet there is too much text for the space that is available. To let the editorial staff rewrite the text so that it fits the available space is definitely not an option, so it becomes necessary to apply copy-fitting to adjust the formatting until the text does fit. Figure 5 shows an information sheet that was included with the Canon MP830 when sold in EMEA. The same information in 24 languages is printed on a single-sided sheet of paper. Most users would look at the information once, at most. If the information had been allowed to extend to the back of the sheet, it would have considerably increased the cost of providing the information for no real benefit. Some form of copy-fitting was probably used to make sure that the information could fit on one side of the sheet.

Multilingual text has other complications. Figure 6 shows two corresponding pages from the English and Brazilian Portuguese editions of the Canon MP830 Quick Start Guide. The same information is presented on both pages. However, translations of English are typically longer than the corresponding English, and this is no exception. The Brazilian Portuguese page has more lines of text on it and, as Figure 7 shows, the font-size and leading has also been reduced in the Brazilian Portuguese page. A copy-fitting process could have been used to adjust the font-size and leading across the whole document by the minimum necessary so that no text overflowed its page.

Standards for copy-fitting of XML or HTML

There is currently no standards for how to specify copy-fitting for either XML or HTML markup. However, copy-fitting was covered in the requirements for XSL 2.0, and the forward-looking “List of CSS features required for paged media” by Bert Bos has an extensive section on copy-fitting.

Extensible Stylesheet Language (XSL) 1.1

XSL 1.1 does not address copy-fitting, but it does define multiple properties for controlling aspects of formatting that, if the properties are not applied, could lead to problems that would need to be corrected using copy-fitting:

`hyphenation-keep`	Controls whether the last line of a column or page may end with a hyphen.
`hyphenation-ladder-count`	Limits the number of successive hyphenated lines.
`orphans`	The minimum number of lines that must be left at the bottom of a page.
`widows`	The minimum number of lines that must be left at the top of a page.

Extensible Stylesheet Language (XSL) Requirements Version 2.0

“Extensible Stylesheet Language (XSL) Requirements Version 2.0" XSLReq2.0 includes:

2.1.4 Copyfitting

Add support for copyfitting, for example to shrink or grow content (change properties of text, line-spacing, ...) to make it constrain to a certain area. This is going to be managed by a defined set of properties, and in the stylesheet it will be possible to define the preference and priority for which properties should be changed. That list of properties that can be used for copyfitting is going to be defined.

Additionally, multiple instances of alternative content can be provided to determine best fit.

This includes copyfitting across a given number of pages, regions, columns etc, for example to constrain the number of pages to 5 pages.

Add the ability to keep consistency in the document, e.g. when a specific area is copyfitted with 10 pt fonts, all other similar text should be the same.

List of CSS features required for paged media

“List of CSS features required for paged media” (https://www.w3.org/Style/2013/paged-media-tasks) by Bert Bos has a ‘Copyfitting’ section. Part of it is relevant to fitting content into specified pages.

20. Copyfitting

Copyfitting is the process of selecting fonts and other parameters such that text fits a given space. This may range from making a book have a certain number of pages, to making a word fit a certain box.

20.1 Micro-adjustments

If a page has enough content, nicer-looking alignments and line breaks can often be achieved by “cheating” a little: instead of the specified line height, use a fraction of a point more or less. Instead of the normal letter sizes, make the letters a tiny bit wider or narrower…

This can also help in balancing columns: In a newspaper, e.g., it may look better to have all columns of an article the same height at the cost of a slightly bigger line height in the last column, than to have all lines aligned but with a gap below the last column.

The French newspaper “Le Canard enchainé” is an example of a publication that favors full columns over equal line heights.

20.2 Automatic selection of font size

One common case is choosing a font size such that a headline exactly fills the width of the page.

A variant is the case where each individual line of the text may be given a different font size, as small as possible above a certain minimum.

Two models suggested for CSS are to see copyfitting either as one of several algorithms available for justification, and thus as a possible value for ‘text-justify’; or as a way to treat overflow, and thus as a possible value for ‘overflow-style’. Both can be useful and they can co-exist:
H1 {text-align: justify; text-justify: copyfit}
H2 {height: 10em; overflow: hidden; overflow-style: copyfit}
The first rule could mean that in each line of the block, rather than shrinking or stretching the interword space to fill out the line, the font size of each letter is decreased or increased by a certain factor so that the line is exactly filled out. The latter could mean that the font size of all text in the block is decreased or increased by a common factor so that the font size is as large as possible without causing the text to overflow. (As the example shows, this type of copyfitting requires the block’s width and height to be set.)

Figure 9: The title of the chapter is one word that exactly fills the width of the page

20.3 Alternative content or style

If line breaks or page breaks turn out very bad, a designer may go back to the author and ask if he can’t replace a word or change a sentence somewhere, or add or remove an image.

In CSS, we assume we cannot ask the author, but the author may have proposed alternatives in advance.

Alternatives can be in the style sheet (e.g., an alternative layout for some images) or in the source (e.g., alternative text for some sentence).

In the style sheet, those alternatives would be selected by some selector that only matches if that alternative is better by some measure than the first choice.

Some alternatives may be provided in the form of an algorithm instead of a set of fixed alternatives. E.g., in the case of alternative image content, the alternative may consist of progressively cropping and scaling the image up to a certain limit and in such a way that the most important content always remains visible.

E.g., an image of a group of people around two main characters can be divided into zones that are progressively less important: the room they are in, people’s feet, the less important people, up to just the heads of the two main characters, which should always be there.

Existing Extensions

Print & Page Layout Community Group

The Print and Page Layout Community Group developed a series of open-source extensions for XSLT processors so you can run any number of iterations of your XSL-FO processor from within your XSLT transformation, which allows you to make decisions based on formatted sizes of areas.

The extensions are currently available for Java and DotNet and use either the Apache FOP XSL formatter or Antenna House AH formatter to produce the area trees.

To date, stylesheets that use the extensions have been bespoke: writing a stylesheet that uses the extensions has required knowledge of the source XML, and the stylesheet for transforming the XML into XSL-FO is the stylesheet that uses the XSLT extensions.

AH Formatter

AH Formatter, from Antenna House, extends the overflow property. When text overflows the area defined for it, the text may either be replaced or one of a set of properties – including font-size and font-stretch – can be automatically reduced (down to a defined lower limit) to make the text fit into the defined area.

FOP

FOP provides fox:orphan-content-limit and fox:widow-content-limit extension properties for specifying a minimum length to leave behind or carry forward, respectively, when a table or list block breaks over a page.

Copy-fitting Implementation

The currently implemented processing paths are shown in the following figure. The simplest processing path is the normal processing of an XSL-FO file to produce formatted pages as PDF. The copy-fitting processes require the XSL-FO to instead be formatted and output as Area Tree XML (an XML representation of the formatted pages) that is analyzed to detect error conditions. As currently implemented, each of the supported error conditions is implemented as a XSLT 2.0 template defined in a separate XSLT file. A separate XSLT stylesheet uses the Area Tree XML as input and imports the error condition stylesheets. The simplest version of this stylesheet outputs an XML representation of the errors found. This XML can be processed to generate a report detailing the error conditions. Alternatively, the error information can be combined with the Area Tree XML to generate a version of the formatted document that has the errors highlighted. Since copy-fitting involves modifying the document, another alternative stylesheet uses the XSLT extension functions from the Print & Page Layout Community Group at the W3C to run the XSL-FO formatter during the XSLT transformation to iteratively adjust selected aspects of the XSL-FO until the Area Tree XML does not contain any errors (or the limits of either adjustment tolerance or maximum iterations have been reached).

Error condition XSLT

The individual XSLT file for an error condition consists of an XSLT template that matches on a node with that specific error. The result of the template is an XML node encoding the error condition and its location. The details of how to represent the information are not part of the template (and are still in flux anyway).

<xsl:template
    match="at:LineArea[ahf:is-page-end-hyphen(.)]"
    mode="ahf:errors">
  <xsl:param name="page" as="xs:integer" tunnel="yes" required="yes" />
  <xsl:param name="x" as="xs:double" tunnel="yes" required="yes" />
  <xsl:param name="y" as="xs:double" tunnel="yes" required="yes" />

  <xsl:variable
      name="x"
      select="if (exists(@left-position))
                then $x
                     + ahf:length-to-pt(@left-position)
               else $x"
          as="xs:double" />
  <xsl:variable
      name="y"
      select="if (exists(@top-position))
                then $y
                     + ahf:length-to-pt(@top-position)
              else $y"
      as="xs:double" />
  <xsl:sequence
      select="ahf:error('page-end-hyphen', $page, $x + ahf:length-to-pt(@width),
          $y - ahf:length-to-pt(at:TextArea[last()]/@font-size) div 2)" />

  <xsl:next-match />
</xsl:template>

<xsl:function name="ahf:is-page-end-hyphen" as="xs:boolean">
  <xsl:param name="line-area" as="element(at:LineArea)" />

  <xsl:sequence
      select="empty($line-area/following-sibling::at:LineArea) and
              $line-area/at:TextArea[last()][@text[ends-with(., '-') or ends-with(., '‐')]] and
              empty($line-area/ancestor::at:ColumnReferenceArea[1]/following-sibling::*)" />
</xsl:function>

Formatting error XML

The XML for reporting errors is essentially just a list of errors and their locations. Again, this is still in flux.

<errors>
   <error code="max-hyphens" page="1" x0="523.2752" y0="583.367"/>
   <error code="page-end-hyphen" page="2" x0="523.2760000000001" y0="740"/>
   <error code="paragraph-widow" page="6" x0="94.462" y0="418"/>
   <error code="page-end-hyphen" page="7" x0="523.2760000000001" y0="740"/>
   <error code="page-sequence-widow" page="8" x0="72" y0="72"/>
</errors>

PDF error report

The error XML can be processed to generate a report. It is, of course, also possible to augment the Area Tree XML to add indications of the errors to the formatted result, as in the simple example below.

Copy-fitting instructions

The copy-fitting instructions consist of sets of contexts and changes to make in that context. The sets are applied in turn until either the current formatting round does not generate any areas or the sets are exhausted,in which case the results from the round with the least number of errors are used. Within each set of contexts and changes, the changes can either be applied in sequence or all together. Like the rest of the processing, the XML format is still in flux.

<copyfitsets>
  <copyfit use="all" name="copyfit1">
    <match role="chapter-drop" />
    <use height="25%"/>
    <match font-family="'Source Serif Pro', serif" font-size="11pt" />
    <use font-size="10.5pt" line-height="13.5pt"/>
  </copyfit>
  <copyfit use="first">
    <match role="chapter-drop" />
    <use height="20%"/>
    <match font-family="'Source Serif Pro', serif" font-size="11pt" />
    <use font-size="10.5pt" line-height="13.5pt"/>
  </copyfit>
</copyfitsets>

When the XSLT extensions from the Print & Page Layout Community Group are used, the changes instruction indicates a range of values. The XSLT initially uses the .start value and, if errors are found, does a binary search between the .start and .end values. Iterations continue until no errors occur, the maximum number of iterations is reached, or the difference between iterations is less than the allowed tolerance.

<copyfitsets>
  <copyfit condition="page-sequence-widow">
    <match  font-size="11pt" />
    <use line-height.start="14pt" line-height.end="13pt" />
  </copyfit>
</copyfitsets>

The copy-fitting instructions are transformed into XSLT that is executed by the XSLT processor, similarly to how Schematron files and XSpec files are transformed into XSLT that is then executed.

Future Work

There should be an XML format for selecting which error tests to use and what threshold values to use for each test. That XML would be converted into the XSLT that is run when checking for errors.
There is currently only a limited number of properties that can be matched on. The range is due to be expanded as we get the hang of doing copy-fitting. The match conditions are transformed into match attributes in the generated XSLT, so there is a lot scope for improvement.
The range of correction actions is due to be increased to include, for example, supplying alternate text.

Conclusion

Automated detection and correction of formatting problems can solve a set of real problems for real documents. There is a larger set of formatting problems that can be recognized automatically and reported to the user in a variety of ways but which so far are not amenable to automatic correction. Work is ongoing to extend both the set of formatting problems that can be recognized and the set of problems that can be corrected automatically.

References

[Blurb] https://support.blurb.com/hc/en-us/articles/207792796-Uploading-an-existing-PDF-to-Blurb, Uploading an existing PDF to Blurb

[White] White, Alex, How to spec type, Roundtable Press

[XSLReq2.0] https://www.w3.org/TR/xslfo20-req/#copyfitting, Extensible Stylesheet Language (XSL) Requirements Version 2.0

[Bos] https://www.w3.org/Style/2013/paged-media-tasks#copyfitting, List of CSS features required for paged media

[Ext] https://www.w3.org/community/ppl/wiki/XSLTExtensions, XSLTExtensions

[WebMD] https://www.webmd.com/drug-medication/news/20110420/the-10-most-prescribed-drugs#1, The 10 Most Prescribed Drugs

[Williamnos] Williamson, Hugh, Methods of Book Design, 3ed., Yale University Press, 1983, ISBN 0-300-03035-5

^[1] A prescription or over-the-counter medication comes with a printed package insert, and, for example, nearly 4 billion prescriptions were written in the U.S. in 2010, with the most-prescribed drug prescribed over 100 million times WebMD.

https://support.blurb.com/hc/en-us/articles/207792796-Uploading-an-existing-PDF-to-Blurb, Uploading an existing PDF to Blurb

White, Alex, How to spec type, Roundtable Press

https://www.w3.org/TR/xslfo20-req/#copyfitting, Extensible Stylesheet Language (XSL) Requirements Version 2.0

https://www.w3.org/Style/2013/paged-media-tasks#copyfitting, List of CSS features required for paged media

https://www.w3.org/community/ppl/wiki/XSLTExtensions, XSLTExtensions

https://www.webmd.com/drug-medication/news/20110420/the-10-most-prescribed-drugs#1, The 10 Most Prescribed Drugs

Williamson, Hugh, Methods of Book Design, 3ed., Yale University Press, 1983, ISBN 0-300-03035-5

BalisageThe Markup Conference2018