The Hard Edges of Soft Hyphens

Syd Bauman

Senior XML Programmer / Analyst

Northeastern University / Library / CDS / WWP

Copyright © 2016 Syd Bauman. Some rights reserved.

expand Abstract

expand Syd Bauman

Balisage logo


expand How to cite this paper

The Hard Edges of Soft Hyphens

Balisage: The Markup Conference 2016
August 2 - 5, 2016

Note to the reader

A link to an updated version (or even a newer edition) of this paper may be available on the WWP bibliography page.


In section “Introduction” this paper presents what soft hyphens are and how they are encoded, and then discusses the desired processing (called resolution). In the next two sections (section “Seems easy …” and section “Further complications”) an algorithm for how this might be done, and then a somewhat detailed discussion of some of the features of TEI encoding that make this difficult are presented, along with a few of the policies at the WWP that try to make it a bit easier. Lastly, in section “Attempts” brief discussions of various attempts to perform this processing are presented.


Soft hyphens

In the modern post-Unicode era, a soft hyphen is typically defined as a spot where you, the word processor, may break this word across a line break, if needed[1] But even as recently as ISO 8859 a soft hyphen was for use when a line break has been established within a word[2] Although not called a soft hyphen back then, this use of the hyphen has been around for centuries. E.g., the OED cites NWEW as saying Hyphen … is used … when one part of a word concludes the former Line, and the one begins the next. It is this latter (older) definition with which we are concerned here: a computer character (or other XML construct) used in a transcription to indicate where an end-of-line hyphen was printed in the source text to indicate this word is continued on the next line.

The use of such characters (hyphen to indicate word continued on next line) is nearly ubiquitous in printed works (at least in English). For example, I searched Google Books for the word balisage, and looked at the first book listed.[3] Even though I cannot read it because it is in French, there are obviously four soft hyphens on the first page of printed prose alone (i.e., ignoring the title page, etc.); that page has just over 200 words spread over 20 lines. In the first full chapter of Michael Kay’s book[4] I counted 67 soft hyphens in roughly 17,760 words over roughly 1410 lines.

Recording lineation and end-of-line hyphens

The Text Encoding Initiative Guidelines for Electronic Text Encoding and Interchange contain a discussion of how to handle these extant typographic indicators.[5] One common solution is to ignore the soft hyphens, and to simply transcribe the word that has been broken across a line break as a single word. Consider the following example.[6]

png image ../../../vol17/graphics/Bauman01/Bauman01-001.png
This passage might be encoded[7] as
      so far they’d been smart enough to keep quiet about it. I’d never seen any
      posts about the Tomb of Horrors on any gunter message boards. I realized,
      of course, that this might be because my theory about the old D&D
      module was completely lame and totally off base.</p>
or even
      so far they’d been smart enough to keep quiet about it. I’d
      never seen any posts about the Tomb of Horrors on any
      gunter message boards. I realized, of course, that this might
      be because my theory about the old D&amp;D module was
      completely lame and totally off base.</p>
or, if encoding original lineation
<lb/>so far they’d been smart enough to keep quiet about it. I’d never seen any
      <lb/>posts about the Tomb of Horrors on any gunter message boards. I realized,
      <lb/>of course, that this might be because my theory about the old D&amp;D
      <lb/>module was completely lame and totally off base.</p>

In all three of the above encodings, the word realized has been silently reconstituted from its constituent parts, the initial portion immediately prior to the soft hyphen, and the final portion shortly after the soft hyphen. In the first and third examples the soft hyphen is resolved by moving the final portion of the word up from the begining of its line to the end of the previous line (which I call finalUp). It could just as easily have been resolved by moving the initial portion of the word from the end of its line to the beginning of the next line (which I call initDown). For most of this paper I will discuss resolution in only the finalUp direction, but the issues generally apply equally well to both directions.

Personally, I do not like that third (last) encoding. It explicitly asserts there was a line break in the source document between realized, and of course, which is not true. But nonetheless, the practice is not uncommon. The first encoded example is not nearly so bad, as its implication that there was a line break at that spot is implicit, not explicit. The middle of the three encoding possibilities is not objectionable in assertion of line breaks at all, since it makes no such assertions. (One could infer them, but they are not implied.) However, many projects will find it a disadvantage to transcribe prose completely irrespective of original lineation. Keeping track of original lineation is very helpful when trying to align the source document with the transcription (or the output of processing the transcription). Even if a project does not think the users of its transcribed texts will appreciate this alignment, the project proofreaders will — a lot.

Another common approach is to explicitly record both the hyphen character and original lineation. Consider the following example.[8]

         It turns out that the most important voice in the Su‐
      preme Court nomination battle is not the American peo‐
      ple’s, as Senate Republicans have insisted from the mo‐
      ment Justice Antonin Scalia died last month. It is not even
      that of the senators. It’s the National Rifle Association’s.
         That is what the majority leader, Mitch McConnell,
      said the other day when asked about the possibility of con‐
      sidering and confirming President Obama’s nominee,
      Judge Merrick Garland, after the November elections. “I
      can’t imagine that a Republican majority in the United
      States Senate would want to confirm, in a lame-duck ses‐
      sion, a nominee opposed by the National Rifle Associa‐
      tion,” he told “Fox News Sunday.”
This excerpt from a New York Times editorial might be encoded as follows.
<p>It turns out that the most important voice in the Su<pc force="weak">-</pc>
      <lb break="no"/>preme Court nomination battle is not the American peo<pc force="weak">-</pc>
      <lb break="no"/>ple’s, as Senate Republicans have insisted from the mo<pc force="weak">-</pc>
      <lb break="no"/>ment Justice Antonin Scalia died last month. It is not even
      <lb/>that of the senators. It’s the National Rifle Association’s.</p>
      <p>That is what the majority leader, Mitch McConnell,
      <lb/>said the other day when asked about the possibility of con<pc force="weak">-</pc>
      <lb break="no"/>sidering and confirming President Obama’s nominee,
      <lb/>Judge Merrick Garland, after the November elections. “I
      <lb/>can’t imagine that a Republican majority in the United
      <lb/>States Senate would want to confirm, in a lame-duck ses<pc force="weak">-</pc>
      <lb break="no"/>sion, a nominee opposed by the National Rifle Associa<pc force="weak">-</pc>
      <lb break="no"/>tion,” he told “Fox News Sunday.”</p>
Here it is explicit that the hyphen character is not a word separator (force="weak"), and that the line break does not imply the end of an orthographic token (break="no"). It is worth noting that many TEI projects choose to use either <pc force="weak"> or <lb break="no">, but not both.

At my project[9] we encode soft hyphens using the Unicode character SOFT HYPHEN (U+00AD). Given that this character is explicitly of the a word processor may insert a hyphen here if needed variety, in some sense it is technically incorrect to use it for this purpose. Furthermore, the TEI Guidelines do not recommend this use. In our defense, we chose this path back in the ISO 8859 days, and when &shy; was an SGML SDATA reference that did not necessarily mean code-point 0xAD. But more importantly, the detail of which character is used to represent the this word is continued on the next line glyph that was on the source page does not matter, so long as it is not also used for some other purpose in the same file. So, given that we are encoding early modern printed books, we could just as well have used the EURO SIGN (U+20AC) for this purpose. In either case it is character abuse; however the abuse of SOFT HYPHEN seems much less dramatic than would be the abuse of EURO SIGN: this has something to do with hyphenation, and nothing to do with currency.

So our encoding of the excerpt from the New York Times editorial would be as follows.[10]

<p>It turns out that the most important voice in the Su&#xAD;
        <lb/>preme Court nomination battle is not the American peo&#xAD;
        <lb/>ple's, as Senate Republicans have insisted from the mo&#xAD;
        <lb/>ment Justice Antonin Scalia died last month. It is not even
        <lb/>that of the senators. It's the National Rifle Association's.</p>
        <p>That is what the majority leader, Mitch McConnell,
        <lb/>said the other day when asked about the possibility of con&#xAD;
        <lb/>sidering and confirming President Obama's nominee,
        <lb/>Judge Merrick Garland, after the November elections. <said>I
        <lb/>can't imagine that a Republican majority in the United
        <lb/>States Senate would want to confirm, in a lame-duck ses&#xAD;
        <lb/>sion, a nominee opposed by the National Rifle Associa&#xAD;
        <lb/>tion,</said> he told <title>Fox News Sunday.</title></p>

Desired output

Encoding texts serves little purpose unless some sort of analysis or output generation (or both) is undertaken. If all we wanted to do was read the text, scanned images of the pages would do.

Consider the following snippet of an encoded text:[11]

<p>Whatever has been ſaid 
        <lb n="14"/>by Men of more Wit than
        <lb n="15"/>Wiſdom, and perhaps of
        <lb n="16"/>more malice than either,
        <lb n="17"/>that Women are natural&#xAD;
        <lb n="18"/>ly Incapable of acting Pru&#xAD;
        <lb n="19"/>dently, or that they are
        <lb n="20"/>neceſſarily determined to
        <lb n="21"/>folly, …
For most analyses we would prefer the words naturally and Prudently to occur in our data, and the the tokens ly, Pru, and dently not to occur. That is, we would like the soft hyphens resolved. The exception is the physical bibliographer who is interested in the phenomena of breaking a word across a line.

As with analyses, for most purposes we would prefer to read the text with as few interruptions to words as possible. The obvious exception is when we want to align reading of the processed output with the physical source page or a facsimile thereof. This alignment makes proofreading much easier.

Thus for proofreading we might like to see something like the following.

        13: Whatever has been ſaid 
        14: by Men of more Wit than
        15: Wiſdom, and perhaps of
        16: more malice than either,
        17: that Women are natural-
        18: ly Incapable of acting Pru-
        19: dently, or that they are
        20: neceſſarily determined to
        21: folly, …
Whereas for casual reading, we might prefer:
        Whatever has been said by Men of more Wit than Wisdom, and
        perhaps of more malice than either, that Women are naturally
        Incapable of acting Prudently, or that they are necessarily
        determined to folly, …
The question is, of course, how to get that resolved output.

When I’m wrong, I can be really wrong

Famous last words: (figuratively, expressing sarcasm) A statement which is overly optimistic, results from overconfidence, or lacks realistic foresight.


Like many people, I make good predictions and bad ones. But sometimes I make truly horrible predictions. E.g., in early 1995 or thereabouts I infamously said something like remember, the web is not our friend, it is our enemy. Hard to be more wrong than that. But when it came to soft hyphens, it may turn out I was. You see, sometime during the early days of the Women Writers Project I asserted that software could read our documents (in which soft hyphens were encoded using first the &shy. Waterloo Script set symbol (essentially a variable), and later the &shy; SGML SDATA entity reference), and resolve the soft hyphen for creating full-text searchable word lists or reading output for undergraduates. I said this with a how hard can it be? attitude, I’m sure.[12]

Well, as will be discussed in the rest of this paper, it has turned out to be quite hard.

Seems easy …

At first blush, this does not seem like it would be a difficult programming task. Basically, when you find a soft hyphen, drop it and replace it with the first token from the next line. Correspondingly, in order to avoid duplicating the first token from the next line,[13] when you find a text node whose immediately preceding text node ended in soft hyphen, drop the first token. For example, consider the following passage.[14]

png image ../../../vol17/graphics/Bauman01/Bauman01-002.png
Or, in modern typography,
png image ../../../vol17/graphics/Bauman01/Bauman01-003.png
If this passage is transcribed as
<lb/>procuring a ſpeedy adminiſtration of Juſ&#xAD;
 <lb/>tice for the impartiall puniſhment of all
 <lb/>offenders, to the relief and comfort of the 
then to resolve the soft hyphen, it needs to be replaced by the first text token of the line that immediately follows.
png image ../../../vol17/graphics/Bauman01/Bauman01-004.png
In an XSLT context, this means that the template that matches the blue portion above needs to strip off the &#xAD; character and replace it with the red portion in the above; and the template that matches the text node that includes the red portion needs to strip off said red portion (since it has already been put into the output stream by the template that matched the blue portion).
png image ../../../vol17/graphics/Bauman01/Bauman01-005.png


That doesn’t sound too tough. Of course it is obviously a little harder than the diagrams above make it look, for they ignore the whitespace between the blue and red portions:

png image ../../../vol17/graphics/Bauman01/Bauman01-006.png

In order to handle that whitespace we need to

  1. ignore whitespace at the end-of-line when looking for soft hyphens

  2. ensure that the end-of-line whitespace is not inserted between the two parts of the broken word, either by stripping end-of-line whitespace off along with the &#xAD; character, or by carefully replacing only that character (such that the whitespace comes after the re-constituted word)

Of course we cannot just normalize whitespace using XPath’s built-in normalize-space() function, as in many cases leading and trailing space are important. E.g., given the following fragment,[15]

<p rend="first-indent(1)">How far the passages of scripture
    <lb/>she mentions were applicable to the
    <lb/>conduct of <persName>Mr B</persName> it is not our prov&#xAD;
    <lb/>ince to determine; but it is not 
using normalize-space() on ␣it␣is␣not␣our␣prov&#xAD;↲ would lead to …conduct of Mr. Bit is not our province to determine; …, because the space in front of it would be lost.[16] [17] But even with this whitespace concern, this is not particularly difficult. And if that’s all there was to it, well, I wouldn’t be writing this paper.

Further complications


First thing to keep in mind is that XML constructs other than just the <lb> element may come between the text node that ends in SOFT HYPHEN and the text node that contains the representation of the continued word. Besides the obvious (XML comments and XML processing instructions), first and foremost the feature that forced the typographer to break the word in the first place may have been a page break, not a line break. Page breaks usually have other information associated with them (page numbers, catch words, signature marks, running titles) that are generally encoded where they lie such that they further interrupt the word that has been split. E.g.[18][19]

<p>Whoever may come out in any society as Mis&#xAD;
            <pb n="247"/>
            <milestone unit="sig" n="M4r"/>
            <mw type="pageNum">247</mw>
            <lb/>sionaries or teachers, whether here or at <placeName>Sierra-
            <lb/>Leone</placeName>, had need to guard against assimilating too
            <lb/>much in habit or sentiment with other <rs type="properAdjective">European</rs>
            <lb/>residents, …
Notice that included among those things that follow the soft hyphen is a text node (247) which is not part of the split word Missionaries. [20] (Note also that the hyphen glyph in Sierra-Leone looks exactly the same in the source as the hyphen glyph in Mis-sionaries, but the encoding asserts it is a hard hyphen even though it occurs at end-of-line. This is because other occurrences of Sierra-Leone have a hyphen, even when it is in the middle of a typographic line. The hard hyphen is probably best encoded with a HYPHEN character (U+2010), but is typically recorded with a HYPHEN-MINUS character (U+002D).)

But sadly, it is not only the obvious and predictable (XML comments, XML processing instructions, line breaks, column breaks, and page breaks with their apparatuses) that may come between a soft hyphen and the final portion of a word. The most common culprits here are annotations and figures, but handwritten additions (either authorial or by a later hand) could also occur.

In the following example[21] an entire tipped-in plate sits between the soft hyphen and the final portion of the word.

          <label>I.</label> God spoke of Be-he-moth. What ani&#xAD;
          <pb n="facing 48"/>
          <pb n="facing 49"/>
            <figDesc>An engraving of a “behemoth” (resembles the
            elephant) standing on a grassy bank drinking from a body
            of water, vegitation in background</figDesc>
            <ab type="caption">To face page 49.</ab>
          <pb n="49"/>
          <milestone n="E5r" unit="sig"/>
          <mw rend="align(outside)" type="pageNum">49</mw>
          <lb/>mal is that?

Sibling of Overlap

In all of the examples so far, the initial and final portions of the word divided by a soft hyphen are at least at the same hierarchical level of encoding. That is (in XPath terms) from the text node that contains the soft hyphen, the final portion of the word is on the following-sibling:: axis, even if it is not the first text node, or even the first non-whitespace-only text node, on that axis.

However, we are not always so lucky. Here is a modern diplomatic transcription of a heading.[22]

png image ../../../vol17/graphics/Bauman01/Bauman01-007.png
The word Honourable is half in roman (or upright) type and half in italics. To account for this font shift, the encoding uses the TEI <hi> element and the global @rend attribute to indicate that while the entire heading is (in general) in italics, the first typographic line is highlighted by being in roman typeface.[23]
<head rend="slant(italic)"><hi rend="slant(upright)">To all vertuous Ladies Honou&#xAD;</hi>
    <lb/>rable or Worſhipfull, and to all other
    <lb/>of <persName rend="slant(upright)">He<vuji>u</vuji>ahs</persName> ſex fearing God, and lo<vuji>u</vuji>ing their
    <lb/><vuji>i</vuji>uſt reputation, grace and peace through
    <lb/><persName>Chriſt</persName>, to eternall glory.

It would be reasonable to think this phenomenon pernicious, not particularly important, and rare. But a different manifestation of the same hierarchical problem is anything but. When a book is damaged (e.g., by a coffee spill, or torn or mouse-eaten edges of pages), it is common for the damage to be on only one side of the page. Such damage will cause a problem reading either the initial portion (if it is on the right edge) or the final portion (if it is ontFIXME!! he left edge) of a word split across a line break.

In the following example,[24] the encoder has indicated that she cannot read a few characters at the beginning of each of four lines due to damage, but that either from context alone or from looking at a different edition of the same book she has been able to surmise what must have been printed.

<lb/>not your own. It is a miſe&#xAD;
  <lb/><supplied reason="damaged">ra</supplied>ble thing for any Wo&#xAD;
  <lb/><supplied reason="damaged">ma</supplied>n, though never ſo great,
  <lb/><supplied reason="damaged">not</supplied> to be able to teach her
  <lb/><supplied reason="damaged">ſerv</supplied>ants; …
This is a particularly thorny case, because in order to resolve the soft hyphen, software will have to recognize that not only should the following <supplied> element be moved from the beginning of its line to the end of the previous line (replacing the &#xAD; character), but also the first token of the text node immediately following the <supplied> needs to move with it.

Text that is not there

If the text that is damaged cannot be read at all, the TEI Guidelines recommend using the <gap> element. While the <gap> element may have content, if it does that content does not provide a transcription of the source text, but rather provides a description of or information about what was not transcribed from the source text; and more often than not <gap> is empty. In the following example,[25] the encoder is asserting that she could not read a significant portion of the last line.

<lb/>And alſo ge<vuji>u</vuji>eth them grace to <vuji>v</vuji>ſe in his
  <lb/>glorye, po<vuji>u</vuji>ertie, ignomine, infamie, in&#xAD;
  <lb/>firmitie, with all ad<vuji>u</vuji>erſitie, and the pri&#xAD;
  <lb/><gap extent="over one third of the line" reason="flawed-reproduction"/>tes,
     e<vuji>u</vuji>en to the death<unclear>,</unclear>
When a <gap> occurs after a soft hyphen, but before any non-ignorable content, we have a case for which it is particularly difficult to resolve the soft hyphen; thankfully, it is also a case for which it is particularly unimportant to do so.

It is difficult to do so for two main reasons. First, because (unlike most other empty elements we would encounter after a soft hyphen: <cb>, <lb>, <milestone>, and <pb>) the <gap> represents content, it would have to be moved as if it were the first token of content. Second, because a <gap> may represent less than a single word, a single word, or more than a single word, the software will need to parse its attributes (and perhaps content) to determine whether or not the first token of an immediately following text node (that does not start with whitespace) needs to be moved along with the <gap>.

It is unimportant because under no circumstances can the soft hyphen resolution process meet the goal of reconstituting the entire word. Whether for spell checking, for indexing for search, or for generating an easy-to-read display, having pri<gap extent="rest of word"/><lb/><gap extent="roughly one third of the line minus roughly one half of the first word"/> is no better than what you had to begin with.

Choosing the shy

The TEI uses a parallel elements mechanism for recording a variety of editorial interventions. Here I will discuss the correction of apparent errors (<choice>, <sic>, and <corr>), but the same issues hold true for the simple expansion of abbreviations (<choice>, <abbr>, and <expan>), the substitution of one bit of text for another (<subst>, <del>, and <add>), the regularization of archaic or eccentric spelling or typography (<choice>, <orig>, and <reg>), and the simultaneous encoding of multiple variant witnesses (<app>, <rdg>, and <lem>).

The following example[26] demonstrates two errors in one title, each of which is directly involved in the use of soft hyphens. I will discuss the second error here, and the first one in the next subsection.

png image ../../../vol17/graphics/Bauman01/Bauman01-008.png

If you look carefully at the end of the 3rd line, you will see that the soft hyphen character is not a hyphen at all. In this reproduction you may find it hard to figure out what it is, but in other editions (I am told) it is more obvious that the character there is a period.

Presuming the encoding project would like to record both the error as it appears in the source text and a modern correction of it, there are two likely TEI encodings of this: letter-level and word-level.

<lb/>Bench</placeName>, for the releaſing of all pri<choice><sic>.</sic><corr>&#xAD;</corr></choice>
<lb/>ſoners for Debt, according to
The above letter-level encoding makes resolving the soft hyphen potentially quite a bit more difficult. The difficulty lies in the fact that if we were to apply the simple algorithm discussed above — namely to replace the soft hyphen character with the first token of the following line, we would suddenly be asserting that the partial word soners was somehow a correction of a period:

<lb/>Bench</placeName>, for the releaſing of all pri<choice><sic>.</sic><corr>ſoners</corr></choice>
<lb/>for Debt, according to
In many, if not the vast majority, of situations this would not really be a problem. When performing soft hyphen resolution for the purpose of generating word lists or indices, we generally do not care about simultaneously handling both the source text and the editorial correction. We usually just want the corrected version, in which case the entire <choice> construct is itself resolved to the content of <corr>. Whether this is done before or after soft hyphen resolution, we end up with the desired words.

In rare cases we might be interested in the uncorrected source text. In which case soft hyphen resolution software has to be smart enough to perform the resolution on the text in <sic> based on the content of <corr>. In theory a project may want to perform soft hyphen resolution in both the uncorrected source and the editorially corrected text. I do not address this particular situation here, as I have never even heard this idea entertained.

Word-level correction is a bit easier for soft hyphen resolution, as the simple algorithm yields a perfectly acceptable result.

<lb/>Bench</placeName>, for the releaſing of all <choice>
</choice> for Debt, according to
However, it has a different drawback: with this system counting lines on the page — a common and important task — is harder, in that the counter has to know that the choice/sic/lb and the choice/corr/lb together need to be counted as only one line break.

Shy of the choice

Anything more than a cursory or rapid read of the first two lines reveals an egregious error, probably by the typesetter: the word commanders is spelled commanmanders, as the medial letters man are not only in the initial portion of the word, but are also repeated after the soft hyphen. Multiple possible encodings jump to mind. The editor may consider the man at the end of the first line as the correct one, and thus the man at the beginning of the second line as the one error; or vice-versa. And in each case the encoder may use letter-level or word-level encoding.

<titlePart type="second">Alſo a Petition of divers Comman&#xAD;
<lb/><choice><sic>man</sic><corr/></choice>ders, priſoners in the <placeName>Kings  

<titlePart type="second">Alſo a Petition of divers <choice>
</choice>, priſoners in the <placeName>Kings    

<titlePart type="second">Alſo a Petition of divers Com<choice><sic>man</sic><corr/></choice>&#xAD;
<lb/>manders, priſoners in the <placeName>Kings 

<titlePart type="second">Alſo a Petition of divers <choice>
</choice>, priſoners in the <placeName>Kings    
Furthermore, when using word-level encoding, project editorial policy may allow elision of the soft hyphen and line break in the corrected version:

<titlePart type="second">Alſo a Petition of divers <choice>
</choice>, priſoners in the <placeName>Kings    

Saving graces

So we see that there are quite a few complications to soft hyphen resolution. Luckily, at least at the WWP, there are a few encoding practices we have put in place that ease the process, rather than interfere.

  • Soft hyphens are consistently encoded

    We never encode soft hyphen with anything else, ever.[27] That is (as demonstrated in section “Choosing the shy”), a U+00AD character is encoded at every soft hyphen even if the source text erroneously has a different character, or indeed no character at all, to represent the soft hyphen.

  • U+00AD is unique to this purpose

    We never use U+00AD for anything else, ever. This is a slight exaggeration, but foregrounds the important point. On rare occasion an actual U+00AD character will creep into the discussion about the encoding of a file in its metadata, e.g., in a change log entry that discusses fixing a soft hyphen. But this usage never occurs in the content. Furthermore, an actual soft hyphen never occurs in metadata. Thus all U+00AD within /TEI/text are soft hyphens, all U+00AD within /TEI/teiHeader are discussions about soft hyphen characters.

  • U+00AD is always in element content

    At the WWP our encoding is such that any U+00AD in an attribute value is in error; and, for this purpose, any U+00AD in an XML comment or processing instruction (or the <teiHeader>) is ignorable.

  • Once you’ve seen one white space, you’ve seen ’em all

    As with many text encoding projects, the WWP cares very much about the presence or absence of most whitespace in the encoded XML file, but we don’t care at all about the details of said whitespace, i.e. how many or which whitespace characters occur. We would consider the following three examples entirely equivalent (although obviously, humans prefer to work on the first).

        <byline>To the tune of <title>Don’t Cry for me Argentina</title> by Andrew Lloyd Webber and Tim Rice</byline>
        <l>Don’t cry for me Charles Goldfarb,</l>
        <l>The truth is I do not miss them,</l>
        <l>All of those features,</l>
        <l>Because we’re lazy,</l>
        <l>To save us typing,</l>
        <l>They drove us crazy.</l>
    <lg><byline>To the tune
      of <title>Don’t Cry for me
        Argentina</title> by Andrew
      Lloyd Webber and Tim Rice</byline>    
    <l>        Don’t cry for me Charles Goldfarb,
    </l><l>The truth is I do not miss them,
    </l><l>        All of those features,
           </l><l> Because we’re lazy,
           </l><l> To save us typing,
           </l><l> They drove us crazy.
    </l> </lg>
    <lg><byline>To the tune of <title>Don’t Cry for me Argentina</title> by Andrew Lloyd Webber and Tim
      Rice</byline> <l> Don’t cry for me Charles Goldfarb, </l><l> The truth is I do not miss them, </l><l>
      All of those features,</l><l>Because we’re lazy,</l><l>To save us typing,</l><l>They drove us crazy.</l></lg>

The results of these encoding practices are that it is trivially easy to find all the occurrences of soft hyphens that require resolution (without any false positives), and we can regularize whitespace (even if we can’t use the normalize-space() function; see section “Whitespace”), making tokenization and reconstitution of strings easier.


Early Days

Roughly speaking, in the 1980s the WWP used Waterloo Script; in the early 1990s we used Waterloo GML; in the mid 1990s we used Waterloo GML using pointy brackets (< and >) instead of the default tag delimiters : and .; in the late 1990s we used SGML, but still did most processing with Waterloo Script; and in the early 21st century we switched to XML.

In mid-1991 the WWP embarked on a collaboration with Oxford University Press to publish a series of books based on our textbase files. Thus I went to work on a program to generate camera-ready PostScript output from our pseudo-SGML input, using Waterloo Script. I believe this was the first time we actually wrote code to resolve our soft hyphens, which. The snippet of code below is from a subroutine of that program written in 1991-10. The &*txt0. set symbol contains a line of text with each SPACE (U+0020) converted to a COMMERCIAL AT (U+0040) character. In the input files at that time a soft hyphen was encoded just like a hard hyphen, i.e. using the HYPHEN-MINUS character (U+002D).

. .*
. .* check the first character; if it is a blank ("@") AND our "we
. .* chopped a hyphen off last time we appended" flag is set, chop off
. .* the blank.
. .*
. .if "&'substr( &*txt0., 1, 1 )" = "@" & &nw_shyl. = 1
.   .sr *txt1 = &'substr( &*txt0., 2 )
. .el .sr *txt1 = &*txt0.
. .*
. .*
. .* parse off the last character; if it is the CONTinuation character,
. .* chop it off.  (For some reason in this context Script treats it as
. .* a text character.)
. .*
. .sr *len = &'length( &*txt1. )
. .sr *last = &'substr( &*txt1., &*len., 1 )
. .*
. .if "&*last." = "&$cont." .sr *txt2 = "&'substr(&*txt1.,1,&*len.-1 )"
. .el                       .sr *txt2 = "&*txt1."
. .*
. .*
. .* if there are still characters left, check the last one; if it is a
. .* hyphen, chop it off (Script will not treat it as a soft hyphen
. .* here!), and set a flag
. .*
. .sr *len = &'length( &*txt2. )
. .if &*len. gt 0 .do begin
.   .sr *last = "&'substr( &*txt2., &*len., 1 )"
.   .*
.   .if "&*last." = "&shy." .th .do begin
.     .sr *txt3 = "&'substr(&*txt2., 1, &*len.-1 )"
.     .sr nw_shyl = 1
.     .do end
.   .el .do begin
.     .sr *txt3 = &*txt2.
.     .sr nw_shyl = 0
.     .do end
.   .do end
. .*

My vague recollection is that the above code worked reasonably well, but e-mail I sent in 1994-03 makes it clear it always had problems: hyphens [are] top prio[rity], so I will be tackling that … My thinknig rightg now is that I've spent years trying to figure out how to get SCRIPT to handle this w/o success. But it would be trivial to massage the original file w/ Perl (or maybe even BBEdit) in order to remove soft hyphens, at least in simple <lb> case, and probably others. I do not recall what the problems were. My vague recollection is this system had the capability to handle section “’Twixt” problems well, because this routine was not called on strings that were not part of the main text flow; i.e. it was not used for page apparatus, annotations, figure descriptions, etc.

Special-purpose: Perl version

It is clear from an e-mail exchange from mid 1994-03 that I wrote a special-purpose MacPerl program at that time to handle the simple soft hyphen cases, i.e. when an end-of-line hyphen was followed by a breaking element (<pgbk>, <lb>, or <cl>). I have not been able to find that original Perl program, but I believe that the soft hyphen handling portion of a later routine was based on it. In the following snippet, the entire input file is stored as one long string in the variable $in.

 $in =~ s,&shy;\s*<lb[^>]*>(<anchor[^>]*>)([^ \t\r\n<]*),\2\1,igs;
 $in =~ s,&shy;\s*(<anchor[^>]*>)\s*<lb[^>]*>([^ \t\r\n<]*),\2\1,igs;
 $in =~ s,&shy;\s*<lb[^>]*>,,igs;
This snippet of code does not handle <pgbk> or <cl> elements, because by the time this program, based on the original MacPerl program, was written they no longer existed in our encoding system. It does handle an empty <anchor> element, whether it is before (line 2) or after (line 1) the <lb> that follows the soft hyphen. I can only guess at the reason why it does not handle <pb>, the replacement for <pgbk>: handling the section “’Twixt” problem would be too difficult.

Special-purpose: CMS Pipelines

However, we found the MacPerl program to be too cumbersome and slow.[28] Thus a few days later (1994-03-19) I wrote a CMS Pipelines version of the same command. The program is written in Rexx, but all the work is done by a single call to the CMS pipe command. That main call follows.

 ** Now do the real work in one big pipeline; it would be fast
 ** except that the SPILL stage is written in Rexx. Oh well.
 'pipe (long endchar #) <' fn ft fm, /* read file in */
    '| nfind <pgbk',                 /* nuke page-break lines */
    '| join * /@/',                  /* now 1 line, remembering \n's  */
    '| split after string />/',      /* chop into reasonable size parts */
    '| change /-@<lb/<NUKEME/',      /* mark hyphen-EOL-<lb             */
    '| change /-@<cl>/<NUKEME>/',    /* mark hyphen-EOL-<cl>            */
    '| change /-@<cl /<NUKEME /',    /* mark hyphen-EOL-<cl_            */
    '| join *',                      /* back into 1 line                */
    '| split before string /</',     /* now chop up such as to separate */
    '| split after string />/',      /*   tags onto lines of their own  */
    '| nfind <NUKEME',               /* and kill marked records         */
    '| t: find <',                   /* take only the tags              */
    '| change / /%/',                /* and protect internal blanks     */
    '| a: faninany',                 /* get back non-tags               */
    '| join *',                      /* back into 1 line, again         */
    '| split before string /@/',     /* cut into pieces at orginal \n's */
    '| change /@//',                 /* nuke our \n markers */
    '| spill 153 sep|',              /* make sure not too long          */
    '  change /%/ /',                /* restore protected blanks-in-tags*/
    '| >' ofid,                      /* write to output file */
    '# t:',
    '| a:'

By the time this command is issued, the input file (fn ft fm) has been tested to ensure it has no @ or % characters, and the name of the output file (ofid) has been set up. The spill stage (which was added a month or so later) is not a standard CMS Pipeline stage, but rather was a pipeline stage written at Brown by James Mathiesen.[29] Its purpose was to Spill lines at a particular column … to wrap one-paragraph-per-line input into a wrapped text. This routine was clearly written back when soft hyphens were encoded just as hard hyphens, i.e. using the HYPHEN-MINUS character (U+002D).

Like the Perl program before it, this program was only designed to handle the simple soft hyphens that were followed immediately by a <cl>, <lb>, or <pgbk> element. These were, of course, the vast majority of cases. The problem described in section “’Twixt” is handled, at least for page breaks, in a novel way: the entire record is simply discarded. This worked because during this era it was policy to record all details about a page break on a single line.

It is worth mentioning that the program can match <lb> elements just by searching for the first three characters, which will always be <lb. However, the same shorthand does not work for <cl> elements, because the first characters of the element name are not unique: there were also <close>, <closer>, <closing>, and <clbk> elements at different times in our history.


And so it went — for years the WWP limped by on various hacks to resolve soft hyphens. Then in 2011 we began moving our publication to XTF, a XSLT system built almost entirely on XSLT.[30] Thus we attempted to resolve soft hyphens in XSLT.

First try: text nodes

My first crack at this was just a simple attempt to implement the algorithm loosely described in section “Seems easy …”. One template matched text()[contains(.,'&#xAD;')] and grabbed the first token of the next (i.e., closest following) non-all-whitespace text node that was not inside an <mw> element. Another template matched that closest following non-all-whitespace text node that was not inside an <mw>, and dropped the first token before spitting it into the output stream.

This code became quite thorny when I added the conditions to handle some of the complications mentioned above. But it is even thornier than you might imagine because any given text node may fall into both categories: it may end in &#xAD; and may also immediately follow a line that ends in &#xAD;. I was ending up with code that that almost worked, but was horrible to read and maintain. Debugging was a nightmare.

Second try: decorated elements around those text nodes

Eventually it occurred to me that XSLT’s forte is processing trees of element nodes and their attributes, not text nodes. A large part of the problem I was having was needing to repeat a test performed in template A so that template B could figure out what template A had thought of a given node. Instead, if I processed in separate passes, template A could record what it thought of each node so that template B, running at a later pass, would know. Of course, one needs a place to record this information, and a text node doesn’t really have any convenient place.

So a first pass wraps all text nodes other than those that need to be ignored, anyway with a temporary element, <pcdata>. This element is given attributes that record useful information about the text node for later examination. E.g., whether or not it ends in a soft hyphen, whether or not it starts with whitespace, the first token parsed off, etc. The following is what the example from section “Seems easy …” looks like after the text nodes have been wrapped.

  <pcdata xml:id="d2t6"
          restWords="a ſpeedy adminiſtration of Juſ&#xAD; ">procuring a ſpeedy adminiſtration of Juſ&#xAD;
  <pcdata xml:id="d2t8"
          restWords="for the impartiall puniſhment of all ">tice for the impartiall puniſhment of all
  <pcdata xml:id="d2t10"
          restWords="to the relief and comfort of the ">offenders, to the relief and comfort of the

A second pass further decorates the new <pcdata> elements with attributes that record information about other nodes. For example, whether or not a text node immediately follows a text node that ended in a soft hyphen is recorded on a new attribute @immedFollowsShy that is added to its wrapper <pcdata> element.

Given the easy access to information now associated with each pertinent text node, it should be much easier to resolve the soft hyphens by moving the first token following a soft hyphen to the end of the <pcdata> containing the soft hyphen (replacing the soft hyphen itself). And, of course, a final pass would clean up by removing the temporary <pcdata> elements.

And in fact, I did find it easier to think about and handle the various tests needed to see which bits should be moved to replace the soft hyphen. Nonetheless, I found this a daunting task and never got a fully working version.

Third try: decorated elements around tokens

Eventually it occurred to me that one of the problems I was facing was the difficulty presented by a single <pcdata>-wrapped text node that both immediately follows a soft hyphen and ends in a soft hyphen; and that another was that keeping track of which text nodes contained multiple tokens and which did not was, although not particularly difficult, an unneeded layer of complexity.

There is no such thing (in English) as a word that is long enough to wrap around more than one line. That is, a single token will never both immediately follow a soft hyphen and end with a soft hyphen. (Note to self: see FLWs.) Thus I am now using the approach to wrap each pertinent text token in a temporary, decorated (i.e., information-rich) element.

 <tmp:tok xml:id="d2t6.1" endsInShy="false" tmp:spaceBefore="false">procuring</tmp:tok>
 <tmp:tok xml:id="d2t6.2" endsInShy="false">a</tmp:tok>
 <tmp:tok xml:id="d2t6.3" endsInShy="false">ſpeedy</tmp:tok>
 <tmp:tok xml:id="d2t6.4" endsInShy="false">adminiſtration</tmp:tok>
 <tmp:tok xml:id="d2t6.5" endsInShy="false">of</tmp:tok>
 <tmp:tok xml:id="d2t6.6" endsInShy="true">Juſ</tmp:tok>
 <tmp:tok xml:id="d2t8.1" endsInShy="false" tmp:spaceBefore="false">tice</tmp:tok>
 <tmp:tok xml:id="d2t8.2" endsInShy="false">for</tmp:tok>
 <tmp:tok xml:id="d2t8.3" endsInShy="false">the</tmp:tok>
 <tmp:tok xml:id="d2t8.4" endsInShy="false">impartiall</tmp:tok>
 <tmp:tok xml:id="d2t8.5" endsInShy="false">puniſhment</tmp:tok>
 <tmp:tok xml:id="d2t8.6" endsInShy="false">of</tmp:tok>
 <tmp:tok xml:id="d2t8.7" endsInShy="false">all</tmp:tok>
 <tmp:tok xml:id="d2t10.1" endsInShy="false" tmp:spaceBefore="false">offenders,</tmp:tok>
 <tmp:tok xml:id="d2t10.2" endsInShy="false">to</tmp:tok>
 <tmp:tok xml:id="d2t10.3" endsInShy="false">the</tmp:tok>
 <tmp:tok xml:id="d2t10.4" endsInShy="false">relief</tmp:tok>
 <tmp:tok xml:id="d2t10.5" endsInShy="false">and</tmp:tok>
 <tmp:tok xml:id="d2t10.6" endsInShy="false">comfort</tmp:tok>
 <tmp:tok xml:id="d2t10.7" endsInShy="false">of</tmp:tok>
 <tmp:tok xml:id="d2t10.8" endsInShy="false">the</tmp:tok>

At the time of this writing, the program that uses this method runs, and handles the simple case well. It also has the added advantage that it will resolve soft hyphens in either direction: finalUp or initDown, moving the final portion of the word up to replace the soft hyphen, or by moving the initial portion of the word down to the beginning of the next line. However, it still has quite a few bugs. In particular, it does not handle the problem pointed out in section “Sibling of Overlap” well at all. However, I am still holding out hope.


[NWEW] Edward Phillips, The New World of English Words, or, a General Dictionary, 4th edition; 1678.

[OED] The Oxford English Dictionary, online edition, accessed 2016-04-22.

[FLWs] famous last words in Wiktionary, accessed 2016-04-21.

[SH] soft hyphen in Wiktionary, accessed 2016-04-22.

[SHHP] Soft hyphen (SHY) – a hard problem?, accessed 2016-04-22.

[TEI] Burnard, Lou and Syd Bauman, eds. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Version 3.0.0, 2016-03. TEI Consortium. (2016-04-22).

[1] E.g.,

soft hyphen: (computing, typography) A generally invisible text character marking a point where hyphenation can occur without forcing a line break in an inconvenient place if the text is later re-flowed.


[2] From SHHP, an excellent resource by Jukka Korpela. That said, I’m a little concerned because Mr. Korpela quotes clause 6.3.3 of ISO 8859-1. I have not yet gotten my hands on a copy of ISO 8859-1:1987 or earlier, but the 1998 edition does not seem to have a clause 6.3.3.

It is worth noting here that the hard problem Mr. Korpela discusses in his paper is not at all the same difficult problem I am trying to tackle in the current paper.

[3] Mémoire sur l'éclairage et le balisage des côtes de France, Volume 2 by Léonce Reynaud. Sadly, it is not about markup technologies. (Not surprising, though: it was published in 1864.)

[4] XSLT 2.0 and XPath 2.0, 4th edition.

[6] From Ready Player One by Ernest Cline. 1st edition, paperback, ISBN 978-0-307-88744-3, page 67.

[7] All encoded examples use TEI unless otherwise specified

[8] From The Senate Defers to the N.R.A., The New York Times, page A24 (editorials), 2016-03-24. Although you can read the editorial online, it does not have the same lineation as the printed National Edition.

[9] Formerly the Brown University Women Writers Project, now the Women Writers Project, which is part of the Digital Scholarship Group in the Northeastern University Library.

[10] Here, as elsewhere, the hyphen glyph in the source is transcribed as a numeric character reference because an actual SOFT HYPHEN does not show up in a web browser.

[11] Copied from lines 697–705 of the WWP transcription of Mary Astell’s 1694 book A Serious Proposal to the Ladies as of revision r27555, last updated 2015-12-30; I then added the @n attributes to make talking about the lines easier.

[12] In my own defense, by summer 1994 I had posted to the internal WWP list that this was a difficult problem. Re: Missing hyphens and spaces posted 1994-07-21 to WWPTAG-L

[13] The re-peat Pete, identi-cal Cal, or duplic-ate 8 problem.

[14] From page 5 of The petition of the Jewes for the repealing of the act of Parliament for their banishment out of England by Johanna Cartwright (with her son Ebenezer Cartwright), 1648. The image is from the Hathi Trust page image. The transcription is copied from the WWP transcription of the same edition as of revision r27244, last updated 2015-11-24.

[15] Adapted from the WWP transcription of Memoir of Mrs. Chloe Spear, a native of Africa, who was enslaved in childhood and died in Boston, January 3, 1815...aged 65 years by A lady of Boston, as of revision r27576, last updated 2016-01-04.

[16] One might imagine that a processor should know that a <persName> is always a word unto itself, and thus should be followed by whitespace. I.e., that the presence of the <persName> element should cause whitespace around its content, thus giving us Mrs. B it as opposed to Mrs. Bit. But this turns out not to be the case. Personal names are often immediately followed by a non-whitespace character. While these characters are most commonly punctuation (e.g., an apostrophe, a comma, or a period) that might be encoded inside the <persName>, there are cases where even such white lie encoding will not work. E.g., the following passage copied from the WWP transcription of Lady Mary Chudleigh’s 1701 work The Ladies Defence as of r28816, last updated 2016-06-09.

<l><persName>Narciſſius</persName>-like, you your own Graces view,</l>
      <l>Think none deſerve to be admir'd but you:</l>
      <l>Your own Perfections always you adore,</l>
      <l>And think all others deſpicably poor:</l>

[17] I often use a WWP function explicitly for this purpose:

  <xsl:function name="wwp:regularize-space" as="xs:string">
    <!-- Collapse all strings of whitespace *including leading & trailing white-  -->
    <!-- space* in the parameter (a string) to a single space (U+0020) character. -->
    <!-- Written long ago on a computer far away by Syd Bauman; copyleft.         -->
    <xsl:param name="arg" as="xs:string"/>
    <xsl:variable name="intermediate" select="concat('␠', $arg, '␠')"/>
    <xsl:variable name="semifinal" select="normalize-space( $intermediate )"/>
    <xsl:value-of select="substring( $semifinal, 2, string-length( $semifinal ) -2 )"/>

[18] Adapted from the WWP transcription of Memoir of the late Hannah Kilham, 1837, as of revision r28478, last updated 2016-04-21.

[19] The <mw> element is the WWP’s version of the TEI <fw> element.

[20] An overall helpful anonymous reviewer suggested that page numbers should be encoded in an attribute value instead of in element content, and further suggested the TEI Guidelines recommend an attribute value. The reviewer is certainly correct, the process of soft hyphen resolution would be much easier if there was never any element content between the initial and final portions of a word broken across a line, column, or page break. And, indeed, in TEI 3.10.3 Milestone Elements the Guidelines say The global @n attribute is used in each case to provide a value for the [page number]. However, this is a mechanism for recording what the page number is, not the page number as it is written on the page. The two may not match, and (in the general case) it is definitionally not possible to record what is written on the page in an attribute value, for two reasons:

  1. it may include characters outside of Unicode — which need to be represented using markup, in the TEI case the <g> element;

  2. it may require markup for other reasons, for example the correction of an apparent error, said correction made either by the current encoder (which would entail the use of the TEI <choice>, <sic>, and <corr> elements) or by an 18th century librarian (which would entail the use of, e.g., the TEI <subst>, <del>, and <add> elements).

The TEI provides the <fw> element precisely to record page numbers etc. actually present in the document being encoded (see TEI 11.6 Headers, Footers, and Similar Matter).

[21] Adapted from the WWP transcription of Favell Mortimer’s 1842 publication The History of Job, in Language Adapted to Children, as of revision r29046, last updated 2016-07-07.

[22] In particular, the heading at the top of page 1 of A Muzzle for Melastomus by Rachel Speght, published in 1617. The heading is actually the complete title of the book, and because there is a lot of front matter, occurs almost halfway through. This image is of Shirley Marc’s Renascence Editions edition, which can be found at the University of Oregon’s Scholars’ Bank.

[23] The encoding also uses the <vuji> element, which is not a TEI element. It is WWP shorthand for the typographic regularization of V, v, U, u, J, j, I, i, VV, and vv. E.g., the expanded TEI form of the WWP <vuji>u</vuji> would be <choice><orig>u</orig><reg>v</reg></choice>.

[24] Copied from the WWP transcription of The cook's guide: or, rare receipts for cookery by Hannah Wolley, 1664, as of r27331, last updated 2015-12-03.

[25] Copied from the WWP transcription of the second edition of Sermons of Barnardine Ochyne, (to the number of. 25.) concerning the predestination and election of god, translated by Ann Bacon in 1570, as of r27244, last updated 2015-11-24.

[26] From the title page of The petition of the Jewes for the repealing of the act of Parliament for their banishment out of England by Johanna Cartwright (with her son Ebenezer Cartwright), 1648. The image is from the Hathi Trust page image. The transcription is copied from the WWP transcription of the same edition as of revision r27244, last updated 2015-11-24.

[27] This is in accordance with the enthymeme I often give clients: I’d prefer your encoding be consistently wrong than inconsistent..

[28] Apparently a large part of the problem was that for reasons I do not know and may have never known, we could not run our preferred Mac↔mainframe transfer program at the same time as MacPerl.

[29] Written 1991-05-24. Interestingly, James and I shared an apartment at the time.

[30] The eXtensible Text Framework from the California Digital Library.

Author's keywords for this paper: XML; soft hyphen; XSLT; TEI