How to cite this paper
With One Voice: A Modular Approach to Streamlining Character Data for
Balisage: The Markup Conference 2019
July 30 - August 2, 2019
Full-text search and text analysis often rely on tokenization—the careful division
textual content into smaller, discrete units (here, words). Many tools for search
and natural language processing require plain text inputs, from which tokens are derived.
Markup tags are not desired because these tools cannot parse them as annotations,
only as a
bizarre sort of plain text.
The documentation for Apache Lucene states outright:
Applications that build
their search capabilities upon Lucene may support documents in various formats – HTML,
XML, PDF, Word – just to name a few. Lucene does not care about the Parsing of these and other document formats, and it is the
responsibility of the application using Lucene to use an appropriate Parser to convert the original format into plain text before
passing that plain text to Lucene. (Lucene)
One might next expect the argument that these tools—ranging from the search engine
to the many word cloud generators currently in existence—should be able to parse markup.
may in fact be true that such applications could benefit from the nuance of a marked-up
but such an argument is beyond the scope of this paper. Rather than advocating for
creation of comprehensive, omni-input tools, I describe a modular, transparent approach;
which prioritizes the need for tokenization and the need for
informational markup, one which allows for customization but not at the expense of
the rubrics of a markup project again and again.
Real text content
To extract tokens from the text content of an XML document, it is first necessary
determine what the document’s content is. A novice in XML encoding might
expect that the text nodes are the
real content of the document—to get text out
of marked-up text, you just remove the
This approach is reductive, but not absurdly so. The Fifth Edition of the XML 1.0
All text that is not markup constitutes the character data of the
document.XML 1.0 Novices may be encouraged to read XML by moving
the markup out of cognitive focus, and taking in only the character data. This method
emphasizes the document’s linear progression of text nodes, mentally filtering out
However, not all character data are comparable, or even useful for all activities.
Text Encoding Initiative, for example, defines the
<teiHeader> element for
<text> for the document itself. Both elements have use in
discovery as well as in analysis. The
<teiHeader> tends to play a contextual
role, allowing one to filter a corpus or determine a document’s licensing information.
<text> element tends to house the words and the structure of the document in
question, but calling
<text>’s text nodes the
real content is
still be overly broad in several senses.
I will provide examples from the Women Writers Online (WWO) corpus.
WWO is not intended to serve as commentary on the women’s writing that it makes
available to its subscribers.WWO encoding has the goal of accurate representation of the original document,placing an emphasis on structural and semantic tagging, rather than the
presentational. However, there are times when the encoding fails to accurately capture
nuance of the original document, or something in the encoding will be lost when it
published to the web. In some of those cases, the encoder of the text is asked to
public-facing description of the missing nuance. The description is tagged as a
<note xml:id="n003" target="#a003" type="WWP">
<p>These characters symbolize several things. The first is the Sun and the Moon.
They also represent the “eye and the horn of the lamb.”
The characters are also supposed to evoke “O C” for
Fig. 1. Example of a WWP note, from The Benediction from the Almighty
Omnipotent by Lady Eleanor Davies. Here’s part of the sentence in which the
note is anchored:
witneſs ☉ ☾ their Golden
Characters, ſtiled Eyes and Horns of the Lamb, &c. (Davies, The Benediction)
This note by WWP alumnus Sarah Stanley captures several meanings invoked by Lady Eleanor
Davies in The Benediction from the Almighty Omnipotent. The note is
undoubtedly important and potentially useful data; someone searching Women Writers
for mentions of Oliver Cromwell should be able to find Benediction,
even though Davies rarely mentions Cromwell without employing some kind of coded wordplay.
However, for programmatic analysis of the works authored and published by Davies,
from the modern era would not be useful.
Lady Eleanor Davies wrote pamphlets—short in page length, but made dense with added
markup. In Benediction, the TEI’s
offers equally useful alternate readings: expansions (
<expan>) are given for
Davies’s abbreviations (
<abbr>). The former would be of use for searchability;the latter for analysis of character data collected from
Anagram, <mcr rend="slant(upright)" xml:id="cromwel1">Howl <placeName>Rome</placeName></mcr>: And thus
<lb/>with one voice, <said rend="slant(upright)">come and ſee,
The above example contains two
<choice>s, with whitespace added for
readability. The example also contains an instance of the WWP’s custom element
meaningful change in rendition) which here marks an
anagram for a person mentioned earlier in Benediction, one
While the alternatives can be considered on equal footing,
represents an open question for indexing and text analysis purposes. Which child of
<choice> should be ignored? Or do we choose to let the character data
Oliver occur sequentially?
Implied or insignificant whitespace
Here is the same excerpt, with different spacing:
Anagram, <mcr rend="slant(upright)" xml:id="cromwel1">Howl <placeName>Rome</placeName></mcr>:
And thus<lb/>with one voice, <said rend="slant(upright)">come and ſee, <persName><choice><abbr>O:</abbr>
Despite some changes in lineation and white space, this encoding is functionally
equivalent to the excerpt in the previous section. First, the TEI defines only elements
the children of
<choice>, and so, whitespace-only text nodes are considered
to be insignificant when they are the children of
<choice>. TEI Guidelines
Even though the first choice has a space between
Oliver, a schema-aware processor would show
O:Oliver, as if
the newline and spaces weren’t there.
<lb>, or line beginning, implies the
existence of a newline between
with one voice,
regardless of whether or not the newline character is actually present in the previous
following text node. For reasons of formatting and readability, a newline is usually
immediately before the
<lb>, but it does not have to be. WWO gives
<lb> a default rendition of
break(yes), such that
<lb> is treated as if it occurs after a newline
Tags and differing wordviews
It’s worth noting that the WWP uses extensive intra-word markup.
Tags can and do occur in the midst of a word—meaning, one can assume that most elements
WWO imply no surrounding whitespace at all. For example, the
<wwp:vuji> tag is used as a convenient
shorthand for a
<choice> marking old-style letterforms and their regularizations.
The Prophet <persName rend="slant(upright)"><vuji>I</vuji>oel</persName>
The Prophet <persName rend="slant(upright)"><choice><orig>I</orig><reg>J</reg></choice>oel</persName>
The first snippet uses
<wwp:vuji> to mark the character
I, which would be written as
J in modern usage. The
second snippet shows the TEI-conformant version of the same content. (Davies, The Benediction)
only marks one or two characters, the tag
occurs most often inside words and never implies whitespace. We might prefer to read
or the modernized
. No one would
be happy with
Prophet I oel
Many XML-aware tools have a different understanding of implied whitespace. In XTF, eXist-DB, and Morphadorner, every element—by default—implies that there is at least one whitespace
character on either side.
I list these tools in particular because each has been used with WWO documents at
time or another. Women Writers Online runs on the XTF platform. Eventually, eXist
replace XTF as a platform for WWO publication. As part of work on the Word Vector Interface, the WWP
experimented with Morphadorner for regularization, especially on works from the early
modern era. These tools do not share the WWP’s worldview on tags and whitespace, but
have been able to customize them to parse WWO documents with reasonable success.
Both eXist-DB and Morphadorner provide a configuration option which lets one define
list of tags which should be considered
inline, or, as implying no
whitespace. (eXist-DB, Burns 2013) Configuration can be
a humbling process when most tags must be exempted from the default behavior!
XTF, on the other hand, can only parse tags as discrete
terms. (XTF Users List) Until recently, WWO pre-publication processes resolved most
intra-word ambiguity before XTF indexed the documents. For example,
<wwp:vuji>s were transformed into the modern forms of their character data:
The Prophet <persName slant="upright"> Joel</persName>
XTF’s parser does a fine job of telling Lucene to index the terms
, setting aside the stopword
Recently, the WWP unveiled a new feature of the Women Writers Online interface which
allows readers to toggle between the regularized and the original typography. To do
<wwp:vuji> tags were retained, although the character data was still modernized:
The Prophet <persName slant="upright"> <vuji>J</vuji>oel</persName>
tags were introduced
each long-s character:
to match the reader’s set preferences.
The WWP staff soon discovered that XTF was displaying the content as expected, but
was also indexing the terms
oel. In fact, XTF had always done this with non-
markup, such as the relatively rare, intra-word
<emph>, which has always
retained its tags during indexing. We only found out when the abrupt increase in
intra-word markup made XTF’s assumption a great deal more apparent.
The <hit><term>Prophet</term></hit> J oel as fore s aw
Keyword-in-context snippet from the XTF’s raw XML results for a WWO search on the
Words and their boundaries
The concerns listed in previous sections are not new; they cannot be solved once,
all. Rather, they are confronted and addressed in XML database index configurations,
stylesheets for publication formats, discussions of schema design, &c., &c. And
because there are as many approaches as there are projects, Lucene and the word cloud
generators of the world can perhaps be forgiven for sticking to plain text input with
single layer of data. These tools don’t have to interpret or reduce complexity beyond
character level—all content is “real” content.
In the following sections, I describe the
fulltexting routines used by the
Women Writers Project. The foundation of the routines is fulltext.xsl, also known
fulltextBot, which defines steps for the creation of an intermediary, derived XML
XML intermediary can be (and has been) used for indexing, XPath queries, the extraction
plain text, and simple HTML display.
At the time of this writing, fulltextBot development has three guiding principles:
no matter the reasons one has for needing reliable word boundaries in character
data, some normalization processes will always be useful;
to support as many applications as possible, the markup should be preserved for as
long as it remains valuable; and
it should always be possible to determine where and why a normalized document
differs from the original.
Early versions of the fulltextBot favored human-readable output over verbosity, and so it may come as no surprise that the fulltextBot creates an XML intermediary
which can be read using the reductive premise described in the introduction—that one
determine the so-called
real content of an XML document by focusing only on the
text nodes. Alternate readings are removed from text nodes, leaving only regularized
The original content is not lost, though. Whenever the fulltextBot decides that character
data should be dropped from the document’s regularized content, it moves the string
custom attribute called
@read (as in,
element, read this original character data). Examples are shown
Origins of the
In 2016, Syd Bauman and I started work on a small application to serve WWO data out
XML database. The project ultimately didn’t go anywhere, but it did include a XSLT
intended to create index-friendly derivatives of WWO documents. This stylesheet, the
fulltextBot, was also an experiment in soft hyphen processing.
Soft hyphens: an interlude
Soft hyphens are the hyphens which occur at the end of a printed line, in the middle
a word, where a hyphen would not normally occur.
witneſs <seg xml:id="a003" corresp="#n003">☉ ☾</seg> their Golden Cha-
<lb/>racters, ſtiled Eyes and Horns of the
An example of a soft hyphen in WWO encoding. (Davies, The Benediction)
For display purposes, I have used a
hard hyphen character
-) instead of the soft hyphen character
Soft hyphens are also the most tenacious of intra-word markup. The soft hyphen
phenomenon is encoded in WWO as the Unicode character
. That is to
, a soft hyphen occurs
alongside other character data in a text
node. The presence of a soft
hyphen overrides any whitespace implied by the next printed line (
fact, any whitespace should be considered insignificant if it occurs after the soft
character and before the orphaned wordpart. Ideally, the wordpart before the soft
should be joined up with the next eligible wordpart.
In 2016, WWP staff spent weeks debugging the soft hyphen processing in Women Writers
Online stylesheets. Syd Bauman’s paper
The Hard Edges of Soft Hyphens goes
into great depth about the intricacies of whitespace and axis relationships, all of
make it difficult to obtain a single word from two parts separated by a soft hyphen.
writes of his experimental method:
Eventually it occurred to me that XSLT’s forte is processing trees of element nodes
and their attributes, not text nodes. A large part of the problem
I was having was needing to repeat a test performed in template A so that template
could figure out what template A had thought of a given node. Instead, if I
processed in separate passes, template A could record what it thought of each node
that template B, running at a later pass, would know. Of course, one needs a place
record this information, and a text node doesn’t really have any convenient
, emphasis mine)
I, in turn, wanted to reduce the cognitive load required for humans to parse and debug
the XPaths needed for template A’s and template B’s tests. I followed the status quo established in the WWO stylesheets: when a soft hyphen
occurs, the XSLT
moves the second wordpart to the first, and deletes the soft
hyphen. When a text node has a soft hyphen in it, an XSLT template (
correctly identify the next part of the word, and copy that wordpart. Consequently,
B) must also be able to identify text nodes which contain the
copied wordpart, and delete the string. As noted in Bauman 2016, a
successful resolution can only occur when both text nodes are processed.
My one innovation in soft hyphen processing was to first group together sequences
elements which represent artifacts around pages, such as catchwords (
type="catch">), signatures (
<mw type="sig">), and page beginnings
<milestone unit="sig" n="A1v"/>
WWO encoding indicating the beginning of page 2, which has an idealized signature
A1v. (Davies, The Benediction)
<milestone unit="sig" n="A1r"/>
A template matches these elements, and determines if the current node has any other
artifacts before it. If not, the current node is processed by the
template. The pbSubsequencer recursively gathers up all
pbGroup candidates which appear immediately after the triggering element. The resulting
collection of elements and whitespace-only text nodes is contained within
. With the phenomena around page beginnings grouped together on a
first pass, templates in the second pass—
mode—could safely ignore
when deciding whether a text node is on either side of a soft
An anxiety of soft hyphens
On November 29, 2016, I wrote an optimistic commit message:
complete shy handling. I was wrong, of course, and I knew it even then, even though my test data looked
clean. Soft hyphens are the most volatile of intra-word markup because so much of
behavior depends upon: implied whitespace; elements with character data that should
ignored when looking for the next wordpart; elements which should halt shy processing
<gap>); how much node ancestry is shared by the affected wordparts;
witness <seg xml:id="a003" corresp="#n003">☉ ☾</seg> their Golden
Cha<seg read="">racters,</seg><lb/> <seg read="racters,"/> stiled Eyes and Horns of the
Knowing this, I surveyed the WWO corpus for soft hyphens, looking for encoding which
might cause bugs. In his paper, Syd states that
it is trivially easy to find all the
occurrences of soft hyphens that require resolution in WWO documents. (Bauman 2016) I found this to be accurate. On the other hand, it is much harder
to classify the ways in which soft hyphens interact with the XML structures around
is even more difficult to do so programmatically, at scale.
For testing purposes, elements and/or attributes were introduced at the sites of the
fulltextBot’s interventions. Besides
@read, the fulltextBot would add
communicate the kind of intervention made. The fulltextBot also would be able to recognize
WWO elements which imply break behavior. If the element had no preceding whitespace
delimiter, the fulltextBot would add one.
<ab type="pbGroup" subtype="add-element" resp="fulltextBot"><pb n="2"/>
<milestone unit="sig" n="A1v"/>
witness <seg xml:id="a003" corresp="#n003">☉ ☾</seg> their Golden
Cha<seg read="-" type="shy-part" subtype="add-element mod-content" resp="fulltextBot">racters,</seg><lb/>
<seg read="racters," type="shy-part" subtype="add-element del-content" resp="fulltextBot"/> stiled Eyes and Horns of the
><vuji read="I" subtype="mod-content" resp="fulltextBot">J</vuji>oel</persName>
Anagram, <mcr rend="slant(upright)" xml:id="cromwel1">Howl <placeName>Rome</placeName></mcr>:
And thus <lb/>with one voice, <said rend="slant(upright)"><quote>come and see</quote>,
><abbr read="O:" type="choice" subtype="del-content" resp="fulltextBot"/>
><abbr read="C:" type="choice" subtype="del-content" resp="fulltextBot"
Some effects of a 2017 fulltextBot on WWO encoding. (Derived from Davies, The Benediction.)
Note that the
ſ character has been silently regularized to a
lower-cased s. For human readability, long-s regularization remains the only unmarked
intervention type in the fulltextBot.
Beyond the fulltextBot XSLT, a companion XQuery
developed to gather regularized WWO content into a tab-separated values. Each row represents a document from Women Writers Online. Besides a cell
containing a plain text representation of the document, each row also contained metadata
about the source material.
By April 2017, general development on the sample WWO application had stopped. The
commits in the app repository were on the fulltextBot or
the push to retain tags and to capture the provenance of interventions on WWO character
the XSLT was becoming a transparent, open system. At this point, the fulltextBot and
XQuery were moved to the Women Writers Project Public Code Share as modular parts
By applying a baseline of normalization first, the toolset as a whole reduces the
of entry to creating plain text from WWO documents. The
allow further customizations and free users to define for themselves what constitutes
relevance in marked-up text.
The fulltexting routines have since been used for many purposes, mostly by WWP staff
encoders. These endeavors include: gathering data on the titles in WWO; providing
plain text to researchers; creating input files for training word embedding models;
and spellchecking WWO texts before publication.
The toolset has continued to grow in response to these endeavors. The processes already
described continue to be fine-tuned as new bugs are discovered. In order to reduce
needed to run the original fulltext2table.xq, a new version called
fulltext2table.enmasse.xq was invented to create one TSV file per XML
document. The fulltextBot offered the option to move
<note>s out of the
<wwp:hyperDiv> and next to their anchors. Sarah Connell and I wrote a new
XQuery to get plain text out of generic XML. Also, starting in fulltextBot version
reworked soft hyphen handling—instead of moving wordparts around, the fulltextBot
the whitespace that occurs between wordparts.
Customizable extraction of plain text
With some effort, the intermediary XML can be used to walk back from a plain text
snippet to the original WWO XML. The first real use of the fulltextBot was to create
s which only appear once in WWO. For human
readability, it was necessary for each
to be normalized... and, for
actionability, it was necessary to be able to get back to the original node using
I used the fulltextBot to create intermediary XML of each published WWO document.
ran an XQuery script which calculated the number of times the content of each
<title> appeared across the corpus, and inserted an
attribute on those which appeared only once. The
singleton-intertextual-titles Inspectre report contained copies of the
passages in which those
<title>s appeared. The Inspectre application
transformed the passages into HTML, and also provided an XML view and an XPath for
where more context was needed.
Once the Inspectre report was complete, I used another XQuery to insert bibliographic
<title>. The nature of the intermediary
XML allowed me to programmatically determine what the original text content of a given
would have been at the time of the report’s creation. The annotations told me which
node appeared in, and the
<milestone> preceding the node’s containing
With one voice
In the intermediary form described above, WWO markup retains its value even when the
character data is being prioritized. At a minimum, fulltextBot results provide a window
the original encoding. They can be queried just as regular WWO texts can, and they
one to answer questions with XPath that would ordinarily require XQuery or XSLT and
a day of
developer time. More than that, the XSLT—and the assumptions under its code—can be
searching the output for the intervention-marking attributes and their fulltextBot-specific
The fulltexting routines have been used on other TEI-based corpora with a change to
default namespace declarations, and some document analysis to find any ignorable elements.
Even so, I think the toolset’s most valuable asset is that it gives shape and context
to the invisible rules underlying WWO encoding. In short, the fulltextBot works best on WWO
documents because it has been tailored to the dimensions of the WWO corpus.
As previously stated, there are as many approaches to tokenization as there are projects.
But it is perhaps more useful to say that every project has baked-in assumptions about
textual content is important, and how XML nodes play off one another. It is perhaps
important to examine these assumptions, to test them, and to build a foundation on
common understanding can rest.
I owe a significant debt of gratitude to Syd Bauman for his support and for his work
processing soft hyphens. The fulltextBot would not be nearly so comprehensive if Syd
pointed out many, many pitfalls to me.
I’d also like to thank Sarah Connell, who probably has a copy of almost every version
the fulltextBot. She’s an expert in modifying the fulltexting routines to accommodate
different needs, different kinds of markup. Her feedback and feature requests have
tools more powerful and more accessible than they would be otherwise.
Finally, thanks to the peer reviewers for Balisage, for all their suggestions, especially
regarding the overall shape of this paper.
Any errors or missteps are mine and mine alone.
Appendix B. Processing in fulltext.xsl version 2.3
The fulltextBot at version 2.3 can be found on GitHub: https://github.com/NEU-DSG/wwp-public-code-share/blob/72060eaa3e9883088a69036e78c74991cb5c28ed/fulltext/fulltext.xsl.
Pass 1: default mode
Most regularization takes place, including the following:
long-s characters are changed to lower-case s characters;
<choice>s are made;
WWP-authored content is deleted;
implied whitespace is made explicit;
pbGroup members are wrapped together in an
Once whitespace is in a reliable state and metawork is
@read, soft hyphens can be resolved. Whitespace is deleted if it
occurs after a soft hyphen and before a subsequent wordpart.
If the parameter
$move-notes-to-anchors is toggled on (it is off by
default), unifier mode is first run on
<note>s. The resulting
<note>s are tunnelled through to their anchor points in the
<text> proper. Notes are not inserted next to their anchors if the note
would appear in the middle of a word.
$move-notes-to-anchors is toggled on and there
<note>s which could not be placed with their anchor, those notes are
returned to their original locations.
This would be the pass where the remaining notes would be placed after the
interrupting wordpart. However, this kind of manipulation is easier to do with XQuery
Update, so I left it out of the XSLT stylesheet.
Otherwise, the results from unifier mode are returned.
[Lucene] Apache Software Foundation. Lucene 8.0.0 documentation.
[Bauman 2016] Bauman, Syd. “The Hard Edges of Soft Hyphens.” Presented at Balisage: The Markup
Conference 2016, Washington, DC, August 2–5, 2016. In Proceedings of Balisage:
The Markup Conference 2016. Balisage Series on Markup Technologies, vol. 17 (2016).
[Burns 2013] Burns, Philip R. 2013. “MorphAdorner v2: A Java Library for the Morphological
Adornment of English Language Texts.” Northwestern University. https://morphadorner.northwestern.edu/morphadorner/download/morphadorner.pdf. Accessed 2019-07-05.
[Davies, The Benediction] Davies, Lady Eleanor. 2015.
The Benediction, 1651. From the Women Writers
Online XML, last modified 2019-02-10 (commit 36259). Published at https://www.wwp.northeastern.edu/texts/davies.benediction.html. (Requires
[eXist-DB] eXist-db Project. Documentation.
Whitespace Treatment and Ignored Content. In
Full Text Index.
[Jockers 2016] Jockers, Matthew L. 2016.
Text Quality, Text Variety, and Parsing
XML. In Text Analysis with R for Students of Literature.
Quantitative Methods in the Humanities and Social Sciences. Springer
[TEI Guidelines] TEI Consortium.
Elements. In P5: Guidelines for Electronic Text Encoding and
Interchange. Version 3.5.0. https://www.tei-c.org/release/doc/tei-p5-doc/en/html/REF-ELEMENTS.html. Accessed
[XML 1.0] W3C. Extensible Markup Language (XML) 1.0 (Fifth
Edition). Section 2.4,
Character Data and Markup. https://www.w3.org/TR/REC-xml/#syntax.
[XQuery and XPath Full Text 1.0] W3C. XQuery and XPath Full Text 1.0. https://www.w3.org/TR/xpath-full-text-10/.
[XTF Users List] XTF
Users List. 2012-02-06 – 2012-05-04. Forum thread.
Tags that break up words. https://groups.google.com/forum/#!topic/xtf-user/hsvFOTM0b9E. Accessed 2019-07-04.