Cayless, Hugh A. “Linking Page Images to Transcriptions with SVG.” Presented at Balisage: The Markup Conference 2008, Montréal, Canada, August 12 - 15, 2008. In Proceedings of Balisage: The Markup Conference 2008. Balisage Series on Markup Technologies, vol. 1 (2008). https://doi.org/10.4242/BalisageVol1.Cayless01.
Balisage: The Markup Conference 2008 August 12 - 15, 2008
Balisage Paper: Linking Page Images to Transcriptions with SVG
Hugh is head of the Research & Development group at the Carolina Digital
Library and Archives, UNC Chapel Hill. He holds a Ph.D. in Classics and an M.S. in
Information Science from UNC. His research interests include the application of
computational techniques to problems in Classics (and the Humanities in general),
This paper will present the results of ongoing experimentation with the linking of
manuscript images to TEI transcriptions. The method being tested involves the automated
conversion of images containing text to SVG, using Open Source tools. Once the text
converted to SVG paths, these can be grouped in the document to mark the words therein
these groups can then be linked using standard methods to tokenized versions of the
transcriptions. The goal of these experiments is to achieve a much more fine-grained
and annotation mechanism than is so far possible with available tools, e.g. the Image
Tool and TEI P5 facsimile markup, both of which annotate only rectangular sections
image. The method envisioned here would produce a legible tracing of the word, expressed
XML, to which transcripts and annotations might be attached and which can be superimposed
upon the original image.
Mass digitization has become a fact of life at most major University Libraries in
years. In the case of printed books, initiatives like Google’s and the the Internet
produce page images with attached metadata. These are OCR’ed and processed to make
searchable and readable via a web browser and (in the case of the IA) in a variety
formats, such as PDF. These processes do not accommodate documents for which OCR is
impossible however. Handwritten documents can, and are, regularly digitized, but human
must produce transcriptions for them (if this is done at all) and the transcriptions
typically linked to the images only on the level of the page.
This project began as an attempt to see what could be done in an automated or
semi-automated way to allow linkage between transcription and image at a deeper level
annotation of the image at the level of the text on the page. Transcription of the
text in a
page image can be considered a special case of annotation.
UNC Chapel Hill is contemplating a very large scale manuscript digitization project,
potentially covering the entire Southern Historical Collection. The Carolina Digital
and Archives (CDLA) also has an ongoing, smaller, manuscript digitization project
the Watson-Brown foundation and several completed projects published under its Documenting
American South program which deal with manuscript images and transcriptions in TEI.
There are a number of existing tools that provide for user-controlled image annotation.
This is typically accomplished by providing the user with drawing tools with which
draw shape overlays on the image. These overlays can in turn be linked to text annotations
entered by the user. This is the way image annotation works on Flickr, for example,
the IMT. The TEI P5 facsimile markup conceives of text-image linking in this fashion
Drawing rectangular overlays on top of an image is a good compromise between ease-of-use
and utility, and rectangles fit well with most types of written text. It does prompt
questions of whether it is possible to go deeper, however, and what to do with lines
that aren’t able to be captured by rectangles. I noted with interest the proof-of-concept
my colleagues Sean Gillies and Tom Elliott did earlier this year using the OpenLayers
saved and used as a linking mechanism. Inscriptol was the inspiration for the work
in this paper. Some further discussion about tools for text and image linking took
the Stoa site at the same time. The starting point for the experiments described here is a tool for tracing raster
images and converting them to vector graphics named potrace. Potrace will convert a bitmap to SVG, among other formats. It is licensed under
the GPL. Tests with the tool on manuscript pages were promising, so I decided to see
toolchain could be constructed, using only Free, Open Source software, that would
start with a
manuscript image in a standard format such as JPEG and take it into an environment
image could be linked to a transcription.
The goal of this experiment is to see whether it is possible to go from a page image
TEI-based transcription to a linked presentation of the two, using only Free, Open
tools. In addition the experiment is intended to evaluate the extent to which this
might be automated and, conversely, where and how much human intervention will be
The process of tracing a raster image to produce a vector analog requires a bitmap
format as input. The source images in DocSouth are most likely to be either TIFF or
they must be converted to a source usable by potrace. The convert utility that comes
the open source image processing library ImageMagick performs this function with ease using a command such as:
convert mss01-01-p01.jpg mss01-01-p01.pnm
ImageMagick also supports a wide variety of additional image manipulations, and is
likely to prove useful in other kinds of image preprocessing. As we work on refining
techniques outlined here, it is likely that operations such as image sharpening and
increasing the contrast will be added to the preprocessing pipeline in order to produce
sources that potrace can do a better job of converting.
Conversion to SVG
Potrace handles the conversion from bitmap to SVG, as part of the process, it collapses
the image’s color space to 1 bit (black/white) and then creates vector paths tracing
black shapes in the image. Setting the cutoff at which it determines whether a pixel
black or white is one of the main steps at which human intervention is presently required.
In experimenting on a number of images, I was able to obtain good results after a
trial and error.
The image in figure 1 was converted to the image in figure 2 using the following
use SVG’s moveto (m), cubic bézier curveto (c) and lineto (l) commands in relative
that is the first moveto command determines the start point of the path, and then
coordinates are relative to that point. This mode is obviously a convenient notation
program creating the paths, but it is less convenient for working with the paths,
conversion to the absolute notation is necessary. In addition, the SVG output by potrace
marked as version 1.0 (although it is compatible with the current standard, 1.1) and
paths need to have ids assigned to them so that they will be able to be referred to
The SVG editor Inkscape (which in fact uses potrace internally to trace images) may
used from the command line to output a version of the SVG with the relative path notation
converted to absolute. If invoked with the -l parameter, Inkscape will output a ‘plain’
file without the additional namespaces the program typically adds. For example:
XSLT was used to insert id numbers and do additional small pieces of cleanup, including
removal of a duplicate SVG namespace and setting the version number to 1.1. After
processing, an example a path looks like
<path d="M 88.96875,276.20312 C 88.96875,276.89375 86.36875,280.42813 85.271875,281.2
C 84.825,281.48438 84.5,281.93125 84.5,282.175 C 84.5,282.37813 84.0125,282.94687
83.44375,283.39375 C 82.753125,283.9625 82.55,284.36875 82.834375,284.65312
C 83.321875,285.14062 86.978125,283.51562 90.065625,281.40312 L 92.178125,279.98125
L 93.4375,280.875 C 95.021875,282.0125 97.90625,282.45938 97.90625,281.60625
C 97.90625,281.28125 97.25625,280.79375 96.403125,280.55 C 95.55,280.30625
94.371875,279.65625 93.7625,279.16875 C 89.171875,275.30937 88.96875,275.1875
88.96875,276.20312 z M 90.390625,279.33125 C 90.146875,279.98125 87.34375,281.6875
86.571875,281.6875 C 86.2875,281.6875 86.9375,280.875 88.034375,279.85938
C 89.821875,278.15312 90.91875,277.90938 90.390625,279.33125 z" id="path13332"/>
At this point, we have an SVG document that is ready for analysis. The initial
experiment uses a Python script that simply attempts to detect lines in the image
organize the paths within those lines into groups within the document. The paths are
left to right and top to bottom and then merged using a simple algorithm. The process
with the bounding rectangle of the leftmost, topmost path and looks at the next path’s
bounding rectangle. If they overlap top to bottom more than 45%, then the two are
into a group. This continues until no more overlapping rectangles can be found, and
remaining paths that have not been assigned to a group are passed to the function
process repeats until all paths have been assigned to a group. When analysis is complete,
the Python script writes the results out to disk in two formats. Initially, the script
produced an SVG file with grouped paths and with bounding rectangles inserted to make
boundaries of the line groups visible.
display described below.
I noted above that the proof-of-concept work on tracing inscriptions by Elliott and
Gillies used an Open Source map display library called OpenLayers as the basis for
display and annotation capabilities. OpenLayers allows the insertion of a single image
base layer (though it supports tiled images as well), so it is quite simple to insert
image into it.
OpenLayers also supports simple vector structures, such as points, lines,
polylines, and polygons. It is therefore possible to represent the line-containing
rectangles generated by the Python script as a vector layer on top of a JPEG version
In order to display the paths traced over the writing itself, however, some
additional work had to be done.
The experimental system discussed here adds several functions to the OpenLayers library
in order to support paths and groups of paths. OpenLayers represents vectors in the
using either SVG or VML, depending on the browser’s capabilities. This test only attempted
to display the traced text in Firefox and Safari, both of which have SVG support,
the SVG serialization code was modified. In theory, the VML generation code should
similar functionality, but this has not been attempted. OpenLayers provides a very
platform for display because it has built in functionality like zoom and pan, as well
ability to turn layers on and off and to add event handlers for structures it draws.
enables, for example, highlighting to be activated when the mouse hovers over a line.
After the functions to store path data and serialize it as SVG were added to OpenLayers,
the paths and bounding rectangles as separate layers on top of the image. OpenLayers
described above, enable the addition of even handlers to polygon features, so in order
demonstrate the ability to link the grouped vectors to lines in a transcription, I
path group ids to their corresponding line number using a pair of JSON objects (e.g.
The event handler for a mouseover on one of the rectangles bounding a line of text
a function that changes the color of the paths within the line and the stroke of the
bounding rectangle itself.
It also changes the font-weight of the corresponding line of text to bold. An
inverted data structure points from lines of text to paths and when a mouseover event
bound to the line, enables the highlighting of the path when a user hovers over a
The experiments outlined above prove that it is feasible to go from a page image with
TEI-based transcription to an online display in which the image can be panned and
and the text on the page can be linked to the transcription (and vice-versa). The
the process that have not yet been fully automated are the selection of a black/white
for the page image, the decision of what percentage of vertical overlap to use in
recognizing that two paths are members of the same line, and the need for line beginning
(<lb/>) tags to be inserted into the TEI transcription (if it does not already
contain them). The tools employed to produce the SVG tracing and the interactive display
all stable and well-supported (although the path support added to OpenLayers needs
additional work). It seems clear that additional testing and the attempt to produce
working implementation will be well worth the effort.
Where to go from here
One question posed by the apparent success of the method is what should link to what.
structures in the vector graphics document can be detected (beyond lines) and how
be linked to the transcriptions? There are some very thorny concurrency issues here,
from one letter may touch another, and thus form a single path consisting of multiple
making it impossible to isolate letters, or even words if the letters belong to different
words. A descender on the letter ‘f’ might touch a letter in the line below, making
impossible to easily identify the two lines as separate. These difficulties mean that
word for word or letter for letter between documents is not necessarily possible.
are not parallel. Of course, vector paths can be sliced, and the image and text streams
therefore could be made parallel, but this kind of operation will almost certainly
human being with an SVG editor such as Inkscape. A second, related issue is that text
transcriptions in XML may well define document structure in a semantic, rather than
way. Line, word, and letter segments can be marked in TEI, but they frequently are
DocSouth example used as a test case here does not have line breaks marked, for example.
The mechanism used for linking bears further thought and study. The current implementation
hand waves over the problem, simply mapping an id in one document to one or more ids
other and vice versa using JSON. It would be much better to develop a standard for
of linking, since there is no guarantee that the id from one document would easily
available to the other. TEI P5 envisions the alignment of different document streams
type using a link group . Using TEI for this is a possible solution, but it does involve changing the TEI
document, which may not be desirable. As the P5 standard remarks: “If it is not feasible
add more markup to the original text, some form of stand-off markup will be needed.”
markup seems a better solution in the abstract, but it isn’t immediately clear what
best way to implement this solution.
The proof-of-concept system illustrated above attempts to detect lines only, and that
very simple way, by looking along the x-axis for overlapping structures. Probabilistic
may well prove the best way to determine whether any given path belongs to the same
another path, or whether a previously constructed group really holds together. The
for structure detection therefore need a great deal of refinement and it is not yet
deep it is possible to go in detecting structure within the SVG image automatically.
human intervention can be asked for, provided, and enabled within this framework is
important question too. OpenLayers provides some limited vector editing capabilities,
reasonable is it to ask a user to manually split, for example, two lines that have
been combined into a single group?
The prospects for further development of this idea seem rich. I hope to proceed by
developing and refining the structure detection routines, by refining the display
of the web interface and improving and standardizing the linking mechanisms. I plan
grant funding to work on this in the context of one or more of UNC Library’s digitized
manuscript collections later this year.