<?xml version="1.0" encoding="UTF-8"?><article xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0-subset Balisage-1.2"><title>Linking Page Images to Transcriptions with SVG</title><info><confgroup><conftitle>Balisage: The Markup Conference 2008</conftitle><confdates>August 12 - 15, 2008</confdates></confgroup><abstract><para>This paper will present the results of ongoing experimentation with the linking of
        manuscript images to TEI transcriptions. The method being tested involves the automated
        conversion of images containing text to SVG, using Open Source tools. Once the text has been
        converted to SVG paths, these can be grouped in the document to mark the words therein and
        these groups can then be linked using standard methods to tokenized versions of the
        transcriptions. The goal of these experiments is to achieve a much more fine-grained linking
        and annotation mechanism than is so far possible with available tools, e.g. the Image Markup
        Tool and TEI P5 facsimile markup, both of which annotate only rectangular sections of an
        image. The method envisioned here would produce a legible tracing of the word, expressed in
        XML, to which transcripts and annotations might be attached and which can be superimposed
        upon the original image.</para></abstract><author><personname><firstname>Hugh</firstname><othername>A.</othername><surname>Cayless</surname></personname><personblurb><para>Hugh is head of the Research &amp; Development group at the Carolina Digital
          Library and Archives, UNC Chapel Hill. He holds a Ph.D. in Classics and an M.S. in
          Information Science from UNC. His research interests include the application of
          computational techniques to problems in Classics (and the Humanities in general), and
          Digital Curation.</para></personblurb><affiliation><jobtitle>Head of R&amp;D</jobtitle><orgname>CDLA, UNC Chapel Hill Library</orgname></affiliation><email>philomousos@gmail.com</email></author><legalnotice><para>Copyright © 2008 Hugh A. Cayless. Used by permission.</para></legalnotice><keywordset role="author"><keyword>image and text linking</keyword><keyword>SVG</keyword><keyword>Open Source</keyword></keywordset></info><section><title>Introduction</title><para>Mass digitization has become a fact of life at most major University Libraries in recent
      years. In the case of printed books, initiatives like Google’s and the the Internet Archive’s
      produce page images with attached metadata. These are OCR’ed and processed to make them
      searchable and readable via a web browser and (in the case of the IA) in a variety of other
      formats, such as PDF. These processes do not accommodate documents for which OCR is currently
      impossible however. Handwritten documents can, and are, regularly digitized, but human beings
      must produce transcriptions for them (if this is done at all) and the transcriptions are
      typically linked to the images only on the level of the page.</para><para>This project began as an attempt to see what could be done in an automated or
      semi-automated way to allow linkage between transcription and image at a deeper level and also
      annotation of the image at the level of the text on the page. Transcription of the text in a
      page image can be considered a special case of annotation.</para><para>UNC Chapel Hill is contemplating a very large scale manuscript digitization project,
      potentially covering the entire Southern Historical Collection. The Carolina Digital Library
      and Archives (CDLA) also has an ongoing, smaller, manuscript digitization project funded by
      the Watson-Brown foundation and several completed projects published under its Documenting the
      American South program which deal with manuscript images and transcriptions in TEI. </para><para>There are a number of existing tools that provide for user-controlled image annotation.
      This is typically accomplished by providing the user with drawing tools with which they may
      draw shape overlays on the image. These overlays can in turn be linked to text annotations
      entered by the user. This is the way image annotation works on Flickr, for example, and also
      the IMT. The TEI P5 facsimile markup conceives of text-image linking in this fashion also.</para><para>Drawing rectangular overlays on top of an image is a good compromise between ease-of-use
      and utility, and rectangles fit well with most types of written text. It does prompt the
      questions of whether it is possible to go deeper, however, and what to do with lines of text
      that aren’t able to be captured by rectangles. I noted with interest the proof-of-concept work
      my colleagues Sean Gillies and Tom Elliott did earlier this year using the OpenLayers
      Javascript library as a means of tracing text on a sample inscription<footnote xml:id="cay01"><para> See <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://atlantides.org/inscriptol/</link>. </para></footnote>. The vector drawings overlaid on the image are serialized as SVG which can then be
      saved and used as a linking mechanism. Inscriptol was the inspiration for the work presented
      in this paper. Some further discussion about tools for text and image linking took place on
      the Stoa site at the same time.<footnote><para>See <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.stoa.org/?p=776</link>.</para></footnote> The starting point for the experiments described here is a tool for tracing raster
      images and converting them to vector graphics named potrace<footnote xml:id="cay02"><para>
          <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://potrace.sourceforge.net/</link>
        </para></footnote>. Potrace will convert a bitmap to SVG, among other formats. It is licensed under
      the GPL. Tests with the tool on manuscript pages were promising, so I decided to see whether a
      toolchain could be constructed, using only Free, Open Source software, that would start with a
      manuscript image in a standard format such as JPEG and take it into an environment where the
      image could be linked to a transcription.</para></section><section><title>The Toolchain</title><para>The goal of this experiment is to see whether it is possible to go from a page image and a
      TEI-based transcription to a linked presentation of the two, using only Free, Open Source
      tools. In addition the experiment is intended to evaluate the extent to which this process
      might be automated and, conversely, where and how much human intervention will be required in
      the process.</para><section><title>Image preparation</title><para>The process of tracing a raster image to produce a vector analog requires a bitmap
        format as input. The source images in DocSouth are most likely to be either TIFF or JPEG, so
        they must be converted to a source usable by potrace. The convert utility that comes with
        the open source image processing library ImageMagick<footnote xml:id="cay03"><para>
            <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.imagemagick.org</link>
          </para></footnote> performs this function with ease using a command such as: <programlisting xml:space="preserve">convert mss01-01-p01.jpg mss01-01-p01.pnm</programlisting>
        <figure xml:id="fig01"><mediaobject><imageobject><imagedata format="jpg" fileref="../../../vol1/graphics/Cayless01/Cayless01-001.jpg" width="100%"/></imageobject><caption><para>Papyrus (P. Mich. 1.78)</para></caption></mediaobject></figure></para><para>ImageMagick also supports a wide variety of additional image manipulations, and is
        likely to prove useful in other kinds of image preprocessing. As we work on refining the
        techniques outlined here, it is likely that operations such as image sharpening and
        increasing the contrast will be added to the preprocessing pipeline in order to produce
        sources that potrace can do a better job of converting.</para></section><section><title>Conversion to SVG</title><para>Potrace handles the conversion from bitmap to SVG, as part of the process, it collapses
        the image’s color space to 1 bit (black/white) and then creates vector paths tracing the
        black shapes in the image. Setting the cutoff at which it determines whether a pixel becomes
        black or white is one of the main steps at which human intervention is presently required.
        In experimenting on a number of images, I was able to obtain good results after a period of
        trial and error.</para><para>The image in figure 1 was converted to the image in figure 2 using the following
        command: <programlisting xml:space="preserve">potrace -s -t 4 -k 0.3 -G 2.5 P.Mich.inv.\ 3088.pnm</programlisting>
        <figure xml:id="fig02"><mediaobject><imageobject><imagedata format="jpg" fileref="../../../vol1/graphics/Cayless01/Cayless01-002.jpg" width="100%"/></imageobject><caption><para>Figure 1 after conversion to SVG</para></caption></mediaobject></figure> The black/white cutoff, represented by the -k parameter on the command, is
        considerably different for a papyrus image like the one above than for paper.</para><para>Potrace produces an SVG document with paths that look like <programlisting xml:space="preserve">
&lt;path d="M5471 5212 c-1 -13 -5 -20 -11 -17 -5 3 -10 1 -10 -5 0
-14 -25 -40 -39 -40 -6 0 -11 -7 -11 -16 0 -12 -4 -14 -12 -7 -16
13 -67 22 -72 14 -2 -3 -67 -6 -144 -6 l-139 0 -12 -44 c-17 -62 1
-78 37 -34 l27 33 93 -4 c51 -2 95 -6 98 -9 11 -11 6 -91 -6 -95
-6 -2 -27 -17 -45 -33 -18 -16 -37 -29 -43 -29 -12 0 -24 -29 -16
-41 4 -8 58 21 74 41 11 14 100 60 114 60 9 0 16 8 16 20 0 11 -7
20 -15 20 -8 0 -15 6 -15 14 0 15 30 44 30 30 0 -5 4 -2 10 6 5 8
12 11 15 8 4 -4 3 -13 -1 -20 -5 -7 -10 -35 -11 -63 -1 -27 -6 -51
-11 -53 -11 -3 -3 -52 9 -52 5 0 9 7 9 15 0 8 4 15 9 15 4 0 6 -12
4 -27 l-4 -28 16 28 c18 32 23 32 54 11 19 -14 25 -14 31 -3 10
15 4 29 -13 29 -7 0 -21 9 -31 21 -16 19 -16 27 -5 71 6 28 20 59
31 70 30 30 40 65 23 82 -9 8 -19 25 -24 38 l-10 23 0 -23z m-146
-149 c-8 -32 -13 -39 -21 -30 -10 9 5 57 17 57 7 0 9 -10 4 -27z"/&gt;</programlisting> They
        use SVG’s moveto (m), cubic bézier curveto (c) and lineto (l) commands in relative mode,
        that is the first moveto command determines the start point of the path, and then subsequent
        coordinates are relative to that point. This mode is obviously a convenient notation for a
        program creating the paths, but it is less convenient for working with the paths, so a
        conversion to the absolute notation is necessary. In addition, the SVG output by potrace is
        marked as version 1.0 (although it is compatible with the current standard, 1.1) and the
        paths need to have ids assigned to them so that they will be able to be referred to
      later.</para></section><section><title>SVG Cleanup</title><para>The SVG editor Inkscape (which in fact uses potrace internally to trace images) may be
        used from the command line to output a version of the SVG with the relative path notation
        converted to absolute. If invoked with the -l parameter, Inkscape will output a ‘plain’ SVG
        file without the additional namespaces the program typically adds. For example:
        <programlisting xml:space="preserve">inkscape -l P.Mich.inv.\ 3088.svg P.Mich.inv.\ 3088.svg</programlisting>
        XSLT was used to insert id numbers and do additional small pieces of cleanup, including the
        removal of a duplicate SVG namespace and setting the version number to 1.1. After
        processing, an example a path looks like <programlisting xml:space="preserve">
&lt;path d="M 88.96875,276.20312 C 88.96875,276.89375 86.36875,280.42813 85.271875,281.2
C 84.825,281.48438 84.5,281.93125 84.5,282.175 C 84.5,282.37813 84.0125,282.94687
83.44375,283.39375 C 82.753125,283.9625 82.55,284.36875 82.834375,284.65312
C 83.321875,285.14062 86.978125,283.51562 90.065625,281.40312 L 92.178125,279.98125
L 93.4375,280.875 C 95.021875,282.0125 97.90625,282.45938 97.90625,281.60625
C 97.90625,281.28125 97.25625,280.79375 96.403125,280.55 C 95.55,280.30625
94.371875,279.65625 93.7625,279.16875 C 89.171875,275.30937 88.96875,275.1875
88.96875,276.20312 z M 90.390625,279.33125 C 90.146875,279.98125 87.34375,281.6875
86.571875,281.6875 C 86.2875,281.6875 86.9375,280.875 88.034375,279.85938
C 89.821875,278.15312 90.91875,277.90938 90.390625,279.33125 z" id="path13332"/&gt;
</programlisting>
      </para></section><section><title>SVG Analysis</title><para>At this point, we have an SVG document that is ready for analysis. The initial
        experiment uses a Python script that simply attempts to detect lines in the image and to
        organize the paths within those lines into groups within the document. The paths are sorted
        left to right and top to bottom and then merged using a simple algorithm. The process starts
        with the bounding rectangle of the leftmost, topmost path and looks at the next path’s
        bounding rectangle. If they overlap top to bottom more than 45%, then the two are merged
        into a group. This continues until no more overlapping rectangles can be found, and the
        remaining paths that have not been assigned to a group are passed to the function again. The
        process repeats until all paths have been assigned to a group. When analysis is complete,
        the Python script writes the results out to disk in two formats. Initially, the script
        produced an SVG file with grouped paths and with bounding rectangles inserted to make the
        boundaries of the line groups visible. <figure><mediaobject><imageobject><imagedata format="jpg" fileref="../../../vol1/graphics/Cayless01/Cayless01-003.jpg" width="100%"/></imageobject><caption><para>SVG with the original image embedded, after line detection.</para></caption></mediaobject></figure> Subsequently a Javascript serialization was added to support the browser-based
        display described below.</para></section><section><title>Display</title><para>I noted above that the proof-of-concept work on tracing inscriptions by Elliott and
        Gillies used an Open Source map display library called OpenLayers as the basis for its
        display and annotation capabilities. OpenLayers allows the insertion of a single image as a
        base layer (though it supports tiled images as well), so it is quite simple to insert a page
        image into it. <figure><mediaobject><imageobject><imagedata format="jpg" fileref="../../../vol1/graphics/Cayless01/Cayless01-004.jpg" width="100%"/></imageobject><caption><para>OpenLayers with embedded image and transcription</para></caption></mediaobject></figure> OpenLayers also supports simple vector structures, such as points, lines,
        polylines, and polygons. It is therefore possible to represent the line-containing
        rectangles generated by the Python script as a vector layer on top of a JPEG version of the
        page image. <figure><mediaobject><imageobject><imagedata format="jpg" fileref="../../../vol1/graphics/Cayless01/Cayless01-005.jpg" width="100%"/></imageobject><caption><para>The layers in the image viewer.</para></caption></mediaobject></figure> In order to display the paths traced over the writing itself, however, some
        additional work had to be done. </para><para>The experimental system discussed here adds several functions to the OpenLayers library
        in order to support paths and groups of paths. OpenLayers represents vectors in the browser
        using either SVG or VML, depending on the browser’s capabilities. This test only attempted
        to display the traced text in Firefox and Safari, both of which have SVG support, so only
        the SVG serialization code was modified. In theory, the VML generation code should support
        similar functionality, but this has not been attempted. OpenLayers provides a very useful
        platform for display because it has built in functionality like zoom and pan, as well as the
        ability to turn layers on and off and to add event handlers for structures it draws. This
        enables, for example, highlighting to be activated when the mouse hovers over a line.</para><para>After the functions to store path data and serialize it as SVG were added to OpenLayers,
        the Python script was modified to output instructions to OpenLayers in Javascript that draw
        the paths and bounding rectangles as separate layers on top of the image. OpenLayers can, as
        described above, enable the addition of even handlers to polygon features, so in order to
        demonstrate the ability to link the grouped vectors to lines in a transcription, I matched
        path group ids to their corresponding line number using a pair of JSON objects (e.g. fig. #)
        <programlisting xml:space="preserve">
var img2xml = {
grpath168: ["ln0", "ln1"],
grpath282: ["ln2"],
...
grpath3426: ["ln26"],
grpath3626: ["ln27"]
};
</programlisting>
      </para><para>The event handler for a mouseover on one of the rectangles bounding a line of text calls
        a function that changes the color of the paths within the line and the stroke of the
        bounding rectangle itself. <figure><mediaobject><imageobject><imagedata format="jpg" fileref="../../../vol1/graphics/Cayless01/Cayless01-006.jpg" width="100%"/></imageobject><caption><para>Hovering over a line in the image.</para></caption></mediaobject></figure> It also changes the font-weight of the corresponding line of text to bold. An
        inverted data structure points from lines of text to paths and when a mouseover event is
        bound to the line, enables the highlighting of the path when a user hovers over a line of
        text. <figure><mediaobject><imageobject><imagedata format="jpg" fileref="../../../vol1/graphics/Cayless01/Cayless01-007.jpg" width="100%"/></imageobject><caption><para>Hovering over a line of text.</para></caption></mediaobject></figure></para></section><section><title>Conclusions</title><para>The experiments outlined above prove that it is feasible to go from a page image with a
        TEI-based transcription to an online display in which the image can be panned and zoomed,
        and the text on the page can be linked to the transcription (and vice-versa). The steps in
        the process that have not yet been fully automated are the selection of a black/white cutoff
        for the page image, the decision of what percentage of vertical overlap to use in
        recognizing that two paths are members of the same line, and the need for line beginning
        (&lt;lb/&gt;) tags to be inserted into the TEI transcription (if it does not already
        contain them). The tools employed to produce the SVG tracing and the interactive display are
        all stable and well-supported (although the path support added to OpenLayers needs
        additional work). It seems clear that additional testing and the attempt to produce a
        working implementation will be well worth the effort.</para></section></section><section><title>Where to go from here</title><para>One question posed by the apparent success of the method is what should link to what. What
      structures in the vector graphics document can be detected (beyond lines) and how should they
      be linked to the transcriptions? There are some very thorny concurrency issues here, since ink
      from one letter may touch another, and thus form a single path consisting of multiple letters,
      making it impossible to isolate letters, or even words if the letters belong to different
      words. A descender on the letter ‘f’ might touch a letter in the line below, making it
      impossible to easily identify the two lines as separate. These difficulties mean that linking
      word for word or letter for letter between documents is not necessarily possible. The streams
      are not parallel. Of course, vector paths can be sliced, and the image and text streams
      therefore could be made parallel, but this kind of operation will almost certainly require a
      human being with an SVG editor such as Inkscape. A second, related issue is that text
      transcriptions in XML may well define document structure in a semantic, rather than a physical
      way. Line, word, and letter segments can be marked in TEI, but they frequently are not. The
      DocSouth example used as a test case here does not have line breaks marked, for example.</para><para>The mechanism used for linking bears further thought and study. The current implementation
      hand waves over the problem, simply mapping an id in one document to one or more ids in the
      other and vice versa using JSON. It would be much better to develop a standard for this kind
      of linking, since there is no guarantee that the id from one document would easily be
      available to the other. TEI P5 envisions the alignment of different document streams of this
      type using a link group <footnote xml:id="cay04"><para>See
        <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SACSXA</link>.</para></footnote>. Using TEI for this is a possible solution, but it does involve changing the TEI
      document, which may not be desirable. As the P5 standard remarks: “If it is not feasible to
      add more markup to the original text, some form of stand-off markup will be needed.” Stand-off
      markup seems a better solution in the abstract, but it isn’t immediately clear what is the
      best way to implement this solution.</para><para>The proof-of-concept system illustrated above attempts to detect lines only, and that in a
      very simple way, by looking along the x-axis for overlapping structures. Probabilistic methods
      may well prove the best way to determine whether any given path belongs to the same group as
      another path, or whether a previously constructed group really holds together. The algorithms
      for structure detection therefore need a great deal of refinement and it is not yet clear how
      deep it is possible to go in detecting structure within the SVG image automatically. How much
      human intervention can be asked for, provided, and enabled within this framework is an
      important question too. OpenLayers provides some limited vector editing capabilities, but how
      reasonable is it to ask a user to manually split, for example, two lines that have mistakenly
      been combined into a single group?</para><para>The prospects for further development of this idea seem rich. I hope to proceed by further
      developing and refining the structure detection routines, by refining the display capabilities
      of the web interface and improving and standardizing the linking mechanisms. I plan to seek
      grant funding to work on this in the context of one or more of UNC Library’s digitized
      manuscript collections later this year.</para></section><bibliography><title>References</title><bibliomixed xml:id="cayb07" xreflabel="APIS1769">Advanced Papyrological Information System, "APIS record:
      michigan.apis.1769,"
        <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://wwwapp.cc.columbia.edu/ldpd/app/apis/search?mode=search&amp;institution=michigan&amp;pubnum_coll=P.Mich.&amp;pubnum_vol=1&amp;pubnum_page=78&amp;sort=date&amp;resPerPage=25&amp;action=search&amp;p=1</link></bibliomixed><bibliomixed xml:id="cayb08" xreflabel="DSTrue01">Documenting the American South, "Letter from John and Ebenezer
      Pettigrew to Charles Pettigrew,"
      <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://docsouth.unc.edu/true/mss01-01/mss01-01.html</link>.</bibliomixed><bibliomixed xml:id="cayb06" xreflabel="PMich1.78">Duke Databank of Documentary Papyri, "P.Mich.: Michigan Papyri
      1.78,"
        <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.perseus.tufts.edu/cgi-bin/ptext?doc=Perseus%3Atext%3A1999.05.0163;layout=;query=1%3A78;loc=78</link>.</bibliomixed><bibliomixed xml:id="cayb01" xreflabel="Gillies2008">Gillies, Sean, "Digitizing Ancient Inscriptions with OpenLayers,"
      (blog post)
      <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://sgillies.net/blog/691/digitizing-ancient-inscriptions-with-openlayers</link>,
      February 21, 2008.</bibliomixed><bibliomixed xml:id="cayb04" xreflabel="OL2.6">"OpenLayers, v. 2.6," <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://openlayers.org</link>. Release
      notes for v. 2.6 at <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://trac.openlayers.org/wiki/Release/2.6/Notes</link>.</bibliomixed><bibliomixed xml:id="cayb02" xreflabel="Selinger1.8">Selinger, Peter, "Potrace, v. 1.8,"
      <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://potrace.sourceforge.net/</link>.</bibliomixed><bibliomixed xml:id="cayb03" xreflabel="Selinger2003">Selinger, Peter, "Potrace: a polygon-based tracing algorithm,"
      http://potrace.sourceforge.net/potrace.pdf, September 20, 2003.</bibliomixed><bibliomixed xml:id="cayb05" xreflabel="Terras2008">Terras, Melissa, "Palaeographic Image Markup Tools," (blog post)
        <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.stoa.org/?p=776</link>, February 19, 2008.</bibliomixed></bibliography></article>
