Linking Page Images to Transcriptions with SVG

Linking Page Images to Transcriptions with SVG Balisage: The Markup Conference 2008 August 12 - 15, 2008 This paper will present the results of ongoing experimentation with the linking of manuscript images to TEI transcriptions. The method being tested involves the automated conversion of images containing text to SVG, using Open Source tools. Once the text has been converted to SVG paths, these can be grouped in the document to mark the words therein and these groups can then be linked using standard methods to tokenized versions of the transcriptions. The goal of these experiments is to achieve a much more fine-grained linking and annotation mechanism than is so far possible with available tools, e.g. the Image Markup Tool and TEI P5 facsimile markup, both of which annotate only rectangular sections of an image. The method envisioned here would produce a legible tracing of the word, expressed in XML, to which transcripts and annotations might be attached and which can be superimposed upon the original image. Hugh A. Cayless Hugh is head of the Research & Development group at the Carolina Digital Library and Archives, UNC Chapel Hill. He holds a Ph.D. in Classics and an M.S. in Information Science from UNC. His research interests include the application of computational techniques to problems in Classics (and the Humanities in general), and Digital Curation. Head of R&D CDLA, UNC Chapel Hill Library philomousos@gmail.com Copyright © 2008 Hugh A. Cayless. Used by permission. image and text linking SVG Open Source

Introduction Mass digitization has become a fact of life at most major University Libraries in recent years. In the case of printed books, initiatives like Google’s and the the Internet Archive’s produce page images with attached metadata. These are OCR’ed and processed to make them searchable and readable via a web browser and (in the case of the IA) in a variety of other formats, such as PDF. These processes do not accommodate documents for which OCR is currently impossible however. Handwritten documents can, and are, regularly digitized, but human beings must produce transcriptions for them (if this is done at all) and the transcriptions are typically linked to the images only on the level of the page. This project began as an attempt to see what could be done in an automated or semi-automated way to allow linkage between transcription and image at a deeper level and also annotation of the image at the level of the text on the page. Transcription of the text in a page image can be considered a special case of annotation. UNC Chapel Hill is contemplating a very large scale manuscript digitization project, potentially covering the entire Southern Historical Collection. The Carolina Digital Library and Archives (CDLA) also has an ongoing, smaller, manuscript digitization project funded by the Watson-Brown foundation and several completed projects published under its Documenting the American South program which deal with manuscript images and transcriptions in TEI. There are a number of existing tools that provide for user-controlled image annotation. This is typically accomplished by providing the user with drawing tools with which they may draw shape overlays on the image. These overlays can in turn be linked to text annotations entered by the user. This is the way image annotation works on Flickr, for example, and also the IMT. The TEI P5 facsimile markup conceives of text-image linking in this fashion also. Drawing rectangular overlays on top of an image is a good compromise between ease-of-use and utility, and rectangles fit well with most types of written text. It does prompt the questions of whether it is possible to go deeper, however, and what to do with lines of text that aren’t able to be captured by rectangles. I noted with interest the proof-of-concept work my colleagues Sean Gillies and Tom Elliott did earlier this year using the OpenLayers Javascript library as a means of tracing text on a sample inscription See http://atlantides.org/inscriptol/. . The vector drawings overlaid on the image are serialized as SVG which can then be saved and used as a linking mechanism. Inscriptol was the inspiration for the work presented in this paper. Some further discussion about tools for text and image linking took place on the Stoa site at the same time. See http://www.stoa.org/?p=776. The starting point for the experiments described here is a tool for tracing raster images and converting them to vector graphics named potrace http://potrace.sourceforge.net/ . Potrace will convert a bitmap to SVG, among other formats. It is licensed under the GPL. Tests with the tool on manuscript pages were promising, so I decided to see whether a toolchain could be constructed, using only Free, Open Source software, that would start with a manuscript image in a standard format such as JPEG and take it into an environment where the image could be linked to a transcription.

The Toolchain The goal of this experiment is to see whether it is possible to go from a page image and a TEI-based transcription to a linked presentation of the two, using only Free, Open Source tools. In addition the experiment is intended to evaluate the extent to which this process might be automated and, conversely, where and how much human intervention will be required in the process.

Image preparation The process of tracing a raster image to produce a vector analog requires a bitmap format as input. The source images in DocSouth are most likely to be either TIFF or JPEG, so they must be converted to a source usable by potrace. The convert utility that comes with the open source image processing library ImageMagick http://www.imagemagick.org performs this function with ease using a command such as: convert mss01-01-p01.jpg mss01-01-p01.pnm

Papyrus (P. Mich. 1.78) ImageMagick also supports a wide variety of additional image manipulations, and is likely to prove useful in other kinds of image preprocessing. As we work on refining the techniques outlined here, it is likely that operations such as image sharpening and increasing the contrast will be added to the preprocessing pipeline in order to produce sources that potrace can do a better job of converting.

Conversion to SVG Potrace handles the conversion from bitmap to SVG, as part of the process, it collapses the image’s color space to 1 bit (black/white) and then creates vector paths tracing the black shapes in the image. Setting the cutoff at which it determines whether a pixel becomes black or white is one of the main steps at which human intervention is presently required. In experimenting on a number of images, I was able to obtain good results after a period of trial and error. The image in figure 1 was converted to the image in figure 2 using the following command: potrace -s -t 4 -k 0.3 -G 2.5 P.Mich.inv.\ 3088.pnm

Figure 1 after conversion to SVG The black/white cutoff, represented by the -k parameter on the command, is considerably different for a papyrus image like the one above than for paper. Potrace produces an SVG document with paths that look like <path d="M5471 5212 c-1 -13 -5 -20 -11 -17 -5 3 -10 1 -10 -5 0 -14 -25 -40 -39 -40 -6 0 -11 -7 -11 -16 0 -12 -4 -14 -12 -7 -16 13 -67 22 -72 14 -2 -3 -67 -6 -144 -6 l-139 0 -12 -44 c-17 -62 1 -78 37 -34 l27 33 93 -4 c51 -2 95 -6 98 -9 11 -11 6 -91 -6 -95 -6 -2 -27 -17 -45 -33 -18 -16 -37 -29 -43 -29 -12 0 -24 -29 -16 -41 4 -8 58 21 74 41 11 14 100 60 114 60 9 0 16 8 16 20 0 11 -7 20 -15 20 -8 0 -15 6 -15 14 0 15 30 44 30 30 0 -5 4 -2 10 6 5 8 12 11 15 8 4 -4 3 -13 -1 -20 -5 -7 -10 -35 -11 -63 -1 -27 -6 -51 -11 -53 -11 -3 -3 -52 9 -52 5 0 9 7 9 15 0 8 4 15 9 15 4 0 6 -12 4 -27 l-4 -28 16 28 c18 32 23 32 54 11 19 -14 25 -14 31 -3 10 15 4 29 -13 29 -7 0 -21 9 -31 21 -16 19 -16 27 -5 71 6 28 20 59 31 70 30 30 40 65 23 82 -9 8 -19 25 -24 38 l-10 23 0 -23z m-146 -149 c-8 -32 -13 -39 -21 -30 -10 9 5 57 17 57 7 0 9 -10 4 -27z"/> They use SVG’s moveto (m), cubic bézier curveto (c) and lineto (l) commands in relative mode, that is the first moveto command determines the start point of the path, and then subsequent coordinates are relative to that point. This mode is obviously a convenient notation for a program creating the paths, but it is less convenient for working with the paths, so a conversion to the absolute notation is necessary. In addition, the SVG output by potrace is marked as version 1.0 (although it is compatible with the current standard, 1.1) and the paths need to have ids assigned to them so that they will be able to be referred to later.

SVG Cleanup The SVG editor Inkscape (which in fact uses potrace internally to trace images) may be used from the command line to output a version of the SVG with the relative path notation converted to absolute. If invoked with the -l parameter, Inkscape will output a ‘plain’ SVG file without the additional namespaces the program typically adds. For example: inkscape -l P.Mich.inv.\ 3088.svg P.Mich.inv.\ 3088.svg XSLT was used to insert id numbers and do additional small pieces of cleanup, including the removal of a duplicate SVG namespace and setting the version number to 1.1. After processing, an example a path looks like <path d="M 88.96875,276.20312 C 88.96875,276.89375 86.36875,280.42813 85.271875,281.2 C 84.825,281.48438 84.5,281.93125 84.5,282.175 C 84.5,282.37813 84.0125,282.94687 83.44375,283.39375 C 82.753125,283.9625 82.55,284.36875 82.834375,284.65312 C 83.321875,285.14062 86.978125,283.51562 90.065625,281.40312 L 92.178125,279.98125 L 93.4375,280.875 C 95.021875,282.0125 97.90625,282.45938 97.90625,281.60625 C 97.90625,281.28125 97.25625,280.79375 96.403125,280.55 C 95.55,280.30625 94.371875,279.65625 93.7625,279.16875 C 89.171875,275.30937 88.96875,275.1875 88.96875,276.20312 z M 90.390625,279.33125 C 90.146875,279.98125 87.34375,281.6875 86.571875,281.6875 C 86.2875,281.6875 86.9375,280.875 88.034375,279.85938 C 89.821875,278.15312 90.91875,277.90938 90.390625,279.33125 z" id="path13332"/>

SVG Analysis At this point, we have an SVG document that is ready for analysis. The initial experiment uses a Python script that simply attempts to detect lines in the image and to organize the paths within those lines into groups within the document. The paths are sorted left to right and top to bottom and then merged using a simple algorithm. The process starts with the bounding rectangle of the leftmost, topmost path and looks at the next path’s bounding rectangle. If they overlap top to bottom more than 45%, then the two are merged into a group. This continues until no more overlapping rectangles can be found, and the remaining paths that have not been assigned to a group are passed to the function again. The process repeats until all paths have been assigned to a group. When analysis is complete, the Python script writes the results out to disk in two formats. Initially, the script produced an SVG file with grouped paths and with bounding rectangles inserted to make the boundaries of the line groups visible.

SVG with the original image embedded, after line detection. Subsequently a Javascript serialization was added to support the browser-based display described below.

Display I noted above that the proof-of-concept work on tracing inscriptions by Elliott and Gillies used an Open Source map display library called OpenLayers as the basis for its display and annotation capabilities. OpenLayers allows the insertion of a single image as a base layer (though it supports tiled images as well), so it is quite simple to insert a page image into it.

OpenLayers with embedded image and transcription OpenLayers also supports simple vector structures, such as points, lines, polylines, and polygons. It is therefore possible to represent the line-containing rectangles generated by the Python script as a vector layer on top of a JPEG version of the page image.

The layers in the image viewer. In order to display the paths traced over the writing itself, however, some additional work had to be done. The experimental system discussed here adds several functions to the OpenLayers library in order to support paths and groups of paths. OpenLayers represents vectors in the browser using either SVG or VML, depending on the browser’s capabilities. This test only attempted to display the traced text in Firefox and Safari, both of which have SVG support, so only the SVG serialization code was modified. In theory, the VML generation code should support similar functionality, but this has not been attempted. OpenLayers provides a very useful platform for display because it has built in functionality like zoom and pan, as well as the ability to turn layers on and off and to add event handlers for structures it draws. This enables, for example, highlighting to be activated when the mouse hovers over a line. After the functions to store path data and serialize it as SVG were added to OpenLayers, the Python script was modified to output instructions to OpenLayers in Javascript that draw the paths and bounding rectangles as separate layers on top of the image. OpenLayers can, as described above, enable the addition of even handlers to polygon features, so in order to demonstrate the ability to link the grouped vectors to lines in a transcription, I matched path group ids to their corresponding line number using a pair of JSON objects (e.g. fig. #) var img2xml = { grpath168: ["ln0", "ln1"], grpath282: ["ln2"], ... grpath3426: ["ln26"], grpath3626: ["ln27"] }; The event handler for a mouseover on one of the rectangles bounding a line of text calls a function that changes the color of the paths within the line and the stroke of the bounding rectangle itself.

Hovering over a line in the image. It also changes the font-weight of the corresponding line of text to bold. An inverted data structure points from lines of text to paths and when a mouseover event is bound to the line, enables the highlighting of the path when a user hovers over a line of text.

Hovering over a line of text.

Conclusions The experiments outlined above prove that it is feasible to go from a page image with a TEI-based transcription to an online display in which the image can be panned and zoomed, and the text on the page can be linked to the transcription (and vice-versa). The steps in the process that have not yet been fully automated are the selection of a black/white cutoff for the page image, the decision of what percentage of vertical overlap to use in recognizing that two paths are members of the same line, and the need for line beginning (<lb/>) tags to be inserted into the TEI transcription (if it does not already contain them). The tools employed to produce the SVG tracing and the interactive display are all stable and well-supported (although the path support added to OpenLayers needs additional work). It seems clear that additional testing and the attempt to produce a working implementation will be well worth the effort.

Where to go from here One question posed by the apparent success of the method is what should link to what. What structures in the vector graphics document can be detected (beyond lines) and how should they be linked to the transcriptions? There are some very thorny concurrency issues here, since ink from one letter may touch another, and thus form a single path consisting of multiple letters, making it impossible to isolate letters, or even words if the letters belong to different words. A descender on the letter ‘f’ might touch a letter in the line below, making it impossible to easily identify the two lines as separate. These difficulties mean that linking word for word or letter for letter between documents is not necessarily possible. The streams are not parallel. Of course, vector paths can be sliced, and the image and text streams therefore could be made parallel, but this kind of operation will almost certainly require a human being with an SVG editor such as Inkscape. A second, related issue is that text transcriptions in XML may well define document structure in a semantic, rather than a physical way. Line, word, and letter segments can be marked in TEI, but they frequently are not. The DocSouth example used as a test case here does not have line breaks marked, for example. The mechanism used for linking bears further thought and study. The current implementation hand waves over the problem, simply mapping an id in one document to one or more ids in the other and vice versa using JSON. It would be much better to develop a standard for this kind of linking, since there is no guarantee that the id from one document would easily be available to the other. TEI P5 envisions the alignment of different document streams of this type using a link group See http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SACSXA. . Using TEI for this is a possible solution, but it does involve changing the TEI document, which may not be desirable. As the P5 standard remarks: “If it is not feasible to add more markup to the original text, some form of stand-off markup will be needed.” Stand-off markup seems a better solution in the abstract, but it isn’t immediately clear what is the best way to implement this solution. The proof-of-concept system illustrated above attempts to detect lines only, and that in a very simple way, by looking along the x-axis for overlapping structures. Probabilistic methods may well prove the best way to determine whether any given path belongs to the same group as another path, or whether a previously constructed group really holds together. The algorithms for structure detection therefore need a great deal of refinement and it is not yet clear how deep it is possible to go in detecting structure within the SVG image automatically. How much human intervention can be asked for, provided, and enabled within this framework is an important question too. OpenLayers provides some limited vector editing capabilities, but how reasonable is it to ask a user to manually split, for example, two lines that have mistakenly been combined into a single group? The prospects for further development of this idea seem rich. I hope to proceed by further developing and refining the structure detection routines, by refining the display capabilities of the web interface and improving and standardizing the linking mechanisms. I plan to seek grant funding to work on this in the context of one or more of UNC Library’s digitized manuscript collections later this year.

References Advanced Papyrological Information System, "APIS record: michigan.apis.1769," http://wwwapp.cc.columbia.edu/ldpd/app/apis/search?mode=search&institution=michigan&pubnum_coll=P.Mich.&pubnum_vol=1&pubnum_page=78&sort=date&resPerPage=25&action=search&p=1 Documenting the American South, "Letter from John and Ebenezer Pettigrew to Charles Pettigrew," http://docsouth.unc.edu/true/mss01-01/mss01-01.html. Duke Databank of Documentary Papyri, "P.Mich.: Michigan Papyri 1.78," http://www.perseus.tufts.edu/cgi-bin/ptext?doc=Perseus%3Atext%3A1999.05.0163;layout=;query=1%3A78;loc=78. Gillies, Sean, "Digitizing Ancient Inscriptions with OpenLayers," (blog post) http://sgillies.net/blog/691/digitizing-ancient-inscriptions-with-openlayers, February 21, 2008. "OpenLayers, v. 2.6," http://openlayers.org. Release notes for v. 2.6 at http://trac.openlayers.org/wiki/Release/2.6/Notes. Selinger, Peter, "Potrace, v. 1.8," http://potrace.sourceforge.net/. Selinger, Peter, "Potrace: a polygon-based tracing algorithm," http://potrace.sourceforge.net/potrace.pdf, September 20, 2003. Terras, Melissa, "Palaeographic Image Markup Tools," (blog post) http://www.stoa.org/?p=776, February 19, 2008.