Multiple annotated documents
Markup languages are often defined for structuring the information of a specific text type, such as web pages (HTML), technical articles or books (DocBook), or a set of information items, such as vector graphics (SVG) or protocol information (SOAP). Therefore, their structure is (in limits) determined by a document grammar that allows for specific elements and attributes. In addition, the different XML-based document grammar formalisms allow to a certain degree the combination of elements (and attributes) from different markup languages – usually by means of XML namespaces (Bray et al., 2009). In practice, one host language can include islands of foreign markup (guest languages). There are different examples for the combination of host and guest markup languages (apart from the already mentioned SOAP). A certain XHTML driver (Ishikawa, 2002) allows for the combination of XHTML (as a host language), MathML and SVG (as guest languages), and the Atom Syndication Format (Nottingham and Sayre, 2005) can be used in conjunction with a wide range of extensions (e.g. for Threading, see Snell, 2006, or Activity Streams, see Atkins et al., 2011) while it is also meant to be embedded in parts in the RSS format (Winer, 2009).
Although XML namespaces support the combination of elements derived from different markup languages, they do not change XML's formal model that prohibits overlapping markup. However, standoff markup (instead of inline annotation) may be used to circumvent this problem. The meta markup language XStandoff (Stührenberg and Jettka, 2009) embeds (slightly transformed) islands of guest languages (with respective XML namespaces) in combination with a standardized standoff approach as key feature for the storage of multiple (and possibly overlapping) hierarchies.
Typical problems when dealing with multiple and/or standoff annotations are related
production and processing of instances. Although usually each markup language involved
defined by a document grammar on its own, it can often be cumbersome to validate an
combining elements from a large variety of document grammars (although XStandoff is
validating these instances, adapted XML schema files have to be present for each guest
language). This behaviour can be controlled by means of the document grammar formalism.
example, XML Schema allows different values of its
which may occur on the
any element. The value
lax provided in Figure 1 (taken from XStandoff's
layer element) Fallside and Walsmley, 2004.
In addition, the
namespace attribute may be used to control the allowed
namespaces. While XSD 1.0 allows the values
##other or a list
of namespaces only (including the preserved values
##local, see Thompson et al., 2004), RELAX NG supports the exclusion
of namespaces (by using the
except pattern in combination with
nsName). XSD 1.1 (Gao et al., 2012) introduced the
The production of multiple annotated documents is typically the result of the combination of formerly stand-alone documents (or their parts), such as the inclusion of externally created SVG graphics in an XHTML host document, or the outcome of a mostly automated process (see Stührenberg and Jettka, 2009 for a discussion on the production of XStandoff instances). What is still lacking is an API (Application Programming Interface) that is flexible enough to support the production and processing of multiple annotated instances, even if annotations are referring to the same primary data by means of standoff annotation. We will demonstrate such an API in the reminder of this article.
Creating an extensible API
XML::Loy (Diewald, 2011) is a Perl library, that
provides a simple programming interface for the creation of XML documents with multiple
namespaces. It is based on Mojo::DOM, an HTML/XML DOM parser that is part
of the Mojolicious framework (Riedel, 2008).
The basic methods for the manipulation of the XML Document Object Model provided by
set(). By applying
these methods new nodes can be introduced as children to every node in the document.
add() always appends additional nodes to the document,
appends nodes in case no child of the given type exists. Both methods are invoked
by a chosen
node in the document tree (acting as the parent node of the newly introduced node).
accept the element name as a string parameter, followed by an optional hash reference
containing attributes and a string containing optional textual content of the element.
string can be used to put a comment in front of the element.
In the example presented in Figure 2 a new XML::Loy
document instance is created with a root element
document. Applying the
set() method, a new
title element is introduced as a child of the
root element. The second call of
set() overwrites the content of the
title element. By using the
add() method we insert multiple
paragraph elements without overwriting existing ones. These elements are
defined with both an
id attribute and textual content.
By applying the
to_pretty_xml() method, the result can be printed as XML.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <document> <title>My New Title</title> <paragraph id="p-1">First Paragraph</paragraph> <paragraph id="p-2">Second Paragraph</paragraph> </document>
The strength of this simple approach for document manipulation is the ability to pass these methods to new extension modules that can represent APIs for specific XML namespaces, as both host and guest languages. The example given in Figure 3 is meant to illustrate these capabilities by creating a simple XML::Loy extension for morpheme annotations.
The class inherits all XML creation methods from XML::Loy and thus
all XML traversal methods from Mojo::DOM. When defining the base class,
an optional namespace
http://www.xstandoff.net/morphemes is bound to the
morph prefix, which means, all invocations of
add() from this class will be bound to the
morph namespace. The
morphemes() method appends a
morphemes element bound
to the given namespace as a child of the invoking node.
To implement simple grammar rules to the API the methods can check the invoking context,
example by constraining the introduction of
morpheme elements to
morphemes parent nodes only (see the regular expression check
By using the generic methods
set() provided by
XML::Loy, the class can easily be used for extending an existing
XML::Loy based class (i.e. as a guest language inside another host
language). In the example shown in Figure 6 a simplified HTML
instance is read and instantiated. Elements from the
http://www.xstandoff.net/morphemes namespace are appended using the API
described above (the output is shown in Figure 7).
By extending the XML::Loy base object with the newly created class using
extension() method, all method calls from the extension class are available for namespace aware
traversal and manipulation. In general, using such an extensible API provides at least
functionality usually made available by document grammars (the nesting of elements
example) and adds methods to create and manipulate the respective class of instances.
XStandoff as an example application
XStandoff's predecessor SGF (Sekimo Generic Format) was developed in 2008 (see Stührenberg and Goecke, 2008) as a meta format for storing and analyzing multiple annotated
instances as part of a linguistic corpus. In 2009 the format was generalized and enhanced.
Since then, XStandoff combines standoff notation with the formal model of General
Ordered-Descendant Directed Acyclic Graphs (GODDAG, introduced in Sperberg-McQueen and Huitfeldt, 2004; see Sperberg-McQueen and Huitfeldt, 2008 for a more
recent discussion). The format as such is capable of representing multiple hierarchies
specifically challenging structures such as overlaps, discontinuous elements and virtual
elements. The basic structure of an XStandoff instance consists of the root element
corpusData underneath which the child elements
primaryData (optional in the proposed
release 2.0, see Stührenberg, 2013),
annotation are subsumed. Figure 8 shows an example
In this example, the sentence
The sun shines brighter. is annotated with
two linguistic levels (and respective layers): morphemes and syllables. We cannot
annotation layers in an inline annotation, since there is an overlap between the two
ter and the two morphemes
er (see Figure 9 for a visualization of the
Each annotation is encapsulated underneath a
layer element (which in turn is
a child element of a
level element, since it is possible to have more than one
serialization, that is, layer, for a conceptual level). The
xsf:segment attribute is used to link the annotation with the
respective part of the primary data. Similar to other standoff approaches, XStandoff
character positions for defining segments over textual primary data. Changes of the
result in an out-of-sync situation between primary data and annotation. Processing
instances requires dealing with at least n+1 XML namespaces: one for
XStandoff itself and one for each of the n annotation layers.
Up to now, these instances are created by transforming inline annotations via a set of XSLT 2.0 stylesheets (see Stührenberg and Jettka, 2009 for a detailed discussion). We will outline an example API for XStandoff based on XML::Loy that makes it easy to deal with the dynamic creation of multi-layered annotations in the following section.
Creating and processing XStandoff instances using XML::Loy
As presented in the previous section, XStandoff associates annotations to primary data by defining segment spans to which the annotations are linked to via XML ID/IDREF integrity features. There are multiple ways to cope with standoff annotation: Compared to the XStandoff-Toolkit discussed in Stührenberg and Jettka, 2009, our API will provide an additional way to access and manipulate both annotations and primary data directly.
In Figure 10 a new
corpusData element is created.
textualContent element is added
(below an automatically introduced
primaryData element with a unique
Seven manually defined
segment elements are appended for selecting spans over the textual primary data
aligned to the words and the sentence as a whole. Figure 11 shows
The document creation is simple, as most elements such as
segment have corresponding API methods for
finding, appending, updating and removing elements of the document. Segments are appended
defining their scope.
The manipulation of the primary data is possible by applying the
segment_content() method, that associates primary data with segment spans (see
The textual content virtually delimited by a segment can be retrieved, replaced and manipulated, while all other segments stay intact and update their according start and end position values by calculating the new offsets in case they change. This addresses one of the key problems with standoff annotation: Usually, if one alters the primary data without updating the corresponding segments, association of annotations and corresponding primary data will break. Due to the dynamic access of primary data information provided by this API, work with standoff annotations can be nearly as flexible as with inline annotations, without the limitations these annotation formats have, for example to represent overlapping (see Figure 9).
The morpheme extension created in section “Creating an extensible API” can be simply adopted to represent an annotation layer with overlapping segment spans with an annotation of syllables (see Figure 13).
The resulting document is similar to listing Figure 8 but with a modified
primary data of
The moon shines brighter. and updated segment spans.
Another problem with some standoff formats is the association with decoupled primary
content. In XStandoff the primary data can be included in the XSF instance (as seen
previous examples) or stored in a separate file and referenced via the
primaryDataRef element (in case of larger textual primary data, multimedia-based or
multiple primary data files). If this file is on a local storage, the API will take
of updating the external textual content as well. Trying to modify files that are
modifiable (e.g. accessible online only) will result in a
Since metadata in XStandoff can be either included inline or referenced in the same
way, the handling of
metadata in our API can be treated alike, with a slight difference
if the metadata itself is a well-formed XML document. The example given in Figure 15 assumes a simple metadata document in RDF with a Dublin Core
namespace at the location
files/meta.xml in the local file system (shown in Figure 14).
The API enables the reference to the external document and supports the access by
a new XML::Loy object with an extension for dealing with Dublin Core data. As a result, the Dublin Core annotated
title element can be accessed
directly, although the data is not embedded in the document.
Conclusion and future work
We have demonstrated the XML::Loy API that can be used as a framework for development of extensible modules for given namespaces (and therefore markup languages). Modules created as extensions can then be used in a simple but yet powerful way to create and process multiple annotated instances, even with standoff markup and referenced documents for primary and metadata information.
The current implementation of XML::Loy is written in pure Perl, with the focus on demonstrating the flexibility and extensibility of our approach, rather than creating a performance optimized system. Since the whole API (including the extension modules and examples described in this paper) is available under a free license at http://github.com/Akron/XML-Loy-XStandoff further possible steps could include performance optimizations and the creation of an extension repository for popular standardized markup languages (such as OLAC, DocBook and TEI).
We would like to thank the anonymous reviewers of this paper for their helpful comments and ideas.
[Bray et al., 2009] Tim Bray, Dave Hollander, Andrew Layman, Richard Tobin, and Henry S. Thompson (2009). Namespaces in XML 1.0 (Third Edition). W3C Recommendation, World Wide Web Consortium (W3C). http://www.w3.org/TR/2009/REC-xml-names-20091208/
[Fallside and Walsmley, 2004] David C. Fallside and Priscilla Walmsley (2004). XML Schema Part 0: Primer Second Edition. W3C Recommendation, World Wide Web Consortium (W3C). http://www.w3.org/TR/2004/REC-xmlschema-0-20041028/
[Gao et al., 2012] Shudi (Sandy) Gao, C. M. Sperberg-McQueen, and Henry S. Thompson (2012). W3C XML Schema Definition Language (XSD) 1.1 Part 1: Structures. W3C Recommendation, World Wide Web Consortium (W3C). http://www.w3.org/TR/2012/REC-xmlschema11-1-20120405/
[Goecke et al., 2010] Daniela Goecke, Harald Lüngen, Dieter Metzing, Maik Stührenberg, and Andreas Witt (2010). Different views on markup. Distinguishing Levels and Layers. In: Witt, A. and Metzing, D. (eds.), Linguistic Modeling of Information and Markup Languages. Dordrecht: Springer. doi:https://doi.org/10.1007/978-90-481-3331-4_1.
[Ishikawa, 2002] Masayasu Ishikawa (2002). An XHTML+MathML+SVG Profile. W3C Working Draft, World Wide Web Consortium (W3C). http://www.w3.org/TR/XHTMLplusMathMLplusSVG/xhtml-math-svg.html
[van Kesteren and Hunt, 2013] Anne Van Kesteren, and Lachlan Hunt (2013). Selectors API Level 1. W3C Recommendation, World Wide Web Consortium (W3C). http://www.w3.org/TR/2013/REC-selectors-api-20130221/
[Sperberg-McQueen and Huitfeldt, 2004] C. M. Sperberg-McQueen and Claus Huitfeldt (2004). GODDAG: A Data Structure for Overlapping Hierarchies. In: King, P. and Munson, E. V. (eds.), Proceedings of the 5th International Workshop on the Principles of Digital Document Processing (PODDP 2000), volume 2023 of Lecture Notes in Computer Science, Springer
[Sperberg-McQueen and Huitfeldt, 2008] C. M. Sperberg-McQueen and Claus Huitfeldt (2008). GODDAG. Presented at the Goddag workshop, Amsterdam, 1-5 December 2008
[Stührenberg and Goecke, 2008] Maik Stührenberg and Daniela Goecke (2008). SGF – An integrated model for multiple annotations and its application in a linguistic domain. Presented at Balisage: The Markup Conference 2008, Montréal, Canada, August 12 - 15, 2008. In: Proceedings of Balisage: The Markup Conference 2008. Balisage Series on Markup Technologies, vol. 1. doi:https://doi.org/10.4242/BalisageVol1.Stuehrenberg01
[Stührenberg and Jettka, 2009] Maik Stührenberg and Daniel Jettka (2009). A toolkit for multi-dimensional markup: The development of SGF to XStandoff. In Proceedings of Balisage: The Markup Conference 2009. Balisage Series on Markup Technologies, vol. 3. doi:https://doi.org/10.4242/BalisageVol3.Stuhrenberg01.
[Stührenberg, 2013] Maik Stührenberg. A What, when, where? Spatial and temporal annotations with XStandoff. In Proceedings of Balisage: The Markup Conference 2013. doi:https://doi.org/10.4242/BalisageVol10.Stuhrenberg01.
[Thompson et al., 2004] Henry S. Thompson, David Beech, Murray Maloney, and Noah Mendelsohn (2004). XML Schema Part 1: Structures Second Edition. W3C Recommendation, World Wide Web Consortium (W3C). http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/
 The leading minus symbol is a shortcut for the
XML::Loy module namespace,
meaning, that the qualified name is
XML::Loy::Example::Morphemes. More than one extension can be passed
 Think of different POS taggers for example.
 This extension is not described in this article.