An extensible API for documents with multiple annotation layers
Balisage: The Markup Conference 2013
August 6 - 9, 2013
Both XML namespaces and standoff annotation are promising approaches to tackle possibly
overlapping multiple annotation layers in XML instances. The creation and processing of
standoff instances can be cumbersome – especially when the underlying textual primary data
is allowed to be modified after the annotation has been added. In this paper we present a
powerful API that is capable of dealing with these tasks by providing an extension mechanism
that allows for the easy creation of modules corresponding to a certain namespace (and
therefore markup language). We use XStandoff as a working example since it is a standoff
format that highly depends on XML namespaces for different annotation layers.
Nils
Diewald
Nils Diewald received a B.A. in German philology and Text Technology and an M.A. in
Linguistics (with a focus on Computational Linguistics) from Bielefeld University.
Currently he is employed as a research assistant in the KorAP project at the IDS Mannheim
(Institute for the German Language) and is a Ph.D. candidate in Computer Science.
His Doctorate Studies focus on communication in social networks,
originating from his work as a research assistant in the
Linguistic Networks project of the BMBF (Federal Ministry of Education and Research).
Before that, he was a research and graduate assistant in the Sekimo project, part of the
DFG Research Group on Text-Technological Modelling of Information.
Universität Bielefeld
Institut für Deutsche Sprache (IDS) Mannheim
nils.diewald@uni-bielefeld.de
Maik
Stührenberg
Maik Stührenberg received his Ph.D. in Computational Linguistics and Text Technology
from Bielefeld University in 2012. After graduating in 2001 he worked in different
text-technological projects at Gießen University, Bielefeld University and the Institut
für Deutsche Sprache (IDS, Institute for the German Language) in Mannheim. He is currently
employed as research assistant at Bielefeld University.
His main research interests include specifications for structuring multiple annotated
data, schema languages, and query processing.
Universität Bielefeld
maik.stuehrenberg@uni-bielefeld.de
Copyright © 2013 by the authors. Used with permission.
Multiple annotated documents
Markup languages are often defined for structuring the information of a specific text
type, such as web pages (HTML), technical articles or books (DocBook), or a set of information
items, such as vector graphics (SVG) or protocol information (SOAP). Therefore, their
structure is (in limits) determined by a document grammar that allows for specific elements
and attributes. In addition, the different XML-based document grammar formalisms allow to a
certain degree the combination of elements (and attributes) from different markup languages –
usually by means of XML namespaces (). In practice, one host
language can include islands of foreign markup (guest languages). There are different examples
for the combination of host and guest markup languages (apart from the already mentioned
SOAP). A certain XHTML driver () allows for the combination of
XHTML (as a host language), MathML and SVG (as guest languages), and the Atom Syndication
Format () can be used in conjunction with a wide range of
extensions (e.g. for Threading, see , or Activity Streams, see
) while it is also meant to be embedded in parts in the RSS
format ().
Although XML namespaces support the combination of elements derived from different
markup languages, they do not change XML's formal model that prohibits overlapping markup.
However, standoff markup (instead of inline annotation) may be used to circumvent this
problem. The meta markup language XStandoff () embeds
(slightly transformed) islands of guest languages (with respective XML namespaces) in
combination with a standardized standoff approach as key feature for the storage of multiple
(and possibly overlapping) hierarchies.
Typical problems when dealing with multiple and/or standoff annotations are related to the
production and processing of instances. Although usually each markup language involved is
defined by a document grammar on its own, it can often be cumbersome to validate an instance
combining elements from a large variety of document grammars (although XStandoff is capable of
validating these instances, adapted XML schema files have to be present for each guest
language). This behaviour can be controlled by means of the document grammar formalism. For
example, XML Schema allows different values of its processContents
attribute
which may occur on the any
element. The value lax
provided in (taken from XStandoff's layer
element) instructs an XML processor to validate the element content on
a can-do basis: It will validate elements and attributes for which it can obtain schema
information, but it will not signal errors for those it cannot obtain any schema
information
, Section 5.5, Any Element, Any Attribute.
In addition, the namespace
attribute may be used to control the allowed
namespaces. While XSD 1.0 allows the values ##any
, ##other
or a list
of namespaces only (including the preserved values ##targetNamespace
and
##local
, see ), RELAX NG supports the exclusion
of namespaces (by using the except
pattern in combination with
nsName
). XSD 1.1 () introduced the
notNamespace
and notQName
attributes.
The production of multiple annotated documents is typically the result of the combination
of formerly stand-alone documents (or their parts), such as the inclusion of externally
created SVG graphics in an XHTML host document, or the outcome of a mostly automated process
(see for a discussion on the production of XStandoff
instances). What is still lacking is an API (Application Programming Interface)
that is flexible enough to support the production
and processing of multiple annotated instances, even if annotations are referring to the same
primary data by means of standoff annotation. We will demonstrate such an API in the reminder
of this article.
Creating an extensible API
XML::Loy () is a Perl library, that
provides a simple programming interface for the creation of XML documents with multiple
namespaces. It is based on Mojo::DOM, an HTML/XML DOM parser that is part
of the Mojolicious framework ().
Mojo::DOM povides CSS selector based methods for DOM traversal (), similar to Javascript's querySelector()
and
querySelectorAll()
methods.
The basic methods for the manipulation of the XML Document Object Model provided by
XML::Loy are add()
and set()
. By applying
these methods new nodes can be introduced as children to every node in the document. While
add()
always appends additional nodes to the document, set()
only
appends nodes in case no child of the given type exists. Both methods are invoked by a chosen
node in the document tree (acting as the parent node of the newly introduced node). They
accept the element name as a string parameter, followed by an optional hash reference
containing attributes and a string containing optional textual content of the element. A final
string can be used to put a comment in front of the element.
In the example presented in a new XML::Loy
document instance is created with a root element document
. Applying the
set()
method, a new title
element is introduced as a child of the
root element. The second call of set()
overwrites the content of the
title
element. By using the add()
method we insert multiple
paragraph
elements without overwriting existing ones. These elements are
defined with both an id
attribute and textual content.
By applying the to_pretty_xml()
method, the result can be printed as XML.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<document>
<title>My New Title</title>
<paragraph id="p-1">First Paragraph</paragraph>
<paragraph id="p-2">Second Paragraph</paragraph>
</document>
The strength of this simple approach for document manipulation is the ability to pass
these methods to new extension modules that can represent APIs for specific XML namespaces, as
both host and guest languages. The example given in is
meant to illustrate these capabilities by creating a simple XML::Loy extension
for morpheme annotations.
The class inherits all XML creation methods from XML::Loy and thus
all XML traversal methods from Mojo::DOM. When defining the base class,
an optional namespace http://www.xstandoff.net/morphemes
is bound to the
morph
prefix, which means, all invocations of set()
and
add()
from this class will be bound to the morph
namespace. The
newly created morphemes()
method appends a morphemes
element bound
to the given namespace as a child of the invoking node.
To implement simple grammar rules to the API the methods can check the invoking context, for
example by constraining the introduction of morpheme
elements to
morphemes
parent nodes only (see the regular expression check
/^(?:morph:)?morphemes$/
).
This newly created API for the http://www.xstandoff.net/morphemes
namespace
can now be used to create new document instances (see
and the output shown in ).
By using the generic methods add()
and set()
provided by
XML::Loy, the class can easily be used for extending an existing
XML::Loy based class (i.e. as a guest language inside another host
language). In the example shown in a simplified HTML
instance is read and instantiated. Elements from the
http://www.xstandoff.net/morphemes
namespace are appended using the API
described above (the output is shown in ).
By extending the XML::Loy base object with the newly created class using
the extension()
The leading minus symbol is a shortcut for the XML::Loy
module namespace,
meaning, that the qualified name is
XML::Loy::Example::Morphemes. More than one extension can be passed
at once.
method, all method calls from the extension class are available for namespace aware
traversal and manipulation. In general, using such an extensible API provides at least some
functionality usually made available by document grammars (the nesting of elements for
example) and adds methods to create and manipulate the respective class of instances.
XStandoff as an example application
XStandoff's predecessor SGF (Sekimo Generic Format) was developed in 2008 (see ) as a meta format for storing and analyzing multiple annotated
instances as part of a linguistic corpus. In 2009 the format was generalized and enhanced.
Since then, XStandoff combines standoff notation with the formal model of General
Ordered-Descendant Directed Acyclic Graphs (GODDAG, introduced in ; see for a more
recent discussion). The format as such is capable of representing multiple hierarchies and
specifically challenging structures such as overlaps, discontinuous elements and virtual
elements. The basic structure of an XStandoff instance consists of the root element
corpusData
underneath which the child elements meta
(optional),
resources
(optional), primaryData
(optional in the proposed
release 2.0, see ), segmentation
and
annotation
are subsumed. shows an example
XStandoff document.
More examples can be found at http://www.xstandoff.net/examples.
In this example, the sentence The sun shines brighter.
is annotated with
two linguistic levels (and respective layers): morphemes and syllables. We cannot combine both
annotation layers in an inline annotation, since there is an overlap between the two syllables
brigh
and ter
and the two morphemes bright
and
er
(see for a visualization of the
overlap).
Each annotation is encapsulated underneath a layer
element (which in turn is
a child element of a level
element, since it is possible to have more than one
serialization, that is, layer, for a conceptual level).
Think of different POS taggers for example.
The xsf:segment
attribute is used to link the annotation with the
respective part of the primary data. Similar to other standoff approaches, XStandoff uses
character positions for defining segments over textual primary data. Changes of the input text
result in an out-of-sync situation between primary data and annotation. Processing XStandoff
instances requires dealing with at least n+1 XML namespaces: one for
XStandoff itself and one for each of the n annotation layers.
Up to now, these instances are created by transforming inline annotations via a set of
XSLT 2.0 stylesheets (see for a detailed discussion). We
will outline an example API for XStandoff based on XML::Loy that makes it
easy to deal with the dynamic creation of multi-layered annotations in the following section
The software presented in this section is freely available under the GPL or the
Artistic License at http://github.com/Akron/XML-Loy-XStandoff.
.
Creating and processing XStandoff instances using XML::Loy
As presented in the previous section, XStandoff associates annotations to primary data by
defining segment spans
In the following example we will limit our view on segments defined by character
positions. See for examples for other segmentation
methods supported by XStandoff.
to which the annotations are linked to via XML ID/IDREF integrity features. There
are multiple ways to cope with standoff annotation: Compared to the XStandoff-Toolkit
discussed in , our API will provide an additional
way to access and manipulate both annotations and primary data directly.
In a new corpusData
element is created.
Next, a textualContent
element is added
(below an automatically introduced primaryData
element with a unique xml:id
).
Seven manually defined
segment
elements are appended for selecting spans over the textual primary data
aligned to the words and the sentence as a whole. shows
the output.
The document creation is simple, as most elements such as corpusData
,
textualContent
and segment
have corresponding API methods for
finding, appending, updating and removing elements of the document. Segments are appended by
defining their scope.
The manipulation of the primary data is possible by applying the
segment_content()
method, that associates primary data with segment spans (see
).
The textual content virtually delimited by a segment can be retrieved, replaced and
manipulated, while all other segments stay intact and update their according start and end
position values by calculating the new offsets in case they change.
This addresses one of the key problems
with standoff annotation: Usually, if one alters the primary data without updating the
corresponding segments, association of annotations and corresponding primary data will break.
Due to the dynamic access of primary data information provided by this API,
work with standoff annotations can
be nearly as flexible as with inline annotations, without the limitations these annotation
formats have, for example to represent overlapping (see ).
The morpheme extension created in can be simply adopted
to represent an annotation layer with overlapping segment spans with an annotation of
syllables (see ).
The resulting document is similar to listing but with a modified
primary data of The moon shines brighter.
and updated segment spans.
Another problem with some standoff formats is the association with decoupled primary data
content. In XStandoff the primary data can be included in the XSF instance (as seen in the
previous examples) or stored in a separate file and referenced via the
primaryDataRef
element (in case of larger textual primary data, multimedia-based or
multiple primary data files). If this file is on a local storage, the API will take care
of updating the external textual content as well. Trying to modify files that are not
modifiable (e.g. accessible online only) will result in a
warning.
Since metadata in XStandoff can be either included inline or referenced in the same way, the handling of
metadata in our API can be treated alike, with a slight difference
if the metadata itself is a well-formed XML document. The example given in assumes a simple metadata document in RDF with a Dublin Core
namespace at the location files/meta.xml
in the local file system (shown in ).
The API enables the reference to the external document and supports the access by defining
a new XML::Loy object with an extension for dealing with Dublin Core data.
This extension is not described in this article.
As a result, the Dublin Core annotated title
element can be accessed
directly, although the data is not embedded in the document.
Conclusion and future work
We have demonstrated the XML::Loy API that can be used as a framework
for development of extensible modules for given namespaces (and therefore markup
languages). Modules created as extensions can then be used in a simple but yet powerful way to
create and process multiple annotated instances, even with standoff markup and referenced
documents for primary and metadata information.
The current implementation of XML::Loy is written in pure Perl, with
the focus on demonstrating the flexibility and extensibility of our approach, rather than
creating a performance optimized system. Since the whole API (including the extension modules
and examples described in this paper) is available under a free license at http://github.com/Akron/XML-Loy-XStandoff further possible steps could include
performance optimizations and the creation of an extension repository for popular standardized
markup languages (such as OLAC, DocBook and TEI).
Acknowledgements
We would like to thank the anonymous reviewers of this paper for their helpful comments
and ideas.
References
Martin Atkins, Will Norris,
Chris Messina, Monica Wilkinson, and Rob Dolin (2011). Atom Activity Streams 1.0. http://activitystrea.ms/specs/atom/1.0/
Tim Bray, Dave Hollander, Andrew
Layman, Richard Tobin, and Henry S. Thompson (2009). Namespaces in XML 1.0 (Third Edition).
W3C Recommendation, World Wide Web Consortium (W3C). http://www.w3.org/TR/2009/REC-xml-names-20091208/
Nils Diewald (2011). XML::Loy –
Extensible XML Reader and Writer. http://search.cpan.org/dist/XML-Loy/
David C. Fallside
and Priscilla Walmsley (2004). XML Schema Part 0: Primer Second Edition. W3C Recommendation,
World Wide Web Consortium (W3C). http://www.w3.org/TR/2004/REC-xmlschema-0-20041028/
Shudi (Sandy) Gao, C. M.
Sperberg-McQueen, and Henry S. Thompson (2012). W3C XML Schema Definition Language (XSD) 1.1
Part 1: Structures. W3C Recommendation, World Wide Web Consortium (W3C). http://www.w3.org/TR/2012/REC-xmlschema11-1-20120405/
Daniela Goecke, Harald Lüngen,
Dieter Metzing, Maik Stührenberg, and Andreas Witt (2010). Different
views on markup. Distinguishing Levels and Layers. In: Witt, A. and Metzing, D.
(eds.), Linguistic Modeling of Information and Markup Languages. Dordrecht:
Springer. doi:10.1007/978-90-481-3331-4_1.
Masayasu Ishikawa (2002). An
XHTML+MathML+SVG Profile. W3C Working Draft, World Wide Web Consortium (W3C). http://www.w3.org/TR/XHTMLplusMathMLplusSVG/xhtml-math-svg.html
Anne Van Kesteren,
and Lachlan Hunt (2013). Selectors API Level 1. W3C Recommendation, World Wide Web Consortium
(W3C). http://www.w3.org/TR/2013/REC-selectors-api-20130221/
Mark Nottingham, and
Robert Sayre (2005). The Atom Syndication Format. The Internet Society. http://tools.ietf.org/html/rfc4287
Sebastian Riedel (2008). Mojolicious.
Real-time web framework. http://search.cpan.org/dist/Mojolicious/
James M. Snell (2006). Atom Threading
Extensions. The Internet Society. http://www.ietf.org/rfc/rfc4685.txt
C.
M. Sperberg-McQueen and Claus Huitfeldt (2004). GODDAG: A Data
Structure for Overlapping Hierarchies. In: King, P. and Munson, E. V. (eds.),
Proceedings of the 5th International Workshop on the Principles of Digital Document Processing
(PODDP 2000), volume 2023 of Lecture Notes in Computer Science, Springer
C.
M. Sperberg-McQueen and Claus Huitfeldt (2008). GODDAG. Presented at the Goddag workshop,
Amsterdam, 1-5 December 2008
Maik
Stührenberg and Daniela Goecke (2008). SGF – An integrated model for multiple
annotations and its application in a linguistic domain. Presented at Balisage: The Markup
Conference 2008, Montréal, Canada, August 12 - 15, 2008. In: Proceedings of Balisage: The
Markup Conference 2008. Balisage Series on Markup Technologies, vol. 1. doi:10.4242/BalisageVol1.Stuehrenberg01
Maik
Stührenberg and Daniel Jettka (2009). A toolkit for multi-dimensional markup: The development
of SGF to XStandoff. In Proceedings of Balisage: The Markup Conference 2009. Balisage Series
on Markup Technologies, vol. 3. doi:10.4242/BalisageVol3.Stuhrenberg01.
Maik Stührenberg. A What,
when, where? Spatial and temporal annotations with XStandoff. In Proceedings of Balisage: The
Markup Conference 2013. doi:10.4242/BalisageVol10.Stuhrenberg01.
Henry S. Thompson, David
Beech, Murray Maloney, and Noah Mendelsohn (2004). XML Schema Part 1: Structures Second
Edition. W3C Recommendation, World Wide Web Consortium (W3C). http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/
Dave Winer (2009). RSS 2.0
Specification. http://www.rssboard.org/rss-specification