How to cite this paper
Poio API and GraF-XML
A radical stand-off approach in language documentation and language typology
Balisage: The Markup Conference 2013
August 6 - 9, 2013
Field linguists in language documentation projects have increasingly adopted the latest
technologies and tools. Their work have led to remarkable developments in digital corpora,
exemplified by The Language Archive at the MPI in Nijmegen. The next step in research is
now the analysis and theoretical exploitation of the huge amount of data that has been
collected in numerous language documentation projects. This research will rely on
computer-based strategies, as data is already available in digital formats.
In our talk we will present a recent project within the CLARIN framework
("Common Language Ressources and Technology Infrastructure") that develops a solution
for researchers to access the data collected in language documentation
projects via GrAF data structures. Our project consists of three parts:
graf-python: a Python implementation of GrAF as defined in the ISO document;
Poio API: a Python library that maps data formats used in language documentation to
GrAF and back again;
CLASS: a web application that provides a REST and user interface to access, search
and modify the data.
We will focus on the advantages of the radical stand-off approach (internally and via its
XML representation) for collaboration in language documentation and analysis projects. We
will also show how our project connects to existing infrastructure via building a bridge
to existing identity providers within the linguistic, scientific environment.
A Common Research Infrastructure
The project was developed by the German section of the European CLARIN project
. The CLARIN
project aims at establishing a web-based research infrastructure for the social sciences
and humanities. The infrastructure embraces existing standards and focusses on loosely
coupled REST-style web services. To assure interoperability, Text Corpus Format (TCF)
– an exchange format for annotations – has been
defined. TCF is fully compatible with the Linguistic Annotation Format (LAF) and
Graphbased Format for Linguistic Annotations (GrAF).
The DoBeS-Archive ,
one of the biggest digital archives for data from language documentation projects, is
part of this infrastructure project. The CLASS webservice
interfaces with this archive as part of the CLARIN infrastructure.
GrAF and Poio API
One problem that our research field faces at the moment is that the data and annotations
show a highly heterogeneous layout. Not only are there different file formats in the
archives, but the structure of "tiers" and the annotation schemes used differ
from project to project. Our implementation aims at unifying the existing data formats and
the existing tier hierarchies into a standardized pivot format as defined by ISO 24612
"Language resource management - Linguistic annotation framework (LAF)"
. This standard
is based on annotation graphs and was developed as an underlying data model
for linguistic annotations [Ide/Suderman 2007]. The "Graph Annotation Framework"
(GrAF), an implementation of LAF, was originally developed to publish the "Manually Annotated Subcorpus" (MASC)
of the American National Corpus
and consists of three parts:
As the data model and its specification is somewhat liberal regarding the layout of the
data, we had to define a common annotation graph structure for data from language documentation
projects in the beginning (for the advantages of the liberal approach of GrAF for our
purposes see section “Language Documentation and Language Typology”
.). In our project we use the Graph Annotation Framework
Converting from and to GrAF and GrAF-XML
We started with the XML representation
of a software called Elan , which
is the de-facto standard in language documentation at the moment. First, we separated the metadata content of Elan's
EAF files from the bare content with data and annotations. Most meta-data contained in Elan
describes the types and restrictions on linguistic tiers, which we store seperately to the
graphs. The graphs itself capture two items: The tier hierarchy in the EAF file and
the content of the annotations.
The following listing shows the start of one tier that stores "utterances" in the EAF file:
<TIER DEFAULT_LOCALE="en" LINGUISTIC_TYPE_REF="utterance" PARTICIPANT="" TIER_ID="W-Spch">
<ALIGNABLE_ANNOTATION ANNOTATION_ID="a8" TIME_SLOT_REF1="ts4" TIME_SLOT_REF2="ts23">
<ANNOTATION_VALUE>so you go out of the Institute to the Saint Anna Straat.</ANNOTATION_VALUE>
The tokenization into individual words are stored in a seperate child tier for "words":
<TIER DEFAULT_LOCALE="en" LINGUISTIC_TYPE_REF="words" PARENT_REF="W-Spch" PARTICIPANT="" TIER_ID="W-Words">
<ALIGNABLE_ANNOTATION ANNOTATION_ID="a23" TIME_SLOT_REF1="ts4" TIME_SLOT_REF2="ts6">
Both tiers being time-aligned, the basic video or audio data serves as reference
point to describe which annotations of a child tier are contained in one annotation
of the parent tier. As this layout is already a stand-off annotation (although
everything is stored in a single file) the transformation to annotation graph is
an easy task. The utterance is stored as a GrAF node with an annotation that
contains a feature structure. The first annotation of the tier W-Spch with the
annotation value "so you go out of the Institute to the Saint Anna Straat." looks
like this in GrAF:
<region anchors="780 4090" xml:id="utterance..W-Spch..ra8"/>
<a as="utterance" label="utterance" ref="utterance..W-Spch..na8" xml:id="a8">
<f name="annotation_value">so you go out of the Institute to the Saint Anna Straat.</f>
is linked to a
that contains the values of the time slots
of the original EAF file. The annotation
for the node has a feature structure
with one feature
for the annotation
value. The corresponding annotation for the "word" annotation in EAF looks
<region anchors="780 1340" xml:id="words..W-Words..ra23"/>
<edge from="utterance..W-Spch..n8" to="words..W-Words..na23" xml:id="ea23"/>
<a as="words" label="words" ref="words..W-Words..na23" xml:id="a23">
The node for the word annotation is similar to the utterance node, except for
tag that links the node to the corresponding
utterance node. Nodes and edges are created for all annotations of tiers that have
time slots and a parent tier in EAF (those tiers have the stereotype
"Time Subdivision" in EAF).
Poio API contains a plugin mechanism to allow future support of other file formats
that can be mapped onto GrAF structures. This mechanism consists of base classes
Writer classes that are used by a general
Converter class which constructs the graph. This architecture minimizes
the amount of work needed to implement support for new file formats. The basic
idea is that we map the semantics of file formats in language documentation onto
annotation graphs, thus creating a certain subset of general annotation
graphs that is suitable to store the information from the original files.
The notion of "tiers" also has to be mapped and stored in annotation graphs
during conversion. This will be explained in the next section.
In Elan, information about the tier hierarchy is stored only implicitly with the
layout of our graph. However, there is no restriction in GrAF itself that would
disallow a node in the word tier to connect to any other node in other tiers but
the utterance tier. This is not possible in Elan. To describe those restrictions
we added another data structure in Poio API that we call "data structure type".
This data structure separately stores information about any tier hierarchy that we
can extract from the graph later. A simple data structure type indicating that the
researcher wants to tokenize a text into words before adding a word-for-word
translation and a translation for the whole utterance looks like this:
[ 'utterance', [ 'word', 'wfw' ], 'translation' ]
One advantage in representing annotation schemes through those simple trees
is that the linguists instantly understand how such a tree works and they can give a
representation of "their" annotation schema. In language documentation and
general linguistics researchers tend to create ad-hoc annotation schemes according to
their background and then normally start to create only those annotations related
to their current research project. This is for example reflected in an annotation
software like Elan, where the user can freely create tiers with any names and
arrange them in custom hierarchies. As we need to map those data into our internal
representation, we try to ease the creation of custom annotation schemes which are
easy to understand for users. Subsequently, users can create their own
data structure types and derive the annotation schemes for GrAF files from those
The node IDs in our GrAF data have a prefix that also contains the
tier name (for example the utterance node:
In the case of Elan the tier name in the node ID is preceeded
by another string, the so-called "Linguistic Type" of Elan. This is a general restriction
of our annotation graphs: we use those string IDs to be able to query the
graphs quickly, for example to get all annotations which are children
of another annotation on a certain tier, or to get all annotations for a "Linguistic Type".
In other words we apply the semantics of tier-based annotation
onto annotation graphs and make sure that the graphs we create and that the user
handles in Poio API are a valid subset of general annotation graphs, as we defined it.
The advantage of this approach is that within Poio API we are sure that the annotation
graphs show a certain layout. The following sections will give an outlook on how this
basic unification will support collaborations and scientific workflows in the future.
We also want to mention that we store the annotations of each tier in a separate file
in GrAF-XML. This radical stand-off approach allows easy sharing of annotation files
and unsupervised collaboration, as discussed in section “Annotation graphs in linguistic workflows and collaborative projects”. Still,
we do not propose to introduce GrAF-XML as another file format for software tools
or archiving in language documentation, but as an easy way to serialize annotation
graphs for later processing in a scientific workflow.
All in all, it was an easy task to map an Elan
EAF file to an annotation graph in GrAF. Those graphs are now the basis for other
mappings of file formats onto GrAF. We are currently working on mapping from SIL
Toolbox files, Typecraft XML files and Weblicht TCF files.
Annotation graphs in linguistic workflows and collaborative projects
Annotation graphs have long been discussed as the underlying data model for
linguistic annotations [Bird/Liberman 2009]. Furthermore, the graphs
itself can be used to extend and augment existing corpora without a central,
supervising instance. As the XML representation of GrAF constitutes a radical
stand-off approach, it is possible to add and mix annotation tiers in collaborative
projects without administrative overhead (for advantages of radical stand-off
solutions against previous approaches see [Cayless/Soroka 2010]
and [Banski/Przepiórkowski 2009]). The amount of work to
manage and synchronize the data is thereby kept as minimal as possible.
In contrast to Elan we store each tier in a separate XML files, so that users can
share annotations independently from any other tier and across sources. With this stand-off
approach, a common data structure and unified access via an API facilitate the
analysis of linguistic data, for example in typological research.
GrAF itself was developed with interoperability in mind, so researchers
can also use existing workflow tools like GATE and Apache UIMA in analysis
Last but not least, existing graph-based methods can now be applied directly on
the linguistic graphs in analysis, for example graph-traversal algorithms to
generate co-occurrence statistics of annotations (for preliminary thoughts in
that direction see also [Ide/Suderman 2007]). Graph visualization
techniques can be instantly used by researchers to gain insight into their data
and allow distant reading.
Language Documentation and Language Typology
Data used in documentary linguistics and language typology is different from data
used in other areas of linguistics. The data produced in language documentation
projects consists mostly of audio and video recordings with time-aligned annotations.
This predominance of time-aligned annotations determines the requirements for data
structures. Because of these requirements an annotation format used in this area
of linguistics should allow the definition of regions of media recordings using timestamps
as well as string ranges of annotations.
The linguistic diversity of the data in documentary linguistics and language
typology makes the creation of standardized tag sets difficult, if not impossible,
and even tier hierarchies differ considerably. Furthermore, many documented and
analyzed languages will be subjected to further research and projects need to work with competing annotations
and analyses. The specifics of the data and research questions require a format that
can represent these complex and varied structures, while providing a unified access
to data and annotation.
We hope that the usage of annotation graphs as pivot structures and the unification
of tier-based annotation via our subset of graphs will help in the ongoing task
to unify tier structures and annotation schemes. The
annotation graphs that our library constructs may be augmented, for example,
with ISOcat links, added as regular features to already existing feature
structures. : This way, the mapping between different tier structures and annotation
schemes remains very general and does not depend on any file format.
In this sense we see Poio API as one important building block to support cross-corpus
research projects in the future.
CLASS web application as gateway to the archive
The CLASS web application implements tools for search and analysis based on
Poio API, provides easy-to-use web interfaces to facilitate field linguist's
research and integrates with CLARIN service infrastructure. In order to offer
a convenient web based workflow it is desirable that the users of the application
may access resource files for analysis directly from CLARIN data centers and
The main target of this effort is currently the DoBeS corpus, a core resource
maintained by the TLA (The Language Archive) at the MPI (Max-Planck-Institute
for Psycholinguistics, Nijmegen, NL). Since most of the collections within the
corpus are protected on a personalized level for privacy and ethical reasons and
may only be accessed by the corresponding owner or research group, the retrieval
of data by external services has not been viable in the past.
The CLASS web application will introduce a solution to this problem using single
sign-on technology in a delegated scenario. The implementation will enable users
to allow the CLASS service to access the data without exposing their credentials
to the service itself. This behavior has become increasingly popular with social
networks and cloud services (Google, Dropbox, Twitter, Facebook etc.) and is widely
accepted by internet users but has had little impact on scientific practice.
The infrastructure for Federated Authentication in scientific networks has already
been established by national and international programs such as CLARIN using the
Shibboleth/SAML protocol. Maintenance and configuration of delegated scenarios in
large federations however prove impractical and the underlying profile ECP
(Enhanced Client or Proxy) is currently not supported by the majority of Identity
Provides within CLARIN and DFN-AAI (German Authentication and Authorization
Infrastructure). To reach the goals of this project we are collaborating with
the TLA in introducing a SAML/OAuth2 bridge, thus combining Federated Authentication
with the benefits of OAuth2 delegation. The experimental nature of this implementation
comes with the risk of the first-mover but will largely improve accessibility of
research data in field linguistics once successfully established.
Summary and outlook
We have shown how our project maps data from a common XML format in language documentation to a
unified data structure and its XML representation as defined in ISO 24612. Those annotation
graphs are a big step forward on the way to unified access to data and annotations from language
documentation projects. Future projects in language typology and quantitative language comparison
can make use of this unified access via the easy-to-use programming interface as defined
in Poio API, without having to handle and parse different file formats. As soon as the data
was transformed into annotation graphs as defined by GrAF, the individual tiers can be shared
as GrAF-XML and extended independently without any supervising instance. Our hope is that this
will foster collaboration in future analysis projects. GrAF-XML can then be used as a common
interchange format in those projects.
Banski, Piotr and Przepiórkowski (2009).
Stand-off TEI annotation: the case of the national corpus of polish;
Proceedings of the Third Linguistic Annotation Workshop (LAW III), pp. 65–67.
Bird, Sptehen and Liberman, Mark (2001).
A formal framework for linguistic annotation;
Speech Communication 33, pp. 23-60. doi:10.1016/S0167-6393(00)00068-6.
Cayless, Hugh A. and Soroka, Adam (2010).
On implementing string-range() for TEI;
Proceedings of Balisage: The Markup Conference 2010
[online] [cited 19 April 2013].
Ide, Nancy and Suderman, Keith (2007).
GrAF: A graph-based format for linguistic annotations;
Proceedings of the Linguistic Annotation Workshop, pp. 1–8,
Prague, Czech Republic, June. Association for Computational Linguistics. doi:10.3115/1642059.1642060.
Ide, Nancy and Suderman, Keith (2009).
Bridging the gaps: interoperability for GrAF, GATE, and UIMA;
Proceedings of the Third Linguistic Annotation Workshop, pp. 27–34,
Suntec, Singapore, August 6-7. Association for Computational Linguistics. doi:10.3115/1698381.1698385.