Field linguists in language documentation projects have increasingly adopted the latest technologies and tools. Their work have led to remarkable developments in digital corpora, exemplified by The Language Archive at the MPI in Nijmegen. The next step in research is now the analysis and theoretical exploitation of the huge amount of data that has been collected in numerous language documentation projects. This research will rely on computer-based strategies, as data is already available in digital formats.
In our talk we will present a recent project within the CLARIN framework ("Common Language Ressources and Technology Infrastructure") that develops a solution for researchers to access the data collected in language documentation projects via GrAF data structures. Our project consists of three parts:
graf-python: a Python implementation of GrAF as defined in the ISO document;
Poio API: a Python library that maps data formats used in language documentation to GrAF and back again;
CLASS: a web application that provides a REST and user interface to access, search and modify the data.
A Common Research Infrastructure
The project was developed by the German section of the European CLARIN project . The CLARIN project aims at establishing a web-based research infrastructure for the social sciences and humanities. The infrastructure embraces existing standards and focusses on loosely coupled REST-style web services. To assure interoperability, Text Corpus Format (TCF)  – an exchange format for annotations – has been defined. TCF is fully compatible with the Linguistic Annotation Format (LAF) and Graphbased Format for Linguistic Annotations (GrAF).
The DoBeS-Archive , one of the biggest digital archives for data from language documentation projects, is part of this infrastructure project. The CLASS webservice interfaces with this archive as part of the CLARIN infrastructure.
GrAF and Poio API
One problem that our research field faces at the moment is that the data and annotations show a highly heterogeneous layout. Not only are there different file formats in the archives, but the structure of "tiers" and the annotation schemes used differ from project to project. Our implementation aims at unifying the existing data formats and the existing tier hierarchies into a standardized pivot format as defined by ISO 24612 "Language resource management - Linguistic annotation framework (LAF)" . This standard is based on annotation graphs and was developed as an underlying data model for linguistic annotations [Ide/Suderman 2007]. The "Graph Annotation Framework" (GrAF), an implementation of LAF, was originally developed to publish the "Manually Annotated Subcorpus" (MASC) of the American National Corpus , and consists of three parts:
an abstract data model;
an API for manipulating the data model;
a simple XML serialization of the data model.
Converting from and to GrAF and GrAF-XML
We started with the XML representation of a software called Elan , which is the de-facto standard in language documentation at the moment. First, we separated the metadata content of Elan's EAF files from the bare content with data and annotations. Most meta-data contained in Elan describes the types and restrictions on linguistic tiers, which we store seperately to the graphs. The graphs itself capture two items: The tier hierarchy in the EAF file and the content of the annotations.
The following listing shows the start of one tier that stores "utterances" in the EAF file:
<TIER DEFAULT_LOCALE="en" LINGUISTIC_TYPE_REF="utterance" PARTICIPANT="" TIER_ID="W-Spch"> <ANNOTATION> <ALIGNABLE_ANNOTATION ANNOTATION_ID="a8" TIME_SLOT_REF1="ts4" TIME_SLOT_REF2="ts23"> <ANNOTATION_VALUE>so you go out of the Institute to the Saint Anna Straat.</ANNOTATION_VALUE> </ALIGNABLE_ANNOTATION> </ANNOTATION> <ANNOTATION> [...] </ANNOTATION> </TIER>The tokenization into individual words are stored in a seperate child tier for "words":
<TIER DEFAULT_LOCALE="en" LINGUISTIC_TYPE_REF="words" PARENT_REF="W-Spch" PARTICIPANT="" TIER_ID="W-Words"> <ANNOTATION> <ALIGNABLE_ANNOTATION ANNOTATION_ID="a23" TIME_SLOT_REF1="ts4" TIME_SLOT_REF2="ts6"> <ANNOTATION_VALUE>so</ANNOTATION_VALUE> </ALIGNABLE_ANNOTATION> </ANNOTATION> <ANNOTATION> [...] </ANNOTATION> </TIER>Both tiers being time-aligned, the basic video or audio data serves as reference point to describe which annotations of a child tier are contained in one annotation of the parent tier. As this layout is already a stand-off annotation (although everything is stored in a single file) the transformation to annotation graph is an easy task. The utterance is stored as a GrAF node with an annotation that contains a feature structure. The first annotation of the tier W-Spch with the annotation value "so you go out of the Institute to the Saint Anna Straat." looks like this in GrAF:
<node xml:id="utterance..W-Spch..na8"> <link targets="utterance..W-Spch..ra8"/> </node> <region anchors="780 4090" xml:id="utterance..W-Spch..ra8"/> <a as="utterance" label="utterance" ref="utterance..W-Spch..na8" xml:id="a8"> <fs> <f name="annotation_value">so you go out of the Institute to the Saint Anna Straat.</f> </fs> </a>The
<node>is linked to a
<region>that contains the values of the time slots of the original EAF file. The annotation
<a>for the node has a feature structure
<fs>with one feature
<f>for the annotation value. The corresponding annotation for the "word" annotation in EAF looks like this:
<node xml:id="words..W-Words..na23"> <link targets="words..W-Words..ra23"/> </node> <region anchors="780 1340" xml:id="words..W-Words..ra23"/> <edge from="utterance..W-Spch..n8" to="words..W-Words..na23" xml:id="ea23"/> <a as="words" label="words" ref="words..W-Words..na23" xml:id="a23"> <fs> <f name="annotation_value">so</f> </fs> </a>The node for the word annotation is similar to the utterance node, except for an additional
<edge>tag that links the node to the corresponding utterance node. Nodes and edges are created for all annotations of tiers that have time slots and a parent tier in EAF (those tiers have the stereotype "Time Subdivision" in EAF).
Poio API contains a plugin mechanism to allow future support of other file formats
that can be mapped onto GrAF structures. This mechanism consists of base classes
Writer classes that are used by a general
Converter class which constructs the graph. This architecture minimizes
the amount of work needed to implement support for new file formats. The basic
idea is that we map the semantics of file formats in language documentation onto
annotation graphs, thus creating a certain subset of general annotation
graphs that is suitable to store the information from the original files.
The notion of "tiers" also has to be mapped and stored in annotation graphs
during conversion. This will be explained in the next section.
In Elan, information about the tier hierarchy is stored only implicitly with the layout of our graph. However, there is no restriction in GrAF itself that would disallow a node in the word tier to connect to any other node in other tiers but the utterance tier. This is not possible in Elan. To describe those restrictions we added another data structure in Poio API that we call "data structure type". This data structure separately stores information about any tier hierarchy that we can extract from the graph later. A simple data structure type indicating that the researcher wants to tokenize a text into words before adding a word-for-word translation and a translation for the whole utterance looks like this:
[ 'utterance', [ 'word', 'wfw' ], 'translation' ]One advantage in representing annotation schemes through those simple trees is that the linguists instantly understand how such a tree works and they can give a representation of "their" annotation schema. In language documentation and general linguistics researchers tend to create ad-hoc annotation schemes according to their background and then normally start to create only those annotations related to their current research project. This is for example reflected in an annotation software like Elan, where the user can freely create tiers with any names and arrange them in custom hierarchies. As we need to map those data into our internal representation, we try to ease the creation of custom annotation schemes which are easy to understand for users. Subsequently, users can create their own data structure types and derive the annotation schemes for GrAF files from those structures.
The node IDs in our GrAF data have a prefix that also contains the
tier name (for example the utterance node:
In the case of Elan the tier name in the node ID is preceeded
by another string, the so-called "Linguistic Type" of Elan. This is a general restriction
of our annotation graphs: we use those string IDs to be able to query the
graphs quickly, for example to get all annotations which are children
of another annotation on a certain tier, or to get all annotations for a "Linguistic Type".
In other words we apply the semantics of tier-based annotation
onto annotation graphs and make sure that the graphs we create and that the user
handles in Poio API are a valid subset of general annotation graphs, as we defined it.
The advantage of this approach is that within Poio API we are sure that the annotation
graphs show a certain layout. The following sections will give an outlook on how this
basic unification will support collaborations and scientific workflows in the future.
We also want to mention that we store the annotations of each tier in a separate file in GrAF-XML. This radical stand-off approach allows easy sharing of annotation files and unsupervised collaboration, as discussed in section “Annotation graphs in linguistic workflows and collaborative projects”. Still, we do not propose to introduce GrAF-XML as another file format for software tools or archiving in language documentation, but as an easy way to serialize annotation graphs for later processing in a scientific workflow.
All in all, it was an easy task to map an Elan EAF file to an annotation graph in GrAF. Those graphs are now the basis for other mappings of file formats onto GrAF. We are currently working on mapping from SIL Toolbox files, Typecraft XML files and Weblicht TCF files.
Annotation graphs in linguistic workflows and collaborative projects
Annotation graphs have long been discussed as the underlying data model for linguistic annotations [Bird/Liberman 2009]. Furthermore, the graphs itself can be used to extend and augment existing corpora without a central, supervising instance. As the XML representation of GrAF constitutes a radical stand-off approach, it is possible to add and mix annotation tiers in collaborative projects without administrative overhead (for advantages of radical stand-off solutions against previous approaches see [Cayless/Soroka 2010] and [Banski/Przepiórkowski 2009]). The amount of work to manage and synchronize the data is thereby kept as minimal as possible.
In contrast to Elan we store each tier in a separate XML files, so that users can share annotations independently from any other tier and across sources. With this stand-off approach, a common data structure and unified access via an API facilitate the analysis of linguistic data, for example in typological research. GrAF itself was developed with interoperability in mind, so researchers can also use existing workflow tools like GATE and Apache UIMA in analysis [Ide/Suderman 2009].
Last but not least, existing graph-based methods can now be applied directly on the linguistic graphs in analysis, for example graph-traversal algorithms to generate co-occurrence statistics of annotations (for preliminary thoughts in that direction see also [Ide/Suderman 2007]). Graph visualization techniques can be instantly used by researchers to gain insight into their data and allow distant reading.
Language Documentation and Language Typology
Data used in documentary linguistics and language typology is different from data used in other areas of linguistics. The data produced in language documentation projects consists mostly of audio and video recordings with time-aligned annotations. This predominance of time-aligned annotations determines the requirements for data structures. Because of these requirements an annotation format used in this area of linguistics should allow the definition of regions of media recordings using timestamps as well as string ranges of annotations.
The linguistic diversity of the data in documentary linguistics and language typology makes the creation of standardized tag sets difficult, if not impossible, and even tier hierarchies differ considerably. Furthermore, many documented and analyzed languages will be subjected to further research and projects need to work with competing annotations and analyses. The specifics of the data and research questions require a format that can represent these complex and varied structures, while providing a unified access to data and annotation.
We hope that the usage of annotation graphs as pivot structures and the unification of tier-based annotation via our subset of graphs will help in the ongoing task to unify tier structures and annotation schemes. The annotation graphs that our library constructs may be augmented, for example, with ISOcat links, added as regular features to already existing feature structures. : This way, the mapping between different tier structures and annotation schemes remains very general and does not depend on any file format. In this sense we see Poio API as one important building block to support cross-corpus research projects in the future.
CLASS web application as gateway to the archive
The CLASS web application implements tools for search and analysis based on Poio API, provides easy-to-use web interfaces to facilitate field linguist's research and integrates with CLARIN service infrastructure. In order to offer a convenient web based workflow it is desirable that the users of the application may access resource files for analysis directly from CLARIN data centers and other sources.
The main target of this effort is currently the DoBeS corpus, a core resource maintained by the TLA (The Language Archive) at the MPI (Max-Planck-Institute for Psycholinguistics, Nijmegen, NL). Since most of the collections within the corpus are protected on a personalized level for privacy and ethical reasons and may only be accessed by the corresponding owner or research group, the retrieval of data by external services has not been viable in the past.
The CLASS web application will introduce a solution to this problem using single sign-on technology in a delegated scenario. The implementation will enable users to allow the CLASS service to access the data without exposing their credentials to the service itself. This behavior has become increasingly popular with social networks and cloud services (Google, Dropbox, Twitter, Facebook etc.) and is widely accepted by internet users but has had little impact on scientific practice.
The infrastructure for Federated Authentication in scientific networks has already been established by national and international programs such as CLARIN using the Shibboleth/SAML protocol. Maintenance and configuration of delegated scenarios in large federations however prove impractical and the underlying profile ECP (Enhanced Client or Proxy) is currently not supported by the majority of Identity Provides within CLARIN and DFN-AAI (German Authentication and Authorization Infrastructure). To reach the goals of this project we are collaborating with the TLA in introducing a SAML/OAuth2 bridge, thus combining Federated Authentication with the benefits of OAuth2 delegation. The experimental nature of this implementation comes with the risk of the first-mover but will largely improve accessibility of research data in field linguistics once successfully established.
Summary and outlook
We have shown how our project maps data from a common XML format in language documentation to a unified data structure and its XML representation as defined in ISO 24612. Those annotation graphs are a big step forward on the way to unified access to data and annotations from language documentation projects. Future projects in language typology and quantitative language comparison can make use of this unified access via the easy-to-use programming interface as defined in Poio API, without having to handle and parse different file formats. As soon as the data was transformed into annotation graphs as defined by GrAF, the individual tiers can be shared as GrAF-XML and extended independently without any supervising instance. Our hope is that this will foster collaboration in future analysis projects. GrAF-XML can then be used as a common interchange format in those projects.
Banski, Piotr and Przepiórkowski (2009).
Stand-off TEI annotation: the case of the national corpus of polish;
Proceedings of the Third Linguistic Annotation Workshop (LAW III), pp. 65–67.
Cayless, Hugh A. and Soroka, Adam (2010).
On implementing string-range() for TEI;
Proceedings of Balisage: The Markup Conference 2010
[online] [cited 19 April 2013].
Ide, Nancy and Suderman, Keith (2007).
GrAF: A graph-based format for linguistic annotations;
Proceedings of the Linguistic Annotation Workshop, pp. 1–8,
Prague, Czech Republic, June. Association for Computational Linguistics. doi:10.3115/1642059.1642060.
Ide, Nancy and Suderman, Keith (2009).
Bridging the gaps: interoperability for GrAF, GATE, and UIMA;
Proceedings of the Third Linguistic Annotation Workshop, pp. 27–34,
Suntec, Singapore, August 6-7. Association for Computational Linguistics. doi:10.3115/1698381.1698385.
 http://weblicht.sfs.uni-tuebingen.de/weblichtwiki/index.php/The_TCF_Format, accessed 19.4.2013
 http://www.iso.org/iso/catalogue_detail.htm?csnumber=37326, accessed 21.3.2013
 http://www.americannationalcorpus.org/MASC/Home.html, accessed 19.4.2013
 Elan uses an XML format called EAF: http://www.mpi.nl/tools/elan/EAF_Annotation_Format.pdf, accessed 19.4.2013