A directed graph (or digraph), [...] denoted G = (V, E), consists of a finite set of
vertices [(or nodes)] V and a set of ordered pairs of vertices E called arcs. We denote an
arc from v to w by v→w.A path in a digraph is a sequence of vertices v1,
v2,...,vk, k≥1, such that
v1→vi+1 is an arc for each i,
1≤i≤k. We say the path is from vi to
vk. [...] If v→w is an arc we say v is a predecessor of w
and w is a successor of v.
An ordered, directed tree is a digraph that has a single root node (a node
that has no predecessors and from which there is a path to every vertex).
Each node other than the root node has exactly one predecessor and is connected to this single
parent via one (and only one) edge. The successors of each node are ordered from left to right
(, p. 3).Usually, one tends to agree on XML instances to use the formal model of a single-rooted
tree: in the XML specification it is stated that [t]here is exactly one element, called
the root, or document element, no part of which appears in the content of any other element.
For all other elements, if the start-tag is in the content of another element, the end-tag
is in the content of the same element. More simply stated, the elements, delimited by start-
and end-tags, nest properly within each other. And indeed, if we stick with the
nesting of elements (and attributes) we end up with a tree. A tree, however, has certain
limitations: since crossing arcs are not allowed, it is not possible to use a tree model for
the annotation of discontinuous segments (for example multi-word idioms discussed in or the Alice in Wonderland example quoted in ). Although it would be possible to use TEI's milestone
elements or fragmentation (see ) one would still have to deal with
separate element instances, that is the relation between the parts of the elements would be
implicit.A related disadvantage of trees is that it is often not possible to annotate concurrent
– and possibly overlapping – hierarchies.
A hierarchy is formed by a subset
of the elements of the markup language used to encode the document. The elements within a
hierarchy have a clear nested structure. When more than such a hierarchy is present in the
markup language, the hierarchies are called concurrent.
(, p. 186).Even if two concurrent hierarchies do not overlap it is
impossible to merge them into a single tree if they do not share the same root, since trees
are only allowed to have a single root node (see definition above). But the major problem
related to concurrent markup is that multiple hierarchies may lead to multiple parentage of
nodes:
Overlap can be represented by graphs that are very like trees, but in which nodes may
have multiple parents. Overlap is multiple parentage.
().Since one of the main driving forces behind the creation of multi-dimensionally annotated
documents are linguistic corpora, the TEI Guidelines have not only
improved the awareness of scholars of the Digital Humanities for the problems regarding this
special field of research, but also provided some solutions to it. However, the different possible
solutions (multiple documents, milestone elements, fragmentation and standoff markup) that are
part of Chapter 20 of the aforementioned Guidelines are flawed with several disadvantages.
Using multiple documents (cf. Section 20.1 of ) results in redundant
storage of the primary data, that is the character stream which is to be annotated and – as
an effect – makes further changes to both primary data and annotation files time-consuming,
which in turn can result in inconsistencies between the various instances. In addition
there is no explicit indication that the various views, which might be in separate files,
are related to each other: it might prove difficult to combine the views or access information
from one view while processing the file that contains the encoding of another (, p. 621). The last point can be addressed by using the primary data as
reference system, that is the positions in the character stream delimit the start and end points
of corresponding markup, see (which is already referred to in the
Guidelines) or and the standoff approaches discussed below. The
related approach of twin documents shown in in addition to the primary data redundantly stores the so-called
sacred markup, that is markup which is shared between different
annotation layers (in contrast to profane markup that is related
to a single layer). Although redundancy may lead to an improved sustainability (according to
) we tend to follow the Guidelines in believing that the price in form
of possible inconsistencies is too high.For these reasons several proposals for graph-based formal models and alternative representation
formats have been discussed in the last decade. As already stated above, a graph is the
superclass of trees and therefore allows both multiple parentage and multiple root nodes.
Again, first proposals for the XML representation of graphs can be found in the TEI Guidelines in Chapter 18 by introducing feature structures.It may be of interest that the mention of feature structures in the TEI Guidelines can be traced back to the first proposal (P1) written in Waterloo script. Even this very draft version dated from 1990 covered feature structures as a means for linguistic annotation.
Feature structures are single-rooted labeled directed acyclic graphs, often displayed as attribute value matrices,
that can be used for representing various kinds of information. The TEI approach was standardized as international
standard
and can be used as serialization format for multiple annotations as shown by .
However, as discussed in this special paper, the resulting XML instances can be quite huge, rendering this
approach quite limited.
Another alternative formal model for markup languages that has received much attention is the
General Ordered-Descendant Directed Acyclic Graph (GODDAG) which was introduced in (see for a more
recent discussion). To be more precise, there is a whole range of GODDAG sub-classes, such as
the restricted GODDAG (r-GODDAG), the generalized GODDAG, the clean GODDAG, the normalized
GODDAG and the colored GODDAG (the latter two have been introduced in ). (taken from ) shows a GODDAG representing the aforementioned
Alice in Wonderland example.GODDAGs (and especially clean r-GODDAGs) can be serialized as TexMECS instances (see for a detailed discussion about the relationships between GODDAG
sub-class and TexMECS serialization). The respective GODDAG serialization of the above-named example
is shown below:<p|Alice
was beginning to get very tired ...
it had no pictures or conversations in it,
<q|and what is the use of a book,|-q>
thought Alice
<+q|without pictures or conversation?|q>
|p>Apart from TexMECS there are other serialization options for representing GODDAGs.
Especially the work done by is of interest, since they have
shown that a data structure based on RDF, called EARMARK (Extreme Annotational RDF Markup), not
only fully supports the expressiveness of GODDAGs but additionally introduces a new sub-type,
called e-GODDAG (extended GODDAG) that adds anonymous non-terminal nodes (for establishing
multiple arcs between two nodes and therefore allowing repetitive structures).A second alternative data model for markup languages is the Annotation Graph introduced by
which was especially designed for linguistic annotations. An AG
formally is a labeled directed acyclic graph (labeled DAG) which uses an
order-preserving map assigning times to (some of) the nodes (, p. 2). This formal model is used for example in the annotation tool
EXMARaLDA discussed in . An extended version can be found in the
NITE Object Model (cf. , ) which
combines hierarchies between nodes (similar to ordered directed trees) and the timing
information. Both formal models use plain XML as serialization format. We will discuss this
finding in a few paragraphs.The third alternative formal model is based on the Core Range Algebra, introduced in and extended in . It uses flat ranges over
the primary data and allows for overlapping ranges. A related serialization format is the
Layered Markup and Annotation Language (LMNL, , , ). LMNL uses the primary data as base
consisting of zero or more atoms (representing a Unicode char or something completely
different). Ranges over the base contain the atoms between a matching start tag and end tag
and may overlap. Even self-overlap (that is overlapping of elements, or ranges that bear the
same generic identifier, see for an example) is supported, as
well as anonymous ranges (similar to the aforementioned e-GODDAGs). Annotations can be located
at both the start and end tag and since LMNL completely abandons hierarchy there is no need
for a 'root range' (although the containment relation can be used via the use of base layers,
see ). Despite its naming as 'markup language' LMNL was developed as
a formal model, therefore several serialization formats exist. Apart from LMNL's own Sawtooth
syntax there is Canonical LMNL in XML (CLIX, formerly known as HORSE, Hierarchy-Obfuscating
Really Spiffy Encoding, , ), ECLIX
(extended CLIX) and xLMNL. While CLIX and ECLIX use TEI milestone elements, xLMNL is a flat
representation, similar to a standoff approach (examples of all these formats can be found at
http://www.piez.org/wendell/papers/dh2010/clix-sonnets/). shows a
possible graphical representation of ranges and annotations in LMNL (here syllables and morphemes).
Note, that due to the two-dimensional
approach the hierarchy that is implied by the vertical arrangement of the bars is not compulsory in LMNL.Other approaches that shall be mentioned here for the sake of completeness are multi-colored
XML (cf. ), the use of delay nodes (), the tabling approach described by and XCONCUR by . While some of the aforementioned data models make use of a
serialization format of their own, others succeed in using plain XML. This indicates that the
formal model of XML instances has a greater expressive power than a directed ordered tree. And
indeed, if we leave the field of Digital Humanities, there is a number of authors that tend to
agree that the formal model of XML instances is that of a graph: , , or . The discrepancy in the findings can be explained by the sole observation of hierarchical
relations of elements or by alternatively taking the XML-inherent integrity constraints
into consideration, that is ID/IDREF/IDREFS token type attributes (in DTD) or
xs:ID/xs:IDREF/xs:IDREFS and xs:key/xs:keyref (in XSD) respectively. In this context a line can be drawn between well-formed XML instances (in that case we still have to deal with a tree) and valid
XML instances according to a document grammar that makes use of the aforementioned integrity
constraints. Using a native XML approach has the advantage of being able to make use not only of a
large range of software products but also of related specifications such as XPath, XSLT,
and XQuery. Especially the upcoming XSLT 3.0 is quite interesting since it supports streamable
transformations allowing for the manipulation of fairly big XML instances (cf. ). In addition, XML-based visualization formats such as the 2D SVG and
newer approaches such as the 3D X3D are promising formats for the visualization of concurrent
annotations (see and ). We have already found proofs
that the full power of valid XML instances can be used to serialize Annotation Graphs or LMNL
ranges. demonstrates that valid XML can even make use
of cyclic paths (or arcs) and therefore definitely exceeds the formal power of trees.Together with the standoff approach mentioned both in the TEI Guidelines and , this expressive power can be used to capture multiple annotated data. In and the authors
discuss the XStandoff meta annotation format which is capable of representing discontinuous elements, multiple parentage and virtual elements (amongst others). Since it is XML-based
we have chosen it as one of the two formats (besides xLMNL) to discuss visualization
aspects.XStandoff as a starting point for visualizationXStandoff is a representation format for multiple hierarchies which evolved from works of
the research project Secondary structuring of information and comparative discourse
analysis (Sekimo)The project Sekimo was a part of the distributed research group Text Technological Modelling of
Information which lasted from 2003 to 2009.. The format is a successor of the Sekimo Generic Format (SGF, cf. ) and was presented in detail at the Balisage 2009 (cf. , for current developments see the XStandoff website). XStandoff can be seen as the combination of the standoff approach
and the formal model of GODDAGs, capable of using native XML to represent multiple hierarchies and the
specifically challenging structures such as overlaps, discontinuous elements, or virtual elements. Since XStandoff makes use of the XML-inherent ID/IDREF mechanism the underlying model can
be seen as a graph and therefore the format is able to represent any graph-based structure.
Because of this it can become quite complicated to construct XStandoff instances manually. For
this reason the XStandoff toolkit was implementedThe stylesheets and corresponding documentation are available at
http://www.xstandoff.net/tk.html., providing XSLT 2.0 stylesheets for the
creation of XStandoff instances on the basis of standard inline XML annotations and their corresponding
primary data (inline2XSF.xsl), the merging of XSF instances
(mergeXSF.xsl), the extraction or deletion of levels or layersLevels refer to the conceptual realization of annotations and layers to the technical
realization (cf. ). This distinction is reflected by XStandoff
in providing the corresponding meta elements <xsf:level> and
<xsf:layer>.
from XStandoff instances (extractXSFcontent.xsl)
and the transformation of standard XStandoff instances to inline XStandoff representations
(XSF2inline.xsl), the latter mainly for demonstration
purposes.
The workflow for creating an XStandoff instance can be demonstrated by the following
example. The basis for the construction is given by two separate annotations () for a single primary data text (): The stylesheet inline2XSF.xsl can be used to build
XStandoff instances for each of the input annotations, by using the Saxon XSLT Processorinline2XSF.xsl makes use of Saxon specific extensions
which are available in the older XSLT 2.0 versions of Saxon (-B and -SA) and the newer
versions PE and EE; see http://saxon.sourceforge.net/.:saxon -o:[output.xml] -s:[input.xml] -xsl:inline2XSF.xsl
primary-data=[primary-data-file.txt]Afterwards the two instances can be merged
with the help of the stylesheet mergeXSF.xsl:saxon -o:[combined-output.xml] -s:[input-xsf-1.xml] merge-with=[input-xsf-2.xml])This process results in the integration of the separate annotations into a single XStandoff instance: There are several parameters which can be specified by the user to influence the
actual serialization of the XStandoff annotation (for a detailed overview see the
online stylesheet documentation).
Apart from this, it should be obvious how the format deals with
challenging structures like overlaps or discontinuous elements, namely by instantiating an
underlying graph model through the use of string range references to parts of the primary data
(xsf:segment elements). At the same time the hierarchical structures of the input annotations are kept nearly
unchanged (except for the addition of the xsf:segment attribute which refers to the respective
xsf:segment element) by storing them separately under <xsf:level> and
<xsf:layer> elements. Note that there is no mandatory relationship
between the string ranges (containment) and the dominance relations implied by the hierarchical structure
(cf. the Alice in Wonderland example in ). In and we will present approaches
to the visualization of XStandoff instances like the one shown in . However, as discussed above, we would like to have a second
XML-based option as starting point for a visualization of concurrent markup. Therefore we explored the possibility of
converting other formats into XStandoff and vice versa. This would allow for the graphic rendering of distinct formats by the visualization
approaches we will introduce in and . As a possible candidate
for conversion we have chosen xLMNL which we will briefly
present in the following section.xLMNL as a starting point for visualizationSince xLMNL, an XML-based serialization format for LMNL, which was introduced by as
an ad-hoc solution for representing LMNL in XML, makes a similar use of string
ranges like XStandoff, it was chosen as a starting point for a conversion project between
XStandoff and other XML-based formats.
The corresponding simplified xLMNL
serialization for the annotations shown in can be
seen in which demonstrates the use of character positions (in start and
end attributes) referring to the normalized textual content of x:content.This illustrates the main difference of XStandoff and xLMNL in that the latter does not
consider a hierarchical structure and imposes a completely flat structure of annotations.
Admittedly, in contrast to dominance relations, containment relations can well be derived by
taking into account the string ranges. Nevertheless, the distinct approaches of xLMNL and
XStandoff towards the representation of potentially concurrent annotations constitute a
serious challenge for the conversion enterprise because annotation hierarchies are not present
in xLMNL. There are two possible ways to deal with this issue. Since XStandoff in principle
allows for the capturing of arbitrary graph-like structures, the xLMNL representation could be
integrated without making any assumptions about hierarchies. Another
strategy, which would make more sense if one wanted to visualize the annotations by the methods
introduced later on, would be the analysis of the individual relations between annotations on the
basis of their string ranges and to try to construct hierarchies of annotations by considering
the containment relations. Conflicting annotations could be separated from each other to avoid
representation problems. This strategy admittedly inserts information which is not directly present,
however it would not be a problem to remove the additional information again in
a later step.Perspectively there will be an examination of creating or integrating XStandoff into a
syntactic conversion framework for existing representation formats like the one described in
. Although it would be possible to realize individual
format-to-format conversions, it seems much more straightforward to have a framework
which is based on a common model. For this purpose the above-mentioned meta markup language
EARMARK, which can be used to represent GODDAGs, appears to be a quite promising candidate for a
pivot format. 2D visualization of concurrent markupBasic principles of the visualization of concurrent markupFor the visualization of concurrent markup there are two main issues to be regarded
and to be solved: the illustration of the relationship of primary data and annotationsthe visualization of potentially overlapping annotations (including other
tree-challenging phenomena like discontinuous elements)In the case of XStandoff the visualization of multiple hierarchies at first glance can be based on a
relatively simple principle, namely the delineation of separate tree structures. This of course only makes sense when the focus is on dominance
relationships. As stated above, it is possible to represent graph structures, too. This will
be addressed in more detail in . But before, we want to take a look
at a general visualization principle for multiple tree structures. A very
basic visualization method is given in where two annotation layers corresponding
to common textual primary data are represented by vertically ordered colored bars: Here the horizontally ordered segments of each level represent the individual
annotations and their length is used to demonstrate the correspondence to the dominated
annotations (edges are inferable by the width of the bars) and the spanned textual content.
This strategy, as indicated above, is based on tree structure visualization. Admittedly it
could be used to represent minimal extensions to trees, for example multiple parents, which would
allow for the capturing of overlapping structures; remember that overlap is multiple
parentage (). However, there seems to
be no way to represent more advanced graph structures. In addition there are some
stylistic disadvantages: first of all, the overall width of the graphic and the visual
accessibility mainly depend on the length of the primary data. Secondly, in this basic
strategy line breaks from the primary data would have to be replaced in order to facilitate
the visualization of continuously ordered annotation segments. The named stylistic shortcomings could be dealt with by changing the direction of the
illustration and ordering the annotation levels horizontally. This concept can be
demonstrated on the basis of the annotations introduced in . Since there
is a classic overlap of the second l element
(/text/body/lg[1]/l[2]) of the verse annotation and the first q
element (/text/body/p[1]/q) of the direct discourse annotation which holds for
the string baby, can't you see, the annotation levels cannot simply be
integrated into a common tree structure. Following the representation in Witt (2005) the
present annotations could be visualized like in
(in order to emphasize the present tree
structures there is an additional representation of nodes and edges): To avoid the above-mentioned stylistic disadvantages of the horizontal ordering of
annotation segments (vertical ordering of annotation levels), the representation could be rotated in a 90° angle to the right and
mirrored horizontally: From this state, it is only a few steps towards an adequate readability of the text and the
consideration of line breaks from the primary data. This can be shown
by a visualization method implemented by . On the basis of LMNL
markup he realized the visualization of concurrent annotations by both an
'arcs'-visualization and an interactive SVG 'map' (shown in below).The present annotation layers and element types are displayed in the left top corner of
the graphic and their appearance can be switched on and off by mouse click. The actual
instances of the underlying annotation are represented by two distinct illustrations: as
bars on the left hand side and circles on the right hand side. The primary data
text is located in between. The correspondence of segments of the primary data and annotations is
demonstrated by interactive mouse-over effects (see the SVG provided online at Piez' website). Overlaps of annotations from the individual layers can be identified in the graphic by having a look at
non-matching borders of the bars or cutting lines of the circles. While explicitly states that the described visualization method primarily
takes the function of a basic demonstration, there are certain technical and theoretical
difficulties which should be named: The annotation layers of Piez'
examples only contain elements which span over text segments large enough to
avoid problems with the visualization of the corresponding bars. If there were
annotations for single words or even smaller parts of the text, the bars and circles would
become too small for a reasonable visualization (see ).The use of circles for representing annotations is only feasible as long as there
are no very large annotated segments because the diameter could grow too big. Since all of the present annotation layers span the complete textual content
without any gaps, there might be the impression that the method is arranged very
clearly. In fact, other configurations of annotations which leave out certain parts of
the text could lead to a less clear picture.
These restrictions, however, do not decrease the overall
usefulness of the approach to visualize overlapping structures.
Rendering SVG from XStandoffThe creation of two-dimensional SVG-based visualizations for XStandoff instances is to a great extent
inspired by the approach of discussed in the previous section. Accordingly, the visualization
includes a section displaying the textual primary data and a section with representations of
annotations which in return correspond to spanned segments of the primary data. The possible
visualization of annotations by circles was not implemented since it can be assumed that
this method leads to problems for large annotation segments, as already stated. Piez'
method was extended by some additional features for user
interactivity like the horizontal switching of annotation levels and the optional display
of classic overlaps. The general appearance of an XStandoff instance visualized in SVG can
be seen in . This representation is based on the XStandoff instance given
in (an online
version of the example is available for testing the interactive features)Also
consider the online
visualization corresponding to .. There are two options for the user to influence the configuration of the responsible XSLT stylesheet
XSF2SVG.xslThe stylesheet
XSF2SVG.xsl is available at
http://www.xstandoff.net/tk.html and the
resulting visualization: the stylesheet parameters font-size and
max-line-length. Since most SVG viewers enable the user to zoom in and out of
the graphic anyway, the parameter font-size simply determines the initial
appearance of the resulting graphic. More attention should be drawn to the parameter
max-line-length which determines the maximal length of a single line of
primary data. This has to be considered since lines of a certain length in
combination with relatively small annotation segments can lead to visualization difficulties. Due to the correspondence between the height of a displayed annotation
segment and the individual characters of a line of the primary data, annotations spanning
over only a few characters might not be visualized accurately. That is the reason why the
value of the parameter max-line-length is determined
automatically by default in order to provide an optimal illustration of the annotation segments.
Although generally it is up to the user to vary the maximal line length, the
circumstance that a high value could lead to inaccurate visualizations has to be kept in
mind. demonstrates the possible difficulties by comparing a
visualization based on a maximal line length of 15 characters (automatically computed as maximum) with one
which is based on 40 characters per line: Even in the case of a short line length of 15 characters (on the left hand side of ) it is difficult to
spot the segment for the tagged comma. Certainly, there are possible solutions to this problem. For instance, an advanced
zooming method for the individual annotations and the corresponding textual content from the
primary data could be implemented. Furthermore, it would be possible to realize some kind of
page-wise navigation through the primary data, which would reduce the amount of
simultaneously displayed text. Nevertheless, the main problems for the present SVG
visualization are manifested by its conceptual foundation. The focus on tree structures
(with minimal possible extensions) prohibits the coverage of other phenomena than overlaps
and discontinuous elements, e.g. repetitive structures. This circumstance could be addressed
by an increased focus on the annotations, which will be demonstrated in the following section. Adding the third dimensionA different perspective on the visualization of concurrent annotations can be taken by the consideration of possible 3D graphic rendering. The
recent developments in native browser support for 3D graphics, especially the specification of
HTML5 () and its element <canvas> allowing for
programmatic rendering of APIs like WebGL (cf. ), promises to
provide a fruitful development and application framework for advanced graphical representation
of concurrent markup. By the time of writing this article, WebGL is supported by the
currently available builds of the browsers Firefox 5 and Chrome 12See http://www.khronos.org/webgl/wiki/Getting_a_WebGL_Implementation for further details.. With X3DOMSee http://www.x3dom.org/ for further details. and the serialization format X3D () there
is an appropriate solution for defining 3D graphics in XML. Accordingly, it is possible to
implement transformation scenarios for XML-based representation formats for concurrent markup
similar to the one shown for XStandoff and SVG for 3D visualizations without
leaving the XML context. Certainly, a native browser support of XSLT 2.0 would make the
framework even more straightforward, which naturally holds for the SVG approach, too. As an alternative
has shown some pretty advantages in implementing a JavaScript version of
Saxon, called Saxon Client Edition or Saxon-CE, bringing XSLT 2.0 to the browser.ConsiderationsSince 3D visualizations accompanied by interactive user navigation open up different
perspectives than the SVG approach presented in the previous section, the basic underlying principle could focus
on different aspects. While in the mentioned two-dimensional representation the primary data
is in focus and minimally extended tree structures for concurrent markup can be
represented, a three-dimensional approach could envisage the comprehensible
visualization of annotations with an underlying graph-based model by constructing horizontally
(along the z-axis of a 3D space) ordered trees, extended tree structures (e.g., allowing
multiple parentage), or even full-blown graphs (including repetitive structures and cyclic paths). In order to construct comparable layers of annotations, the structures could be
normalized with respect to the corresponding primary data. In this context two methods could
be considered: horizontal normalization and vertical
normalization. The horizontal normalization of the displayed structures
refers to the horizontal position of the nodes representing annotations and could be based on
the primary data virtually transformed into a single line. Along this line of characters the
nodes could be located by positioning them at the center of their spanned character string
(x-axis of ).The vertical normalization could make use of a very similar strategy. By dividing the
amount of spanned characters of an annotation by the total amount of characters in the
primary data, the vertical position of nodes could be determined. Admittedly, this strategy
could lead to confusion since it is probable that nodes of one level do not have the
same vertical position, while nodes from different levels have the same position. Having in
mind that the described normalization method arranges nodes with respect to the concept of containment,
it would be possible to allow for different realizations of layer visualizations, that is, a containment
perspective and a dominance perspective.The graphic incorporates the normalized structures of the two annotation layers of the
above-mentioned XStandoff instance
(). The normalized node positions reflect
the concept of containment.
In addition to the respective XStandoff instance, the first structure can also be seen as
a visualization of the containment relations from the xLMNL instance ()
if a virtual node is imposed which spans the complete primary data.
Note, that the hierarchy between the nodes in the structure for an xLMNL instance is only
implicitly present as already shown in
– in contrast to hierarchies in XStandoff instances. Thus, in general,
for the visualization of concurrent markup two distinct
visualization methods (containment vs. dominance) should be considered.XStandoff supports the differentiation of containment and dominance relations (see ),
using the start and end positions of the referenced segments for computing whether a string range virtually delimited
by an annotation is contained inside a second one and using the hierarchical relations between two nodes on the same
annotation layer to express a dominance between these nodes. Therefore, it would be reasonable to consider
these two possible normalization methods, allowing for the generation of both visualization methods.As a benefit from using a 3D approach it would still be possible to use tree-like visualizations as a starting point since both
the handling of overlapping annotations and the arrangement of different annotation layers can
be managed by using the z-axis.The actual realization of a 3D rendering of concurrent markup could vary in its complexity and in the
amount of the realized features. (corresponding to the XStandoff instance in
) demonstrates the dominance perspective mentioned above (in opposition to the
containment persective), in which there is a 1:1 relationship between nodes and
annotation elements. It is based on a hierarchical organization of the annotations. Besides these minimalistic illustrations, more complex and sophisticated graphics could be
realized. For example, it would be possible to represent hierarchies which are based on graphs
and include phenomena like discontinuous elements or repetitive structures. These would be
visualized on the basis of present containment relations, that is, nodes are normalized with regard to their
referenced textual content and edges reflect containment relations.Regarding the visualization of the relationship between primary data and annotations there are several
imaginable solutions. Firstly, it would be possible to simply display the spanned textual content of a node in tooltips
as indicated in . Alternatively, it would
be conceivable to take a 3D space like in as a basis
and project the textual primary data onto the back wall. By mouse-over effects the user could
focus the spanned textual content, for example by evoking light and shadow effects which
highlight the corresponding primary data section(s). At the same time information about the annotation could
be shown in a tooltip.In the visualization from , which shows horizontally and vertically normalized trees,
the appearance and position of nodes depend on the presence of distinct string
ranges for which there are annotations, that is, a single node might represent more than one
annotation element. This should be kept in mind.Apart from the actual design there are some core features which should be realized in the
envisaged approach: free user navigation through the graphic, including zooming in and out;draggable structures for layers (e.g. draggable as a whole along the z-axis);mouse-over effects: for example information on spanned primary data (textual content &
positions), information on annotation, XPath;highlighting of specific structures (distinct element relations, overlaps,
discontinuous elements, virtual/repetitive structures);the choice between displaying annotated or plain textual content for a node;illustration of left and right context of focused annotation elements and
corresponding textual content (+ specification of the range of considered context).Besides these rather stylistic considerations, which focus on the informational level of
the visualization, the conceptual advantages of a 3D approach to concurrent markup should
have become clear. Since it is not automatically restricted to strictly hierarchical
structures, it would be possible to display graph-based constructs like repetitive/reentrant
structures. Furthermore, relations between individual hierarchies of graph structures
could be illustrated and there could be a distinction of representations of dominance and/or
containment relations being reflected by the actual instantiation of the edges of graphs.
Prototypic 3D visualizationWe've implemented a first prototypic 3D visualization based on an XSLT stylesheet
named XSF2X3D.xsl that transforms
XStandoff instances into X3D graphics like the one in .
Since there is no complete implementation available yet, in the remainder of this section we
will concentrate on the things already accomplished, followed by possible future enhancements.The current implementation of a 3D visualization of concurrent hierarchies reflects the
considerations from the previous sections. The direct embedding of X3D into HTML5 allows for the
rendering of 3D visualizations in current browser versions.The visualization has
been successfully tested in Google's Chrome 12.0.742.112 and Mozilla Firefox 5.0 except for certain HTML5
constructs like range inputs on the latter. Support is dependent on the GPU installed – it runs fine on an
NVIDIA GeForce GT 330M installed in a MacBook Pro, while on other configurations Chrome had to be
started with the '--ignore-gpu-blacklist' startup parameter while Firefox had to be
customized via the about:config page and enabling the parameter 'webgl.force-enabled'.
The actual appearance of the current state of the prototype is shown in .
The main component of the visualization is a 3D space indicated as a cube which contains the
layers from the corresponding XStandoff instance () ordered along the z-axis.
At present, the normalization methods described in the previous section have not been fully implemented. In a later realization of the XSLT
stylesheet it should be possible for the
user to choose the normalization method, that is, the visualization of dominance or containment
relations.
The illustration given in indicates most of the available user interactivity. Besides
free navigation like zooming in and out of the graphic and rotating it, there are certain predefined viewpoints
like front view and side view, which could be interesting for the user and can be taken by selection
from the menu item 'View'. In addition, it is possible to freely drag the hierarchies along the
z-axis by using the sliders, which are available for each individual layer in the info box on the
right hand side. An interesting feature of the graphic is the possibility to virtually merge layers
by either dragging them into the appropriate positions or selecting the predefined 'Merge layers' option
from the 'Layers' submenu. The initial configuration of the layers can be
reestablished by a click on 'Reset layers'. In the case of feeling lost in 3D space the 'Reload' button
on the left hand side restores the initial state of the graphic.Information on the present annotations in the individual layers can be gathered by hovering over the
nodes with the cursor evoking a tooltip, which contains basic information like element names,
string ranges, and XPath expressions. Other desirable features for an appropriate visualization of
concurrent markup, like the ones listed in the previous section, will be considered in a later version.Conclusion and future researchIn this paper we demonstrated two aspects: firstly, that the formal model of XML
instances can exceed that of trees; in fact, we have proven that it is fully capable of
representing graphs. This, secondly, was used as a starting point to choose two XML-based representation
formats for multiple annotations that can be converted into 2D visualizations. Although it could be shown
that the first visualization approach provides an adequate (though admittedly suboptimal) solution to
overlapping structures, it is not capable of illustrating enhanced graph-based phenomena like
discontinuous elements or repetitive structures. Therefore we have sketched possible
3D renderings of concurrent markup. A first prototypic realization demonstrated how the adding of an additional
dimension could in principle contribute to the appropriate visualization of concurrent markup and could serve
as the basis for further research. The current version will be made available under
the GNU Lesser General Public License (LGPL v3) at the XStandoff website.
Unresolved tasks like an improved visualization of overlapping annotations
and the treatment of discontinuous and repetitive structures could be tackled in a future release.
BibliographyAbiteboul, S.,
Buneman, P., and Suciu, D. Data on the Web: From Relations to Semistructured
Data and XML. San Francisco, California: Morgan Kaufmann Publishers,
2000.Bauman, S. TEI
HORSEing Around. In: Proceedings of Extreme Markup Languages, Montréal, Québec,
2005.Bird, S. and Liberman, M.
Annotation graphs as a framework for multidimensional linguistic data
analysis. In: Proceedings of the Workshop "Towards Standards and Tools for
Discourse Tagging". Association for Computational Linguistics, 1999.Burnard, L. and Bauman, S.
(eds.). TEI P5: Guidelines for Electronic Text Encoding and
Interchange. Published for the TEI Consortium by Humanities Computing Unit,
University of Oxford, Oxford, Providence, Charlottesville, Bergen. Version 1.9.1. Last updated
on March 5th 2011.Carletta, J., Kilgour, J.,
O’Donnel, T. J., Evert, S., and Voormann, H. The NITE Object Model
Library for Handling Structured Linguistic Annotation on Multimodal Data Sets.
In: Proceedings of the EACL Workshop on Language Technology and the Semantic Web (3rd Workshop
on NLP and XML (NLPXML-2003)), Budapest, Ungarn, 2003.Carletta, J., Evert, S., Heid, U., and Kilgour, J. The NITE XML Toolkit: data model and query
language. In: Language Resources and Evaluation, Springer, Dordrecht, 2005,
39.Coombs, J. H., Renear, A. H.,
and DeRose, S. J. Markup Systems and the Future of Scholarly Text
Processing. In: Communications of the ACM 30.11, 1987.Cowan, J., Tennison J., and Piez,
W. LMNL Update. In: Proceedings of Extreme Markup Languages,
Montréal, Québec, 2006.Cowan, J. MicroXML. Poster presented at XML Prague 2010.Cowan, J. MicroXML. Editor's Draft 2011-06-30. http://www.ccil.org/~cowan/MicroXML.html.DeRose, S. J. Markup Overlap: A Review and a Horse. In: Proceedings of Extreme Markup
Languages, Montréal, Québec, 2004.Dekhtyar, A. and Iacob,
I. E. A framework for management of concurrent XML markup.
Data & Knowledge Engineering, 52(2):185–208, 2005. Di Iorio, A., Peroni, S.,
and Vitali, F. Towards markup support for full GODDAGs and beyond: the
EARMARK approach. In: Proceedings of Balisage: The Markup Conference 2009.
Balisage Series on Markup Technologies, vol. 3 (2009). doi:10.4242/BalisageVol3.Peroni01.
Durusau, P. and
Brook O'Donnell, M. Tabling the Overlap Discussion. In:
Proceedings of Extreme Markup Languages, Montréal, Québec, 2004. Goecke, D., Lüngen, H.,
Metzing, D., Stührenberg, M., and Witt, A. Different views on markup.
Distinguishing Levels and Layers. In: Witt, A. and Metzing, D. (eds.), Linguistic
Modeling of Information and Markup Languages. Dordrecht: Springer, 2010. doi:10.1007/978-90-481-3331-4Gou, G. and Chirkova, R.
Efficiently Querying Large XML Data Repositories: A Survey.
In: IEEE Transactions on Knowledge and Data Engineering 19.10, 2007.Hopcroft, J. E. and
Ullman, J. D. Introduction to Automata Theory, Languages, and
Computation. Addison-Wesley, 1979.HTML5: A vocabulary and associated APIs for HTML and XHTML,
W3C Working Draft 05 April 2011. World Wide Web Consortium. http://www.w3.org/TR/html5/.
Huitfeldt,
C. and Sperberg-McQueen, C. M. Representing and processing of GODDAG
structures: implementation strategies and progress report. In: Proceedings of
Extreme Markup Languages, Montréal, Québec, 2006.ISO/IEC 19776-1:2009, Information technology – Computer graphics,
image processing and environmental data representation – Extensible 3D (X3D) encodings
– Part 1: Extensible Markup Language (XML) encoding. International
Standard, International Organization for Standardization, 2009.
ISO/TC 37/SC 4. ISO 24610-1:2006: Language Resource Management – Feature Structures – Part 1: Feature Structure Representation. International Standard, International Organization for Standardization, 2006.
Jagadish, H. V.,
Lakshmanany, L. V. S., Scannapieco, M., Srivastava, D., and Wiwatwattana, N. Colorful XML: One hierarchy isn’t enough. In: Proceedings of ACM
SIGMOD International Conference on Management of Data (SIGMOD 2004), ACM Press, New York, NY,
USA, 2004. doi:10.1145/1007568.1007598Kay, M., 2010. A streaming XSLT
processor. In: Proceedings of Balisage: The Markup Conference 2010.
Balisage Series on Markup Technologies, vol. 5 (2010).
doi:10.4242/BalisageVol5.Kay01.Kay, M., 2011
XSLT in the Browser. In: Kosek, J. (ed), XML Prague 2011 Conference Proceedings, number 2011-519 in ITI Series, pages 125–134, Prague, Czech Republic, 3 2011. Institute for Theoretical Computer Science.Le Maitre, J. Describing multistructured XML documents by means of delay nodes. In:
DocEng ’06: Proceedings of the 2006 ACM symposium on Document engineering, ACM Press, New
York, NY, USA, 2006.Marcoux, Y. Graph characterization of overlap-only TexMECS and other overlapping markup
formalisms. In: Proceedings of Balisage: The Markup Conference 2008.
Balisage Series on Markup Technologies, vol. 1 (2008).
doi:10.4242/BalisageVol1.Marcoux01.Marinelli, P., Vitali,
F., and Zacchiroli, S. Towards the unification of formats for
overlapping markup. In: New Review of Hypermedia and Multimedia, 14(1), 2008.
doi:10.1080/13614560802316145.Møller, A. and
Schwartzbach, M. I. XML Graphs in Program Analysis. In: PEPM
’07: Proceedings of the 2007 ACM SIGPLAN symposium on Partial evaluation and semantics-based
program manipulation. Nice, France, 2007.Nicol, G. T. Attributed Range Algebra. Extending Core Range Algebra to Arbitrary Structures,
2002.Nicol, G. T. Core Range
Algebra: Toward a Formal Model of Markup. In: Proceedings of Extreme Markup
Languages. Montréal, Québec, 2002.Pianta, E. and
Bentivogli., L. Annotating Discontinuous Structures in XML: the
Multiword Case. In: Proceedings of LREC 2004 Workshop on "XML-based richly
annotated corpora", Lisbon, Portugal, 2004.Piez, W. Half-steps
toward LMNL. In: Proceedings of Extreme Markup Languages. Montréal, Québec,
2004.Piez, W. Towards Hermeneutic Markup: An architectural outline. In: Digital Humanities
2010 Conference Abstract, London, 2010.Polyzotis, N. and
Garofalakis, M. Statistical Synopses for Graph-Structured XML
Databases. In: Proceedings of the 2002 ACM SIGMOD International Conference on
Management of Data, Madison, Wisconsin, 2002. doi:10.1145/564691.564733. Rehm, G., Schonefeld, O., Trippel,
T., and Witt, A. Sustainability of linguistic resources
revisited. In: Proceedings of the International Symposium on XML for the Long
Haul: Issues in the Long-term Preservation of XML. Balisage Series on Markup
Technologies, vol. 6 (2010). doi:10.4242/BalisageVol6.Witt01.Schmidt, T. The transcription system EXMARaLDA: An application of the annotation graph formalism as the
Basis of a Database of Multilingual Spoken Discourse. In: Proceedings of the IRCS
Workshop On Linguistic Databases. Philadelphia: Institute for Research in Cognitive Science,
University of Pennsylvania, 2001.Schonefeld, O. XCONCUR and XCONCUR-CL: A constraint-based approach for the validation of
concurrent markup. In: Rehm, G., Witt, A., Lemnitzer, L. (eds.), Datenstrukturen
für linguistische Ressourcen und ihre Anwendungen. Data Structures for Linguistic Resources
and Applications. Proceedings of the Biennial GLDV Conference 2007, Tübingen, Germany, 2007.
Gunter Narr Verlag.Sperberg-McQueen, C. M. and
Huitfeldt, C. GODDAG: A Data Structure for Overlapping
Hierarchies. In: King, P. and Munson, E. V. (eds.), Proceedings of the 5th
International Workshop on the Principles of Digital Document Processing (PODDP 2000), volume
2023 of Lecture Notes in Computer Science, Springer, 2004.Sperberg-McQueen, C. M. and Huitfeldt, C. Markup Discontinued
Discontinuity in TexMecs, Goddag structures, and rabbit/duck grammars. In:
Proceedings of Balisage: The Markup Conference 2008. Balisage Series on Markup Technologies,
vol. 1 (2008). doi:10.4242/BalisageVol1.Sperberg-McQueen01.Sperberg-McQueen, C. M. and Huitfeldt, C. GODDAG. Presented
at the Goddag workshop, Amsterdam, 1-5 December 2008.Stegmann, J. and Witt, A. TEI Feature Structures as a Representation Format for Multiple Annotation and Generic XML Documents. In: Proceedings of Balisage: The Markup Conference
2009. Balisage Series on Markup Technologies, vol. 3 (2009).
doi:10.4242/BalisageVol3.Stegmann01.
Stührenberg, M.
and Goecke, D. SGF – An integrated model for multiple annotations
and its application in a linguistic domain. In: Proceedings of Balisage: The
Markup Conference 2008. Balisage Series on Markup Technologies, vol. 1 (2008). doi:10.4242/BalisageVol1.Stuehrenberg01. Stührenberg, M. and Jettka, D. A toolkit for multi-dimensional markup: The development of SGF to
XStandoff. In: Proceedings of Balisage: The Markup Conference
2009. Balisage Series on Markup Technologies, vol. 3 (2009). doi:10.4242/BalisageVol3.Stuhrenberg01.
Tennison, J. Layered Markup and Annotation Language (LMNL). In: Proceedings of Extreme Markup
Languages, Montréal, Québec, 2002.Thompson, H. S. and
McKelvie, D. Hyperlink semantics for standoff markup of read-only
documents. In: Proceedings of SGML Europe ’97: The next decade – Pushing
the Envelope, Barcelona, 1997.WebGL Specification. Version 1.0, 10
February 2011. Khronos Group. https://www.khronos.org/registry/webgl/specs/1.0/.
Witt, A. Multiple
Informationsstrukturierung mit Auszeichnungssprachen. XML-basierte Methoden und deren Nutzen
für die Sprachtechnologie. Dissertation, Universität Bielefeld, 2002.Witt, A. Multiple
hierarchies: New Aspects of an Old Solution. In: Proceedings of Extreme Markup
Languages, Montréal, Québec, 2004.Witt, A., Goecke, D., Sasaki, F., and Lüngen, H. Unification of XML Documents with Concurrent Markup. Literary and Linguistic Computing, 20(1):103–116, 2005. doi:10.1093/llc/fqh046.