The Sekimo Generic Format
After the experiences made with the Prolog fact base format the decision was made to
develop a similar representation based on XML. The initial goal was to use a native XML
database as storage backend, however, during the development of the Sekimo Generic Format
(SGF) several implementations were tested, including the use on a per-file basis, different
native XML databases (e.g. eXist, Berkeley DB XML, Qizx/db, IBM DB2 Express-C 9.5), and a relational database (MySQL, cf. section “SGF as import and export format”). In the following sections we will present
SGF in detail. The annotation layers shown in Figure 2 and Figure 3 will serve for demonstration purposes. In section “Application of SGF” we will show a real world example from the domain of anaphora
Figure 2: Phrase structure annotation
Figure 3: Syllable annotation
The concept of SGF
SGF was developed for storing multiple annotated linguistic corpus data and examining
relationships between elements derived from different annotation layers. The format consists
of a base layer, providing the structure of an SGF instance and global attributes that are
imported by the different annotation layers (cf. section “The base layer”). The use of
metadata in SGF is described in section “Metadata” while section “Adding layers”, section “Disjoints and continuous segments” and section “Validation” deal with different aspects of the format. Finally, we will
discuss processing and querying of SGF annotated data in section “Querying” and
conclude with possible caveats of the format in section “Caveats and problems”.
Figure 4: Diagram of the
corpus root element
SGF can be used in two different ways as shown in Figure 4:
As a container format that contains optional meta data (cf. section “Metadata”) and the corpus data, i.e. the whole corpus is saved as a
single SGF instance. This is the appropriate way when using SGF for storing small and
medium sized corpora in conjunction with a native XML database (cf. Figure 5).
On a per-file basis or when dealing with larger corpora a meta SGF file is used
containing (again optional) metadata for and references to the actual corpus files
(cf. Figure 6).
Figure 5: Storing a whole corpus in a single SGF instance
<corpusData xml:id="c1" type="text" sgfVersion="1.0">
<!-- [...] -->
<corpusData xml:id="c2" type="text" sgfVersion="1.0">
<!-- [...] -->
In both cases the root element is the
corpus element; underneath this a
corpusDataRef element or a
corpusData element can be inserted.
corpusDataRef element allows for referring to an external file
containing a corpus entry via its
uri attribute and for specifying the external
data in terms of encoding and mime-types (respective attributes of the same name). In this
case the root element of the corpus entry instances that are referenced by the SGF meta file
should be the
corpusData element (cf. section “The base layer”).
The base layer
corpusData element is used for storing a single corpus entry containing
optional metadata (cf. section “Metadata”), the primary data, the segmentation of
the primary data, and zero or more respective annotation layer(s) (cf. section “Adding layers”). An example base layer is shown in Figure 8. The
xml:id attribute is obligatory while the
attribute is optional (with a default value of 1.0)
Figure 7: Diagram of the
Figure 8: The SGF base layer
xml:id="c1" type="text" sgfVersion="1.0">
<primaryData start="0" end="19" xml:lang="en">
<textualContent>This is a sentence.</textualContent>
corpusData element holds the
type attribute which can be
either set to the value text or multimodal while the
primaryData child element contains either
the textual primary data (i.e. the text that is used as basis for annotation) as text node
textualContent element or a reference to a file containing the primary
data (in case of larger texts or non-textual primary data) via a
element (not shown in the example listing). In the latter case an optional checksum of the
input file can be provided in the corresponding element to preserve integrity of primary
data when dealing with multiple annotation resources. Note, that we do not handle any byte
offset problems derived by different encodings (e.g. Latin 1 vs. UTF-16), therefore, the use
encoding attribute is highly recommended.
When using SGF for storing multimodal annotations, multiple
elements are allowed. In this case, the attribute
role has to be provided which
marks exactly one primary data file as "master" while the other primary
data files are marked as "slaves". The master primary data file sets the
timeline, the slave files can be aligned to the master file via an optional
Several annotations of the primary data can be stored inside a
element. Whenever an annotation layer is added, two steps have to be undertaken:
The segments which delimit the annotated parts of the primary data are
A converted representation of the original annotation is stored.
segments element consists of at least one
segment is defined by its start and end position in the character stream - similar to the
Prolog fact base format discussed in section “Prolog-based architectures” (for an alternative
definition of segments cf. section “Disjoints and continuous segments”). We use simple numeric attributes
nonNegativInteger data type in the underlying XML Schema, cf. section “Validation” and XML Schema Part 2, 2004) for defining the start
and end position - in contrast to the PAULA format (Dipper, 2005), which
uses XLink (DeRose et al., 2001) and the XPointer framework (Grosso et al., 2003) to identify text spans. Because single characters have a step size
of 1 (cf. Figure 1), empty elements use the same value for start and end
position. An optional segment
type attribute can be used to provide more
information about the segment (available values are empty,
char for character data, ws for whitespace characters, pun for
punctuation characters, dur for duration in case of
multimodal primary data and seg for referring to already
defined segments, cf. section “Disjoints and continuous segments”).
Figure 10 shows the SGF representation of the two annotation layers
given in Figure 2 and Figure 3. Note that a segment
has to be defined only once, even if it is used in different annotation layers - in contrast
to some other graph-based approaches (cf. section “Graph-based architectures”) which define the same
character span separately for each annotation layer. This results in a smaller amount of
segments that has to be defined even for a large number of annotation layers.
The annotation of the primary data is stored in the corresponding element. Following the
terminological distinction between levels and layers (cf. section “Introduction”), each
level element contains - in addition to optional metadata - exactly
layer element consisting of the markup representation of the corresponding
annotation level. An
annotation element may contain more than one
level element, this mechanism can be used for subsuming annotation levels (e.g.
when the corresponding elements are declared in the same document grammar). The
layer element is a wrapper element containing elements derived from a different
namespace, similar to the meta element (cf. section “Metadata”). However, while the
value of the
processContents attribute of the latter is set to lax, the value of the respective attribute of the
layer element is set to strict, resulting in
the fact that an XML schema has to be provided for each annotation layer (cf. section “Validation”).
Figure 9: Diagram of the
Figure 10: SGF instance containing two annotation layers
<corpusData xml:id="c1" type="text">
<primaryData start="0" end="19" xml:lang="en">
<textualContent>This is a sentence.</textualContent>
<segment xml:id="seg0" type="char" start="0" end="19" />
<segment xml:id="seg1" type="char" start="0" end="4" />
<segment xml:id="seg2" type="char" start="5" end="18" />
<level xml:id="al1" priority="1">
<description>Phrase structure annotation.</description>
<phrase:s base:segment="seg0" xml:lang="en">
<phrase:pron base:segment="seg1" />
<phrase:v base:segment="seg3" />
<phrase:det base:segment="seg5" />
<phrase:n base:segment="seg6" />
<level xml:id="al2" priority="1">
<syll:s base:segment="seg1" />
<syll:s base:segment="seg3" />
<syll:s base:segment="seg5" />
<syll:s base:segment="seg7" />
<syll:s base:segment="seg8" />
As one can observe in Figure 11, SGF heavily makes use of XML's
inherent ID/IDREF(S) mechanism to connect segments of the primary data with single or
multiple annotation layers (displayed as solid red lines).
Figure 11: Use of XML's ID/IREF(S) mechanism in SGF
When comparing the two annotation layers with the namespace prefixes
syll with their respective original representation given in Figure 2 and Figure 3, a second design goal of SGF is made
visible: to conserve as much of the former annotation format as possible. Still, a
conversion has to be made consisting of the following steps:
Elements with a mixed content model are converted into container elements.
Elements containing text nodes are converted into empty elements.
base:segment attribute is added to former non-empty elements as
an obligatory attribute (and as an optional attribute for empty elements).
The same conversion rules are applied to the underlying XSD (cf. section “Validation”
). As shown in Figure 10
the hierarchy of
elements and all attributes remain intact, i.e. there is no need for additional files such
as structure files which are needed for the graph-based annotation formats discussed in
section “Graph-based architectures”
. However, this statement is only true as long as the
XML-inherent tree structures are adequate.
An XSLT implementation is available for converting arbitrary inline annotation
layers into their respective SGF representation while a second XSLT script merges different
annotation layers according to the same primary data into a single SGF instance. Therefore,
it is possible to add additional
elements to an already existing SGF
instance at any time (as long as the primary data is not changed). Work has begun on a
second implementation (written in Java).
Disjoints and continuous segments
Often segments consist of other segments making it possible to create new segments not
only by defining their start and end positions but by referring to already defined segments
segments attribute, too (cf. Figure 12). In
order to distinguish if these newly established segments include all segments starting from
the first referred segment up to the last referred one, or define a disjoint span, the
mode has to be set to the value continuous or disjoint, respectively. The
example in Figure 12 shows a disjoint span.
Figure 12: Definition of a disjoint segment by referring to already established
<segment xml:id="seg6" type="seg" segments="seg1 seg3" mode="disjoint"/>;
Note that this feature of SGF could be used for conversion between SGF instances and
architectures mentioned in section “XML-related architectures”, however, up to now it has been of
theoretical use only.
An important aspect when dealing with multiple annotated data is the question of
validating this data. In case of overlaps it is strictly impossible to provide a document
grammar that is feasible for validating the unification of different annotation layers -
even without the amount of work that has to be done for producing such a document grammar.
Therefore, we propose that each annotation level is validated separately - in addition to
the SGF instance as a whole - with a transformed version of its original document grammar.
This conversion follows the conversion of the annotation layer described in section “Adding layers”.
We decided to use W3C XML Schema Description Language (XSD) (cf. XML Schema Part 1, 2004) as the underlying schema language for SGF for different
reasons. As already stated, SGF relies heavily on two aspects:
While ID/IDREF(S) is already present in XML Document Type Definitions, DTDs
lack real support for XML namespaces. Furthermore, SGF makes use of XML Schema data types
(XML Schema Part 2, 2004
) and when external document grammars (for annotation
layers and metadata) are imported, the control of the processing of the imported document
grammars is crucial (cf. section “SGF as import and export format”
for the discussion of the Serengeti
log functionality and the role of XML Schema's
Because of this we had to choose one of the XML schema languages available. XSD was
favoured over RELAX NG (ISO/IEC 19757-2:2003
) because of the better software support,
e.g. with Saxon-SA
a schema-aware XSLT and XQuery engine is available which allows for the use of
the id() and idref() functions for the task of comparing different annotation layers (cf.
section “Analysing annotations”
). Of course it would be possible to use simple string
comparisons, however, XML IDs are usually indexed by the XSLT processor (for Saxon cf.
and are for this reason - in most cases - much more efficient than the equivalent XPath
expression using a string comparison predicate (cf. Kay 2008
, p. 802-804.).
This helps reducing processing costs when dealing with larger SGF instances, however, the
downside is that the validation of each XSD associated takes some time (approximately one to
two seconds in our case).
Apart from XSD validation, embedded Schematron (ISO/IEC 19757-3:2006) asserts
are used as additional constraints, for example for refusing end positions of segments that
are less than start positions (cf. Robertson, 2002). In the upcoming version
1.1 of XML Schema, the
assert element will be used for fulfilling this task
(XML Schema 1.1 Part 1, 2008).
One of the goals during the development of SGF has been the possibility of analyzing the
relationships between elements of different layers. In contrast to the work described by
Alink et al., 2006 and Alink et al., 2006a, which involves new standoff
XPath axis steps, or the linguistic query language LPath, which extends the XPath 1.0 syntax
and which was introduced by Bird et al., 2006, SGF uses unchanged XML-related
specifications for querying data. Up to now we have employed XSLT 2.0, XPath 2.0 and XQuery
1.0 queries for typical tasks carried out in our project (cf. section “Application of SGF”). Bird et al., 2006 and Dipper et al., 2007
suggest different example queries to evaluate their architectures. By now, Q1
("Find all sentences that include the word 'kam'"), Q2 ("Find all
sentences that do not include the word 'kam'"), Q3 ("Find all NPs. Return
the reference to that NP") and Q7 ("Find all pairs of anaphors and direct
antecedents in which the anaphor is a personal pronoun") described in Dipper et al., 2007 were implemented.
Figure 13 shows Q7 for our
Figure 13: XQuery Q7 adapted for the corpus under investigation
declare boundary-space strip;
declare namespace base="http://www.text-technology.de/sekimo";
declare namespace doc="http://www.text-technology.de/sekimo/doc";
declare namespace cnx="http://www.text-technology.de/cnx";
declare namespace chs="http://www.text-technology.de/sekimo/chs";
declare variable $doc := "ling-deu-003-sgf-noWS.xml";
let $d := doc($doc)
for $s in $d//chs:semRel/chs:cospecLink[id(@phorIDRef)/
and @pos='PRON' and contains(@morpho,'Pers')]]
In addition, we have implemented Q8 ("Find all pairs of anaphors and
antecedents and their respective parent(s) on the logical document layer"), for
which it is necessary for the XQuery processor to traverse back to the segments, compare
segment elements and then to find the corresponding annotations. Most
of the queries perform comparable to the respective inline queries referred to in Dipper et al., 2007, but in general they are difficult to compare since our corpus (six
German scientific articles and eight German newspaper articles, containing 3,084 sentences,
56,203 tokens, 11,740 markables, 4,323 anaphoric relations, three annotation levels: logical
document structure, POS, anaphoric relations) is different both in terms of size and
annotation levels. Apart from Q7, most parts of the queries can be performed inline (which
is a benefit of SGF over other architectures discussed in section “Graph-based architectures”),
which allows us to abstain from converting SGF instances to inline representation prior to
analyzing the relations (which was one of the motivations in developing SGF) as proposed by
Dipper et al., 2007.
For a first evaluation we have chosen both the aforementioned complete corpus and our
largest single text, a German scientific article comprising 157 paragraphs, 696 sentences,
12,345 token, 2,550 markables and 1,358 anaphoric relations (14,985 segments in total),
annotated on the three annotation levels described above. All values are average results
after five executions on two different machines:
PC1: a Sun Fire V20z equipped with dual single core AMD Opteron 248 clocked at 2,2
GHz and 6 GB RAM running on Sun Solaris 10 (64bit) with Saxon-SA 22.214.171.124J on Java
1.5.0_15 (2 GB RAM allocated for Java VM) and SWI-Prolog 5.6.21 (128 MB allocated as
local stack limit).
PC2: a standard PC equipped with a Intel dual core Core2Duo E6600 clocked at 2,99
GHz with 3.12 GB RAM running on Microsoft Windows XP SP3 (32bit) with Saxon-SA
126.96.36.199J on Java 1.6.0_06 (1 GB RAM allocated for Java VM) and SWI-Prolog 5.6.57 (128
MB allocated as local stack limit).
Included in the XQuery results is the validation of five XSD files (-val
parameter) and the output of an XML file (-o
parameter) with a
root element and the
corresponding query results underneath. For comparison, we evaluated the same queries for
the Prolog fact base architecture used in the first project phase (cf. section “Prolog-based architectures”
) on the same two machines. For the latter the amount of time for
consulting the Prolog fact base containing the annotated data (14.3 MB in size, 3.37 sec on
PC1; 2.94 sec on PC2) and the Prolog query file (4.3 KB in size, 0.0 sec on both machines)
is not included in the results. The query results are output to a separate text file.
Evaluation results (in seconds). Average of five executions.
|Query||Prolog query results for single text (PC1 / PC2)||XQuery results for single text (PC1 / PC2)||XQuery results for whole corpus (PC1 / PC2)
|Q1||0.22 / 0.054||4.612 / 1.244||9.609 / 4.162
|Q2||13.502 / 4.554||5.161 / 1.234||9.390 / 4.357
|Q3||0.084 / 0.03||4.035 / 1.219||9.556 / 4.084
|Q7||30.66 / 7.798||5.764 / 1.481||11.669 / 5.35
|Q8||84.16 / 24.738||15.379 / 11.134||152.683 / 114.525
Note that in contrast to the graph-based architectures described in section “Graph-based architectures”, the XQueries and their evaluation results depend on the annotation
layers that are imported into the SGF base layer. This means that especially Q1, Q2 and Q3
are very fast because they can be performed inline in our corpus (i.e. both sentence and
token information are descendants of the same annotation element - and the
token element contains its textual content in its
For Q7, information derived from different annotation layers has to be taken into account,
however, since only the id() function is used, the results are satisfactory as well. Q8 is
the single XQuery that requires the identification of the respective
element and the use of the idref() function afterwards in order to get the corresponding
annotations. For these reasons, the advantage when using SGF over comparable architectures
rises or drops depending on the imported annotation layers. To further reduce processing
costs it is possible to use merged inline annotation layers (e.g. a logical document layer
and a POS layer) as a combined, single SGF layer and use separate SGF layers only when
overlaps occur. In this case the XML-inherent hierarchies can be used for (inline) analyzing
of wide parts of the annotated data while a reversion to SGF's use of the ID/IDREF mechanism
should only be made if not avoidable.
The performance figures for the Prolog fact base format show higher performance for
simple queries but lower performance for more complex ones. These figures result from the
fact that our corpus annotation makes heavy use of attributes, which leads to distributed
information. We believe that a re-implemented Prolog fact base format could both reduce file
size and speed up the querying.
Caveats and problems
Up to now, several former inline annotation layers have been converted into SGF and the
format as such is quite stable (although minor changes may occur). Apart from the huge
amount of markup that is necessary to do this kind of analysis, problems may arise when the
annotation layers that are stored in SGF are exported back into their original inline
representation. This is especially true when the annotation layers contain empty elements,
for which it is impossible to provide the exact position in the original document tree (of
base:segment attribute can be used for these elements as well; when
a large number of empty elements appears in a row, the values of all their respective
base:segment attributes would be identical). Although our largest SGF
instance is at 6 MB including optional whitespace segments (4.8 MB without optional
whitespace segments), it is still smaller than the respective Prolog fact base
representation at 14.3 MB, cf. section “Prolog-based architectures”.
When it comes to queries, SGF relies on the imported annotation layers. For this reason,
there is no standard set of queries available and the execution time cannot be easily
[Alink et al., 2006] Alink, W., Bhoedjang, R., de
Vries, A. P., and Boncz, P. A. Efficient XQuery Support for Stand-Off
Annotation. In: Proceedings of the 3rd International Workshop on XQuery
Implementation, Experience and Perspectives, in cooperation with ACM SIGMOD, Chicago, USA,
[Alink et al., 2006a] Alink, W., Jijkoun, V., Ahn,
D., and de Rijke, M. Representing and Querying Multi-dimensional Markup
for Question Answering. In: Proceedings of the 5th EACL Workshop on NLP and XML
(NLPXML-2006): Multi-Dimensional Markup in Natural Language Processing}, Trento, 2006.
[Bayerl et al., 2003] Bayerl, P. S., Lüngen, H.,
Goecke, D., Witt, A. and Naber, D. Methods for the semantic analysis of
document markup. In: Roisin, C.; Muson, E. and Vanoirbeek, C. (ed.), Proceedings
of the 3rd ACM Symposium on Document Engineering (DocEng), Grenoble, pages 161-170, 2003.
[Bird and Liberman, 1999] Bird, S. and Liberman,
M.Annotation graphs as a framework for multidimensional linguistic
data analysis. In: Proceedings of the Workshop "Towards Standards and Tools for
Discourse Tagging", pages 1–10. Association for Computational Linguistics, 1999.
[Bird et al., 2000] Bird, S., Day, D., Garofolo, J.,
Henderson,J., Laprun, C. and Liberman,M. ATLAS: A flexible and
extensible architecture for linguistic annotation. In: Proceedings of the Second
International Conference on Language Resources and Evaluation, pages 1699–1706, Paris, 2000.
European Language Resources Association.
[Bird and Liberman, 2001] Bird, S. and Liberman, M.
A formal framework for linguistic annotation. Speech
Communication, 33(1–2): pages 23–60, 2001.
[Bird et al., 2006] Bird, S., Chen, Y., Davidson, S.,
Lee, H. and Zheng,Y. Designing and Evaluating an XPath Dialect for
Linguistic Queries. In: Proceedings of the 22nd International Conference on Data
Engineering (ICDE), Atlanta, USA., 2006
[Carletta et al., 2003] Carletta, J., Kilgour, J.,
O’Donnel, T. J., Evert, S. and Voormann, H. The NITE Object Model
Library for Handling Structured Linguistic Annotation on Multimodal Data Sets.
In: Proceedings of the EACL Workshop on Language Technology and the Semantic Web (3rd Workshop
on NLP and XML (NLPXML-2003)), Budapest, Ungarn, 2003.
[Clark, 1977] Clark, H. (1977). Bridging. In: Johnson-Laird, P.N. and Wason, P.C. (eds.): Thinking: Readings in
Cognitive Science. Cambridge : Cambridge University Press, 1977, S. 411 - 420.
[Cowan et al., 2006] J. Cowan, J. Tennison, and Piez,
W. LMNL update. In: Proceedings of Extreme Markup Languages,
Montréal, Québec, 2006.
[DeRose et al., 2001] DeRose, S., Maler, E. and
Orchard, D. XML Linking Language (XLink) Version 1.0. W3C
Recommendation, World Wide Web Consortium, June 2001. Online: http://www.w3.org/TR/2001/REC-xlink-20010627/.
[DeRose, 2004] DeRose, S. J. Markup Overlap: A Review and a Horse. In: Proceedings of Extreme Markup
[Diewald et al. (submitted)] Diewald, N.,
Stührenberg, M., Garbar, A. and Goecke, D. Serengeti -- Webbasierte
Annotation semantischer Relationen. To appear in LDV Forum - Zeitschrift für
Computerlinguistik und Sprachtechnologie.
[Dipper, 2005] Dipper, S. XML-based stand-off representation and exploitation of multi-level linguistic
annotation. In: Proceedings of Berliner XML Tage 2005 (BXML 2005), pages 39–50,
Berlin, Deutschland, 2005.
[Dipper et al., 2007] Dipper, S., Götze, M.,
Küssner, U. and Stede, M. Representing and Querying Standoff
XML. In: Rehm, G., Witt, A. and Lemnitzer, L. editors, Datenstrukturen für
linguistische Ressourcen und ihre Anwendungen. Data Structures for Linguistic Resources and
Applications. Proceedings of the Biennial GLDV Conference 2007, pages 337–346, Tübingen, 2007.
Gunter Narr Verlag.
[Durusau and O'Donnell, 2002] Durusau, P. and
O'Donnell, M.B.. Concurrent Markup for XML Documents. In:
Proceedings of the XML Europe conference 2002.
[Fellbaum, 1998] Fellbaum, C. WordNet: An electronic lexical database. Cambridge, Mass.: MIT Press, 1998.
[Gleim et al., 2007] Gleim, R., Mehler, A. and
Eikmeyer, H.-J. Representing and Maintaining Large Corpora.
In: Proceedings of the Corpus Linguistics 2007 Conference, Birmingham (UK), 2007.
[Goecke and Witt, 2006] Goecke, D. and Witt, A.
Exploiting Logical Document Structure for Anaphora
Resolution. In: Proceedings of the 5th International Conference on Language
Resources and Evaluation (LREC 2006). Genoa, Italy, 2006.
[Goecke et al. (to appear)] Goecke, D., Stührenberg,
M. and Wandmacher, T. Extraction and representation of semantic
relations for resolving definite descriptions. To appear in LDV Forum -
Zeitschrift für Computerlinguistik und Sprachtechnologie.
[Goecke et al., 2008] Goecke, D., Lüngen, H.,
Metzing, D., Stührenberg, M. and Witt, A. Different Views on Markup.
Distinguishing levels and layers. In: Linguistic modeling of information and
Markup Languages. Contributions to language technology. Springer, 2008.
[Grosso et al., 2003] Grosso, P., Maler, E., Marsh,
J. and Walsh, N. XPointer Framework. W3C Recommendation,
World Wide Web Consortium, March 2003. Online: http://www.w3.org/TR/2003/REC-xptr-framework-20030325/.
[Hamp and Feldweg, 1997] Hamp, B. and Feldweg, H.
GermaNet - a Lexical-Semantic Net for German. In:
Proceedings of ACL workshop "Automatic Information Extraction and Building of Lexical
Semantic Resources for NLP Applications", pages 9–15, New Brunswick, New Jersey,
1997. Association for Computational Linguistics.
[Hilbert, 2005] Hilbert, M. MuLaX – ein Modell zur Verarbeitung mehrfach XML-strukturierter Daten. Diploma
thesis, Bielefeld University, 2005.
[Hilbert et al., 2005] M. Hilbert, O. Schonefeld,
and A. Witt. Making CONCUR work. In: Proceedings of Extreme
Markup Languages, 2005.
[Holt et al., 2006] Holt, R., Schürr, A., Elliott Sim,
S and Winter, A. GXL: A graph-based standard exchange format for
reengineering. In: Science of Computer Programming, 60(2): 149-170, 2006.
[Huitfeldt and Sperberg-McQueen, 2001] Huitfeldt,
C. and Sperberg-McQueen, C.M. Texmecs: An experimental markup
meta-language for complex documents. Markup Languages and Complex Documents
(MLCD) Project, Februar 2001.
[Ide and Romary, 2004] Ide, N. and Romary, L. International Standard for a Linguistic Annotation Framework. Journal
of Natural Language Engineering, 10(3-4): pages 211-225, 2004.
[Ide and Romary, 2007] Ide, N. and Romary, L.
Towards International Standards for Language Resources. In:
Dybkjaer, L., Hemsen, H., and Minker, W., editors, Evaluation of Text and Speech Systems,
pages 263--284. Springer.
[Ide and Suderman, 2007] Ide, N. and Suderman, K.
GrAF: A Graph-based Format for Linguistic Annotations. In:
Proceedings of the Linguistic Annotation Workshop, pages 1-8, Prague, Czech Republic.
Association for Computational Linguistics, 2007.
[Laprun et al., 2002] Laprun, C., Fiscus, J. G.,
Garofolo, J. and Pajot, S. Recent improvements to the ATLAS
architecture. In: Proceedings of HLT 2002, Second International Conference on Human
Language Technology Research, 2002.
[ISO/IEC 19757-2:2003] ISO/IEC 19757-2:2003.
Information technology – Document Schema Definition Language (DSDL) –
Part 2: Regular-grammar-based validation – RELAX NG (ISO/IEC 19757-2).
International Standard, International Organization for Standardization, Geneva, 2003.
[ISO/IEC 19757-3:2006] ISO/IEC 19757-3:2006.
Information technology – Document Schema Definition Language (DSDL) –
Part 3: Rule-based validation – Schematron. International standard, International
Organization for Standardization, Geneva, 2006.
[Jagadish et al., 2004] Jagadish, H. V.,
Lakshmanany, L. V. S., Scannapieco, M., Srivastava, D. and Wiwatwattana, N. Colorful XML: One hierarchy isn’t enough. In: Proceedings of ACM
SIGMOD International Conference on Management of Data (SIGMOD 2004), pages 251–262, Paris,
June 13-18 2004. ACM Press New York, NY, USA.
[Kay 2008] M. Kay. XSLT 2.0 and
XPath 2.0 Programmer’s Reference. Wiley Publishing, Indianapolis, 4th edition,
[Le Maitre, 2006] Le Maitre, J. Describing multistructured XML documents by means of delay nodes. In:
DocEng ’06: Proceedings of the 2006 ACM symposium on Document engineering, pages 155–164, New
York, NY, USA, 2006. ACM Press.
[Mitkov, 2002] Mitkov, R. Anaphora resolution. London: Longman, 2002
[Poesio and Kruschwitz 2008] Poesio, M. and
Kruschwitz, U. Anawiki: Creating anaphorically annotated resources
through web cooperation. In: Proceedings of LREC 2008.
[Polanyi, 1988] Polanyi, L. A
formal model of the structure of discourse. In: Journal of Pragmatics 12 (1988),
pages 601-638. doi:10.1016/0378-2166(88)90050-1.
[Robertson, 2002] E. Robertson. Combining Schematron with other XML Schema languages, Juni 2002.
[Schonefeld, 2007] O. Schonefeld. XCONCUR and XCONCUR-CL: A constraint-based approach for the validation of
concurrent markup. In: Rehm, G., Witt, A., Lemnitzer, L. (eds.), Datenstrukturen
für linguistische Ressourcen und ihre Anwendungen. Data Structures for Linguistic Resources
and Applications. Proceedings of the Biennial GLDV Conference 2007, Tübingen, Germany, 2007.
Gunter Narr Verlag.
[Soon et al., 2001] Soon, W.M., Lim, D.C.Y. and Ng,
H.T. (2001). A Machine Learning Approach to Coreference Resolution of
Noun Phrases. In: Computational Linguistics 27 (2001), No. 4, pages 521-544.
[Simons and Bird, 2003] G. Simons and S. Bird.
OLAC Metadata. OLAC: Open Language Archives Community,
2003. Online: http://www.language-archives.org/OLAC/metadata.html.
[Sperberg-McQueen et al., 2000] Sperberg-McQueen, C. M., Huitfeldt, C. and Renear, A.. Meaning and
Interpretation of markup. Markup Languages - Theory & Practice, 2, pages
215-234, 2000. doi:10.1162/109966200750363599.
[Sperberg-McQueen et al., 2002] Sperberg-McQueen, C. M., Dubin, D., Huitfeldt, C. and Renear, A. Drawing inferences on the basis of markup. In: Proceedings of Extreme Markup
[Sperberg-McQueen and Burnard, 2002]
C. Sperberg-McQueen, C. M. and Burnard, L. (eds.). TEI P4: Guidelines
for Electronic Text Encoding and Interchange. published for the TEI Consortium by
Humanities Computing Unit, University of Oxford, Oxford, Providence, Charlottesville, Bergen,
[Sperberg-McQueen and Huitfeldt, 2004]
Sperberg-McQueen, C. M. and Huitfeldt, C. GODDAG: A Data Structure for
Overlapping Hierarchies. In: King, P. and Munson, E. V. (eds.), Proceedings of
the 5th International Workshop on the Principles of Digital Document Processing (PODDP 2000),
volume 2023 of Lecture Notes in Computer Science, pages 139–160. Springer, 2004.
[Strube and Müller, 2003] Strube, M. and Müller, C.
(2003). A machine learning approach to pronoun resolution in spoken
dialogue. In: ACL '03: Proceedings of the 41st Annual Meeting on Association for
Computational Linguistics. Morristown, NJ, USA : Association for Computational Linguistics,
2003, pages 168-175.
[Stührenberg et al., 2007] Stührenberg, M.,
Goecke, D, Diewald, N., Cramer, I. and Mehler, A. Web-based annotation
of anaphoric relations and lexical chains. In: Proceedings of the Linguistic
Annotation Workshop (LAW), pages 140–147, Prague. Association for Computational Linguistics,
[Tennison, 2002] Tennison, J. Layered Markup and Annotation Language (LMNL). In: Proceedings of
Extreme Markup Languages, Montréal, Québec, 2002.
[Thompson and McKelvie, 1997] Thompson, H. S. and
D. McKelvie. Hyperlink semantics for standoff markup of read-only
documents. In: Proceedings of SGML Europe ’97: The next decade – Pushing the
Envelope, pages 227–229, Barcelona, 1997.
[Waltinger et al., 2008] Waltinger, U., Mehler,
A. Mehler, and Stührenberg, M. An Integrated Model of Lexical Chaining:
Application, Resources and its Format. Accepted for Proceedings of Konvens 2008.
[Witt, 2002] Witt, A. Meaning
and interpretation of concurrent markup. In: Proceedings of ALLC-ACH2002, Joint
Conference of the ALLC and ACH, 2002.
[Witt, 2004] Witt, A. Multiple
hierarchies: New Aspects of an Old Solution. In: Proceedings of Extreme Markup
[Witt et al., 2005] Witt, A., Goecke, D., Sasaki, F.,
and Lüngen, H. Unification of XML Documents with Concurrent
Markup. Literary and Lingustic Computing, 20(1): pages 103-116, 2005.
[Witt et al., 2007] Witt, A., Schonefeld, O., Rehm,
G., Khoo, J. and Evang, K. On the lossless transformation of
single-file, multi-layer annotations into multi-rooted trees. In: Proceedings of
Extreme Markup Languages, Montréal, Québec, 2007.
[XML Schema Part 1, 2004] XML Schema Part 1:
Structures Second Edition. W3C Recommendation, World Wide Web Consortium, 28 October 2004.
[XML Schema Part 2, 2004] XML Schema Part 2:
Datatypes Second Edition. W3C Recommendation, World Wide Web Consortium, 28 October 2004.
[XML Schema 1.1 Part 1, 2008] W3C XML Schema
Definition Language (XSD) 1.1 Part 1: Structures. W3C Working Draft, World Wide Web
Consortium, 20 June 2008. Online: http://www.w3.org/TR/2008/WD-xmlschema11-1-20080620/.
[Yang et al., 2004] Yang, X., Su, J., Zhou, G. and Tan,
C. L. (2004). Improving pronoun resolution by incorporating
coreferential information of candidates. In: Proceedings of the 42nd Annual
Meeting of the Association for Computational Linguistics (ACL04). Barcelona, Spain,