Anderson, S. (1992). A-Morphous
Morphology. Cambridge Studies in Linguistics (No. 62). CUP.
Bański, P.,
Gozdawa-Gołębiowski, R. (2010). Foreign Language Examination Corpus for L2-Learning Studies. In Rapp, R.,
Zweigenbaum, P., Sharoff, S. (Eds.) Proceedings of the 3rd Workshop on Building and Using Comparable Corpora
(BUCC), Applications of Parallel and Comparable Corpora in Natural Language Engineering and the
Humanities
, 22 May 2010, Valletta, Malta, pp. 56–64. Available from http://www.lrec-conf.org/proceedings/lrec2010/workshops/W12.pdf.
Bański, P., Przepiórkowski, A.
(2009). Stand-off TEI annotation: the case of the National Corpus of Polish. In Ide, N., Meyers, A. (Eds.)
Proceedings of the Third Linguistic Annotation Workshop (LAW III) at
ACL-IJCNLP 2009, Singapore, pp. 64-67.
Bański, P., Przepiórkowski, A.
(2010). The TEI and the NCP: the model and its application. In Arranz, V., van Eerten, L. (Eds.) Proceedings of
the LREC workshop on Language Resources: From Storyboard to Sustainability and LR Lifecycle
Management
(LRSLM2010), 23 May 2010, Valletta, Malta, pp. 34–39. Available from http://www.lrec-conf.org/proceedings/lrec2010/workshops/W20.pdf.
Bański, P., Wójtowicz, B. (2010). The
Open-Content Text Corpus project. In Arranz, V., van Eerten, L. (Eds.) Proceedings of the LREC workshop on
Language Resources: From Storyboard to Sustainability and LR Lifecycle Management
(LRSLM2010),
23 May 2010, Valletta, Malta, pp. 19–25. Available from http://www.lrec-conf.org/proceedings/lrec2010/workshops/W20.pdf.
Bird, S., Simons, G. (2003). Seven dimensions
of portability for language documentation and description. Language 79(3), pp.
557–582; 10.1353/lan.2003.0149
Boot, P. (2009). Towards a TEI-based encoding scheme for the
annotation of parallel texts. Literary and Linguistic Computing 24(3), pp.
347–361; 10.1093/llc/fqp023
Cayless, H, Soroka (2010). On
Implementing string-range() for TEI. Presented at Balisage: The Markup Conference 2010, Montréal, Canada, August
3–6, 2010. In Proceedings of Balisage: The Markup Conference 2010. Balisage Series on Markup Technologies, vol.
5 (2009); 10.4242/BalisageVol5.Cayless01
Chiarcos, Ch., Ritz, J., Stede, M. (2009).
By all these lovely tokens... Merging conflicting tokenizations. In Ide, N.,
Meyers, A. (Eds.) Proceedings of the Third Linguistic Annotation Workshop (LAW
III) at ACL-IJCNLP 2009, Singapore, pp. 35-43.
Cummings, J. (2009). Converting Saint Paul: A new TEI
P5 edition of The Conversion of Saint Paul using stand-off methodology.
Literary and Linguistic Computing 24(3), pp. 307–317; 10.1093/llc/fqp019
DeRose, S. (2004). Markup overlap: a review and a horse.
Proceedings of Extreme Markup Languages 2004.
DeRose, S., Durand, D., Mylonas, E., Renear, A.
(1990). What is text, really?. Journal of Computing in Higher Education, Winter
1990, Vol. I (2), pp. 3–26
Dipper, S. (2005). XML-based stand-off representation and
exploitation of multi-level linguistic annotation. In Proceedings of Berliner XML Tage 2005
(BXML 2005). Berlin, pp. 39–50.
Goecke, D., Metzing, D., Lüngen, H.,
Stührenberg, M., Witt, A. (2010). Different views on markup. distinguishing levels and layers. In Linguistic modeling of information and markup languages. Contributions to language
technology. Springer Netherlands, pp. 1–21.
Ide, N. (1998). Corpus Encoding Standard: SGML Guidelines for
Encoding Linguistic Corpora. Proceedings of the First International Language Resources and Evaluation Conference,
Granada, Spain, pp. 463–470.
Ide, N. (2000). The XML Framework and Its Implications for the
Development of Natural Language Processing Tools. Proceedings of the COLING Workshop on Using Toolsets and
Architectures to Build NLP Systems, Luxembourg, 5 August 2000.
http://www.cs.vassar.edu/~ide/papers/coling00-ws-final.pdf
Ide, N., Bonhomme, P., Romary, L. (2000). XCES: An
XML-based Standard for Linguistic Corpora. Proceedings of the Second Language Resources and
Evaluation Conference (LREC), Athens, Greece, pp. 825–830.
Ide, N., Romary, L. (2007). Towards
International Standards for Language Resources. In Dybkjaer, L., Hemsen, H., Minker, W. (Eds.), Evaluation of Text and Speech Systems, Springer, pages 263–284.
Ide, N., Suderman, K. (2006). Integrating
Linguistic Resources: The American National Corpus Model. In Proceedings of the Fifth
Language Resources and Evaluation Conference (LREC), Genoa, Italy.
Ide, N., Suderman, K. (2007). GrAF: A
Graph-based Format for Linguistic Annotations. In the proceedings of the Linguistic Annotation Workshop, held in
conjunction with ACL 2007, Prague, June 28-29, pp. 1–8.
Isard, Amy, McKelvie, David, Thompson, Henry S.
(1998). Towards a minimal standard for dialogue transcripts: a new SGML architecture for the HCRC map task
corpus, In 5th International Conference on Spoken Language
Processing - 1998, paper 0322.
Jannidis, F. (2009). TEI in a crystal ball. Literary and Linguistic Computing 24(3), pp. 253–265; 10.1093/llc/fqp015
Lawler, J. and H. Aristar Dry (Eds.)
(1998). Using computers in linguistics: a practical guide. London:
Routledge.
Kupść, A. (1999). Haplology of the Polish Reflexive Marker.
In Borsley, R.D., Przepiórkowski, A. (Eds.) Slavic in HPSG, pp. 91–124,
Stanford, CA: CSLI Publications.
Przepiórkowski, A., Bański, P.
(2010). TEI P5 as a text encoding standard for multilevel corpus annotation. In Fang, A.C., Ide, N. and J.
Webster (eds). Language Resources and Global Interoperability. The Second International
Conference on Global Interoperability for Language Resources (ICGL2010). Hong Kong: City University
of Hong Kong, pp. 133–142.
Simons, G.F., Bird, S. (2008). Toward a
global infrastructure for the sustainability of language resources. In Proceedings of the
22nd Pacific Asia Conference on Language, Information and Computation: PACLIC 22. pp.
87–100.
TEI Consortium (Eds.) (2010). TEI P5: Guidelines for
Electronic Text Encoding and Interchange. Version 1.6.0. Last updated on February 12th 2010. TEI Consortium.
http://www.tei-c.org/Guidelines/P5/
Witt, A., Heid, U., Sasaki, F., Sérasset, G.
(2009). Multilingual language resources and interoperability. In Language Resources and
Evaluation, vol. 43:1, pp. 1–14. 10.1007/s10579-009-9088-x
Witt, A., Rehm, G., Hinrichs, E., Lehmberg, T.,
Stegmann, J. (2009). SusTEInability of linguistic resources through feature structures. In Language Resources and Evaluation, vol. 43:3, pp. 363–372. 10.1093/llc/fqp024
Wittern, Ch., Ciula, A., Tuohy, C. (2009). The
making of TEI P5. Literary and Linguistic Computing 24(3), pp. 281–296;
10.1093/llc/fqp017
Why TEI stand-off annotation doesn't quite work
and why you might want to use it nevertheless
Piotr Bański
Assistant Professor
Institute of English Studies, University of Warsaw
Abstract
The present submission focuses on the concept of stand-off annotation as it is implemented in the current
version of the TEI Guidelines. We look at the motivation for choosing the stand-off approach to encoding
Language Resources, briefly recount the history of the concept within the broadly conceived TEI setting (since
TEI P3 and the LT NSL suite, through CES and XCES, ending in TEI P5), review the various kinds of hyperlink
semantics and identify three kinds of reasons for the poor uptake of the TEI-recommended stand-off annotation
approach to corpus encoding. We also suggest some solutions that may contribute to a change in the current
state of affairs.
Why TEI stand-off annotation doesn't quite work
and why you might want to use it nevertheless
Balisage: The Markup Conference 2010
August 3 - 6, 2010
The materials listed below were provided by the speaker as supplements to a
presentation at Balisage. These materials may include the slides or visuals used in the
presentation; supplementary material, such as code samples or a demonstration application;
and/or the paper accompanying the presentation (if it has not been provided in XML). These
materials have been zipped for easy download and are identified by a brief description of
the contents. The materials themselves are untouched
, that is, they
have not been tested or edited by Balisage: The Markup Conference or by Mulberry
Technologies, Inc. As such, they are included on this website AS IS
,
i.e., as provided by the speaker, with no warranties, express or otherwise, made by Balisage
or Mulberry.
Slides and Materials
Author's keywords for this paper: TEI; stand-off annotation; hyperlink semantics; corpus encoding