Balisage Paper: Why TEI stand-off annotation doesn't quite work

and why you might want to use it nevertheless

Balisage: The Markup Conference 2010
August 3 - 6, 2010

The materials listed below were provided by the speaker as supplements to a presentation at Balisage. These materials may include the slides or visuals used in the presentation; supplementary material, such as code samples or a demonstration application; and/or the paper accompanying the presentation (if it has not been provided in XML). These materials have been zipped for easy download and are identified by a brief description of the contents. The materials themselves are untouched, that is, they have not been tested or edited by Balisage: The Markup Conference or by Mulberry Technologies, Inc. As such, they are included on this website AS IS, i.e., as provided by the speaker, with no warranties, express or otherwise, made by Balisage or Mulberry.

Slides and Materials

×

Anderson, S. (1992). A-Morphous Morphology. Cambridge Studies in Linguistics (No. 62). CUP.

×

Bański, P. (2010). XIncluding plain-text fragments for symmetry and profit. Poster presented at Balisage: The Markup Conference 2010, Montréal, Canada, August 3–6, 2010. Available from http://bansp.users.sourceforge.net/pdf/Banski-Balisage2010-poster.pdf

×

Bański, P., Gozdawa-Gołębiowski, R. (2010). Foreign Language Examination Corpus for L2-Learning Studies. In Rapp, R., Zweigenbaum, P., Sharoff, S. (Eds.) Proceedings of the 3rd Workshop on Building and Using Comparable Corpora (BUCC), Applications of Parallel and Comparable Corpora in Natural Language Engineering and the Humanities, 22 May 2010, Valletta, Malta, pp. 56–64. Available from http://www.lrec-conf.org/proceedings/lrec2010/workshops/W12.pdf.

×

Bański, P., Przepiórkowski, A. (2009). Stand-off TEI annotation: the case of the National Corpus of Polish. In Ide, N., Meyers, A. (Eds.) Proceedings of the Third Linguistic Annotation Workshop (LAW III) at ACL-IJCNLP 2009, Singapore, pp. 64-67.

×

Bański, P., Przepiórkowski, A. (2010). The TEI and the NCP: the model and its application. In Arranz, V., van Eerten, L. (Eds.) Proceedings of the LREC workshop on Language Resources: From Storyboard to Sustainability and LR Lifecycle Management (LRSLM2010), 23 May 2010, Valletta, Malta, pp. 34–39. Available from http://www.lrec-conf.org/proceedings/lrec2010/workshops/W20.pdf.

×

Bański, P., Wójtowicz, B. (2010). The Open-Content Text Corpus project. In Arranz, V., van Eerten, L. (Eds.) Proceedings of the LREC workshop on Language Resources: From Storyboard to Sustainability and LR Lifecycle Management (LRSLM2010), 23 May 2010, Valletta, Malta, pp. 19–25. Available from http://www.lrec-conf.org/proceedings/lrec2010/workshops/W20.pdf.

×

Bird, S., Simons, G. (2003). Seven dimensions of portability for language documentation and description. Language 79(3), pp. 557–582; doi:https://doi.org/10.1353/lan.2003.0149

×

Boot, P. (2009). Towards a TEI-based encoding scheme for the annotation of parallel texts. Literary and Linguistic Computing 24(3), pp. 347–361; doi:https://doi.org/10.1093/llc/fqp023

×

Burnard, L., Rahtz, S. (2004). RelaxNG with Son of ODD. Presented at Extreme Markup Languages 2004, Montréal, Québec. Available from http://conferences.idealliance.org/extreme/html/2004/Burnard01/EML2004Burnard01.html

×

Cayless, H, Soroka (2010). On Implementing string-range() for TEI. Presented at Balisage: The Markup Conference 2010, Montréal, Canada, August 3–6, 2010. In Proceedings of Balisage: The Markup Conference 2010. Balisage Series on Markup Technologies, vol. 5 (2009); doi:https://doi.org/10.4242/BalisageVol5.Cayless01

×

Chiarcos, Ch., Ritz, J., Stede, M. (2009). By all these lovely tokens... Merging conflicting tokenizations. In Ide, N., Meyers, A. (Eds.) Proceedings of the Third Linguistic Annotation Workshop (LAW III) at ACL-IJCNLP 2009, Singapore, pp. 35-43.

×

Cummings, J. (2008). The Text Encoding Initiative and the Study of Literature. In Schreibman, S., Siemens, R. A Companion to Digital Literary Studies. Oxford: Blackwell. http://www.digitalhumanities.org/companionDLS/

×

Cummings, J. (2009). Converting Saint Paul: A new TEI P5 edition of The Conversion of Saint Paul using stand-off methodology. Literary and Linguistic Computing 24(3), pp. 307–317; doi:https://doi.org/10.1093/llc/fqp019

×

DeRose, S. (2004). Markup overlap: a review and a horse. Proceedings of Extreme Markup Languages 2004.

×

DeRose, S., Durand, D., Mylonas, E., Renear, A. (1990). What is text, really?. Journal of Computing in Higher Education, Winter 1990, Vol. I (2), pp. 3–26

×

Dipper, S. (2005). XML-based stand-off representation and exploitation of multi-level linguistic annotation. In Proceedings of Berliner XML Tage 2005 (BXML 2005). Berlin, pp. 39–50.

×

Goecke, D., Metzing, D., Lüngen, H., Stührenberg, M., Witt, A. (2010). Different views on markup. distinguishing levels and layers. In Linguistic modeling of information and markup languages. Contributions to language technology. Springer Netherlands, pp. 1–21.

×

Farrar, S., Moran, S. (2008) "The e-Linguistics Toolkit" Presented at e-Humanities–an emerging discipline: Workshop in the 4th IEEE International Conference on e-Science. http://faculty.washington.edu/farrar/documents/inproceedings/FarrarMoran2008.pdf

×

Ide, N. (1998). Corpus Encoding Standard: SGML Guidelines for Encoding Linguistic Corpora. Proceedings of the First International Language Resources and Evaluation Conference, Granada, Spain, pp. 463–470.

×

Ide, N. (2000). The XML Framework and Its Implications for the Development of Natural Language Processing Tools. Proceedings of the COLING Workshop on Using Toolsets and Architectures to Build NLP Systems, Luxembourg, 5 August 2000. http://www.cs.vassar.edu/~ide/papers/coling00-ws-final.pdf

×

Ide, N., Bonhomme, P., Romary, L. (2000). XCES: An XML-based Standard for Linguistic Corpora. Proceedings of the Second Language Resources and Evaluation Conference (LREC), Athens, Greece, pp. 825–830.

×

Ide, N., Romary, L. (2007). Towards International Standards for Language Resources. In Dybkjaer, L., Hemsen, H., Minker, W. (Eds.), Evaluation of Text and Speech Systems, Springer, pages 263–284.

×

Ide, N., Suderman, K. (2006). Integrating Linguistic Resources: The American National Corpus Model. In Proceedings of the Fifth Language Resources and Evaluation Conference (LREC), Genoa, Italy.

×

Ide, N., Suderman, K. (2007). GrAF: A Graph-based Format for Linguistic Annotations. In the proceedings of the Linguistic Annotation Workshop, held in conjunction with ACL 2007, Prague, June 28-29, pp. 1–8.

×

Ide, N., Véronis, J. (1993). Background and context for the development of a Corpus Encoding Standard, EAGLES Working Paper, http://www.cs.vassar.edu/CES/CES3.ps.gz

×

Isard, Amy, McKelvie, David, Thompson, Henry S. (1998). Towards a minimal standard for dialogue transcripts: a new SGML architecture for the HCRC map task corpus, In 5th International Conference on Spoken Language Processing - 1998, paper 0322.

×

Jannidis, F. (2009). TEI in a crystal ball. Literary and Linguistic Computing 24(3), pp. 253–265; doi:https://doi.org/10.1093/llc/fqp015

×

Lawler, J. and H. Aristar Dry (Eds.) (1998). Using computers in linguistics: a practical guide. London: Routledge.

×

Kupść, A. (1999). Haplology of the Polish Reflexive Marker. In Borsley, R.D., Przepiórkowski, A. (Eds.) Slavic in HPSG, pp. 91–124, Stanford, CA: CSLI Publications.

×

McKelvie, D., Brew, Ch., Thompson, H. (1998). Using SGML as a basis for Data-Intensive Natural Language Processing. Often listed as appearing in Computers and the Humanities, 31 (5): pp. 367–388, but not present in that volume, according to its publisher. Available as manuscript from http://xml.coverpages.org/mckelvieNLP98-ps.gz

×

Przepiórkowski, A., Bański, P. (2010). TEI P5 as a text encoding standard for multilevel corpus annotation. In Fang, A.C., Ide, N. and J. Webster (eds). Language Resources and Global Interoperability. The Second International Conference on Global Interoperability for Language Resources (ICGL2010). Hong Kong: City University of Hong Kong, pp. 133–142.

×

Rehm, G., Schonefeld, O., Trippel, T., Witt, A. (2010). Sustainability of linguistic resources revisited. Presented at the International Symposium on XML for the Long Haul: Issues in the Long-term Preservation of XML, Montréal, Canada, August 2, 2010. In Proceedings of the International Symposium on XML for the Long Haul: Issues in the Long-term Preservation of XML. Balisage Series on Markup Technologies, vol. 6 (2010); doi:https://doi.org/10.4242/BalisageVol6.Witt01

×

Renear, A., Mylonas, E., Durand, D. (1993). Refining our notion of what text really is: the problem of overlapping hierarchies. Final version, January 6, 1993.http://www.stg.brown.edu/resources/stg/monographs/ohco.html

×

Renear, A. (2004). Text Encoding. In Schreibman, S., Siemens, R., Unsworth, J. A Companion to Digital Humanities. Oxford: Blackwell. http://www.digitalhumanities.org/companion/

×

Simons, G.F., Bird, S. (2008). Toward a global infrastructure for the sustainability of language resources. In Proceedings of the 22nd Pacific Asia Conference on Language, Information and Computation: PACLIC 22. pp. 87–100.

×

TEI Consortium (Eds.) (2010). TEI P5: Guidelines for Electronic Text Encoding and Interchange. Version 1.6.0. Last updated on February 12th 2010. TEI Consortium. http://www.tei-c.org/Guidelines/P5/

×

Thompson, H. S., McKelvie, D. (1997). Hyperlink semantics for standoff markup of read-only documents, Proceedings of SGML Europe. Available from http://www.ltg.ed.ac.uk/~ht/sgmleu97.html.

×

Witt, A., Heid, U., Sasaki, F., Sérasset, G. (2009). Multilingual language resources and interoperability. In Language Resources and Evaluation, vol. 43:1, pp. 1–14. doi:https://doi.org/10.1007/s10579-009-9088-x

×

Witt, A., Rehm, G., Hinrichs, E., Lehmberg, T., Stegmann, J. (2009). SusTEInability of linguistic resources through feature structures. In Language Resources and Evaluation, vol. 43:3, pp. 363–372. doi:https://doi.org/10.1093/llc/fqp024

×

Wittern, Ch., Ciula, A., Tuohy, C. (2009). The making of TEI P5. Literary and Linguistic Computing 24(3), pp. 281–296; doi:https://doi.org/10.1093/llc/fqp017

×

Wörner, K., Witt, A., Rehm, G., Dipper, S. (2006). Modelling Linguistic Data Structures. Presented at Extreme Markup Languages 2006, Montréal, Québec. Available from http://conferences.idealliance.org/extreme/html/2006/Witt01/EML2006Witt01.html

Author's keywords for this paper:
TEI; stand-off annotation; hyperlink semantics; corpus encoding