How to cite this paper

Quin, Liam R. E. “Characterizing ill-formed XML on the web: An analysis of the Amsterdam Corpus by document type.” Presented at Balisage: The Markup Conference 2012, Montréal, Canada, August 7 - 10, 2012. In Proceedings of Balisage: The Markup Conference 2012. Balisage Series on Markup Technologies, vol. 8 (2012).

Balisage: The Markup Conference 2012
August 7 - 10, 2012

Balisage Paper: Characterizing ill-formed XML on the web

An analysis of the Amsterdam Corpus by document type

Liam R. E. Quin

XML Activity Lead

THe World Wide Web Consortium (W3C)

Copyright © 2012 by the author. Used with permission.


This paper builds on the work of Steven Grijzenhout to analyze the Amsterdam XML Corpus in more detail. Where Grijzenhout had as a primary focus XML validation, this paper focuses on well-formedness; in addition, rather than measuring error frequency by Internet domain or by country of origin, the analysis presented here is by document type. The aim is to bring a more XML-centric view to the work and to inform work on error recovery in XML parsing.

Author's keywords for this paper:
markup; syntax errors; well-formedness; parsing; XML errors; error recovery; XML corpus