Quin, Liam R. E. “Characterizing ill-formed XML on the web: An analysis of the Amsterdam Corpus by document type.” Presented at Balisage: The Markup Conference 2012, Montréal, Canada, August 7 - 10, 2012. In Proceedings of Balisage: The Markup Conference 2012. Balisage Series on Markup Technologies, vol. 8 (2012).

Balisage: The Markup Conference 2012
August 7 - 10, 2012

Balisage Paper: Characterizing ill-formed XML on the web

An analysis of the Amsterdam Corpus by document type

Liam R. E. Quin

XML Activity Lead

THe World Wide Web Consortium (W3C)

This paper builds on the work of Steven Grijzenhout to analyze the Amsterdam XML Corpus in more detail. Where Grijzenhout had as a primary focus XML validation, this paper focuses on well-formedness; in addition, rather than measuring error frequency by Internet domain or by country of origin, the analysis presented here is by document type. The aim is to bring a more XML-centric view to the work and to inform work on error recovery in XML parsing.

Author's keywords for this paper:
markup; syntax errors; well-formedness; parsing; XML errors; error recovery; XML corpus