Characterizing ill-formed XML on the web

An analysis of the Amsterdam Corpus by document type

Liam R. E. Quin

XML Activity Lead

THe World Wide Web Consortium (W3C)

Copyright © 2012 by the author. Used with permission.

Balisage logo

Proceedings

expand How to cite this paper

Characterizing ill-formed XML on the web

An analysis of the Amsterdam Corpus by document type

Balisage: The Markup Conference 2012
August 7 - 10, 2012

Abstract

This paper builds on the work of Steven Grijzenhout to analyze the Amsterdam XML Corpus in more detail. Where Grijzenhout had as a primary focus XML validation, this paper focuses on well-formedness; in addition, rather than measuring error frequency by Internet domain or by country of origin, the analysis presented here is by document type. The aim is to bring a more XML-centric view to the work and to inform work on error recovery in XML parsing.

Author's keywords for this paper: markup; syntax errors; well-formedness; parsing; XML errors; error recovery; XML corpus