How to cite this paper

Mietchen, Daniel, Chris Maloney and Nils Dagsson Moskopp. “Inconsistent XML as a barrier to reuse of Open Access Content.” Presented at Impromptu JATS User Group Meeting, Washington, DC, October 22, 2013. In Proceedings of the Impromptu JATS User Group Meeting. Balisage Series on Markup Technologies, vol. 12 (2013).

Impromptu JATS User Group Meeting
October 22, 2013

Balisage Paper: Inconsistent XML as a barrier to reuse of Open Access Content

Daniel Mietchen

Museum für Naturkunde - Leibniz-Institut für Evolutions - und Biodiversitätsforschung, Germany

Open Knowledge Foundation, Germany

Chris Maloney

PMC/NCBI/NIH, Bethesda, MD

Nils Dagsson Moskopp

Humboldt University, Germany (student)


In this paper, we will describe the current state of some of the tagging of articles within the PMC Open Access subset. As a case study, we will use our experiences developing the Open Access Media Importer, a tool to harvest content from the OA subset for automated upload to Wikimedia Commons.

Tagging inconsistencies stretch across several aspects of the articles, ranging from licensing to keywords to the media types of supplementary materials. While all of these complicate large-scale reuse, the unclear licensing statements had the greatest impact, requiring us to implement text mining-like algorithms in order to accurately determine whether or not specific content was compatible with reuse on Wikimedia Commons.

Besides presenting examples of incorrectly tagged XML from a range of publishers, we will also explore past and current efforts towards standardization of license tagging, and we will describe a set of recommendations related to tagging practices of certain data, to ensure that it is both compatible with existing standards, and consistent and machine-readable.