slidy-icon
jats4r
A client-side JATS4R validator using Saxon-CE

Chris Maloney

JATS4R & NCBI

, , : .

Alf Eaton

JATS4R & PeerJ

, , : .

Jeff Beck

JATS4R & NCBI

, , : .

Balisage: The Markup Conference 2015 (August 10-14, 2015)

What is JATS4R?

pets

JATS4R is "JATS for Reuse"

A group of publishers and others formed to describe "Best Practices" for tagging articles in JATS.

What is JATS4R?

So: What is JATS?

JATS

pets

JATS is NISO Z39.96-2012 Journal Article Tag Suite

It defines a suite of XML elements and attributes that describe the content and metadata of journal articles including research and non-research articles.

It was developed from the NLM DTDs, which were released in 2003

... and has become the standard for tagging journal content.

JATS

pets

The NLM/JATS article models were developed for archiving, so they are more descriptive than prescriptive.

This makes it easier for anyone trying to convert articles from many different models into one model

A wide target is easier to hit.

It also made it easier for anyone publishing original content in NLM/JATS.

Fewer rules mean you can do what you want.

So, what's the problem?

Reuse

Large sets of documents all tagged in JATS are not tagged consistently enough for easy reuse of the XML.

So, I was paying attention

epigraph

The Bot

blank bot

In 2012, an automated software tool—the Open Access Media Importer (OAMI) started using the articles in the PMC Open Access Subset to find audio and video objects that could be loaded to Wikimedia Commons for use on Wikipedia and elsewhere.

The OAMI used several JATS elements and attributes including those for licensing, keywords and media types. This use revealed inconsistencies in the XML available from PMC.

Although JATS is a standard and PMC performs some standardization of the submitted XML during ingest, JATS has had to allow for the fine nuances of publishing and the varying requirements of different types of content and different publishers.

So, publishers use JATS inconsistently, which leads to problems when reusing the content.

Reuse

reuse 1

Reuse

reuse 2

Reuse

reuse 3

Reuse

reuse 4

<permissions>

Contained in: <article-meta> section of the <front> matter.

permissions example

Things can go wrong

<permissions> <copyright-statement> Uosaki et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. </copyright-statement> <copyright-year>2011</copyright-year> </permissions>

JATS for Reuse

In June 2014, a group of publishers and aggregators met to discuss JATS reusability issues and formed as "JATS for Reuse".

They decided to publish best tagging practices recommendations to improve the reusability of JATS-tagged article content.

But tagging rules do no one any good without a way to check if they have been met.

"Schematron is your friend"—J Cowan

We wrote the tests in Schematron.

And wanted to make a public tool for testing articles.

http://jats4r.org/validator/

The Tool

When run against an instance document, the tool

Demo

Tool Dataflow

tool data flow

XML Parsing and DTD Validation

Saxon-CE uses the browser’s native XML parser, which, for most browsers, does not read the DTD specified in the DOCTYPE declaration - which is bad if named character entities are used.

So, a separate tool was required to parse the documents using the DTD, and resolve those named entity references with their corresponding replacement text. To accomplish this, we have incorporated a JavaScript port of xmllint.

GitHub project xml.js is a port of libxml, including xmllint, to JavaScript, using emscripten.

Emscripten is a free tool for compiling C and C++ into optimized JavaScript code.

The following code invokes xmllint to parse and validate the instance document:

var args = ['--loaddtd', '--valid', '--noent', 'dummy.xml']; var files = [ { path: 'dummy.xml', data: contents }, { path: dtd_filename, data: dtd_contents } ]; result = xmltool(args, files);

In case it wasn't obvious to you

Note that the DTD is passed into the function via the `dtd_filename` argument.

The JavaScript implementation of xmllint uses the SYSTEM identifier, and validator ensures that the "dtd_filename" variable matches that SYSTEM identifier.

NLM and JATS DTDs

Before the XML is parsed, the validator uses a regular expression to check for the presence of a DOCTYPE and extracts the PUBLIC and SYSTEM identifiers.

It uses the PUBLIC identifier to determine which specific NLM or JATS DTD to use to parse the file (there are currently 62 variants) and records the SYSTEM identifier as "dtd_filename" and passes it into xmllint, which dereferences that name to get the DTD contents.

Schematrons of JATS4R Recommendations

As described above, the JATS4R recommendations are encoded in Schematron. The recommendations are broken down in two ways:

The Schematron rules for each of the combinations of level and topic are encapsulated in their own files.

schematrons

There are two "master" Schematron files which break down the tests in two different ways:

Schematron Offline Processing

Because Saxon-CE is an XSL processor and not a Schematron processor, we convert the Schematrons to XSLT2 files using the conversion available from http://schematron.com.

The validator has a selector for reporting level: errors, warnings, or info.

Topic-specific validation using the client-side validator is not available at this time, but could be added easily if there is a demand for it.

Client-side Validation

The validator runs the instance document through the appropriate XSLT, which generates a report in Schematron Validation Report Language XML (SVRL).

The validator code invokes Saxon, passing the URL of the appropriate XSLT file. The results, in SVRL format, are converted into an HTML report using a separate XSLT transformation.

This is then inserted by Saxon CE into the HTML DOM, and thus presented to the user.

xpath-locator function

The Problem

Schematron outputs locations of problems in XPath format, not very user friendly.

/article[1]/front[1]/article-meta[1]/permissions[1]/license[1]

How do we show the exact location in the input file?

Possible solutions

What we've done

Future work

How can I get involved?

Poster

poster