JATS4R is "JATS for Reuse"
A group of publishers and others formed to describe "Best Practices" for tagging articles in JATS.
So: What is JATS?
JATS is NISO Z39.96-2012 Journal Article Tag Suite
It defines a suite of XML elements and attributes that describe the content and metadata of journal articles including research and non-research articles.
It was developed from the NLM DTDs, which were released in 2003
... and has become the standard for tagging journal content.
The NLM/JATS article models were developed for archiving, so they are more descriptive than prescriptive.
This makes it easier for anyone trying to convert articles from many different models into one model
A wide target is easier to hit.
It also made it easier for anyone publishing original content in NLM/JATS.
Fewer rules mean you can do what you want.
Reuse
Large sets of documents all tagged in JATS are not tagged consistently enough for easy reuse of the XML.
In 2012, an automated software tool—the Open Access Media Importer (OAMI) started using the articles in the PMC Open Access Subset to find audio and video objects that could be loaded to Wikimedia Commons for use on Wikipedia and elsewhere.
The OAMI used several JATS elements and attributes including those for licensing, keywords and media types. This use revealed inconsistencies in the XML available from PMC.
Although JATS is a standard and PMC performs some standardization of the submitted XML during ingest, JATS has had to allow for the fine nuances of publishing and the varying requirements of different types of content and different publishers.
So, publishers use JATS inconsistently, which leads to problems when reusing the content.
Contained in: <article-meta> section of the <front> matter.
<permissions> <copyright-statement> Uosaki et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. </copyright-statement> <copyright-year>2011</copyright-year> </permissions>
In June 2014, a group of publishers and aggregators met to discuss JATS reusability issues and formed as "JATS for Reuse".
They decided to publish best tagging practices recommendations to improve the reusability of JATS-tagged article content.
But tagging rules do no one any good without a way to check if they have been met.
We wrote the tests in Schematron.
And wanted to make a public tool for testing articles.
http://jats4r.org/validator/
When run against an instance document, the tool
Saxon-CE uses the browser’s native XML parser, which, for most browsers, does not read the DTD specified in the DOCTYPE declaration - which is bad if named character entities are used.
So, a separate tool was required to parse the documents using the DTD, and resolve those named entity references with their corresponding replacement text. To accomplish this, we have incorporated a JavaScript port of xmllint.
GitHub project xml.js is a port of libxml, including xmllint, to JavaScript, using emscripten.
Emscripten is a free tool for compiling C and C++ into optimized JavaScript code.
The following code invokes xmllint to parse and validate the instance document:
var args = ['--loaddtd', '--valid', '--noent', 'dummy.xml']; var files = [ { path: 'dummy.xml', data: contents }, { path: dtd_filename, data: dtd_contents } ]; result = xmltool(args, files);
Note that the DTD is passed into the function via the `dtd_filename` argument.
The JavaScript implementation of xmllint uses the SYSTEM identifier, and validator ensures that the "dtd_filename" variable matches that SYSTEM identifier.
Before the XML is parsed, the validator uses a regular expression to check for the presence of a DOCTYPE and extracts the PUBLIC and SYSTEM identifiers.
It uses the PUBLIC identifier to determine which specific NLM or JATS DTD to use to parse the file (there are currently 62 variants) and records the SYSTEM identifier as "dtd_filename" and passes it into xmllint, which dereferences that name to get the DTD contents.
As described above, the JATS4R recommendations are encoded in Schematron. The recommendations are broken down in two ways:
The Schematron rules for each of the combinations of level and topic are encapsulated in their own files.
There are two "master" Schematron files which break down the tests in two different ways:
Because Saxon-CE is an XSL processor and not a Schematron processor, we convert the Schematrons to XSLT2 files using the conversion available from http://schematron.com.
The validator has a selector for reporting level: errors, warnings, or info.
Topic-specific validation using the client-side validator is not available at this time, but could be added easily if there is a demand for it.
The validator runs the instance document through the appropriate XSLT, which generates a report in Schematron Validation Report Language XML (SVRL).
The validator code invokes Saxon, passing the URL of the appropriate XSLT file. The results, in SVRL format, are converted into an HTML report using a separate XSLT transformation.
This is then inserted by Saxon CE into the HTML DOM, and thus presented to the user.
The Problem
Schematron outputs locations of problems in XPath format, not very user friendly.
/article[1]/front[1]/article-meta[1]/permissions[1]/license[1]
How do we show the exact location in the input file?