Be in the Room Where It Happens

Digital Preservation at Portico and the JATS Ecosystem

Sheila Morrissey


John Meyer


Sushil Bhattarai


Copyright ©2018 ITHAKA

Symposium on Markup Vocabulary Ecosystems
July 30, 2018

Standardization is a social process by which humans come to take things for granted. Through standardization, inventions become commonplace, novelties become mundane, and the local becomes universal. It is a historical, and therefore contested process whose success depends upon the obfuscation of its founding conflicts and contingencies. Successful standards, if they are noticed at all, simply appear as authoritative, objective, uncontroversial, and natural. Standards are, as other scholars have noted, recipes for reality whose black boxes are rarely opened and whose subjectivity and contingency are rarely revealed.


No one really knows how the game is played   The art of the trade   How the sausage gets made   We just assume that it happens   But no one else is in   The room where it happens.  



Institutions such as Portico that are engaged in ensuring that the digital record of our time is accessible, usable, discoverable, and verifiable for the very long term, continually face the challenge of processing and managing content at very large scales, often with minimal, and sometimes diminishing, resources to accomplish the task.

A key resource in meeting the challenge of preserving born-digital and digitized scholarly literature has been the NLM and JATS standards, and the community of practice centered on those standards. This paper discusses our shared experience in developing those standards: what motivated our participation, how we participated in the evolution of JATS, what benefits we have seen, and what challenges we still face.


What is Portico?

Portico is a community-supported digital preservation service for electronic journals, books, and other content. Portico is a service of ITHAKA, a not-for-profit organization dedicated to helping the academic community use digital technologies to preserve the scholarly record and to advance research and teaching in sustainable ways. Portico understands digital preservation as the series of management policies and activities necessary to ensure the enduring usability, authenticity, discoverability, and accessibility of content over the very long-term.

Portico serves as a permanent archive for the content of, at present, 553 publishers (from 57 countries, and on behalf of over 2000 learned societies and associations), with 29,068 committed electronic journal titles, 1,242,793 committed e-book titles, and 187 committed digitized historical collections. The archive currently contains over 93 million archival units (journal articles, e-books, etc.), comprising over 1.5 billion preserved files. Portico is sustained by the support of over 1000 libraries in 22 countries.

Portico functions as a “dark archive”. While a limited number of credentialed users at both depositing publishers’ and subscribing libraries’ sites can access the archive content via an audit interface, subscribers to content in the archive generally continue to access that content at the publishers’ host sites. Participating libraries, including their students, faculty, and staff, gain direct access to archived content when specific conditions or "trigger events" occur which cause titles no longer to be available from the publisher or any other source.

From a technical perspective, the Portico archive is designed to preserve content in an application-neutral manner. Each archived object is packaged in a ZIP file (more precisely, in a ZIP file conforming to the Bagit specificaton), with all original publisher-provided files, along with any Portico-created digital artifacts and metadata associated with the object. For each journal article in the archive, for example, Portico preserves all original publisher-provided digital artifacts, including PDF page images, along with any Portico-created digital artifacts associated with the item. These latter include structural, technical, descriptive, and provenance metadata, and a normalization of the publisher-provided SGML or XML journal article files to JATS. The entire archive can be reconstituted as a file system object, using non-platform-specific readers, completely independent of the Portico archive system.

Why standards in Preservation?

One of the three key aims of all standards, as Andrew Russell has noted [Russell], is compatibility. Digital preservation practitioners’ shorthand for what they do is interoperability with the future – that is to say, compatibility over time [Pasking]. The preservation of digital artifacts is a relatively new endeavor, with, by definition, an always-receding goal. Practitioners must act in the present to make provision for unanticipated, perhaps not-yet-existing uses and contexts for those preserved artifacts.

Standards are a hedge against this inherent uncertainty. They provide a means of \ uncoupling content from single-vendor or proprietary tools or formats.[Morrissey JEP] And, potentially at least, they mitigate the risks associated with that uncertainty. As Portico’s first CTO, Evan Owens, has pointed out, interoperability with the present is at least a good first step towards ensuring interoperability with the future [Owens]. So digital preservation practitioners construct their repositories, services, and systems informed by many different standard frameworks. A key framework is the Reference Model for an Open Archival Information System (OAIS) [OAIS] and its companion standard for audit of trustworthy digital repositories, ISO 16363:2012, derived from the Center for Research Libraries (CRL) Trustworthy Repositories Audit and Certification (TRAC) standard and checklist [TRAC]. National libraries and consortia of institutions engaged in preservation maintain best-practice checklists for such things as archive replication (number of copies, storage types), the frequency and algorithms for content check-summing, and criteria of choice and recommendations for use of file formats. A great deal of work has gone into defining what metadata (in OAIS terms, “representation information”) are essential for ensuring there is sufficient context to make use of digital objects over the very long term, as well as providing sufficient, and sufficiently reliable, provenance to ensure the authenticity of those objects. Key instances here are PREMIS and METS, both of which have defined XML schema. There is considerable body of work on the specification of persistent identifiers schemes, including identifiers for digital objects, people, and institutions, and on the community- and institution-building necessary to ensure that, first, those identifiers are used, and second, that they in fact continue to be persistently resolvable.

Why JATS in Preservation?

Portico’s original remit, fifteen years ago, was to develop a sustainable repository for electronic scholarly journals, by acquiring publishers’ “original materials”, from which various manifestations in print and online are derived, and rationalizing, managing, and preserving those content streams. As Evan Owens noted [Owens], quite apart from issues such as non-standard practice in naming, packaging, handling of author-supplied supplementary materials, the useof persistent identifiers, in versioning of content,

Journal publishing models are still evolving: after ten years of delivery of e-journals on the web, there is still wide variation in practice and online PDF and online HTML are many; in effect, an e-journal article is a work with multiple “manifestations.” This makes preservation an interesting challenge, particularly when the manifestation delivered via the web (the HTML) is a subset of richer content and information resources that exist behind the scenes, as it were.

The richer content and information resources were to be captured and preserved by Portico, who would then provide a “normalized” view across all this variegated content for discovery, presentation, and archive management. The tactic employed to accomplish this normalized view was the migration of publisher-provided article metadata to a common journal article metadata vocabulary (while, of course, retaining and preserving the original metadata supplied by the publisher).


As it developed, there were a great many people in the room where, first, the NLM Archiving and Interchange DTD, and, then, the JATS standard happened. And, with all due respect to poor Aaron Burr, the people in the room, both at the time, and after, have been very happy to detail how the sausage was being made. (For some samples, see, first of all, the minutes of the NLM Working Group, history and description of NLM and JATS by Jeff Beck [Beck JEP, Beck Balisage], and many reports in the JAT-CON Proceedings, which inform the narrative below.)

Both Portico and the NLM/JATS vocabulary share a common origin in a 2001-2002 set of e-journal archive planning projects funded by the Andrew W. Mellon Foundation [Cantara]. One of the projects, at Harvard University Library (HUL), and including Blackwell Publishing, the University of Chicago Press, and John Wiley and Sons, undertook to investigate the issues presented by dynamic e-journals – ones whose content changes frequently. The HUL project report [HUL Report] effectively served as a blueprint when, supported by Mellon Foundation funding, Portico (originally called the JSTOR Electronic Archive Project) was founded in 2003.

As part of its investigation, HUL commissioned Inera’s Bruce Rosenblum to study the feasibility of creating a common e-journal archival DTD. One of the key artifacts of the HUL project was Inera’s report [Inera], which, having surveyed 10 DTDs (from primarily scientific, technological, engineering, and medical publishers), anatomized the key components and characteristics of what a common archival DTD might be, and what likely issues would be encountered in transformations from publisher-specific vocabularies to a single common vocabulary.

Intertwined with these developments in Portico’s history were developments at National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM). As Jeff Beck has described [Beck JEP], PubMed Central (PMC), was founded in 2000 in order to take full-text article submissions from publishers and make them available through the PubMed Central database. As content from more and more publishers was submitted, it was clear that the original PMC DTD was insufficiently expressive to handle all the elements in incoming content. NCBI engaged Mulberry Technologies to review the PMC DTD, and assist in the design of a replacement. By this time, the HUL/Inera report was available. Its analysis and recommendations were incorporated in the developing pmc-2.dtd.

That second-generation PMC DTD was submitted to Bruce Rosenblum for review, and was the basis for a 2002 meeting with NCBI/NLM, HUL, the Mellon Foundation, Mulberry Technologies, and Inera to formulate the project that would adapt pmc-2.dtd to a new DTD suitable for general use for archiving any electronic journal article: the NLM Archiving and Interchange Tag Suite. Version 1 of the NLM DTD, released in December, 2002, was the outcome of that collaboration.

In 2003, the NLM Working Group was formed, hosted by NCBI, with participants (some only occasionally) from Microsoft, BioMed Central, American Physical Society (APS), Data Conversion Laboratories (DCL), IEEE, Portico, HighWire Press, Public Library of Science, Mulberry Technologies (who served as secretariat), California Institute of Technology, Griffin Brown, Cadmus, and Inera. The working group refined the tag suite, shepherding it through Version 3, released in 2008. In 2009, the NLM Working Group morphed into a formal NISO working group (with many continuing members), and in August 2012 released NISO JATS Version 1.0. JATS 1.1 was released in November, 2015.

Both the NLM and NISO working groups have sought broad-based input to, and comment on, proposed developments in the standards. Mulberry administers the JATS listserv. Since 2010, JATS-CON has been hosted at NLM. The NLM maintains extensive documentation about both NLM and JATS tag suites, including the NLM working group notes.


Any markup tag scheme is an interpretation. Though the conventions and conventional structures of the scholarly journal article are long established and broadly understood, their detailed expression has been articulated and elaborated in many ways over the more than two-decade long print-to-digital transition in scholarly publishing. As an indication, for example, in its 15 years, Portico has processed over 1000 different tag sets.

Figure 1: Publisher Markup Formats in Portico Archive

While accomplishing its goal of normalizing these vocabularies to a single tag set, Portico has to ensure that its transformation of incoming content to a normalized format does not distort the original interpretation in the source document. This requirement for fidelity means, in turn, that broad participation, key to the development of any consensus standard, was absolutely essential for the formulation and refinement of an archival tag suite.

So the standards process itself was a first crucial benefit to Portico. It was not just that were there lots of others in the room where it happened. Many of these people, besides representing their own institutions, brought broad-based, in-depth experience with many publisher vocabularies and practices from beyond their institution. Often their institutions were themselves aggregators of content from many sources, in many vocabularies. As the Working Group meeting notes reflect, these participants, as did Portico, brought to the working group meetings specific examples of content that raised various issues, both philosophic and pragmatically gritty (boiler plate text versus implied content, reference tagging, semantic significance of display elements such as bold and italic). These concrete examples challenged the working group in developing and refining not just the tag set, but also well-articulated rationales for the choices made, and rich documentation and detailed examples for guidance and use.

The NLM DTD has modularization capabilities and a recommend process for creating a custom profile of the tag set. In earlier years, Portico used a customization of the earlier versions of the NLM DTD to implement normalization policies that had not yet been incorporated into whatever was the current version of the public tag set [Morrissey, Meyer, Bhatterai et al.]. For example, Portico transformations generate (sometimes boilerplate) text or punctuation, titles, and labels that are only implicit in publisher markup, but that appear in the display version of an article. As this is Portico-generated, rather than publisher-supplied, content, there was a need for some form of markup to make the origin of that content explicit, wherever in the document it occurs. Portico created its customization, but at the same time shared its specific use cases and examples with the working groups. Portico currently uses the latest version of the JATS DTD without any customization to express these and other use cases.

Others besides Portico, of course, also use one or another of the NLM/JATS/BITS tag sets. Of those 1079 formats processed by and archived in Portico, 420 are from that format family. While this content still requires some amount of normalization, the effort to create normalizing XSL transforms is considerably less than for content in other vocabularies. Typically, configuring normalization tools for header-only content in the JATS family takes one-quarter to one-half the time of proprietary header-only content. Tools for full-text articles marked up with JATS takes one-half to two-thirds the time of proprietary full-text content. The savings in resources is extremely important for the sustainability of the not-for-profit Portico archive over the very long term.

Though not, strictly speaking, a mark-up issue, the community of practice and discussion fostered by the development of NLM and JATS, including with such groups as JATS4R, has in turn fostered an ecosystem of standards (such as those for the “packaging” of supplementary journal article materials) and practices (especially for correct and consistent use of persistent identifiers for digital objects such as articles and data sets, for authors, for funders, for institutions) – all crucial components for creating and maintaining the accessible, discoverable, navigable, and, in many cases, machine-actionable context that ensures digital artifacts will retain their meaning over the very long term.


Few will be surprised to hear that no tag set, however elegantly crafted, by however broad-based and well-informed a consensus, solves all problems in the interchange of information between the systems of any two or more institutions.

There is inevitable variation in the use of these tag sets. The effort described above to make JATS-to-JATS transformations at Portico is a sure indication that we have not closed the interoperability gap between syntax and semantics. [Morrissey, Meyer, Bhatterai et al.]

While the JATS community certainly fosters best practices of many sorts, it can neither ensure nor enforce them. We still see tag abuse. We still see content – including JATS content – that is not well-formed and valid. Somewhat surprising to us is the fact that roughly the same percentage (15-20%) of the XML content from providers who moved their processes to JATS as those who did not is in some way defective.

JATS is a living, evolving standard. Its tradition of broad-based, inclusive, responsive development entails its own challenges, characteristic, as Andrew Russell has described [Russell], of open standards development. He describes as well the characteristic potential entropy of such processes. Tommie Usdin has described [Usdin] the particular pressures this has placed on the development of the JATS tag sets, as well as guidelines and guidance for constructively meeting these pressures. This process makes for some complications. Portico is receiving content from publishers whose content conforms to a now superseded draft version (JATS V1.2D1) of the next, upcoming release of JATS. Later draft versions are incompatible with that earlier draft, as V1.2 is also expected to be when released.

Though elegant, flexible, and expressive, working natively in JATS still apparently is not accessible to what in the library community is referred to as the long tail of small, specialized, and typically non-STM scholarly journals. It is a markup scheme developed by and large out of the experience and practice of relatively large-scale, sophisticated publishers. Uptake by long-tail publications would require easily-accessible, broadly-used, and likely free authoring tools that natively produce JATS.

Be in the room where it happens

As we have described, the articulation and on-going refinement of the NLM/JATS family of tag sets by inclusive process, with broadly and deeply experienced participants, has been crucial to Portico’s mission to preserve digital scholarly artifacts for the long term.

Firmly committed to on-going active participation in this community, Portico hopes sharing its experience and practice can help others in the community as we have been helped and enriched by participation.

It has been our experience that we all benefit when all who wish to be are the room where JATS happens.


Author's keywords for this paper: Digital Preservation; JATS; NLM; Standards; Portico