How to cite this paper
Quality Control of PMC Content: A Case Study
International Symposium on Quality Assurance and Quality Control in XML
August 6, 2012
PubMed Central [PMC01] is the U.S. National Institutes
of Health's free digital archive of full-text biomedical and life sciences journal
Content is stored in XML at the article level. and is displayed dynamically from the
XML each time that a user retrieves an article.
PubMed Central was started in 1999 to allow free full-text access to journal articles.
Participation by journals is voluntary. From the beginning there has always been a
that participating journals provide their content to NCBI marked up in some "reasonable"
XML format [Beck01] along with the highest-resolution
images available, PDF files (if available), and all supplementary material. Complete
the PMC's file requirements are available [PMC02].
The Promise of Marked-up Content
Building a full-text journal article repository seemed like a pretty straightforward
in when PMC was conceived in 2000. After all, participating journals already had their
in SGML or XML that they were putting online or sending to a vendor to be put online.
would need to do would be to get the marked-up content, render it in PMC, and store
But we soon found out that the content we were receiving was not of the same quality
the articles that had been printed or even as the articles showing on the journals'
This seemed odd; the SGML was created for the online product, and this SGML was delivered
PMC for the archive. At the turn of the century, XML-first publishing workflows (where
articles are authored in XML or converted to XML as soon as they were submitted to
and then processed as XML) were not the norm.
The workflow that we encountered generally went something like this:
article is written, submitted, peer reviewed, revised, and accepted as a Word
document (or more specifically, a printout of a Word document).
accepted article copyedited in the Word file (to capture the author's
copyedited Word file is ingested into a typesetting system and made into
any changes in the author proof cycle are made in the typesetting system under the
direction of copyeditors or proofreaders.
typesetting files are built into issues, checked by copyeditors or proofreaders and
sent for printing.
typesetting files are then converted to SGML/XML - generally to a model defined by
the online service.
SGML/XML files are converted to HTML, and the HTML is checked for errors.
HTML is corrected so that the online article represents the printed article.
SGML/XML files are stored away - never to be thought of again.
Things were going pretty well in that workflow up through item 5. The author's keystrokes
had been used from the original word processing files—reducing errors that
have been introduced by rekeying the article; a person familiar with the content (usually
least two of a copyeditor, proofreader, or author) checked the files and any changes
stage until it was sent off to press.
At this point it is out of the hands of the content folks and into the hands of the
converters. In step 6, the typesetting files used as the source for the transformation
contained some information about the structure of the article, but it certainly was
to build a proper SGML/XML representation of the article. Assumptions are made during
transformation, and a surprising amount of hand tagging - copy and paste - was done
the files. The problem areas are not surprising: metadata, tables, and math.
From the PMC point of view, the real tragedy in this (generalized and non-specific)
workflow happens at step 8, when the files are reviewed by a proofreader but corrections
made to the output HTML files and not back in the source SGML files. Once we started
processing those SGML files in the young PMC system, we took a close look at the output
rendering of the files to identify any problems with the way our system was handling
We started to find a good number of problems with the supplied SGML that were not
HTML version available on the web - and a surprising number that were incorrect on
Unfortunately we did not count the errors that we encountered at this point. We simply
them, reported them to the publisher so that any that also appeared on their site
fixed, and moved on the the next article or issue. This is when we knew we would have
a team of content specialists checking all of the content we get in PMC.
PMC Ingest Workflow
The PMC processing model has been addressed in detail previously [Beck01]. Briefly it is diagrammed in Fig. 1. For
each article, we receive a set of files that includes the text in SGML or XML, the
resolution figures available, a PDF file if one has been created for the article,
supplementary material or supporting data. The text is converted to the current version
NISO Archiving and Interchange Tag Set [JATS01], and the
images are converted to a web-friendly format. The source SGML or XML, original images,
supplementary data files, PDFs, and NLM XML files are stored in the archive. Articles
rendered online using the NLM XML, PDFs, supplementary data files, and the web-friendly
Fig. 1: PMC Processing Model
For purposes of this paper, we will concentrate on the text processing.
PMC Philosophy of Text Processing
There are four main principals to the PMC philosophy of text processing
First, we expect to receive marked-up versions of an article that are well-formed,
and accurately represents the article as it was published. We do not go so far as
that the content be true or correct as Syd Bauman illustrates [Bauman01], merely that the article represents the version of record.
The question of what is the "Version of Record" is left up to the publisher. It may
printed copy, a PDF version, or the journal's website.
Unlike the early days of PMC as described above, we do not correct articles or files.
is, we will not fix something that is wrong in the version of record nor will we make
correction to an XML file. All problems found in either processing or QA of files
back to the publisher to be corrected and resubmitted.
Thirdly, the goal of PMC is to represent the content of the article and not the formatting
of the printed page, the PDF, or the journal's website.
And finally, we must run a Quality Assessment on content coming into PMC to ensure
the content in PMC accurately reflects the article as it was published. Our QA is
combination of automated checks and manual checking of articles. To help ensure that
content we are spending time ingesting to PMC is likely to be worthwhile, journals
through an evaluation process before they can send content to PMC in a regular production
The Data Evaluation Process
The data evaluation process has been described previously [Beck01], but it is integral in ensuring the quality of content that is being
submitted to PMC.
Journals joining PMC must pass two tests. First, the journal must pass PMC's scientific
quality standard, which means that the journal must be approved for the NLM collection
NLM's Selection and Acquisitions section [NLM01]. This
check ensures that the journal's content is "in scope" for a medical library and is
sufficient scientific quality for the archive.
Next the journal must go through a technical evaluation to "be sure that the journal
routinely supply files of sufficient quality to generate complete and accurate articles
without the need for human action to correct errors or omissions in the data." [PMC03]
For the data evaluation, a journal supplies a set of sample files for approximately
articles. These files are put through a series of automated and human checks to ensure
the XML is valid, accurately reflects the publication model of the journal (publication
for example) and is a complete and accurate representation of the published articled.
a baseline set of "Minimum Data Requirements" that must be met before the evaluation
to the more human-intense content accuracy checking [PMC04]. These minimum criteria are listed briefly below:
Each sample package must be complete: all required data files (XML/SGML, PDF if
available, image files, supplementary data files) for every article in the package
be present and named correctly.
All XML files must conform to an acceptable journal article DTD.
All XML/SGML files must parse according to their DTD.
Regardless of the XML/SGML DTD used, the following metadata information must be
present and tagged with correct values in every sample file:
Journal ISSN or other unique Journal ID
Copyright statement (if applicable)
License statement (if applicable)
Issue number (if applicable)
Pagination/article sequence number
Issue-based or Article-based publication dates. Articles submitted to PMC must
contain publication dates that accurately reflect the journal’s publication
All image files for figures must be legible, and submitted in high-resolution TIFF
or EPS format, according to the PMC Image File Requirements.
These seem like simple and obvious things - xml files must be valid - but the minimum
requirements have greatly reduced the amount of rework that the PMC Data Evaluation
to do. Certainly it helps to be explicit about even the most obvious of things.
During the data evaluation, PMC reviews 100% of the sample articles by eye.
The PMC production team is made up of a set of Journal Managers who are responsible
the processing and checking of content for the journal titles that are assigned to
use a combination of automated processing checks and manual checking of articles to
ensure the accuracy of the content in PMC.
We can leverage the fact that all content is sent to us in SGML or XML to eliminate
article files that are not well-formed (if XML) or valid to the model they claim to
against. 100% of not-well-formed or invalid files are returned to the provider to
The PMC Style Checker ([Beck01]) is used during the ingest process to ensure that all content flowing
into PMC is in the PMC common XML format for loading to the database. The errors provided
the Style Checker provides us with a level of automated checking on the content itself
can highlight problems, but it only goes so far. For example, the Style Checker can
if an electronic publication date is tagged completely to PMC Style (contains values
month, and day elements) in a file, but it can't tell you if the values themselves
and actually represent the electronic publication date of the article.
PMC also has a series of automated data integrity checks that run once content is
to the database which can identify, among other things, problems like duplicate articles
submitted to the system, and potential discrepancies in issue publication dates for
a group of
articles in the same issue.
Early in the PMC project the questions of "what and how much to check?" were left
discretion of the Journal Manager. But we soon found out that certain things had to
more frequently. All articles with marked-up math (in MathML or LateX) get a close
because the quality of marked up math has been one of the problem areas since the
the project. In addition, all published corrections and retractions needed to be checked
closely to ensure that the tagging provided in the source XML allows PMC to build
between the correction/retrction and the article being corrected/retracted. We also
ever PDF that we generate for manuscripts being tagged for the NIH Public Access Project
As the project began to grow, both in terms of the amount of content coming into PMC
the number of Journal Managers required to manage the content, it became obvious that
needed to quantify the QA process. First, the production team compiled a "Content
checklist" which was a collection of some of the most common (and serious) errors
over the years. Next, we built a system that shows each Journal Manager the journals
assigned to him, and which articles need to be checked Fig. 2. The system
selects a percentage of articles from each "batch" of new content deposited by the
they are processed and loaded to the PMC database. The number of articles in each
determined by the publication model and participation mode of the journal itself.
journals which deposit a whole issue at a time into PMC will normally have a "batch"
consisting of all the articles in the issue. Journals which publishing on a continuous
and send content to PMC article-by-article will have batches which reflect how much
deposited during a given day.
While it would be nice to be able to check 100% of the articles being loaded into
database, this is generally not reasonable because of resources. So, once an journal
passed the tightly monitored data evaluation period, and has demonstrated over a period
time after moving into production that it can provide generally error-free data, QA
is done on
more of a spot check basis. By default, new journals that come out of data evaluation
into production are set with a higher threshold of articles selected for QA in the
Once PMC's Journal Manager is confident in the journal's ability to provide good,
the article-selection percentages are lowered over time. Journals with a proven track
of providing error-free data into PMC generally have lower percentage of articles
QA. If the Journal Manager begins to see problems on an ongoing basis, the percentage
articles checked may be increased.
Fig. 2: QA System Dashboard
The system includes the original list of standard errors that was compiled as the
"Content QA checklist"; this list has grown the years as new errors are encountered.
Article Errors are grouped into eight major categories: Article Information, Article
Back Matter, Figures and Tables, Special Characters and Math, Generic Errors, Image
and PDF Quality. Within each of these major categories, there may be one or more
sub-categories. For example, in the "Article Body" section there is a subcategory
"Sections and Subsections", containing errors for mising sections, or sections that
nested incorrectly in the flow of the body text. In Fig. 3 the errors specific
to Article Information are shown in the purple box. If a Journal Manager finds that
"Publication date is missing or does not match the version of record", then he or
checks that box and that error is recorded for that article. Of course, there is always
box available to allow a Journal Manger to enter an unforeseen error.
Fig. 3: Article Errors List
Processing can be done either on the issue level or on a group of articles that arrives
at PMC in a given time period. Each error is classified as "severe" or "normal" and
as a "PMC
Error" or a "Data Error". See the purple box in Fig. 4 for an example of a
totaled batch of articles.
At any time, the Journal Manager can review the accumulated errors for a batch as
Fig. 5. This report can be output in RTF format so it can be sent to the
publisher as a Word file (Fig. 6>).
Fig. 6: Batch Error Report
The automated QA system has greatly improved both the Quantity and the Quality of
work that is being done for the PMC project, but we find that even this long after
anniversary of the invention of XML that there is still a need for manual review of
Fortunately we are able to sort out the deepest structural problems with XML tools,
questions of accuracy of content still need to be addressed by eye.
This work was supported by the Intramural Research Program of the NIH, National Library
Medicine, National Center for Biotechnology Information.
[PMC01] PubMed Central, http://www.ncbi.nlm.nih.gov/pmc/.
[Beck01] Beck, Jeff. “Report from the Field: PubMed Central, an XML-based
Archive of Life Sciences Journal Articles.” Presented at International Symposium on
the Long Haul: Issues in the Long-term Preservation of XML, Montréal, Canada, August
In Proceedings of the International Symposium on XML for the Long Haul: Issues in
the Long-term Preservation of XML. Balisage Series on Markup Technologies, vol. 6
[PMC02] How to Join PMC, http://www.ncbi.nlm.nih.gov/pmc/about/pubinfo.html.
[JATS01] NISO Journal Article Tag Suite (JATS), http://jats.nlm.nih.gov/archiving/.
[Bauman01] Bauman, Syd. (2010) "The 4 Levels of XML Rectitude", Balisage
[NLM01] NLM Collection Development and Acquisitions, http://www.nlm.nih.gov/tsd/acquisions/mainpage.html.
[PMC04] Minimum Criteria for PMC Data Evaluation Submissions, http://www.ncbi.nlm.nih.gov/pmc/pmcdoc/mindatareq.pdf.
[PMC03] How to Join PMC, http://www.ncbi.nlm.nih.gov/pmc/about/pubinfo.html.
[NIH01] National Institutes of Health Public Access http://publicaccess.nih.gov/