graphic with four colored squares
graphic with four colored squares

Quality Control of PMC Content: A Case Study

Christopher Kelly and Jeff Beck

National Center for Biotechnology Information, National Library of Medicine, US National Institutes of Health

beck@ncbi.nlm.nih.gov

Presented at Balisage, Montreal, QC, August 6, 2012

Brief Intro to PMC

PMC is the US National Library of Medicine's electronic archive of full-text journal literature.

Content is stored in XML at the article level. and is displayed dynamically from the archival XML each time that a user retrieves an article.

Browse Issues

Tables of Contents

And read the article

And read the article

Or come from PubMed

Workflow picture

Piece of cake!

Participation by publishers is voluntary, although we require that the content be submitted in SGML or XML.

So, publishers send XML

and we have all of these fancy XML tools

Our jobs are easy!

History

Syd Bauman taught us

That XML can be well-formed

  ... and valid

    ... and make sense

      ... and not be true

or in our case not represent the article it is supposed to represent.

Bauman, Syd. (2010) "The 4 Levels of XML Rectitude", Balisage 2010, poster.

Editorial Comment: Best Balisage poster ever.

Leveraging XML

  1. We use XML tools to check incoming files for well-formedness and validity
  2. We've defined a preferred XML Tagging Style and developed a set of Stylesheets to test against it
    • This helps us test for Sensibility on some level.
    • <xref ref-type="fig"> points to a <fig>
    • <article article-type="correction"> has a <related-article related-article-type="corrected-article"/>
  3. We have a regression-testing system to make sure that changes to our ingest XSL stylesheets are will not break the articles already in the database if they need to be reconverted.
  4. But we still need to get eyes on the articles.

    There is no XML test or tool for Veracity.

The samples in this presentation

... are real, but the XML has been changed to protect me.

They are also all well-formed and valid.

Sample article XML 1

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article SYSTEM "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd">
<article article-type="example">
    <front>
        <journal-meta>
            <journal-id journal-id-type="nlm-ta">J Example Studies</journal-id>
            <issn>1111-XXXX</issn>
        </journal-meta>
        <article-meta>
            <title-group>
                <article-title>Good Science Info for You</article-title>
            </title-group>
            <contrib-group>
                <contrib>
                    <name>
                        <surname>Snap</surname>
                        <given-names>Ginger P</given-names>
                    </name>
                </contrib>
                <contrib>
                    <name>
                        <surname>House</surname>
                        <given-names>Toul</given-names>
                    </name>
                </contrib>
            </contrib-group>
            <pub-date>
                <month>03</month>
                <year>2012</year>
            </pub-date>
            <volume>12</volume>
            <issue>14</issue>
            <fpage>155</fpage>
            <lpage>159</lpage>
        </article-meta>
    </front>
</article>
        

Sample article XML 2

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article SYSTEM "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd">
<article article-type="example">
    <front>
        <journal-meta>
            <journal-id journal-id-type="nlm-ta">J Example Studies</journal-id>
            <issn>1111-XXXX</issn>
        </journal-meta>
        <article-meta>
            <title-group>
                <article-title>Good Science Info for You</article-title>
            </title-group>
            <contrib-group>
                <contrib>
                    <name>
                        <surname>Taylor</surname>
                        <given-names>Katy Rose</given-names>
                    </name>
                </contrib>
                <contrib>
                    <name>
                        <surname>Hamelers</surname>
                        <given-names>Audrey</given-names>
                    </name>
                </contrib>
            </contrib-group>
            <pub-date>
                <month>03</month>
                <year>2012</year>
            </pub-date>
            <volume>12</volume>
            <issue>14</issue>
            <fpage>160</fpage>
            <lpage>164</lpage>
        </article-meta>
    </front>
</article>
        

Sample article XML 3

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article SYSTEM "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd">
<article article-type="example">
    <front>
        <journal-meta>
            <journal-id journal-id-type="nlm-ta">J Example Studies</journal-id>
            <issn>1111-XXXX</issn>
        </journal-meta>
        <article-meta>
            <title-group>
                <article-title>Good Science Info for You</article-title>
            </title-group>
            <contrib-group>
                <contrib>
                    <name>
                        <surname>Waters</surname>
                        <given-names>Roger</given-names>
                    </name>
                </contrib>
                <contrib>
                    <name>
                        <surname>Gilmour</surname>
                        <given-names>David</given-names>
                    </name>
                </contrib>
                <contrib>
                    <name>
                        <surname>Wright</surname>
                        <given-names>Rick</given-names>
                    </name>
                </contrib>
                 <contrib>
                    <name>
                        <surname>Best</surname>
                        <given-names>Pete</given-names>
                    </name>
                </contrib>
           </contrib-group>
            <pub-date>
                <month>03</month>
                <year>2012</year>
            </pub-date>
            <volume>12</volume>
            <issue>14</issue>
            <fpage>165</fpage>
            <lpage>168</lpage>
        </article-meta>
    </front>
</article>
        

The TOC

Good Science Info for You
Ginger P Snap and Toul House
J Example Studies 2012, 12(14): 155–159.

Good Science Info for You
Katy Rose Taylor and Audrey Hamelers
J Example Studies 2012, 12(14): 160–164.

Good Science Info for You
Roger Waters, David Gilmour, Rick Wright, and Pete Best
J Example Studies 2012, 12(14): 165–168.

Cut and Paste

It really happens ... still

Early PMC vs Current PMC

In the early days, all participants had XML or SGML that was created for some other reason. We simply were going to reuse it. Because that is one thing that Marked-up content promised us.

Now over 70% of the content coming to PMC is in JATS (NLM DTD).

Two Phases of PMC Participation

Evaluation

Because we convert incoming SGML or XML to our article model for loading to PMC, we need to find out before a publisher sends us content that we can map their article model to ours

Of course, we use XML tools to check for well-formedness and validity

But we have to put eyes on these articles, because we've seen thing like ...

A Generic Sample Article

Publisher site HTML

<html>            
    <head>
        <title>My Article</title>
    </head>
    <body>
        <p><font size="-1"><i>J Example Studies</i> <b>12</b>(14):155-159.</font></p>
        <p>
            <b>
                <font size="+4">Good Science Info for You</font>
            </b>
        </p>
        <p>
            <i>Ginger P Snap PhD and Toul House, PhD</i>
        </p>
        <p>
            <b>Abstract</b>
        </p>
        <p><b>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis
            faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec
            mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </b></p>
        <p>
            <b>Introduction</b>
        </p>
        <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis
            faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec
            mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p>
        <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis
            faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec
            mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p>
        <p>
            <b>Materials and Methods</b>
        </p>
        <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis
            faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec
            mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p>
        <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis
            faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec
            mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p>
        <p>
            <b>Results</b>
        </p>
        <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis
            faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec
            mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p>
        <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis
            faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec
            mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p>
        <p>
            <b>Discussion</b>
        </p>
        <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis
            faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec
            mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p>
        <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis
            faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec
            mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p>
        
    </body>
</html>       
            
        

Article submitted in JATS XML

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" 
"http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd">
<article>
    <front>
        <journal-meta>
            <journal-id/>
            <journal-title-group>
                <journal-title>J Example Studies</journal-title>
            </journal-title-group>
            <issn/>
        </journal-meta>
        <article-meta>
            <title-group>
                <article-title>Good Science Info for You</article-title>
            </title-group>
            <pub-date>
                <year>2012</year>
            </pub-date>
        </article-meta>
    </front>
    <body>
    <p>
        <![CDATA[
   <html>
    <head>
        <title>My Article</title>
    </head>
    <body>
        <p><font size="-1"><i>J Example Studies</i> <b>12</b>(14):155-159.</font></p>
        <p>
            <b>
                <font size="+4">Good Science Info for You</font>
            </b>
        </p>
        <p>
            <i>Ginger P Snap PhD and Toul House, PhD</i>
        </p>
        <p>
            <b>Abstract</b>
        </p>
        <p><b>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis
            faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec
            mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </b></p>
        <p>
            <b>Introduction</b>
        </p>
        <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis
            faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec
            mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p>
        <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis
            faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec
            mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p>
        <p>
            <b>Materials and Methods</b>
        </p>
        <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis
            faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec
            mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p>
        <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis
            faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec
            mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p>
        <p>
            <b>Results</b>
        </p>
        <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis
            faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec
            mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p>
        <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis
            faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec
            mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p>
        <p>
            <b>Discussion</b>
        </p>
        <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis
            faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec
            mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p>
        <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis
            faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec
            mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p>
        
    </body>
</html>       
]]>
    </p>
    </body>
</article>
   

Article submitted in JATS XML

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" 
"http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd">
<article>
    <front>
        <journal-meta>
            <journal-id/>
            <journal-title-group>
                <journal-title>J Example Studies</journal-title>
            </journal-title-group>
            <issn/>
        </journal-meta>
        <article-meta>
            <title-group>
                <article-title>Good Science Info for You</article-title>
            </title-group>
            <pub-date>
                <year>2012</year>
            </pub-date>
        </article-meta>
    </front>
    <body>
    <p>
        <![CDATA[
   <html>
    <head>
        <title>My Article</title>
    </head>
    <body>
        <p><font size="-1"><i>J Example Studies</i> <b>12</b>(14):155-159.</font></p>
        <p>
            <b>
                <font size="+4">Good Science Info for You</font>
            </b>
        </p>
        <p>
            <i>Ginger P Snap PhD and Toul House, PhD</i>
        </p>
        <p>
            <b>Abstract</b>
        </p>
        <p><b>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis
            faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec
            mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </b></p>
        <p>
            <b>Introduction</b>
        </p>
        <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis
            faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec
            mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p>
        <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis
            faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec
            mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p>
        <p>
            <b>Materials and Methods</b>
        </p>
        <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis
            faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec
            mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p>
        <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis
            faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec
            mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p>
        <p>
            <b>Results</b>
        </p>
        <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis
            faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec
            mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p>
        <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis
            faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec
            mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p>
        <p>
            <b>Discussion</b>
        </p>
        <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis
            faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec
            mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p>
        <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis
            faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec
            mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis </p>
        
    </body>
</html>       
]]>
    </p>
    </body>
</article>
   

Same aarticle in proprietary XML

<!DOCTYPE article SYSTEM "ourdtd.dtd">
<article>

J Example Studies 12(14):155-159.

Good Science Info for You

Ginger P Snap PhD and Toul House, PhD

Abstract

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis

Introduction

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis

Materials and Methods

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis

Results

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis

Discussion

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam ut mauris vel turpis faucibus porta. Nunc eleifend blandit dolor, at placerat tellus fermentum sed. Sed nec mi eget risus pretium facilisis. Aenean orci mauris, scelerisque a sagittis 
</article>
		
        

But it is well formed and valid!

And at least they shipped us their DTD.

<!ELEMENT article   (#PCDATA) >
		
        

Documentation not necessary.

Production

In Eval, every article from the sample set is checked by XML tools and by eye.

We don't have the staff to check every article once a journal has moved into production, but we've built a system to manage the QA work.

Once an article clears ingest, it moves into this system.

QA System Dashboard

Article Errors List

Errors Totals

Batch Errors

Batch Error Report

The QA System

Allows us to quantify errors found in batches

Creates those nasty Word (well, almost Word) reports that we send out.

Reduces the level of expertise in XML needed to do QA.

Concusions

Concusion

Even though it was not the intent when PMC was created, we have built and XML publishing system

where we can't trust the content that is being sent to us.

Even with the power of XML in the palm of our hands, we still need to get eyes on articles.

Because there is no XML test or tool for Veracity.

Thank you