Report from the Field: PubMed Central, an XML-based Archive of Life Sciences Journal Articles

Report from the Field: PubMed Central, an XML-based Archive of Life Sciences Journal Articles International Symposium on XML for the Long Haul: Issues in the Long-term Preservation of XML August 2, 2010 PubMed Central (PMC) is an XML-based archive of life sciences journal literature at the U.S. National Institutes of Heath that allows public access to full-text journal articles. The archive was created in 2000 and has grown steadily to over 2 million records. The project has been successful in part because of the strict XML control and the flexibility that PMC givesre its submitters. This paper gives an overview of the PMC data evaluation process; the XML processing model; the PMC philosophy toward XML use, including use of the NLM DTD, XML Taggging Style, usability or reusablilty of the XML, public XML tools, and our people; and some challenges we continue to face maintaining the archive. Jeff Beck Jeff has been involved in the PubMed Central project at NLM since 2000. He has been working journal publishing since the early 1990. Technical Information Specialist National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Heath (NIH) Author's contribution to the Work was done as part of the Author's official duties as an NIH employee and is a Work of the United States Government. Therefore, copyright may not be established in the United States. 17 U.S.C. § 105. If Publisher intends to disseminate the Work outside the U.S., Publisher may secure copyright to the extent authorized under the domestic laws of the relevant country, subject to a paid-up, nonexclusive, irrevocable worldwide license to the United States in such copyrighted work to reproduce, prepare derivative works, distribute copies to the public and perform publicly and display publicly the work, and to permit others to do so. PubMed Central [PMC01] is the U.S. National Institutes of Health's free digital archive of full-text biomedical and life sciences journal literature. Content is stored in XML at the article level. Structures above article (issues, volumes, etc.) are built as collections of articles. The content is displayed dynamically from the archival XML every time that a user retrieves an article. In that respect, every request confirms that the archival copy is still functional. PMC contains over 2 million article records from over 1,000 titles with 10-15,000 new records being added each month. On average, 600,000-700,000 articles are retrieved by 400,000 unique users each day [PMC02]. PubMed Central was started in 1999 to allow free full-text access to journal articles. Participation by journals is voluntary. From the beginning there has always been a requirement that participating journals provide their content to NCBI marked up in some "reasonable" SGML or XML format along with the highest-resolution images available, PDF files (if available), and all supplementary material. Complete details on the PMC's file requirements are available [PMC03]. A discussion of what makes a "reasonable" SGML or XML format could be an interesting topic for another paper. For now, "reasonable" means that there is sufficient granularity in the source model to map those elements critical to the understanding of the article (and/or its functioning in the PMC system) from the original article to the appropriate place in the PMC XML model.

The Data Evaluation Process Journals joining PMC must pass two tests. First, the content must be approved for the NLM collection [NLM01]. Essentially this is just a check to be sure that the content is "in scope" for a medical library. Next the journal must go through a technical evaluation to "be sure that the journal can routinely supply files of sufficient quality to generate complete and accurate articles online without the need for human action to correct errors or omissions in the data." [PMC03] For the evaluation, a journal supplies a sample set of articles (at least 50). These articles are put through a series of automated and human checks to ensure that the XML is valid and that it accurately represents the article content. There is a set of "Minimum Data Requirements" that must be met before the evaluation proceeds to the more human-intense content accuracy checking [PMC04]. These minimum criteria are listed briefly below: Each sample package must be complete: all required data files (XML/SGML, PDF if available, image files, supplementary data files) for every article in the package must be present and named correctly. All XML files must conform to an acceptable journal article DTD. All XML/SGML files must parse according to their DTD. Regardless of the XML/SGML DTD used, the following metadata information must be present and tagged with correct values in every sample file: Journal ISSN or other unique Journal ID Journal Publisher Copyright statement (if applicable) License statement (if applicable) Volume number Issue number (if applicable) Pagination/article sequence number Issue-based or Article-based publication dates. Articles submitted to PMC must contain publication dates that accurately reflect the journal’s publication model. All image files for figures must be legible, and submitted in high-resolution TIFF or EPS format, according to the PMC Image File Requirements. These seem like simple and obvious things - xml files must be valid - but the minimum data requirements have greatly reduced the amount of rework that the PMC Data Evaluation group has to do. Certainly it helps to be explicit about even the most obvious of things.

PMC Processing Model The PMC processing model is diagrammed in . For each article, we receive a set of files that includes the text in SGML or XML, the highest resolution figures available, a PDF file if one has been created for the article, and any supplementary material or supporting data. The text is converted to the current version of the NLM Archiving and Interchange DTD [JATS01], and the images are converted to a web-friendly format. The source SGML or XML, original images, supplementary data files, PDFs, and NLM XML files are stored in the archive. Articles are rendered online using the NLM XML, PDFs, supplementary data files, and the web-friendly images.

PMC Processing Model This processing model seems obvious now, but this is not how the content was always handled. The early PMC processing model is shown in .

Original PMC Processing Model The images, PDFs, and supplementary material were handled in much the same way, but they are not interesting for this paper so I will just be concentrating on the text processing from now on. The SGML or XML was loaded into a database in its native format. When the article was requested by a user, the original text was pulled from the database and run through a piece of proprietary software to render it in HTML on the web. It is easy to see that this is not a scalable solution - especially if we were planning for any kind of success. As we started getting more and more submitters, it became obvious that we needed to take the processing hit of conversion to a single text format up front and allow the database and rendering system concentrate on only one type of XML (). We created the pmc-1.dtd [PMC05], which was based on the first two submission DTDs that we had to PMC: keton.dtd, an SGML DTD somehow descended from ISO12083, and the BioMed Central article.dtd, an XML DTD that was very similar.

Updated PMC Processing Model The pmc-1.dtd was written to simplify access to full-text articles online. It was a small and simple DTD that soon showed its limitations as we started to convert more submission formats into it. It grew quickly, and quickly grew unwieldy. At this point, work began on a new DTD that became the NLM Archiving and Interchange Tag Suite [Beck01].

XML Philosophy There are a number of things in our XML Philosophy, which may at first seem contradictory, that have lead to the success of PMC and the manageability of the article ingest work. In general, we have found a balance between strictness and flexibility that allows us to keep a certain amount of control but not be too restrictive on our submitters.

Some Strict Things The first thing we are strict about is that we do a complete review of any new DTD in which content is being submitted, as described above. We do not take articles in HTML. We also do a complete review of sample articles for each new journal to be sure that the content provider is able to provide content that is structurally and semantically correct. Another thing we are strict about is that all content must be valid according to the DTD in which it was submitted - not just during data evaluation but in the ongoing production process as well. This seems like an obvious one, but there was a surprising amount of controversy about this in the early days of PMC, and we still get invalid files. Problems usually arise now because the submitter has made a DTD change (as simple as adding a new character entity to the DTD) without telling us or sending an updated DTD. Also, we do not fix text; all content changes must be made by the submitter, and the content must be resubmitted.

Some Flexible Things Some things we are not so strict about, which reduces some of the burden on our submitters. First, we don't require all content to be in our format or to follow our tagging rules. We don't force updates of content to latest DTD version, and we generally follow journal style where it does not interfere with our systems.

The DTD We use the NLM Archiving and Interchange DTD as the format for all articles loaded to the PMC database. This model was created specifically for archiving article content. It was designed to be an "easy target to hit" when transforming content from the over 40 different input models that we receive content in. Currently we are writing content into version 3.0. We do not migrate all content to each new version of the NLM DTD when it comes out. The system is robust enough to handle content from version 1.0 of the DTD through version 3.0, so we are not constantly churning the data. We manage all of the versions of the DTD with an XML Catalog, which we also use to manage all of the input DTDs (SGML and XML). We maintain all mappings of PUBLIC and SYSTEM IDs for any DTD that we use in the XML catalog on our Linux machines and then create other catalogs from it each time it is updated. We create an SGML Catalog for the SGML tools that we use; a single "Oxygen" catalog that everyone on the team can use over the network with the XML editor; and a copy of the catalog that refers to http-based copies of the DTDs for PMC International sites. The XML Catalog is an essential piece of the PMC system.

PMC Tagging Style Next, we've defined a set of rules for objects within articles that is more restrictive than the DTD. This allows us to have normalized structures (figures, tables, contributors) in articles for ease of processing and rendering. We call these rules the PMC Tagging Style, and all articles must "pass style" before being loaded to the database. They are documented in the PMC Tagging Guidelines [PMC06].

(Re)Usability of XML Finally, our XML must be useable by others. The NLM XML that we create from whatever was submitted to us is always available to the submitting publisher (the content owner), and a subset of the articles that are Open Access are available to anyone for download through the PMC Open Archives Service [PMC07]. This keeps us honest. We can't allow ourselves to take shortcuts with the data. All articles must be valid according to the public NLM DTD version that they reference, and we only use Processing Instructions for instructions about processing.

PMC Tools Another thing that has been helpful is that we share tools we've built with our submitters. The more help we can give to people sending us content, the cleaner the content is, and the easier it is for us to process. Reducing or eliminating correction cycles greatly reduces cost and stress on the PMC team. PMC Tagging Guidelines [PMC06] - This is a document that describes the PMC Tagging Style mentioned above. We first created the guidelines as an internal document that we would use to keep track of any tagging rules we set up to keep our XML as normal as possible. As we got more and more questions from providers about how structures should be tagged, we decided to share our internal document. The DTD allows many ways to tag certain structures, but we do have a preference for many of them, and sharing them has been helpful both for us and for PMC submitters. While we don't require content that is submitted to PMC be "to style", we certainly aren't going to turn it away. PMC StyleChecker [PMC08] - This is a tool that applies the rules defined in the Tagging Guidelines. It is an XSL transform that reviews each article and reports what it finds. This is the same transform that we have inline in our production process after validation against the DTD to confirm that the structures withing the articles we are loading into the database meet our standards for tagging. It is available as an online tool that users can upload one article at a time or as a set of stylesheets that data providers can build into their workflow [PMC09]. (). Online SGML Validator [PMC10] - This is a tool that validates SGML against the SGML DTDs that we have in the PMC system as long as the PUBLIC or SYSTEM ID is recognized in our SGML Catalog file. That is, as long as we have set up the particular SGML DTD for ingest in our system. Online XML Validator [PMC11] - This is a tool that validates any XML file whose SYSTEM ID is resolvable on the system or whose PUBLIC or SYSTEM ID is mapped in our XML Catalog file. PMC Article Previewer [PMC12] - This is a tool that runs an article through a PMC ingest workflow. First the SGML or XML is validated against the DTD per the rules given above for the SGML and XML validators. Then, if we recognize the DTD and have a conversion built for it, the article runs through the XSL transform to normalized PMC XML. This output is validated against the output DTD and is checked with the StyleChecker. If successfull at all of these steps, the article is displayed on a page in PMC display format so that the publisher can preview the article. Also, he has a good idea that his article will be ingested into PMC with no problems. The Article Previewer requires users to login with a "My NCBI" account [NCBI01] because the content stays on the database until the user removes it.

Our People Finally, there has to be some acknowledgment of the people who work on the project. We have a fantastic set of programmers that builds us some very nice software, and we have a good group of Content Managers who marshall the content through the workflow and deal with the exceptions and communicate with the publishers. But, we have a group in the middle that writes and maintains the XSL transforms. It is not easy to find the right people for this group. Our document conversion writers need to have the strong attention to detail, unending curiosity, and a desire to always find a better way to do things of a good programmer. But, they also need to live in a world where mixed-content does not frightent them. Because we are writing transforms to maintain the meaning and intend of the origingal article, we slip out of the black-and-white prgrammer world into a world of grays. We need to represent the article that was submitted to the archive - and not necessarily the file that was submitted. A successful PMC document conversion writer lives with a foot in both camps: the world of the wants and desires of the content owners and the rule-driven world of people who build and run large systems. This is not something that is unique to PubMed Central by any means, but it does surprise some people that our group is made up of people who studied English, Art, and Library Sciences rather than Systems Design, COBOL or The DOM. Because we work exclusively with documents and document XML, I have found that it is much easier to start with a (technically oriented) person with a publishing or printing background and teach them how to write a transform than to take a brilliant programmer and teach them how to think about documents.

Some Challenges: When Print Goes Away There are still a few things that give us problems in PMC. First, because of the rapid changes in article publishing since the invention of the Internet, journal publishing models have started to change. A business that had a pretty stable model of articles in issues that printed and mailed on a periodic schedule has been shifting to publishing on a continuous basis - with articles made available online as soon as they are ready and then collected into print issues after some time (or not collected and printed at all). Early PMC was created based on the issue model, and we have had to be flexible to handle these changes. One thing that we have done to remain flexible here to allow us to deal with this continuous publication model is to be very strict about publication dates in articles. Depending on the publication model of a journal, an article may have one publication date (for traditional print-based journals) or an article-level publication date and a collection-level publication date (for articles that are published online and then collected into a print or online issue. volume, or collection. See the essay on <pub-date> in the Tagging Guidelines [PMC06]. The rise of the internet has also brought about another "Continuous publication" idea, and that is continuous publication of the same document - or updates to an article. We have not seen too much of this in PMC yet, but we have been preparing ourselves. Each article in the PMC database is assigned an ArticleID and an ArticleInstanceID. Any new version of an article creates a new ArticleInstance, and all ArticleInstances for a given article have the same ArticleID. This allows us to store multiple versions of articles, retrieve the latest (by ArticleID), and to retrieve any version (by ArticleInstanceID). Fortunately, how the Journal Publishers will handle will need to be decided by the Journal Publishers.

Conclusion The success we have had at PMC has really been due to a combination of factors including our strictness to upholding the XML models that we use, our flexibility (to a point) that we have with our content providers, the tools that we've put together to make our jobs and the jobs of our content providers easier, and teh fine people we have working on the team. To be fair, it is also nice to have the resources of the National Center for Biotechnology Information and the National Library of Medicine behind us as well as the support of the NIH in general, who have declared that copies of all peer-reviewed articles that report on NIH-funded research must be deposited to PubMed Central and made available within 12 months of the publication date [Beck02]. I have to include that just to be fair.

Acknowledgments This work was supported by the Intramural Research Program of the NIH, National Library of Medicine, National Center for Biotechnology Information.

References PubMed Central, http://www.ncbi.nlm.nih.gov/pmc/. PubMed Central National Avisory Committee. (2009) Minutes of the PubMed Central National Avisory Committee Meeting, June 15, 2009. National Institutues of Health, Bethesda, MD. How to Join PMC, http://www.ncbi.nlm.nih.gov/pmc/about/pubinfo.html. NLM Collection Development and Acquisitions, http://www.nlm.nih.gov/tsd/acquisions/mainpage.html. Minimum Criteria for PMC Data Evaluation Submissions, http://www.ncbi.nlm.nih.gov/pmc/pmcdoc/mindatareq.pdf. NLM Archving and Interchange DTD, http://dtd.nlm.nih.gov/archiving/. pmc-1.dtd, http://www.ncbi.nlm.nih.gov/pmc/pmcdoc/dtd/pmc-1.dtd. Beck and Lapeyre (2003) New Public Domain Journal Archiving and Interchange DTDs. XML 2003. http://www.idealliance.org/papers/dx_xml03/papers/04-01-02/04-01-02.html. PMC Tagging Guidelines, http://www.ncbi.nlm.nih.gov/pmc/pmcdoc/tagging-guidelines/article/style.html. PMC Open Archive Interface, http://www.ncbi.nlm.nih.gov/pmc/about/oai.html. PMC Online StyleChecker, http://www.pubmedcentral.nih.gov/utils/style_checker/stylechecker.cgi. PMC Downloadable StyleChecker, http://www.ncbi.nlm.nih.gov/pmc/pmcdoc/tagging-guidelines/stylechecker/stylecheck-README.html. PMC SGML Validator, http://www.pubmedcentral.nih.gov/utils/validate/sgmlcheck.cgi. PMC XML Validator, http://www.pubmedcentral.nih.gov/utils/validate/xmlcheck.cgi. PMC Article Previewer, http://www.pubmedcentral.nih.gov/utils/pv/. My NCBI, http://www.ncbi.nlm.nih.gov/sites/myncbi/. Beck, Jeff. “How XML made the NIH "Policy on Enhancing Public Access to Archived Publications Resulting from NIH-Funded Research" possible.” In Proceedings of Extreme Markup Languages 2005. http://conferences.idealliance.org/extreme/html/2005/Beck01/EML2005Beck01.html