PubMed Central Overview

PubMed Central (PMC) is a free archive of life science journal literature from the National Library of Medicine (NLM) and is the digital counterpart to NLM's collection of print journals. Currently PMC includes more than 3.5 million articles from approximately 1700 full participation journals and more than 3500 partial participation journals. Full-text articles are archived in PMC with XML conforming to the JATS Archiving and Interchange DTD.

Back Issue Digitization

From 2004 to 2010, the NLM in partnership with the Wellcome Trust and the U.K. Joint Information Systems Committee (JISC) ran a project to digitize content of some PMC-participating journals. The project included a destructive scanning and the output was high-resolution page TIFFs, OCR full text, and XML meta data for approximately 1.3 million journal articles from more than 160 journals. These records were added to the PMC archive and are freely available. In 2014, the NLM and Wellcome Trust signed a memorandum of understanding to begin a second project to digitize historical content to make freely available in PMC.

Unlike the first project which employed a destructive scanning process of donated source material, this second project will utilize NLM's own collection and thus requires a non-destructive scanning method. In addition, this current project is focusing on journals identified by NLM's History of Medicine Division as being orphaned or out-of-copyright material and having significant historical significance.

These titles span approximately 200 years and while the very basic structure of journal articles has remained largely unchanged in that time, the specifics of journal and article citation information are very nuanced. The experience gained by NLM and PMC staff on the first digitization project is certainly invaluable for handling this new project, but as is to be expected with any project, there are exceptions.

In an attempt to minimize the surprises in the project, staff fro NLM and PMC are working together to review the material to be scanned and identifying any anomalies that do not conform to our typical or expected data formats.

The Fascicular Series

During the analysis of one of the titles chosen for scanning, NLM staff came across a series in the Medico-Chirurgical Review (London: S. Highley, 1824-1847) titled the Fascicular Series. As the name suggests, this series of the journal was published in fascicles or small bundles. Up until January 1828, the journal regularly published quarterly issues of 288 pages. In an address to the subscribers of the journal appearing in January 1828 [Figure 1], the editorial staff of the journal explained that they would be changing the publishing form and instead of a single 288-page issue once a quarter, they would be issuing six 48-page fascicles that would be sent to the subscribers half-monthly.

To those who startle at innovation, we would put this plain question:—Can there be any objection, that each packet or fasciculus...of this Journal, should go forth to those who wish to have it every fifteen days, (half-monthly) instead of remaining in the printing office for the space of many weeks? [*]

Figure 1: Address

The concern of the publisher that potentially timely material was lingering in a printing office for weeks is one familiar to modern publishers. The time between acceptance of a manuscript and publication of the final edited version can sometimes take months. Instead of holding these publications, many publishers issue ahead-of-print or online-first versions of articles. Additionally, many electronic-only publications have adopted a continuous publication model whereby articles are published on a rolling basis and are collected into issues which have publication dates that may be as broad as an entire year. It would seem, then, that the publisher of the Medico-Chirurgical Review adopted a continuous publication model in the print medium almost 200 years ago.

Structure of the Series

The first two issues published in the Fascicular Series (issues 16 and 17) included very distinct headings identifying the fasciculus number and date [Figure 2].

Figure 2: Series identification

Fasciculus I: Jan 12, 1828

Fasciculus II: Jan 26, 1828

Following the first two issues of the series, the publisher did not include the same heading but continued to identify the fasciculus number in the footer of the first page [Figure 3]. Because the bound volume in the collection does not include covers or tables of contents, this note in the footer along with date information in the running head [Figure 4] are our indicators that the issues were being released in groups of 48 pages twice a month.

Figure 3: Fasciculus identification in footer

Each issue of this series has two categorical sections: Analytical Reviews and Periscope. Both sections appear in each individual fasciculus but are grouped together in the bound volume of the collection. This results in the content in the bound volumes being out of chronological order [Figure 4].

Figure 4: January 12 Periscope following March 22 Analytical Series

Beginning with issue 21, the dates in the running heads change to month only. The fasciculus numbering continues with six per issue, but they are no longer printed with the day in the running head. During this run of the series, the issue date is expressed as a month range. This pattern continues through the end of 1833 when the journal ceases the Fascicular Series and begins the Decennial Series. From the beginning of the Decennial Series through the end of the publication, the journal returns to a single quarterly issue.

Tagging Considerations

Fortunately for the project, the JATS Archiving model will natively handle all of the structures included in this journal. There are, however, two distinct issues that need to be addressed: article division and publication dates.

Article Division

For the back issue scanning project, PMC has outlined very specific instructions for how to identify and group types of articles. They include consulting the tables of contents and reviewing the content of the articles. If articles contain brief announcements or news-type items, they are grouped as a single article. The Periscope section of the journal falls into this classification as it includes items such as brief communications, society announcements, and obituaries. Per the project rules, these brief announcements would all be captured as a single article titled "Periscope".

The challenge with this decision, however, comes in that each individual fasciculus contained a Periscope section. So for each issue in the Fascicular Series, six Periscope sections were published. These divisions, however, do not exist in the bound volume as they are presented as one continuous section following all of the Analytical Series articles of the issue.

How do we address this? Should we stray from the specification and create a separate Periscope article for each fasciculus, imposing a division where one does not necessarily exist? If not, and we follow the general guidelines we laid out for ourselves, what, then, is the article publication date for an article that was published in six installments?

Publication Dates

That challenge of addressing publication dates extends past the Periscope-specific question. Since we have identified this as a print continuous publication model, the first step would be to try to tag the dates with a method parallel to that of the electronic continuous publication model we currently handle.

For electronic continuous publication models, PMC requires both a collection date and an electronic publication date which must include the day, month, and year.

<pub-date publication-format="electronic" date-type="collection"/>
<pub-date publication-format="electronic" date-type="pub">
 <day/>
 <month/>
 <year/>
</pub-date>
The JATS attributes of publication-format and date-type allow separation of the publication format from the event type, so one possible solution would be to use this same pairing of values but just change "electronic" to "print".
<pub-date publication-format="print" date-type="collection"/>
<pub-date publication-format="print"	date-type="pub">
 <day/>
 <month/>
 <year/>
</pub-date>
There are, however, two issues with this potential solution for PMC's current system:
  1. PMC style requires the date accompanying the collection date contain a day, month, and year.

    For the first two issues in the Fascicular Series, this is not an issue. In the next three issues, the fasciculus day exists only in the running head. So it is possible to identify the date, but NLM staff would need to inventory each issue and list the divisions for the vendor. The later issues in the Fascicular Series, however, contain only a month and year, not a day.

    Since we can't impose a day where none was provided, we look at the reason behind that rule. PMC requires that date to include a day so we can ensure that we the correct release date identified for the article. Since this content is more than a century old, the release date is irrelevant, so this requirement could be modified.

  2. PMC style requires an electronic publication date to be present if a collection date exists.

    Until this point, PMC has only ever encountered electronic collection dates. Since we occasionally receive data that has incorrectly identified electronic dates as print dates, this check catches the data in an early stage of processing and prevents the incorrect information from being loaded to the database. Easing this restriction would jeopardize our ability to identify these incorrect dates which we receive frequently enough for this not to be a viable option.

With the current constraints of the PMC system, it does not seem that capturing these fascicular dates as publication dates is feasible. The option we have remaining is to tag the information as a more generic history date rather than an actual publication date. There is more flexibility in the history dates as there is a lot of variety in the kinds of event dates publishers capture for articles. This would allow PMC to retain the information about the fascicular date in the XML, but the date would not appear anywhere in the rendered content. This option also does not address the question of the appropriate publication date for the Periscope articles.

Inconclusion

PMC's approach to archiving has always been focused more on preserving the intellectual content than the physical format. Does information about this publishing model and the physical form of the journal even belong in PMC? If not, where does it belong? And if this information isn't captured now, when these already-fragile volumes are being digitized, we run the risk of losing it completely when the physical copies are no longer viable.

The original focus of this analysis was to figure out how to capture the fasciculus date as a publication date, but as it progressed, the question has become whether or not we should. In addition to the specific tagging questions that we have not been able to answer, the bound volume itself raises the question of whether or not the specific fasciculus information is of significant importance to the content. If the party responsible for binding the issues did not think the structure significant to preserve, should PMC really depart from that?

References

[bib1] Address: To the Subscribers of the Medico-Chirurgical Review. Medico-Chirurgical Review. London: S. Highley. Jan 1828; p 284-287.

Laura Randall

Technical Information Specialist

National Library of Medicine

Laura Randall has been working with markup languages longer than she cares to admit and currently works for the PubMed Central project at the National Library of Medicine.